Introduction to Econometrics

Introduction to Econometrics
The statistical analysis of economic (and related) data
1/2/3-1
Brief Overview of the Course
Economics suggests important relationships, often with policy

implications, but virtually never suggests quantitative
magnitudes of causal effects.
 What is the quantitative effect of reducing class size on
student achievement?
 How does another year of education change earnings?
 What is the price elasticity of cigarettes?
 What is the effect on output growth of a 1 percentage
point increase in interest rates by the Fed?
 What is the effect on housing prices of environmental
improvements?
1/2/3-2
This course is about using data to measure causal effects.
 Ideally, we would like an experiment
o what would be an experiment to estimate the effect of
class size on standardized test scores?
 But almost always we only have observational
(nonexperimental) data.
o returns to education
o cigarette prices
o monetary policy
 Most of the course deals with difficulties arising from using
observational to estimate causal effects
o confounding effects (omitted factors)
o simultaneous causality
o “correlation does not imply causation”
1/2/3-3
In this course you will:
 Learn methods for estimating causal effects using

observational data
 Learn some tools that can be used for other purposes, for
example forecasting using time series data;
 Focus on applications – theory is used only as needed to
understand the “why”s of the methods;
 Learn to evaluate the regression analysis of others – this
means you will be able to read/understand empirical
economics papers in other econ courses;
 Get some hands-on experience with regression analysis in
your problem sets.
1/2/3-4
Review of Probability and Statistics
(SW Chapters 2, 3)
Empirical problem: Class size and educational output
 Policy question: What is the effect on test scores (or some

other outcome measure) of reducing class size by one
student per class? by 8 students/class?
 We must use data to find out (is there any way to answer
this without data?)
1/2/3-5
The California Test Score Data Set
All K-6 and K-8 California school districts (n = 420)
Variables:
 5PthP grade test scores (Stanford-9 achievement test,
combined math and reading), district average
 Student-teacher ratio (STR) = no. of students in the
district divided by no. full-time equivalent teachers
1/2/3-6
Initial look at the data:
(You should already know how to interpret this table)
This table doesn’t tell us anything about the relationship

between test scores and the STR.
1/2/3-7
Do districts with smaller classes have higher test scores?
Scatterplot of test score v. student-teacher ratio
what does this figure show?

1/2/3-8
We need to get some numerical evidence on whether districts
with low STRs have higher test scores – but how?
1. Compare average test scores in districts with low STRs

to those with high STRs (“estimation”)
2. Test the “null” hypothesis that the mean test scores in

the two types of districts are the same, against the
“alternative” hypothesis that they differ (“hypothesis
testing”)
3. Estimate an interval for the difference in the mean test

scores, high v. low STR districts (“confidence
interval”)
1/2/3-9
Initial data analysis: Compare districts with “small” (STR <
20) and “large” (STR ≥ 20) class sizes:
Class Average score Standard n

Size (Y ) deviation
(sBYB)
Small 657.4 19.4 238
Large 650.0 17.9 182
1. Estimation of  = difference between group means

2. Test the hypothesis that  = 0
3. Construct a confidence interval for 
1/2/3-10
1. Estimation
nsmall nlarge
1 1
Ysmall  Ylarge =
nsmall
Y
i 1
i –
nlarge
Y
i 1
i
= 657.4 – 650.0
= 7.4
Is this a large difference in a real-world sense?

 Standard deviation across districts = 19.1
 Difference between 60PthP and 75PthP percentiles of test
score distribution is 667.6 – 659.4 = 8.2
 This is a big enough difference to be important for school
reform discussions, for parents, or for a school
committee?
1/2/3-11
2. Hypothesis testing
Difference-in-means test: compute the t-statistic,
Ys  Yl Ys  Yl
t  (remember this?)
ss2
 sl2 SE (Ys  Yl )
ns nl
where SE(Ys – Yl ) is the “standard error” of Ys – Yl , the

subscripts s and l refer to “small” and “large” STR districts,
ns
1
and ss2  
ns  1 i 1
(Yi  Ys ) 2
(etc.)
1/2/3-12
Compute the difference-of-means t-statistic:
Size Y sBYB n
small 657.4 19.4 238
large 650.0 17.9 182
Ys  Yl 657.4  650.0 7.4

t   = 4.05
ss2
 sl2 19.42
 17.92 1.83
ns nl 238 182
|t| > 1.96, so reject (at the 5% significance level) the null
hypothesis that the two means are the same.
1/2/3-13
3. Confidence interval
A 95% confidence interval for the difference between the

means is,
(Ys – Yl )  1.96SE(Ys – Yl )
= 7.4  1.961.83 = (3.8, 11.0)
Two equivalent statements:
1. The 95% confidence interval for  doesn’t include 0;
2. The hypothesis that  = 0 is rejected at the 5% level.
1/2/3-14
What comes next…
 The mechanics of estimation, hypothesis testing, and

confidence intervals should be familiar
 These concepts extend directly to regression and its
variants
 Before turning to regression, however, we will review
some of the underlying theory of estimation, hypothesis
testing, and confidence intervals:
 why do these procedures work, and why use these
rather than others?
 So we will review the intellectual foundations of
statistics and econometrics
1/2/3-15
Review of Statistical Theory
1. The probability framework for statistical inference

2. Estimation
3. Testing
4. Confidence Intervals
The probability framework for statistical inference

(a) Population, random variable, and distribution
(b) Moments of a distribution (mean, variance, standard
deviation, covariance, correlation)
(c) Conditional distributions and conditional means
(d) Distribution of a sample of data drawn randomly from a
population: YB1B,…, YBnB
1/2/3-16
(a) Population, random variable, and distribution
Population
 The group or collection of all possible entities of interest
(school districts)
 We will think of populations as infinitely large ( is an
approximation to “very big”)
Random variable Y
 Numerical summary of a random outcome (district
average test score, district STR)
1/2/3-17
Population distribution of Y
 The probabilities of different values of Y that occur in the

population, for ex. Pr[Y = 650] (when Y is discrete)
 or: The probabilities of sets of these values, for ex. Pr[640
 Y  660] (when Y is continuous).
1/2/3-18
(b) Moments of a population distribution: mean, variance,
standard deviation, covariance, correlation
mean = expected value (expectation) of Y

= E(Y)
= BYB
= long-run average value of Y over repeated
realizations of Y
variance = E(Y – BYB)P2P
=  Y2
= measure of the squared spread of the
distribution
standard deviation = variance = BYB
1/2/3-19
Moments, ctd.
E Y  Y  
 3
skewness =  
3
Y
= measure of asymmetry of a distribution
 skewness = 0: distribution is symmetric
 skewness > (<) 0: distribution has long right (left) tail
E Y  Y  
 4
kurtosis =  
4
Y
= measure of mass in tails
= measure of probability of large values
 kurtosis = 3: normal distribution
 skewness > 3: heavy tails (“leptokurtotic”)
1/2/3-20
1/2/3-21
2 random variables: joint distributions and covariance
 Random variables X and Z have a joint distribution

 The covariance between X and Z is
cov(X,Z) = E[(X – BXB)(Z – BZB)] = BXZB
 The covariance is a measure of the linear association

between X and Z; its units are units of X  units of Z
 cov(X,Z) > 0 means a positive relation between X and Z
 If X and Z are independently distributed, then cov(X,Z) = 0
(but not vice versa!!)
 The covariance of a r.v. with itself is its variance:
cov(X,X) = E[(X – BXB)(X – BXB)] = E[(X – BXB)P2P]
=  X2
1/2/3-22
1/2/3-23
The covariance between Test Score and STR is negative:
so is the correlation…
1/2/3-24
The correlation coefficient is defined in terms of the
covariance:
cov( X , Z )  XZ
corr(X,Z) =  = rBXZB
var( X ) var( Z )  X  Z
 –1  corr(X,Z)  1
 corr(X,Z) = 1 mean perfect positive linear association
 corr(X,Z) = –1 means perfect negative linear association
 corr(X,Z) = 0 means no linear association
1/2/3-25
The correlation coefficient measures linear association
1/2/3-26
(c) Conditional distributions and conditional means
Conditional distributions
 The distribution of Y, given value(s) of some other
random variable, X
 Ex: the distribution of test scores, given that STR < 20
Conditional expectations and conditional moments
 conditional mean = mean of conditional distribution
= E(Y|X = x) (important concept and notation)
 conditional variance = variance of conditional distribution
 Example: E(Test scores|STR < 20) = the mean of test
scores among districts with small class sizes
The difference in means is the difference between the means
of two conditional distributions:
1/2/3-27
Conditional mean, ctd.
 = E(Test scores|STR < 20) – E(Test scores|STR ≥ 20)
Other examples of conditional means:

 Wages of all female workers (Y = wages, X = gender)
 Mortality rate of those given an experimental treatment (Y
= live/die; X = treated/not treated)
 If E(X|Z) = const, then corr(X,Z) = 0 (not necessarily vice
versa however)
The conditional mean is a (possibly new) term for the
familiar idea of the group mean
1/2/3-28
(d) Distribution of a sample of data drawn randomly
from a population: YB1B,…, YBnB
We will assume simple random sampling

 Choose and individual (district, entity) at random from the
population
Randomness and data
 Prior to sample selection, the value of Y is random
because the individual selected is random
 Once the individual is selected and the value of Y is
observed, then Y is just a number – not random
 The data set is (YB1B, YB2B,…, YBnB), where YBiB = value
of Y for the iPthP individual (district, entity) sampled
1/2/3-29
Distribution of YB1B,…, YBnB under simple random
sampling
 Because individuals #1 and #2 are selected at random, the
value of YB1B has no information content for YB2B. Thus:
o YB1B and YB2B are independently distributed
o YB1B and YB2B come from the same distribution, that
is, YB1B, YB2B are identically distributed
o That is, under simple random sampling, YB1B and YB2B
are independently and identically distributed (i.i.d.).
o More generally, under simple random sampling,
{YBiB}, i = 1,…, n, are i.i.d.
1/2/3-30
This framework allows rigorous statistical inferences about
moments of population distributions using a sample of data
from that population …
2. Estimation
3. Testing
4. Confidence Intervals
Estimation
Y is the natural estimator of the mean. But:
(a) What are the properties of Y ?
(b) Why should we use Y rather than some other estimator?
 YB1B (the first observation)
 maybe unequal weights – not simple average
1/2/3-31
 median(YB1B,…, YBnB)
The starting point is the sampling distribution of Y …
1/2/3-32
(a) The sampling distribution of Y
Y is a random variable, and its properties are determined by
the sampling distribution of Y
 The individuals in the sample are drawn at random.
 Thus the values of (YB1B,…, YBnB) are random
 Thus functions of (YB1B,…, YBnB), such as Y , are random:
had a different sample been drawn, they would have taken
on a different value
 The distribution of Y over different possible samples of
size n is called the sampling distribution of Y .
 The mean and variance of Y are the mean and variance of
its sampling distribution, E(Y ) and var(Y ).
 The concept of the sampling distribution underpins all of
econometrics.
1/2/3-33
The sampling distribution of Y , ctd.
Example: Suppose Y takes on 0 or 1 (a Bernoulli random
variable) with the probability distribution,
Pr[Y = 0] = .22, Pr(Y =1) = .78
Then
E(Y) = p1 + (1 – p)0 = p = .78
 Y2 = E[Y – E(Y)]2 = p(1 – p) [remember this?]
= .78(1–.78) = 0.1716
The sampling distribution of Y depends on n.
Consider n = 2. The sampling distribution of Y is,
Pr(Y = 0) = .222 = .0484
Pr(Y = ½) = 2.22.78 = .3432
Pr(Y = 1) = .782 = .6084
1/2/3-34
The sampling distribution of Y when Y is Bernoulli (p = .78):
1/2/3-35
Things we want to know about the sampling distribution:
 What is the mean of Y ?

o If E(Y ) = true  = .78, then Y is an unbiased estimator
of 
 What is the variance of Y ?
o How does var(Y ) depend on n (famous 1/n formula)
 Does Y become close to  when n is large?
o Law of large numbers: Y is a consistent estimator of 
 Y –  appears bell shaped for n large…is this generally
true?
o In fact, Y –  is approximately normally distributed
for n large (Central Limit Theorem)
1/2/3-36
The mean and variance of the sampling distribution of Y
General case – that is, for Yi i.i.d. from any distribution, not
just Bernoulli:
1 n 1 n 1 n
mean: E(Y ) = E(  Yi ) =  E (Yi ) =  Y = Y
n i 1 n i 1 n i 1
Variance: var(Y ) = E[Y – E(Y )]2

= E[Y – Y]2
2
 1 n
 
= E  Yi   Y 
 n i 1  
2
1 n

= E   (Yi  Y ) 
 n i 1 
1/2/3-37
2
1 n

so var(Y ) = E   (Yi  Y ) 
 n i 1 
  1 n  1 n  
= E    (Yi  Y )     (Y j  Y )  
  n i 1   n j 1  
1 n n
= 2  E (Yi  Y )(Y j  Y ) 
n i 1 j 1
1 n n
= 2  cov(Yi ,Y j )
n i 1 j 1
n
1
= 2
n
 Y
 2
i 1
 Y2
=
n
1/2/3-38
Mean and variance of sampling distribution of Y , ctd.
E(Y ) = Y
 Y2
var(Y ) =
n
Implications:
1. Y is an unbiased estimator of Y (that is, E(Y ) = Y)
2. var(Y ) is inversely proportional to n
 the spread of the sampling distribution is
proportional to 1/ n
 Thus the sampling uncertainty associated with Y is
proportional to 1/ n (larger samples, less
uncertainty, but square-root law)
1/2/3-39
The sampling distribution of Y when n is large
For small sample sizes, the distribution of Y is complicated,

but if n is large, the sampling distribution is simple!
1. As n increases, the distribution of Y becomes more tightly
centered around Y (the Law of Large Numbers)
2. Moreover, the distribution of Y – Y becomes normal (the
Central Limit Theorem)
1/2/3-40
The Law of Large Numbers:
An estimator is consistent if the probability that its falls
within an interval of the true population value tends to one
as the sample size increases.
If (Y1,…,Yn) are i.i.d. and  Y2 < , then Y is a consistent
estimator of Y, that is,
Pr[|Y – Y| < ]  1 as n  
p
which can be written, Y  Y
p
(“Y  Y” means “Y converges in probability to Y”).
 Y2
(the math: as n  , var(Y ) =  0, which implies that
n
Pr[|Y – Y| < ]  1.)
1/2/3-41
The Central Limit Theorem (CLT):
If (Y1,…,Yn) are i.i.d. and 0 <  Y2 < , then when n is large
the distribution of Y is well approximated by a normal
distribution.
 Y2
 Y is approximately distributed N(Y, ) (“normal
n
distribution with mean Y and variance  Y2 /n”)
 n (Y – Y)/Y is approximately distributed N(0,1)
(standard normal)
Y  E (Y ) Y  Y
 That is, “standardized” Y = = is
var(Y ) Y / n
approximately distributed as N(0,1)
 The larger is n, the better is the approximation.
Sampling distribution of Y when Y is Bernoulli, p = 0.78:
1/2/3-42
1/2/3-43
Y  E (Y )
Same example: sampling distribution of :
var(Y )
1/2/3-44
Summary: The Sampling Distribution of Y
For Y1,…,Yn i.i.d. with 0 <  Y2 < ,
 The exact (finite sample) sampling distribution of Y has
mean Y (“Y is an unbiased estimator of Y”) and variance
 Y2 /n
 Other than its mean and variance, the exact distribution of
Y is complicated and depends on the distribution of Y (the
population distribution)
 When n is large, the sampling distribution simplifies:
p
o Y  Y (Law of large numbers)
Y  E (Y )
o is approximately N(0,1) (CLT)
var(Y )
1/2/3-45
(b) Why Use Y To Estimate Y?
 Y is unbiased: E(Y ) = Y
p
 Y is consistent: Y  Y
 Y is the “least squares” estimator of Y; Y solves,
n
min m  (Yi  m ) 2
i 1
so, Y minimizes the sum of squared “residuals”

optional derivation (also see App. 3.2)
d n n
d n

dm i 1
(Yi  m ) 2
= 
i 1 dm
(Yi  m ) 2
= 2 (Yi  m )
i 1
Set derivative to zero and denote optimal value of m by m̂ :

n n
1 n
 Y =  mˆ = nmˆ or m̂ =  Yi = Y
i 1 i 1 n i 1
1/2/3-46
Why Use Y To Estimate Y, ctd.
 Y has a smaller variance than all other linear unbiased

1 n
estimators: consider the estimator, ˆY   aiYi , where
n i 1
{ai} are such that ˆY is unbiased; then var(Y )  var( ˆY )
(proof: SW, Ch. 17)
 Y isn’t the only estimator of Y – can you think of a time
you might want to use the median instead?
1/2/3-47
2. Estimation
3. Hypothesis Testing
4. Confidence intervals
Hypothesis Testing
The hypothesis testing problem (for the mean): make a
provisional decision, based on the evidence at hand, whether
a null hypothesis is true, or instead that some alternative
hypothesis is true. That is, test
H0: E(Y) = Y,0 vs. H1: E(Y) > Y,0 (1-sided, >)
H0: E(Y) = Y,0 vs. H1: E(Y) < Y,0 (1-sided, <)
H0: E(Y) = Y,0 vs. H1: E(Y) Y,0 (2-sided)
1/2/3-48
Some terminology for testing statistical hypotheses:
p-value = probability of drawing a statistic (e.g. Y ) at least as

adverse to the null as the value actually computed with your
data, assuming that the null hypothesis is true.
The significance level of a test is a pre-specified probability

of incorrectly rejecting the null, when the null is true.
Calculating the p-value based on Y :
p-value = PrH0 [| Y  Y ,0 || Y act  Y ,0 |]
where Y act is the value of Y actually observed (nonrandom)

1/2/3-49
Calculating the p-value, ctd.
 To compute the p-value, you need the to know the
sampling distribution of Y , which is complicated if n is
small.
 If n is large, you can use the normal approximation (CLT):
p-value = PrH0 [| Y  Y ,0 || Y act  Y ,0 |],

Y  Y ,0 Y act  Y ,0
= PrH 0 [| || |]
Y / n Y / n
Y  Y ,0 Y act  Y ,0
= PrH 0 [| || |]
Y Y
 probability under left+right N(0,1) tails
where  Y = std. dev. of the distribution of Y = Y/ n .
1/2/3-50
Calculating the p-value with Y known:
 For large n, p-value = the probability that a N(0,1) random

variable falls outside |(Y act – Y,0)/ Y |
 In practice,  Y is unknown – it must be estimated
1/2/3-51
Estimator of the variance of Y:
n
1
sY2 = 
n  1 i 1
(Yi  Y ) 2
= “sample variance of Y”
Fact:
p
If (Y1,…,Yn) are i.i.d. and E(Y ) < , then s   Y2
4 2
Y
Why does the law of large numbers apply?

 Because sY2 is a sample average; see Appendix 3.3
 Technical note: we assume E(Y4) <  because here the
average is not of Yi, but of its square; see App. 3.3
1/2/3-52
Computing the p-value with  Y2 estimated:
p-value = PrH0 [| Y  Y ,0 || Y act  Y ,0 |],

Y  Y ,0 Y act  Y ,0
= PrH 0 [| || |]
Y / n Y / n
Y  Y ,0 Y act  Y ,0
 PrH 0 [| || |] (large n)
sY / n sY / n
so
p-value = PrH0 [| t || t act |] ( Y2 estimated)
 probability under normal tails outside |tact|
Y  Y ,0
where t = (the usual t-statistic)
sY / n
1/2/3-53
What is the link between the p-value and the significance
level?
The significance level is prespecified. For example, if the

prespecified significance level is 5%,
 you reject the null hypothesis if |t|  1.96
 equivalently, you reject if p  0.05.
 The p-value is sometimes called the marginal
significance level.
 Often, it is better to communicate the p-value than simply
whether a test rejects or not – the p-value contains more
information than the “yes/no” statement about whether the
test rejects.
1/2/3-54
At this point, you might be wondering,...
What happened to the t-table and the degrees of freedom?
Digression: the Student t distribution

If Yi, i = 1,…, n is i.i.d. N(Y, Y2 ), then the t-statistic has the
Student t-distribution with n – 1 degrees of freedom.
The critical values of the Student t-distribution is tabulated in
the back of all statistics books. Remember the recipe?
1. Compute the t-statistic
2. Compute the degrees of freedom, which is n – 1
3. Look up the 5% critical value
4. If the t-statistic exceeds (in absolute value) this
critical value, reject the null hypothesis.
1/2/3-55
Comments on this recipe and the Student t-distribution
1. The theory of the t-distribution was one of the early

triumphs of mathematical statistics. It is astounding, really:
if Y is i.i.d. normal, then you can know the exact, finite-
sample distribution of the t-statistic – it is the Student t. So,
you can construct confidence intervals (using the Student t
critical value) that have exactly the right coverage rate, no
matter what the sample size. This result was really useful in
times when “computer” was a job title, data collection was
expensive, and the number of observations was perhaps a
dozen. It is also a conceptually beautiful result, and the
math is beautiful too – which is probably why stats profs
love to teach the t-distribution. But….
1/2/3-56
Comments on Student t distribution, ctd.
2. If the sample size is moderate (several dozen) or large

(hundreds or more), the difference between the t-
distribution and N(0,1) critical values are negligible. Here
are some 5% critical values for 2-sided tests:
degrees of freedom 5% t-distribution

(n – 1) critical value
10 2.23
20 2.09
30 2.04
60 2.00
 1.96
1/2/3-57
3. So, the Student-t distribution is only relevant when the

sample size is very small; but in that case, for it to be
correct, you must be sure that the population distribution of
Y is normal. In economic data, the normality assumption is
rarely credible. Here are the distributions of some
economic data.
 Do you think earnings are normally distributed?
 Suppose you have a sample of n = 10 observations
from one of these distributions – would you feel
comfortable using the Student t distribution?
1/2/3-58
1/2/3-59
4. You might not know this. Consider the t-statistic testing
the hypothesis that two means (groups s, l) are equal:
Ys  Yl Ys  Yl
t 2 2 
ss
 sl SE (Ys  Yl )
ns nl
Even if the population distribution of Y in the two groups

is normal, this statistic doesn’t have a Student t
distribution!
There is a statistic testing this hypothesis that has a
normal distribution, the “pooled variance” t-statistic – see
SW (Section 3.6) – however the pooled variance t-statistic
is only valid if the variances of the normal distributions
are the same in the two groups. Would you expect this to
be true, say, for men’s v. women’s wages?
1/2/3-60
The Student-t distribution – summary
 The assumption that Y is distributed N(Y, Y2 ) is rarely

plausible in practice (income? number of children?)
 For n > 30, the t-distribution and N(0,1) are very close (as
n grows large, the tn–1 distribution converges to N(0,1))
 The t-distribution is an artifact from days when sample
sizes were small and “computers” were people
 For historical reasons, statistical software typically uses
the t-distribution to compute p-values – but this is
irrelevant when the sample size is moderate or large.
 For these reasons, in this class we will focus on the large-
n approximation given by the CLT
1/2/3-61
2. Estimation
3. Testing
Confidence Intervals
A 95% confidence interval for Y is an interval that contains
the true value of Y in 95% of repeated samples.
Digression: What is random here? The values of Y1,…,Yn and

thus any functions of them – including the confidence
interval. The confidence interval it will differ from one
sample to the next. The population parameter, Y, is not
random, we just don’t know it.
1/2/3-62
Confidence intervals, ctd.
A 95% confidence interval can always be constructed as the
set of values of Y not rejected by a hypothesis test with a 5%
significance level.
Y  Y Y  Y
{Y:  1.96} = {Y: –1.96   1.96}
sY / n sY / n
sY sY
= {Y: –1.96  Y – Y  1.96 }
n n
sY sY
= {Y  (Y – 1.96 , Y + 1.96 )}
n n
This confidence interval relies on the large-n results that Y is
p
approximately normally distributed and s   Y2 .
2
Y
1/2/3-63
Summary:
From the two assumptions of:
(1) simple random sampling of a population, that is,
{Yi, i =1,…,n} are i.i.d.
(2) 0 < E(Y4) < 
we developed, for large samples (large n):
 Theory of estimation (sampling distribution of Y )
 Theory of hypothesis testing (large-n distribution of t-
statistic and computation of the p-value)
 Theory of confidence intervals (constructed by inverting
test statistic)
Are assumptions (1) & (2) plausible in practice? Yes
1/2/3-64
Let’s go back to the original policy question:
What is the effect on test scores of reducing STR by one
student/class?
Have we answered this question?
1/2/3-65
Introduction to Linear Regression
(SW Chapter 4)
Empirical problem: Class size and educational output

 Policy question: What is the effect of reducing class
size by one student per class? by 8 students/class?
 What is the right output (performance) measure?
 parent satisfaction
 student personal development
 future adult welfare
 future adult earnings
 performance on standardized tests
4-1
What do data say about class sizes and test scores?
The California Test Score Data Set
All K-6 and K-8 California school districts (n = 420)
Variables:
 5th grade test scores (Stanford-9 achievement test,
combined math and reading), district average
 Student-teacher ratio (STR) = no. of students in the
district divided by no. full-time equivalent teachers
4-2
An initial look at the California test score data:
4-3
Do districts with smaller classes (lower STR) have higher test
scores?
4-4
The class size/test score policy question:
 What is the effect on test scores of reducing STR by
one student/class?
Test score
 Object of policy interest:
STR
 This is the slope of the line relating test score and STR
4-5
This suggests that we want to draw a line through the
Test Score v. STR scatterplot – but how?
4-6
Some Notation and Terminology
(Sections 4.1 and 4.2)
The population regression line:

Test Score = 0 + 1STR
1 = slope of population regression line

Test score
=
STR
= change in test score for a unit change in STR
 Why are 0 and 1 “population” parameters?
 We would like to know the population value of 1.
 We don’t know 1, so must estimate it using data.
4-7
How can we estimate 0 and 1 from data?
Recall that Y was the least squares estimator of Y: Y
solves,
n
min m  (Yi  m ) 2
i 1
By analogy, we will focus on the least squares

(“ordinary least squares” or “OLS”) estimator of the
unknown parameters 0 and 1, which solves,
n
min b0 ,b1  [Yi  (b0  b1 X i )]2
i 1
4-8
n
The OLS estimator solves: min b0 ,b1  [Yi  (b0  b1 X i )]2
i 1
 The OLS estimator minimizes the average squared

difference between the actual values of Yi and the
prediction (predicted value) based on the estimated line.
 This minimization problem can be solved using
calculus (App. 4.2).
 The result is the OLS estimators of 0 and 1.
4-9
Why use OLS, rather than some other estimator?
 OLS is a generalization of the sample average: if the
“line” is just an intercept (no X), then the OLS
estimator is just the sample average of Y1,…Yn (Y ).
 Like Y , the OLS estimator has some desirable
properties: under certain assumptions, it is unbiased
(that is, E( ˆ1 ) = 1), and it has a tighter sampling
distribution than some other candidate estimators of
1 (more on this later)
 Importantly, this is what everyone uses – the common
“language” of linear regression.
4-10
4-11
Application to the California Test Score – Class Size data
Estimated slope = ˆ1 = – 2.28

Estimated intercept = ˆ0 = 698.9
Estimated regression line: TestScore = 698.9 – 2.28 STR
4-12
Interpretation of the estimated slope and intercept
TestScore = 698.9 – 2.28 STR
 Districts with one more student per teacher on average
have test scores that are 2.28 points lower.
Test score
 That is, = –2.28
STR
 The intercept (taken literally) means that, according to
this estimated line, districts with zero students per
teacher would have a (predicted) test score of 698.9.
 This interpretation of the intercept makes no sense – it
extrapolates the line outside the range of the data – in
this application, the intercept is not itself
economically meaningful.
4-13
Predicted values & residuals:
One of the districts in the data set is Antelope, CA, for

which STR = 19.33 and Test Score = 657.8
predicted value: YÂntelope = 698.9 – 2.28 19.33 = 654.8
residual: uˆ Antelope = 657.8 – 654.8 = 3.0
4-14
OLS regression: STATA output
regress testscr str, robust
Regression with robust standard errors Number of obs = 420

F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581
-------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------
TestScore = 698.9 – 2.28 STR
(we’ll discuss the rest of this output later)

4-15
The OLS regression line is an estimate, computed using
our sample of data; a different sample would have given
a different value of ˆ1 .
How can we:
 quantify the sampling uncertainty associated with ˆ1 ?
 use ˆ1 to test hypotheses such as 1 = 0?
 construct a confidence interval for 1?
Like estimation of the mean, we proceed in four steps:

1. The probability framework for linear regression
2. Estimation
4-16
1. Probability Framework for Linear Regression
Population
population of interest (ex: all possible school districts)
Random variables: Y, X
Ex: (Test Score, STR)
Joint distribution of (Y,X)

The key feature is that we suppose there is a linear
relation in the population that relates X and Y; this linear
relation is the “population linear regression”
4-17
The Population Linear Regression Model (Section 4.3)
Yi = 0 + 1Xi + ui, i = 1,…, n
 X is the independent variable or regressor

 Y is the dependent variable
 0 = intercept
 1 = slope
 ui = “error term”
 The error term consists of omitted factors, or possibly
measurement error in the measurement of Y. In
general, these omitted factors are other factors that
influence Y, other than the variable X
4-18
Ex.: The population regression line and the error term
What are some of the omitted factors in this example?

4-19
Data and sampling
The population objects (“parameters”) 0 and 1 are
unknown; so to draw inferences about these unknown
parameters we must collect relevant data.
Simple random sampling:

Choose n entities at random from the population of
interest, and observe (record) X and Y for each entity
Simple random sampling implies that {(Xi, Yi)}, i = 1,…,

n, are independently and identically distributed (i.i.d.).
(Note: (Xi, Yi) are distributed independently of (Xj, Yj) for
different observations i and j.)
4-20
Task at hand: to characterize the sampling distribution of
the OLS estimator. To do so, we make three
assumptions:
The Least Squares Assumptions
1. The conditional distribution of u given X has mean

zero, that is, E(u|X = x) = 0.
2. (Xi,Yi), i =1,…,n, are i.i.d.
3. X and u have four moments, that is:
E(X4) < and E(u4) < .
We’ll discuss these assumptions in order.

4-21
Least squares assumption #1: E(u|X = x) = 0.
For any given value of X, the mean of u is zero
4-22
Example: Assumption #1 and the class size example
Test Scorei = 0 + 1STRi + ui, ui = other factors
“Other factors:”
 parental involvement
 outside learning opportunities (extra math class,..)
 home environment conducive to reading
 family income is a useful proxy for many such factors
So E(u|X=x) = 0 means E(Family Income|STR) = constant

(which implies that family income and STR are
uncorrelated). This assumption is not innocuous! We
will return to it often.
4-23
Least squares assumption #2:
(Xi,Yi), i = 1,…,n are i.i.d.
This arises automatically if the entity (individual, district)

is sampled by simple random sampling: the entity is
selected then, for that entity, X and Y are observed
(recorded).
The main place we will encounter non-i.i.d. sampling is

when data are recorded over time (“time series data”) –
this will introduce some extra complications.
4-24
Least squares assumption #3:
E(X4) < and E(u4) <
Because Yi = 0 + 1Xi + ui, assumption #3 can

equivalently be stated as, E(X4) < and E(Y4) < .
Assumption #3 is generally plausible. A finite domain of

the data implies finite fourth moments. (Standardized
test scores automatically satisfy this; STR, family income,
etc. satisfy this too).
4-25
2. Estimation: the Sampling Distribution of ˆ1
(Section 4.4)
Like Y , ˆ1 has a sampling distribution.

 What is E( ˆ1 )? (where is it centered)
 What is var( ˆ1 )? (measure of sampling uncertainty)
 What is its sampling distribution in small samples?
 What is its sampling distribution in large samples?
4-26
The sampling distribution of ˆ1 : some algebra:
Yi = 0 + 1Xi + ui
Y = 0 + 1 X + u
so Yi – Y = 1(Xi – X ) + (ui – u )
Thus,
n
( X i  X )(Yi  Y )
ˆ1 = i 1
n
 i
( X
i 1
 X ) 2
( X
i 1
i  X )[ 1 ( X i  X )  ( ui  u )]
= n
 i
( X
i 1
 X ) 2
4-27
n
( X i  X )[ 1 ( X i  X )  ( ui  u )]
ˆ1 = i 1
n
 i
( X
i 1
 X ) 2
n n
( X i  X )( X i  X ) ( X i  X )(ui  u )
= 1 i 1
n
 i 1
n
 i
( X
i 1
 X ) 2
 i
( X
i 1
 X ) 2
so
n
( X i  X )(ui  u )
ˆ1 – 1 = i 1
n
 i
( X  X
i 1
) 2
4-28
We can simplify this formula by noting that:
n n
 n 
 ( X i  X )(u i  u ) =  ( X i  X )u i –   ( X i  X )  u
i 1 i 1  i 1 
n
= ( X
i 1
i  X )u i .
Thus
n
1 n
 ( X i  X )u i 
n i 1
vi
ˆ1 – 1 = i n1 =
 n 1 2
 i 1
(Xi  X ) 2

 n 
 sX
where vi = (Xi – X )ui.
4-29
1 n

n i 1
vi
ˆ
1 – 1 = , where vi = (Xi – X )ui
 n 1 2
  sX
 n 
We now can calculate the mean and variance of ˆ1 :

  n 1 2 
n
1
E( ˆ1 – 1) = E   v i   sX 
 n i 1  n  
 n   1 n vi 
= E   2 
 n  1   n i 1 s X 
 n  1 n  vi 
=  E  2 
 n  1  n i 1  s X 
4-30
Now E(vi/ s X2 ) = E[(Xi – X )ui/ s X2 ] = 0
because E(ui|Xi=x) = 0 (for details see App. 4.3)
 n  1 n
 vi 
 E  2  = 0
Thus, ˆ
E( 1 – 1) = 
 n  1  n i 1  s X 
so
E( ˆ1 ) = 1
That is, ˆ1 is an unbiased estimator of 1.
4-31
Calculation of the variance of ˆ1 :
1 n

n i 1
vi
ˆ
1 – 1 =
 n 1 2
  sX
 n 
This calculation is simplified by supposing that n is

large (so that s X2 can be replaced by  X2 ); the result is,
ˆ var( v )
var( 1 ) =
n X2
(For details see App. 4.3.)
4-32
The exact sampling distribution is complicated, but when
the sample size is large we get some simple (and good)
approximations:
p
(1) Because var( ˆ1 ) 1/n and E( ˆ1 ) = 1, ˆ1  1
(2) When n is large, the sampling distribution of ˆ1 is

well approximated by a normal distribution (CLT)
4-33
1 n

n i 1
vi
ˆ
1 – 1 =
 n 1 2
  sX
 n 
When n is large:
 vi = (Xi – X )ui (Xi – X)ui, which is i.i.d. (why?) and
has two moments, that is, var(vi) < (why?). Thus
1 n

n i 1
vi is distributed N(0,var(v)/n) when n is large
 s X2 is approximately equal to  X2 when n is large

n 1 1
 =1– 1 when n is large
n n
Putting these together we have:
Large-n approximation to the distribution of ˆ1 :
4-34
1 n 1 n

n i 1
vi 
n i 1
vi
ˆ
1 – 1 = ,
 n 1 2  2
  sX
X
 n 
 v2
which is approximately distributed N(0, ).
n( X )
2 2
Because vi = (Xi – X )ui, we can write this as:
ˆ var[( X i   x )ui ]
1 is approximately distributed N(1, )
n X 4
4-35
Recall the summary of the sampling distribution of Y :
For (Y1,…,Yn) i.i.d. with 0 <  Y2 < ,
 The exact (finite sample) sampling distribution of Y
has mean Y (“Y is an unbiased estimator of Y”) and
variance  Y2 /n
 Other than its mean and variance, the exact
distribution of Y is complicated and depends on the
distribution of Y
p
 Y  Y (law of large numbers)
Y  E (Y )
 is approximately distributed N(0,1) (CLT)
var(Y )
4-36
Parallel conclusions hold for the OLS estimator ˆ1 :
Under the three Least Squares Assumptions,

 The exact (finite sample) sampling distribution of ˆ1
has mean 1 (“ ˆ1 is an unbiased estimator of 1”), and
var( ˆ1 ) is inversely proportional to n.
distribution of ˆ1 is complicated and depends on the
distribution of (X,u)
p
 ˆ1  1 (law of large numbers)
ˆ1  E ( ˆ1 )
var( ˆ1 )
4-37
4-38
2. Estimation
3. Hypothesis Testing (Section 4.5)
Suppose a skeptic suggests that reducing the number of

students in a class has no effect on learning or,
specifically, test scores. The skeptic thus asserts the
hypothesis,
H0: 1 = 0
We wish to test this hypothesis using data – reach a

tentative conclusion whether it is correct or incorrect.
4-39
Null hypothesis and two-sided alternative:
H0: 1 = 0 vs. H1: 1 0
or, more generally,
H0: 1 = 1,0 vs. H1: 1 1,0
where 1,0 is the hypothesized value under the null.
Null hypothesis and one-sided alternative:

H0: 1 = 1,0 vs. H1: 1 < 1,0
In economics, it is almost always possible to come up

with stories in which an effect could “go either way,” so
it is standard to focus on two-sided alternatives.
Recall hypothesis testing for population mean using Y :
4-40
Y  Y ,0
t=
sY / n
then reject the null hypothesis if |t| >1.96.
where the SE of the estimator is the square root of an

estimator of the variance of the estimator.
Applied to a hypothesis about 1:
4-41
estimator - hypothesized value
t=
standard error of the estimator
so
ˆ1  1,0
t=
SE ( ˆ1 )
where 1 is the value of 1,0 hypothesized under the null

(for example, if the null value is zero, then 1,0 = 0.
What is SE( ˆ1 )?

SE( ˆ1 ) = the square root of an estimator of the
variance of the sampling distribution of ˆ1
4-42
Recall the expression for the variance of ˆ1 (large n):
var[( X   ) u ]  2
var( ˆ1 ) = i x i
= v
n( X2 )2 n X4
where vi = (Xi – X )ui. Estimator of the variance of ˆ1 :
1 estimator of  2
ˆ 2ˆ =  v
1
n (estimator of  X2 )2
1 n
1 
n  2 i 1
( X i  X ) uî
2 2
=  2
.
n 1 n 2
 n ( Xi  X ) 
 i 1 
4-43
1 n
1 
n  2 i 1
( X i  X ) uî
2 2
ˆ 2ˆ =  2
.
1
n 1 n 2
 n ( Xi  X ) 
 i 1 
OK, this is a bit nasty, but:

 There is no reason to memorize this
 It is computed automatically by regression software
 SE( ˆ1 ) = ˆ 2ˆ is reported by regression software
1
 It is less complicated than it seems. The numerator

estimates the var(v), the denominator estimates
var(X).
4-44
Return to calculation of the t-statsitic:
ˆ1  1,0 ˆ1  1,0

t= =
ˆ
SE ( 1 ) ˆ 2ˆ
1
 Reject at 5% significance level if |t| > 1.96

 p-value is p = Pr[|t| > |tact|] = probability in tails of
normal outside |tact|
 Both the previous statements are based on large-n
approximation; typically n = 50 is large enough for
the approximation to be excellent.
4-45
Example: Test Scores and STR, California data
Estimated regression line: TestScore = 698.9 – 2.28STR
Regression software reports the standard errors:
SE( ˆ0 ) = 10.4 SE( ˆ1 ) = 0.52

ˆ1  1,0 2.28  0
t-statistic testing 1,0 = 0 = = = –4.38
SE ( ˆ1 ) 0.52
 The 1% 2-sided significance level is 2.58, so we reject

the null at the 1% significance level.
 Alternatively, we can compute the p-value…
4-46
The p-value based on the large-n standard normal
approximation to the t-statistic is 0.00001 (10–4)
4-47
2. Estimation
4. Confidence intervals (Section 4.6)
In general, if the sampling distribution of an estimator is

normal for large n, then a 95% confidence interval can be
constructed as estimator 1.96 standard error.
So: a 95% confidence interval for ˆ1 is,
{ ˆ1 1.96 SE( ˆ1 )}
4-48
Example: Test Scores and STR, California data
Estimated regression line: TestScore = 698.9 – 2.28STR
SE( ˆ0 ) = 10.4 SE( ˆ1 ) = 0.52
95% confidence interval for ˆ1 :
{ ˆ1 1.96 SE( ˆ1 )} = {–2.28 1.96 0.52}

= (–3.30, –1.26)
Equivalent statements:
 The 95% confidence interval does not include zero;
 The hypothesis 1 = 0 is rejected at the 5% level
A convention for reporting estimated regressions:
4-49
Put standard errors in parentheses below the estimates
TestScore = 698.9 – 2.28STR

(10.4) (0.52)
This expression means that:

 The estimated regression line is
 The standard error of ˆ is 10.4
0
 The standard error of ˆ1 is 0.52
4-50
OLS regression: STATA output

F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581
-------------------------------------------------------------------------
| Robust
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.38 0.000 -3.300945 -1.258671
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------
so:
(10.4) (0.52)
t (1 = 0) = –4.38, p-value = 0.000
95% conf. interval for 1 is (–3.30, –1.26)
4-51
Regression when X is Binary (Section 4.7)
Sometimes a regressor is binary:

 X = 1 if female, = 0 if male
 X = 1 if treated (experimental drug), = 0 if not
 X = 1 if small class size, = 0 if not
So far, 1 has been called a “slope,” but that doesn’t

make much sense if X is binary.
How do we interpret regression with a binary regressor?
4-52
Yi = 0 + 1Xi + ui, where X is binary (Xi = 0 or 1):
 When Xi = 0: Y i = 0 + u i
 When Xi = 1: Y i = 0 +  1 + u i
thus:
 When Xi = 0, the mean of Yi is 0
 When Xi = 1, the mean of Yi is 0 + 1
that is:
 E(Yi|Xi=0) = 0
 E(Yi|Xi=1) = 0 + 1
so:
1 = E(Yi|Xi=1) – E(Yi|Xi=0)
= population difference in group means
4-53
Example: TestScore and STR, California data
Let
1 if STRi  20
Di = 
0 if STRi  20
The OLS estimate of the regression line relating

TestScore to D (with standard errors in parentheses) is:
TestScore = 650.0 + 7.4D

(1.3) (1.8)
Difference in means between groups = 7.4;

SE = 1.8 t = 7.4/1.8 = 4.0
4-54
Compare the regression results with the group means,
computed directly:
Class Size Average score (Y ) Std. dev. (sY) N
Small (STR > 20) 657.4 19.4 238
Large (STR ≥ 20) 650.0 17.9 182
Estimation: Ysmall  Ylarge = 657.4 – 650.0 = 7.4

Ys  Yl 7.4
Test =0: t  = 4.05
SE (Ys  Yl ) 1.83
95% confidence interval ={7.4 1.96 1.83}=(3.8,11.0)
This is the same as in the regression!
TestScore = 650.0 + 7.4D
(1.3) (1.8)
4-55
Summary: regression when Xi is binary (0/1)
Yi = 0 + 1Xi + ui
 0 = mean of Y given that X = 0

 0 + 1 = mean of Y given that X = 1
 1 = difference in group means, X =1 minus X = 0
 SE( ˆ1 ) has the usual interpretation
 t-statistics, confidence intervals constructed as usual
 This is another way to do difference-in-means
analysis
 The regression formulation is especially useful when
we have additional regressors (coming up soon…)
4-56
Other Regression Statistics (Section 4.8)
A natural question is how well the regression line “fits”

or explains the data. There are two regression statistics
that provide complementary measures of the quality of
fit:
 The regression R2 measures the fraction of the
variance of Y that is explained by X; it is unitless and
ranges between zero (no fit) and one (perfect fit)
 The standard error of the regression measures the fit
– the typical size of a regression residual – in the units
of Y.
4-57
The R2
Write Yi as the sum of the OLS prediction + OLS
residual:
Yi = Yî + uî
The R2 is the fraction of the sample variance of Yi

“explained” by the regression, that is, by Yî :
ESS
2
R = ,
TSS
n n
where ESS =  (Yî  Yˆ ) and TSS =
i 1
2
 i
(Y
i 1
 Y ) 2
.
4-58
n n
ESS
2
R =
TSS
, where ESS =  (Yî  Yˆ ) and TSS =
i 1
2
 i
(Y
i 1
 Y ) 2
The R2:
 R2 = 0 means ESS = 0, so X explains none of the
variation of Y
 R2 = 1 means ESS = TSS, so Y = Yˆ so X explains all of
the variation of Y
 0 ≤ R2 ≤ 1
 For regression with a single regressor (the case here),
R2 is the square of the correlation coefficient between
X and Y
4-59
The Standard Error of the Regression (SER)
The standard error of the regression is (almost) the

sample standard deviation of the OLS residuals:
1 n
SER = 
n  2 i 1
( ˆ
ui  ˆ
ui ) 2
1 n 2
= 
n  2 i 1
uî
1 n
(the second equality holds because  uî = 0).
n i 1
4-60
1 n 2
SER = 
n  2 i 1
uî
The SER:
 has the units of u, which are the units of Y
 measures the spread of the distribution of u
 measures the average “size” of the OLS residual (the
average “mistake” made by the OLS regression line)
 The root mean squared error (RMSE) is closely
related to the SER:
1 n 2
RMSE = 
n i 1
uî
This measures the same thing as the SER – the minor

difference is division by 1/n instead of 1/(n–2).
4-61
Technical note: why divide by n–2 instead of n–1?
1 n 2
SER = 
n  2 i 1
uî
 Division by n–2 is a “degrees of freedom” correction

like division by n–1 in sY2 ; the difference is that, in the
SER, two parameters have been estimated (0 and 1, by
ˆ0 and ˆ1 ), whereas in sY2 only one has been estimated
(Y, by Y ).
 When n is large, it makes negligible difference whether
n, n–1, or n–2 are used – although the conventional
formula uses n–2 when there is a single regressor.
 For details, see Section 15.4
4-62
Example of R2 and SER
TestScore = 698.9 – 2.28STR, R2 = .05, SER = 18.6

(10.4) (0.52)
The slope coefficient is statistically significant and large
in a policy sense, even though STR explains only a small
fraction of the variation in test scores.
4-63
A Practical Note: Heteroskedasticity,
Homoskedasticity, and the Formula for the Standard
Errors of ˆ0 and ˆ1 (Section 4.9)
 What do these two terms mean?

 Consequences of homoskedasticity
 Implication for computing standard errors
What do these two terms mean?

If var(u|X=x) is constant – that is, the variance of the
conditional distribution of u given X does not depend on
X, then u is said to be homoskedastic. Otherwise, u is
said to be heteroskedastic.
4-64
Homoskedasticity in a picture:
 E(u|X=x) = 0 (u satisfies Least Squares Assumption #1)

 The variance of u does not change with (depend on) x
4-65
Heteroskedasticity in a picture:
 E(u|X=x) = 0 (u satisfies Least Squares Assumption #1)

 The variance of u depends on x – so u is
heteroskedastic.
4-66
An real-world example of heteroskedasticity from labor
economics: average hourly earnings vs. years of
education (data source: 1999 Current Population Survey)
Average Hourly Earnings Fitted values
60
Average hourly earnings
40
20
0
5 10 15 20
Years of Education
Scatterplot and OLS Regression Line
4-67
Is heteroskedasticity present in the class size data?
Hard to say…looks nearly homoskedastic, but the spread

might be tighter for large values of STR.
4-68
So far we have (without saying so) assumed that u is
heteroskedastic:
Recall the three least squares assumptions:

1. The conditional distribution of u given X has mean
zero, that is, E(u|X = x) = 0.
2. (Xi,Yi), i =1,…,n, are i.i.d.
3. X and u have four finite moments.
Heteroskedasticity and homoskedasticity concern

var(u|X=x). Because we have not explicitly assumed
homoskedastic errors, we have implicitly allowed for
heteroskedasticity.
4-69
What if the errors are in fact homoskedastic?:
 You can prove some theorems about OLS (in
particular, the Gauss-Markov theorem, which says
that OLS is the estimator with the lowest variance
among all estimators that are linear functions of
(Y1,…,Yn); see Section 15.5).
 The formula for the variance of ˆ1 and the OLS
standard error simplifies (App. 4.4): If var(ui|Xi=x) =
 u2 , then
var[( X   ) u ]  2
var( ˆ1 ) = i x i
= … = u
n( X2 )2 n X2
Note: var( ˆ ) is inversely proportional to var(X):
1
more spread in X means more information about ˆ1 .

4-70
General formula for the standard error of ˆ1 is the of:
1 n
1 
n  2 i 1
( X i  X ) uî
2 2
ˆ = 
2
ˆ 2
.
1
n 1 n 2
n  i ( X  X ) 
 i 1 
Special case under homoskedasticity:

1 n 2
1 
n  2 i 1
uî
ˆ 2ˆ =  n .
n 1
 i
1
( X  X ) 2
n i 1
Sometimes it is said that the lower formula is simpler.

4-71
The homoskedasticity-only formula for the standard error
of ˆ1 and the “heteroskedasticity-robust” formula (the
formula that is valid under heteroskedasticity) differ – in
general, you get different standard errors using the
different formulas.
Homoskedasticity-only standard errors are the
default setting in regression software –
sometimes the only setting (e.g. Excel). To get
the general “heteroskedasticity-robust”
standard errors you must override the default.
If you don’t override the default and there is in fact
heteroskedasticity, you will get the wrong standard errors
(and wrong t-statistics and confidence intervals).
4-72
The critical points:
 If the errors are homoskedastic and you use the
heteroskedastic formula for standard errors (the one
we derived), you are OK
 If the errors are heteroskedastic and you use the
homoskedasticity-only formula for standard errors,
the standard errors are wrong.
 The two formulas coincide (when n is large) in the
special case of homoskedasticity
 The bottom line: you should always use the
heteroskedasticity-based formulas – these are
conventionally called the heteroskedasticity-robust
standard errors.
4-73
Heteroskedasticity-robust standard errors in STATA

F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581
-------------------------------------------------------------------------
| Robust
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------
Use the “, robust” option!!!
4-74
Summary and Assessment (Section 4.10)
 The initial policy question:
Suppose new teachers are hired so the student-
teacher ratio falls by one student per class. What
is the effect of this policy intervention (this
“treatment”) on test scores?
 Does our regression analysis give a convincing answer?
Not really – districts with low STR tend to be ones
with lots of other resources and higher income
families, which provide kids with more learning
opportunities outside school…this suggests that
corr(ui,STRi) > 0, so E(ui|Xi) 0.
4-75
Digression on Causality
The original question (what is the quantitative effect of

an intervention that reduces class size?) is a question
about a causal effect: the effect on Y of applying a unit
of the treatment is 1.
 But what is, precisely, a causal effect?

 The common-sense definition of causality isn’t
precise enough for our purposes.
 In this course, we define a causal effect as the effect
that is measured in an ideal randomized controlled
experiment.
4-76
Ideal Randomized Controlled Experiment
 Ideal: subjects all follow the treatment protocol –
perfect compliance, no errors in reporting, etc.!
 Randomized: subjects from the population of interest
are randomly assigned to a treatment or control group
(so there are no confounding factors)
 Controlled: having a control group permits
measuring the differential effect of the treatment
 Experiment: the treatment is assigned as part of the
experiment: the subjects have no choice, which
means that there is no “reverse causality” in which
subjects choose the treatment they think will work
best.
4-77
Back to class size:
 What is an ideal randomized controlled experiment for
measuring the effect on Test Score of reducing STR?
 How does our regression analysis of observational data
differ from this ideal?
o The treatment is not randomly assigned
o In the US – in our observational data – districts with
higher family incomes are likely to have both
smaller classes and higher test scores.
o As a result it is plausible that E(ui|Xi=x) 0.
o If so, Least Squares Assumption #1 does not hold.
o If so, ˆ1 is biased: does an omitted factor make
class size seem more important than it really is?
4-78
Multiple Regression
(SW Chapter 5)
OLS estimate of the Test Score/STR relation:

TestScore = 698.9 – 2.28STR, R2 = .05, SER = 18.6
(10.4) (0.52)
Is this a credible estimate of the causal effect on test
scores of a change in the student-teacher ratio?
No: there are omitted confounding factors (family
income; whether the students are native English
speakers) that bias the OLS estimator: STR could be
“picking up” the effect of these confounding factors.
5-1
Omitted Variable Bias
(SW Section 5.1)
The bias in the OLS estimator that occurs as a result of

an omitted factor is called omitted variable bias. For
omitted variable bias to occur, the omitted factor “Z”
must be:
1. a determinant of Y; and
2. correlated with the regressor X.
Both conditions must hold for the omission of Z to result

in omitted variable bias.
5-2
In the test score example:
1. English language ability (whether the student has
English as a second language) plausibly affects
standardized test scores: Z is a determinant of Y.
2. Immigrant communities tend to be less affluent and
thus have smaller school budgets – and higher STR:
Z is correlated with X.
 Accordingly, ˆ1 is biased

 What is the direction of this bias?
 What does common sense suggest?
 If common sense fails you, there is a formula…
5-3
A formula for omitted variable bias: recall the equation,
n
1 n
 ( X i  X )u i 
n i 1
vi
ˆ
1 – 1 = n
i 1
=
 n 1 2
 i 1
(Xi  X ) 2

 n 
 sX
where vi = (Xi – X )ui  (Xi – X)ui. Under Least Squares

Assumption #1,
E[(Xi – X)ui] = cov(Xi,ui) = 0.
But what if E[(Xi – X)ui] = cov(Xi,ui) = Xu  0?
5-4
Then
n
1 n
 ( X i  X )u i 
n i 1
vi
ˆ
1 – 1 = n
i 1
=
 n 1 2
 i 1
(Xi  X ) 2

 n 
 sX
so
 n 
  ( X i  X )u i     u    Xu 
ˆ
E( 1 ) – 1 = E  ni 1
 2 =
Xu

     
 ( X  X )2   X  X u

 i 1 i

X
where  holds with equality when n is large; specifically,

ˆ
p  u 
1  1 +    Xu , where Xu = corr(X,u)
X 
5-5
ˆ
p  u 
Omitted variable bias formula: 1  1 +    Xu .
X 
If an omitted factor Z is both:
(1) a determinant of Y (that is, it is contained in u); and
(2) correlated with X,
then Xu  0 and the OLS estimator ˆ1 is biased.
The math makes precise the idea that districts with few
ESL students (1) do better on standardized tests and (2)
have smaller classes (bigger budgets), so ignoring the
ESL factor results in overstating the class size effect.
Is this is actually going on in the CA data?
5-6
 Districts with fewer English Learners have higher test scores
 Districts with lower percent EL (PctEL) have smaller classes
 Among districts with comparable PctEL, the effect of class
size is small (recall overall “test score gap” = 7.4)
5-7
Three ways to overcome omitted variable bias
1. Run a randomized controlled experiment in which
treatment (STR) is randomly assigned: then PctEL is
still a determinant of TestScore, but PctEL is
uncorrelated with STR. (But this is unrealistic in
practice.)
2. Adopt the “cross tabulation” approach, with finer
gradations of STR and PctEL (But soon we will run
out of data, and what about other determinants like
family income and parental education?)
3. Use a method in which the omitted variable (PctEL) is
no longer omitted: include PctEL as an additional
regressor in a multiple regression.
5-8
The Population Multiple Regression Model
(SW Section 5.2)
Consider the case of two regressors:
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n
 X1, X2 are the two independent variables (regressors)

 (Yi, X1i, X2i) denote the ith observation on Y, X1, and X2.
 0 = unknown population intercept
 1 = effect on Y of a change in X1, holding X2 constant
 2 = effect on Y of a change in X2, holding X1 constant
 ui = “error term” (omitted factors)
5-9
Interpretation of multiple regression coefficients
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n
Consider changing X1 by X1 while holding X2 constant:

Population regression line before the change:
Y = 0 + 1X1 + 2X2
Population regression line, after the change:
Y + Y = 0 + 1(X1 + X1) + 2X2
5-10
Before: Y = 0 + 1(X1 + X1) + 2X2
After: Y + Y = 0 + 1(X1 + X1) + 2X2
Difference: Y = 1X1
That is,
Y
1 = , holding X2 constant
X 1
also,
Y
2 = , holding X1 constant
X 2
and
0 = predicted value of Y when X1 = X2 = 0.
5-11
The OLS Estimator in Multiple Regression
(SW Section 5.3)
With two regressors, the OLS estimator solves:
n
min b0 ,b1 ,b2  [Yi  (b0  b1 X 1i  b2 X 2i )]2
i 1
 The OLS estimator minimizes the average squared

difference between the actual values of Yi and the
prediction (predicted value) based on the estimated line.
 This minimization problem is solved using calculus
 The result is the OLS estimators of 0 and 1.
5-12
Example: the California test score data
Regression of TestScore against STR:
Now include percent English Learners in the district

(PctEL):
TestScore = 696.0 – 1.10STR – 0.65PctEL
 What happens to the coefficient on STR?

 Why? (Note: corr(STR, PctEL) = 0.19)
5-13
Multiple regression in STATA
reg testscr str pctel, robust;

F( 2, 417) = 223.82
Prob > F = 0.0000
R-squared = 0.4264
Root MSE = 14.464
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
str | -1.101296 .4328472 -2.54 0.011 -1.95213 -.2504616
pctel | -.6497768 .0310318 -20.94 0.000 -.710775 -.5887786
_cons | 686.0322 8.728224 78.60 0.000 668.8754 703.189
------------------------------------------------------------------------------
TestScore = 696.0 – 1.10STR – 0.65PctEL
What are the sampling distribution of ˆ1 and ˆ2 ?

5-14
The Least Squares Assumptions for Multiple
Regression (SW Section 5.4)
Yi = 0 + 1X1i + 2X2i + … + kXki + ui, i = 1,…,n
1. The conditional distribution of u given the X’s has

mean zero, that is, E(u|X1 = x1,…, Xk = xk) = 0.
2. (X1i,…,Xki,Yi), i =1,…,n, are i.i.d.
3. X1,…, Xk, and u have four moments: E( X 1i4 ) < ,…,
E( X ki4 ) < , E( ui4 ) < .
4. There is no perfect multicollinearity.
5-15
Assumption #1: the conditional mean of u given the
included X’s is zero.
 This has the same interpretation as in regression

with a single regressor.
 If an omitted variable (1) belongs in the equation (so
is in u) and (2) is correlated with an included X, then
this condition fails
 Failure of this condition leads to omitted variable
bias
 The solution – if possible – is to include the omitted
variable in the regression.
5-16
Assumption #2: (X1i,…,Xki,Yi), i =1,…,n, are i.i.d.
This is satisfied automatically if the data are collected
by simple random sampling.
Assumption #3: finite fourth moments

This is technical assumption is satisfied automatically
by variables with a bounded domain (test scores,
PctEL, etc.)
5-17
Assumption #4: There is no perfect multicollinearity
Perfect multicollinearity is when one of the regressors is
an exact linear function of the other regressors.
Example: Suppose you accidentally include STR twice:

regress testscr str str, robust
F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581
-------------------------------------------------------------------------
| Robust
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
str | (dropped)
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------
5-18
Perfect multicollinearity is when one of the regressors is
an exact linear function of the other regressors.
 In the previous regression, 1 is the effect on TestScore
of a unit change in STR, holding STR constant (???)
 Second example: regress TestScore on a constant, D,
and B, where: Di = 1 if STR ≤ 20, = 0 otherwise; Bi = 1
if STR >20, = 0 otherwise, so Bi = 1 – Di and there is
perfect multicollinearity
 Would there be perfect multicollinearity if the intercept
(constant) were somehow dropped (that is, omitted or
suppressed) in the regression?
 Perfect multicollinearity usually reflects a mistake in
the definitions of the regressors, or an oddity in the data
5-19
The Sampling Distribution of the OLS Estimator
(SW Section 5.5)
Under the four Least Squares Assumptions,

 The exact (finite sample) distribution of ˆ1 has mean
1, var( ˆ1 ) is inversely proportional to n; so too for ˆ2 .
distribution of ˆ1 is very complicated
p
 ˆ1 is consistent: ˆ1  1 (law of large numbers)
ˆ1  E ( ˆ1 )
var( ˆ1 )
 So too for ˆ2 ,…, ˆk
5-20
Hypothesis Tests and Confidence Intervals for a
Single Coefficient in Multiple Regression
(SW Section 5.6)
ˆ1  E ( ˆ1 )
 is approximately distributed N(0,1) (CLT).
var( ˆ1 )
 Thus hypotheses on 1 can be tested using the usual t-
statistic, and confidence intervals are constructed as
{ ˆ1  1.96SE( ˆ1 )}.
 So too for 2,…, k.
 ˆ1 and ˆ2 are generally not independently distributed
– so neither are their t-statistics (more on this later).
5-21
Example: The California class size data
(1) TestScore = 698.9 – 2.28STR
(10.4) (0.52)
(2) TestScore = 696.0 – 1.10STR – 0.650PctEL
(8.7) (0.43) (0.031)
 The coefficient on STR in (2) is the effect on

TestScores of a unit change in STR, holding constant
the percentage of English Learners in the district
 Coefficient on STR falls by one-half
 95% confidence interval for coefficient on STR in (2)
is {–1.10  1.960.43} = (–1.95, –0.26)
5-22
Tests of Joint Hypotheses
(SW Section 5.7)
Let Expn = expenditures per pupil and consider the

population regression model:
TestScorei = 0 + 1STRi + 2Expni + 3PctELi + ui
The null hypothesis that “school resources don’t matter,”

and the alternative that they do, corresponds to:
H0: 1 = 0 and 2 = 0
vs. H1: either 1  0 or 2  0 or both
5-23
H0: 1 = 0 and 2 = 0
vs. H1: either 1  0 or 2  0 or both
A joint hypothesis specifies a value for two or more

coefficients, that is, it imposes a restriction on two or
more coefficients.
 A “common sense” test is to reject if either of the
individual t-statistics exceeds 1.96 in absolute value.
 But this “common sense” approach doesn’t work!
The resulting test doesn’t have the right significance
level!
5-24
Here’s why: Calculation of the probability of incorrectly
rejecting the null using the “common sense” test based on
the two individual t-statistics. To simplify the
calculation, suppose that ˆ1 and ˆ2 are independently
distributed. Let t1 and t2 be the t-statistics:
ˆ1  0 ˆ2  0
t1 = and t2 =
ˆ
SE ( 1 ) SE ( ˆ2 )
The “common sense” test is:
reject H0: 1 = 2 = 0 if |t1| > 1.96 and/or |t2| > 1.96
What is the probability that this “common sense” test

rejects H0, when H0 is actually true? (It should be 5%.)
5-25
Probability of incorrectly rejecting the null
= PrH [|t1| > 1.96 and/or |t2| > 1.96]
0
= PrH [|t1| > 1.96, |t2| > 1.96]

0
+ PrH [|t1| > 1.96, |t2| ≤ 1.96]

0
+ PrH [|t1| ≤ 1.96, |t2| > 1.96]

0
(disjoint events)
= PrH [|t1| > 1.96]  PrH [|t2| > 1.96]
0 0
+ PrH [|t1| > 1.96]  PrH [|t2| ≤ 1.96]

0 0
+ PrH [|t1| ≤ 1.96]  PrH [|t2| > 1.96]

0 0
(t1, t2 are independent by assumption)

= .05.05 + .05.95 + .95.05
= .0975 = 9.75% – which is not the desired 5%!!
5-26
The size of a test is the actual rejection rate under the null
hypothesis.
 The size of the “common sense” test isn’t 5%!

 Its size actually depends on the correlation between t1
and t2 (and thus on the correlation between ˆ1 and ˆ2 ).
Two Solutions:
 Use a different critical value in this procedure – not
1.96 (this is the “Bonferroni method – see App. 5.3)
 Use a different test statistic that test both 1 and 2 at
once: the F-statistic.
5-27
The F-statistic
The F-statistic tests all parts of a joint hypothesis at once.
Unpleasant formula for the special case of the joint

hypothesis 1 = 1,0 and 2 = 2,0 in a regression with two
regressors:
1 1 2  2 ˆ t1 ,t2 t1t2 
 
2 2
t t
F=  

2 1  ˆ t1 ,t2
2

where ˆ t ,t estimates the correlation between t1 and t2.

1 2
Reject when F is “large”

5-28
The F-statistic testing 1 and 2 (special case):
1 1 2  2 ˆ t1 ,t2 t1t2 
 
2 2
t t
F=  

2 1   t1 ,t2
ˆ 2

 The F-statistic is large when t1 and/or t2 is large

 The F-statistic corrects (in just the right way) for the
correlation between t1 and t2.
 The formula for more than two ’s is really nasty
unless you use matrix algebra.
 This gives the F-statistic a nice large-sample
approximate distribution, which is…
5-29
Large-sample distribution of the F-statistic
Consider special case that t1 and t2 are independent, so
p
ˆ t ,t  0; in large samples the formula becomes
1 2
1 1 2  2 ˆ t1 ,t2 t1t2  1 2 2
 
2 2
t t
F=    (t1  t2 )
2  1  ˆ t1 ,t2
2  2

 Under the null, t1 and t2 have standard normal

distributions that, in this special case, are independent
 The large-sample distribution of the F-statistic is the
distribution of the average of two independently
distributed squared standard normal random variables.
5-30
The chi-squared distribution with q degrees of freedom
(  q2 ) is defined to be the distribution of the sum of q
independent squared standard normal random variables.
In large samples, F is distributed as  q2 /q.
Selected large-sample critical values of  q2 /q

q 5% critical value
1 3.84 (why?)
2 3.00 (the case q=2 above)
3 2.60
4 2.37
5 2.21
5-31
p-value using the F-statistic:
p-value = tail probability of the  q2 /q distribution
beyond the F-statistic actually computed.
Implementation in STATA
Use the “test” command after the regression
Example: Test the joint hypothesis that the population

coefficients on STR and expenditures per pupil
(expn_stu) are both zero, against the alternative that at
least one of the population coefficients is nonzero.
5-32
F-test example, California class size data:
reg testscr str expn_stu pctel, r;

F( 3, 416) = 147.20
Prob > F = 0.0000
R-squared = 0.4366
Root MSE = 14.353
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
str | -.2863992 .4820728 -0.59 0.553 -1.234001 .661203
expn_stu | .0038679 .0015807 2.45 0.015 .0007607 .0069751
pctel | -.6560227 .0317844 -20.64 0.000 -.7185008 -.5935446
_cons | 649.5779 15.45834 42.02 0.000 619.1917 679.9641
------------------------------------------------------------------------------
NOTE
test str expn_stu; The test command follows the regression
( 1) str = 0.0 There are q=2 restrictions being tested

( 2) expn_stu = 0.0
F( 2, 416) = 5.43 The 5% critical value for q=2 is 3.00

Prob > F = 0.0047 Stata computes the p-value for you
5-33
Two (related) loose ends:
1. Homoskedasticity-only versions of the F-statistic
2. The “F” distribution
The homoskedasticity-only (“rule-of-thumb”) F-

statistic
To compute the homoskedasticity-only F-statistic:
 Use the previous formulas, but using
homoskedasticity-only standard errors; or
 Run two regressions, one under the null hypothesis
(the “restricted” regression) and one under the
alternative hypothesis (the “unrestricted” regression).
 The second method gives a simple formula
5-34
The “restricted” and “unrestricted” regressions
Example: are the coefficients on STR and Expn zero?
Restricted population regression (that is, under H0):

TestScorei = 0 + 3PctELi + ui (why?)
Unrestricted population regression (under H1):

 The number of restrictions under H0 = q = 2.

 The fit will be better (R2 will be higher) in the
unrestricted regression (why?)
5-35
By how much must the R2 increase for the coefficients on
Expn and PctEL to be judged statistically significant?
Simple formula for the homoskedasticity-only F-statistic:
2
( Runrestricted  Rrestricted
2
)/q
F=
(1  Runrestricted
2
) /( n  kunrestricted  1)
where:
2
Rrestricted = the R2 for the restricted regression
2
Runrestricted = the R2 for the unrestricted regression
q = the number of restrictions under the null
kunrestricted = the number of regressors in the
unrestricted regression.
5-36
Example:
Restricted regression:
TestScore = 644.7 –0.671PctEL, Rrestricted
2
= 0.4149
(1.0) (0.032)
Unrestricted regression:
TestScore = 649.6 – 0.29STR + 3.87Expn – 0.656PctEL
(15.5) (0.48) (1.59) (0.032)
2
Runrestricted = 0.4366, kunrestricted = 3, q = 2
so:
2
2
)/q
F=
2
(.4366  .4149) / 2
= = 8.01
(1  .4366) /(420  3  1)
5-37
The homoskedasticity-only F-statistic
2
2
)/q
F=
2
 The homoskedasticity-only F-statistic rejects when

adding the two variables increased the R2 by “enough”
– that is, when adding the two variables improves the
fit of the regression by “enough”
 If the errors are homoskedastic, then the
homoskedasticity-only F-statistic has a large-sample
distribution that is  q2 /q.
 But if the errors are heteroskedastic, the large-sample
distribution is a mess and is not  q2 /q
5-38
The F distribution
If:
1. u1,…,un are normally distributed; and
2. Xi is distributed independently of ui (so in
particular ui is homoskedastic)
then the homoskedasticity-only F-statistic has the

“Fq,n-k–1” distribution, where q = the number of
restrictions and k = the number of regressors under the
alternative (the unrestricted model).
5-39
The Fq,n–k–1 distribution:
 The F distribution is tabulated many places
 When n gets large the Fq,n-k–1 distribution asymptotes
to the  q2 /q distribution:
Fq, is another name for  q2 /q
 For q not too big and n≥100, the Fq,n–k–1 distribution
and the  q2 /q distribution are essentially identical.
 Many regression packages compute p-values of F-
statistics using the F distribution (which is OK if the
sample size is 100
 You will encounter the “F-distribution” in published
empirical work.
5-40
Digression: A little history of statistics…
 The theory of the homoskedasticity-only F-statistic
and the Fq,n–k–1 distributions rests on implausibly
strong assumptions (are earnings normally
distributed?)
 These statistics dates to the early 20th century, when
“computer” was a job description and observations
numbered in the dozens.
 The F-statistic and Fq,n–k–1 distribution were major
breakthroughs: an easily computed formula; a single
set of tables that could be published once, then
applied in many settings; and a precise,
mathematically elegant justification.
5-41
A little history of statistics, ctd…
 The strong assumptions seemed a minor price for this
breakthrough.
 But with modern computers and large samples we can
use the heteroskedasticity-robust F-statistic and the
Fq, distribution, which only require the four least
squares assumptions.
 This historical legacy persists in modern software, in
which homoskedasticity-only standard errors (and F-
statistics) are the default, and in which p-values are
computed using the Fq,n–k–1 distribution.
5-42
Summary: the homoskedasticity-only (“rule of
thumb”) F-statistic and the F distribution
 These are justified only under very strong conditions
– stronger than are realistic in practice.
 Yet, they are widely used.
 You should use the heteroskedasticity-robust F-
statistic, with  q2 /q (that is, Fq,) critical values.
 For n ≥ 100, the F-distribution essentially is the  q2 /q
distribution.
 For small n, the F distribution isn’t necessarily a
“better” approximation to the sampling distribution of
the F-statistic – only if the strong conditions are true.
5-43
Summary: testing joint hypotheses
 The “common-sense” approach of rejecting if either
of the t-statistics exceeds 1.96 rejects more than 5% of
the time under the null (the size exceeds the desired
significance level)
 The heteroskedasticity-robust F-statistic is built in to
STATA (“test” command); this tests all q restrictions
at once.
 For n large, F is distributed as  q2 /q (= Fq,)
 The homoskedasticity-only F-statistic is important
historically (and thus in practice), and is intuitively
appealing, but invalid when there is heteroskedasticity
5-44
Testing Single Restrictions on Multiple Coefficients
(SW Section 5.8)
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n
Consider the null and alternative hypothesis,
H0: 1 = 2 vs. H1: 1  2
This null imposes a single restriction (q = 1) on multiple

coefficients – it is not a joint hypothesis with multiple
restrictions (compare with 1 = 0 and 2 = 0).
5-45
Two methods for testing single restrictions on multiple
coefficients:
1. Rearrange (“transform”) the regression

Rearrange the regressors so that the restriction
becomes a restriction on a single coefficient in
an equivalent regression
2. Perform the test directly

Some software, including STATA, lets you test
restrictions using multiple coefficients directly
5-46
Method 1: Rearrange (“transform”) the regression
Yi = 0 + 1X1i + 2X2i + ui
H0: 1 = 2 vs. H1: 1  2
Add and subtract 2X1i:

Yi = 0 + (1 – 2) X1i + 2(X1i + X2i) + ui
or
Yi = 0 + 1 X1i + 2Wi + ui
where
 1 = 1 –  2
Wi = X1i + X2i
(a) Original system:
5-47
Yi = 0 + 1X1i + 2X2i + ui
H0: 1 = 2 vs. H1: 1  2
(b) Rearranged (“transformed”) system:

Yi = 0 + 1 X1i + 2Wi + ui
where 1 = 1 – 2 and Wi = X1i + X2i
so
H0: 1 = 0 vs. H1: 1  0
The testing problem is now a simple one:

test whether 1 = 0 in specification (b).
5-48
Method 2: Perform the test directly
Yi = 0 + 1X1i + 2X2i + ui
H0: 1 = 2 vs. H1: 1  2
Example:
To test, using STATA, whether 1 = 2:
regress testscore str expn pctel, r

test str=expn
Confidence Sets for Multiple Coefficients

5-49
(SW Section 5.9)
Yi = 0 + 1X1i + 2X2i + … + kXki + ui, i = 1,…,n
What is a joint confidence set for 1 and 2?
A 95% confidence set is:

 A set-valued function of the data that contains the true
parameter(s) in 95% of hypothetical repeated samples.
 The set of parameter values that cannot be rejected at
the 5% significance level when taken as the null
hypothesis.
5-50
The coverage rate of a confidence set is the probability
that the confidence set contains the true parameter values
A “common sense” confidence set is the union of the

95% confidence intervals for 1 and 2, that is, the
rectangle:
{ ˆ1  1.96SE( ˆ1 ), ˆ2  1.96 SE( ˆ2 )}
 What is the coverage rate of this confidence set?

 Des its coverage rate equal the desired confidence
level of 95%?
5-51
Coverage rate of “common sense” confidence set:
Pr[(1, 2)  { ˆ1  1.96SE( ˆ1 ), ˆ2 1.96  SE( ˆ2 )}]
= Pr[ ˆ1 – 1.96SE( ˆ1 )  1  ˆ1 + 1.96SE( ˆ1 ),
ˆ2 – 1.96SE( ˆ2 )  2  ˆ2 + 1.96SE( ˆ2 )]
ˆ1  1 ˆ2   2
= Pr[–1.96 1.96, –1.96 1.96]
ˆ
SE ( 1 ) ˆ
SE (  2 )
= Pr[|t1|  1.96 and |t2|  1.96]
= 1 – Pr[|t1| > 1.96 and/or |t2| > 1.96]  95% !
Why?
This confidence set “inverts” a test for which the size
doesn’t equal the significance level!
5-52
Recall: the probability of incorrectly rejecting the null
= PrH [|t1| > 1.96 and/or |t2| > 1.96]
0
= PrH [|t1| > 1.96, |t2| > 1.96]

0
+ PrH [|t1| > 1.96, |t2| ≤ 1.96]

0
+ PrH [|t1| ≤ 1.96, |t2| > 1.96]

0
(disjoint events)
= PrH [|t1| > 1.96]  PrH [|t2| > 1.96]
0 0
+ PrH [|t1| > 1.96]  PrH [|t2| ≤ 1.96]

0 0
+ PrH [|t1| ≤ 1.96]  PrH [|t2| > 1.96]

0 0
(if t1, t2 are independent)

= .05.05 + .05.95 + .95.05
= .0975 = 9.75% – which is not the desired 5%!!
5-53
Instead, use the acceptance region of a test that has size
equal to its significance level (“invert” a valid test):
Let F(1,0,2,0) be the (heteroskedasticity-robust) F-

statistic testing the hypothesis that 1 = 1,0 and 2 = 2,0:
95% confidence set = {1,0, 2,0: F(1,0, 2,0) < 3.00}
 3.00 is the 5% critical value of the F2, distribution

 This set has coverage rate 95% because the test on
which it is based (the test it “inverts”) has size of 5%.
5-54
The confidence set based on the F-statistic is an ellipse
1 1 2  2 ˆ t1 ,t2 t1t2 
 
2 2
t t
{1, 2: F =   ≤ 3.00}
2  1  ˆ t1 ,t2
2 

Now
1
2  2  t1 ,t2 t1t2 
F=  
 t 2
 t 2
ˆ 
2(1  ˆ t1 ,t2 )
2 1
1
 
2(1  ˆ t1 ,t2 )
2
 ˆ   2  ˆ    2  ˆ    ˆ    
 2 2,0
  
1 1,0
  2 
ˆ t1 ,t2 
1 1,0

2 2,0
 
 SE ( ˆ2 )   SE ( ˆ1 )   SE ( ˆ )  SE ( ˆ )  
 1  2 
 
This is a quadratic form in 1,0 and 2,0 – thus the
boundary of the set F = 3.00 is an ellipse.
5-55
Confidence set based on inverting the F-statistic
5-56
The R2, SER, and R 2 for Multiple Regression
(SW Section 5.10)
Actual = predicted + residual: Yi = Yî + uî
As in regression with a single regressor, the SER (and the

RMSE) is a measure of the spread of the Y’s around the
regression line:
n
1
SER = i
n  k  1 i 1
ˆ
u 2
5-57
The R2 is the fraction of the variance explained:
ESS SSR
2
R = = 1 ,
TSS TSS
n n
where ESS =  i
(Yˆ  Y
i 1
ˆ ) 2
, SSR =  i , and TSS =
ˆ
u 2
i 1
n
 i
(Y
i 1
 Y ) 2
– just as for regression with one regressor.
 The R2 always increases when you add another

regressor – a bit of a problem for a measure of “fit”
 The R 2 corrects this problem by “penalizing” you for
including another regressor:
 n  1  SSR
R = 1 
2
 so R 2
< R2
 n  k  1  TSS
5-58
How to interpret the R2 and R 2 ?
 A high R2 (or R 2 ) means that the regressors explain
the variation in Y.
 A high R2 (or R 2 ) does not mean that you have
eliminated omitted variable bias.
 A high R2 (or R 2 ) does not mean that you have an
unbiased estimator of a causal effect (1).
 A high R2 (or R 2 ) does not mean that the included
variables are statistically significant – this must be
determined using hypotheses tests.
5-59
Example: A Closer Look at the Test Score Data
(SW Section 5.11, 5.12)
A general approach to variable selection and model

specification:
 Specify a “base” or “benchmark” model.
 Specify a range of plausible alternative models, which
include additional candidate variables.
 Does a candidate variable change the coefficient of
interest (1)?
 Is a candidate variable statistically significant?
 Use judgment, not a mechanical recipe…
5-60
Variables we would like to see in the California data set:
School characteristics:
 student-teacher ratio
 teacher quality
 computers (non-teaching resources) per student
 measures of curriculum design…
Student characteristics:
 English proficiency
 availability of extracurricular enrichment
 home learning environment
 parent’s education level…
5-61
Variables actually in the California class size data set:
 student-teacher ratio (STR)
 percent English learners in the district (PctEL)
 percent eligible for subsidized/free lunch
 percent on public income assistance
 average district income
5-62
A look at more of the California data
5-63
Digression: presentation of regression results in a table
 Listing regressions in “equation” form can be
cumbersome with many regressors and many regressions
 Tables of regression results can present the key
information compactly
 Information to include:
 variables in the regression (dependent and
independent)
 estimated coefficients
 standard errors
 results of F-tests of pertinent joint hypotheses
 some measure of fit
 number of observations
5-64
5-65
Summary: Multiple Regression
 Multiple regression allows you to estimate the effect

on Y of a change in X1, holding X2 constant.
 If you can measure a variable, you can avoid omitted
variable bias from that variable by including it.
 There is no simple recipe for deciding which variables
belong in a regression – you must exercise judgment.
 One approach is to specify a base model – relying on
a-priori reasoning – then explore the sensitivity of the
key estimate(s) in alternative specifications.
5-66
Nonlinear Regression Functions
(SW Ch. 6)
 Everything so far has been linear in the X’s

 The approximation that the regression function is
linear might be good for some variables, but not for
others.
 The multiple regression framework can be extended to
handle regression functions that are nonlinear in one
or more X.
6-1
The TestScore – STR relation looks approximately
linear…
6-2
But the TestScore – average district income relation
looks like it is nonlinear.
6-3
If a relation between Y and X is nonlinear:
 The effect on Y of a change in X depends on the value
of X – that is, the marginal effect of X is not constant
 A linear regression is mis-specified – the functional
form is wrong
 The estimator of the effect on Y of X is biased – it
needn’t even be right on average.
 The solution to this is to estimate a regression
function that is nonlinear in X
6-4
The General Nonlinear Population Regression Function
Yi = f(X1i,X2i,…,Xki) + ui, i = 1,…, n
Assumptions
1. E(ui| X1i,X2i,…,Xki) = 0 (same); implies that f is the
conditional expectation of Y given the X’s.
2. (X1i,…,Xki,Yi) are i.i.d. (same).
3. “enough” moments exist (same idea; the precise
statement depends on specific f).
4. No perfect multicollinearity (same idea; the precise
statement depends on the specific f).
6-5
6-6
Nonlinear Functions of a Single Independent Variable
(SW Section 6.2)
We’ll look at two complementary approaches:

1. Polynomials in X
The population regression function is approximated
by a quadratic, cubic, or higher-degree polynomial
2. Logarithmic transformations
 Y and/or X is transformed by taking its logarithm
 this gives a “percentages” interpretation that makes
sense in many applications
6-7
1. Polynomials in X
Approximate the population regression function by a
polynomial:
Yi = 0 + 1Xi + 2 X i2 +…+ r X ir + ui
 This is just the linear multiple regression model –

except that the regressors are powers of X!
 Estimation, hypothesis testing, etc. proceeds as in the
multiple regression model using OLS
 The coefficients are difficult to interpret, but the
regression function itself is interpretable
6-8
Example: the TestScore – Income relation
Incomei = average district income in the ith district

(thousdand dollars per capita)
Quadratic specification:
TestScorei = 0 + 1Incomei + 2(Incomei)2 + ui
Cubic specification:
TestScorei = 0 + 1Incomei + 2(Incomei)2

+ 3(Incomei)3 + ui
6-9
Estimation of the quadratic specification in STATA
generate avginc2 = avginc*avginc; Create a new regressor

reg testscr avginc avginc2, r;

F( 2, 417) = 428.52
Prob > F = 0.0000
R-squared = 0.5562
Root MSE = 12.724
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
avginc | 3.850995 .2680941 14.36 0.000 3.32401 4.377979
avginc2 | -.0423085 .0047803 -8.85 0.000 -.051705 -.0329119
_cons | 607.3017 2.901754 209.29 0.000 601.5978 613.0056
------------------------------------------------------------------------------
The t-statistic on Income2 is -8.85, so the hypothesis of

linearity is rejected against the quadratic alternative at the
1% significance level.
6-10
Interpreting the estimated regression function:
(a) Plot the predicted values
TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2
(2.9) (0.27) (0.0048)
6-11
Interpreting the estimated regression function:
(a) Compute “effects” for different values of X

(2.9) (0.27) (0.0048)
Predicted change in TestScore for a change in income to

$6,000 from $5,000 per capita:
TestScore = 607.3 + 3.85 6 – 0.0423 62

– (607.3 + 3.85 5 – 0.0423 52)
= 3.4
6-12
Predicted “effects” for different values of X

Change in Income (th$ per capita) TestScore
from 5 to 6 3.4
from 25 to 26 1.7
from 45 to 46 0.0
The “effect” of a change in income is greater at low than

high income levels (perhaps, a declining marginal benefit
of an increase in school budgets?)
Caution! What about a change from 65 to 66?
Don’t extrapolate outside the range of the data.
6-13
Estimation of the cubic specification in STATA
gen avginc3 = avginc*avginc2; Create the cubic regressor

reg testscr avginc avginc2 avginc3, r;

F( 3, 416) = 270.18
Prob > F = 0.0000
R-squared = 0.5584
Root MSE = 12.707
------------------------------------------------------------------------------
| Robust

-------------+----------------------------------------------------------------
avginc | 5.018677 .7073505 7.10 0.000 3.628251 6.409104
avginc2 | -.0958052 .0289537 -3.31 0.001 -.1527191 -.0388913
avginc3 | .0006855 .0003471 1.98 0.049 3.27e-06 .0013677
_cons | 600.079 5.102062 117.61 0.000 590.0499 610.108
------------------------------------------------------------------------------
The cubic term is statistically significant at the 5%, but

not 1%, level
6-14
Testing the null hypothesis of linearity, against the
alternative that the population regression is quadratic
and/or cubic, that is, it is a polynomial of degree up to 3:
H0: pop’n coefficients on Income2 and Income3 = 0

H1: at least one of these coefficients is nonzero.
test avginc2 avginc3; Execute the test command after running the regression
( 1) avginc2 = 0.0
( 2) avginc3 = 0.0
F( 2, 416) = 37.69
Prob > F = 0.0000
The hypothesis that the population regression is linear is

rejected at the 1% significance level against the
alternative that it is a polynomial of degree up to 3.
6-15
Summary: polynomial regression functions
Yi = 0 + 1Xi + 2 X i2 +…+ r X ir + ui
 Estimation: by OLS after defining new regressors
 Coefficients have complicated interpretations
 To interpret the estimated regression function:
o plot predicted values as a function of x
o compute predicted Y/X at different values of x
 Hypotheses concerning degree r can be tested by t-
and F-tests on the appropriate (blocks of) variable(s).
 Choice of degree r
o plot the data; t- and F-tests, check sensitivity of
estimated effects; judgment.
o Or use model selection criteria (maybe later)
6-16
2. Logarithmic functions of Y and/or X
 ln(X) = the natural logarithm of X
 Logarithmic transforms permit modeling relations
in “percentage” terms (like elasticities), rather than
linearly.
 x  x
Here’s why: ln(x+x) – ln(x) = ln  1  
 x  x
d ln( x ) 1
(calculus:  )
dx x
Numerically:
ln(1.01) = .00995 .01; ln(1.10) = .0953 .10 (sort of)
6-17
Three cases:
Case Population regression function

I. linear-log Yi = 0 + 1ln(Xi) + ui
II. log-linear ln(Yi) = 0 + 1Xi + ui
III. log-log ln(Yi) = 0 + 1ln(Xi) + ui
 The interpretation of the slope coefficient differs in

each case.
 The interpretation is found by applying the general
“before and after” rule: “figure out the change in Y for
a given change in X.”
6-18
I. Linear-log population regression function
Yi = 0 + 1ln(Xi) + ui (b)
Now change X: Y + Y = 0 + 1ln(X + X) (a)
Subtract (a) – (b): Y = 1[ln(X + X) – ln(X)]
X
now ln(X + X) – ln(X) ,
X
X
so Y 1
X
Y
or 1 (small X)
X / X
6-19
Linear-log case, continued
Yi = 0 + 1ln(Xi) + ui
for small X,

Y
1
X / X
X
Now 100 = percentage change in X, so a 1%
X
increase in X (multiplying X by 1.01) is associated with
a .011 change in Y.
6-20
Example: TestScore vs. ln(Income)
 First defining the new regressor, ln(Income)
 The model is now linear in ln(Income), so the linear-log
model can be estimated by OLS:
TestScore = 557.8 + 36.42 ln(Incomei)

(3.8) (1.40)
so a 1% increase in Income is associated with an

increase in TestScore of 0.36 points on the test.
 Standard errors, confidence intervals, R2 – all the
usual tools of regression apply here.
 How does this compare to the cubic model?
6-21
TestScore = 557.8 + 36.42 ln(Incomei)
6-22
II. Log-linear population regression function
ln(Yi) = 0 + 1Xi + ui (b)
Now change X: ln(Y + Y) = 0 + 1(X + X) (a)
Subtract (a) – (b): ln(Y + Y) – ln(Y) = 1X
Y
so 1X
Y
Y / Y
X
6-23
Log-linear case, continued
ln(Yi) = 0 + 1Xi + ui
Y / Y
for small X, 1
X
Y
 Now 100 = percentage change in Y, so a change
Y
in X by one unit (X = 1) is associated with a 1001%
change in Y (Y increases by a factor of 1+1).
 Note: What are the units of ui and the SER?
o fractional (proportional) deviations
o for example, SER = .2 means…
6-24
III. Log-log population regression function
ln(Yi) = 0 + 1ln(Xi) + ui (b)
Now change X: ln(Y + Y) = 0 + 1ln(X + X) (a)
Subtract: ln(Y + Y) – ln(Y) = 1[ln(X + X) – ln(X)]
Y X
so 1
Y X
Y / Y
X / X
6-25
Log-log case, continued
ln(Yi) = 0 + 1ln(Xi) + ui
for small X,

Y / Y
1
X / X
Y X
Now 100 = percentage change in Y, and 100 =
Y X
percentage change in X, so a 1% change in X is
associated with a 1% change in Y.
 In the log-log specification, 1 has the interpretation
of an elasticity.
6-26
Example: ln( TestScore) vs. ln( Income)
 First defining a new dependent variable, ln(TestScore),
and the new regressor, ln(Income)
 The model is now a linear regression of ln(TestScore)
against ln(Income), which can be estimated by OLS:
ln(TestScore) = 6.336 + 0.0554 ln(Incomei)

(0.006) (0.0021)
An 1% increase in Income is associated with an

increase of .0554% in TestScore (factor of 1.0554)
 How does this compare to the log-linear model?
6-27
Neither specification seems to fit as well as the cubic or linear-log
6-28
Summary: Logarithmic transformations
 Three cases, differing in whether Y and/or X is

transformed by taking logarithms.
 After creating the new variable(s) ln(Y) and/or ln(X),
the regression is linear in the new variables and the
coefficients can be estimated by OLS.
 Hypothesis tests and confidence intervals are now
standard.
 The interpretation of 1 differs from case to case.
 Choice of specification should be guided by judgment
(which interpretation makes the most sense in your
application?), tests, and plotting predicted values
6-29
Interactions Between Independent Variables
(SW Section 6.3)
 Perhaps a class size reduction is more effective in some

circumstances than in others…
 Perhaps smaller classes help more if there are many
English learners, who need individual attention
TestScore
 That is, might depend on PctEL
STR
Y
 More generally, might depend on X2
X 1
 How to model such “interactions” between X1 and X2?
 We first consider binary X’s, then continuous X’s
6-30
(a) Interactions between two binary variables
Yi = 0 + 1D1i + 2D2i + ui
 D1i, D2i are binary

 1 is the effect of changing D1=0 to D1=1. In this
specification, this effect doesn’t depend on the value of
D2.
 To allow the effect of changing D1 to depend on D2,
include the “interaction term” D1i D2i as a regressor:
Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui
6-31
Interpreting the coefficients
Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui
General rule: compare the various cases

E(Yi|D1i=0, D2i=d2) = 0 + 2d2 (b)
E(Yi|D1i=1, D2i=d2) = 0 + 1 + 2d2 + 3d2 (a)
subtract (a) – (b):

E(Yi|D1i=1, D2i=d2) – E(Yi|D1i=0, D2i=d2) = 1 + 3d2
 The effect of D1 depends on d2 (what we wanted)

 3 = increment to the effect of D1, when D2 = 1
6-32
Example: TestScore, STR, English learners
Let
1 if STR  20  1 if PctEL  l0
HiSTR =  and HiEL = 
0 if STR  20 0 if PctEL  10
TestScore = 664.1 – 18.2HiEL – 1.9HiSTR – 3.5(HiSTR HiEL)

(1.4) (2.3) (1.9) (3.1)
 “Effect” of HiSTR when HiEL = 0 is –1.9

 “Effect” of HiSTR when HiEL = 1 is –1.9 – 3.5 = –5.4
 Class size reduction is estimated to have a bigger effect
when the percent of English learners is large
 This interaction isn’t statistically significant: t = 3.5/3.1
6-33
(b) Interactions between continuous and binary
variables
Yi = 0 + 1Di + 2Xi + ui
 Di is binary, X is continuous
 As specified above, the effect on Y of X (holding
constant D) = 2, which does not depend on D
 To allow the effect of X to depend on D, include the
“interaction term” Di Xi as a regressor:
Yi = 0 + 1Di + 2Xi + 3(Di Xi) + ui
6-34
Interpreting the coefficients
Yi = 0 + 1Di + 2Xi + 3(Di Xi) + ui

Y = 0 + 1D + 2X + 3(D X) (b)
Now change X:
Y + Y = 0 + 1D + 2(X+X) + 3[D (X+X)] (a)
Y
Y = 2X + 3DX or = 2 + 3D
X
 The effect of X depends on D (what we wanted)
 3 = increment to the effect of X, when D = 1
Example: TestScore, STR, HiEL (=1 if PctEL 20)
6-35
TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)
(11.9) (0.59) (19.5) (0.97)
 When HiEL = 0:
TestScore = 682.2 – 0.97STR
 When HiEL = 1,
TestScore = 682.2 – 0.97STR + 5.6 – 1.28STR
= 687.8 – 2.25STR
 Two regression lines: one for each HiSTR group.
 Class size reduction is estimated to have a larger effect
when the percent of English learners is large.
Example, ctd.
6-36
TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)
(11.9) (0.59) (19.5) (0.97)
Testing various hypotheses:

 The two regression lines have the same slope the
coefficient on STR HiEL is zero:
t = –1.28/0.97 = –1.32 can’t reject
 The two regression lines have the same intercept
the coefficient on HiEL is zero:
t = –5.6/19.5 = 0.29 can’t reject
Example, ctd.
TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL),
(11.9) (0.59) (19.5) (0.97)
6-37
 Joint hypothesis that the two regression lines are the
same population coefficient on HiEL = 0 and
population coefficient on STR HiEL = 0:
F = 89.94 (p-value < .001) !!
 Why do we reject the joint hypothesis but neither
individual hypothesis?
 Consequence of high but imperfect multicollinearity:
high correlation between HiEL and STR HiEL
Binary-continuous interactions: the two regression lines
Yi = 0 + 1Di + 2Xi + 3(Di Xi) + ui
Observations with Di= 0 (the “D = 0” group):

6-38
Yi = 0 + 2Xi + ui
Observations with Di= 1 (the “D = 1” group):
Yi = 0 + 1 + 2Xi + 3Xi + ui
= (0+1) + (2+3)Xi + ui
6-39
6-40
(c) Interactions between two continuous variables
Yi = 0 + 1X1i + 2X2i + ui
 X1, X2 are continuous

 As specified, the effect of X1 doesn’t depend on X2
 As specified, the effect of X2 doesn’t depend on X1
 To allow the effect of X1 to depend on X2, include the
“interaction term” X1i X2i as a regressor:
Yi = 0 + 1X1i + 2X2i + 3(X1i X2i) + ui
Coefficients in continuous-continuous interactions

6-41
Yi = 0 + 1X1i + 2X2i + 3(X1i X2i) + ui
Y = 0 + 1X1 + 2X2 + 3(X1 X2) (b)
Now change X1:
Y+ Y = 0 + 1(X1+X1) + 2X2 + 3[(X1+X1) X2] (a)
Y
Y = 1X1 + 3X2X1 or = 2 + 3X2
X 1
 The effect of X1 depends on X2 (what we wanted)
 3 = increment to the effect of X1 from a unit change
in X2
Example: TestScore, STR, PctEL
6-42
TestScore = 686.3 – 1.12STR – 0.67PctEL + .0012(STR PctEL),
(11.8) (0.59) (0.37) (0.019)
The estimated effect of class size reduction is nonlinear

because the size of the effect itself depends on PctEL:
TestScore
= –1.12 + .0012PctEL
STR
PctEL TestScore
STR
0 –1.12
20% –1.12+.0012 20 = –1.10
Example, ctd: hypothesis tests
TestScore = 686.3 – 1.12STR – 0.67PctEL + .0012(STR PctEL),
(11.8) (0.59) (0.37) (0.019)
6-43
 Does population coefficient on STR PctEL = 0?
t = .0012/.019 = .06 can’t reject null at 5% level
 Does population coefficient on STR = 0?
t = –1.12/0.59 = –1.90 can’t reject null at 5% level
 Do the coefficients on both STR and STR PctEL = 0?
F = 3.89 (p-value = .021) reject null at 5% level(!!)
(Why? high but imperfect multicollinearity)
6-44
Application: Nonlinear Effects on Test Scores
of the Student-Teacher Ratio
(SW Section 6.4)
Focus on two questions:
1. Are there nonlinear effects of class size reduction on

test scores? (Does a reduction from 35 to 30 have
same effect as a reduction from 20 to 15?)
2. Are there nonlinear interactions between PctEL and

STR? (Are small classes more effective when there are
many English learners?)
6-45
Strategy for Question #1 (different effects for different STR?)
 Estimate linear and nonlinear functions of STR, holding

constant relevant demographic variables
o PctEL
o Income (remember the nonlinear TestScore-Income
relation!)
o LunchPCT (fraction on free/subsidized lunch)
 See whether adding the nonlinear terms makes an
“economically important” quantitative difference
(“economic” or “real-world” importance is different than
statistically significant)
 Test for whether the nonlinear terms are significant
6-46
What is a good “base” specification?
6-47
The TestScore – Income relation
An advantage of the logarithmic specification is that it is

better behaved near the ends of the sample, especially large
values of income.
6-48
Base specification
From the scatterplots and preceding analysis, here are
plausible starting points for the demographic control
variables:
Dependent variable: TestScore
Independent variable Functional form

PctEL linear
LunchPCT linear
Income ln(Income)
(or could use cubic)
6-49
Question #1:
Investigate by considering a polynomial in STR
TestScore = 252.0 + 64.33STR – 3.42STR2 + .059STR3

(163.6) (24.86) (1.25) (.021)
– 5.47HiEL – .420LunchPCT + 11.75ln(Income)

(1.03) (.029) (1.78)
Interpretation of coefficients on:

 HiEL?
 LunchPCT?
 ln(Income)?
 STR, STR2, STR3?
6-50
Interpreting the regression function via plots
(preceding regression is labeled (5) in this figure)
6-51
Are the higher order terms in STR statistically
significant?
TestScore = 252.0 + 64.33STR – 3.42STR2 + .059STR3

(163.6) (24.86) (1.25) (.021)
– 5.47HiEL – .420LunchPCT + 11.75ln(Income)

(1.03) (.029) (1.78)
(a) H0: quadratic in STR v. H1: cubic in STR?

t = .059/.021 = 2.86 (p = .005)
(b) H0: linear in STR v. H1: nonlinear/up to cubic in STR?

F = 6.17 (p = .002)
6-52
Question #2: STR-PctEL interactions
(to simplify things, ignore STR2, STR3 terms for now)
TestScore = 653.6 – .53STR + 5.50HiEL – .58HiEL STR

(9.9) (.34) (9.80) (.50)
– .411LunchPCT + 12.12ln(Income)
(.029) (1.80)
Interpretation of coefficients on:

 STR?
 HiEL? (wrong sign?)
 HiEL STR?
 LunchPCT?
 ln(Income)?
6-53
Interpreting the regression functions via plots:

(9.9) (.34) (9.80) (.50)
(.029) (1.80)
“Real-world” (“policy” or “economic”) importance of

the interaction term:
TestScore  1.12 if HiEL  1
= –.53 – .58HiEL = 
STR  .53 if HiEL  0
 The difference in the estimated effect of reducing the
STR is substantial; class size reduction is more
effective in districts with more English learners
6-54
Is the interaction effect statistically significant?

(9.9) (.34) (9.80) (.50)
(.029) (1.80)
(a) H0: coeff. on interaction=0 v. H1: nonzero interaction

t = –1.17 not significant at the 10% level
(b) H0: both coeffs involving STR = 0 vs.
H1: at least one coefficient is nonzero (STR enters)
F = 5.92 (p = .003)
Next: specifications with polynomials + interactions!

6-55
6-56
Interpreting the regression functions via plots:
6-57
Tests of joint hypotheses:
6-58
Summary: Nonlinear Regression Functions
 Using functions of the independent variables such as

ln(X) or X1 X2, allows recasting a large family of
nonlinear regression functions as multiple regression.
 Estimation and inference proceeds in the same way as
in the linear multiple regression model.
 Interpretation of the coefficients is model-specific, but
the general rule is to compute effects by comparing
different cases (different value of the original X’s)
 Many nonlinear specifications are possible, so you must
use judgment: What nonlinear effect you want to
analyze? What makes sense in your application?
6-59
Assessing Studies Based on
Multiple Regression
(SW Ch. 7)
Multiple regression has some key virtues:

 It provides an estimate of the effect on Y of arbitrary
changes X.
 It resolves the problem of omitted variable bias, if an
omitted variable can be measured and included.
 It can handle nonlinear relations (effects that vary
with the X’s)
Still, OLS might yield a biased estimator of the true
causal effect.
7-1
A Framework for Assessing Statistical Studies
Internal and External Validity

 Internal validity: the statistical inferences about
causal effects are valid for the population being
studied.
 External validity: the statistical inferences can be
generalized from the population and setting studied to
other populations and settings, where the “setting”
refers to the legal, policy, and physical environment
and related salient features.
7-2
Threats to External Validity
How far can we generalize class size results from

California school districts?
 Differences in populations
o California in 2005?
o Massachusetts in 2005?
o Mexico in 2005?
 Differences in settings
o different legal requirements concerning special
education
o different treatment of bilingual education
o differences in teacher characteristics
7-3
Threats to Internal Validity of
Multiple Regression Analysis
(SW Section 7.2)
Internal validity: the statistical inferences about causal

effects are valid for the population being studied.
Five threats to the internal validity of regression studies:
1. Omitted variable bias
2. Wrong functional form
3. Errors-in-variables bias
4. Sample selection bias
5. Simultaneous causality bias
All of these imply that E(ui|X1i,…,Xki) 0.
7-4
Arises if an omitted variable both (i) is a determinant of
Y and (ii) is correlated with at least one included
regressor.
Potential solutions to omitted variable bias

 If the variable can be measured, include it as a
regressor in multiple regression;
 Possibly, use panel data in which each entity
(individual) is observed more than once;
 If the variable cannot be measured, use instrumental
variables regression;
 Run a randomized controlled experiment.
7-5
Arises if the functional form is incorrect – for example,
an interaction term is incorrectly omitted; then inferences
on causal effects will be biased.
Potential solutions to functional form misspecification
 Continuous dependent variable: use the “appropriate”

nonlinear specifications in X (logarithms, interactions,
etc.)
 Discrete (example: binary) dependent variable: need
an extension of multiple regression methods (“probit”
or “logit” analysis for binary dependent variables).
7-6
So far we have assumed that X is measured without error.

In reality, economic data often have measurement error
 Data entry errors in administrative data
 Recollection errors in surveys (when did you start
your current job?)
 Ambiguous questions problems (what was your
income last year?)
 Intentionally false response problems with surveys
(What is the current value of your financial assets?
How often do you drink and drive?)
7-7
In general, measurement error in a regressor results in
“errors-in-variables” bias.
Illustration: suppose
Yi = 0 + 1Xi + ui
is “correct” in the sense that the three least squares

assumptions hold (in particular E(ui|Xi) = 0).
Let
Xi = unmeasured true value of X
X i = imprecisely measured version of X
7-8
Then
Yi = 0 + 1Xi + ui
= 0 + 1 X i + [1(Xi – X i ) + ui]
or
Yi = 0 + 1 X i + ui , where ui = 1(Xi – X i ) + ui
If X i is correlated with ui then ˆ1 will be biased:

cov( X i , ui ) = cov( X i ,1(Xi – X i ) + ui)
= 1cov( X i ,Xi – X i ) + cov( X i ,ui)
= 1[cov( X i ,Xi) – var( X i )] + 0 0
because in general cov( X i ,Xi) var( X i ).

7-9
Yi = 0 + 1 X i + ui , where ui = 1(Xi – X i ) + ui
 If Xi is measured with error, X i is in general correlated

with ui , so ˆ1 is biased and inconsistent.
 It is possible to derive formulas for this bias, but they
require making specific mathematical assumptions
about the measurement error process (for example,
that ui and Xi are uncorrelated). Those formulas are
special and particular, but the observation that
measurement error in X results in bias is general.
7-10
Potential solutions to errors-in-variables bias
 Obtain better data.

 Develop a specific model of the measurement error
process.
 This is only possible if a lot is known about the
nature of the measurement error – for example a
subsample of the data are cross-checked using
administrative records and the discrepancies are
analyzed and modeled. (Very specialized; we won’t
pursue this here.)
 Instrumental variables regression.
7-11
So far we have assumed simple random sampling of the

population. In some cases, simple random sampling is
thwarted because the sample, in effect, “selects itself.”
Sample selection bias arises when a selection process (i)

influences the availability of data and (ii) that process is
related to the dependent variable.
7-12
Example #1: Mutual funds
 Do actively managed mutual funds outperform “hold-
the-market” funds?
 Empirical strategy:
o Sampling scheme: simple random sampling of
mutual funds available to the public on a given
date.
o Data: returns for the preceding 10 years.
o Estimator: average ten-year return of the sample
mutual funds, minus ten-year return on S&P500
o Is there sample selection bias?
7-13
Sample selection bias induces correlation between a
regressor and the error term.
Mutual fund example:
returni = 0 + 1managed_fundi + ui
Being a managed fund in the sample (managed_fundi =

1) means that your return was better than failed managed
funds, which are not in the sample – so
corr(managed_fundi,ui) 0.
7-14
Example #2: returns to education
 What is the return to an additional year of education?
 Empirical strategy:
o Sampling scheme: simple random sampling of
workers
o Data: earnings and years of education
o Estimator: regress ln(earnings) on years_education
o Ignore issues of omitted variable bias and
measurement error – is there sample selection
bias?
7-15
Potential solutions to sample selection bias
 Collect the sample in a way that avoids sample

selection.
o Mutual funds example: change the sample
population from those available at the end of the
ten-year period, to those available at the
beginning of the period (include failed funds)
o Returns to education example: sample college
graduates, not workers (include the unemployed)
 Randomized controlled experiment.
 Construct a model of the sample selection problem
and estimate that model (we won’t do this).
7-16
So far we have assumed that X causes Y.

What if Y causes X, too?
Example: Class size effect

 Low STR results in better test scores
 But suppose districts with low test scores are given
extra resources: as a result of a political process they
also have low STR
 What does this mean for a regression of TestScore on
STR?
7-17
Simultaneous causality bias in equations
(a) Causal effect of X on Y: Yi = 0 + 1Xi + ui
(b) Causal effect of Y on X: Xi = 0 + 1Yi + vi
 Large ui means large Yi, which implies large Xi (if 1>0)

 Thus corr(Xi,ui) 0
 Thus ˆ1 is biased and inconsistent.
 Ex: A district with particularly bad test scores given the
STR (negative ui) receives extra resources, thereby
lowering its STR; so STRi and ui are correlated
Potential solutions to simultaneous causality bias
7-18
 Randomized controlled experiment. Because Xi is
chosen at random by the experimenter, there is no
feedback from the outcome variable to Yi (assuming
perfect compliance).
 Develop and estimate a complete model of both
directions of causality. This is the idea behind many
large macro models (e.g. Federal Reserve Bank-US).
This is extremely difficult in practice.
 Use instrumental variables regression to estimate the
causal effect of interest (effect of X on Y, ignoring
effect of Y on X).
7-19
Applying this Framework: Test Scores and Class Size
(SW Chapter 7.3)
Objective: Assess the threats to the internal and external

validity of the empirical analysis of the California test
score data.
 External validity
o Compare results for California and Massachusetts
o Think hard…
 Internal validity
o Go through the list of five potential threats to
internal validity and think hard…
7-20
Check of external validity
compare the California study to one using
Massachusetts data
The Massachusetts data set

 220 elementary school districts
 Test: 1998 MCAS test – fourth grade total (Math +
English + Science)
 Variables: STR, TestScore, PctEL, LunchPct, Income
7-21
The Massachusetts data: summary statistics
7-22
7-23
7-24
 Logarithmic v. cubic function for STR?
 Evidence of nonlinearity in TestScore-STR relation?
 Is there a significant HiEL STR interaction?
7-25
Predicted effects for a class size reduction of 2
Linear specification for Mass:
TestScore = 744.0 – 0.64STR – 0.437PctEL – 0.582LunchPct

(21.3) (0.27) (0.303) (0.097)
– 3.07Income + 0.164Income2 – 0.0022Income3

(2.35) (0.085) (0.0010)
Estimated effect = -0.64 (-2) = 1.28
Standard error = 2 0.27 = 0.54
NOTE: var(aY) = a2var(Y); SE(a ˆ1 ) = |a|SE( ˆ1 )
95% CI = 1.28 1.96 0.54 = (0.22, 2.34)
Computing predicted effects in nonlinear models
7-26
Use the “before” and “after” method:
TestScore = 655.5 + 12.4STR – 0.680STR2 + 0.0115STR3

– 0.434PctEL – 0.587LunchPct
– 3.48Income + 0.174Income2 – 0.0023Income3
Estimated reduction from 20 students to 18:
TestScore = [12.4 20 – 0.680 202 + 0.0115 203]
– [12.4 18 – 0.680 182 + 0.0115 183] = 1.98
 compare with estimate from linear model of 1.28
 SE of this estimated effect: use the “rearrange the
regression” (“transform the regressors”) method
7-27
Summary of Findings for Massachusetts
1. Coefficient on STR falls from –1.72 to –0.69 when

control variables for student and district
characteristics are included – an indication that the
original estimate contained omitted variable bias.
2. The class size effect is statistically significant at the
1% significance level, after controlling for student
and district characteristics
3. No statistical evidence on nonlinearities in the
TestScore – STR relation
4. No statistical evidence of STR – PctEL interaction
7-28
Comparison of estimated class size effects: CA vs. MA
7-29
Summary: Comparison of California and
Massachusetts Regression Analyses
 Class size effect falls in both CA, MA data when

student and district control variables are added.
 Class size effect is statistically significant in both CA,
MA data.
 Estimated effect of a 2-student reduction in STR is
quantitatively similar for CA, MA.
 Neither data set shows evidence of STR – PctEL
interaction.
 Some evidence of STR nonlinearities in CA data, but
not in MA data.
7-30
Remaining threats to internal validity
What the CA v. MA comparison does and doesn’t show

This analysis controls for:
 district demographics (income)
 some student characteristics (English speaking)
What is missing?
 Additional student characteristics, for example native
ability (but is this correlated with STR?)
 Access to outside learning opportunities
 Teacher quality (perhaps better teachers are attracted
to schools with lower STR)
7-31
Omitted variable bias, ctd.
 We have controlled for many relevant omitted factors;

 The nature of this omitted variable bias would need to
be similar in California and Massachusetts to be
consistent with these results;
 In this application we will be able to compare these
estimates based on observational data with estimates
based on experimental data – a check of this multiple
regression methodology.
7-32
 We have tried quite a few different functional forms,
in both the California and Mass. data
 Nonlinear effects are modest
 Plausibly, this is not a major threat at this point.
 STR is a district-wide measure
 Presumably there is some measurement error –
students who take the test might not have experienced
the measured STR for the district
 Ideally we would like data on individual students, by
grade level.
7-33
4. Selection
 Sample is all elementary public school districts (in
California; in Mass.)
 no reason that selection should be a problem.
5. Simultaneous Causality
 School funding equalization based on test scores
could cause simultaneous causality.
 This was not in place in California or Mass. during
these samples, so simultaneous causality bias is
arguably not important.
7-34
Summary
 Framework for evaluating regression studies:

o Internal validity
o External validity
 Five threats to internal validity:
 Rest of course focuses on econometric methods for
addressing these threats.
7-35
Regression with Panel Data
(SW Ch. 8)
A panel dataset contains observations on multiple

entities (individuals), where each entity is observed at
two or more points in time.
Examples:
 Data on 420 California school districts in 1999 and
again in 2000, for 840 observations total.
 Data on 50 U.S. states, each state is observed in 3
years, for a total of 150 observations.
 Data on 1000 individuals, in four different months,
for 4000 observations total.
8-1
Notation for panel data
A double subscript distinguishes entities (states) and
time periods (years)
i = entity (state), n = number of entities,

so i = 1,…,n
t = time period (year), T = number of time periods

so t =1,…,T
Data: Suppose we have 1 regressor. The data are:
(Xit, Yit), i = 1,…,n, t = 1,…,T

8-2
Panel data notation, ctd.
Panel data with k regressors:
(X1it, X2it,…,Xkit, Yit), i = 1,…,n, t = 1,…,T
n = number of entities (states)

T = number of time periods (years)
Some jargon…
 Another term for panel data is longitudinal data
 balanced panel: no missing observations
 unbalanced panel: some entities (states) are not
observed for some time periods (years)
8-3
Why are panel data useful?
With panel data we can control for factors that:

 Vary across entities (states) but do not vary over time
 Could cause omitted variable bias if they are omitted
 are unobserved or unmeasured – and therefore cannot
be included in the regression using multiple regression
Here’s the key idea:

If an omitted variable does not change over time, then
any changes in Y over time cannot be caused by the
omitted variable.
8-4
Example of a panel data set:
Traffic deaths and alcohol taxes
Observational unit: a year in a U.S. state

 48 U.S. states, so n = of entities = 48
 7 years (1982,…, 1988), so T = # of time periods = 7
 Balanced panel, so total # observations = 7 48 = 336
Variables:
 Traffic fatality rate (# traffic deaths in that state in
that year, per 10,000 state residents)
 Tax on a case of beer
 Other (legal driving age, drunk driving laws, etc.)
8-5
Traffic death data for 1982
Higher alcohol taxes, more traffic deaths?

8-6
Traffic death data for 1988
Higher alcohol taxes, more traffic deaths?

8-7
Why might there be higher more traffic deaths in states
that have higher alcohol taxes?
Other factors that determine traffic fatality rate:

 Quality (age) of automobiles
 Quality of roads
 “Culture” around drinking and driving
 Density of cars on the road
8-8
These omitted factors could cause omitted variable bias.
Example #1: traffic density. Suppose:

(i) High traffic density means more traffic deaths
(ii) (Western) states with lower traffic density have
lower alcohol taxes
 Then the two conditions for omitted variable bias are
satisfied. Specifically, “high taxes” could reflect “high
traffic density” (so the OLS coefficient would be biased
positively – high taxes, more deaths)
 Panel data lets us eliminate omitted variable bias when
the omitted variables are constant over time within a
given state.
8-9
Example #2: cultural attitudes towards drinking and
driving
(i) arguably are a determinant of traffic deaths; and
(ii) potentially are correlated with the beer tax, so beer
taxes could be picking up cultural differences
(omitted variable bias).
 Then the two conditions for omitted variable bias are
satisfied. Specifically, “high taxes” could reflect
“cultural attitudes towards drinking” (so the OLS
coefficient would be biased)
 Panel data lets us eliminate omitted variable bias when
the omitted variables are constant over time within a
given state.
8-10
Panel Data with Two Time Periods
(SW Section 8.2)
Consider the panel data model,
FatalityRateit = 0 + 1BeerTaxit + 2Zi + uit
Zi is a factor that does not change over time (density), at

least during the years on which we have data.
 Suppose Zi is not observed, so its omission could
result in omitted variable bias.
 The effect of Zi can be eliminated using T = 2 years.
8-11
The key idea:
Any change in the fatality rate from 1982 to 1988
cannot be caused by Zi, because Zi (by assumption)
does not change between 1982 and 1988.
The math: consider fatality rates in 1988 and 1982:

FatalityRatei1988 = 0 + 1BeerTaxi1988 + 2Zi + ui1988
Suppose E(uit|BeerTaxit, Zi) = 0.
Subtracting 1988 – 1982 (that is, calculating the change),

eliminates the effect of Zi…
8-12

so
FatalityRatei1988 – FatalityRatei1982 =
1(BeerTaxi1988 – BeerTaxi1982) + (ui1988 – ui1982)
 The new error term, (ui1988 – ui1982), is uncorrelated

with either BeerTaxi1988 or BeerTaxi1982.
 This “difference” equation can be estimated by OLS,
even though Zi isn’t observed.
 The omitted variable Zi doesn’t change, so it cannot
be a determinant of the change in Y
8-13
Example: Traffic deaths and beer taxes
1982 data:
FatalityRate = 2.01 + 0.15BeerTax (n = 48)
(.15) (.13)
1988 data:
FatalityRate = 1.86 + 0.44BeerTax (n = 48)
(.11) (.13)
Difference regression (n = 48)

FR1988  FR1982 = –.072 – 1.04(BeerTax1988–BeerTax1982)
(.065) (.36)
8-14
8-15
Fixed Effects Regression
(SW Section 8.3)
What if you have more than 2 time periods (T > 2)?
Yit = 0 + 1Xit + 2Zi + ui, i =1,…,n, T = 1,…,T
We can rewrite this in two useful ways:

1. “n-1 binary regressor” regression model
2. “Fixed Effects” regression model
We first rewrite this in “fixed effects” form. Suppose we

have n = 3 states: California, Texas, Massachusetts.
8-16
Yit = 0 + 1Xit + 2Zi + ui, i =1,…,n, T = 1,…,T
Population regression for California (that is, i = CA):

YCA,t = 0 + 1XCA,t + 2ZCA + uCA,t
= (0 + 2ZCA) + 1XCA,t + uCA,t
or
YCA,t = CA + 1XCA,t + uCA,t
 CA = 0 + 2ZCA doesn’t change over time

 CA is the intercept for CA, and 1 is the slope
 The intercept is unique to CA, but the slope is the
same in all the states: parallel lines.
8-17
For TX:
YTX,t = 0 + 1XTX,t + 2ZTX + uTX,t
= (0 + 2ZTX) + 1XTX,t + uTX,t
or
YTX,t = TX + 1XTX,t + uTX,t, where TX = 0 + 2ZTX
Collecting the lines for all three states:

YCA,t = CA + 1XCA,t + uCA,t
YTX,t = TX + 1XTX,t + uTX,t
YMA,t = MA + 1XMA,t + uMA,t
or
Yit = i + 1Xit + uit, i = CA, TX, MA, T = 1,…,T
8-18
The regression lines for each state in a picture
Y = CA + 1X
Y
CA
CA Y = TX + 1X
TX Y = MA+ 1X
TX
MA
MA
Recall (Fig. 6.8a) that shifts in the intercept can be

represented using binary regressors…
8-19
Y = CA + 1X
Y
CA
CA Y = TX + 1X
TX Y = MA+ 1X
TX
MA
MA
In binary regressor form:

Yit = 0 + CADCAi + TXDTXi + 1Xit + uit
 DCAi = 1 if state is CA, = 0 otherwise

 DTXt = 1 if state is TX, = 0 otherwise
 leave out DMAi (why?)
8-20
Summary: Two ways to write the fixed effects model
“n-1 binary regressor” form
Yit = 0 + 1Xit + 2D2i + … + nDni + ui
1 for i =2 (state #2)

where D2i =  , etc.
0 otherwise
“Fixed effects” form:

Yit = 1Xit + i + ui
 i is called a “state fixed effect” or “state effect” – it

is the constant (fixed) effect of being in state i
8-21
Fixed Effects Regression: Estimation
Three estimation methods:

1. “n-1 binary regressors” OLS regression
2. “Entity-demeaned” OLS regression
3. “Changes” specification (only works for T = 2)
 These three methods produce identical estimates of the

regression coefficients, and identical standard errors.
 We already did the “changes” specification (1988
minus 1982) – but this only works for T = 2 years
 Methods #1 and #2 work for general T
 Method #1 is only practical when n isn’t too big
8-22
1. “n-1 binary regressors” OLS regression
Yit = 0 + 1Xit + 2D2i + … + nDni + ui (1)
1 for i =2 (state #2)

where D2i =  etc.
0 otherwise
 First create the binary variables D2i,…,Dni

 Then estimate (1) by OLS
 Inference (hypothesis tests, confidence intervals) is as
usual (using heteroskedasticity-robust standard errors)
 This is impractical when n is very large (for example if
n = 1000 workers)
8-23
2. “Entity-demeaned” OLS regression
The fixed effects regression model:
Yit = 1Xit + i + ui
The state averages satisfy:

1 T 1 T 1 T

T t 1
Yit = i + 1  X it +  uit
T t 1 T t 1
Deviation from state averages:

1 T  1 T   1 T 
Yit –  Yit = 1  X it   X it  +  uit   uit 
T t 1  T t 1   T t 1 
8-24
Entity-demeaned OLS regression, ctd.
1 T  1 T   1 T 
Yit –  Yit = 1  X it   X it  +  uit   uit 
T t 1  T t 1   T t 1 
or
Yit = 1 X it + uit
1 T 1 T
where Yit = Yit –  Yit and X it = Xit –  X it
T t 1 T t 1
 For i=1 and t = 1982, Yit is the difference between the

fatality rate in Alabama in 1982, and its average value
in Alabama averaged over all 7 years.
8-25
Entity-demeaned OLS regression, ctd.
Yit = 1 X it + uit (2)

1 T
where Yit = Yit –  Yit , etc.
T t 1
 First construct the demeaned variables Yit and X it
 Then estimate (2) by regressing Yit on X it using OLS
 Inference (hypothesis tests, confidence intervals) is as
usual (using heteroskedasticity-robust standard errors)
 This is like the “changes” approach, but instead Yit is
deviated from the state average instead of Yi1.
 This can be done in a single command in STATA
8-26
Example: Traffic deaths and beer taxes in STATA
. areg vfrall beertax, absorb(state) r;

F( 1, 287) = 10.41
Prob > F = 0.0014
R-squared = 0.9050
Adj R-squared = 0.8891
Root MSE = .18986
------------------------------------------------------------------------------
| Robust
vfrall | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
beertax | -.6558736 .2032797 -3.23 0.001 -1.055982 -.2557655
_cons | 2.377075 .1051515 22.61 0.000 2.170109 2.584041
-------------+----------------------------------------------------------------
state | absorbed (48 categories)
 “areg” automatically de-means the data

 this is especially useful when n is large
 the reported intercept is arbitrary
8-27
Example, ctd.
For n = 48, T = 7:
FatalityRate = –.66BeerTax + State fixed effects

(.20)
 Should you report the intercept?
 How many binary regressors would you include to
estimate this using the “binary regressor” method?
 Compare slope, standard error to the estimate for the
1988 v. 1982 “changes” specification (T = 2, n = 48):
FR1988  FR1982 = –.072 – 1.04(BeerTax1988–BeerTax1982)

(.065) (.36)
8-28
Regression with Time Fixed Effects
(SW Section 8.4)
An omitted variable might vary over time but not across

states:
 Safer cars (air bags, etc.); changes in national laws
 These produce intercepts that change over time
 Let these changes (“safer cars”) be denoted by the
variable St, which changes over time but not states.
 The resulting population regression model is:
Yit = 0 + 1Xit + 2Zi + 3St + uit
8-29
Time fixed effects only
Yit = 0 + 1Xit + 3St + uit
In effect, the intercept varies from one year to the next:
Yi,1982 = 0 + 1Xi,1982 + 3S1982 + ui,1982

= (0 + 3S1982) + 1Xi,1982 + ui,1982
or
Yi,1982 = 1982 + 1Xi,1982 + ui,1982, 1982 = 0 + 3S1982
Similarly,
Yi,1983 = 1983 + 1Xi,1983 + ui,1983, 1983 = 0 + 3S1983
etc.
8-30
Two formulations for time fixed effects
1. “Binary regressor” formulation:
Yit = 0 + 1Xit + 2B2t + … TBTt + uit
1 when t =2 (year #2)

where B2t =  , etc.
0 otherwise
2. “Time effects” formulation:
Yit = 1Xit + t + uit
8-31
Time fixed effects: estimation methods
1. “T-1 binary regressors” OLS regression

Yit = 0 + 1Xit + 2B2it + … TBTit + uit
 Create binary variables B2,…,BT

 B2 = 1 if t = year #2, = 0 otherwise
 Regress Y on X, B2,…,BT using OLS
 Where’s B1?
2. “Year-demeaned” OLS regression

 Deviate Yit, Xit from year (not state) averages
 Estimate by OLS using “year-demeaned” data
8-32
State and Time Fixed Effects
Yit = 0 + 1Xit + 2Zi + 3St + uit
1. “Binary regressor” formulation:
Yit = 0 + 1Xit + 2D2i + … + nDni

+ 2B2t + … TBTt + uit
2. “State and time effects” formulation:
Yit = 1Xit + i + t + uit
8-33
State and time effects: estimation methods
1. “n-1 and T-1 binary regressors” OLS regression

 Create binary variables D2,…,Dn
 Create binary variables B2,…,BT
 Regress Y on X, D2,…,Dn, B2,…,BT using OLS
 What about D1 and B1?
2. “State- and year-demeaned” OLS regression
 Deviate Yit, Xit from year and state averages
 Estimate by OLS using “year- and state-
demeaned” data
These two methods can be combined too.
STATA example: Traffic deaths…
8-34
. gen y83=(year==1983);
. gen y84=(year==1984);
. gen y85=(year==1985);
. gen y86=(year==1986);
. gen y87=(year==1987);
. gen y88=(year==1988);
. areg vfrall beertax y83 y84 y85 y86 y87 y88, absorb(state) r;

F( 7, 281) = 3.70
Prob > F = 0.0008
R-squared = 0.9089
Adj R-squared = 0.8914
Root MSE = .18788
------------------------------------------------------------------------------
| Robust
vfrall | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
beertax | -.6399799 .2547149 -2.51 0.013 -1.141371 -.1385884
y83 | -.0799029 .0502708 -1.59 0.113 -.1788579 .0190522
y84 | -.0724206 .0452466 -1.60 0.111 -.161486 .0166448
y85 | -.1239763 .0460017 -2.70 0.007 -.214528 -.0334246
y86 | -.0378645 .0486527 -0.78 0.437 -.1336344 .0579055
y87 | -.0509021 .0516113 -0.99 0.325 -.1524958 .0506917
y88 | -.0518038 .05387 -0.96 0.337 -.1578438 .0542361
_cons | 2.42847 .1468565 16.54 0.000 2.139392 2.717549
-------------+----------------------------------------------------------------
state | absorbed (48 categories)
Go to section for other ways to do this in STATA!
8-35
Some Theory: The Fixed Effects Regression
Assumptions (SW App. 8.2)
For a single X:
Yit = 1Xit + i + uit, i = 1,…,n, t = 1,…, T
1. E(uit|Xi1,…,XiT,i) = 0.
2. (Xi1,…,XiT,Yi1,…,YiT), i =1,…,n, are i.i.d. draws from
their joint distribution.
3. (Xit, uit) have finite fourth moments.
4. There is no perfect multicollinearity (multiple X’s)
5. corr(uit,uis|Xit,Xis,i) = 0 for t s.
Assumptions 3&4 are identical; 1, 2, differ; 5 is new
8-36
Assumption #1: E(uit|Xi1,…,XiT,i) = 0
 uit has mean zero, given the state fixed effect and the
entire history of the X’s for that state
 This is an extension of the previous multiple
regression Assumption #1
 This means there are no omitted lagged effects (any
lagged effects of X must enter explicitly)
 Also, there is not feedback from u to future X:
o Whether a state has a particularly high fatality rate
this year doesn’t subsequently affect whether it
increases the beer tax.
o We’ll return to this when we take up time series
data.
8-37
Assumption #2: (Xi1,…,XiT,Yi1,…,YiT), i =1,…,n, are
i.i.d. draws from their joint distribution.
 This is an extension of Assumption #2 for multiple
regression with cross-section data
 This is satisfied if entities (states, individuals) are
randomly sampled from their population by simple
random sampling, then data for those entities are
collected over time.
 This does not require observations to be i.i.d. over
time for the same entity – that would be unrealistic
(whether a state has a mandatory DWI sentencing law
this year is strongly related to whether it will have that
law next year).
8-38
Assumption #5: corr(uit,uis|Xit,Xis,i) = 0 for t s
 This is new.
 This says that (given X), the error terms are
uncorrelated over time within a state.
 For example, uCA,1982 and uCA,1983 are uncorrelated
 Is this plausible? What enters the error term?
o Especially snowy winter
o Opening major new divided highway
o Fluctuations in traffic density from local economic
conditions
 Assumption #5 requires these omitted factors entering
uit to be uncorrelated over time, within a state.
8-39
What if Assumption #5 fails: corr(uit,uis|Xit,Xis,i) 0?
 A useful analogy is heteroskedasticity.
 OLS panel data estimators of 1 are unbiased,
consistent
 The OLS standard errors will be wrong – usually the
OLS standard errors understate the true uncertainty
 Intuition: if uit is correlated over time, you don’t have
as much information (as much random variation) as you
would were uit uncorrelated.
 This problem is solved by using “heteroskedasticity and
autocorrelation-consistent standard errors” – we return
to this when we focus on time series regression
Application: Drunk Driving Laws and Traffic Deaths
8-40
(SW Section 8.5)
Some facts
 Approx. 40,000 traffic fatalities annually in the U.S.
 1/3 of traffic fatalities involve a drinking driver
 25% of drivers on the road between 1am and 3am
have been drinking (estimate)
 A drunk driver is 13 times as likely to cause a fatal
crash as a non-drinking driver (estimate)
8-41
Drunk driving laws and traffic deaths, ctd.
Public policy issues

 Drunk driving causes massive externalities (sober
drivers are killed, etc. etc.) – there is ample
justification for governmental intervention
 Are there any effective ways to reduce drunk driving?
If so, what?
 What are effects of specific laws:
o mandatory punishment
o minimum legal drinking age
o economic interventions (alcohol taxes)
8-42
The drunk driving panel data set
n = 48 U.S. states, T = 7 years (1982,…,1988) (balanced)
Variables
 Traffic fatality rate (deaths per 10,000 residents)
 Tax on a case of beer (Beertax)
 Minimum legal drinking age
 Minimum sentencing laws for first DWI violation:
o Mandatory Jail
o Manditory Community Service
o otherwise, sentence will just be a monetary fine
 Vehicle miles per driver (US DOT)
 State economic data (real per capita income, etc.)
8-43
Why might panel data help?
 Potential OV bias from variables that vary across states
but are constant over time:
o culture of drinking and driving
o quality of roads
o vintage of autos on the road
use state fixed effects
 Potential OV bias from variables that vary over time
but are constant across states:
o improvements in auto safety over time
o changing national attitudes towards drunk driving
use time fixed effects
8-44
8-45
8-46
Empirical Analysis: Main Results
 Sign of beer tax coefficient changes when fixed state

effects are included
 Fixed time effects are statistically significant but do not
have big impact on the estimated coefficients
 Estimated effect of beer tax drops when other laws are
included as regressor
 The only policy variable that seems to have an impact is
the tax on beer – not minimum drinking age, not
mandatory sentencing, etc.
 The other economic variables have plausibly large
coefficients: more income, more driving, more deaths
8-47
Extensions of the “n-1 binary regressor” approach
The idea of using many binary indicators to eliminate

omitted variable bias can be extended to non-panel data –
the key is that the omitted variable is constant for a group
of observations, so that in effect it means that each group
has its own intercept.
Example: Class size problem.
Suppose funding and curricular issues are determined
at the county level, and each county has several
districts. Resulting omitted variable bias could be
addressed by including binary indicators, one for each
county (omit one to avoid perfect multicollinearity).
8-48
Summary: Regression with Panel Data
(SW Section 8.6)
Advantages and limitations of fixed effects regression

Advantages
 You can control for unobserved variables that:
o vary across states but not over time, and/or
o vary over time but not across states
 More observations give you more information
 Estimation involves relatively straightforward
extensions of multiple regression
8-49
 Fixed effects estimation can be done three ways:
1. “Changes” method when T = 2
2. “n-1 binary regressors” method when n is small
3. “Entity-demeaned” regression
 Similar methods apply to regression with time fixed
effects and to both time and state fixed effects
 Statistical inference: like multiple regression.
Limitations/challenges
 Need variation in X over time within states
 Time lag effects can be important
 Standard errors might be too low (errors might be
correlated over time)
8-50
Regression with a Binary Dependent Variable
(SW Ch. 9)
So far the dependent variable (Y) has been continuous:

 district-wide average test score
 traffic fatality rate
But we might want to understand the effect of X on a

binary variable:
 Y = get into college, or not
 Y = person smokes, or not
 Y = mortgage application is accepted, or not
9-1
Example: Mortgage denial and race
The Boston Fed HMDA data set
 Individual applications for single-family mortgages
made in 1990 in the greater Boston area
 2380 observations, collected under Home Mortgage
Disclosure Act (HMDA)
Variables
 Dependent variable:
o Is the mortgage denied or accepted?
 Independent variables:
o income, wealth, employment status
o other loan, property characteristics
o race of applicant
9-2
The Linear Probability Model
(SW Section 9.1)
A natural starting point is the linear regression model

with a single regressor:
Yi = 0 + 1Xi + ui
But:
Y
 What does 1 mean when Y is binary? Is 1 = ?
X
 What does the line 0 + 1X mean when Y is binary?
 What does the predicted value Yˆ mean when Y is
binary? For example, what does Yˆ = 0.26 mean?
9-3
The linear probability model, ctd.
Yi = 0 + 1Xi + ui
Recall assumption #1: E(ui|Xi) = 0, so
E(Yi|Xi) = E(0 + 1Xi + ui|Xi) = 0 + 1Xi
When Y is binary,
E(Y) = 1 Pr(Y=1) + 0 Pr(Y=0) = Pr(Y=1)
so
E(Y|X) = Pr(Y=1|X)
9-4
The linear probability model, ctd.
When Y is binary, the linear regression model
Yi = 0 + 1Xi + ui
is called the linear probability model.
 The predicted value is a probability:

o E(Y|X=x) = Pr(Y=1|X=x) = prob. that Y = 1 given x
o Yˆ = the predicted probability that Yi = 1, given X
 1 = change in probability that Y = 1 for a given x:
Pr(Y  1| X  x  x )  Pr(Y  1| X  x)
1 =
x
Example: linear probability model, HMDA data

9-5
Mortgage denial v. ratio of debt payments to income
(P/I ratio) in the HMDA data set (subset)
9-6
Linear probability model: HMDA data
deny = -.080 + .604P/I ratio (n = 2380)

(.032) (.098)
 What is the predicted value for P/I ratio = .3?

Pr(deny  1| P / Iratio  .3) = -.080 + .604 .3 = .151
 Calculating “effects:” increase P/I ratio from .3 to .4:
Pr(deny  1| P / Iratio  .4) = -.080 + .604 .4 = .212
The effect on the probability of denial of an increase
in P/I ratio from .3 to .4 is to increase the probability
by .061, that is, by 6.1 percentage points (what?).
9-7
Next include black as a regressor:
deny = -.091 + .559P/I ratio + .177black
(.032) (.098) (.025)
Predicted probability of denial:

 for black applicant with P/I ratio = .3:
Pr(deny  1) = -.091 + .559 .3 + .177 1 = .254
 for white applicant, P/I ratio = .3:
Pr(deny  1) = -.091 + .559 .3 + .177 0 = .077
 difference = .177 = 17.7 percentage points
 Coefficient on black is significant at the 5% level
 Still plenty of room for omitted variable bias…
9-8
The linear probability model: Summary
 Models probability as a linear function of X

 Advantages:
o simple to estimate and to interpret
o inference is the same as for multiple regression
(need heteroskedasticity-robust standard errors)
 Disadvantages:
o Does it make sense that the probability should be
linear in X?
o Predicted probabilities can be <0 or >1!
 These disadvantages can be solved by using a nonlinear
probability model: probit and logit regression
9-9
Probit and Logit Regression
(SW Section 9.2)
The problem with the linear probability model is that it

models the probability of Y=1 as being linear:
Pr(Y = 1|X) = 0 + 1X
Instead, we want:
 0 ≤ Pr(Y = 1|X) ≤ 1 for all X
 Pr(Y = 1|X) to be increasing in X (for 1>0)
This requires a nonlinear functional form for the
probability. How about an “S-curve”…
9-10
The probit model satisfies these conditions:
 0 ≤ Pr(Y = 1|X) ≤ 1 for all X
 Pr(Y = 1|X) to be increasing in X (for 1>0)
9-11
Probit regression models the probability that Y=1 using
the cumulative standard normal distribution function,
evaluated at z = 0 + 1X:
Pr(Y = 1|X) = (0 + 1X)
  is the cumulative normal distribution function.
 z = 0 + 1X is the “z-value” or “z-index” of the
probit model.
Example: Suppose 0 = -2, 1= 3, X = .4, so

Pr(Y = 1|X=.4) = (-2 + 3 .4) = (-0.8)
Pr(Y = 1|X=.4) = area under the standard normal density
to left of z = -.8, which is…
9-12
Pr(Z ≤ -0.8) = .2119
9-13
Probit regression, ctd.
Why use the cumulative normal probability distribution?

 The “S-shape” gives us what we want:
o 0 ≤ Pr(Y = 1|X) ≤ 1 for all X
o Pr(Y = 1|X) to be increasing in X (for 1>0)
 Easy to use – the probabilities are tabulated in the
cumulative normal tables
 Relatively straightforward interpretation:
o z-value = 0 + 1X
o ˆ0 + ˆ1 X is the predicted z-value, given X
o 1 is the change in the z-value for a unit change
in X
9-14
STATA Example: HMDA data
. probit deny p_irat, r;
Iteration 0: log likelihood = -872.0853 We’ll discuss this later

Iteration 1: log likelihood = -835.6633
Probit estimates Number of obs = 2380

Wald chi2(1) = 40.68
Prob > chi2 = 0.0000
Log likelihood = -831.79234 Pseudo R2 = 0.0462
------------------------------------------------------------------------------
| Robust
deny | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p_irat | 2.967908 .4653114 6.38 0.000 2.055914 3.879901
_cons | -2.194159 .1649721 -13.30 0.000 -2.517499 -1.87082
------------------------------------------------------------------------------
Pr(deny  1| P / Iratio) = (-2.19 + 2.97 P/I ratio)

(.16) (.47)
9-15
STATA Example: HMDA data, ctd.
Pr(deny  1| P / Iratio) = (-2.19 + 2.97 P/I ratio)
(.16) (.47)
 Positive coefficient: does this make sense?
 Standard errors have usual interpretation
 Predicted probabilities:
Pr(deny  1| P / Iratio  .3) = (-2.19+2.97 .3)
= (-1.30) = .097
 Effect of change in P/I ratio from .3 to .4:
Pr(deny  1| P / Iratio  .4) = (-2.19+2.97 .4) = .159
Predicted probability of denial rises from .097 to .159
9-16
Probit regression with multiple regressors
Pr(Y = 1|X1, X2) = (0 + 1X1 + 2X2)
  is the cumulative normal distribution function.

 z = 0 + 1X1 + 2X2 is the “z-value” or “z-index” of the
probit model.
 1 is the effect on the z-score of a unit change in X1,
holding constant X2
9-17
. probit deny p_irat black, r;


Wald chi2(2) = 118.18
Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
p_irat | 2.741637 .4441633 6.17 0.000 1.871092 3.612181
black | .7081579 .0831877 8.51 0.000 .545113 .8712028
_cons | -2.258738 .1588168 -14.22 0.000 -2.570013 -1.947463
------------------------------------------------------------------------------
We’ll go through the estimation details later…
9-18
STATA Example: predicted probit probabilities
. probit deny p_irat black, r;

Wald chi2(2) = 118.18
Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
p_irat | 2.741637 .4441633 6.17 0.000 1.871092 3.612181
black | .7081579 .0831877 8.51 0.000 .545113 .8712028
_cons | -2.258738 .1588168 -14.22 0.000 -2.570013 -1.947463
------------------------------------------------------------------------------
. sca z1 = _b[_cons]+_b[p_irat]*.3+_b[black]*0;
. display "Pred prob, p_irat=.3, white: "normprob(z1);
Pred prob, p_irat=.3, white: .07546603
NOTE
_b[_cons] is the estimated intercept (-2.258738)
_b[p_irat] is the coefficient on p_irat (2.741637)
sca creates a new scalar which is the result of a calculation
display prints the indicated information to the screen
9-19
STATA Example: HMDA data, ctd.
Pr(deny  1| P / I , black )
= (-2.26 + 2.74 P/I ratio + .71 black)
(.16) (.44) (.08)
 Is the coefficient on black statistically significant?
 Estimated effect of race for P/I ratio = .3:
Pr(deny  1|.3,1) = (-2.26+2.74 .3+.71 1) = .233
Pr(deny  1|.3,0) = (-2.26+2.74 .3+.71 0) = .075
 Difference in rejection probabilities = .158 (15.8
percentage points)
 Still plenty of room still for omitted variable bias…
9-20
Logit regression
Logit regression models the probability of Y=1 as the

cumulative standard logistic distribution function,
evaluated at z = 0 + 1X:
Pr(Y = 1|X) = F(0 + 1X)
F is the cumulative logistic distribution function:
1
F(0 + 1X) =
1  e  ( 0  1 X )
9-21
Logistic regression, ctd.
Pr(Y = 1|X) = F(0 + 1X)
1
where F(0 + 1X) =  ( 0  1 X )
.
1 e
Example: 0 = -3, 1= 2, X = .4,

so 0 + 1X = -3 + 2 .4 = -2.2 so
Pr(Y = 1|X=.4) = 1/(1+e–(–2.2)) = .0998
Why bother with logit if we have probit?
 Historically, numerically convenient
 In practice, very similar to probit
9-22
. logit deny p_irat black, r;
Iteration 0: log likelihood = -872.0853 Later…

Logit estimates Number of obs = 2380

Wald chi2(2) = 117.75
Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
p_irat | 5.370362 .9633435 5.57 0.000 3.482244 7.258481
black | 1.272782 .1460986 8.71 0.000 .9864339 1.55913
_cons | -4.125558 .345825 -11.93 0.000 -4.803362 -3.447753
------------------------------------------------------------------------------
. dis "Pred prob, p_irat=.3, white: "

> 1/(1+exp(-(_b[_cons]+_b[p_irat]*.3+_b[black]*0)));
Pred prob, p_irat=.3, white: .07485143

NOTE: the probit predicted probability is .07546603
9-23
Predicted probabilities from estimated probit and logit
models usually are very close.
9-24
Estimation and Inference in Probit (and Logit)
Models (SW Section 9.3)
Probit model:
Pr(Y = 1|X) = (0 + 1X)
 Estimation and inference

o How to estimate 0 and 1?
o What is the sampling distribution of the estimators?
o Why can we use the usual methods of inference?
 First discuss nonlinear least squares (easier to explain)
 Then discuss maximum likelihood estimation (what is
actually done in practice)
9-25
Probit estimation by nonlinear least squares
Recall OLS:
n
min b0 ,b1  [Yi  (b0  b1 X i )]2
i 1
 The result is the OLS estimators ˆ0 and ˆ1
In probit, we have a different regression function – the

nonlinear probit model. So, we could estimate 0 and 1
by nonlinear least squares:
n
min b0 ,b1  [Yi   (b0  b1 X i )]2
i 1
Solving this yields the nonlinear least squares estimator

of the probit coefficients.
9-26
Nonlinear least squares, ctd.
n
min b0 ,b1  [Yi   (b0  b1 X i )]2
i 1
How to solve this minimization problem?

 Calculus doesn’t give and explicit solution.
 Must be solved numerically using the computer, e.g.
by “trial and error” method of trying one set of values
for (b0,b1), then trying another, and another,…
 Better idea: use specialized minimization algorithms
In practice, nonlinear least squares isn’t used because it
isn’t efficient – an estimator with a smaller variance is…
9-27
Probit estimation by maximum likelihood
The likelihood function is the conditional density of

Y1,…,Yn given X1,…,Xn, treated as a function of the
unknown parameters 0 and 1.
 The maximum likelihood estimator (MLE) is the value
of (0, 1) that maximize the likelihood function.
 The MLE is the value of (0, 1) that best describe the
full distribution of the data.
 In large samples, the MLE is:
o consistent
o normally distributed
o efficient (has the smallest variance of all estimators)
9-28
Special case: the probit MLE with no X
1 with probability p
Y= (Bernoulli distribution)
0 with probability 1  p
Data: Y1,…,Yn, i.i.d.
Derivation of the likelihood starts with the density of Y1:
Pr(Y1 = 1) = p and Pr(Y1 = 0) = 1–p

so
Pr(Y1 = y1) = p y (1  p )1 y (verify this for y1=0, 1!)
1 1
9-29
Joint density of (Y1,Y2):
Because Y1 and Y2 are independent,
Pr(Y1 = y1,Y2 = y2) = Pr(Y1 = y1) Pr(Y2 = y2)

= [ p y (1  p )1 y ] [ p y (1  p )1 y ]
1 1 2 2
Joint density of (Y1,..,Yn):
Pr(Y1 = y1,Y2 = y2,…,Yn = yn)

= [ p y (1  p )1 y ] [ p y (1  p )1 y ] … [ p y (1  p )1 y ]
1 1 2 2 n n
= p 
n
y
i 1 i
(1  p)
 i1 yi 
n
n
The likelihood is the joint density, treated as a function of

the unknown parameters, which here is p:
9-30
f(p;Y1,…,Yn) = p  i 1
(1  p )
n
 i1Yi 
Yi n
n
The MLE maximizes the likelihood. Its standard to work

with the log likelihood, ln[f(p;Y1,…,Yn)]:
ln[f(p;Y1,…,Yn)] =  Y  ln( p)  n   Y  ln(1  p)

n
i 1 i
n
i 1 i
d ln f ( p;Y1 ,...,Yn )
dp
1
  1 
i1Yi p  n  i1Yi  1  p  = 0
=
n n
 

Solving for p yields the MLE; that is, pˆ MLE satisfies,
9-31
 Y  pˆ
n
i 1 i
1
MLE  n

 1 
 n  i 1Yi 
 1  pˆ
MLE 

=0
or
 Y  pˆ
n
i 1 i
1
MLE 
 n   i 1Yi
n
 1
1  pˆ MLE
or
Y pˆ MLE

1  Y 1  pˆ MLE
or
pˆ MLE = Y = fraction of 1’s
9-32
The MLE in the “no-X” case (Bernoulli distribution):
pˆ MLE = Y = fraction of 1’s
 For Yi i.i.d. Bernoulli, the MLE is the “natural”
estimator of p, the fraction of 1’s, which is Y
 We already know the essentials of inference:
o In large n, the sampling distribution of pˆ MLE = Y is
normally distributed
o Thus inference is “as usual:” hypothesis testing via
t-statistic, confidence interval as 1.96SE
 STATA note: to emphasize requirement of large-n, the
printout calls the t-statistic the z-statistic; instead of the
F-statistic, the chi-squared statstic (= q F).
9-33
The probit likelihood with one X
The derivation starts with the density of Y1, given X1:
Pr(Y1 = 1|X1) = (0 + 1X1)
Pr(Y1 = 0|X1) = 1–(0 + 1X1)
so
Pr(Y1 = y1|X1) =  (  0  1 X 1 ) y [1   (  0  1 X 1 )]1 y
1 1
The probit likelihood function is the joint density of

Y1,…,Yn given X1,…,Xn, treated as a function of 0, 1:
f(0,1; Y1,…,Yn|X1,…,Xn)
= {  (  0  1 X 1 )Y [1   (  0  1 X 1 )]1Y }
1 1
… {  (  0  1 X n )Y [1   (  0  1 X n )]1Y }
n n
The probit likelihood function:

9-34
f(0,1; Y1,…,Yn|X1,…,Xn)
= {  (  0  1 X 1 )Y [1   (  0  1 X 1 )]1Y }
1 1
… {  (  0  1 X n )Y [1   (  0  1 X n )]1Y }
n n
 Can’t solve for the maximum explicitly

 Must maximize using numerical methods
 As in the case of no X, in large samples:
o ˆ0MLE , ˆ1MLE are consistent
o ˆ0MLE , ˆ1MLE are normally distributed (more later…)
o Their standard errors can be computed
o Testing, confidence intervals proceeds as usual
 For multiple X’s, see SW App. 9.2
The logit likelihood with one X
9-35
 The only difference between probit and logit is the
functional form used for the probability:  is
replaced by the cumulative logistic function.
 Otherwise, the likelihood is similar; for details see
SW App. 9.2
 As with probit,
o ˆ0MLE , ˆ1MLE are consistent
o ˆ0MLE , ˆ1MLE are normally distributed
o Their standard errors can be computed
o Testing, confidence intervals proceeds as usual
9-36
Measures of fit
The R2 and R 2 don’t make sense here (why?). So, two
other specialized measures are used:
1. The fraction correctly predicted = fraction of Y’s for

which predicted probability is >50% (if Yi=1) or is
<50% (if Yi=0).
2. The pseudo-R2 measure the fit using the likelihood

function: measures the improvement in the value of
the log likelihood, relative to having no X’s (see SW
App. 9.2). This simplifies to the R2 in the linear
model with normally distributed errors.
9-37
Large-n distribution of the MLE (not in SW)
 This is foundation of mathematical statistics.
 We’ll do this for the “no-X” special case, for which p is
the only unknown parameter. Here are the steps:
1. Derive the log likelihood (“(p)”) (done).
2. The MLE is found by setting its derivative to zero;
that requires solving a nonlinear equation.
3. For large n, pˆ MLE will be near the true p (ptrue) so this
nonlinear equation can be approximated (locally) by
a linear equation (Taylor series around ptrue).
4. This can be solved for pˆ MLE – ptrue.
5. By the Law of Large Numbers and the CLT, for n
large, n ( pˆ MLE – ptrue) is normally distributed.
9-38
1. Derive the log likelihood
Recall: the density for observation #1 is:
Pr(Y1 = y1) = p y (1  p )1 y
1
(density)
1
so
f(p;Y1) = pY (1  p )1Y
1 1
(likelihood)
The likelihood for Y1,…,Yn is,
f(p;Y1,…,Yn) = f(p;Y1) … f(p;Yn)
so the log likelihood is,
(p) = lnf(p;Y1,…,Yn)
= ln[f(p;Y1) … f(p;Yn)]
n
=  ln f ( p;Y )
i 1
i
2. Set the derivative of (p) to zero to define the MLE:

9-39
L ( p ) n
 ln f ( p;Yi )
= =0
p pˆ MLE i 1 p pˆ MLE
3. Use a Taylor series expansion around ptrue to

approximate this as a linear function of pˆ MLE :
L ( p ) L ( p )  2L ( p )
0= + ( pˆ MLE – ptrue)
p pˆ MLE
p p true
p 2 p true
9-40
4. Solve this linear approximation for ( pˆ MLE – ptrue):
L ( p )  2L ( p )
+ ( pˆ MLE – ptrue) 0
p p true
p 2 p true
so
 2L ( p ) L ( p )
( pˆ MLE
–p true
) –
p 2 p true
p p true
or
1
  L ( p)
2  L ( p )
( pˆ MLE
– ptrue) – 
 p 2
p true 
p
 p true
9-41
5. Substitute things in and apply the LLN and CLT.
n
(p) =  ln f ( p;Y )
i 1
i
L ( p ) n
 ln f ( p;Yi )
=
p p true i 1 p p true
 2L ( p )  2 ln f ( p;Yi )
n
=
p 2 p true i 1  p 2
p true
so
1
  2L ( p )  L ( p )
( pˆ MLE
–p true
) – 
 p 2
p true 
p
 p true
1
 n   2 ln f ( p;Y )  n  ln f ( p;Y ) 
=   
  2
i

  
 p
i


 i 1  p p true  
 i 1  p true

9-42
Multiply through by n :
n ( pˆ MLE – ptrue)
1
 1 n   2 ln f ( p;Y )   1 n   ln f ( p;Y ) 
   i
    i

    n i 1  p 
2
 i 1 
n p p true
 p true  

Because Yi is i.i.d., the ith terms in the summands are also
i.i.d. Thus, if these terms have enough (2) moments, then
under general conditions (not just Bernoulli likelihood):
1 n   2 ln f ( p;Yi )  p
 
n i 1  p 2
  a (a constant) (WLLN)

p true 
1 n   ln f ( p;Yi )  d
 
n i 1  p
  N(0, ln f ) (CLT) (Why?)

2
p true

Putting this together,
9-43
1
 1 n   ln f ( p;Y )
2   1 n   ln f ( p;Y ) 
   i
    i

    n i 1  p 
2
 i 1 
n p p true  
 p true  

1 n   2 ln f ( p;Yi )  p
 
n i 1  p 2
  a (a constant) (WLLN)

p true 
1 n   ln f ( p;Yi )  d
 
n i 1  p
  N(0, ln f ) (CLT) (Why?)

2
p true 
so
d
n ( pˆ MLE
–p true
)  N(0, ln2 f /a2) (large-n normal)
Work out the details for probit/no X (Bernoulli) case:
9-44
Recall:
f(p;Yi) = pY (1  p )1Y
i i
so
ln f(p;Yi) = Yilnp + (1–Yi)ln(1–p)
and
 ln f ( p, Yi ) Yi 1  Yi Yi  p
=  =
p p 1 p p(1  p )
and
 2 ln f ( p, Yi ) Yi 1  Yi  Yi 1  Yi 
= 2 =  2  2 
p 2
p (1  p ) 2
 p (1  p ) 
9-45
Denominator term first:
 2 ln f ( p, Yi )  Yi 1  Yi 
=  2  2 
p 2
 p (1  p ) 
so
1 n   2 ln f ( p;Yi )  1 n Y 1  Yi 
 
n i 1  p 2
 =  2 

i
 2 
p true 
n i 1  p (1 p ) 
Y 1Y
= 2
p (1  p ) 2
p
p 1 p
 2 (LLN)
p (1  p ) 2
1 1 1
=  =
p 1 p p(1  p )
9-46
Next the numerator:
 ln f ( p, Yi ) Yi  p
=
p p(1  p )
so
1 n   ln f ( p;Yi )  1 n
Yi  p
 
n i 1  p
=
 n
 p (1  p )
p true
 i 1
 1  1 n
=   (Yi  p )
 p(1  p )  n i 1
d  Y2
 N(0, )
[ p(1  p )] 2
9-47
Put these pieces together:
1
 1 n   2 ln f ( p;Y )   1 n   ln f ( p;Y ) 
   i
    i

    n i 1  p 
2
 i 1 
n p p true  
 p true  

where
1 n   2 ln f ( p;Yi )  p 1
 
n i 1  p 2

 p(1  p )
p true

1 n   ln f ( p;Yi )  d  Y2
 
n i 1  p
  N(0,
 [ p(1  p )]2
)
p true 
Thus
d
n ( pˆ MLE
–p true
)  N(0, Y2 )
9-48
Summary: probit MLE, no-X case
The MLE: pˆ MLE = Y
Working through the full MLE distribution theory gave:
d
n ( pˆ MLE
– ptrue)  N(0, Y2 )
But because ptrue = Pr(Y = 1) = E(Y) = Y, this is:

d
n (Y – Y)  N(0, Y2 )
A familiar result from the first week of class!

9-49
The MLE derivation applies generally
d
n ( pˆ MLE
–p true
)  N(0,  ln2 f /a2))
 Standard errors are obtained from working out

expressions for  ln2 f /a2
 Extends to >1 parameter (0, 1) via matrix calculus
 Because the distribution is normal for large n, inference
is conducted as usual, for example, the 95% confidence
interval is MLE 1.96SE.
 The expression above uses “robust” standard errors,
further simplifications yield non-robust standard errors
which apply if  ln f ( p;Yi ) / p is homoskedastic.
9-50
Summary: distribution of the MLE
(Why did I do this to you?)
 The MLE is normally distributed for large n
 We worked through this result in detail for the probit
model with no X’s (the Bernoulli distribution)
 For large n, confidence intervals and hypothesis testing
proceeds as usual
 If the model is correctly specified, the MLE is efficient,
that is, it has a smaller large-n variance than all other
estimators (we didn’t show this).
 These methods extend to other models with discrete
dependent variables, for example count data (#
crimes/day) – see SW App. 9.2.
9-51
Application to the Boston HMDA Data
(SW Section 9.4)
 Mortgages (home loans) are an essential part of

buying a home.
 Is there differential access to home loans by race?
 If two otherwise identical individuals, one white and
one black, applied for a home loan, is there a
difference in the probability of denial?
9-52
The HMDA Data Set
 Data on individual characteristics, property

characteristics, and loan denial/acceptance
 The mortgage application process circa 1990-1991:
o Go to a bank or mortgage company
o Fill out an application (personal+financial info)
o Meet with the loan officer
 Then the loan officer decides – by law, in a race-blind
way. Presumably, the bank wants to make profitable
loans, and the loan officer doesn’t want to originate
defaults.
9-53
The loan officer’s decision
 Loan officer uses key financial variables:

o P/I ratio
o housing expense-to-income ratio
o loan-to-value ratio
o personal credit history
 The decision rule is nonlinear:
o loan-to-value ratio > 80%
o loan-to-value ratio > 95% (what happens in default?)
o credit score
9-54
Regression specifications
Pr(deny=1|black, other X’s) = …
 linear probability model
 probit
Main problem with the regressions so far: potential

omitted variable bias. All these (i) enter the loan officer
decision function, all (ii) are or could be correlated with
race:
 wealth, type of employment
 credit history
 family status
Variables in the HMDA data set…
9-55
9-56
9-57
9-58
9-59
9-60
Summary of Empirical Results
 Coefficients on the financial variables make sense.

 Black is statistically significant in all specifications
 Race-financial variable interactions aren’t significant.
 Including the covariates sharply reduces the effect of
race on denial probability.
 LPM, probit, logit: similar estimates of effect of race
on the probability of denial.
 Estimated effects are large in a “real world” sense.
9-61
Remaining threats to internal, external validity
 Internal validity
1. omitted variable bias
 what else is learned in the in-person interviews?
2. functional form misspecification (no…)
3. measurement error (originally, yes; now, no…)
4. selection
 random sample of loan applications
 define population to be loan applicants
5. simultaneous causality (no)
 External validity
This is for Boston in 1990-91. What about today?
9-62
Summary
(SW Section 9.5)
 If Yi is binary, then E(Y| X) = Pr(Y=1|X)
 Three models:
o linear probability model (linear multiple regression)
o probit (cumulative standard normal distribution)
o logit (cumulative standard logistic distribution)
 LPM, probit, logit all produce predicted probabilities
 Effect of X is change in conditional probability that
Y=1. For logit and probit, this depends on the initial X
 Probit and logit are estimated via maximum likelihood
o Coefficients are normally distributed for large n
o Large-n hypothesis testing, conf. intervals is as usual
9-63
Instrumental Variables Regression
(SW Ch. 10)
Three important threats to internal validity are:
 omitted variable bias from a variable that is correlated

with X but is unobserved, so cannot be included in the
regression;
 simultaneous causality bias (X causes Y, Y causes X);
 errors-in-variables bias (X is measured with error)
Instrumental variables regression can eliminate bias from

these three sources.
10-1
The IV Estimator with a Single Regressor and a
Single Instrument (SW Section 10.1)
Yi = 0 + 1Xi + ui
 Loosely, IV regression breaks X into two parts: a part

that might be correlated with u, and a part that is not.
By isolating the part that is not correlated with u, it is
possible to estimate 1.
 This is done using an instrumental variable, Zi, which
is uncorrelated with ui.
 The instrumental variable detects movements in Xi that
are uncorrelated with ui, and use these two estimate 1.
10-2
Terminology: endogeneity and exogeneity
An endogenous variable is one that is correlated with u

An exogenous variable is one that is uncorrelated with u
Historical note: “Endogenous” literally means

“determined within the system,” that is, a variable that
is jointly determined with Y, that is, a variable subject
to simultaneous causality. However, this definition is
narrow and IV regression can be used to address OV
bias and errors-in-variable bias, not just to
simultaneous causality bias.
10-3
Two conditions for a valid instrument
Yi = 0 + 1Xi + ui
For an instrumental variable (an “instrument”) Z to be

valid, it must satisfy two conditions:
1. Instrument relevance: corr(Zi,Xi) 0
2. Instrument exogeneity: corr(Zi,ui) = 0
Suppose for now that you have such a Zi (we’ll discuss

how to find instrumental variables later). How can you
use Zi to estimate 1?
10-4
The IV Estimator, one X and one Z
Explanation #1: Two Stage Least Squares (TSLS)
As it sounds, TSLS has two stages – two regressions:
(1) First isolates the part of X that is uncorrelated with u:
regress X on Z using OLS
Xi = 0 + 1Zi + vi (1)
 Because Zi is uncorrelated with ui, 0 + 1Zi is

uncorrelated with ui. We don’t know 0 or 1 but we
have estimated them, so…
 Compute the predicted values of Xi, Xˆ i , where Xˆ i =
ˆ0 + ˆ1 Zi, i = 1,…,n.
10-5
(2) Replace Xi by Xˆ i in the regression of interest:
regress Y on Xˆ i using OLS:
Yi = 0 + 1 Xˆ i + ui (2)
 Because Xˆ i is uncorrelated with ui in large samples, so

the first least squares assumption holds
 Thus 1 can be estimated by OLS using regression (2)
 This argument relies on large samples (so 0 and 1 are
well estimated using regression (1))
 This the resulting estimator is called the “Two Stage
Least Squares” (TSLS) estimator, ˆ1TSLS .
10-6
Two Stage Least Squares, ctd.
Suppose you have a valid instrument, Zi.
Stage 1:
Regress Xi on Zi, obtain the predicted values Xˆ i
Stage 2:
Regress Yi on Xˆ i ; the coefficient on Xˆ i is the TSLS
estimator, ˆ1TSLS .
Then ˆ1TSLS is a consistent estimator of 1.
10-7
The IV Estimator, one X and one Z, ctd.
Explanation #2: (only) a little algebra
Yi = 0 + 1Xi + ui
Thus,
cov(Yi,Zi) = cov(0 + 1Xi + ui,Zi)
= cov(0,Zi) + cov(1Xi,Zi) + cov(ui,Zi)
= 0 + cov(1Xi,Zi) + 0
= 1cov(Xi,Zi)
where cov(ui,Zi) = 0 (instrument exogeneity); thus
cov(Yi , Z i )
1 =
cov( X i , Z i )
10-8
The IV Estimator, one X and one Z, ctd
cov(Yi , Z i )
1 =
cov( X i , Z i )
The IV estimator replaces these population covariances

with sample covariances:
ˆ sYZ
1 =
TSLS
,
s XZ
sYZ and sXZ are the sample covariances.

This is the TSLS estimator – just a different derivation.
10-9
Consistency of the TSLS estimator
ˆ sYZ
1 =
TSLS
s XZ
p
The sample covariances are consistent: sYZ  cov(Y,Z)
p
and sXZ  cov(X,Z). Thus,
sYZ p cov(Y , Z )
ˆ
TSLS
1 =  = 1
s XZ cov( X , Z )
 The instrument relevance condition, cov(X,Z) 0,

ensures that you don’t divide by zero.
Example #1: Supply and demand for butter
10-10
IV regression was originally developed to estimate
demand elasticities for agricultural goods, for example
butter:
ln(Qibutter ) = 0 + 1ln( Pi butter ) + ui
 1 = price elasticity of butter = percent change in

quantity for a 1% change in price (recall log-log
specification discussion)
 Data: observations on price and quantity of butter for
different years
 The OLS regression of ln(Qibutter ) on ln( Pi butter ) suffers
from simultaneous causality bias (why?)
10-11
Simultaneous causality bias in the OLS regression of
ln(Qibutter ) on ln( Pi butter ) arises because price and quantity
are determined by the interaction of demand and supply
10-12
This interaction of demand and supply produces…
Would a regression using these data produce the demand

curve?
10-13
What would you get if only supply shifted?
 TSLS estimates the demand curve by isolating shifts

in price and quantity that arise from shifts in supply.
 Z is a variable that shifts supply but not demand.
10-14
TSLS in the supply-demand example:
Let Z = rainfall in dairy-producing regions.

Is Z a valid instrument?
(1) Exogenous? corr(raini,ui) = 0?
Plausibly: whether it rains in dairy-producing
regions shouldn’t affect demand
(2) Relevant? corr(raini,ln( Pi butter )) 0?
Plausibly: insufficient rainfall means less grazing
means less butter
10-15
TSLS in the supply-demand example, ctd.
Zi = raini = rainfall in dairy-producing regions.
Stage 1: regress ln( Pi butter ) on rain, get ln( Pi butter )

ln( Pi butter ) isolates changes in log price that arise
from supply (part of supply, at least)
Stage 2: regress ln(Qibutter ) on ln( Pi butter )

The regression counterpart of using shifts in the
supply curve to trace out the demand curve.
10-16
Example #2: Test scores and class size
 The California regressions still could have OV bias

(e.g. parental involvement).
 This bias could be eliminated by using IV regression
(TSLS).
 IV regression requires a valid instrument, that is, an
instrument that is:
(1) relevant: corr(Zi,STRi) 0
(2) exogenous: corr(Zi,ui) = 0
10-17
Example #2: Test scores and class size, ctd.
Here is a (hypothetical) instrument:
 some districts, randomly hit by an earthquake, “double
up” classrooms:
Zi = Quakei = 1 if hit by quake, = 0 otherwise
 Do the two conditions for a valid instrument hold?
 The earthquake makes it as if the districts were in a
random assignment experiment. Thus the variation in
STR arising from the earthquake is exogenous.
 The first stage of TSLS regresses STR against Quake,
thereby isolating the part of STR that is exogenous (the
part that is “as if” randomly assigned)
We’ll go through other examples later…
10-18
Inference using TSLS
 In large samples, the sampling distribution of the TSLS
estimator is normal
 Inference (hypothesis tests, confidence intervals)
proceeds in the usual way, e.g. 1.96SE
 The idea behind the large-sample normal distribution of
the TSLS estimator is that – like all the other estimators
we have considered – it involves an average of mean
zero i.i.d. random variables, to which we can apply the
CLT.
 Here is a sketch of the math (see SW App. 10.3 for the
details)...
10-19
1 n
sYZ 
n  1 i 1
(Yi  Y )( Z i  Z )
ˆ
1 =
TSLS
=
1 n

s XZ
( X i  X )( Z i  Z )
n  1 i 1
Now substitute in Yi = 0 + 1Xi + ui and simplify:
First,
Yi – Y = 1(Xi – X ) + (ui – u )
so
1 n 1 n

n  1 i 1
(Yi  Y )( Z i  Z ) = 
n  1 i 1
[ 1 ( X i  X )  (u i  u )]( Z i  Z )
1 n 1 n
= 1 
n  1 i 1
( X i  X )( Z i  Z )  
n  1 i 1
(ui  u )( Z i  Z ) .
10-20
Thus
1 n

n  1 i 1
(Yi  Y )( Z i  Z )
ˆ1TSLS =
1 n

n  1 i 1
( X i  X )( Z i  Z )
1 n 1 n
1 
n  1 i 1
( X i  X )( Z i  Z )  
n  1 i 1
(ui  u )( Z i  Z )
=
1 n

n  1 i 1
( X i  X )( Z i  Z )
1 n

n  1 i 1
(ui  u )( Z i  Z )
= 1 + n
.
1

n  1 i 1
( X i  X )( Z i  Z )
Subtract 1 from each side and you get,

10-21
1 n

n  1 i 1
(ui  u )( Z i  Z )
ˆ1TSLS – 1 =
1 n

n  1 i 1
( X i  X )( Z i  Z )
Multiplying through by n  1 and making the

approximation that n  1 n yields:
1 n

n i 1
(ui  u )( Z i  Z )
n ( ˆ1TSLS – 1)
1 n

n i 1
( X i  X )( Z i  Z )
10-22
1 n

n i 1
(ui  u )( Z i  Z )
n ( ˆ1TSLS – 1)
1 n

n i 1
( X i  X )( Z i  Z )
 First consider the numerator: in large samples,

1 n

n i 1
(ui  u )( Z i  Z ) is dist’d N(0,var[(Z–Z)u])
 Next consider the denominator:

1 n p

n i 1
( X i  X )( Z i  Z )  cov(X,Z) by the LLN
where cov(X,Z) 0 because the instrument is relevant

(by assumption) (What if it isn’t relevant? More later.)
10-23
Put this together:
1 n

n i 1
(ui  u )( Z i  Z )
n ( ˆ1TSLS – 1)
1 n

n i 1
( X i  X )( Z i  Z )
1 n p

n i 1
( X i  X )( Z i  Z )  cov(X,Z)
1 n

n i 1
(ui  u )( Z i  Z ) is dist’d N(0,var[(Z–Z)u])
So finally:
ˆ1TSLS is approx. distributed N(1, 2ˆ TSLS ),
1
1 var[( Z i  Z )ui ]
where  2
ˆ TSLS
= 2
.
1
n [cov( Z i , X i )]
10-24
Inference using TSLS, ctd.
ˆ1TSLS is approx. distributed N(1, 2ˆ
TSLS ),
1
 Statistical inference proceeds in the usual way.

 The justification is (as usual) based on large samples
 This all assumes that the instruments are valid – we’ll
discuss what happens if they aren’t valid shortly.
 Important note on standard errors:
o The OLS standard errors from the second stage
regression aren’t right – they don’t take into account
the estimation in the first stage ( Xˆ i is estimated).
o Instead, use a single specialized command that
computes the TSLS estimator and the correct SEs.
o as usual, use heteroskedasticity-robust SEs
10-25
A complete digression:
The early history of IV regression
 How much money would be raised by an import tariff

on animal and vegetable oils (butter, flaxseed oil, soy
oil, etc.)?
 To do this calculation you need to know the
elasticities of supply and demand, both domestic and
foreign
 This problem was first solved in Appendix B of
Wright (1928), “The Tariff on Animal and Vegetable
Oils.”
10-26
Figure 4, p. 296, from Appendix B (1928):
10-27
Who wrote Appendix B of Philip Wright (1928)?
…this appendix is thought to have been written

with or by his son, Sewall Wright, an important
statistician. (SW, p. 334)
Who were these guys and what’s their story?
10-28
Philip Wright (1861-1934) Sewall Wright (1889-1988)
obscure economist and poet famous genetic statistician
MA Harvard, Econ, 1887 ScD Harvard, Biology, 1915
Lecturer, Harvard, 1913-1917 Prof., U. Chicago, 1930-1954
10-29
Example: Demand for Cigarettes
 How much will a hypothetical cigarette tax reduce

cigarette consumption?
 To answer this, we need the elasticity of demand for
cigarettes, that is, 1, in the regression,
ln(Qicigarettes ) = 0 + 1ln( Pi cigarettes ) + ui
 Will the OLS estimator plausibly be unbiased?

Why or why not?
10-30
Example: Cigarette demand, ctd.
ln(Qicigarettes ) = 0 + 1ln( Pi cigarettes ) + ui
Panel data:
 Annual cigarette consumption and average prices paid
(including tax)
 48 continental US states, 1985-1995
Proposed instrumental variable:
 Zi = general sales tax per pack in the state = SalesTaxi
 Is this a valid instrument?
(1) Relevant? corr(SalesTaxi, ln( Pi cigarettes )) 0?
(2) Exogenous? corr(SalesTaxi,ui) = 0?
10-31
For now, use data for 1995 only.
First stage OLS regression:

ln( Pi cigarettes ) = 4.63 + .031SalesTaxi, n = 48
Second stage OLS regression:

ln(Qicigarettes ) = 9.72 – 1.08 ln( Pi cigarettes ) , n = 48
Combined regression with correct, heteroskedasticity-

robust standard errors:
(1.53) (0.32)
10-32
STATA Example: Cigarette demand, First stage
Instrument = Z = rtaxso = general sales tax (real $/pack)
X Z
. reg lravgprs rtaxso if year==1995, r;

F( 1, 46) = 40.39
Prob > F = 0.0000
R-squared = 0.4710
Root MSE = .09394
------------------------------------------------------------------------------
| Robust
lravgprs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rtaxso | .0307289 .0048354 6.35 0.000 .0209956 .0404621
_cons | 4.616546 .0289177 159.64 0.000 4.558338 4.674755
------------------------------------------------------------------------------
X-hat
. predict lravphat; Now we have the predicted values from the 1st stage
10-33
Second stage
Y X-hat
. reg lpackpc lravphat if year==1995, r;

F( 1, 46) = 10.54
Prob > F = 0.0022
R-squared = 0.1525
Root MSE = .22645
------------------------------------------------------------------------------
| Robust
lpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lravphat | -1.083586 .3336949 -3.25 0.002 -1.755279 -.4118932
_cons | 9.719875 1.597119 6.09 0.000 6.505042 12.93471
------------------------------------------------------------------------------
 These coefficients are the TSLS estimates

 The standard errors are wrong because they ignore the
fact that the first stage was estimated
10-34
Combined into a single command:
Y X Z
. ivreg lpackpc (lravgprs = rtaxso) if year==1995, r;
IV (2SLS) regression with robust standard errors Number of obs = 48

F( 1, 46) = 11.54
Prob > F = 0.0014
R-squared = 0.4011
Root MSE = .19035
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
lravgprs | -1.083587 .3189183 -3.40 0.001 -1.725536 -.4416373
_cons | 9.719876 1.528322 6.36 0.000 6.643525 12.79623
------------------------------------------------------------------------------
Instrumented: lravgprs This is the endogenous regressor
Instruments: rtaxso This is the instrumental varible
------------------------------------------------------------------------------
OK, the change in the SEs was small this time...but not always!

(1.53) (0.32)
10-35
Summary of IV Regression with a Single X and Z
 A valid instrument Z must satisfy two conditions:

(1) relevance: corr(Zi,Xi) 0
(2) exogeneity: corr(Zi,ui) = 0
 TSLS proceeds by first regressing X on Z to get X̂ , then
regressing Y on X̂ .
 The key idea is that the first stage isolates part of the
variation in X that is uncorrelated with u
 If the instrument is valid, then the large-sample
sampling distribution of the TSLS estimator is normal,
so inference proceeds as usual
10-36
The General IV Regression Model
(SW Section 10.2)
 So far we have considered IV regression with a single

endogenous regressor (X) and a single instrument (Z).
 We need to extend this to:
o multiple endogenous regressors (X1,…,Xk)
o multiple included exogenous variables (W1,…,Wr)
These need to be included for the usual OV reason
o multiple instrumental variables (Z1,…,Zm)
More (relevant) instruments can produce a smaller
variance of TSLS: the R2 of the first stage
increases, so you have more variation in X̂ .
10-37
Example: cigarette demand
 Another determinant of cigarette demand is income;

omitting income could result in omitted variable bias
 Cigarette demand with one X, one W, and 2 instruments
(2 Z’s):
ln(Qicigarettes ) = 0 + 1ln( Pi cigarettes ) + 2ln(Incomei) + ui
Z1i = general sales tax component onlyi

Z2i = cigarette-specific tax component onlyi
 Other W’s might be state effects and/or year effects (in

panel data, later…)
10-38
The general IV regression model: notation and jargon
Yi = 0 + 1X1i + … + kXki + k+1W1i + … + k+rWri + ui
 Yi is the dependent variable

 X1i,…, Xki are the endogenous regressors (potentially
correlated with ui)
 W1i,…,Wri are the included exogenous variables or
included exogenous regressors (uncorrelated with ui)
 0, 1,…, k+r are the unknown regression coefficients
 Z1i,…,Zmi are the m instrumental variables (the excluded
exogenous variables)
10-39
The general IV regression model, ctd.
We need to introduce some new concepts and to extend

some old concepts to the general IV regression model:
 Terminology: identification and overidentification
 TSLS with included exogenous variables
o one endogenous regressor
o multiple endogenous regressors
 Assumptions that underlie the normal sampling
distribution of TSLS
o Instrument validity (relevance and exogeneity)
o General IV regression assumptions
10-40
Identification
 In general, a parameter is said to be identified if

different values of the parameter would produce
different distributions of the data.
 In IV regression, whether the coefficients are identified
depends on the relation between the number of
instruments (m) and the number of endogenous
regressors (k)
 Intuitively, if there are fewer instruments than
endogenous regressors, we can’t estimate 1,…,k
 For example, suppose k = 1 but m = 0 (no instruments)!
10-41
Identification, ctd.
The coefficients 1,…,k are said to be:
 exactly identified if m = k.
There are just enough instruments to estimate
1,…,k.
 overidentified if m > k.
There are more than enough instruments to estimate
1,…,k. If so, you can test whether the instruments
are valid (a test of the “overidentifying restrictions”)
– we’ll return to this later
 underidentified if m < k.
There are too few enough instruments to estimate
1,…,k. If so, you need to get more instruments!
10-42
General IV regression: TSLS, 1 endogenous regressor
Yi = 0 + 1X1i + 2W1i + … + 1+rWri + ui
 Instruments: Z1i,…,Zm
 First stage
o Regress X1 on all the exogenous regressors: regress
X1 on W1,…,Wr,Z1,…,Zm by OLS
o Compute predicted values Xˆ 1i , i = 1,…,n
 Second stage
o Regress Y on X̂ 1,W1,…,Wr by OLS
o The coefficients from this second stage regression
are the TSLS estimators, but SEs are wrong
 To get correct SEs, do this in a single step
10-43
Example: Demand for cigarettes
ln(Qicigarettes ) = 0 + 1ln( Pi cigarettes ) + 2ln(Incomei) + ui
Z1i = general sales taxi

Z2i = cigarette-specific taxi
 Endogenous variable: ln( Pi cigarettes ) (“one X”)

 Included exogenous variable: ln(Incomei) (“one W”)
 Instruments (excluded endogenous variables): general
sales tax, cigarette-specific tax (“two Zs”)
 Is the demand elasticity 1 overidentified, exactly
identified, or underidentified?
10-44
Example: Cigarette demand, one instrument
Y W X Z
. ivreg lpackpc lperinc (lravgprs = rtaxso) if year==1995, r;

F( 2, 45) = 8.19
Prob > F = 0.0009
R-squared = 0.4189
Root MSE = .18957
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
lravgprs | -1.143375 .3723025 -3.07 0.004 -1.893231 -.3935191
lperinc | .214515 .3117467 0.69 0.495 -.413375 .842405
_cons | 9.430658 1.259392 7.49 0.000 6.894112 11.9672
------------------------------------------------------------------------------
Instrumented: lravgprs
Instruments: lperinc rtaxso STATA lists ALL the exogenous regressors
as instruments – slightly different
terminology than we have been using
------------------------------------------------------------------------------
 Running IV as a single command yields correct SEs

 Use , r for heteroskedasticity-robust SEs
10-45
Example: Cigarette demand, two instruments
Y W X Z1 Z2
. ivreg lpackpc lperinc (lravgprs = rtaxso rtax) if year==1995, r;

F( 2, 45) = 16.17
Prob > F = 0.0000
R-squared = 0.4294
Root MSE = .18786
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
lravgprs | -1.277424 .2496099 -5.12 0.000 -1.780164 -.7746837
lperinc | .2804045 .2538894 1.10 0.275 -.230955 .7917641
_cons | 9.894955 .9592169 10.32 0.000 7.962993 11.82692
------------------------------------------------------------------------------
Instrumented: lravgprs
Instruments: lperinc rtaxso rtax STATA lists ALL the exogenous regressors
as “instruments” – slightly different
terminology than we have been using
------------------------------------------------------------------------------
10-46
TSLS estimates, Z = sales tax (m = 1)
ln(Qicigarettes ) = 9.43 – 1.14 ln( Pi cigarettes ) + 0.21ln(Incomei)
(1.26) (0.37) (0.31)
TSLS estimates, Z = sales tax, cig-only tax (m = 2)

ln(Qicigarettes ) = 9.89 – 1.28 ln( Pi cigarettes ) + 0.28ln(Incomei)
(0.96) (0.25) (0.25)
 Smaller SEs for m = 2. Using 2 instruments gives more

information – more “as-if random variation”.
 Low income elasticity (not a luxury good); income
elasticity not statistically significantly different from 0
 Surprisingly high price elasticity
10-47
General IV regression: TSLS with multiple
endogenous regressors
 Instruments: Z1i,…,Zm
 Now there are k first stage regressions:
o Regress X1 on W1,…, Wr, Z1,…, Zm by OLS
o Compute predicted values Xˆ 1i , i = 1,…,n
o Regress X2 on W1,…, Wr, Z1,…, Zm by OLS
o Compute predicted values Xˆ 2 i , i = 1,…,n
o Repeat for all X’s, obtaining Xˆ 1i , Xˆ 2 i ,…, Xˆ ki
10-48
TSLS with multiple endogenous regressors, ctd.
 Second stage
o Regress Y on Xˆ 1i , Xˆ 2 i ,…, Xˆ ki , W1,…, Wr by OLS
o The coefficients from this second stage regression
are the TSLS estimators, but SEs are wrong
 To get correct SEs, do this in a single step
 What would happen in the second stage regression if
the coefficients were underidentified (that is, if
#instruments < #endogenous variables); for example, if
k = 2, m = 1?
10-49
Sampling distribution of the TSLS estimator in the
general IV regression model
 Meaning of “valid” instruments in the general case

 The IV regression assumptions
 Implications: if the IV regression assumptions hold,
then the TSLS estimator is normally distributed, and
inference (testing, confidence intervals) proceeds as
usual
10-50
A “valid” set of instruments in the general case
The set of instruments must be relevant and exogenous:
1. Instrument relevance: Special case of one X

At least one instrument must enter the population
counterpart of the first stage regression.
2. Instrument exogeneity
All the instruments are uncorrelated with the error
term: corr(Z1i,ui) = 0,…, corr(Zm,ui) = 0
10-51
“Valid” instruments in the general case, ctd.
(1) General instrument relevance condition:

 General case, multiple X’s
Suppose the second stage regression could be run
using the predicted values from the population first
stage regression. Then: there is no perfect
multicollinearity in this (infeasible) second stage
regression
 Special case of one X
10-52
The IV Regression Assumptions
1. E(ui|W1i,…,Wri) = 0
2. (Yi,X1i,…,Xki,W1i,…,Wri,Z1i,…,Zmi) are i.i.d.
3. The X’s, W’s, Z’s, and Y have nonzero, finite 4th
moments
4. The W’s are not perfectly multicollinear
5. The instruments (Z1i,…,Zmi) satisfy the conditions for
a valid set of instruments.
 #1 says “the exogenous regressors are exogenous.”

 #2 – #4 are not new; we have discussed #5.
10-53
Implications: Sampling distribution of TSLS
 If the IV regression assumptions hold, then the TSLS

estimator is normally distributed in large samples.
 Inference (hypothesis testing, confidence intervals)
proceeds as usual.
 Two notes about standard errors:
o The second stage SEs are incorrect because they
don’t take into account estimation in the first stage;
to get correct SEs, run TSLS in a single command
o Use heteroskedasticity-robust SEs, for the usual
reason.
 All this hinges on having valid instruments…
10-54
Checking Instrument Validity
(SW Section 10.3)
Recall the two requirements for valid instruments:

1. Relevance (special case of one X)
2. Exogeneity
All the instruments must be uncorrelated with the
error term: corr(Z1i,ui) = 0,…, corr(Zmi,ui) = 0
What happens if one of these requirements isn’t

satisfied? How can you check? And what do you do?
10-55
Checking Assumption #1: Instrument Relevance
We will focus on a single included endogenous regressor:
Yi = 0 + 1Xi + 2W1i + … + 1+rWri + ui
First stage regression:

Xi = 0 + 1Z1i +…+ miZmi + m+1iW1i +…+ m+kiWki + ui
 The instruments are relevant if at least one of 1,…,m

are nonzero.
 The instruments are said to be weak if all the 1,…,m
are either zero or nearly zero.
 Weak instruments explain very little of the variation
in X, beyond that explained by the W’s
10-56
What are the consequences of weak instruments?
Consider the simplest case:

Yi = 0 + 1Xi + ui
Xi = 0 + 1Zi + ui
ˆ sYZ
 The IV estimator is 1 =
TSLS
s XZ
 If cov(X,Z) is zero or small, then sXZ will be small:
With weak instruments, the denominator is nearly zero.
 If so, the sampling distribution of ˆ1TSLS (and its t-
statistic) is not well approximated by its large-n normal
approximation…
10-57
An example: the distribution of the TSLS t-statistic
with weak instruments
Dark line = irrelevant instruments

Dashed light line = strong instruments
10-58
Why does our trusty normal approximation fail us!?!
ˆ sYZ
1 =
TSLS
s XZ
 If cov(X,Z) is small, small changes in sXZ (from one
sample to the next) can induce big changes in ˆ1TSLS
 Suppose in one sample you calculate sXZ = .00001!
 Thus the large-n normal approximation is a poor
approximation to the sampling distribution of ˆ1TSLS
 A better approximation is that ˆ1TSLS is distributed as the
ratio of two correlated normal random variables (see
SW App. 10.4)
 If instruments are weak, the usual methods of inference
are unreliable – potentially very unreliable.
10-59
Measuring the strength of instruments in practice:
The first-stage F-statistic
 The first stage regression (one X):

Regress X on Z1,..,Zm,W1,…,Wk.
 Totally irrelevant instruments all the coefficients on
Z1,…,Zm are zero.
 The first-stage F-statistic tests the hypothesis that
Z1,…,Zm do not enter the first stage regression.
 Weak instruments imply a small first stage F-statistic.
10-60
Checking for weak instruments with a single X
 Compute the first-stage F-statistic.
Rule-of-thumb: If the first stage F-statistic is less
than 10, then the set of instruments is weak.
 If so, the TSLS estimator will be biased, and statistical
inferences (standard errors, hypothesis tests, confidence
intervals) can be misleading.
 Note that simply rejecting the null hypothesis of that
the coefficients on the Z’s are zero isn’t enough – you
actually need substantial predictive content for the
normal approximation to be a good one.
 There are more sophisticated things to do than just
compare F to 10 but they are beyond this course.
10-61
What to do if you have weak instruments?
 Get better instruments (!)

 If you have many instruments, some are probably
weaker than others and it’s a good idea to drop the
weaker ones (dropping an irrelevant instrument will
increase the first-stage F)
 Use a different IV estimator instead of TSLS
o There are many IV estimators available when the
coefficients are overidentified.
o Limited information maximum likelihood has been
found to be less affected to weak instruments.
o all this is beyond the scope of this course…
10-62
Checking Assumption #2: Instrument Exogeneity
 Instrument exogeneity: All the instruments are

uncorrelated with the error term: corr(Z1i,ui) = 0,…,
corr(Zmi,ui) = 0
 If the instruments aren’t correlated with the error
term, the first stage of TSLS doesn’t successfully
isolate a component of X that is uncorrelated with the
error term, so X̂ is correlated with u and TSLS is
inconsistent.
 If there are more instruments than endogenous
regressors, it is possible to test – partially – for
instrument exogeneity.
10-63
Testing overidentifying restrictions
Consider the simplest case:

Yi = 0 + 1Xi + ui,
 Suppose there are two valid instruments: Z1i, Z2i

 Then you could compute two separate TSLS estimates.
 Intuitively, if these 2 TSLS estimates are very different
from each other, then something must be wrong: one or
the other (or both) of the instruments must be invalid.
 The J-test of overidentifying restrictions makes this
comparison in a statistically precise way.
 This can only be done if #Z’s > #X’s (overidentified).
10-64
Suppose #instruments = m > # X’s = k (overidentified)
The J-test of overidentifying restrictions

1. First estimate the equation of interest using TSLS and
all m instruments; compute the predicted values Yî ,
using the actual X’s (not the X̂ ’s used to estimate the
second stage)
2. Compute the residuals uî = Yi – Yî
3. Regress uî against Z1i,…,Zmi, W1i,…,Wri
4. Compute the F-statistic testing the hypothesis that the
coefficients on Z1i,…,Zmi are all zero;
5. The J-statistic is J = mF
10-65
J = mF, where F = the F-statistic testing the
coefficients on Z1i,…,Zmi in a regression of the TSLS
residuals against Z1i,…,Zmi, W1i,…,Wri.
Distribution of the J-statistic

 Under the null hypothesis that all the instruments are
exogeneous, J has a chi-squared distribution with m–k
degrees of freedom
 If m = k, J = 0 (does this make sense?)
 If some instruments are exogenous and others are
endogenous, the J statistic will be large, and the null
hypothesis that all instruments are exogenous will be
rejected.
10-66
Application to the Demand for Cigarettes
(SW Section 10.4)
Why are we interested in knowing the elasticity of

demand for cigarettes?
 Theory of optimal taxation: optimal tax is inverse to
elasticity: smaller deadweight loss if quantity is
affected less.
 Externalities of smoking – role for government
intervention to discourage smoking
o second-hand smoke (non-monetary)
o monetary externalities
10-67
Panel data set
 Annual cigarette consumption, average prices paid by
end consumer (including tax), personal income
 48 continental US states, 1985-1995
Estimation strategy
 Having panel data allows us to control for unobserved
state-level characteristics that enter the demand for
cigarettes, as long as they don’t vary over time
 But we still need to use IV estimation methods to
handle the simultaneous causality bias that arises from
the interaction of supply and demand.
10-68
Fixed-effects model of cigarette demand
ln(Qitcigarettes ) = i + 1ln( Pitcigarettes ) + 2ln(Incomeit) + uit
 i = 1,…,48, t = 1985, 1986,…,1995

 i reflects unobserved omitted factors that vary across
states but not over time, e.g. attitude towards smoking
 Still, corr(ln( Pitcigarettes ),uit) is plausibly nonzero because
of supply/demand interactions
 Estimation strategy:
o Use panel data regression methods to eliminate i
o Use TSLS to handle simultaneous causality bias
10-69
Panel data IV regression: two approaches
(a) The “n-1 binary indicators” method
(b) The “changes” method (when T=2)
(a) The “n-1 binary indicators” method

Rewrite
ln(Qitcigarettes ) = i + 1ln( Pitcigarettes ) + 2ln(Incomeit) + uit
as
ln(Qitcigarettes ) = 0 + 1ln( Pitcigarettes ) + 2ln(Incomeit)
+ 2D2it + … + 48D48it + uit
Instruments:
Z1it = general sales taxit
Z2it = cigarette-specific taxit
10-70
This now fits in the general IV regression model:
ln(Qitcigarettes ) = 0 + 1ln( Pitcigarettes ) + 2ln(Incomeit)

+ 2D2it + … + 48D48it + uit
 X (endogenous regressor) = ln( Pitcigarettes )

 48 W’s (included exogenous regressors) =
ln(Incomeit), D2it,…, D48it
 Two instruments = Z1it, Z2it
 Now estimate this full model using TSLS!
 An issue arises when dynamic response (lagged
adjustment) is important, as it is here – it takes time to
kick smoking – how to model lagged effects?
10-71
(b) The “changes” method (when T=2)
 One way to model long-term effects is to consider 10-
year changes, between 1985 and 1995
 Rewrite the regression in “changes” form:
ln(Qicigarettes
1995 ) – ln( Q cigarettes
i1985 )
= 1[ln( Pi1995
cigarettes
) – ln( Pi1985
cigarettes
)]
+2[ln(Incomei1995) – ln(Incomei1985)]
+ (ui1995 – ui1985)
 Must create “10-year change” variables, for example:
10-year change in log price = ln(Pi1995) – ln(Pi1985)
 Then estimate the demand elasticity by TSLS using 10-
year changes in the instrumental variables
 We’ll take this approach
10-72
STATA: Cigarette demand
First create “10-year change” variables

10-year change in log price
= ln(Pit) – ln(Pit–10) = ln(Pit/Pit–10)
. gen dlpackpc = log(packpc/packpc[_n-10]); _n-10 is the 10-yr lagged value

. gen dlavgprs = log(avgprs/avgprs[_n-10]);
. gen dlperinc = log(perinc/perinc[_n-10]);
. gen drtaxs = rtaxs-rtaxs[_n-10];
. gen drtax = rtax-rtax[_n-10];
. gen drtaxso = rtaxso-rtaxso[_n-10];
10-73
Use TSLS to estimate the demand elasticity by using
the “10-year changes” specification
Y W X Z
. ivreg dlpackpc dlperinc (dlavgprs = drtaxso) , r;

F( 2, 45) = 12.31
Prob > F = 0.0001
R-squared = 0.5499
Root MSE = .09092
------------------------------------------------------------------------------
| Robust
dlpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dlavgprs | -.9380143 .2075022 -4.52 0.000 -1.355945 -.5200834
dlperinc | .5259693 .3394942 1.55 0.128 -.1578071 1.209746
_cons | .2085492 .1302294 1.60 0.116 -.0537463 .4708446
------------------------------------------------------------------------------
Instrumented: dlavgprs
Instruments: dlperinc drtaxso
------------------------------------------------------------------------------
NOTE:
- All the variables – Y, X, W, and Z’s – are in 10-year changes
- Estimated elasticity = –.94 (SE = .21) – surprisingly elastic!
- Income elasticity small, not statistically different from zero
- Must check whether the instrument is relevant…
10-74
Check instrument relevance: compute first-stage F
. reg dlavgprs drtaxso dlperinc , r;

F( 2, 45) = 16.84
Prob > F = 0.0000
R-squared = 0.5146
Root MSE = .06334
------------------------------------------------------------------------------
| Robust
dlavgprs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drtaxso | .0254611 .0043876 5.80 0.000 .016624 .0342982
dlperinc | -.2241037 .2188815 -1.02 0.311 -.6649536 .2167463
_cons | .5321948 .0295315 18.02 0.000 .4727153 .5916742
------------------------------------------------------------------------------
. test drtaxso; We didn’t need to run “test” here

because with m=1 instrument, the
( 1) drtaxso = 0 F-statistic is the square of the
t-statistic, that is,
F( 1, 45) = 33.67 5.80*5.80 = 33.67
Prob > F = 0.0000
First stage F = 33.7 > 10 so instrument is not weak
Can we check instrument exogeneity? No…m = k

10-75
What about two instruments (cig-only tax, sales tax)?
. ivreg dlpackpc dlperinc (dlavgprs = drtaxso drtax) , r;

F( 2, 45) = 21.30
Prob > F = 0.0000
R-squared = 0.5466
Root MSE = .09125
------------------------------------------------------------------------------
| Robust
dlpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dlavgprs | -1.202403 .1969433 -6.11 0.000 -1.599068 -.8057392
dlperinc | .4620299 .3093405 1.49 0.142 -.1610138 1.085074
_cons | .3665388 .1219126 3.01 0.004 .1209942 .6120834
------------------------------------------------------------------------------
Instrumented: dlavgprs
Instruments: dlperinc drtaxso drtax
------------------------------------------------------------------------------
drtaxso = general sales tax only

drtax = cigarette-specific tax only
Estimated elasticity is -1.2, even more elastic than using general
sales tax only
With m>k, we can test the overidentifying restrictions

10-76
Test the overidentifying restrictions
. predict e, resid; Computes predicted values for most recently
estimated regression (the previous TSLS regression)
. reg e drtaxso drtax dlperinc; Regress e on Z’s and W’s
Source | SS df MS Number of obs = 48

-------------+------------------------------ F( 3, 44) = 1.64
Model | .037769176 3 .012589725 Prob > F = 0.1929
Residual | .336952289 44 .007658007 R-squared = 0.1008
-------------+------------------------------ Adj R-squared = 0.0395
Total | .374721465 47 .007972797 Root MSE = .08751
------------------------------------------------------------------------------
e | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drtaxso | .0127669 .0061587 2.07 0.044 .000355 .0251789
drtax | -.0038077 .0021179 -1.80 0.079 -.008076 .0004607
dlperinc | -.0934062 .2978459 -0.31 0.755 -.6936752 .5068627
_cons | .002939 .0446131 0.07 0.948 -.0869728 .0928509
------------------------------------------------------------------------------
. test drtaxso drtax;
( 1) drtaxso = 0 Compute J-statistic, which is m*F,

( 2) drtax = 0 where F tests whether coefficients on
the instruments are zero
F( 2, 44) = 2.47 so J = 2 2.47 = 4.93
Prob > F = 0.0966 ** WARNING – this uses the wrong d.f. **
10-77
The correct degrees of freedom for the J-statistic is m–k:
 J = mF, where F = the F-statistic testing the coefficients
on Z1i,…,Zmi in a regression of the TSLS residuals
against Z1i,…,Zmi, W1i,…,Wmi.
 Under the null hypothesis that all the instruments are
exogeneous, J has a chi-squared distribution with m–k
degrees of freedom
 Here, J = 4.93, distributed chi-squared with d.f. = 1; the
5% critical value is 3.84, so reject at 5% sig. level.
 In STATA:
. dis "J-stat = " r(df)*r(F) " p-value = " chiprob(r(df)-1,r(df)*r(F));
J-stat = 4.9319853 p-value = .02636401
J = 2 2.47 = 4.93 p-value from chi-squared(1) distribution
10-78
Check instrument relevance: compute first-stage F
X Z1 Z2 W
. reg dlavgprs drtaxso drtax dlperinc , r;

F( 3, 44) = 66.68
Prob > F = 0.0000
R-squared = 0.7779
Root MSE = .04333
------------------------------------------------------------------------------
| Robust
dlavgprs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drtaxso | .013457 .0031405 4.28 0.000 .0071277 .0197863
drtax | .0075734 .0008859 8.55 0.000 .0057879 .0093588
dlperinc | -.0289943 .1242309 -0.23 0.817 -.2793654 .2213767
_cons | .4919733 .0183233 26.85 0.000 .4550451 .5289015
------------------------------------------------------------------------------
. test drtaxso drtax;
( 1) drtaxso = 0
( 2) drtax = 0
F( 2, 44) = 88.62 88.62 > 10 so instruments aren’t weak

Prob > F = 0.0000
10-79
Tabular summary of these results:
10-80
How should we interpret the J-test rejection?
 J-test rejects the null hypothesis that both the
instruments are exogenous
 This means that either rtaxso is endogenous, or rtax is
endogenous, or both
 The J-test doesn’t tell us which!! You must think!
 Why might rtax (cig-only tax) be endogenous?
o Political forces: history of smoking or lots of
smokers political pressure for low cigarette taxes
o If so, cig-only tax is endogenous
 This reasoning doesn’t apply to general sales tax
 use just one instrument, the general sales tax
10-81
The Demand for Cigarettes:
Summary of Empirical Results
 Use the estimated elasticity based on TSLS with the

general sales tax as the only instrument:
Elasticity = -.94, SE = .21
 This elasticity is surprisingly large (not inelastic) – a
1% increase in prices reduces cigarette sales by nearly
1%. This is much more elastic than conventional
wisdom in the health economics literature.
 This is a long-run (ten-year change) elasticity. What
would you expect a short-run (one-year change)
elasticity to be – more or less elastic?
10-82
What are the remaining threats to internal validity?
 Omitted variable bias?

o Panel data estimator; probably OK
 Functional form mis-specification

o Hmmm…should check…
o A related question is the interpretation of the
elasticity: using 10-year differences, the elasticity
interpretation is long-term. Different estimates
would obtain using shorter differences.
10-83
Remaining threats to internal validity, ctd.
 Remaining simultaneous causality bias?
o Not if the general sales tax a valid instrument:
 relevance?
 exogeneity?
 Errors-in-variables bias? Interesting question: are we
accurately measuring the price actually paid? What
about cross-border sales?
 Selection bias? (no, we have all the states)
Overall, this is a credible estimate of the long-term

elasticity of demand although some problems might
remain.
10-84
Where Do Valid Instruments Come From?
(SW Section 10.5)
 Valid instruments are (1) relevant and (2) exogenous

 One general way to find instruments is to look for
exogenous variation – variation that is “as if” randomly
assigned in a randomized experiment – that affects X.
o Rainfall shifts the supply curve for butter but not the
demand curve; rainfall is “as if” randomly assigned
o Sales tax shifts the supply curve for cigarettes but
not the demand curve; sales taxes are “as if”
randomly assigned
 Here is a final example…
10-85
Example: Cardiac Catheterization
Does cardiac catheterization improve longevity of heart

attack patients?
Yi = survival time (in days) of heart attack patient

Xi = 1 if patient receives cardiac catheterization,
= 0 otherwise
 Clinical trials show that CardCath affects

SurvivalDays.
 But is the treatment effective “in the field”?
10-86
SurvivalDaysi = 0 + 1CardCathi + ui
 Is OLS unbiased? The decision to treat a patient by

cardiac catheterization is endogenous – it is (was) made
in the field by EMT technician depends on ui
(unobserved patient health characteristics)
 If healthier patients are catheterized, then OLS has
simultaneous causality bias and OLS overstates
overestimates the CC effect
 Propose instrument: distance to the nearest CC hospital
– distance to the nearest “regular” hospital
10-87
 Z = differential distance to CC hospital
o Relevant? If a CC hospital is far away, patient
won’t bet taken there and won’t get CC
o Exogenous? If distance to CC hospital doesn’t
affect survival, other than through effect on
CardCathi, then corr(distance,ui) = 0 so exogenous
o If patients location is random, then differential
distance is “as if” randomly assigned.
o The 1st stage is a linear probability model: distance
affects the probability of receiving treatment
 Results (McClellan, McNeil, Newhous, JAMA, 1994):
o OLS estimates significant and large effect of CC
o TSLS estimates a small, often insignificant effect
10-88
Summary: IV Regression
(SW Section 10.6)
 A valid instrument lets us isolate a part of X that is
uncorrelated with u, and that part can be used to
estimate the effect of a change in X on Y
 IV regression hinges on having valid instruments:
(1) Relevance: check via first-stage F
(2) Exogeneity: Test overidentifying restrictions
via the J-statistic
 A valid instrument isolates variation in X that is “as if”
randomly assigned.
 The critical requirement of at least m valid instruments
cannot be tested – you must use your head.
10-89
Experiments and Quasi-Experiments
(SW Chapter 11)
Why study experiments?

 Ideal randomized controlled experiments provide a
benchmark for assessing observational studies.
 Actual experiments are rare ($$$) but influential.
 Experiments can solve the threats to internal validity
of observational studies, but they have their own
threats to internal validity.
 Thinking about experiments helps us to understand
quasi-experiments, or “natural experiments,” in which
there some variation is “as if” randomly assigned.
11-1
Terminology: experiments and quasi-experiments
 An experiment is designed and implemented
consciously by human researchers. An experiment
entails conscious use of a treatment and control group
with random assignment (e.g. clinical trials of a drug)
 A quasi-experiment or natural experiment has a
source of randomization that is “as if” randomly
assigned, but this variation was not part of a conscious
randomized treatment and control design.
 Program evaluation is the field of statistics aimed at
evaluating the effect of a program or policy, for
example, an ad campaign to cut smoking.
11-2
Different types of experiments: three examples
 Clinical drug trial: does a proposed drug lower

cholesterol?
o Y = cholesterol level
o X = treatment or control group (or dose of drug)
 Job training program (Job Training Partnership Act)
o Y = has a job, or not (or Y = wage income)
o X = went through experimental program, or not
 Class size effect (Tennessee class size experiment)
o Y = test score (Stanford Achievement Test)
o X = class size treatment group (regular, regular +
aide, small)
11-3
Our treatment of experiments: brief outline
 Why (precisely) do ideal randomized controlled

experiments provide estimates of causal effects?
 What are the main threats to the validity (internal and
external) of actual experiments – that is, experiments
actually conducted with human subjects?
 Flaws in actual experiments can result in X and u being
correlated (threats to internal validity).
 Some of these threats can be addressed using the
regression estimation methods we have used so far:
multiple regression, panel data, IV regression.
11-4
Idealized Experiments and Causal Effects
(SW Section 11.1)
 An ideal randomized controlled experiment randomly
assigns subjects to treatment and control groups.
 More generally, the treatment level X is randomly
assigned:
Yi = 0 + 1Xi + ui
 If X is randomly assigned (for example by computer)

then u and X are independently distributed and E(ui|Xi)
= 0, so OLS yields an unbiased estimator of 1.
 The causal effect is the population value of 1 in an
ideal randomized controlled experiment
11-5
Estimation of causal effects in an ideal randomized
controlled experiment
 Random assignment of X implies that E(ui|Xi) = 0.

 Thus the OLS estimator ˆ1 is unbiased.
 When the treatment is binary, ˆ1 is just the difference
in mean outcome (Y) in the treatment vs. control
group (Y treated – Y control ).
 This differences in means is sometimes called the
differences estimator.
11-6
Potential Problems with Experiments in Practice
(SW Section 11.2)
Threats to Internal Validity

1. Failure to randomize (or imperfect randomization)
 for example, openings in job treatment program are
filled on first-come, first-serve basis; latecomers are
controls
 result is correlation between X and u
11-7
Threats to internal validity, ctd.
2. Failure to follow treatment protocol (or “partial

compliance”)
 some controls get the treatment
 some “treated” get controls
 “errors-in-variables” bias: corr(X,u) 0
 Attrition (some subjects drop out)
 suppose the controls who get jobs move out of town;
then corr(X,u) 0
11-8
Threats to internal validity, ctd.
3. Experimental effects
 experimenter bias (conscious or subconscious):
treatment X is associated with “extra effort” or
“extra care,” so corr(X,u) 0
 subject behavior might be affected by being in an
experiment, so corr(X,u) 0 (Hawthorne effect)
Just as in regression analysis with observational data,

threats to the internal validity of regression with
experimental data implies that corr(X,u) 0 so OLS
(the differences estimator) is biased.
George Elton Mayo and the Hawthorne Experiment
11-9
Subjects in the Hawthorne plant experiments, 1924 – 1932
11-10
Threats to External Validity
1. Nonrepresentative sample
2. Nonrepresentative “treatment” (that is, program or
policy)
3. General equilibrium effects (effect of a program can
depend on its scale; admissions counseling )
4. Treatment v. eligibility effects (which is it you want
to measure: effect on those who take the program, or
the effect on those are eligible)
11-11
Regression Estimators of Causal Effects Using
Experimental Data
(SW Section 11.3)
 Focus on the case that X is binary (treatment/control).
 Often you observe subject characteristics, W1i,…,Wri.
 Extensions of the differences estimator:
o can improve efficiency (reduce standard errors)
o can eliminate bias that arises when:
 treatment and control groups differ
 there is “conditional randomization”
 there is partial compliance
 These extensions involve methods we have already
seen – multiple regression, panel data, IV regression
11-12
Estimators of the Treatment Effect 1 using
Experimental Data (X = 1 if treated, 0 if control)
Dep. Ind. method

vble vble(s)
differences Y X OLS
differences-in- Y = X OLS adjusts for initial
differences Yafter differences between
– Ybefore treatment and control
groups
differences with Y X,W1, OLS controls for
add’l regressors …,Wn additional subject
characteristics W
11-13
Estimators with experimental data, ctd.
Dep. Ind. method
vble vble(s)
differences-in- Y = X,W1, OLS adjusts for group
differences with Yafter …,Wn differences + controls
add’l regressors – Ybefore for subject char’s W
Instrumental Y X TSLS Z = initial random
variables assignment;
eliminates bias from
partial compliance
 TSLS with Z = initial random assignment also can be
applied to the differences-in-differences estimator and
the estimators with additional regressors (W’s)
11-14
The differences-in-differences estimator
 Suppose the treatment and control groups differ
systematically; maybe the control group is healthier
(wealthier; better educated; etc.)
 Then X is correlated with u, and the differences
estimator is biased.
 The differences-in-differences estimator adjusts for pre-
experimental differences by subtracting off each
subject’s pre-experimental value of Y
o Yi before = value of Y for subject i before the expt
o Yi after = value of Y for subject i after the expt
o Yi = Yi after – Yi before = change over course of expt
11-15
ˆ1diffsindiffs = (Y treat ,after –Y treat ,before ) – (Y control ,after –Y control ,before )
11-16
The differences-in-differences estimator, ctd.
(1) “Differences” formulation:
Yi = 0 + 1Xi + ui
where
Yi = Yi after – Yi before
Xi = 1 if treated, = 0 otherwise
 ˆ1 is the diffs-in-diffs estimator
11-17
The differences-in-differences estimator, ctd.
(2) Equivalent “panel data” version:

Yit = 0 + 1Xit + 2Dit + 3Git + vit, i = 1,…,n
where
t = 1 (before experiment), 2 (after experiment)
Dit = 0 for t = 1, = 1 for t = 2
Git = 0 for control group, = 1 for treatment group
Xit = 1 if treated, = 0 otherwise
= Dit Git = interaction effect of being in treatment
group in the second period
 ˆ1 is the diffs-in-diffs estimator
11-18
Including additional subject characteristics (W’s)
 Typically you observe additional subject characteristics,

W1i,…,Wri
 Differences estimator with add’l regressors:
Yi = 0 + 1Xi + 2W1i + … + r+1Wri + ui
 Differences-in-differences estimator with W’s:
Yi = 0 + 1Xi + 2W1i + … + r+1Wri + ui
where Yi = Yi after – Yi before .

11-19
Why include additional subject characteristics (W’s)?
1. Efficiency: more precise estimator of 1 (smaller

standard errors)
2. Check for randomization. If X is randomly assigned,
then the OLS estimators with and without the W’s
should be similar – if they aren’t, this suggests that X
wasn’t randomly designed (a problem with the expt.)
 Note: To check directly for randomization,
regress X on the W’s and do a F-test.
3. Adjust for conditional randomization (we’ll return to
this later…)
11-20
Estimation when there is partial compliance
Consider diffs-in-diffs estimator, X = actual treatment
Yi = 0 + 1Xi + ui
 Suppose there is partial compliance: some of the
treated don’t take the drug; some of the controls go to
job training anyway
 Then X is correlated with u, and OLS is biased
 Suppose initial assignment, Z, is random
 Then (1) corr(Z,X) 0 and (2) corr(Z,u) = 0
 Thus 1 can be estimated by TSLS, with instrumental
variable Z = initial assignment
 This can be extended to W’s (included exog. variables)
11-21
Experimental Estimates of the Effect of
Reduction: The Tennessee Class Size Experiment
(SW Section 11.4)
Project STAR (Student-Teacher Achievement Ratio)
 4-year study, $12 million
 Upon entering the school system, a student was
randomly assigned to one of three groups:
o regular class (22 – 25 students)
o regular class + aide
o small class (13 – 17 students)
 regular class students re-randomized after first year to
regular or regular+aide
 Y = Stanford Achievement Test scores
11-22
Deviations from experimental design
 Partial compliance:
o 10% of students switched treatment groups because
of “incompatibility” and “behavior problems” – how
much of this was because of parental pressure?
o Newcomers: incomplete receipt of treatment for
those who move into district after grade 1
 Attrition
o students move out of district
o students leave for private/religious schools
11-23
Regression analysis
 The “differences” regression model:

Yi = 0 + 1SmallClassi + 2RegAidei + ui
where
SmallClassi = 1 if in a small class
RegAidei = 1 if in regular class with aide
 Additional regressors (W’s)

o teacher experience
o free lunch eligibility
o gender, race
11-24
Differences estimates (no W’s)
11-25
11-26
How big are these estimated effects?
 Put on same basis by dividing by std. dev. of Y
 Units are now standard deviations of test scores
11-27
How do these estimates compare to those from the
California, Mass. observational studies? (Ch. 4 – 7)
11-28
Summary: The Tennessee Class Size Experiment
Remaining threats to internal validity

 partial compliance/incomplete treatment
o can use TSLS with Z = initial assignment
o Turns out, TSLS and OLS estimates are similar
(Krueger (1999)), so this bias seems not to be large
Main findings:
 The effects are small quantitatively (same size as
gender difference)
 Effect is sustained but not cumulative or increasing
biggest effect at the youngest grades
11-29
What is the Difference Between a Control Variable
and the Variable of Interest?
(SW App. 11.3)
Example: “free lunch eligible” in the STAR regressions

 Coefficient is large, negative, statistically significant
 Policy interpretation: Making students ineligible for a
free school lunch will improve their test scores.
 Is this really an estimate of a causal effect?
 Is the OLS estimator of its coefficient unbiased?
 Can it be that the coefficient on “free lunch eligible”
is biased but the coefficient on SmallClass is not?
11-30
11-31
Example: “free lunch eligible,” ctd.
 Coefficient on “free lunch eligible” is large, negative,
statistically significant
 Policy interpretation: Making students ineligible for a
free school lunch will improve their test scores.
 Why (precisely) can we interpret the coefficient on
SmallClass as an unbiased estimate of a causal effect,
but not the coefficient on “free lunch eligible”?
 This is not an isolated example!
o Other “control variables” we have used: gender,
race, district income, state fixed effects, time fixed
effects, city (or state) population,…
 What is a “control variable” anyway?
11-32
Simplest case: one X, one control variable W
Yi = 0 + 1 Xi + 2Wi + ui
For example,
 W = free lunch eligible (binary)
 X = small class/large class (binary)
 Suppose random assignment of X depends on W
o for example, 60% of free-lunch eligibles get small
class, 40% of ineligibles get small class)
o note: this wasn’t the actual STAR randomization
procedure – this is a hypothetical example
 Further suppose W is correlated with u
11-33
Yi = 0 + 1 Xi + 2Wi + ui
Suppose:
 The control variable W is correlated with u
 Given W = 0 (ineligible), X is randomly assigned
 Given W = 1 (eligible), X is randomly assigned.
Then:
 Given the value of W, X is randomly assigned;
 That is, controlling for W, X is randomly assigned;
 Thus, controlling for W, X is uncorrelated with u
 Moreover, E(u|X,W) doesn’t depend on X
 That is, we have conditional mean independence:
E(u|X,W) = E(u|W)
11-34
Implications of conditional mean independence
Yi = 0 + 1 Xi + 2Wi + ui
Suppose E(u|W) is linear in W (not restrictive – could add

quadratics etc.): then,
E(u|X,W) = E(u|W) = 0 + 1Wi (*)
so
E(Yi|Xi,Wi) = E(0 + 1 Xi + 2Wi + ui|Xi,Wi)
= 0 + 1Xi + 2Wi + E(ui|Xi,Wi)
= 0 + 1Xi + 2Wi + 0 + 1Wi by (*)
= (0+0) + 1Xi + (1+2)Wi
11-35
Implications of conditional mean independence:
 The conditional mean of Y given X and W is
E(Yi|Xi,Wi) = (0+0) + 1Xi + (1+2)Wi
 The effect of a change in X under conditional mean
independence is the desired causal effect:
E(Yi|Xi = x+x,Wi) – E(Yi|Xi = x,Wi) = 1x
or
E (Yi | X i  x  x,Wi )  E (Yi | X i  x,Wi )
1 =
x
 If X is binary (treatment/control), this becomes:
E (Yi | X i  1,Wi )  E (Yi | X i  0,Wi )
1 =
x
which is the desired treatment effect.
11-36
Implications of conditional mean independence, ctd.
Yi = 0 + 1 Xi + 2Wi + ui
Conditional mean independence says:

E(u|X,W) = E(u|W)
which, with linearity, implies:
E(Yi|Xi,Wi) = (0+0) + 1Xi + (1+2)Wi
Then:
 The OLS estimator ˆ1 is unbiased.
 ˆ2 is not consistent and not meaningful
 The usual inference methods (standard errors,
hypothesis tests, etc.) apply to ˆ1 .
11-37
So, what is a control variable?
A control variable W is a variable that results in X

satisfying the conditional mean independence condition:
E(u|X,W) = E(u|W)
 Upon including a control variable in the regression, X

ceases to be correlated with the error term.
 The control variable itself can be (in general will be)
correlated with the error term.
 The coefficient on X has a causal interpretation.
 The coefficient on W does not have a causal
interpretation.
11-38
Example: Effect of teacher experience on test scores
More on the design of Project STAR:
 Teachers didn’t change school because of the expt.
 Within their normal school, teachers were randomly
assigned to small/regular/reg+aide classrooms.
 What is the effect of X = years of teacher education?
The design implies conditional mean independence:

 W = school binary indicator
 Given W (school), X is randomly assigned
 That is, E(u|X,W) = E(u|W)
 W is plausibly correlated with u (nonzero school fixed
effects: some schools are better/richer/etc than others)
11-39
11-40
Example: teacher experience, ctd.
 Without school fixed effects (2), the estimated effect of
an additional year of experience is 1.47 (SE = .17)
 “Controlling for the school” (3), the estimated effect of
an additional year of experience is .74 (SE = .17)
 Direction of bias makes sense:
o less experienced teachers at worse schools
o years of experience picks up this school effect
 OLS estimator of coefficient on years of experience is
biased up without school effects; with school effects,
OLS yields unbiased estimator of causal effect
 School effect coefficients don’t have a causal
interpretation (effect of student changing schools)
11-41
Quasi-Experiments
(SW Section 11.5)
A quasi-experiment or natural experiment has a source

of randomization that is “as if” randomly assigned, but
this variation was not part of a conscious randomized
treatment and control design.
Two cases:
(a) Treatment (X) is “as if” randomly assigned (OLS)
(b) A variable (Z) that influences treatment (X) is
“as if” randomly assigned (IV)
11-42
Two types of quasi-experiments
(a) Treatment (X) is “as if” randomly assigned (perhaps

conditional on some control variables W)
 Ex: Effect of marginal tax rates on labor supply
o X = marginal tax rate (rate changes in one state,
not another; state is “as if” randomly assigned)

 Effect on survival of cardiac catheterization
X = cardiac catheterization;
Z = differential distance to CC hospital
11-43
Econometric methods
(a) Treatment (X) is “as if” randomly assigned (OLS)
Diffs-in-diffs estimator using panel data methods:
Yit = 0 + 1Xit + 2Dit + 3Git + uit, i = 1,…,n

where
t = 1 (before experiment), 2 (after experiment)
Dit = 0 for t = 1, = 1 for t = 2
Git = 0 for control group, = 1 for treatment group
Xit = 1 if treated, = 0 otherwise
= Dit Git = interaction effect of being in treatment
group in the second period
 ˆ1 is the diffs-in-diffs estimator…
11-44
The panel data diffs-in-diffs estimator simplifies to
the “changes” diffs-in-diffs estimator when T = 2
Yit = 0 + 1Xit + 2Dit + 3Git + uit, i = 1,…,n (*)
For t = 1: Di1 = 0 and Xi1 = 0 (nobody treated), so

Yi1 = 0 + 3Gi1 + ui1
For t = 2: Di2 = 1 and Xi2 = 1 if treated, = 0 if not, so
Yi2 = 0 + 1Xi2 + 2 + 3Gi2 + ui2
so
Yi = Yi2–Yi1 = (0+1Xi2+2+3Gi2+ui2) – (0+3Gi1+ui1)
= 1Xi + 2 + (ui1 – ui2) (since Gi1 = Gi2)
or
Yi = 2 + 1Xi + vi, where vi = ui1 – ui2 (**)
11-45
Differences-in-differences with control variables
Yit = 0 + 1Xit + 2Dit + 3Git + 4W1it + … + 3+rWrit + uit,
Xit = 1 if the treatment is received, = 0 otherwise

= Git Dit (= 1 for treatment group in second period)
 If the treatment (X) is “as if” randomly assigned,
given W, then u is conditionally mean indep. of X:
E(u|X,D,G,W) = E(u|D,G,W)
 OLS is a consistent estimator of 1, the causal effect
of a change in X
 In general, the OLS estimators of the other
coefficients do not have a causal interpretation.
11-46
Yit = 0 + 1Xit + 2Dit + 3Git + 4W1it + … + 3+rWrit + uit,
Xit = 1 if the treatment is received, = 0 otherwise

= Git Dit (= 1 for treatment group in second period)
Zit = variable that influences treatment but is
uncorrelated with uit (given W’s)
TSLS:
 X = endogenous regressor
 D,G,W1,…,Wr = included exogenous variables
 Z = instrumental variable
11-47
Potential Threats to Quasi-Experiments
(SW Section 11.6)
The threats to the internal validity of a quasi-
experiment are the same as for a true experiment, with
one addition.
4. Failure to randomize (imperfect randomization)
Is the “as if” randomization really random, so that X
(or Z) is uncorrelated with u?
5. Failure to follow treatment protocol & attrition
6. Experimental effects (not applicable)
7. Instrument invalidity (relevance + exogeneity)
(Maybe healthier patients do live closer to CC hospitals
–they might have better access to care in general)
11-48
The threats to the external validity of a quasi-
experiment are the same as for an observational study.
5. Nonrepresentative sample
6. Nonrepresentative “treatment” (that is, program or
policy)
Example: Cardiac catheterization

 The CC study has better external validity than
controlled clinical trials because the CC study uses
observational data based on real-world
implementation of cardiac catheterization.
However that study used data from the early 90’s – do its
findings apply to CC usage today?
11-49
Experimental and Quasi-Experiments Estimates in
Heterogeneous Populations
(SW Section 11.7)
 We have discussed “the” treatment effect

 But the treatment effect could vary across individuals:
o Effect of job training program probably depends on
education, years of education, etc.
o Effect of a cholesterol-lowering drug could depend
other health factors (smoking, age, diabetes,…)
 If this variation depends on observed variables, then
this is a job for interaction variables!
 But what if the source of variation is unobserved?
11-50
Heterogeneity of causal effects
When the causal effect (treatment effect) varies among
individuals, the population is said to be heterogeneous.
When there are heterogeneous causal effects that are not

linked to an observed variable:
 What do we want to estimate?
o Often, the average causal effect in the population
o But there are other choices, for example the average
causal effect for those who participate (effect of
treatment on the treated)
 What do we actually estimate?
o using OLS? using TSLS?
11-51
Population regression model with heterogeneous
causal effects:
Yi = 0 + 1iXi + ui, i = 1,…,n
 1i is the causal effect (treatment effect) for the ith

individual in the sample
 For example, in the JTPA experiment, 1i could be zero
if person i already has good job search skills
 What do we want to estimate?
o effect of the program on a randomly selected person
(the “average causal effect”) – our main focus
o effect on those most (least?) benefited
o effect on those who choose to go into the program?
11-52
The Average Causal Effect
Yi = 0 + 1iXi + ui, i = 1,…,n
 The average causal effect (or average treatment effect)

is the mean value of 1i in the population.
 We can think of 1 as a random variable: it has a
distribution in the population, and drawing a different
person yields a different value of 1 (just like X and Y)
 For example, for person #34 the treatment effect is not
random – it is her true treatment effect – but before she
is selected at random from the population, her value of
1 can be thought of as randomly distributed.
11-53
The average causal effect, ctd.
Yi = 0 + 1iXi + ui, i = 1,…,n
 The average causal effect is E(1).

 What does OLS estimate:
(a) When the conditional mean of u given X is zero?
(b) Under the stronger assumption that X is randomly
assigned (as in a randomized experiment)?
In this case, OLS is a consistent estimator of the
average causal effect.
11-54
OLS with Heterogeneous Causal Effects
Yi = 0 + 1iXi + ui, i = 1,…,n
(a) Suppose E(ui|Xi) = 0 so cov(ui,Xi) = 0.

 If X is binary (treated/untreated), ˆ1 = Y treated – Y control
estimates the causal effect among those who receive
the treatment.
 Why? For those treated, Y treated reflects the effect of
the treatment on them. But we don’t know how the
untreated would have responded had they been
treated!
11-55
The math: suppose X is binary and E(ui|Xi) = 0.
Then
ˆ1 = Y treated – Y control
For the treated:
E(Yi|Xi=1) = 0 + E(1iXi|Xi=1) + E(ui|Xi=1)
= 0 + E(1i|Xi=1)
For the controls:
E(Yi|Xi=0) = 0 + E(1iXi|Xi=0) + E(ui|Xi=0)
= 0
Thus:
p
ˆ1  E(Yi|Xi=1) – E(Yi|Xi=0) = E(1i|Xi=1)
= average effect of the treatment on the treated
11-56
OLS with heterogeneous treatment effects: general X
with E(ui|Xi) = 0
ˆ s p
 XY cov(  0   1i X i  ui , X i )
1 = 2  2 =
XY
sX X var( X i )
cov(  0 , X i )  cov(  1i X i , X i )  cov(ui , X i )
=
var( X i )
cov(  1i X i , X i )
= (because cov(ui,Xi) = 0)
var( X i )
 If X is binary, this simplifies to the “effect of
treatment on the treated”
p
 Without heterogeneity, 1i = 1 and ˆ1  1
 In general, the treatment effects of individuals with
large values of X are given the most weight
11-57
(b) Now make a stronger assumption: that X is randomly
assigned (experiment or quasi-experiment). Then
what does OLS actually estimate?
 I Xi is randomly assigned, it is distributed
independently of 1i, so there is no difference
between the population of controls and the
population in the treatment group
 Thus the effect of treatment on the treated = the
average treatment effect in the population.
11-58
The math:
ˆ
pcov(  1i X i , X i )   cov( 1i X i , X i ) 
1  = E E  |  1i  
var( X i )   var( X i ) 
 cov( X i , X i )   var( X i ) 
= E  1i  = E  1i 
 var( X i )   var( X i 
)
= E(1i)
Summary
 If Xi and 1i are independent (Xi is randomly
assigned), OLS estimates the average treatment effect.
 If Xi is not randomly assigned but E(ui|Xi) = 0, OLS
estimates the effect of treatment on the treated.
 Without heterogeneity, the effect of treatment on the
treated and the average treatment effect are the same
11-59
IV Regression with Heterogeneous Causal Effects
Suppose the treatment effect is heterogeneous and the

effect of the instrument on X is heterogeneous:
Yi = 0 + 1iXi + ui (equation of interest)

Xi = 0 + 1iZi + vi (first stage of TSLS)
In general, TSLS estimates the causal effect for those

whose value of X (probability of treatment) is most
influenced by the instrument.
11-60
IV with heterogeneous causal effects, ctd.

Intuition:
 Suppose 1i’s were known. If for some people 1i =
0, then their predicted value of Xi wouldn’t depend
on Z, so the IV estimator would ignore them.
 The IV estimator puts most of the weight on
individuals for whom Z has a large influence on X.
 TSLS measures the treatment effect for those whose
probability of treatment is most influenced by X.
11-61
The math…
To simplify things, suppose:

 1i and 1i are distributed independently of (ui,vi,Zi)
 E(ui|Zi) = 0 and E(vi|Zi) = 0
 E(1i) 0
ˆ E ( 1i 1i )
p
Then 1 
TSLS
(derived in SW App. 11.4)
E ( 1i )
 TSLS estimates the causal effect for those individuals
for whom Z is most influential (those with large 1i).
11-62
When there are heterogeneous causal effects, what
TSLS estimates depends on the choice of instruments!
 With different instruments, TSLS estimates different
weighted averages!!!
 Suppose you have two instruments, Z1 and Z2.
o In general these instruments will be influential for
different members of the population.
o Using Z1, TSLS will estimate the treatment effect for
those people whose probability of treatment (X) is
most influenced by Z1
o The treatment effect for those most influenced by Z1
might differ from the treatment effect for those most
influenced by Z2
11-63
When does TSLS estimate the average causal effect?
p
E ( 1i 1i )
ˆTSLS

E ( 1i )
1
 TSLS estimates the average causal effect (that is,

p
ˆ
TSLS
1  E(1i)) if:
o If 1i and 1i are independent
o If 1i = 1 (no heterogeneity in equation of interest)
o If 1i = 1 (no heterogeneity in first stage equation)
 But in general ˆ1TSLS does not estimate E(1i)!
11-64
Example: Cardiac catheterization
Yi = survival time (days) for AMI patients
Xi = received cardiac catheterization (or not)
Zi = differential distance to CC hospital
Equation of interest:
SurvivalDaysi = 0 + 1iCardCathi + ui
First stage (linear probability model):
CardCathi = 0 + 1iDistancei + vi
 For whom does distance have the great effect on the

probability of treatment?
 For those patients, what is their causal effect 1i?
11-65
Equation of interest:
SurvivalDaysi = 0 + 1iCardCathi + ui
First stage (linear probability model):
CardCathi = 0 + 1iDistancei + vi
 TSLS estimates the causal effect for those whose

value of Xi is most heavily influenced by Zi
 TSLS estimates the causal effect for those for whom
distance most influences the probability of treatment
 What is their causal effect? (“We might as well go to
the CC hospital, its not too much farther”)
 This is one explanation of why the TSLS estimate is
smaller than the clinical trial OLS estimate.
11-66
Heterogeneous Causal Effects: Summary
 Heterogeneous causal effects means that the causal (or

treatment) effect varies across individuals.
 When these differences depend on observable variables,
heterogeneous causal effects can be estimated using
interactions (nothing new here).
 When these differences are unobserved (1i) the
average causal (or treatment) effect is the average value
in the population, E(1i).
 When causal effects are heterogeneous, OLS and TSLS
estimate….
11-67
OLS with Heterogeneous Causal Effects
X is: Relation between Xi and Then OLS estimates:
u i:
binary E(ui|Xi) = 0 effect of treatment on the
treated: E(1i|Xi=1)
X randomly assigned (so average causal effect E(1i)
Xi and ui are independent)
general E(ui|Xi) = 0 weighted average of 1i,
placing most weight on
those with large |Xi–X|
X randomly assigned average causal effect E(1i)
p
Without heterogeneity, 1i = 1 and ˆ1  1 in all these
cases.
11-68
TSLS with Heterogeneous Causal Effects
 TSLS estimates the causal effect for those individuals
for whom Z is most influential (those with large 1i).
 What TSLS estimates depends on the choice of Z!!
 In CC example, these were the individuals for whom
the decision to drive to a CC lab was heavily
influenced by the extra distance (those patients for
whom the EMT was otherwise “on the fence”)
 Thus TSLS also estimates a causal effect: the average
effect of treatment on those most influenced by the
instrument
o In general, this is neither the average causal effect
nor the effect of treatment on the treated
11-69
Summary: Experiments and Quasi-Experiments
(SW Section 11.8)
Experiments:
 Average causal effects are defined as expected values
of ideal randomized controlled experiments
 Actual experiments have threats to internal validity
 These threats to internal validity can be addressed (in
part) by:
o panel methods (differences-in-differences)
o multiple regression
o IV (using initial assignment as an instrument)
11-70
Summary, ctd.
Quasi-experiments:
 Quasi-experiments have an “as-if” randomly assigned
source of variation.
 This as-if random variation can generate:
o Xi which satisfies E(ui|Xi) = 0 (so estimation
proceeds using OLS); or
o instrumental variable(s) which satisfy E(ui|Zi) = 0
(so estimation proceeds using TSLS)
 Quasi-experiments also have threats to internal vaidity
11-71
Summary, ctd.
Two additional subtle issues:

 What is a control variable?
o A variable W for which X and u are uncorrelated,
given the value of W (conditional mean
independence: E(ui|Xi,Wi) = E(ui|Wi)
o Example: STAR & effect of teacher experience
 within their school, teachers were randomly
assigned to regular/reg+aide/small class
 OLS provides an unbiased estimator of the causal
effect, but only after controlling for school
effects.
11-72
Summary, ctd.
 What do OLS and TSLS estimate when there is

unobserved heterogeneity of causal effects?
 In general, weighted averages of causal effects:
o If X is randomly assigned, then OLS estimates the
average causal effect.
o If Xi is not randomly assigned but E(ui|Xi) = 0, OLS
estimates the average effect of treatment on the
treated.
o If E(ui|Zi) = 0, TSLS estimates the average effect of
treatment on those most influenced by Zi.
11-73
Introduction to Time Series Regression and
Forecasting
(SW Chapter 12)
Time series data are data collected on the same

observational unit at multiple time periods
 Aggregate consumption and GDP for a country (for
example, 20 years of quarterly observations = 80
observations)
 Yen/$, pound/$ and Euro/$ exchange rates (daily data
for 1 year = 365 observations)
 Cigarette consumption per capital for a state
12-1
Example #1 of time series data: US rate of inflation
12-2
Example #2: US rate of unemployment
12-3
Why use time series data?
 To develop forecasting models
o What will the rate of inflation be next year?
 To estimate dynamic causal effects
o If the Fed increases the Federal Funds rate now,
what will be the effect on the rates of inflation and
unemployment in 3 months? in 12 months?
o What is the effect over time on cigarette
consumption of a hike in the cigarette tax
 Plus, sometimes you don’t have any choice…
o Rates of inflation and unemployment in the US can
be observed only over time.
12-4
Time series data raises new technical issues
 Time lags
 Correlation over time (serial correlation or
autocorrelation)
 Forecasting models that have no causal interpretation
(specialized tools for forecasting):
o autoregressive (AR) models
o autoregressive distributed lag (ADL) models
 Conditions under which dynamic effects can be
estimated, and how to estimate them
 Calculation of standard errors when the errors are
serially correlated
12-5
Using Regression Models for Forecasting
(SW Section 12.1)
 Forecasting and estimation of causal effects are quite

different objectives.
 For forecasting,
o R 2 matters (a lot!)
o Omitted variable bias isn’t a problem!
o We will not worry about interpreting coefficients
in forecasting models
o External validity is paramount: the model
estimated using historical data must hold into the
(near) future
12-6
Introduction to Time Series Data
and Serial Correlation
(SW Section 12.2)
First we must introduce some notation and terminology.
Notation for time series data

 Yt = value of Y in period t.
 Data set: Y1,…,YT = T observations on the time series
random variable Y
 We consider only consecutive, evenly-spaced
observations (for example, monthly, 1960 to 1999, no
missing months) (else yet more complications...)
12-7
We will transform time series variables using lags,
first differences, logarithms, & growth rates
12-8
Example: Quarterly rate of inflation at an annual rate
 CPI in the first quarter of 1999 (1999:I) = 164.87
 CPI in the second quarter of 1999 (1999:II) = 166.03
 Percentage change in CPI, 1999:I to 1999:II
 166.03  164.87   1.16 
= 100    = 100    = 0.703%
 164.87   164.87 
 Percentage change in CPI, 1999:I to 1999:II, at an
annual rate = 4 0.703 = 2.81% (percent per year)
 Like interest rates, inflation rates are (as a matter of
convention) reported at an annual rate.
 Using the logarithmic approximation to percent changes
yields 4 100 [log(166.03) – log(164.87)] = 2.80%
12-9
Example: US CPI inflation – its first lag and its change
CPI = Consumer price index (Bureau of Labor Statistics)
12-10
Autocorrelation
The correlation of a series with its own lagged values is

called autocorrelation or serial correlation.
 The first autocorrelation of Yt is corr(Yt,Yt–1)

 The first autocovariance of Yt is cov(Yt,Yt–1)
 Thus
cov(Yt , Yt 1 )
corr(Yt,Yt–1) = =1
var(Yt ) var(Yt 1 )
 These are population correlations – they describe the

population joint distribution of (Yt,Yt–1)
12-11
12-12
Sample autocorrelations
The jth sample autocorrelation is an estimate of the jth
population autocorrelation:
cov(Yt , Yt  j )
ˆ j =
var(Yt )
where
T
1
cov(Yt ,Yt  j ) = 
T  j  1 t  j 1
(Yt  Y j 1,T )(Yt  j  Y1,T  j )
where Y j 1,T is the sample average of Yt computed over

observations t = j+1,…,T
o Note: the summation is over t=j+1 to T (why)?
12-13
Example: Autocorrelations of:
(1) the quarterly rate of U.S. inflation
(2) the quarter-to-quarter change in the quarterly rate
of inflation
12-14
 The inflation rate is highly serially correlated (1 = .85)
 Last quarter’s inflation rate contains much information
about this quarter’s inflation rate
 The plot is dominated by multiyear swings
 But there are still surprise movements!
12-15
More examples of time series & transformations
12-16
More examples of time series & transformations, ctd.
12-17
Stationarity: a key idea for external validity of time
series regression
Stationarity says that the past is like the present and
the future, at least in a probabilistic sense.
We’ll focus on the case that Yt stationary.
12-18
Autoregressions
(SW Section 12.3)
A natural starting point for a forecasting model is to use

past values of Y (that is, Yt–1, Yt–2,…) to forecast Yt.
 An autoregression is a regression model in which Yt
is regressed against its own lagged values.
 The number of lags used as regressors is called the
order of the autoregression.
o In a first order autoregression, Yt is regressed
against Yt–1
o In a pth order autoregression, Yt is regressed
against Yt–1,Yt–2,…,Yt–p.
12-19
The First Order Autoregressive (AR(1)) Model
The population AR(1) model is
Yt = 0 + 1Yt–1 + ut
 0 and 1 do not have causal interpretations

 if 1 = 0, Yt–1 is not useful for forecasting Yt
 The AR(1) model can be estimated by OLS regression
of Yt against Yt–1
 Testing 1 = 0 v. 1 0 provides a test of the
hypothesis that Yt–1 is not useful for forecasting Yt
12-20
Example: AR(1) model of the change in inflation
Estimated using data from 1962:I – 1999:IV:
Inf t = 0.02 – 0.211Inft–1 R 2 = 0.04

(0.14) (0.106)
Is the lagged change in inflation a useful predictor of the

current change in inflation?
 t = .211/.106 = 1.99 > 1.96
 Reject H0: 1 = 0 at the 5% significance level
 Yes, the lagged change in inflation is a useful
predictor of current change in infl. (but low R 2 !)
12-21
Example: AR(1) model of inflation – STATA
First, let STATA know you are using time series data
generate time=q(1959q1)+_n-1; _n is the observation no.
So this command creates a new variable
time that has a special quarterly
date format
format time %tq; Specify the quarterly date format
sort time; Sort by time
tsset time; Let STATA know that the variable time

is the variable you want to indicate the
time scale
12-22
Example: AR(1) model of inflation – STATA, ctd.
. gen lcpi = log(cpi); variable cpi is already in memory
. gen inf = 400*(lcpi[_n]-lcpi[_n-1]); quarterly rate of inflation at an

annual rate
. corrgram inf , noplot lags(8); computes first 8 sample autocorrelations
LAG AC PAC Q Prob>Q

-----------------------------------------
1 0.8459 0.8466 116.64 0.0000
2 0.7663 0.1742 212.97 0.0000
3 0.7646 0.3188 309.48 0.0000
4 0.6705 -0.2218 384.18 0.0000
5 0.5914 0.0023 442.67 0.0000
6 0.5538 -0.0231 494.29 0.0000
7 0.4739 -0.0740 532.33 0.0000
8 0.3670 -0.1698 555.3 0.0000
. gen inf = 400*(lcpi[_n]-lcpi[_n-1])

This syntax creates a new variable, inf, the “nth” observation of which is
400 times the difference between the nth observation on lcpi and the “n-
1”th observation on lcpi, that is, the first difference of lcpi
12-23
Example: AR(1) model of inflation – STATA, ctd
Syntax: L.dinf is the first lag of dinf
. reg dinf L.dinf if tin(1962q1,1999q4), r;

F( 1, 150) = 3.96
Prob > F = 0.0484
R-squared = 0.0446
Root MSE = 1.6619
------------------------------------------------------------------------------
| Robust
dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dinf |
L1 | -.2109525 .1059828 -1.99 0.048 -.4203645 -.0015404
_cons | .0188171 .1350643 0.14 0.889 -.2480572 .2856914
------------------------------------------------------------------------------
if tin(1962q1,1999q4)
STATA time series syntax for using only observations between 1962q1 and
1999q4 (inclusive).
This requires defining the time scale first, as we did above
12-24
Forecasts and forecast errors
A note on terminology:
 A predicted value refers to the value of Y predicted
(using a regression) for an observation in the sample
used to estimate the regression – this is the usual
definition
 A forecast refers to the value of Y forecasted for an
observation not in the sample used to estimate the
regression.
 Predicted values are “in sample”
 Forecasts are forecasts of the future – which cannot
have been used to estimate the regression.
12-25
Forecasts: notation
 Yt|t–1 = forecast of Yt based on Yt–1,Yt–2,…, using the
population (true unknown) coefficients
 Yˆt|t 1 = forecast of Yt based on Yt–1,Yt–2,…, using the
estimated coefficients, which were estimated using
data through period t–1.
For an AR(1),
 Yt|t–1 = 0 + 1Yt–1
 Yˆt|t 1 = ˆ0 + ˆ1 Yt–1, where ˆ0 and ˆ1 were estimated
using data through period t–1.
12-26
Forecast errors
The one-period ahead forecast error is,
forecast error = Yt – Yˆt|t 1
The distinction between a forecast error and a residual is

the same as between a forecast and a predicted value:
 a residual is “in-sample”
 a forecast error is “out-of-sample” – the value of Yt
isn’t used in the estimation of the regression
coefficients
12-27
The root mean squared forecast error (RMSFE)
RMSFE = E[(Yt  Yˆt|t 1 ) 2 ]
 The RMSFE is a measure of the spread of the forecast

error distribution.
 The RMSFE is like the standard deviation of ut,
except that it explicitly focuses on the forecast error
using estimated coefficients, not using the population
regression line.
 The RMSFE is a measure of the magnitude of a
typical forecasting “mistake”
12-28
Example: forecasting inflation using and AR(1)
AR(1) estimated using data from 1962:I – 1999:IV:

Inf t = 0.02 – 0.211Inft–1
Inf1999:III = 2.8 (units are percent, at an annual rate)

Inf1999:IV = 3.2
Inf1999:IV = 0.4
So the forecast of Inf2000:I is,
Inf 2000:I |1999:IV = 0.02 – 0.211 0.4 = -0.06 -0.1
so
Inf 2000:I |1999:IV = Inf1999:IV + Inf 2000:I |1999: IV = 3.2 – 0.1 = 3.1
12-29
The pth order autoregressive model (AR(p))
Yt = 0 + 1Yt–1 + 2Yt–2 + … + pYt–p + ut
 The AR(p) model uses p lags of Y as regressors

 The AR(1) model is a special case
 The coefficients do not have a causal interpretation
 To test the hypothesis that Yt–2,…,Yt–p do not further
help forecast Yt, beyond Yt–1, use an F-test
 Use t- or F-tests to determine the lag order p
 Or, better, determine p using an “information criterion”
(see SW Section 12.5 – we won’t cover this)
12-30
Example: AR(4) model of inflation
Inf t = .02 – .21Inft–1 – .32Inft–2 + .19Inft–3

(.12) (.10) (.09) (.09)
– .04Inft–4, R 2 = 0.21
(.10)
 F-statistic testing lags 2, 3, 4 is 6.43 (p-value < .001)

 R 2 increased from .04 to .21 by adding lags 2, 3, 4
 Lags 2, 3, 4 (jointly) help to predict the change in
inflation, above and beyond the first lag
12-31
Example: AR(4) model of inflation – STATA
. reg dinf L(1/4).dinf if tin(1962q1,1999q4), r;

F( 4, 147) = 6.79
Prob > F = 0.0000
R-squared = 0.2073
Root MSE = 1.5292
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
dinf |
L1 | -.2078575 .09923 -2.09 0.038 -.4039592 -.0117558
L2 | -.3161319 .0869203 -3.64 0.000 -.4879068 -.144357
L3 | .1939669 .0847119 2.29 0.023 .0265565 .3613774
L4 | -.0356774 .0994384 -0.36 0.720 -.2321909 .1608361
_cons | .0237543 .1239214 0.19 0.848 -.2211434 .268652
------------------------------------------------------------------------------
NOTES
 L(1/4).dinf is A convenient way to say “use lags 1–4 of dinf as regressors”

 L1,…,L4 refer to the first, second,… 4th lags of dinf
12-32
Example: AR(4) model of inflation – STATA, ctd.
. dis "Adjusted Rsquared = " _result(8); result(8) is the rbar-squared

Adjusted Rsquared = .18576822 of the most recently run regression
. test L2.dinf L3.dinf L4.dinf; L2.dinf is the second lag of dinf, etc.
( 1) L2.dinf = 0.0
( 2) L3.dinf = 0.0
( 3) L4.dinf = 0.0
F( 3, 147) = 6.43
Prob > F = 0.0004
Note: some of the time series features of STATA differ

between STATA v. 7 and STATA v. 8…
12-33
Digression: we used Inf, not Inf, in the AR’s. Why?
The AR(1) model of Inft–1 is an AR(2) model of Inft:
Inft = 0 + 1Inft–1 + ut
or
Inft – Inft–1 = 0 + 1(Inft–1 – Inft–2) + ut
or
Inft = Inft–1 + 0 + 1Inft–1 – 1Inft–2 + ut
so
Inft = 0 + (1+1)Inft–1 – 1Inft–2 + ut
So why use Inft, not Inft?

12-34
AR(1) model of Inf: Inft = 0 + 1Inft–1 + ut
AR(2) model of Inf: Inft = 0 + 1Inft + 2Inft–1 + vt
 When Yt is strongly serially correlated, the OLS

estimator of the AR coefficient is biased towards zero.
 In the extreme case that the AR coefficient = 1, Yt isn’t
stationary: the ut’s accumulate and Yt blows up.
 If Yt isn’t stationary, our regression theory are working
with here breaks down
 Here, Inft is strongly serially correlated – so to keep
ourselves in a framework we understand, the
regressions are specified using Inf
 For optional reading, see SW Section 12.6, 14.3, 14.4
12-35
Time Series Regression with Additional Predictors
and the Autoregressive Distributed Lag (ADL) Model
(SW Section 12.4)
 So far we have considered forecasting models that use

only past values of Y
 It makes sense to add other variables (X) that might be
useful predictors of Y, above and beyond the predictive
value of lagged values of Y:
Yt = 0 + 1Yt–1 + … + pYt–p
+ 1Xt–1 + … + rXt–r + ut
 This is an autoregressive distributed lag (ADL) model

12-36
Example: lagged unemployment and inflation
 According to the “Phillips curve” says that if

unemployment is above its equilibrium, or “natural,”
rate, then the rate of inflation will increase.
 That is, Inft should be related to lagged values of the
unemployment rate, with a negative coefficient
 The rate of unemployment at which inflation neither
increases nor decreases is often called the “non-
accelerating rate of inflation” unemployment rate: the
NAIRU
 Is this relation found in US economic data?
 Can this relation be exploited for forecasting inflation?
12-37
The empirical “Phillips Curve”
The NAIRU is the value of u for which Inf = 0

12-38
Example: ADL(4,4) model of inflation
Inf t = 1.32 – .36Inft–1 – .34Inft–2 + .07Inft–3 – .03Inft–4

(.47) (.09) (.10) (.08) (.09)
– 2.68Unemt–1 + 3.43Unemt–2 – 1.04Unemt–3 + .07Unempt–4

(.47) (.89) (.89) (.44)
 R 2 = 0.35 – a big improvement over the AR(4), for

which R 2 = .21
12-39
Example: dinf and unem – STATA
. reg dinf L(1/4).dinf L(1/4).unem if tin(1962q1,1999q4), r;

F( 8, 143) = 7.99
Prob > F = 0.0000
R-squared = 0.3802
Root MSE = 1.371
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
dinf |
L1 | -.3629871 .0926338 -3.92 0.000 -.5460956 -.1798786
L2 | -.3432017 .100821 -3.40 0.001 -.5424937 -.1439096
L3 | .0724654 .0848729 0.85 0.395 -.0953022 .240233
L4 | -.0346026 .0868321 -0.40 0.691 -.2062428 .1370377
unem |
L1 | -2.683394 .4723554 -5.68 0.000 -3.617095 -1.749692
L2 | 3.432282 .889191 3.86 0.000 1.674625 5.189939
L3 | -1.039755 .8901759 -1.17 0.245 -2.799358 .719849
L4 | .0720316 .4420668 0.16 0.871 -.8017984 .9458615
_cons | 1.317834 .4704011 2.80 0.006 .3879961 2.247672
------------------------------------------------------------------------------
12-40
Example: ADL(4,4) model of inflation – STATA, ctd.
. dis "Adjusted Rsquared = " _result(8);

Adjusted Rsquared = .34548812
. test L2.dinf L3.dinf L4.dinf;
( 1) L2.dinf = 0.0
( 2) L3.dinf = 0.0
( 3) L4.dinf = 0.0
F( 3, 143) = 4.93 The extra lags of dinf are signif.

Prob > F = 0.0028
. test L1.unem L2.unem L3.unem L4.unem;
( 1) L.unem = 0.0
( 2) L2.unem = 0.0
( 3) L3.unem = 0.0
( 4) L4.unem = 0.0
F( 4, 143) = 8.51 The lags of unem are significant

Prob > F = 0.0000
The null hypothesis that the coefficients on the lags of the unemployment
rate are all zero is rejected at the 1% significance level using the F-
statistic
12-41
The test of the joint hypothesis that none of the X’s is a
useful predictor, above and beyond lagged values of Y, is
called a Granger causality test
“causality” is an unfortunate term here: Granger

Causality simply refers to (marginal) predictive content.
12-42
Summary: Time Series Forecasting Models
 For forecasting purposes, it isn’t important to have
coefficients with a causal interpretation!
 Simple and reliable forecasts can be produced using
AR(p) models – these are common “benchmark”
forecasts against which more complicated forecasting
models can be assessed
 Additional predictors (X’s) can be added; the result is
an autoregressive distributed lag (ADL) model
 Stationary means that the models can be used outside
the range of data for which they were estimated
 We now have the tools we need to estimate dynamic
causal effects...
12-43
Estimation of Dynamic Causal Effects
(SW Chapter 13)
A dynamic causal effect is the effect on Y of a change in

X over time.
For example:
 The effect of an increase in cigarette taxes on cigarette
consumption this year, next year, in 5 years;
 The effect of a change in the Fed Funds rate on
inflation, this month, in 6 months, and 1 year;
 The effect of a freeze in Florida on the price of orange
juice concentrate in 1 month, 2 months, 3 months…
23-1
The Orange Juice Data
(SW Section 13.1)
Data
 Monthly, Jan. 1950 – Dec. 2000 (T = 612)
 Price = price of frozen OJ (a sub-component of the
producer price index; US Bureau of Labor Statistics)
 %ChgP = percentage change in price at an annual rate,
so %ChgPt = 1200ln(Pricet)
 FDD = number of freezing degree-days during the
month, recorded in Orlando FL
o Example: If November has 2 days with low temp <
32o, one at 30o and at 25o, then FDDNov = 2 + 7 = 9
23-2
23-3
Initial OJ regression
%ChgP t = -.40 + .47FDDt

(.22) (.13)
 Statistically significant positive relation

 More/deeper freezes, price goes up
 Standard errors: not the usual – heteroskedasticity and
autocorrelation-consistent (HAC) SE’s – more on this
later
 But what is the effect of FDD over time?
23-4
Dynamic Causal Effects
(SW Section 13.2)
Example: What is the effect of fertilizer on tomato yield?
An ideal randomized controlled experiment

 Fertilize some plots, not others (random assignment)
 Measure yield over time – over repeated harvests – to
estimate causal effect of fertilizer on:
o Yield in year 1 of expt
o Yield in year 2, etc.
 The result (in a large expt) is the causal effect of
fertilizer on yield k years later.
23-5
In time series applications, we can’t conduct this ideal
randomized controlled experiment:
 We only have one US OJ market ….
 We can’t randomly assign FDD to different replicates
of the US OJ market (?)
 We can’t measure the average (across “subjects”)
outcome at different times – only one “subject”
 So we can’t estimate the causal effect at different
times using the differences estimator
23-6
An alternative thought experiment:
 Randomly give the same subject different treatments
(FDDt) at different times
 Measure the outcome variable (%ChgPt)
 The “population” of subjects consists of the same
subject (OJ market) but at different dates
 If the “different subjects” are drawn from the same
distribution – that is, if Yt,Xt are stationary – then the
dynamic causal effect can be deduced by OLS
regression of Yt on lagged values of Xt.
 This estimator (regression of Yt on Xt and lags of Xt) s
called the distributed lag estimator.
23-7
Dynamic causal effects and the distributed lag model
The distributed lag model is:
Yt = 0 + 1Xt + … + pXt–r + ut
 1 = impact effect of change in X = effect of change in

Xt on Yt, holding past Xt constant
 2 = 1-period dynamic multiplier = effect of change in
Xt–1 on Yt, holding constant Xt, Xt–2, Xt–3,…
 3 = 2-period dynamic multiplier (etc.)= effect of
change in Xt–2 on Yt, holding constant Xt, Xt–1, Xt–3,…
 Cumulative dynamic multipliers
o Ex: the 2-period cumulative dynamic multiplier
= 1 + 2 + 3 (etc.)
23-8
Exogeneity in time series regression
Exogeneity (past and present)

X is exogenous if E(ut|Xt,Xt–1,Xt–2,…) = 0.
Strict Exogeneity (past, present, and future)

X is strictly exogenous if E(ut|…,Xt+1,Xt,Xt–1, …) = 0
 Strict exogeneity implies exogeneity

 For now we suppose that X is exogenous – we’ll return
(briefly) to the case of strict exogeneity later.
 If X is exogenous then OLS estimates the dynamic
causal effect on Y of a change in X. Specifically,…
23-9
Estimation of Dynamic Causal Effects with
Exogenous Regressors
(SW Section 13.3)
Yt = 0 + 1Xt + … + r+1Xt–r + ut
The Distributed Lag Model Assumptions

1. E(ut|Xt,Xt–1,Xt–2,…) = 0 (X is exogenous)
2. (a) Y and X have stationary distributions;
(b) (Yt,Xt) and (Yt–j,Xt–j) become independent as j
gets large
3. Y and X have eight nonzero finite moments
4. There is no perfect multicollinearity.
23-10
 Assumptions 1 and 4 are familiar
 Assumption 3 is familiar, except for 8 (not four) finite
moments – this has to do with HAC estimators
 Assumption 2 is different – before it was (Xi,Yi) are
i.i.d. – this now becomes more complicated.
2. (a) Y and X have stationary distributions;

 If so, the coefficients don’t change within the
sample (internal validity);
 and the results can be extrapolated outside the
sample (external validity).
 This is the time series counterpart of the
“identically distributed” part of i.i.d.
23-11
2. (b) (Yt,Xt) and (Yt–j,Xt–j) become independent as j
gets large
 Intuitively, this says that we have separate
experiments for time periods that are widely
separated.
 In cross-sectional data, we assumed that Y and X
were i.i.d., a consequence of simple random
sampling – this led to the CLT.
 A version of the CLT holds for time series
variables that become independent as their
temporal separation increases – assumption 2(b)
is the time series counterpart of the
“independently distributed” part of i.i.d.
23-12
Under the Distributed Lag Model Assumptions:
 OLS yields consistent estimators of 1, 2,…,r (of
the dynamic multipliers)
 The sampling distribution of ˆ1 , etc., is normal
 However, the formula for the variance of this
sampling distribution is not the usual one from cross-
sectional (i.i.d.) data, because ut is not i.i.d. – it is
serially correlated.
 This means that the usual OLS standard errors (usual
STATA printout) are wrong
 We need to use, instead, SEs that are robust to
autocorrelation as well as to heteroskedasticity…
23-13
Heteroskedasticity and Autocorrelation-Consistent
(HAC) Standard Errors
(SW Section 13.4)
 When ut is serially correlated, the variance of the

sampling distribution of the OLS estimator is
different.
 Consequently, we need to use a different formula for
the standard errors.
 This is easy to do using STATA and most (but not all)
other statistical software.
23-14
The math…
Consider the case of no lags:
Yt = 0 + 1Xt + ut
“Recall” that the OLS estimator is:
1 T

T t 1
( X t  X )(Yt  Y )
ˆ
1 =
1 T
 t
T t 1
( X  X ) 2
so..
23-15
1 T

T t 1
( X t  X )ut
ˆ
1 – 1 = (this is SW App. 4.3)
1 T
 t
T t 1
( X  X ) 2
1 T

T t 1
( X t  X )ut
=
1 T

T t 1
( X t  X ) 2
so
1 T

T t 1
vt
ˆ1 – 1 in large samples
 2
X
where vt = (Xt – X )ut (this is still SW App. 4.3)

23-16
1 T

T t 1
vt
ˆ1 – 1 in large samples
 2
X
so, in large samples,
T
1
var( ˆ1 ) = var(  vt )/ ( X2 ) 2 (still SW App. 4.3)
T t 1
What happens with time series data? Consider T = 2:
1 T
var(  vt ) = var[½(v1+v2)]
T t 1
= ¼[var(v1) + var(v2) + 2cov(v1,v2)]
23-17
so
1 2
var(  vt ) = ¼[var(v1) + var(v2) + 2cov(v1,v2)]
2 t 1
= ½ v2 + ½1 v2 (1 = corr(v1,v2))
= ½ v2 f2, where f2 = (1+1)
 In i.i.d. (cross-section) data, 1 = 0 so f2 = 1 – which

gives the usual formula for var( ˆ1 ).
 In time series data, 1 0, so var( ˆ1 ) is not given by
the usual formula
 Conventional OLS SE’s are wrong when ut is serially
correlated (STATA printout is wrong).
Expression for var( ˆ1 ), general T
23-18
1 T  v2
var(  v ) = fT
T t 1 T
so
ˆ  1  2

var( 1 ) =  v
2 2
fT
 T ( X ) 
where
T 1
T  j
fT = 1  2   j
j 1  T 
The OLS SEs are off by the factor fT (which can be big!)
23-19
HAC Standard Errors
 Conventional OLS SEs (heteroskedasticity-robust or

not) are wrong when there is autocorrelation
 So, we need a new formula that produces SEs that are
robust to autocorrelation as well as heteroskedasticity
We need Heteroskedasticity and Autocorrelation-
Consistent (HAC) standard errors
 If we knew the factor fT, we could just make the
adjustment.
 But we don’t know fT – it depends on unknown
autocorrelations.
 HAC SEs replace fT with an estimator of fT
23-20
HAC SEs, ctd.
 1  2
 T 1
T  j
ˆ
var( 1 ) =  v
2 2
fT , where fT = 1  2  j
 T ( X )  j 1  T 
The most commonly used estimator of fT is:
ˆf = 1  2  m  j  
m 1
T  
j 1 
 j
m 
 fˆt sometimes called “Newey-West” weights

  j is an estimator of j
 m is called the truncation parameter
 What truncation parameter to use in practice?
o Use the Goldilocks method
o Or, try m = 0.75T1/3
23-21
Example: OJ and HAC estimators in STATA
. gen l1fdd = L1.fdd; generate lag #1

. gen l2fdd = L2.fdd; generate lag #2
. gen l3fdd = L3.fdd; .
. gen l6fdd = L6.fdd;
. reg dlpoj l1fdd if tin(1950m1,2000m12), r; NOT HAC SEs

F( 1, 610) = 3.97
Prob > F = 0.0467
R-squared = 0.0101
Root MSE = 5.0438
------------------------------------------------------------------------------
| Robust
dlpoj | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
l1fdd | .1529217 .0767206 1.99 0.047 .0022532 .3035903
_cons | -.2097734 .2071122 -1.01 0.312 -.6165128 .196966
------------------------------------------------------------------------------
Example: OJ and HAC estimators in STATA, ctd.

23-22
Example: OJ and HAC estimators in STATA, ctd
Now compute Newey-West SEs:
. newey dlpoj l1fdd if tin(1950m1,2000m12), lag(8);
Regression with Newey-West standard errors Number of obs = 612

maximum lag : 8 F( 1, 610) = 3.83
Prob > F = 0.0507
------------------------------------------------------------------------------
| Newey-West
-------------+----------------------------------------------------------------
l1fdd | .1529217 .0781195 1.96 0.051 -.000494 .3063375
_cons | -.2097734 .2402217 -0.87 0.383 -.6815353 .2619885
------------------------------------------------------------------------------
Uses autocorrelations up to m=8 to compute the SEs

rule-of-thumb: 0.75*(6121/3) = 6.4 8, rounded up a little.
OK, in this case the difference is small, but not always so!
23-23
Example: OJ and HAC estimators in STATA, ctd.
. global lfdd6 "fdd l1fdd l2fdd l3fdd l4fdd l5fdd l6fdd";
. newey dlpoj $lfdd6 if tin(1950m1,2000m12), lag(7);
Regression with Newey-West standard errors Number of obs = 612

maximum lag : 7 F( 7, 604) = 3.56
Prob > F = 0.0009
------------------------------------------------------------------------------
| Newey-West
-------------+----------------------------------------------------------------
fdd | .4693121 .1359686 3.45 0.001 .2022834 .7363407
l1fdd | .1430512 .0837047 1.71 0.088 -.0213364 .3074388
l2fdd | .0564234 .0561724 1.00 0.316 -.0538936 .1667404
l3fdd | .0722595 .0468776 1.54 0.124 -.0198033 .1643223
l4fdd | .0343244 .0295141 1.16 0.245 -.0236383 .0922871
l5fdd | .0468222 .0308791 1.52 0.130 -.0138212 .1074657
l6fdd | .0481115 .0446404 1.08 0.282 -.0395577 .1357807
_cons | -.6505183 .2336986 -2.78 0.006 -1.109479 -.1915578
------------------------------------------------------------------------------
 global lfdd6 defines a string which is all the additional lags

 What are the estimated dynamic multipliers (dynamic effects)?
23-24
Do I need to use HAC SEs when I estimate an AR or
an ADL model?
NO.
 The problem to which HAC SEs are the solution arises
when ut is serially correlated
 If ut is serially uncorrelated, then OLS SE’s are fine
 In AR and ADL models, the errors are serially
uncorrelated if you have included enough lags of Y
o If you include enough lags of Y, then the error term
can’t be predicted using past Y, or equivalently by
past u – so u is serially uncorrelated
23-25
Estimation of Dynamic Causal Effects with Strictly
Exogenous Regressors
(SW Section 13.5)
 X is strictly exogenous if E(ut|…,Xt+1,Xt,Xt–1, …) = 0
 If X is strictly exogenous, there are more efficient
ways to estimate dynamic causal effects than by a
distributed lag regression.
o Generalized Least Squares (GLS)
o Autoregressive Distributed Lag (ADL)
 But the condition of strict exogeneity is very strong,
so this condition is rarely plausible in practice.
 So we won’t cover GLS or ADL estimation of
dynamic causal effects (Section 13.5 is optional)
23-26
Analysis of the OJ Price Data
(SW Section 13.6)
What is the dynamic causal effect (what are the dynamic

multipliers) of a unit increase in FDD on OJ prices?
%ChgPt = 0 + 1FDDt + … + r+1FDDt–r + ut
 What r to use?
How about 18? (Goldilocks method)
 What m (Newey-West truncation parameter) to use?
m = .75 6121/3 = 6.4 7
23-27
23-28
23-29
23-30
23-31
These dynamic multipliers were estimated using a
distributed lag model. Should we attempt to obtain more
efficient estimates using GLS or an ADL model?
 Is FDD strictly exogenous in the distributed lag

regression?
%ChgPt = 0 + 1FDDt + … + r+1FDDt–r + ut
 OJ commodity traders can’t change the weather.

 So this implies that corr(ut,FDDt+1) = 0, right?
23-32
When Can You Estimate Dynamic Causal Effects?
That is, When is Exogeneity Plausible?
(SW Section 13.7)
In the following examples,

 is X exogenous?
 is X strictly exogenous?
Examples:
1. Y = OJ prices, X = FDD in Orlando
2. Y = Australian exports, X = US GDP (effect of US
income on demand for Australian exports)
23-33
Examples, ctd.
3. Y = EU exports, X = US GDP (effect of US income on

demand for EU exports)
4. Y = US rate of inflation, X = percentage change in
world oil prices (as set by OPEC) (effect of OPEC oil
price increase on inflation)
5. Y = GDP growth, X =Federal Funds rate (the effect of
monetary policy on output growth)
6. Y = change in the rate of inflation, X = unemployment
rate on inflation (the Phillips curve)
23-34
Exogeneity, ctd.
 You must evaluate exogeneity and strict exogeneity

on a case by case basis
 Exogeneity is often not plausible in time series data
because of simultaneous causality
 Strict exogeneity is rarely plausible in time series data
because of feedback.
23-35
Estimation of Dynamic Causal Effects: Summary
(SW Section 13.8)
 Dynamic causal effects are measurable in theory using

a randomized controlled experiment with repeated
measurements over time.
 When X is exogenous, you can estimate dynamic causal
effects using a distributed lag regression
 If u is serially correlated, conventional OLS SEs are
incorrect; you must use HAC SEs
 To decide whether X is exogenous, think hard!
23-36

Introduction to Econometrics - Measuring Causal Effects Using Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction to Econometrics - Measuring Causal Effects Using Data

Uploaded by

Copyright:

Available Formats

The statistical analysis of economic (and related) data

Economics suggests important relationships, often with policy

 Learn methods for estimating causal effects using

Empirical problem: Class size and educational output

 Policy question: What is the effect on test scores (or some

All K-6 and K-8 California school districts (n = 420)

This table doesn’t tell us anything about the relationship

what does this figure show?

1. Compare average test scores in districts with low STRs

2. Test the “null” hypothesis that the mean test scores in

3. Estimate an interval for the difference in the mean test

Class Average score Standard n

1. Estimation of  = difference between group means

Is this a large difference in a real-world sense?

Difference-in-means test: compute the t-statistic,

where SE(Ys – Yl ) is the “standard error” of Ys – Yl , the

Ys  Yl 657.4  650.0 7.4

A 95% confidence interval for the difference between the

 The mechanics of estimation, hypothesis testing, and

1. The probability framework for statistical inference

The probability framework for statistical inference

 The probabilities of different values of Y that occur in the

mean = expected value (expectation) of Y

 Random variables X and Z have a joint distribution

 The covariance is a measure of the linear association

 = E(Test scores|STR < 20) – E(Test scores|STR ≥ 20)

Other examples of conditional means:

We will assume simple random sampling

 What is the mean of Y ?

Variance: var(Y ) = E[Y – E(Y )]2

For small sample sizes, the distribution of Y is complicated,

so, Y minimizes the sum of squared “residuals”

Set derivative to zero and denote optimal value of m by m̂ :

 Y has a smaller variance than all other linear unbiased

p-value = probability of drawing a statistic (e.g. Y ) at least as

The significance level of a test is a pre-specified probability

Calculating the p-value based on Y :

p-value = PrH0 [| Y  Y ,0 || Y act  Y ,0 |]

where Y act is the value of Y actually observed (nonrandom)

p-value = PrH0 [| Y  Y ,0 || Y act  Y ,0 |],

 For large n, p-value = the probability that a N(0,1) random

Why does the law of large numbers apply?

p-value = PrH0 [| Y  Y ,0 || Y act  Y ,0 |],

The significance level is prespecified. For example, if the

Digression: the Student t distribution

1. The theory of the t-distribution was one of the early

2. If the sample size is moderate (several dozen) or large

degrees of freedom 5% t-distribution

3. So, the Student-t distribution is only relevant when the

Even if the population distribution of Y in the two groups

 The assumption that Y is distributed N(Y, Y2 ) is rarely

Digression: What is random here? The values of Y1,…,Yn and

Empirical problem: Class size and educational output

The California Test Score Data Set

All K-6 and K-8 California school districts (n = 420)

The population regression line:

1 = slope of population regression line

By analogy, we will focus on the least squares

 The OLS estimator minimizes the average squared

Estimated slope = ˆ1 = – 2.28

One of the districts in the data set is Antelope, CA, for

regress testscr str, robust