Professional Documents
Culture Documents
1/2/3-1
Brief Overview of the Course
1/2/3-2
This course is about using data to measure causal effects.
Ideally, we would like an experiment
o what would be an experiment to estimate the effect of
class size on standardized test scores?
But almost always we only have observational
(nonexperimental) data.
o returns to education
o cigarette prices
o monetary policy
Most of the course deals with difficulties arising from using
observational to estimate causal effects
o confounding effects (omitted factors)
o simultaneous causality
o “correlation does not imply causation”
1/2/3-3
In this course you will:
1/2/3-4
Review of Probability and Statistics
(SW Chapters 2, 3)
1/2/3-5
The California Test Score Data Set
Variables:
5PthP grade test scores (Stanford-9 achievement test,
combined math and reading), district average
Student-teacher ratio (STR) = no. of students in the
district divided by no. full-time equivalent teachers
1/2/3-6
Initial look at the data:
(You should already know how to interpret this table)
1/2/3-7
Do districts with smaller classes have higher test scores?
Scatterplot of test score v. student-teacher ratio
1/2/3-10
1. Estimation
nsmall nlarge
1 1
Ysmall Ylarge =
nsmall
Y
i 1
i –
nlarge
Y
i 1
i
= 657.4 – 650.0
= 7.4
1/2/3-11
2. Hypothesis testing
Ys Yl Ys Yl
t (remember this?)
ss2
sl2 SE (Ys Yl )
ns nl
1/2/3-12
Compute the difference-of-means t-statistic:
Size Y sBYB n
small 657.4 19.4 238
large 650.0 17.9 182
|t| > 1.96, so reject (at the 5% significance level) the null
hypothesis that the two means are the same.
1/2/3-13
3. Confidence interval
(Ys – Yl ) 1.96SE(Ys – Yl )
= 7.4 1.961.83 = (3.8, 11.0)
Two equivalent statements:
1. The 95% confidence interval for doesn’t include 0;
2. The hypothesis that = 0 is rejected at the 5% level.
1/2/3-14
What comes next…
1/2/3-15
Review of Statistical Theory
Population
The group or collection of all possible entities of interest
(school districts)
We will think of populations as infinitely large ( is an
approximation to “very big”)
Random variable Y
Numerical summary of a random outcome (district
average test score, district STR)
1/2/3-17
Population distribution of Y
1/2/3-18
(b) Moments of a population distribution: mean, variance,
standard deviation, covariance, correlation
1/2/3-19
Moments, ctd.
E Y Y
3
skewness =
3
Y
= measure of asymmetry of a distribution
skewness = 0: distribution is symmetric
skewness > (<) 0: distribution has long right (left) tail
E Y Y
4
kurtosis =
4
Y
= measure of mass in tails
= measure of probability of large values
kurtosis = 3: normal distribution
skewness > 3: heavy tails (“leptokurtotic”)
1/2/3-20
1/2/3-21
2 random variables: joint distributions and covariance
so is the correlation…
1/2/3-24
The correlation coefficient is defined in terms of the
covariance:
cov( X , Z ) XZ
corr(X,Z) = = rBXZB
var( X ) var( Z ) X Z
–1 corr(X,Z) 1
corr(X,Z) = 1 mean perfect positive linear association
corr(X,Z) = –1 means perfect negative linear association
corr(X,Z) = 0 means no linear association
1/2/3-25
The correlation coefficient measures linear association
1/2/3-26
(c) Conditional distributions and conditional means
Conditional distributions
The distribution of Y, given value(s) of some other
random variable, X
Ex: the distribution of test scores, given that STR < 20
Conditional expectations and conditional moments
conditional mean = mean of conditional distribution
= E(Y|X = x) (important concept and notation)
conditional variance = variance of conditional distribution
Example: E(Test scores|STR < 20) = the mean of test
scores among districts with small class sizes
The difference in means is the difference between the means
of two conditional distributions:
1/2/3-27
Conditional mean, ctd.
1/2/3-28
(d) Distribution of a sample of data drawn randomly
from a population: YB1B,…, YBnB
1/2/3-29
Distribution of YB1B,…, YBnB under simple random
sampling
Because individuals #1 and #2 are selected at random, the
value of YB1B has no information content for YB2B. Thus:
o YB1B and YB2B are independently distributed
o YB1B and YB2B come from the same distribution, that
is, YB1B, YB2B are identically distributed
o That is, under simple random sampling, YB1B and YB2B
are independently and identically distributed (i.i.d.).
o More generally, under simple random sampling,
{YBiB}, i = 1,…, n, are i.i.d.
1/2/3-30
This framework allows rigorous statistical inferences about
moments of population distributions using a sample of data
from that population …
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence Intervals
Estimation
Y is the natural estimator of the mean. But:
(a) What are the properties of Y ?
(b) Why should we use Y rather than some other estimator?
YB1B (the first observation)
maybe unequal weights – not simple average
1/2/3-31
median(YB1B,…, YBnB)
The starting point is the sampling distribution of Y …
1/2/3-32
(a) The sampling distribution of Y
Y is a random variable, and its properties are determined by
the sampling distribution of Y
The individuals in the sample are drawn at random.
Thus the values of (YB1B,…, YBnB) are random
Thus functions of (YB1B,…, YBnB), such as Y , are random:
had a different sample been drawn, they would have taken
on a different value
The distribution of Y over different possible samples of
size n is called the sampling distribution of Y .
The mean and variance of Y are the mean and variance of
its sampling distribution, E(Y ) and var(Y ).
The concept of the sampling distribution underpins all of
econometrics.
1/2/3-33
The sampling distribution of Y , ctd.
Example: Suppose Y takes on 0 or 1 (a Bernoulli random
variable) with the probability distribution,
Pr[Y = 0] = .22, Pr(Y =1) = .78
Then
E(Y) = p1 + (1 – p)0 = p = .78
Y2 = E[Y – E(Y)]2 = p(1 – p) [remember this?]
= .78(1–.78) = 0.1716
The sampling distribution of Y depends on n.
Consider n = 2. The sampling distribution of Y is,
Pr(Y = 0) = .222 = .0484
Pr(Y = ½) = 2.22.78 = .3432
Pr(Y = 1) = .782 = .6084
1/2/3-34
The sampling distribution of Y when Y is Bernoulli (p = .78):
1/2/3-35
Things we want to know about the sampling distribution:
1/2/3-36
The mean and variance of the sampling distribution of Y
General case – that is, for Yi i.i.d. from any distribution, not
just Bernoulli:
1 n 1 n 1 n
mean: E(Y ) = E( Yi ) = E (Yi ) = Y = Y
n i 1 n i 1 n i 1
i 1
Y2
=
n
1/2/3-38
Mean and variance of sampling distribution of Y , ctd.
E(Y ) = Y
Y2
var(Y ) =
n
Implications:
1. Y is an unbiased estimator of Y (that is, E(Y ) = Y)
2. var(Y ) is inversely proportional to n
the spread of the sampling distribution is
proportional to 1/ n
Thus the sampling uncertainty associated with Y is
proportional to 1/ n (larger samples, less
uncertainty, but square-root law)
1/2/3-39
The sampling distribution of Y when n is large
1/2/3-40
The Law of Large Numbers:
An estimator is consistent if the probability that its falls
within an interval of the true population value tends to one
as the sample size increases.
If (Y1,…,Yn) are i.i.d. and Y2 < , then Y is a consistent
estimator of Y, that is,
Pr[|Y – Y| < ] 1 as n
p
which can be written, Y Y
p
(“Y Y” means “Y converges in probability to Y”).
Y2
(the math: as n , var(Y ) = 0, which implies that
n
Pr[|Y – Y| < ] 1.)
1/2/3-41
The Central Limit Theorem (CLT):
If (Y1,…,Yn) are i.i.d. and 0 < Y2 < , then when n is large
the distribution of Y is well approximated by a normal
distribution.
Y2
Y is approximately distributed N(Y, ) (“normal
n
distribution with mean Y and variance Y2 /n”)
n (Y – Y)/Y is approximately distributed N(0,1)
(standard normal)
Y E (Y ) Y Y
That is, “standardized” Y = = is
var(Y ) Y / n
approximately distributed as N(0,1)
The larger is n, the better is the approximation.
Sampling distribution of Y when Y is Bernoulli, p = 0.78:
1/2/3-42
1/2/3-43
Y E (Y )
Same example: sampling distribution of :
var(Y )
1/2/3-44
Summary: The Sampling Distribution of Y
For Y1,…,Yn i.i.d. with 0 < Y2 < ,
The exact (finite sample) sampling distribution of Y has
mean Y (“Y is an unbiased estimator of Y”) and variance
Y2 /n
Other than its mean and variance, the exact distribution of
Y is complicated and depends on the distribution of Y (the
population distribution)
When n is large, the sampling distribution simplifies:
p
o Y Y (Law of large numbers)
Y E (Y )
o is approximately N(0,1) (CLT)
var(Y )
1/2/3-45
(b) Why Use Y To Estimate Y?
Y is unbiased: E(Y ) = Y
p
Y is consistent: Y Y
Y is the “least squares” estimator of Y; Y solves,
n
min m (Yi m ) 2
i 1
dm i 1
(Yi m ) 2
=
i 1 dm
(Yi m ) 2
= 2 (Yi m )
i 1
1/2/3-46
Why Use Y To Estimate Y, ctd.
1/2/3-47
1. The probability framework for statistical inference
2. Estimation
3. Hypothesis Testing
4. Confidence intervals
Hypothesis Testing
The hypothesis testing problem (for the mean): make a
provisional decision, based on the evidence at hand, whether
a null hypothesis is true, or instead that some alternative
hypothesis is true. That is, test
H0: E(Y) = Y,0 vs. H1: E(Y) > Y,0 (1-sided, >)
H0: E(Y) = Y,0 vs. H1: E(Y) < Y,0 (1-sided, <)
H0: E(Y) = Y,0 vs. H1: E(Y) Y,0 (2-sided)
1/2/3-48
Some terminology for testing statistical hypotheses:
1/2/3-51
Estimator of the variance of Y:
n
1
sY2 =
n 1 i 1
(Yi Y ) 2
= “sample variance of Y”
Fact:
p
If (Y1,…,Yn) are i.i.d. and E(Y ) < , then s Y2
4 2
Y
1/2/3-52
Computing the p-value with Y2 estimated:
1/2/3-53
What is the link between the p-value and the significance
level?
1/2/3-54
At this point, you might be wondering,...
What happened to the t-table and the degrees of freedom?
1/2/3-55
Comments on this recipe and the Student t-distribution
1/2/3-58
1/2/3-59
Comments on Student t distribution, ctd.
4. You might not know this. Consider the t-statistic testing
the hypothesis that two means (groups s, l) are equal:
Ys Yl Ys Yl
t 2 2
ss
sl SE (Ys Yl )
ns nl
1/2/3-61
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence intervals
Confidence Intervals
A 95% confidence interval for Y is an interval that contains
the true value of Y in 95% of repeated samples.
Y Y Y Y
{Y: 1.96} = {Y: –1.96 1.96}
sY / n sY / n
sY sY
= {Y: –1.96 Y – Y 1.96 }
n n
sY sY
= {Y (Y – 1.96 , Y + 1.96 )}
n n
This confidence interval relies on the large-n results that Y is
p
approximately normally distributed and s Y2 .
2
Y
1/2/3-63
Summary:
From the two assumptions of:
(1) simple random sampling of a population, that is,
{Yi, i =1,…,n} are i.i.d.
(2) 0 < E(Y4) <
we developed, for large samples (large n):
Theory of estimation (sampling distribution of Y )
Theory of hypothesis testing (large-n distribution of t-
statistic and computation of the p-value)
Theory of confidence intervals (constructed by inverting
test statistic)
Are assumptions (1) & (2) plausible in practice? Yes
1/2/3-64
Let’s go back to the original policy question:
What is the effect on test scores of reducing STR by one
student/class?
Have we answered this question?
1/2/3-65
Introduction to Linear Regression
(SW Chapter 4)
4-1
What do data say about class sizes and test scores?
Variables:
5th grade test scores (Stanford-9 achievement test,
combined math and reading), district average
Student-teacher ratio (STR) = no. of students in the
district divided by no. full-time equivalent teachers
4-2
An initial look at the California test score data:
4-3
Do districts with smaller classes (lower STR) have higher test
scores?
4-4
The class size/test score policy question:
What is the effect on test scores of reducing STR by
one student/class?
Test score
Object of policy interest:
STR
This is the slope of the line relating test score and STR
4-5
This suggests that we want to draw a line through the
Test Score v. STR scatterplot – but how?
4-6
Some Notation and Terminology
(Sections 4.1 and 4.2)
n
min b0 ,b1 [Yi (b0 b1 X i )]2
i 1
4-8
n
The OLS estimator solves: min b0 ,b1 [Yi (b0 b1 X i )]2
i 1
4-9
Why use OLS, rather than some other estimator?
OLS is a generalization of the sample average: if the
“line” is just an intercept (no X), then the OLS
estimator is just the sample average of Y1,…Yn (Y ).
Like Y , the OLS estimator has some desirable
properties: under certain assumptions, it is unbiased
(that is, E( ˆ1 ) = 1), and it has a tighter sampling
distribution than some other candidate estimators of
1 (more on this later)
Importantly, this is what everyone uses – the common
“language” of linear regression.
4-10
4-11
Application to the California Test Score – Class Size data
-------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------
Population
population of interest (ex: all possible school districts)
Random variables: Y, X
Ex: (Test Score, STR)
4-17
The Population Linear Regression Model (Section 4.3)
4-22
Example: Assumption #1 and the class size example
Test Scorei = 0 + 1STRi + ui, ui = other factors
“Other factors:”
parental involvement
outside learning opportunities (extra math class,..)
home environment conducive to reading
family income is a useful proxy for many such factors
4-24
Least squares assumption #3:
E(X4) < and E(u4) <
4-25
1. The probability framework for linear regression
2. Estimation: the Sampling Distribution of ˆ1
(Section 4.4)
3. Hypothesis Testing
4. Confidence intervals
4-26
The sampling distribution of ˆ1 : some algebra:
Yi = 0 + 1Xi + ui
Y = 0 + 1 X + u
so Yi – Y = 1(Xi – X ) + (ui – u )
Thus,
n
( X i X )(Yi Y )
ˆ1 = i 1
n
i
( X
i 1
X ) 2
( X
i 1
i X )[ 1 ( X i X ) ( ui u )]
= n
i
( X
i 1
X ) 2
4-27
n
( X i X )[ 1 ( X i X ) ( ui u )]
ˆ1 = i 1
n
i
( X
i 1
X ) 2
n n
( X i X )( X i X ) ( X i X )(ui u )
= 1 i 1
n
i 1
n
i
( X
i 1
X ) 2
i
( X
i 1
X ) 2
so
n
( X i X )(ui u )
ˆ1 – 1 = i 1
n
i
( X X
i 1
) 2
4-28
We can simplify this formula by noting that:
n n
n
( X i X )(u i u ) = ( X i X )u i – ( X i X ) u
i 1 i 1 i 1
n
= ( X
i 1
i X )u i .
Thus
n
1 n
( X i X )u i
n i 1
vi
ˆ1 – 1 = i n1 =
n 1 2
i 1
(Xi X ) 2
n
sX
4-29
1 n
n i 1
vi
ˆ
1 – 1 = , where vi = (Xi – X )ui
n 1 2
sX
n
4-30
Now E(vi/ s X2 ) = E[(Xi – X )ui/ s X2 ] = 0
n 1 n
vi
E 2 = 0
Thus, ˆ
E( 1 – 1) =
n 1 n i 1 s X
so
E( ˆ1 ) = 1
4-31
Calculation of the variance of ˆ1 :
1 n
n i 1
vi
ˆ
1 – 1 =
n 1 2
sX
n
ˆ var( v )
var( 1 ) =
n X2
4-32
The exact sampling distribution is complicated, but when
the sample size is large we get some simple (and good)
approximations:
p
(1) Because var( ˆ1 ) 1/n and E( ˆ1 ) = 1, ˆ1 1
4-33
1 n
n i 1
vi
ˆ
1 – 1 =
n 1 2
sX
n
When n is large:
vi = (Xi – X )ui (Xi – X)ui, which is i.i.d. (why?) and
has two moments, that is, var(vi) < (why?). Thus
1 n
n i 1
vi is distributed N(0,var(v)/n) when n is large
sX
X
n
v2
which is approximately distributed N(0, ).
n( X )
2 2
ˆ var[( X i x )ui ]
1 is approximately distributed N(1, )
n X 4
4-35
Recall the summary of the sampling distribution of Y :
For (Y1,…,Yn) i.i.d. with 0 < Y2 < ,
The exact (finite sample) sampling distribution of Y
has mean Y (“Y is an unbiased estimator of Y”) and
variance Y2 /n
Other than its mean and variance, the exact
distribution of Y is complicated and depends on the
distribution of Y
p
Y Y (law of large numbers)
Y E (Y )
is approximately distributed N(0,1) (CLT)
var(Y )
4-36
Parallel conclusions hold for the OLS estimator ˆ1 :
4-42
Recall the expression for the variance of ˆ1 (large n):
var[( X ) u ] 2
var( ˆ1 ) = i x i
= v
n( X2 )2 n X4
1 estimator of 2
ˆ 2ˆ = v
1
n (estimator of X2 )2
1 n
1
n 2 i 1
( X i X ) uˆi
2 2
= 2
.
n 1 n 2
n ( Xi X )
i 1
4-43
1 n
1
n 2 i 1
( X i X ) uˆi
2 2
ˆ 2ˆ = 2
.
1
n 1 n 2
n ( Xi X )
i 1
4-44
Return to calculation of the t-statsitic:
4-45
Example: Test Scores and STR, California data
4-48
Example: Test Scores and STR, California data
Estimated regression line: TestScore = 698.9 – 2.28STR
4-50
OLS regression: STATA output
4-52
Yi = 0 + 1Xi + ui, where X is binary (Xi = 0 or 1):
When Xi = 0: Y i = 0 + u i
When Xi = 1: Y i = 0 + 1 + u i
thus:
When Xi = 0, the mean of Yi is 0
When Xi = 1, the mean of Yi is 0 + 1
that is:
E(Yi|Xi=0) = 0
E(Yi|Xi=1) = 0 + 1
so:
1 = E(Yi|Xi=1) – E(Yi|Xi=0)
= population difference in group means
4-53
Example: TestScore and STR, California data
Let
1 if STRi 20
Di =
0 if STRi 20
4-55
Summary: regression when Xi is binary (0/1)
Yi = 0 + 1Xi + ui
4-57
The R2
Write Yi as the sum of the OLS prediction + OLS
residual:
Yi = Yˆi + uˆi
ESS
2
R = ,
TSS
n n
where ESS = (Yˆi Yˆ ) and TSS =
i 1
2
i
(Y
i 1
Y ) 2
.
4-58
n n
ESS
2
R =
TSS
, where ESS = (Yˆi Yˆ ) and TSS =
i 1
2
i
(Y
i 1
Y ) 2
The R2:
R2 = 0 means ESS = 0, so X explains none of the
variation of Y
R2 = 1 means ESS = TSS, so Y = Yˆ so X explains all of
the variation of Y
0 ≤ R2 ≤ 1
For regression with a single regressor (the case here),
R2 is the square of the correlation coefficient between
X and Y
4-59
The Standard Error of the Regression (SER)
1 n
SER =
n 2 i 1
( ˆ
ui ˆ
ui ) 2
1 n 2
=
n 2 i 1
uˆi
1 n
(the second equality holds because uˆi = 0).
n i 1
4-60
1 n 2
SER =
n 2 i 1
uˆi
The SER:
has the units of u, which are the units of Y
measures the spread of the distribution of u
measures the average “size” of the OLS residual (the
average “mistake” made by the OLS regression line)
The root mean squared error (RMSE) is closely
related to the SER:
1 n 2
RMSE =
n i 1
uˆi
60
Average hourly earnings
40
20
0
5 10 15 20
Years of Education
Scatterplot and OLS Regression Line
4-67
Is heteroskedasticity present in the class size data?
4-68
So far we have (without saying so) assumed that u is
heteroskedastic:
n( X2 )2 n X2
Note: var( ˆ ) is inversely proportional to var(X):
1
1 n
1
n 2 i 1
( X i X ) uˆi
2 2
ˆ =
2
ˆ 2
.
1
n 1 n 2
n i ( X X )
i 1
n i 1
4-74
Summary and Assessment (Section 4.10)
The initial policy question:
Suppose new teachers are hired so the student-
teacher ratio falls by one student per class. What
is the effect of this policy intervention (this
“treatment”) on test scores?
Does our regression analysis give a convincing answer?
Not really – districts with low STR tend to be ones
with lots of other resources and higher income
families, which provide kids with more learning
opportunities outside school…this suggests that
corr(ui,STRi) > 0, so E(ui|Xi) 0.
4-75
Digression on Causality
5-1
Omitted Variable Bias
(SW Section 5.1)
5-2
In the test score example:
1. English language ability (whether the student has
English as a second language) plausibly affects
standardized test scores: Z is a determinant of Y.
2. Immigrant communities tend to be less affluent and
thus have smaller school budgets – and higher STR:
Z is correlated with X.
5-3
A formula for omitted variable bias: recall the equation,
n
1 n
( X i X )u i
n i 1
vi
ˆ
1 – 1 = n
i 1
=
n 1 2
i 1
(Xi X ) 2
n
sX
5-4
Then
n
1 n
( X i X )u i
n i 1
vi
ˆ
1 – 1 = n
i 1
=
n 1 2
i 1
(Xi X ) 2
n
sX
so
n
( X i X )u i u Xu
ˆ
E( 1 ) – 1 = E ni 1
2 =
Xu
( X X )2 X X u
i 1 i
X
5-5
ˆ
p u
Omitted variable bias formula: 1 1 + Xu .
X
If an omitted factor Z is both:
(1) a determinant of Y (that is, it is contained in u); and
(2) correlated with X,
then Xu 0 and the OLS estimator ˆ1 is biased.
The math makes precise the idea that districts with few
ESL students (1) do better on standardized tests and (2)
have smaller classes (bigger budgets), so ignoring the
ESL factor results in overstating the class size effect.
Is this is actually going on in the CA data?
5-6
Districts with fewer English Learners have higher test scores
Districts with lower percent EL (PctEL) have smaller classes
Among districts with comparable PctEL, the effect of class
size is small (recall overall “test score gap” = 7.4)
5-7
Three ways to overcome omitted variable bias
1. Run a randomized controlled experiment in which
treatment (STR) is randomly assigned: then PctEL is
still a determinant of TestScore, but PctEL is
uncorrelated with STR. (But this is unrealistic in
practice.)
2. Adopt the “cross tabulation” approach, with finer
gradations of STR and PctEL (But soon we will run
out of data, and what about other determinants like
family income and parental education?)
3. Use a method in which the omitted variable (PctEL) is
no longer omitted: include PctEL as an additional
regressor in a multiple regression.
5-8
The Population Multiple Regression Model
(SW Section 5.2)
Y = 0 + 1X1 + 2X2
5-10
Before: Y = 0 + 1(X1 + X1) + 2X2
Difference: Y = 1X1
That is,
Y
1 = , holding X2 constant
X 1
also,
Y
2 = , holding X1 constant
X 2
and
0 = predicted value of Y when X1 = X2 = 0.
5-11
The OLS Estimator in Multiple Regression
(SW Section 5.3)
n
min b0 ,b1 ,b2 [Yi (b0 b1 X 1i b2 X 2i )]2
i 1
------------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -1.101296 .4328472 -2.54 0.011 -1.95213 -.2504616
pctel | -.6497768 .0310318 -20.94 0.000 -.710775 -.5887786
_cons | 686.0322 8.728224 78.60 0.000 668.8754 703.189
------------------------------------------------------------------------------
5-15
Assumption #1: the conditional mean of u given the
included X’s is zero.
5-16
Assumption #2: (X1i,…,Xki,Yi), i =1,…,n, are i.i.d.
This is satisfied automatically if the data are collected
by simple random sampling.
5-17
Assumption #4: There is no perfect multicollinearity
Perfect multicollinearity is when one of the regressors is
an exact linear function of the other regressors.
5-21
Example: The California class size data
(1) TestScore = 698.9 – 2.28STR
(10.4) (0.52)
(2) TestScore = 696.0 – 1.10STR – 0.650PctEL
(8.7) (0.43) (0.031)
5-22
Tests of Joint Hypotheses
(SW Section 5.7)
H0: 1 = 0 and 2 = 0
vs. H1: either 1 0 or 2 0 or both
5-23
TestScorei = 0 + 1STRi + 2Expni + 3PctELi + ui
H0: 1 = 0 and 2 = 0
vs. H1: either 1 0 or 2 0 or both
5-26
The size of a test is the actual rejection rate under the null
hypothesis.
Two Solutions:
Use a different critical value in this procedure – not
1.96 (this is the “Bonferroni method – see App. 5.3)
Use a different test statistic that test both 1 and 2 at
once: the F-statistic.
5-27
The F-statistic
The F-statistic tests all parts of a joint hypothesis at once.
1 1 2 2 ˆ t1 ,t2 t1t2
2 2
t t
F=
2 1 ˆ t1 ,t2
2
1 1 2 2 ˆ t1 ,t2 t1t2
2 2
t t
F=
2 1 t1 ,t2
ˆ 2
5-29
Large-sample distribution of the F-statistic
Consider special case that t1 and t2 are independent, so
p
ˆ t ,t 0; in large samples the formula becomes
1 2
1 1 2 2 ˆ t1 ,t2 t1t2 1 2 2
2 2
t t
F= (t1 t2 )
2 1 ˆ t1 ,t2
2 2
Implementation in STATA
Use the “test” command after the regression
5-32
F-test example, California class size data:
reg testscr str expn_stu pctel, r;
------------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -.2863992 .4820728 -0.59 0.553 -1.234001 .661203
expn_stu | .0038679 .0015807 2.45 0.015 .0007607 .0069751
pctel | -.6560227 .0317844 -20.64 0.000 -.7185008 -.5935446
_cons | 649.5779 15.45834 42.02 0.000 619.1917 679.9641
------------------------------------------------------------------------------
NOTE
2
( Runrestricted Rrestricted
2
)/q
F=
(1 Runrestricted
2
) /( n kunrestricted 1)
where:
2
Rrestricted = the R2 for the restricted regression
2
Runrestricted = the R2 for the unrestricted regression
q = the number of restrictions under the null
kunrestricted = the number of regressors in the
unrestricted regression.
5-36
Example:
Restricted regression:
TestScore = 644.7 –0.671PctEL, Rrestricted
2
= 0.4149
(1.0) (0.032)
Unrestricted regression:
TestScore = 649.6 – 0.29STR + 3.87Expn – 0.656PctEL
(15.5) (0.48) (1.59) (0.032)
2
Runrestricted = 0.4366, kunrestricted = 3, q = 2
so:
2
( Runrestricted Rrestricted
2
)/q
F=
(1 Runrestricted
2
) /( n kunrestricted 1)
(.4366 .4149) / 2
= = 8.01
(1 .4366) /(420 3 1)
5-37
The homoskedasticity-only F-statistic
2
( Runrestricted Rrestricted
2
)/q
F=
(1 Runrestricted
2
) /( n kunrestricted 1)
If:
1. u1,…,un are normally distributed; and
2. Xi is distributed independently of ui (so in
particular ui is homoskedastic)
5-39
The Fq,n–k–1 distribution:
The F distribution is tabulated many places
When n gets large the Fq,n-k–1 distribution asymptotes
to the q2 /q distribution:
Fq, is another name for q2 /q
For q not too big and n≥100, the Fq,n–k–1 distribution
and the q2 /q distribution are essentially identical.
Many regression packages compute p-values of F-
statistics using the F distribution (which is OK if the
sample size is 100
You will encounter the “F-distribution” in published
empirical work.
5-40
Digression: A little history of statistics…
The theory of the homoskedasticity-only F-statistic
and the Fq,n–k–1 distributions rests on implausibly
strong assumptions (are earnings normally
distributed?)
These statistics dates to the early 20th century, when
“computer” was a job description and observations
numbered in the dozens.
The F-statistic and Fq,n–k–1 distribution were major
breakthroughs: an easily computed formula; a single
set of tables that could be published once, then
applied in many settings; and a precise,
mathematically elegant justification.
5-41
A little history of statistics, ctd…
The strong assumptions seemed a minor price for this
breakthrough.
But with modern computers and large samples we can
use the heteroskedasticity-robust F-statistic and the
Fq, distribution, which only require the four least
squares assumptions.
This historical legacy persists in modern software, in
which homoskedasticity-only standard errors (and F-
statistics) are the default, and in which p-values are
computed using the Fq,n–k–1 distribution.
5-42
Summary: the homoskedasticity-only (“rule of
thumb”) F-statistic and the F distribution
These are justified only under very strong conditions
– stronger than are realistic in practice.
Yet, they are widely used.
You should use the heteroskedasticity-robust F-
statistic, with q2 /q (that is, Fq,) critical values.
For n ≥ 100, the F-distribution essentially is the q2 /q
distribution.
For small n, the F distribution isn’t necessarily a
“better” approximation to the sampling distribution of
the F-statistic – only if the strong conditions are true.
5-43
Summary: testing joint hypotheses
The “common-sense” approach of rejecting if either
of the t-statistics exceeds 1.96 rejects more than 5% of
the time under the null (the size exceeds the desired
significance level)
The heteroskedasticity-robust F-statistic is built in to
STATA (“test” command); this tests all q restrictions
at once.
For n large, F is distributed as q2 /q (= Fq,)
The homoskedasticity-only F-statistic is important
historically (and thus in practice), and is intuitively
appealing, but invalid when there is heteroskedasticity
5-44
Testing Single Restrictions on Multiple Coefficients
(SW Section 5.8)
5-45
Two methods for testing single restrictions on multiple
coefficients:
5-46
Method 1: Rearrange (“transform”) the regression
Yi = 0 + 1X1i + 2X2i + ui
H0: 1 = 2 vs. H1: 1 2
5-48
Method 2: Perform the test directly
Yi = 0 + 1X1i + 2X2i + ui
H0: 1 = 2 vs. H1: 1 2
Example:
5-50
The coverage rate of a confidence set is the probability
that the confidence set contains the true parameter values
5-51
Coverage rate of “common sense” confidence set:
Pr[(1, 2) { ˆ1 1.96SE( ˆ1 ), ˆ2 1.96 SE( ˆ2 )}]
= Pr[ ˆ1 – 1.96SE( ˆ1 ) 1 ˆ1 + 1.96SE( ˆ1 ),
ˆ2 – 1.96SE( ˆ2 ) 2 ˆ2 + 1.96SE( ˆ2 )]
ˆ1 1 ˆ2 2
= Pr[–1.96 1.96, –1.96 1.96]
ˆ
SE ( 1 ) ˆ
SE ( 2 )
= Pr[|t1| 1.96 and |t2| 1.96]
= 1 – Pr[|t1| > 1.96 and/or |t2| > 1.96] 95% !
Why?
This confidence set “inverts” a test for which the size
doesn’t equal the significance level!
5-52
Recall: the probability of incorrectly rejecting the null
= PrH [|t1| > 1.96 and/or |t2| > 1.96]
0
5-53
Instead, use the acceptance region of a test that has size
equal to its significance level (“invert” a valid test):
5-54
The confidence set based on the F-statistic is an ellipse
1 1 2 2 ˆ t1 ,t2 t1t2
2 2
t t
{1, 2: F = ≤ 3.00}
2 1 ˆ t1 ,t2
2
Now
1
2 2 t1 ,t2 t1t2
F=
t 2
t 2
ˆ
2(1 ˆ t1 ,t2 )
2 1
1
2(1 ˆ t1 ,t2 )
2
ˆ 2 ˆ 2 ˆ ˆ
2 2,0
1 1,0
2
ˆ t1 ,t2
1 1,0
2 2,0
SE ( ˆ2 ) SE ( ˆ1 ) SE ( ˆ ) SE ( ˆ )
1 2
This is a quadratic form in 1,0 and 2,0 – thus the
boundary of the set F = 3.00 is an ellipse.
5-55
Confidence set based on inverting the F-statistic
5-56
The R2, SER, and R 2 for Multiple Regression
(SW Section 5.10)
n
1
SER = i
n k 1 i 1
ˆ
u 2
5-57
The R2 is the fraction of the variance explained:
ESS SSR
2
R = = 1 ,
TSS TSS
n n
where ESS = i
(Yˆ Y
i 1
ˆ ) 2
, SSR = i , and TSS =
ˆ
u 2
i 1
n
i
(Y
i 1
Y ) 2
– just as for regression with one regressor.
n 1 SSR
R = 1
2
so R 2
< R2
n k 1 TSS
5-58
How to interpret the R2 and R 2 ?
A high R2 (or R 2 ) means that the regressors explain
the variation in Y.
A high R2 (or R 2 ) does not mean that you have
eliminated omitted variable bias.
A high R2 (or R 2 ) does not mean that you have an
unbiased estimator of a causal effect (1).
A high R2 (or R 2 ) does not mean that the included
variables are statistically significant – this must be
determined using hypotheses tests.
5-59
Example: A Closer Look at the Test Score Data
(SW Section 5.11, 5.12)
5-60
Variables we would like to see in the California data set:
School characteristics:
student-teacher ratio
teacher quality
computers (non-teaching resources) per student
measures of curriculum design…
Student characteristics:
English proficiency
availability of extracurricular enrichment
home learning environment
parent’s education level…
5-61
Variables actually in the California class size data set:
student-teacher ratio (STR)
percent English learners in the district (PctEL)
percent eligible for subsidized/free lunch
percent on public income assistance
average district income
5-62
A look at more of the California data
5-63
Digression: presentation of regression results in a table
Listing regressions in “equation” form can be
cumbersome with many regressors and many regressions
Tables of regression results can present the key
information compactly
Information to include:
variables in the regression (dependent and
independent)
estimated coefficients
standard errors
results of F-tests of pertinent joint hypotheses
some measure of fit
number of observations
5-64
5-65
Summary: Multiple Regression
5-66
Nonlinear Regression Functions
(SW Ch. 6)
6-1
The TestScore – STR relation looks approximately
linear…
6-2
But the TestScore – average district income relation
looks like it is nonlinear.
6-3
If a relation between Y and X is nonlinear:
The effect on Y of a change in X depends on the value
of X – that is, the marginal effect of X is not constant
A linear regression is mis-specified – the functional
form is wrong
The estimator of the effect on Y of X is biased – it
needn’t even be right on average.
The solution to this is to estimate a regression
function that is nonlinear in X
6-4
The General Nonlinear Population Regression Function
Assumptions
1. E(ui| X1i,X2i,…,Xki) = 0 (same); implies that f is the
conditional expectation of Y given the X’s.
2. (X1i,…,Xki,Yi) are i.i.d. (same).
3. “enough” moments exist (same idea; the precise
statement depends on specific f).
4. No perfect multicollinearity (same idea; the precise
statement depends on the specific f).
6-5
6-6
Nonlinear Functions of a Single Independent Variable
(SW Section 6.2)
6-7
1. Polynomials in X
Approximate the population regression function by a
polynomial:
Yi = 0 + 1Xi + 2 X i2 +…+ r X ir + ui
6-8
Example: the TestScore – Income relation
Quadratic specification:
Cubic specification:
------------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avginc | 3.850995 .2680941 14.36 0.000 3.32401 4.377979
avginc2 | -.0423085 .0047803 -8.85 0.000 -.051705 -.0329119
_cons | 607.3017 2.901754 209.29 0.000 601.5978 613.0056
------------------------------------------------------------------------------
6-11
Interpreting the estimated regression function:
(a) Compute “effects” for different values of X
6-12
TestScore = 607.3 + 3.85Incomei – 0.0423(Incomei)2
------------------------------------------------------------------------------
| Robust
test avginc2 avginc3; Execute the test command after running the regression
( 1) avginc2 = 0.0
( 2) avginc3 = 0.0
F( 2, 416) = 37.69
Prob > F = 0.0000
x x
Here’s why: ln(x+x) – ln(x) = ln 1
x x
d ln( x ) 1
(calculus: )
dx x
Numerically:
ln(1.01) = .00995 .01; ln(1.10) = .0953 .10 (sort of)
6-17
Three cases:
6-18
I. Linear-log population regression function
Yi = 0 + 1ln(Xi) + ui (b)
X
now ln(X + X) – ln(X) ,
X
X
so Y 1
X
Y
or 1 (small X)
X / X
6-19
Linear-log case, continued
Yi = 0 + 1ln(Xi) + ui
X
Now 100 = percentage change in X, so a 1%
X
increase in X (multiplying X by 1.01) is associated with
a .011 change in Y.
6-20
Example: TestScore vs. ln(Income)
First defining the new regressor, ln(Income)
The model is now linear in ln(Income), so the linear-log
model can be estimated by OLS:
6-22
II. Log-linear population regression function
Y
so 1X
Y
Y / Y
or 1 (small X)
X
6-23
Log-linear case, continued
ln(Yi) = 0 + 1Xi + ui
Y / Y
for small X, 1
X
Y
Now 100 = percentage change in Y, so a change
Y
in X by one unit (X = 1) is associated with a 1001%
change in Y (Y increases by a factor of 1+1).
Note: What are the units of ui and the SER?
o fractional (proportional) deviations
o for example, SER = .2 means…
6-24
III. Log-log population regression function
Y X
so 1
Y X
Y / Y
or 1 (small X)
X / X
6-25
Log-log case, continued
ln(Yi) = 0 + 1ln(Xi) + ui
6-26
Example: ln( TestScore) vs. ln( Income)
First defining a new dependent variable, ln(TestScore),
and the new regressor, ln(Income)
The model is now a linear regression of ln(TestScore)
against ln(Income), which can be estimated by OLS:
6-27
Neither specification seems to fit as well as the cubic or linear-log
6-28
Summary: Logarithmic transformations
Yi = 0 + 1D1i + 2D2i + ui
6-31
Interpreting the coefficients
Yi = 0 + 1D1i + 2D2i + 3(D1i D2i) + ui
6-32
Example: TestScore, STR, English learners
Let
1 if STR 20 1 if PctEL l0
HiSTR = and HiEL =
0 if STR 20 0 if PctEL 10
Di is binary, X is continuous
As specified above, the effect on Y of X (holding
constant D) = 2, which does not depend on D
To allow the effect of X to depend on D, include the
“interaction term” Di Xi as a regressor:
6-34
Interpreting the coefficients
Yi = 0 + 1Di + 2Xi + 3(Di Xi) + ui
When HiEL = 0:
TestScore = 682.2 – 0.97STR
When HiEL = 1,
TestScore = 682.2 – 0.97STR + 5.6 – 1.28STR
= 687.8 – 2.25STR
Two regression lines: one for each HiSTR group.
Class size reduction is estimated to have a larger effect
when the percent of English learners is large.
Example, ctd.
6-36
TestScore = 682.2 – 0.97STR + 5.6HiEL – 1.28(STR HiEL)
(11.9) (0.59) (19.5) (0.97)
6-37
Joint hypothesis that the two regression lines are the
same population coefficient on HiEL = 0 and
population coefficient on STR HiEL = 0:
F = 89.94 (p-value < .001) !!
Why do we reject the joint hypothesis but neither
individual hypothesis?
Consequence of high but imperfect multicollinearity:
high correlation between HiEL and STR HiEL
Binary-continuous interactions: the two regression lines
Yi = 0 + 1 + 2Xi + 3Xi + ui
= (0+1) + (2+3)Xi + ui
6-39
6-40
(c) Interactions between two continuous variables
Yi = 0 + 1X1i + 2X2i + ui
6-42
TestScore = 686.3 – 1.12STR – 0.67PctEL + .0012(STR PctEL),
(11.8) (0.59) (0.37) (0.019)
6-44
Application: Nonlinear Effects on Test Scores
of the Student-Teacher Ratio
(SW Section 6.4)
6-47
The TestScore – Income relation
6-49
Question #1:
Investigate by considering a polynomial in STR
6-50
Interpreting the regression function via plots
(preceding regression is labeled (5) in this figure)
6-51
Are the higher order terms in STR statistically
significant?
– .411LunchPCT + 12.12ln(Income)
(.029) (1.80)
6-53
Interpreting the regression functions via plots:
– .411LunchPCT + 12.12ln(Income)
(.029) (1.80)
– .411LunchPCT + 12.12ln(Income)
(.029) (1.80)
6-57
Tests of joint hypotheses:
6-58
Summary: Nonlinear Regression Functions
7-2
Threats to External Validity
7-7
In general, measurement error in a regressor results in
“errors-in-variables” bias.
Illustration: suppose
Yi = 0 + 1Xi + ui
Let
Xi = unmeasured true value of X
X i = imprecisely measured version of X
7-8
Then
Yi = 0 + 1Xi + ui
= 0 + 1 X i + [1(Xi – X i ) + ui]
or
Yi = 0 + 1 X i + ui , where ui = 1(Xi – X i ) + ui
7-10
Potential solutions to errors-in-variables bias
7-11
4. Sample selection bias
7-12
Example #1: Mutual funds
Do actively managed mutual funds outperform “hold-
the-market” funds?
Empirical strategy:
o Sampling scheme: simple random sampling of
mutual funds available to the public on a given
date.
o Data: returns for the preceding 10 years.
o Estimator: average ten-year return of the sample
mutual funds, minus ten-year return on S&P500
o Is there sample selection bias?
7-13
Sample selection bias induces correlation between a
regressor and the error term.
returni = 0 + 1managed_fundi + ui
7-14
Example #2: returns to education
What is the return to an additional year of education?
Empirical strategy:
o Sampling scheme: simple random sampling of
workers
o Data: earnings and years of education
o Estimator: regress ln(earnings) on years_education
o Ignore issues of omitted variable bias and
measurement error – is there sample selection
bias?
7-15
Potential solutions to sample selection bias
7-17
Simultaneous causality bias in equations
7-19
Applying this Framework: Test Scores and Class Size
(SW Chapter 7.3)
External validity
o Compare results for California and Massachusetts
o Think hard…
Internal validity
o Go through the list of five potential threats to
internal validity and think hard…
7-20
Check of external validity
compare the California study to one using
Massachusetts data
7-21
The Massachusetts data: summary statistics
7-22
7-23
7-24
Logarithmic v. cubic function for STR?
Evidence of nonlinearity in TestScore-STR relation?
Is there a significant HiEL STR interaction?
7-25
Predicted effects for a class size reduction of 2
Linear specification for Mass:
7-27
Summary of Findings for Massachusetts
7-28
Comparison of estimated class size effects: CA vs. MA
7-29
Summary: Comparison of California and
Massachusetts Regression Analyses
7-32
2. Wrong functional form
We have tried quite a few different functional forms,
in both the California and Mass. data
Nonlinear effects are modest
Plausibly, this is not a major threat at this point.
3. Errors-in-variables bias
STR is a district-wide measure
Presumably there is some measurement error –
students who take the test might not have experienced
the measured STR for the district
Ideally we would like data on individual students, by
grade level.
7-33
4. Selection
Sample is all elementary public school districts (in
California; in Mass.)
no reason that selection should be a problem.
5. Simultaneous Causality
School funding equalization based on test scores
could cause simultaneous causality.
This was not in place in California or Mass. during
these samples, so simultaneous causality bias is
arguably not important.
7-34
Summary
Some jargon…
Another term for panel data is longitudinal data
balanced panel: no missing observations
unbalanced panel: some entities (states) are not
observed for some time periods (years)
8-3
Why are panel data useful?
8-4
Example of a panel data set:
Traffic deaths and alcohol taxes
8-5
Traffic death data for 1982
8-8
These omitted factors could cause omitted variable bias.
8-11
The key idea:
Any change in the fatality rate from 1982 to 1988
cannot be caused by Zi, because Zi (by assumption)
does not change between 1982 and 1988.
1982 data:
FatalityRate = 2.01 + 0.15BeerTax (n = 48)
(.15) (.13)
1988 data:
FatalityRate = 1.86 + 0.44BeerTax (n = 48)
(.11) (.13)
8-14
8-15
Fixed Effects Regression
(SW Section 8.3)
8-17
For TX:
YTX,t = 0 + 1XTX,t + 2ZTX + uTX,t
= (0 + 2ZTX) + 1XTX,t + uTX,t
or
YTX,t = TX + 1XTX,t + uTX,t, where TX = 0 + 2ZTX
8-18
The regression lines for each state in a picture
Y = CA + 1X
Y
CA
CA Y = TX + 1X
TX Y = MA+ 1X
TX
MA
MA
CA
CA Y = TX + 1X
TX Y = MA+ 1X
TX
MA
MA
8-24
Entity-demeaned OLS regression, ctd.
1 T 1 T 1 T
Yit – Yit = 1 X it X it + uit uit
T t 1 T t 1 T t 1
or
Yit = 1 X it + uit
1 T 1 T
where Yit = Yit – Yit and X it = Xit – X it
T t 1 T t 1
8-26
Example: Traffic deaths and beer taxes in STATA
. areg vfrall beertax, absorb(state) r;
------------------------------------------------------------------------------
| Robust
vfrall | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
beertax | -.6558736 .2032797 -3.23 0.001 -1.055982 -.2557655
_cons | 2.377075 .1051515 22.61 0.000 2.170109 2.584041
-------------+----------------------------------------------------------------
state | absorbed (48 categories)
8-29
Time fixed effects only
Yit = 0 + 1Xit + 3St + uit
Similarly,
Yi,1983 = 1983 + 1Xi,1983 + ui,1983, 1983 = 0 + 3S1983
etc.
8-30
Two formulations for time fixed effects
8-31
Time fixed effects: estimation methods
8-33
State and time effects: estimation methods
For a single X:
Yit = 1Xit + i + uit, i = 1,…,n, t = 1,…, T
1. E(uit|Xi1,…,XiT,i) = 0.
2. (Xi1,…,XiT,Yi1,…,YiT), i =1,…,n, are i.i.d. draws from
their joint distribution.
3. (Xit, uit) have finite fourth moments.
4. There is no perfect multicollinearity (multiple X’s)
5. corr(uit,uis|Xit,Xis,i) = 0 for t s.
Assumptions 3&4 are identical; 1, 2, differ; 5 is new
8-36
Assumption #1: E(uit|Xi1,…,XiT,i) = 0
uit has mean zero, given the state fixed effect and the
entire history of the X’s for that state
This is an extension of the previous multiple
regression Assumption #1
This means there are no omitted lagged effects (any
lagged effects of X must enter explicitly)
Also, there is not feedback from u to future X:
o Whether a state has a particularly high fatality rate
this year doesn’t subsequently affect whether it
increases the beer tax.
o We’ll return to this when we take up time series
data.
8-37
Assumption #2: (Xi1,…,XiT,Yi1,…,YiT), i =1,…,n, are
i.i.d. draws from their joint distribution.
This is an extension of Assumption #2 for multiple
regression with cross-section data
This is satisfied if entities (states, individuals) are
randomly sampled from their population by simple
random sampling, then data for those entities are
collected over time.
This does not require observations to be i.i.d. over
time for the same entity – that would be unrealistic
(whether a state has a mandatory DWI sentencing law
this year is strongly related to whether it will have that
law next year).
8-38
Assumption #5: corr(uit,uis|Xit,Xis,i) = 0 for t s
This is new.
This says that (given X), the error terms are
uncorrelated over time within a state.
For example, uCA,1982 and uCA,1983 are uncorrelated
Is this plausible? What enters the error term?
o Especially snowy winter
o Opening major new divided highway
o Fluctuations in traffic density from local economic
conditions
Assumption #5 requires these omitted factors entering
uit to be uncorrelated over time, within a state.
8-39
What if Assumption #5 fails: corr(uit,uis|Xit,Xis,i) 0?
A useful analogy is heteroskedasticity.
OLS panel data estimators of 1 are unbiased,
consistent
The OLS standard errors will be wrong – usually the
OLS standard errors understate the true uncertainty
Intuition: if uit is correlated over time, you don’t have
as much information (as much random variation) as you
would were uit uncorrelated.
This problem is solved by using “heteroskedasticity and
autocorrelation-consistent standard errors” – we return
to this when we focus on time series regression
Application: Drunk Driving Laws and Traffic Deaths
8-40
(SW Section 8.5)
Some facts
Approx. 40,000 traffic fatalities annually in the U.S.
1/3 of traffic fatalities involve a drinking driver
25% of drivers on the road between 1am and 3am
have been drinking (estimate)
A drunk driver is 13 times as likely to cause a fatal
crash as a non-drinking driver (estimate)
8-41
Drunk driving laws and traffic deaths, ctd.
8-42
The drunk driving panel data set
n = 48 U.S. states, T = 7 years (1982,…,1988) (balanced)
Variables
Traffic fatality rate (deaths per 10,000 residents)
Tax on a case of beer (Beertax)
Minimum legal drinking age
Minimum sentencing laws for first DWI violation:
o Mandatory Jail
o Manditory Community Service
o otherwise, sentence will just be a monetary fine
Vehicle miles per driver (US DOT)
State economic data (real per capita income, etc.)
8-43
Why might panel data help?
Potential OV bias from variables that vary across states
but are constant over time:
o culture of drinking and driving
o quality of roads
o vintage of autos on the road
use state fixed effects
Potential OV bias from variables that vary over time
but are constant across states:
o improvements in auto safety over time
o changing national attitudes towards drunk driving
use time fixed effects
8-44
8-45
8-46
Empirical Analysis: Main Results
8-49
Fixed effects estimation can be done three ways:
1. “Changes” method when T = 2
2. “n-1 binary regressors” method when n is small
3. “Entity-demeaned” regression
Similar methods apply to regression with time fixed
effects and to both time and state fixed effects
Statistical inference: like multiple regression.
Limitations/challenges
Need variation in X over time within states
Time lag effects can be important
Standard errors might be too low (errors might be
correlated over time)
8-50
Regression with a Binary Dependent Variable
(SW Ch. 9)
9-1
Example: Mortgage denial and race
The Boston Fed HMDA data set
Individual applications for single-family mortgages
made in 1990 in the greater Boston area
2380 observations, collected under Home Mortgage
Disclosure Act (HMDA)
Variables
Dependent variable:
o Is the mortgage denied or accepted?
Independent variables:
o income, wealth, employment status
o other loan, property characteristics
o race of applicant
9-2
The Linear Probability Model
(SW Section 9.1)
Yi = 0 + 1Xi + ui
But:
Y
What does 1 mean when Y is binary? Is 1 = ?
X
What does the line 0 + 1X mean when Y is binary?
What does the predicted value Yˆ mean when Y is
binary? For example, what does Yˆ = 0.26 mean?
9-3
The linear probability model, ctd.
Yi = 0 + 1Xi + ui
When Y is binary,
E(Y) = 1 Pr(Y=1) + 0 Pr(Y=0) = Pr(Y=1)
so
E(Y|X) = Pr(Y=1|X)
9-4
The linear probability model, ctd.
When Y is binary, the linear regression model
Yi = 0 + 1Xi + ui
is called the linear probability model.
9-6
Linear probability model: HMDA data
9-7
Next include black as a regressor:
deny = -.091 + .559P/I ratio + .177black
(.032) (.098) (.025)
9-8
The linear probability model: Summary
Instead, we want:
0 ≤ Pr(Y = 1|X) ≤ 1 for all X
Pr(Y = 1|X) to be increasing in X (for 1>0)
This requires a nonlinear functional form for the
probability. How about an “S-curve”…
9-10
The probit model satisfies these conditions:
0 ≤ Pr(Y = 1|X) ≤ 1 for all X
Pr(Y = 1|X) to be increasing in X (for 1>0)
9-11
Probit regression models the probability that Y=1 using
the cumulative standard normal distribution function,
evaluated at z = 0 + 1X:
Pr(Y = 1|X) = (0 + 1X)
is the cumulative normal distribution function.
z = 0 + 1X is the “z-value” or “z-index” of the
probit model.
9-12
Pr(Z ≤ -0.8) = .2119
9-13
Probit regression, ctd.
------------------------------------------------------------------------------
| Robust
deny | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p_irat | 2.967908 .4653114 6.38 0.000 2.055914 3.879901
_cons | -2.194159 .1649721 -13.30 0.000 -2.517499 -1.87082
------------------------------------------------------------------------------
9-15
STATA Example: HMDA data, ctd.
Pr(deny 1| P / Iratio) = (-2.19 + 2.97 P/I ratio)
(.16) (.47)
Positive coefficient: does this make sense?
Standard errors have usual interpretation
Predicted probabilities:
Pr(deny 1| P / Iratio .3) = (-2.19+2.97 .3)
= (-1.30) = .097
Effect of change in P/I ratio from .3 to .4:
Pr(deny 1| P / Iratio .4) = (-2.19+2.97 .4) = .159
Predicted probability of denial rises from .097 to .159
9-16
Probit regression with multiple regressors
9-17
STATA Example: HMDA data
. probit deny p_irat black, r;
------------------------------------------------------------------------------
| Robust
deny | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p_irat | 2.741637 .4441633 6.17 0.000 1.871092 3.612181
black | .7081579 .0831877 8.51 0.000 .545113 .8712028
_cons | -2.258738 .1588168 -14.22 0.000 -2.570013 -1.947463
------------------------------------------------------------------------------
9-18
STATA Example: predicted probit probabilities
. probit deny p_irat black, r;
------------------------------------------------------------------------------
| Robust
deny | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p_irat | 2.741637 .4441633 6.17 0.000 1.871092 3.612181
black | .7081579 .0831877 8.51 0.000 .545113 .8712028
_cons | -2.258738 .1588168 -14.22 0.000 -2.570013 -1.947463
------------------------------------------------------------------------------
. sca z1 = _b[_cons]+_b[p_irat]*.3+_b[black]*0;
NOTE
_b[_cons] is the estimated intercept (-2.258738)
_b[p_irat] is the coefficient on p_irat (2.741637)
sca creates a new scalar which is the result of a calculation
display prints the indicated information to the screen
9-19
STATA Example: HMDA data, ctd.
Pr(deny 1| P / I , black )
= (-2.26 + 2.74 P/I ratio + .71 black)
(.16) (.44) (.08)
Is the coefficient on black statistically significant?
Estimated effect of race for P/I ratio = .3:
Pr(deny 1|.3,1) = (-2.26+2.74 .3+.71 1) = .233
Pr(deny 1|.3,0) = (-2.26+2.74 .3+.71 0) = .075
Difference in rejection probabilities = .158 (15.8
percentage points)
Still plenty of room still for omitted variable bias…
9-20
Logit regression
1
F(0 + 1X) =
1 e ( 0 1 X )
9-21
Logistic regression, ctd.
1
where F(0 + 1X) = ( 0 1 X )
.
1 e
------------------------------------------------------------------------------
| Robust
deny | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p_irat | 5.370362 .9633435 5.57 0.000 3.482244 7.258481
black | 1.272782 .1460986 8.71 0.000 .9864339 1.55913
_cons | -4.125558 .345825 -11.93 0.000 -4.803362 -3.447753
------------------------------------------------------------------------------
9-24
Estimation and Inference in Probit (and Logit)
Models (SW Section 9.3)
Probit model:
Pr(Y = 1|X) = (0 + 1X)
n
min b0 ,b1 [Yi (b0 b1 X i )]2
i 1
1 with probability p
Y= (Bernoulli distribution)
0 with probability 1 p
9-29
Joint density of (Y1,Y2):
Because Y1 and Y2 are independent,
= p
n
y
i 1 i
(1 p)
i1 yi
n
n
d ln f ( p;Y1 ,...,Yn )
dp
1
1
i1Yi p n i1Yi 1 p = 0
=
n n
Solving for p yields the MLE; that is, pˆ MLE satisfies,
9-31
Y pˆ
n
i 1 i
1
MLE n
1
n i 1Yi
1 pˆ
MLE
=0
or
Y pˆ
n
i 1 i
1
MLE
n i 1Yi
n
1
1 pˆ MLE
or
Y pˆ MLE
1 Y 1 pˆ MLE
or
pˆ MLE = Y = fraction of 1’s
9-32
The MLE in the “no-X” case (Bernoulli distribution):
pˆ MLE = Y = fraction of 1’s
For Yi i.i.d. Bernoulli, the MLE is the “natural”
estimator of p, the fraction of 1’s, which is Y
We already know the essentials of inference:
o In large n, the sampling distribution of pˆ MLE = Y is
normally distributed
o Thus inference is “as usual:” hypothesis testing via
t-statistic, confidence interval as 1.96SE
STATA note: to emphasize requirement of large-n, the
printout calls the t-statistic the z-statistic; instead of the
F-statistic, the chi-squared statstic (= q F).
9-33
The probit likelihood with one X
The derivation starts with the density of Y1, given X1:
Pr(Y1 = 1|X1) = (0 + 1X1)
Pr(Y1 = 0|X1) = 1–(0 + 1X1)
so
Pr(Y1 = y1|X1) = ( 0 1 X 1 ) y [1 ( 0 1 X 1 )]1 y
1 1
… { ( 0 1 X n )Y [1 ( 0 1 X n )]1Y }
n n
… { ( 0 1 X n )Y [1 ( 0 1 X n )]1Y }
n n
9-35
The only difference between probit and logit is the
functional form used for the probability: is
replaced by the cumulative logistic function.
Otherwise, the likelihood is similar; for details see
SW App. 9.2
As with probit,
o ˆ0MLE , ˆ1MLE are consistent
o ˆ0MLE , ˆ1MLE are normally distributed
o Their standard errors can be computed
o Testing, confidence intervals proceeds as usual
9-36
Measures of fit
The R2 and R 2 don’t make sense here (why?). So, two
other specialized measures are used:
so
f(p;Y1) = pY (1 p )1Y
1 1
(likelihood)
The likelihood for Y1,…,Yn is,
f(p;Y1,…,Yn) = f(p;Y1) … f(p;Yn)
so the log likelihood is,
(p) = lnf(p;Y1,…,Yn)
= ln[f(p;Y1) … f(p;Yn)]
n
= ln f ( p;Y )
i 1
i
L ( p ) L ( p ) 2L ( p )
0= + ( pˆ MLE – ptrue)
p pˆ MLE
p p true
p 2 p true
9-40
4. Solve this linear approximation for ( pˆ MLE – ptrue):
L ( p ) 2L ( p )
+ ( pˆ MLE – ptrue) 0
p p true
p 2 p true
so
2L ( p ) L ( p )
( pˆ MLE
–p true
) –
p 2 p true
p p true
or
1
L ( p)
2 L ( p )
( pˆ MLE
– ptrue) –
p 2
p true
p
p true
9-41
5. Substitute things in and apply the LLN and CLT.
n
(p) = ln f ( p;Y )
i 1
i
L ( p ) n
ln f ( p;Yi )
=
p p true i 1 p p true
2L ( p ) 2 ln f ( p;Yi )
n
=
p 2 p true i 1 p 2
p true
so
1
2L ( p ) L ( p )
( pˆ MLE
–p true
) –
p 2
p true
p
p true
1
n 2 ln f ( p;Y ) n ln f ( p;Y )
=
2
i
p
i
i 1 p p true
i 1 p true
9-42
Multiply through by n :
n ( pˆ MLE – ptrue)
1
1 n 2 ln f ( p;Y ) 1 n ln f ( p;Y )
i
i
n i 1 p
2
i 1
n p p true
p true
Because Yi is i.i.d., the ith terms in the summands are also
i.i.d. Thus, if these terms have enough (2) moments, then
under general conditions (not just Bernoulli likelihood):
1 n 2 ln f ( p;Yi ) p
n i 1 p 2
a (a constant) (WLLN)
p true
1 n ln f ( p;Yi ) d
n i 1 p
N(0, ln f ) (CLT) (Why?)
2
p true
Putting this together,
9-43
n ( pˆ MLE – ptrue)
1
1 n ln f ( p;Y )
2 1 n ln f ( p;Y )
i
i
n i 1 p
2
i 1
n p p true
p true
1 n 2 ln f ( p;Yi ) p
n i 1 p 2
a (a constant) (WLLN)
p true
1 n ln f ( p;Yi ) d
n i 1 p
N(0, ln f ) (CLT) (Why?)
2
p true
so
d
n ( pˆ MLE
–p true
) N(0, ln2 f /a2) (large-n normal)
Work out the details for probit/no X (Bernoulli) case:
9-44
Recall:
f(p;Yi) = pY (1 p )1Y
i i
so
ln f(p;Yi) = Yilnp + (1–Yi)ln(1–p)
and
ln f ( p, Yi ) Yi 1 Yi Yi p
= =
p p 1 p p(1 p )
and
2 ln f ( p, Yi ) Yi 1 Yi Yi 1 Yi
= 2 = 2 2
p 2
p (1 p ) 2
p (1 p )
9-45
Denominator term first:
2 ln f ( p, Yi ) Yi 1 Yi
= 2 2
p 2
p (1 p )
so
1 n 2 ln f ( p;Yi ) 1 n Y 1 Yi
n i 1 p 2
= 2
i
2
p true
n i 1 p (1 p )
Y 1Y
= 2
p (1 p ) 2
p
p 1 p
2 (LLN)
p (1 p ) 2
1 1 1
= =
p 1 p p(1 p )
9-46
Next the numerator:
ln f ( p, Yi ) Yi p
=
p p(1 p )
so
1 n ln f ( p;Yi ) 1 n
Yi p
n i 1 p
=
n
p (1 p )
p true
i 1
1 1 n
= (Yi p )
p(1 p ) n i 1
d Y2
N(0, )
[ p(1 p )] 2
9-47
Put these pieces together:
n ( pˆ MLE – ptrue)
1
1 n 2 ln f ( p;Y ) 1 n ln f ( p;Y )
i
i
n i 1 p
2
i 1
n p p true
p true
where
1 n 2 ln f ( p;Yi ) p 1
n i 1 p 2
p(1 p )
p true
1 n ln f ( p;Yi ) d Y2
n i 1 p
N(0,
[ p(1 p )]2
)
p true
Thus
d
n ( pˆ MLE
–p true
) N(0, Y2 )
9-48
Summary: probit MLE, no-X case
d
n ( pˆ MLE
– ptrue) N(0, Y2 )
9-52
The HMDA Data Set
9-53
The loan officer’s decision
9-54
Regression specifications
Pr(deny=1|black, other X’s) = …
linear probability model
probit
9-61
Remaining threats to internal, external validity
Internal validity
1. omitted variable bias
what else is learned in the in-person interviews?
2. functional form misspecification (no…)
3. measurement error (originally, yes; now, no…)
4. selection
random sample of loan applications
define population to be loan applicants
5. simultaneous causality (no)
External validity
This is for Boston in 1990-91. What about today?
9-62
Summary
(SW Section 9.5)
If Yi is binary, then E(Y| X) = Pr(Y=1|X)
Three models:
o linear probability model (linear multiple regression)
o probit (cumulative standard normal distribution)
o logit (cumulative standard logistic distribution)
LPM, probit, logit all produce predicted probabilities
Effect of X is change in conditional probability that
Y=1. For logit and probit, this depends on the initial X
Probit and logit are estimated via maximum likelihood
o Coefficients are normally distributed for large n
o Large-n hypothesis testing, conf. intervals is as usual
9-63
Instrumental Variables Regression
(SW Ch. 10)
Yi = 0 + 1Xi + ui
10-3
Two conditions for a valid instrument
Yi = 0 + 1Xi + ui
10-4
The IV Estimator, one X and one Z
Explanation #1: Two Stage Least Squares (TSLS)
As it sounds, TSLS has two stages – two regressions:
(1) First isolates the part of X that is uncorrelated with u:
regress X on Z using OLS
Xi = 0 + 1Zi + vi (1)
Yi = 0 + 1 Xˆ i + ui (2)
10-6
Two Stage Least Squares, ctd.
Stage 1:
Regress Xi on Zi, obtain the predicted values Xˆ i
Stage 2:
Regress Yi on Xˆ i ; the coefficient on Xˆ i is the TSLS
estimator, ˆ1TSLS .
10-7
The IV Estimator, one X and one Z, ctd.
Explanation #2: (only) a little algebra
Yi = 0 + 1Xi + ui
Thus,
cov(Yi,Zi) = cov(0 + 1Xi + ui,Zi)
= cov(0,Zi) + cov(1Xi,Zi) + cov(ui,Zi)
= 0 + cov(1Xi,Zi) + 0
= 1cov(Xi,Zi)
cov(Yi , Z i )
1 =
cov( X i , Z i )
10-8
The IV Estimator, one X and one Z, ctd
cov(Yi , Z i )
1 =
cov( X i , Z i )
ˆ sYZ
1 =
TSLS
,
s XZ
ˆ sYZ
1 =
TSLS
s XZ
p
The sample covariances are consistent: sYZ cov(Y,Z)
p
and sXZ cov(X,Z). Thus,
sYZ p cov(Y , Z )
ˆ
TSLS
1 = = 1
s XZ cov( X , Z )
10-11
Simultaneous causality bias in the OLS regression of
ln(Qibutter ) on ln( Pi butter ) arises because price and quantity
are determined by the interaction of demand and supply
10-12
This interaction of demand and supply produces…
10-13
What would you get if only supply shifted?
10-15
TSLS in the supply-demand example, ctd.
ln(Qibutter ) = 0 + 1ln( Pi butter ) + ui
10-17
Example #2: Test scores and class size, ctd.
Here is a (hypothetical) instrument:
some districts, randomly hit by an earthquake, “double
up” classrooms:
Zi = Quakei = 1 if hit by quake, = 0 otherwise
Do the two conditions for a valid instrument hold?
The earthquake makes it as if the districts were in a
random assignment experiment. Thus the variation in
STR arising from the earthquake is exogenous.
The first stage of TSLS regresses STR against Quake,
thereby isolating the part of STR that is exogenous (the
part that is “as if” randomly assigned)
We’ll go through other examples later…
10-18
Inference using TSLS
In large samples, the sampling distribution of the TSLS
estimator is normal
Inference (hypothesis tests, confidence intervals)
proceeds in the usual way, e.g. 1.96SE
The idea behind the large-sample normal distribution of
the TSLS estimator is that – like all the other estimators
we have considered – it involves an average of mean
zero i.i.d. random variables, to which we can apply the
CLT.
Here is a sketch of the math (see SW App. 10.3 for the
details)...
10-19
1 n
sYZ
n 1 i 1
(Yi Y )( Z i Z )
ˆ
1 =
TSLS
=
1 n
s XZ
( X i X )( Z i Z )
n 1 i 1
Now substitute in Yi = 0 + 1Xi + ui and simplify:
First,
Yi – Y = 1(Xi – X ) + (ui – u )
so
1 n 1 n
n 1 i 1
(Yi Y )( Z i Z ) =
n 1 i 1
[ 1 ( X i X ) (u i u )]( Z i Z )
1 n 1 n
= 1
n 1 i 1
( X i X )( Z i Z )
n 1 i 1
(ui u )( Z i Z ) .
10-20
Thus
1 n
n 1 i 1
(Yi Y )( Z i Z )
ˆ1TSLS =
1 n
n 1 i 1
( X i X )( Z i Z )
1 n 1 n
1
n 1 i 1
( X i X )( Z i Z )
n 1 i 1
(ui u )( Z i Z )
=
1 n
n 1 i 1
( X i X )( Z i Z )
1 n
n 1 i 1
(ui u )( Z i Z )
= 1 + n
.
1
n 1 i 1
( X i X )( Z i Z )
1 n
n i 1
(ui u )( Z i Z )
n ( ˆ1TSLS – 1)
1 n
n i 1
( X i X )( Z i Z )
10-22
1 n
n i 1
(ui u )( Z i Z )
n ( ˆ1TSLS – 1)
1 n
n i 1
( X i X )( Z i Z )
10-23
Put this together:
1 n
n i 1
(ui u )( Z i Z )
n ( ˆ1TSLS – 1)
1 n
n i 1
( X i X )( Z i Z )
1 n p
n i 1
( X i X )( Z i Z ) cov(X,Z)
1 n
n i 1
(ui u )( Z i Z ) is dist’d N(0,var[(Z–Z)u])
So finally:
ˆ1TSLS is approx. distributed N(1, 2ˆ TSLS ),
1
1 var[( Z i Z )ui ]
where 2
ˆ TSLS
= 2
.
1
n [cov( Z i , X i )]
10-24
Inference using TSLS, ctd.
ˆ1TSLS is approx. distributed N(1, 2ˆ
TSLS ),
1
10-26
Figure 4, p. 296, from Appendix B (1928):
10-27
Who wrote Appendix B of Philip Wright (1928)?
10-28
Philip Wright (1861-1934) Sewall Wright (1889-1988)
obscure economist and poet famous genetic statistician
MA Harvard, Econ, 1887 ScD Harvard, Biology, 1915
Lecturer, Harvard, 1913-1917 Prof., U. Chicago, 1930-1954
10-29
Example: Demand for Cigarettes
10-30
Example: Cigarette demand, ctd.
ln(Qicigarettes ) = 0 + 1ln( Pi cigarettes ) + ui
Panel data:
Annual cigarette consumption and average prices paid
(including tax)
48 continental US states, 1985-1995
Proposed instrumental variable:
Zi = general sales tax per pack in the state = SalesTaxi
Is this a valid instrument?
(1) Relevant? corr(SalesTaxi, ln( Pi cigarettes )) 0?
(2) Exogenous? corr(SalesTaxi,ui) = 0?
10-31
For now, use data for 1995 only.
10-32
STATA Example: Cigarette demand, First stage
Instrument = Z = rtaxso = general sales tax (real $/pack)
X Z
. reg lravgprs rtaxso if year==1995, r;
------------------------------------------------------------------------------
| Robust
lravgprs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rtaxso | .0307289 .0048354 6.35 0.000 .0209956 .0404621
_cons | 4.616546 .0289177 159.64 0.000 4.558338 4.674755
------------------------------------------------------------------------------
X-hat
. predict lravphat; Now we have the predicted values from the 1st stage
10-33
Second stage
Y X-hat
. reg lpackpc lravphat if year==1995, r;
------------------------------------------------------------------------------
| Robust
lpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lravphat | -1.083586 .3336949 -3.25 0.002 -1.755279 -.4118932
_cons | 9.719875 1.597119 6.09 0.000 6.505042 12.93471
------------------------------------------------------------------------------
10-34
Combined into a single command:
Y X Z
. ivreg lpackpc (lravgprs = rtaxso) if year==1995, r;
------------------------------------------------------------------------------
| Robust
lpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lravgprs | -1.083587 .3189183 -3.40 0.001 -1.725536 -.4416373
_cons | 9.719876 1.528322 6.36 0.000 6.643525 12.79623
------------------------------------------------------------------------------
Instrumented: lravgprs This is the endogenous regressor
Instruments: rtaxso This is the instrumental varible
------------------------------------------------------------------------------
OK, the change in the SEs was small this time...but not always!
10-36
The General IV Regression Model
(SW Section 10.2)
10-39
The general IV regression model, ctd.
Yi = 0 + 1X1i + … + kXki + k+1W1i + … + k+rWri + ui
10-41
Identification, ctd.
The coefficients 1,…,k are said to be:
exactly identified if m = k.
There are just enough instruments to estimate
1,…,k.
overidentified if m > k.
There are more than enough instruments to estimate
1,…,k. If so, you can test whether the instruments
are valid (a test of the “overidentifying restrictions”)
– we’ll return to this later
underidentified if m < k.
There are too few enough instruments to estimate
1,…,k. If so, you need to get more instruments!
10-42
General IV regression: TSLS, 1 endogenous regressor
Yi = 0 + 1X1i + 2W1i + … + 1+rWri + ui
Instruments: Z1i,…,Zm
First stage
o Regress X1 on all the exogenous regressors: regress
X1 on W1,…,Wr,Z1,…,Zm by OLS
o Compute predicted values Xˆ 1i , i = 1,…,n
Second stage
o Regress Y on X̂ 1,W1,…,Wr by OLS
o The coefficients from this second stage regression
are the TSLS estimators, but SEs are wrong
To get correct SEs, do this in a single step
10-43
Example: Demand for cigarettes
------------------------------------------------------------------------------
| Robust
lpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lravgprs | -1.143375 .3723025 -3.07 0.004 -1.893231 -.3935191
lperinc | .214515 .3117467 0.69 0.495 -.413375 .842405
_cons | 9.430658 1.259392 7.49 0.000 6.894112 11.9672
------------------------------------------------------------------------------
Instrumented: lravgprs
Instruments: lperinc rtaxso STATA lists ALL the exogenous regressors
as instruments – slightly different
terminology than we have been using
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Robust
lpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lravgprs | -1.277424 .2496099 -5.12 0.000 -1.780164 -.7746837
lperinc | .2804045 .2538894 1.10 0.275 -.230955 .7917641
_cons | 9.894955 .9592169 10.32 0.000 7.962993 11.82692
------------------------------------------------------------------------------
Instrumented: lravgprs
Instruments: lperinc rtaxso rtax STATA lists ALL the exogenous regressors
as “instruments” – slightly different
terminology than we have been using
------------------------------------------------------------------------------
10-46
TSLS estimates, Z = sales tax (m = 1)
ln(Qicigarettes ) = 9.43 – 1.14 ln( Pi cigarettes ) + 0.21ln(Incomei)
(1.26) (0.37) (0.31)
Instruments: Z1i,…,Zm
Now there are k first stage regressions:
o Regress X1 on W1,…, Wr, Z1,…, Zm by OLS
o Compute predicted values Xˆ 1i , i = 1,…,n
o Regress X2 on W1,…, Wr, Z1,…, Zm by OLS
o Compute predicted values Xˆ 2 i , i = 1,…,n
o Repeat for all X’s, obtaining Xˆ 1i , Xˆ 2 i ,…, Xˆ ki
10-48
TSLS with multiple endogenous regressors, ctd.
Second stage
o Regress Y on Xˆ 1i , Xˆ 2 i ,…, Xˆ ki , W1,…, Wr by OLS
o The coefficients from this second stage regression
are the TSLS estimators, but SEs are wrong
To get correct SEs, do this in a single step
What would happen in the second stage regression if
the coefficients were underidentified (that is, if
#instruments < #endogenous variables); for example, if
k = 2, m = 1?
10-49
Sampling distribution of the TSLS estimator in the
general IV regression model
10-50
A “valid” set of instruments in the general case
2. Instrument exogeneity
All the instruments are uncorrelated with the error
term: corr(Z1i,ui) = 0,…, corr(Zm,ui) = 0
10-51
“Valid” instruments in the general case, ctd.
10-52
The IV Regression Assumptions
Yi = 0 + 1X1i + … + kXki + k+1W1i + … + k+rWri + ui
1. E(ui|W1i,…,Wri) = 0
2. (Yi,X1i,…,Xki,W1i,…,Wri,Z1i,…,Zmi) are i.i.d.
3. The X’s, W’s, Z’s, and Y have nonzero, finite 4th
moments
4. The W’s are not perfectly multicollinear
5. The instruments (Z1i,…,Zmi) satisfy the conditions for
a valid set of instruments.
reason.
All this hinges on having valid instruments…
10-54
Checking Instrument Validity
(SW Section 10.3)
ˆ sYZ
The IV estimator is 1 =
TSLS
s XZ
If cov(X,Z) is zero or small, then sXZ will be small:
With weak instruments, the denominator is nearly zero.
If so, the sampling distribution of ˆ1TSLS (and its t-
statistic) is not well approximated by its large-n normal
approximation…
10-57
An example: the distribution of the TSLS t-statistic
with weak instruments
s XZ
If cov(X,Z) is small, small changes in sXZ (from one
sample to the next) can induce big changes in ˆ1TSLS
Suppose in one sample you calculate sXZ = .00001!
Thus the large-n normal approximation is a poor
approximation to the sampling distribution of ˆ1TSLS
A better approximation is that ˆ1TSLS is distributed as the
ratio of two correlated normal random variables (see
SW App. 10.4)
If instruments are weak, the usual methods of inference
are unreliable – potentially very unreliable.
10-59
Measuring the strength of instruments in practice:
The first-stage F-statistic
10-60
Checking for weak instruments with a single X
Compute the first-stage F-statistic.
Rule-of-thumb: If the first stage F-statistic is less
than 10, then the set of instruments is weak.
If so, the TSLS estimator will be biased, and statistical
inferences (standard errors, hypothesis tests, confidence
intervals) can be misleading.
Note that simply rejecting the null hypothesis of that
the coefficients on the Z’s are zero isn’t enough – you
actually need substantial predictive content for the
normal approximation to be a good one.
There are more sophisticated things to do than just
compare F to 10 but they are beyond this course.
10-61
What to do if you have weak instruments?
10-67
Panel data set
Annual cigarette consumption, average prices paid by
end consumer (including tax), personal income
48 continental US states, 1985-1995
Estimation strategy
Having panel data allows us to control for unobserved
state-level characteristics that enter the demand for
cigarettes, as long as they don’t vary over time
But we still need to use IV estimation methods to
handle the simultaneous causality bias that arises from
the interaction of supply and demand.
10-68
Fixed-effects model of cigarette demand
10-69
Panel data IV regression: two approaches
(a) The “n-1 binary indicators” method
(b) The “changes” method (when T=2)
10-73
Use TSLS to estimate the demand elasticity by using
the “10-year changes” specification
Y W X Z
. ivreg dlpackpc dlperinc (dlavgprs = drtaxso) , r;
------------------------------------------------------------------------------
| Robust
dlpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dlavgprs | -.9380143 .2075022 -4.52 0.000 -1.355945 -.5200834
dlperinc | .5259693 .3394942 1.55 0.128 -.1578071 1.209746
_cons | .2085492 .1302294 1.60 0.116 -.0537463 .4708446
------------------------------------------------------------------------------
Instrumented: dlavgprs
Instruments: dlperinc drtaxso
------------------------------------------------------------------------------
NOTE:
- All the variables – Y, X, W, and Z’s – are in 10-year changes
- Estimated elasticity = –.94 (SE = .21) – surprisingly elastic!
- Income elasticity small, not statistically different from zero
- Must check whether the instrument is relevant…
10-74
Check instrument relevance: compute first-stage F
. reg dlavgprs drtaxso dlperinc , r;
------------------------------------------------------------------------------
| Robust
dlavgprs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drtaxso | .0254611 .0043876 5.80 0.000 .016624 .0342982
dlperinc | -.2241037 .2188815 -1.02 0.311 -.6649536 .2167463
_cons | .5321948 .0295315 18.02 0.000 .4727153 .5916742
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Robust
dlpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dlavgprs | -1.202403 .1969433 -6.11 0.000 -1.599068 -.8057392
dlperinc | .4620299 .3093405 1.49 0.142 -.1610138 1.085074
_cons | .3665388 .1219126 3.01 0.004 .1209942 .6120834
------------------------------------------------------------------------------
Instrumented: dlavgprs
Instruments: dlperinc drtaxso drtax
------------------------------------------------------------------------------
------------------------------------------------------------------------------
e | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drtaxso | .0127669 .0061587 2.07 0.044 .000355 .0251789
drtax | -.0038077 .0021179 -1.80 0.079 -.008076 .0004607
dlperinc | -.0934062 .2978459 -0.31 0.755 -.6936752 .5068627
_cons | .002939 .0446131 0.07 0.948 -.0869728 .0928509
------------------------------------------------------------------------------
. test drtaxso drtax;
10-78
Check instrument relevance: compute first-stage F
X Z1 Z2 W
. reg dlavgprs drtaxso drtax dlperinc , r;
------------------------------------------------------------------------------
| Robust
dlavgprs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drtaxso | .013457 .0031405 4.28 0.000 .0071277 .0197863
drtax | .0075734 .0008859 8.55 0.000 .0057879 .0093588
dlperinc | -.0289943 .1242309 -0.23 0.817 -.2793654 .2213767
_cons | .4919733 .0183233 26.85 0.000 .4550451 .5289015
------------------------------------------------------------------------------
( 1) drtaxso = 0
( 2) drtax = 0
10-79
Tabular summary of these results:
10-80
How should we interpret the J-test rejection?
J-test rejects the null hypothesis that both the
instruments are exogenous
This means that either rtaxso is endogenous, or rtax is
endogenous, or both
The J-test doesn’t tell us which!! You must think!
Why might rtax (cig-only tax) be endogenous?
o Political forces: history of smoking or lots of
smokers political pressure for low cigarette taxes
o If so, cig-only tax is endogenous
This reasoning doesn’t apply to general sales tax
use just one instrument, the general sales tax
10-81
The Demand for Cigarettes:
Summary of Empirical Results
10-83
Remaining threats to internal validity, ctd.
Remaining simultaneous causality bias?
o Not if the general sales tax a valid instrument:
relevance?
exogeneity?
Errors-in-variables bias? Interesting question: are we
accurately measuring the price actually paid? What
about cross-border sales?
Selection bias? (no, we have all the states)
10-86
SurvivalDaysi = 0 + 1CardCathi + ui
10-87
Z = differential distance to CC hospital
o Relevant? If a CC hospital is far away, patient
won’t bet taken there and won’t get CC
o Exogenous? If distance to CC hospital doesn’t
affect survival, other than through effect on
CardCathi, then corr(distance,ui) = 0 so exogenous
o If patients location is random, then differential
distance is “as if” randomly assigned.
o The 1st stage is a linear probability model: distance
affects the probability of receiving treatment
Results (McClellan, McNeil, Newhous, JAMA, 1994):
o OLS estimates significant and large effect of CC
o TSLS estimates a small, often insignificant effect
10-88
Summary: IV Regression
(SW Section 10.6)
A valid instrument lets us isolate a part of X that is
uncorrelated with u, and that part can be used to
estimate the effect of a change in X on Y
IV regression hinges on having valid instruments:
(1) Relevance: check via first-stage F
(2) Exogeneity: Test overidentifying restrictions
via the J-statistic
A valid instrument isolates variation in X that is “as if”
randomly assigned.
The critical requirement of at least m valid instruments
cannot be tested – you must use your head.
10-89
Experiments and Quasi-Experiments
(SW Chapter 11)
11-2
Different types of experiments: three examples
11-4
Idealized Experiments and Causal Effects
(SW Section 11.1)
An ideal randomized controlled experiment randomly
assigns subjects to treatment and control groups.
More generally, the treatment level X is randomly
assigned:
Yi = 0 + 1Xi + ui
11-6
Potential Problems with Experiments in Practice
(SW Section 11.2)
11-7
Threats to internal validity, ctd.
11-8
Threats to internal validity, ctd.
3. Experimental effects
experimenter bias (conscious or subconscious):
treatment X is associated with “extra effort” or
“extra care,” so corr(X,u) 0
subject behavior might be affected by being in an
experiment, so corr(X,u) 0 (Hawthorne effect)
11-10
Threats to External Validity
1. Nonrepresentative sample
2. Nonrepresentative “treatment” (that is, program or
policy)
3. General equilibrium effects (effect of a program can
depend on its scale; admissions counseling )
4. Treatment v. eligibility effects (which is it you want
to measure: effect on those who take the program, or
the effect on those are eligible)
11-11
Regression Estimators of Causal Effects Using
Experimental Data
(SW Section 11.3)
Focus on the case that X is binary (treatment/control).
Often you observe subject characteristics, W1i,…,Wri.
Extensions of the differences estimator:
o can improve efficiency (reduce standard errors)
o can eliminate bias that arises when:
treatment and control groups differ
there is “conditional randomization”
there is partial compliance
These extensions involve methods we have already
seen – multiple regression, panel data, IV regression
11-12
Estimators of the Treatment Effect 1 using
Experimental Data (X = 1 if treated, 0 if control)
11-15
ˆ1diffsindiffs = (Y treat ,after –Y treat ,before ) – (Y control ,after –Y control ,before )
11-16
The differences-in-differences estimator, ctd.
Yi = 0 + 1Xi + ui
where
Yi = Yi after – Yi before
Xi = 1 if treated, = 0 otherwise
11-17
The differences-in-differences estimator, ctd.
where
t = 1 (before experiment), 2 (after experiment)
Dit = 0 for t = 1, = 1 for t = 2
Git = 0 for control group, = 1 for treatment group
Xit = 1 if treated, = 0 otherwise
= Dit Git = interaction effect of being in treatment
group in the second period
ˆ1 is the diffs-in-diffs estimator
11-18
Including additional subject characteristics (W’s)
11-20
Estimation when there is partial compliance
Consider diffs-in-diffs estimator, X = actual treatment
Yi = 0 + 1Xi + ui
Suppose there is partial compliance: some of the
treated don’t take the drug; some of the controls go to
job training anyway
Then X is correlated with u, and OLS is biased
Suppose initial assignment, Z, is random
Then (1) corr(Z,X) 0 and (2) corr(Z,u) = 0
Thus 1 can be estimated by TSLS, with instrumental
variable Z = initial assignment
This can be extended to W’s (included exog. variables)
11-21
Experimental Estimates of the Effect of
Reduction: The Tennessee Class Size Experiment
(SW Section 11.4)
Project STAR (Student-Teacher Achievement Ratio)
4-year study, $12 million
Upon entering the school system, a student was
randomly assigned to one of three groups:
o regular class (22 – 25 students)
o regular class + aide
o small class (13 – 17 students)
regular class students re-randomized after first year to
regular or regular+aide
Y = Stanford Achievement Test scores
11-22
Deviations from experimental design
Partial compliance:
o 10% of students switched treatment groups because
of “incompatibility” and “behavior problems” – how
much of this was because of parental pressure?
o Newcomers: incomplete receipt of treatment for
those who move into district after grade 1
Attrition
o students move out of district
o students leave for private/religious schools
11-23
Regression analysis
11-24
Differences estimates (no W’s)
11-25
11-26
How big are these estimated effects?
Put on same basis by dividing by std. dev. of Y
Units are now standard deviations of test scores
11-27
How do these estimates compare to those from the
California, Mass. observational studies? (Ch. 4 – 7)
11-28
Summary: The Tennessee Class Size Experiment
Main findings:
The effects are small quantitatively (same size as
gender difference)
Effect is sustained but not cumulative or increasing
biggest effect at the youngest grades
11-29
What is the Difference Between a Control Variable
and the Variable of Interest?
(SW App. 11.3)
11-30
11-31
Example: “free lunch eligible,” ctd.
Coefficient on “free lunch eligible” is large, negative,
statistically significant
Policy interpretation: Making students ineligible for a
free school lunch will improve their test scores.
Why (precisely) can we interpret the coefficient on
SmallClass as an unbiased estimate of a causal effect,
but not the coefficient on “free lunch eligible”?
This is not an isolated example!
o Other “control variables” we have used: gender,
race, district income, state fixed effects, time fixed
effects, city (or state) population,…
What is a “control variable” anyway?
11-32
Simplest case: one X, one control variable W
Yi = 0 + 1 Xi + 2Wi + ui
For example,
W = free lunch eligible (binary)
X = small class/large class (binary)
Suppose random assignment of X depends on W
o for example, 60% of free-lunch eligibles get small
class, 40% of ineligibles get small class)
o note: this wasn’t the actual STAR randomization
procedure – this is a hypothetical example
Further suppose W is correlated with u
11-33
Yi = 0 + 1 Xi + 2Wi + ui
Suppose:
The control variable W is correlated with u
Given W = 0 (ineligible), X is randomly assigned
Given W = 1 (eligible), X is randomly assigned.
Then:
Given the value of W, X is randomly assigned;
That is, controlling for W, X is randomly assigned;
Thus, controlling for W, X is uncorrelated with u
Moreover, E(u|X,W) doesn’t depend on X
That is, we have conditional mean independence:
E(u|X,W) = E(u|W)
11-34
Implications of conditional mean independence
Yi = 0 + 1 Xi + 2Wi + ui
11-35
Implications of conditional mean independence:
The conditional mean of Y given X and W is
E(Yi|Xi,Wi) = (0+0) + 1Xi + (1+2)Wi
The effect of a change in X under conditional mean
independence is the desired causal effect:
E(Yi|Xi = x+x,Wi) – E(Yi|Xi = x,Wi) = 1x
or
E (Yi | X i x x,Wi ) E (Yi | X i x,Wi )
1 =
x
If X is binary (treatment/control), this becomes:
E (Yi | X i 1,Wi ) E (Yi | X i 0,Wi )
1 =
x
which is the desired treatment effect.
11-36
Implications of conditional mean independence, ctd.
Yi = 0 + 1 Xi + 2Wi + ui
Then:
The OLS estimator ˆ1 is unbiased.
ˆ2 is not consistent and not meaningful
The usual inference methods (standard errors,
hypothesis tests, etc.) apply to ˆ1 .
11-37
So, what is a control variable?
Two cases:
(a) Treatment (X) is “as if” randomly assigned (OLS)
(b) A variable (Z) that influences treatment (X) is
“as if” randomly assigned (IV)
11-42
Two types of quasi-experiments
11-54
OLS with Heterogeneous Causal Effects
11-55
The math: suppose X is binary and E(ui|Xi) = 0.
Then
ˆ1 = Y treated – Y control
For the treated:
E(Yi|Xi=1) = 0 + E(1iXi|Xi=1) + E(ui|Xi=1)
= 0 + E(1i|Xi=1)
For the controls:
E(Yi|Xi=0) = 0 + E(1iXi|Xi=0) + E(ui|Xi=0)
= 0
Thus:
p
ˆ1 E(Yi|Xi=1) – E(Yi|Xi=0) = E(1i|Xi=1)
= average effect of the treatment on the treated
11-56
OLS with heterogeneous treatment effects: general X
with E(ui|Xi) = 0
ˆ s p
XY cov( 0 1i X i ui , X i )
1 = 2 2 =
XY
sX X var( X i )
cov( 0 , X i ) cov( 1i X i , X i ) cov(ui , X i )
=
var( X i )
cov( 1i X i , X i )
= (because cov(ui,Xi) = 0)
var( X i )
If X is binary, this simplifies to the “effect of
treatment on the treated”
p
Without heterogeneity, 1i = 1 and ˆ1 1
In general, the treatment effects of individuals with
large values of X are given the most weight
11-57
(b) Now make a stronger assumption: that X is randomly
assigned (experiment or quasi-experiment). Then
what does OLS actually estimate?
I Xi is randomly assigned, it is distributed
independently of 1i, so there is no difference
between the population of controls and the
population in the treatment group
Thus the effect of treatment on the treated = the
average treatment effect in the population.
11-58
The math:
ˆ
pcov( 1i X i , X i ) cov( 1i X i , X i )
1 = E E | 1i
var( X i ) var( X i )
cov( X i , X i ) var( X i )
= E 1i = E 1i
var( X i ) var( X i
)
= E(1i)
Summary
If Xi and 1i are independent (Xi is randomly
assigned), OLS estimates the average treatment effect.
If Xi is not randomly assigned but E(ui|Xi) = 0, OLS
estimates the effect of treatment on the treated.
Without heterogeneity, the effect of treatment on the
treated and the average treatment effect are the same
11-59
IV Regression with Heterogeneous Causal Effects
11-60
IV with heterogeneous causal effects, ctd.
Intuition:
Suppose 1i’s were known. If for some people 1i =
0, then their predicted value of Xi wouldn’t depend
on Z, so the IV estimator would ignore them.
The IV estimator puts most of the weight on
individuals for whom Z has a large influence on X.
TSLS measures the treatment effect for those whose
probability of treatment is most influenced by X.
11-61
The math…
Yi = 0 + 1iXi + ui (equation of interest)
Xi = 0 + 1iZi + vi (first stage of TSLS)
11-62
When there are heterogeneous causal effects, what
TSLS estimates depends on the choice of instruments!
With different instruments, TSLS estimates different
weighted averages!!!
Suppose you have two instruments, Z1 and Z2.
o In general these instruments will be influential for
different members of the population.
o Using Z1, TSLS will estimate the treatment effect for
those people whose probability of treatment (X) is
most influenced by Z1
o The treatment effect for those most influenced by Z1
might differ from the treatment effect for those most
influenced by Z2
11-63
When does TSLS estimate the average causal effect?
Yi = 0 + 1iXi + ui (equation of interest)
Xi = 0 + 1iZi + vi (first stage of TSLS)
p
E ( 1i 1i )
ˆTSLS
E ( 1i )
1
11-64
Example: Cardiac catheterization
Yi = survival time (days) for AMI patients
Xi = received cardiac catheterization (or not)
Zi = differential distance to CC hospital
Equation of interest:
SurvivalDaysi = 0 + 1iCardCathi + ui
First stage (linear probability model):
CardCathi = 0 + 1iDistancei + vi
11-67
OLS with Heterogeneous Causal Effects
X is: Relation between Xi and Then OLS estimates:
u i:
binary E(ui|Xi) = 0 effect of treatment on the
treated: E(1i|Xi=1)
X randomly assigned (so average causal effect E(1i)
Xi and ui are independent)
general E(ui|Xi) = 0 weighted average of 1i,
placing most weight on
those with large |Xi–X|
X randomly assigned average causal effect E(1i)
p
Without heterogeneity, 1i = 1 and ˆ1 1 in all these
cases.
11-68
TSLS with Heterogeneous Causal Effects
TSLS estimates the causal effect for those individuals
for whom Z is most influential (those with large 1i).
What TSLS estimates depends on the choice of Z!!
In CC example, these were the individuals for whom
the decision to drive to a CC lab was heavily
influenced by the extra distance (those patients for
whom the EMT was otherwise “on the fence”)
Thus TSLS also estimates a causal effect: the average
effect of treatment on those most influenced by the
instrument
o In general, this is neither the average causal effect
nor the effect of treatment on the treated
11-69
Summary: Experiments and Quasi-Experiments
(SW Section 11.8)
Experiments:
Average causal effects are defined as expected values
of ideal randomized controlled experiments
Actual experiments have threats to internal validity
These threats to internal validity can be addressed (in
part) by:
o panel methods (differences-in-differences)
o multiple regression
o IV (using initial assignment as an instrument)
11-70
Summary, ctd.
Quasi-experiments:
Quasi-experiments have an “as-if” randomly assigned
source of variation.
This as-if random variation can generate:
o Xi which satisfies E(ui|Xi) = 0 (so estimation
proceeds using OLS); or
o instrumental variable(s) which satisfy E(ui|Zi) = 0
(so estimation proceeds using TSLS)
Quasi-experiments also have threats to internal vaidity
11-71
Summary, ctd.
11-73
Introduction to Time Series Regression and
Forecasting
(SW Chapter 12)
12-1
Example #1 of time series data: US rate of inflation
12-2
Example #2: US rate of unemployment
12-3
Why use time series data?
To develop forecasting models
o What will the rate of inflation be next year?
To estimate dynamic causal effects
o If the Fed increases the Federal Funds rate now,
what will be the effect on the rates of inflation and
unemployment in 3 months? in 12 months?
o What is the effect over time on cigarette
consumption of a hike in the cigarette tax
Plus, sometimes you don’t have any choice…
o Rates of inflation and unemployment in the US can
be observed only over time.
12-4
Time series data raises new technical issues
Time lags
Correlation over time (serial correlation or
autocorrelation)
Forecasting models that have no causal interpretation
(specialized tools for forecasting):
o autoregressive (AR) models
o autoregressive distributed lag (ADL) models
Conditions under which dynamic effects can be
estimated, and how to estimate them
Calculation of standard errors when the errors are
serially correlated
12-5
Using Regression Models for Forecasting
(SW Section 12.1)
12-8
Example: Quarterly rate of inflation at an annual rate
CPI in the first quarter of 1999 (1999:I) = 164.87
CPI in the second quarter of 1999 (1999:II) = 166.03
Percentage change in CPI, 1999:I to 1999:II
166.03 164.87 1.16
= 100 = 100 = 0.703%
164.87 164.87
Percentage change in CPI, 1999:I to 1999:II, at an
annual rate = 4 0.703 = 2.81% (percent per year)
Like interest rates, inflation rates are (as a matter of
convention) reported at an annual rate.
Using the logarithmic approximation to percent changes
yields 4 100 [log(166.03) – log(164.87)] = 2.80%
12-9
Example: US CPI inflation – its first lag and its change
CPI = Consumer price index (Bureau of Labor Statistics)
12-10
Autocorrelation
cov(Yt , Yt j )
ˆ j =
var(Yt )
where
T
1
cov(Yt ,Yt j ) =
T j 1 t j 1
(Yt Y j 1,T )(Yt j Y1,T j )
12-14
The inflation rate is highly serially correlated (1 = .85)
Last quarter’s inflation rate contains much information
about this quarter’s inflation rate
The plot is dominated by multiyear swings
But there are still surprise movements!
12-15
More examples of time series & transformations
12-16
More examples of time series & transformations, ctd.
12-17
Stationarity: a key idea for external validity of time
series regression
Stationarity says that the past is like the present and
the future, at least in a probabilistic sense.
12-18
Autoregressions
(SW Section 12.3)
Yt = 0 + 1Yt–1 + ut
12-20
Example: AR(1) model of the change in inflation
Estimated using data from 1962:I – 1999:IV:
12-21
Example: AR(1) model of inflation – STATA
First, let STATA know you are using time series data
generate time=q(1959q1)+_n-1; _n is the observation no.
So this command creates a new variable
time that has a special quarterly
date format
12-22
Example: AR(1) model of inflation – STATA, ctd.
. gen lcpi = log(cpi); variable cpi is already in memory
12-23
Example: AR(1) model of inflation – STATA, ctd
Syntax: L.dinf is the first lag of dinf
------------------------------------------------------------------------------
| Robust
dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dinf |
L1 | -.2109525 .1059828 -1.99 0.048 -.4203645 -.0015404
_cons | .0188171 .1350643 0.14 0.889 -.2480572 .2856914
------------------------------------------------------------------------------
if tin(1962q1,1999q4)
STATA time series syntax for using only observations between 1962q1 and
1999q4 (inclusive).
This requires defining the time scale first, as we did above
12-24
Forecasts and forecast errors
A note on terminology:
A predicted value refers to the value of Y predicted
(using a regression) for an observation in the sample
used to estimate the regression – this is the usual
definition
A forecast refers to the value of Y forecasted for an
observation not in the sample used to estimate the
regression.
Predicted values are “in sample”
Forecasts are forecasts of the future – which cannot
have been used to estimate the regression.
12-25
Forecasts: notation
Yt|t–1 = forecast of Yt based on Yt–1,Yt–2,…, using the
population (true unknown) coefficients
Yˆt|t 1 = forecast of Yt based on Yt–1,Yt–2,…, using the
estimated coefficients, which were estimated using
data through period t–1.
For an AR(1),
Yt|t–1 = 0 + 1Yt–1
Yˆt|t 1 = ˆ0 + ˆ1 Yt–1, where ˆ0 and ˆ1 were estimated
using data through period t–1.
12-26
Forecast errors
12-27
The root mean squared forecast error (RMSFE)
12-28
Example: forecasting inflation using and AR(1)
12-29
The pth order autoregressive model (AR(p))
12-30
Example: AR(4) model of inflation
– .04Inft–4, R 2 = 0.21
(.10)
12-31
Example: AR(4) model of inflation – STATA
. reg dinf L(1/4).dinf if tin(1962q1,1999q4), r;
------------------------------------------------------------------------------
| Robust
dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dinf |
L1 | -.2078575 .09923 -2.09 0.038 -.4039592 -.0117558
L2 | -.3161319 .0869203 -3.64 0.000 -.4879068 -.144357
L3 | .1939669 .0847119 2.29 0.023 .0265565 .3613774
L4 | -.0356774 .0994384 -0.36 0.720 -.2321909 .1608361
_cons | .0237543 .1239214 0.19 0.848 -.2211434 .268652
------------------------------------------------------------------------------
NOTES
12-32
Example: AR(4) model of inflation – STATA, ctd.
. test L2.dinf L3.dinf L4.dinf; L2.dinf is the second lag of dinf, etc.
( 1) L2.dinf = 0.0
( 2) L3.dinf = 0.0
( 3) L4.dinf = 0.0
F( 3, 147) = 6.43
Prob > F = 0.0004
12-33
Digression: we used Inf, not Inf, in the AR’s. Why?
Inft = 0 + 1Inft–1 + ut
or
Inft – Inft–1 = 0 + 1(Inft–1 – Inft–2) + ut
or
Inft = Inft–1 + 0 + 1Inft–1 – 1Inft–2 + ut
so
Inft = 0 + (1+1)Inft–1 – 1Inft–2 + ut
12-39
Example: dinf and unem – STATA
------------------------------------------------------------------------------
| Robust
dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dinf |
L1 | -.3629871 .0926338 -3.92 0.000 -.5460956 -.1798786
L2 | -.3432017 .100821 -3.40 0.001 -.5424937 -.1439096
L3 | .0724654 .0848729 0.85 0.395 -.0953022 .240233
L4 | -.0346026 .0868321 -0.40 0.691 -.2062428 .1370377
unem |
L1 | -2.683394 .4723554 -5.68 0.000 -3.617095 -1.749692
L2 | 3.432282 .889191 3.86 0.000 1.674625 5.189939
L3 | -1.039755 .8901759 -1.17 0.245 -2.799358 .719849
L4 | .0720316 .4420668 0.16 0.871 -.8017984 .9458615
_cons | 1.317834 .4704011 2.80 0.006 .3879961 2.247672
------------------------------------------------------------------------------
12-40
Example: ADL(4,4) model of inflation – STATA, ctd.
( 1) L2.dinf = 0.0
( 2) L3.dinf = 0.0
( 3) L4.dinf = 0.0
( 1) L.unem = 0.0
( 2) L2.unem = 0.0
( 3) L3.unem = 0.0
( 4) L4.unem = 0.0
The null hypothesis that the coefficients on the lags of the unemployment
rate are all zero is rejected at the 1% significance level using the F-
statistic
12-41
The test of the joint hypothesis that none of the X’s is a
useful predictor, above and beyond lagged values of Y, is
called a Granger causality test
For example:
The effect of an increase in cigarette taxes on cigarette
consumption this year, next year, in 5 years;
The effect of a change in the Fed Funds rate on
inflation, this month, in 6 months, and 1 year;
The effect of a freeze in Florida on the price of orange
juice concentrate in 1 month, 2 months, 3 months…
23-1
The Orange Juice Data
(SW Section 13.1)
Data
Monthly, Jan. 1950 – Dec. 2000 (T = 612)
Price = price of frozen OJ (a sub-component of the
producer price index; US Bureau of Labor Statistics)
%ChgP = percentage change in price at an annual rate,
so %ChgPt = 1200ln(Pricet)
FDD = number of freezing degree-days during the
month, recorded in Orlando FL
o Example: If November has 2 days with low temp <
32o, one at 30o and at 25o, then FDDNov = 2 + 7 = 9
23-2
23-3
Initial OJ regression
23-4
Dynamic Causal Effects
(SW Section 13.2)
23-6
An alternative thought experiment:
Randomly give the same subject different treatments
(FDDt) at different times
Measure the outcome variable (%ChgPt)
The “population” of subjects consists of the same
subject (OJ market) but at different dates
If the “different subjects” are drawn from the same
distribution – that is, if Yt,Xt are stationary – then the
dynamic causal effect can be deduced by OLS
regression of Yt on lagged values of Xt.
This estimator (regression of Yt on Xt and lags of Xt) s
called the distributed lag estimator.
23-7
Dynamic causal effects and the distributed lag model
The distributed lag model is:
Yt = 0 + 1Xt + … + pXt–r + ut
Yt = 0 + 1Xt + … + r+1Xt–r + ut
23-13
Heteroskedasticity and Autocorrelation-Consistent
(HAC) Standard Errors
(SW Section 13.4)
23-14
The math…
Yt = 0 + 1Xt + ut
1 T
T t 1
( X t X )(Yt Y )
ˆ
1 =
1 T
t
T t 1
( X X ) 2
so..
23-15
1 T
T t 1
( X t X )ut
ˆ
1 – 1 = (this is SW App. 4.3)
1 T
t
T t 1
( X X ) 2
1 T
T t 1
( X t X )ut
=
1 T
T t 1
( X t X ) 2
so
1 T
T t 1
vt
ˆ1 – 1 in large samples
2
X
T
1
var( ˆ1 ) = var( vt )/ ( X2 ) 2 (still SW App. 4.3)
T t 1
1 T
var( vt ) = var[½(v1+v2)]
T t 1
= ¼[var(v1) + var(v2) + 2cov(v1,v2)]
23-17
so
1 2
var( vt ) = ¼[var(v1) + var(v2) + 2cov(v1,v2)]
2 t 1
= ½ v2 + ½1 v2 (1 = corr(v1,v2))
= ½ v2 f2, where f2 = (1+1)
The OLS SEs are off by the factor fT (which can be big!)
23-19
HAC Standard Errors
T
j 1
j
m
------------------------------------------------------------------------------
| Robust
dlpoj | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
l1fdd | .1529217 .0767206 1.99 0.047 .0022532 .3035903
_cons | -.2097734 .2071122 -1.01 0.312 -.6165128 .196966
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Newey-West
dlpoj | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
l1fdd | .1529217 .0781195 1.96 0.051 -.000494 .3063375
_cons | -.2097734 .2402217 -0.87 0.383 -.6815353 .2619885
------------------------------------------------------------------------------
OK, in this case the difference is small, but not always so!
23-23
Example: OJ and HAC estimators in STATA, ctd.
. global lfdd6 "fdd l1fdd l2fdd l3fdd l4fdd l5fdd l6fdd";
------------------------------------------------------------------------------
| Newey-West
dlpoj | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
fdd | .4693121 .1359686 3.45 0.001 .2022834 .7363407
l1fdd | .1430512 .0837047 1.71 0.088 -.0213364 .3074388
l2fdd | .0564234 .0561724 1.00 0.316 -.0538936 .1667404
l3fdd | .0722595 .0468776 1.54 0.124 -.0198033 .1643223
l4fdd | .0343244 .0295141 1.16 0.245 -.0236383 .0922871
l5fdd | .0468222 .0308791 1.52 0.130 -.0138212 .1074657
l6fdd | .0481115 .0446404 1.08 0.282 -.0395577 .1357807
_cons | -.6505183 .2336986 -2.78 0.006 -1.109479 -.1915578
------------------------------------------------------------------------------
23-24
Do I need to use HAC SEs when I estimate an AR or
an ADL model?
NO.
The problem to which HAC SEs are the solution arises
when ut is serially correlated
If ut is serially uncorrelated, then OLS SE’s are fine
In AR and ADL models, the errors are serially
uncorrelated if you have included enough lags of Y
o If you include enough lags of Y, then the error term
can’t be predicted using past Y, or equivalently by
past u – so u is serially uncorrelated
23-25
Estimation of Dynamic Causal Effects with Strictly
Exogenous Regressors
(SW Section 13.5)
X is strictly exogenous if E(ut|…,Xt+1,Xt,Xt–1, …) = 0
If X is strictly exogenous, there are more efficient
ways to estimate dynamic causal effects than by a
distributed lag regression.
o Generalized Least Squares (GLS)
o Autoregressive Distributed Lag (ADL)
But the condition of strict exogeneity is very strong,
so this condition is rarely plausible in practice.
So we won’t cover GLS or ADL estimation of
dynamic causal effects (Section 13.5 is optional)
23-26
Analysis of the OJ Price Data
(SW Section 13.6)
What r to use?
How about 18? (Goldilocks method)
What m (Newey-West truncation parameter) to use?
m = .75 6121/3 = 6.4 7
23-27
23-28
23-29
23-30
23-31
These dynamic multipliers were estimated using a
distributed lag model. Should we attempt to obtain more
efficient estimates using GLS or an ADL model?
23-32
When Can You Estimate Dynamic Causal Effects?
That is, When is Exogeneity Plausible?
(SW Section 13.7)
Examples:
1. Y = OJ prices, X = FDD in Orlando
2. Y = Australian exports, X = US GDP (effect of US
income on demand for Australian exports)
23-33
Examples, ctd.
23-34
Exogeneity, ctd.
23-35
Estimation of Dynamic Causal Effects: Summary
(SW Section 13.8)
23-36