Professional Documents
Culture Documents
Regression
Regression
Kosuke Imai
Princeton University
Yi = α + βXi + i
where E(i ) = 0
Data
Yi : outcome (response, dependent) variable
Xi : predictor (explanatory, independent) variable, covariate
Parameters
α: intercept
β: slope
i : error term, disturbance, residual
E(i ) = 0 is not really an assumption because we have α
which yields
α̂ = Y − β̂X
Pn s
i=1 (Y i − Y )(Xi − X ) p Cov(Xi , Yi ) V(Yi )
β̂ = Pn −→ = ρXY
i=1 (Xi − X )
2 V(Xi ) V(Xi )
E(β̂) − β = E{E(β̂ − β | X )} = 0
2500
●
1500
1500
Residuals
Residuals
500
500
● ● ●
● ● ● ● ●
●
●
●
●●●
● ●
●●
● ●●●●●●● ●●●●●
● ●● ●●●●
●●●
●
●● ●
●● ●●
0
● ●● ●
0 ●●
●● ●● ●●● ● ●
●
●
●●
●●● ●● ● ●●●● ●
● ● ● ●●
● ●● ● ●
●
●●
● ● ●
●
−500
●●
−500
●
●
0 200 400 600 800 1200 200 400 600 800 1000
E{V(\
β̂ | X ) | X } = V(β̂ | X )
(Unconditionally) Unbiased: V(E(β̂ | X )) = 0 implies
[
Y (x) = E(Y \
| Xi = x) = α̂ + β̂x
[
Y (x) = Y (x) + i
[
Variance (point estimate is still Y (x)):
σ2 n X 2
P
σ2 X
where V(α̂ | X ) = Pn i=1 i 2
n i=1 (Xi −X )
and Cov(α̂, β̂ | X ) = − Pn 2
i=1 (Xi −X )
i.i.d.
Inference under i ∼ N (0, σ 2 ):
,v
β̂ − β β̂ − β u (n − 2)σ̂ 2
u
1
= qP u
2
× ∼ tn−2
s.e. σ/ n
(X − X )2
u σ{z } n − 2
i=1 i t |
| {z } ∼χ2n−2
∼N (0,1)
p ν degrees of freedom: X ∼2 tν
Student’s t distribution with
2
independent
Construction II (ν1 = 1): X = Y 2 where Y ∼ tν2
Construction III: X = 1/Y where Y ∼ F (ν2 , ν1 )
d
Chi-square distribution: X −→ χ2ν1 as ν2 → ∞
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 13 / 64
Model-Based Asymptotic Inference
p
Consistency: β̂ −→ β
Asymptotic distribution and inference:
n n
!
√ √ 1X 1X
n(β̂ − β) = n (Xi − E(Xi ))i + (E(Xi ) − X ) i
n n
i=1 i=1
| {z }
d
−→N (0, σ 2 V(Xi ))
n
!−1
1X
× (Xi − X )2
n
| i=1 {z }
p
−→V(Xi )−1
σ2
d
−→ N 0,
V(Xi )
β̂ − β d
−→ N (0, 1)
s.e.
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 14 / 64
Bias of Model-Based Variance
Units: i = 1, 2, . . . , nj
Clusters of units: j = 1, 2, . . . , m
Treatment at cluster level: Tj ∈ {0, 1}
Outcome: Yij = Yij (Tj )
⊥Tj
Random assignment: (Yij (1), Yij (0))⊥
No interference between units of different clusters
Possible interference between units of the same cluster
Estimands at unit level:
nj
m X
1 X
SATE ≡ Pm (Yij (1) − Yij (0))
j=1 nj j=1 i=1
σ12 σ2
V(τ̂ ) = + 0
m1 n m0 n
Design-based evaluation:
!
V(Yj (1)) V(Yj (0))
Finite Sample Bias = − +
m12 m02
FIGURE V
FIGURE I Specification Test: Similarity of Historical Voting Patterns between Bare
Total Effect of Initial Win on Future ADA Scores: ␥ Democrat and Republican Districts
This figure plots ADA scores after the election at time t ⫹ 1 against the The panel plots one time lagged ADA scores against the Democrat vote shar
Democrat vote share, time t. Each circle is the average ADA score within 0.01 Time t and t ⫺ 1 refer to congressional sessions. Each point is the average lagg
intervals of the Democrat vote share. Solid lines are fitted values from fourth-
ADA score within intervals of 0.01 in Democrat vote share. The continuous line
Placebo test for natural experiments
order polynomial regressions on either side of the discontinuity. Dotted lines are
pointwise 95 percent confidence intervals. The discontinuity gap estimates
␥ ⫽ 0共P*t ⫹1
D
⫺ P*t ⫹1
R
兲 ⫹ 1共P*t ⫹1
D
⫺ P*t ⫹1
R
兲.
from a fourth-order polynomial in vote share fitted separately for points above an
below the 50 percent threshold. The dotted line is the 95 percent confiden
interval.
What is a good placebo?
“Affect” “Elect”
SSR = hY − βX − γZ , Y − βX − γZ i
= hY − βX , Y − βX i + hY − γZ , Y − γZ i − hY , Y i
X T Y
γ does NOT have a causal interpretation
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 29 / 64
Assumptions and Interpretation
rank(X) = dim(S(X))
Definition: Av = λv
1 a square matrix A
2 a non-zero column vector v : an eigenvector of A
3 a scalar λ: an eigenvalue of A associated with v
v2 λ 2 v2
v1 λ 1 v1
v2 d2 u2
d1 u1
v1
An n × K matrix of predictors: X
“Centered” predictors: X
e where Xek = Xk − mean(Xk )
Sample covariance matrix: 1 Xe>Xe
n
e > X:
Singular value decomposition of X e
e>X
X e = VD2 V>
d12 d2 d2
V(z1 ) = ≥ V(z2 ) = 2 ≥ · · · ≥ V(zK ) = K
n n n
Frequently used to summarize multiple measurements
Model parameters: β
Estimates: β̂
b = Xβ̂
Predicted (fitted) value: Y
Residual: ˆ = Y − Y b = Y − Xβ̂
Minimize the sum of squared residuals:
n
X
SSR = ˆ2i = hˆ k2
, ˆi = kˆ
i=1
which yields
Computation via SVD: X> X = VD2 V> and (X> X)−1 = VD−2 V>
E(β̂ | X) = β
Unconditional unbiasedness:
E(β̂) = β
Conditional variance:
Unconditional variance:
Orthogonal decomposition: Y = PX Y + MX Y
Symmetric: PX = P>
X and
MX = M>
X
Idempotent: PX PX = PX
and MX MX = MX
Annihilator:
MX PX = PX MX = 0
ˆ = Y − Xβ̂ = MX Y
Orthogonality:
D E
1 Yb , ˆ = hPX Y , MX Y i = 0
2 hXk , ˆi = hPX Xk , MX Y i = 0 for any column Xk of X
3 More generally, x · ˆ = 0 for any x ∈ S(X)
Pn
Zero mean: i=1
ˆi = 0 since x1 = (1, . . . , 1) ∈ S(X)
(Sample and population) correlation between ˆi and xik is 0 for any
column k
Does not imply E( | X) = 0, E(xk · ) = 0 or Cor(i , xik )
Pythagoras’ Theorem:
Note Y = Xβ̂
Thus, V(Yi ) = V(Xi β̂) + V(ˆ
i ) (holds even in sample)
k2
kˆ
β̂ | X) = σ̂ 2 (X> X)−1
V(\ where σ̂ 2 =
n−K
E(V(\
β̂ | X) | X) = V(β̂ | X)
Recall ˆ = MX Y = MX (Xβ + ) = MX
Then kˆ k2 = kMX k2 = > MX = trace(> MX )
> MX = i j i j Mij where Mij is the (i, j) element of MX
P P
[
Y | Xi = x) = x > β̂
(x) = E(Yi\
[
Y (x) = Y (x) + i
[
Variance (point estimate is still Y (x)):
β + x >δ
(x ∗ − x)> δ
q
Standard error for β̂k : s.e. = V(\
β̂ | X)kk
Sampling distribution:
β̂k − βk ∼ N 0, σ 2 (X> X)−1
kk
i.i.d.
Inference under i ∼ N (0, σ 2 ):
,v
β̂k − βk β̂k − βk u (n − K )σ̂ 2 1
u
= q u
2
∼ tn−K
s.e.k > −1 t| σ{z } n − K
u
σ (X X)kk
| {z } ∼χ2n−K
∼N (0,1)
∼ Frank(A),n−K
Y = X(β + γ δ̂) + γ η̂ + ∗
ME in Yi : Yi = Yi∗ + ei
Omitted variable problem: Yi = Xi> β + ei + i
ME in xk : Xik = Xik∗ + ei
Classical ME assumption: Cov(ei , Xik∗ ) = Cov(ei , Xik 0 ) = 0 for all
k 0 6= k , but Cov(ei , Xik ) 6= 0
Omitted variable problem: Yi = Xi> β − βk ei + i
V(Xik∗ )
P
Attenuation bias: β̂k −→ βk V(X ∗ )+V(e i)
≤ βk
ik
where the expectation is taken over Yinew given all the data (X, Y )
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 62 / 64
The Cp Statistic
Compare Error with the average SSR under the homoskedasticity
assumption:
n
1 2X
E Error − SSR X = E{(Yi − Xi> β)(Xi β̂ − Xi> β) | X}
n n
i=1
n
2 X
bi | X)
= Cov(Yi , Y
n
i=1
where the expectation is taken with respect to Y given X
One can show:
n 2
2X bi | X) = 2 trace{Cov(Y , PX Y | X)} = 2σ K
Cov(Yi , Y
n n n
i=1
The Cp statistic is defined:
[ = 1 + K /n SSR
Error
n−K
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 63 / 64
Key Points