You are on page 1of 64

Linear Regression

Kosuke Imai

Princeton University

POL572 Quantitative Analysis II


Spring 2016

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 1 / 64


Simple Linear Regression

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 2 / 64


Simple Linear Regression Model

The linear regression with a single variable:

Yi = α + βXi + i

where E(i ) = 0
Data
Yi : outcome (response, dependent) variable
Xi : predictor (explanatory, independent) variable, covariate
Parameters
α: intercept
β: slope
i : error term, disturbance, residual
E(i ) = 0 is not really an assumption because we have α

Why simple regression? It’s simple and gives lots of intuition

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 3 / 64


Causal Interpretation
Association: you can always regress Yi on Xi and vice versa
The previous model written in terms of potential outcomes:
Yi (Xi ) = α + βXi + i
where E(i ) = 0
Xi is the treatment variable
α = E(Yi (0))
β = Yi (1) − Yi (0) for all i ⇐⇒ Constant additive unit causal effect

A more general model:


Yi (Xi ) = α + βXi + i (Xi )
where E(i (1)) = E(i (0)) = 0
Relax the assumption of i = i (1) = i (0)
Yi (1) − Yi (0) = {α + β + i (1)} − {α + i (0)} = β + i (1) − i (0)
ATE: β = E(Yi (1) − Yi (0))
α = E(Yi (0)) as before
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 4 / 64
Assumptions

Exogeneity: E(i | X ) = E(i ) = 0 where X = (X1 , X2 , . . . , Xn )


1 Orthogonality: E(i Xj ) = 0
2 Zero correlation: Cov(i , Xj ) = 0
Homoskedasticity: V(i | X ) = V(i ) = σ 2

Classical randomized experiments:


1 Randomization of treatment: (Yi (1), Yi (0)) ⊥
⊥ Xi for all i
E(Yi (x) | Xi ) = E(Yi (x)) ⇐⇒ E(i (x) | Xi ) = E(i (x)) = 0
E(Yi (x)) = E(Yi | Xi = x) = α + βx
2 Random sampling of units:
{Yi (1), Yi (0)} ⊥⊥ {Yj (1), Yj (0)} for any i 6= j
{i (1), i (0)} ⊥
⊥ {j (1), j (0)} for any i 6= j
3 Variance of potential outcomes:
V(i (x)) = V(i (x) | Xi ) = V(Yi (x) | Xi ) = V(Yi (x)) = σx for x = 0, 1

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 5 / 64


Least Squares Estimation
Model parameters: (α, β)
Estimates: (α̂, β̂)
Predicted (fitted) value: Y
bi = α̂ + β̂Xi
Residual: ˆi = Yi − Ybi = Yi − α̂ − β̂Xi
Minimize the sum of squared residuals (SSR):
n
X
(α̂, β̂) = argmin ˆ2i
(α,β) i=1

which yields

α̂ = Y − β̂X
Pn s
i=1 (Y i − Y )(Xi − X ) p Cov(Xi , Yi ) V(Yi )
β̂ = Pn −→ = ρXY
i=1 (Xi − X )
2 V(Xi ) V(Xi )

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 6 / 64


Unbiasedness of Least Squares Estimator

When Xi is binary, β̂ = Difference-in-Means estimator


So, β̂ is unbiased from the design-based perspective too

Model-based estimation error:


Pn
(Xi − X )i
β̂ − β = Pi=1
n 2
i=1 (Xi − X )

Thus, the exogeneity assumption implies,

E(β̂) − β = E{E(β̂ − β | X )} = 0

Similarly, α̂ − α =  − (β̂ − β)X


Thus, E(α̂) − α = 0

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 7 / 64


Residuals
Estimated error
P term
Zero mean: ni=1 ˆi = 0
, X i = ˆ> X = ni=1 ˆi Xi = 0
P
Orthogonality: hˆ
Sample correlation between ˆi and Xi is exactly 0
Does not imply E(h, X i) = 0 or Cor(i , Xi )
Residual plot (Y = 2000 Buchanan votes, X = 1996 Perot votes):
With Palm Beach Without Palm Beach
2500

2500

1500

1500
Residuals

Residuals
500

500
● ● ●
● ● ● ● ●



●●●
● ●
●●
● ●●●●●●● ●●●●●
● ●● ●●●●
●●●

●● ●
●● ●●
0

● ●● ●
0 ●●
●● ●● ●●● ● ●


●●
●●● ●● ● ●●●● ●
● ● ● ●●
● ●● ● ●

●●
● ● ●

−500

●●
−500


0 200 400 600 800 1200 200 400 600 800 1000

Fitted Values Fitted Values

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 8 / 64


The Coefficient of Determination

How much variation in Y does the model explain?


Total Sum of Squares:
n
X
TSS = (Yi − Y )2
i=1

The coefficient of determination or R 2 :


TSS − SSR
R2 ≡
TSS
0 ≤ R2 ≤ 1
R 2 = 0 when β̂ = 0
R 2 = 1 when Yi = Ybi for all i
Example: 0.85 (without Palm Beach) vs. 0.51 (with Palm Beach)

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 9 / 64


Model-Based Variance and Its Estimator
The homoskedasticity assumption implies
σ2
V(β̂ | X ) = Pn
2
i=1 (Xi − X )

Standard model-based (conditional) variance estimator for β̂:


n
σ̂ 2 1 X 2
V(\
β̂ | X ) = Pn where σ̂ 2 = ˆi
i=1 (Xi − X )
2 n−2
i=1

(Conditionally) Unbiased: E(σ̂ 2 | X) = σ2 implies

E{V(\
β̂ | X ) | X } = V(β̂ | X )
(Unconditionally) Unbiased: V(E(β̂ | X )) = 0 implies

V(β̂) = E{V(β̂ | X )} = E[E{V(\


β̂ | X ) | X }] = E{V(\
β̂ | X )}

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 10 / 64


Model-Based Prediction

(Estimated) Expected value for Xi = x:

[
Y (x) = E(Y \
| Xi = x) = α̂ + β̂x

Predicted value for Xi = x:

[
Y (x) = Y (x) + i

[
Variance (point estimate is still Y (x)):

V(Y (x) | X ) = V(α̂ | X ) + V(β̂ | X ) · x 2 + 2 · x · Cov(α̂, β̂ | X ) + σ 2


Pn !
2
2 i=1 (Xi − x)
= σ +1
n ni=1 (Xi − X )2
P

σ2 n X 2
P
σ2 X
where V(α̂ | X ) = Pn i=1 i 2
n i=1 (Xi −X )
and Cov(α̂, β̂ | X ) = − Pn 2
i=1 (Xi −X )

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 11 / 64


Model-Based Finite Sample Inference
q
Standard error: s.e. = V(\
β̂ | X )
Sampling distribution:
!
σ2
β̂ − β ∼ N 0, Pn
2
i=1 (Xi − X )

i.i.d.
Inference under i ∼ N (0, σ 2 ):
,v
β̂ − β β̂ − β u (n − 2)σ̂ 2
u
1
= qP u
2
× ∼ tn−2
s.e. σ/ n
(X − X )2
u σ{z } n − 2
i=1 i t |
| {z } ∼χ2n−2
∼N (0,1)

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 12 / 64


Key Distributions for Linear Regression and Beyond
1 Chi-square distribution with ν degrees of freedom: X ∼ χ2ν
Pν i.i.d.
Construction I: X = i=1 Yi2 where Yi ∼ N (0, 1)
Construction II: X = Y > Σ−1 Y where Y ∼ Nν (0, Σ)
Mean and variance: E(X ) = ν and V(X ) = 2ν
(X1 + X2 ) ∼ χ2ν1 +ν2 if X1 ∼ χ2ν1 and X2 ∼ χ2ν2 , which are independent

p ν degrees of freedom: X ∼2 tν
Student’s t distribution with
2

Construction: X = Y / Z /ν if Y ∼ N (0, 1) and Z ∼ χν


Mean and variance: E(X ) = 0 and V(X ) = ν/(ν − 2)
Cauchy distribution (ν = 1): E(X ) = V(X ) = ∞
Normal distribution (ν = ∞): E(X ) = 0 and V(X ) = 1
3 F distribution with ν1 and ν2 degrees of freedom: X ∼ F (ν1 , ν2 )
Construction I: X = YY12 /ν 2 2
/ν2 if Y1 ∼ χν1 and Y2 ∼ χν2 , which are
1

independent
Construction II (ν1 = 1): X = Y 2 where Y ∼ tν2
Construction III: X = 1/Y where Y ∼ F (ν2 , ν1 )
d
Chi-square distribution: X −→ χ2ν1 as ν2 → ∞
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 13 / 64
Model-Based Asymptotic Inference
p
Consistency: β̂ −→ β
Asymptotic distribution and inference:
n n
!
√ √ 1X 1X
n(β̂ − β) = n (Xi − E(Xi ))i + (E(Xi ) − X ) i
n n
i=1 i=1
| {z }
d
−→N (0, σ 2 V(Xi ))
n
!−1
1X
× (Xi − X )2
n
| i=1 {z }
p
−→V(Xi )−1

σ2
 
d
−→ N 0,
V(Xi )
β̂ − β d
−→ N (0, 1)
s.e.
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 14 / 64
Bias of Model-Based Variance

The design-based perspective: use Neyman’s exact variance


What is the bias of the model-based variance estimator?
Finite sample bias:
! !
σ̂ 2 σ12 σ02
Bias = E Pn − +
i=1 (Xi − X )
2 n1 n0
(n1 − n0 )(n − 1) 2
= (σ1 − σ02 )
n1 n0 (n − 2)

Bias is zero when n1 = n0 or σ12 = σ02


In general, bias can be negative or positive and does not
asymptotically vanish

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 15 / 64


Robust Standard Error
Suppose V(i | X ) = σ 2 (Xi ) 6= σ 2
Heteroskedasticity consistent robust variance estimator (more
later):
n
!−1 n
! n !−1
X X X
\
V((α̂, β̂) | X ) = X e>
ei X ˆ2 X e>
ei X X e>
ei X
i i i i
i=1 i=1 i=1

where in this case X


ei = (1, Xi ) is a column vector of length 2
Model-based justification: asymptotically valid in the presence of
heteroskedastic errors
Design-based evaluation:
! !
\ σ12 σ02 σ12 σ02
Finite Sample Bias = E(V(β̂ | X )) − + = − + 2
n1 n0 n12 n0

Bias vanishes asymptotically


Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 16 / 64
Cluster Randomized Experiments

Units: i = 1, 2, . . . , nj
Clusters of units: j = 1, 2, . . . , m
Treatment at cluster level: Tj ∈ {0, 1}
Outcome: Yij = Yij (Tj )
⊥Tj
Random assignment: (Yij (1), Yij (0))⊥
No interference between units of different clusters
Possible interference between units of the same cluster
Estimands at unit level:
nj
m X
1 X
SATE ≡ Pm (Yij (1) − Yij (0))
j=1 nj j=1 i=1

PATE ≡ E(Yij (1) − Yij (0))

Random sampling of clusters and units


Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 17 / 64
Design-Based Inference
For simplicity, assume the following:
1 equal cluster size, i.e., nj = n for all j
2 we observe all units for a selected cluster (no sampling of units)

The difference-in-means estimator:


m m
1 X 1 X
τ̂ ≡ Tj Y j − (1 − Tj )Y j
m1 m0
j=1 j=1
Pn
where Y j ≡ i=1 Yij /n

Easy to show E(τ̂ | O) = SATE and thus E(τ̂ ) = PATE


Exact population variance:

V(Yj (1)) V(Yj (0))


V(τ̂ ) = +
m1 m0

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 18 / 64


Intracluster Correlation Coefficient

Comparison with the standard variance:

σ12 σ2
V(τ̂ ) = + 0
m1 n m0 n

Correlation of potential outcomes across units within a cluster


n
!
1X
V(Yj (t)) = V Yij (t)
n
i=1
 
n n
1 X XX 
= V(Y ij (t)) + Cov(Yij (t), Y 0
i j (t))
n2  0 0

i=1 i6=i i =1

σt2 typically σt2


= {1 + (n − 1)ρt } ≥
n n

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 19 / 64


Cluster Standard Error
Cluster robust variance estimator:
 −1   −1
Xm m
X Xm
\
Var((α̂, β̂) | T ) =  Xj> Xj   Xj> ˆj ˆ>
j Xj
 Xj> Xj 
j=1 j=1 j=1

where in this case Xj = [1Tj ] is an n × 2 matrix and


1j , . . . , ˆnj ) is a column vector of length n
ˆj = (ˆ

Design-based evaluation:
!
V(Yj (1)) V(Yj (0))
Finite Sample Bias = − +
m12 m02

Bias vanishes asymptotically as m → ∞ with n fixed


Implication: cluster standard errors by the unit of treatment
assignment
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 20 / 64
Regression Discontinuity Design
Idea: Find an arbitrary cutpoint c which determines the treatment
assignment such that Ti = 1{Xi ≥ c}
Assumption: E(Yi (t) | Xi = x) is continuous in x
Contrast this with the “as-if random” assumption within a window

{Yi (1), Yi (0)}⊥


⊥Ti | c0 ≤ Xi ≤ c1

Estimand: E(Yi (1) − Yi (0) | Xi = c)


Regression modeling:

E(Yi (1) | Xi = c) = lim E(Yi (1) | Xi = x) = lim E(Yi | Xi = x)


x↓c x↓c
E(Yi (0) | Xi = c) = lim E(Yi (0) | Xi = x) = lim E(Yi | Xi = x)
x↑c x↑c

Advantage: internal validity


Disadvantage: external validity
Make sure nothing else is going on at Xi = c
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 21 / 64
Close Elections as RD Design (Lee)
828 QUARTERLY JOURNAL OF ECONOMICS
838 QUARTERLY JOURNAL OF ECONOMICS

FIGURE V
FIGURE I Specification Test: Similarity of Historical Voting Patterns between Bare
Total Effect of Initial Win on Future ADA Scores: ␥ Democrat and Republican Districts
This figure plots ADA scores after the election at time t ⫹ 1 against the The panel plots one time lagged ADA scores against the Democrat vote shar
Democrat vote share, time t. Each circle is the average ADA score within 0.01 Time t and t ⫺ 1 refer to congressional sessions. Each point is the average lagg
intervals of the Democrat vote share. Solid lines are fitted values from fourth-
ADA score within intervals of 0.01 in Democrat vote share. The continuous line
Placebo test for natural experiments
order polynomial regressions on either side of the discontinuity. Dotted lines are
pointwise 95 percent confidence intervals. The discontinuity gap estimates
␥ ⫽ ␲ 0共P*t ⫹1
D
⫺ P*t ⫹1
R
兲 ⫹ ␲ 1共P*t ⫹1
D
⫺ P*t ⫹1
R
兲.
from a fourth-order polynomial in vote share fitted separately for points above an
below the 50 percent threshold. The dotted line is the 95 percent confiden
interval.
What is a good placebo?
“Affect” “Elect”

1 expected not to have any effect


be a continuous and smooth function of vote shares everywhere, V.C. Sensitivity to Alternative Measures of Voting Records
closely related to outcome of interest
2 threshold that determines party membership. There
except at the
is a large discontinuous jump in ADA scores at the 50 percent Our results so far are based on a particular voting index, th
Close election controversy: de la Cuesta and Imai (2016).
threshold. Compare districts where the Democrat candidate
barely lost in period t (for example, vote share is 49.5 percent),
ADA score. In this section we investigate whether our resul
generalize to other voting scores. We find that the findings do no
with districts where the Democrat candidate barely won (for
“Misunderstandings about the Regression Discontinuity Design in the
example, vote share is 50.5 percent). If the regression disconti- change when we use alternative interest groups scores, or othe
nuity design is valid, the two groups of districts should appear ex summary measures of representatives’ voting records.
Study of Close Elections” Annual Review of Political Science
ante similar in every respect— on average. The difference will be Table III is analogous to Table I, but instead of using AD
that in one group, the Democrats will be the incumbent for the scores, it is based on two alternative measures of roll-call votin
next election (t ⫹ 1), and in the other it will be the Republicans.
Districts where the (Princeton)
Democrats are the incumbent party for elec-
The top panel is based on McCarty, Poole, and Rosenthal’s DW
Kosuke Imai Linear Regression POL572 Spring 2016 22 / 64
Analysis Methods under the RD Design
As-if randomization Difference-in-means within a window
Continuity linear regression within a window
We want to choose a window in a principled manner
We want to relax the functional form assumption
Higher-order polynomial regression not robust
Local linear regression: better behavior at the boundary than
other nonparametric regressions
n  
X
2 Xi − x
f (x) = argmin {Yi − α − (Xi − x)β} Kh
α,β h
i=1

Weighted regression with a kernel of one’s choice:


uniform kernel: k (u) = 12 1{|u| < 1}
triangular kernel: k (u) = (1 − |u|)1{|u| < 1}
Choice of bandwidth parameter h: cross-validation, analytical
minimization of MSE
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 23 / 64
Multiple Regression from Simple Regressions

Consider simple linear regression without an intercept:


Pn
Xi Yi hX , Y i
β̂ = Pi=1n 2
=
i=1 Xi hX , X i

where h·, ·i represents the inner product of two vectors

Another predictor Z that is orthogonal to X , i.e., hX , Z i = 0


Then, regressing Y on X and Z separately give you the same
coefficients as regressing Y on X and Z together:

SSR = hY − βX − γZ , Y − βX − γZ i
= hY − βX , Y − βX i + hY − γZ , Y − γZ i − hY , Y i

Orthogonal predictors have no impact on each other in multiple


regression (experimental designs)
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 24 / 64
Orthogonalization

How to orthogonalize Z with respect to X ? Regress Z on X and


obtain residuals!
An alternative way to estimate β in Y = α1 + βX + 
1 Orthogonalize X with respect to 1:


X − X 1, 1 = 0

2 Now, we can get β̂ from a simple regression:




X − X 1, Y
β̂ =

X − X 1, X − X 1

β̂ represents additional contribution of X on Y after X has been


adjusted for 1
Generalization of this idea: Gram-Schmidt procedure

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 25 / 64


Proof by A Picture

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 26 / 64


Multiple Linear Regression

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 27 / 64


Multiple Linear Regression Model
1 Scalar representation:
Yi = β1 Xi1 + β2 Xi2 + · · · + βK XiK + i
where typically Xi1 = 1 is the intercept
2 Vector representation:
   
Xi1 β1
 Xi2   β2 
Yi = Xi> β + i where Xi =  . and β = 
   
..
 ..
 
  . 
XnK βK
3 Matrix representation:
X1>
   
1
 X2>   2 
Y = Xβ +  where X =  and  = 
   
..  .. 
 .   . 
Xn> n

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 28 / 64


Causal Interpretation
Potential outcomes:
Yi (Ti ) = α + βTi + Xi> γ + i
Ti is a treatment variable
Xi is a vector of pre-treatment confounders
Constant unit treatment effect: β = Yi (1) − Yi (0)
A general model with PATE β = E(Yi (1) − Yi (0)):
Yi (Ti ) = α + βTi + Xi> γ + i (Ti )

Post-treatment bias: Xi shouldn’t include post-treatment variables

X T Y
γ does NOT have a causal interpretation
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 29 / 64
Assumptions and Interpretation

Exogeneity: E( | X) = E() = 0


1 Conditional Expectation Function (CEF): E(Y | X) = Xβ
2 Orthogonality: E(i · Xjk ) = 0 for any i, j, k
3 Zero correlation: Cov(i , Xjk ) = 0 for any i, j, k
Homoskedasticity: V( | X) = V() = σ 2 In
1 spherical error variance
2 V(i | X) = σ 2 and Cov(i , j | X) = 0 for any i 6= j

Ignorability (unconfoundedness): (Yi (1), Yi (0)) ⊥⊥ Ti | Xi


E(Yi (t) | Ti , Xi ) = E(Yi (t) | Xi )
E(Yi (t) | Xi ) = E(Yi | Ti = t, Xi ) = α + βt + Xi> γ
?
E(i | Ti , Xi ) = E(i | Xi ) = E(i ) = 0
Random sampling: (Yi (1), Yi (0), Ti , Xi ) is i.i.d.
Equal variance: V(Yi (t) | Ti , Xi ) = V(Yi (t)) = σ 2

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 30 / 64


Rank and Linear Independence
Assumption (No multicollinearity): X is full column rank
rank(X) = K ≤ n
Column (row) rank of a matrix = maximal # of linearly
independent columns (rows)
Linear independence: Xc = 0 iff c is a column vector of 0s
c1 X1 + c2 X2 + · · · + cK XK = 0 ⇐⇒ c1 = c2 = · · · = cK = 0

Geometric interpretation of matrix rank:

rank(X) = dim(S(X))

where S(X) = {Xc : c ∈ RK } is the


column space of X

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 31 / 64


Properties of Matrix Rank
Property 1: column rank of X = row rank of X
Proof:
1 Let column rank be L ≤ K . There exist basis vectors b such that
Xk = c1k b1 + c2k b2 + · · · + cLk bL for each column k of X.
2 We can write X = BC. Since every row of X is a linear combination
of rows of C, row rank of X is no greater than row rank of C, or
equivalently, column rank of X, i.e., L.
3 Applying the same argument to X> to show row rank is no greater
than column rank.

Property 2: rank(AB) ≤ min(rank(A), rank(B))


Proof is similar to the previous one

Property 3: If A and C are non-singular, rank(ABC) = rank(B)


Proof:
1 rank(ABC) ≤ rank(AB) ≤ rank(B)
2 rank(B) = rank(A−1 (ABC)C−1 ) ≤ rank(A−1 (ABC)) ≤ rank(ABC)
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 32 / 64
Matrix and Linear Transformation
A matrix A can be interpreted as a linear transformation operator:
transformed vector = Av

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 33 / 64


Eigenvalues and Eigenvectors

Definition: Av = λv
1 a square matrix A
2 a non-zero column vector v : an eigenvector of A
3 a scalar λ: an eigenvalue of A associated with v

Interpretation: direction of v is not affected by linear transform A

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 34 / 64


Spectoral Theorem

If A is a symmetric n × n matrix, then eigenvalues λ1 , . . . , λn and their


corresponding orthonormal eigenvectors v1 , . . . , vn exist. Moreover,
the following spectoral decomposition (eigen decomposition) applies,

A = VDV> where D = diag(λ1 , . . . , λn ) and V = [v1 · · · vn ]

Orthonormality implies V> V = In or V−1 = V> . Thus, A = VDV−1


Geometric interpretation: AV = VD = [λ1 v1 · · · λn vn ]

v2 λ 2 v2
v1 λ 1 v1

Corollaries: trace(A) = λ1 + · · · + λn and det(A) = λ1 × · · · × λn

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 35 / 64


Sigularity

1 If A is a symmetric n × n matrix, rank(A) equals the number of


non-zero eigenvalues of A
Proof: rank(A) = rank(VDV−1 ) = rank(D)

2 A symmetric n × n matrix has full rank iff it is non-singular


Proof: non-singular ⇔ det(A) 6= 0 ⇔ all eigenvalues are non-zero
Ac = b has a unique solution c = X−1 b (K unknowns and K
equations)

3 X is of full column rank iff X> X is non-singular


Suppose X has full rank: assume X> Xc = 0 for a non-zero c
=⇒ c > X> Xc = 0 =⇒ kXck2 = 0 =⇒ Xc = 0, contradiction
Suppose X> X has full rank: assume Xc = 0 for a non-zero c
=⇒ X> Xc = 0, contradiction

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 36 / 64


Singular Value Decomposition (SVD)
For any m × n matrix A,
A = UDV> where D = diag(d1 , . . . , dn )
singular values of A: d1 ≥ · · · ≥ dn ≥ 0
orthogonal matrix U whose columns span the column space of A
orthogonal matrix V whose columns span the row space of A
Geometric interpretation: AV = UD

v2 d2 u2
d1 u1
v1

Relationship to spectoral decomposition:


A> A = (UDV> )> (UDV> ) = VD2 V>
Columns of V are eigenvectors and diag(D2 ) are eigenvalues
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 37 / 64
Principal Components

An n × K matrix of predictors: X
“Centered” predictors: X
e where Xek = Xk − mean(Xk )
Sample covariance matrix: 1 Xe>Xe
n
e > X:
Singular value decomposition of X e

e>X
X e = VD2 V>

Columns of V are principal components direction of X e


zk = Xv
e k = uk dk is called the k th principal component
Variances of principal components:

d12 d2 d2
V(z1 ) = ≥ V(z2 ) = 2 ≥ · · · ≥ V(zK ) = K
n n n
Frequently used to summarize multiple measurements

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 38 / 64


Least Squares Estimation and Prediction

Model parameters: β
Estimates: β̂
b = Xβ̂
Predicted (fitted) value: Y
Residual: ˆ = Y − Y b = Y − Xβ̂
Minimize the sum of squared residuals:
n
X
SSR = ˆ2i = hˆ k2
, ˆi = kˆ
i=1

which yields

β̂ = argmin SSR = (X> X)−1 X> Y


β

Computation via SVD: X> X = VD2 V> and (X> X)−1 = VD−2 V>

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 39 / 64


Geometry of Least Squares
1 Using β̂ = β + (X> X)−1 X> , show the orthogonality
X> ˆ = X> (Y − Xβ̂) = 0
that is, any column vector of X is orthogonal to ˆ
2 Using the orthogonality, show
SSR = kY − Xβ̃k2 = kˆ
 + X(β̂ − β̃)k2 = kˆ
k2 + kX(β̂ − β̃)k2
3 SSR is minimized when β̃ = β̂
S(X): subspace of <n
spanned by columns of X
b = Xβ̂: projection of Y
Y
onto S(X)
ˆ is orthogonal to all
columns of X and thus to
S(X)
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 40 / 64
Derivation with Calculus

1 Vector calculus: Let a, b be column vectors and A be a matrix


∂a> b ∂
∂b = ∂b (a1 b1 + a2 b2 + · · · + aK bK ) =a
1
∂Ab >
∂b = A
2
∂b> Ab
3
∂b = 2Ab when A is symmetric
2 SSR = kY − Xβ̂k2 = Y > Y − 2Y > Xβ̂ + β̂ > X> Xβ̂
3 First order condition:
∂SSR
∂ β̂
= 0 =⇒ (X> X)β̂ = X> Y (normal equation)
4 Second order condition:
∂ 2 SSR
>
= X> X ≥ 0 in a matrix sense
∂ β̂∂ β̂
5 Theorem: A square matrix A is positive semi-definite if A is
symmetric and c > Ac ≥ 0 for any column vector c
X> X is symmetric and c > X> Xc = kXck2 ≥ 0 =⇒ X> X is positive
semi-definite

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 41 / 64


Unbiasedness of Least Squares Estimator
Recall β̂ = β + (X> X)−1 X> 
Conditional unbiasedness:

E(β̂ | X) = β

Unconditional unbiasedness:

E(β̂) = β

Conditional variance:

V(β̂ | X) = V(β̂ − β | X) = σ 2 (X> X)−1

Unconditional variance:

V(β̂) = E{V(β̂ − β | X)} = σ 2 E(X> X)−1

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 42 / 64


Gauss-Markov Theorem

Under exogeneity and homoskedasticity assumptions, β̂ is the


Best Linear Unbiased Estimator

A linear estimator: β̃ = AY = β̂ + BY where B = A − (X> X)−1 X>


β̃ = {(X> X)−1 X> + B}Y = β + {(X> X)−1 X> + B} + BXβ
E(β̃ | X) = E(β̂ | X) = β and exogeneity imply BX = 0

V(β̃ | X) = {(X> X)−1 X> + B}V( | X){(X> X)−1 X> + B}>


= σ 2 {(X> X)−1 + BB> }
≥ σ 2 (X> X)−1
= V(β̂ | X)

Also, V(β̃) ≥ V(β̂) so long as E(β̃ | X) = β holds


Don’t take it too seriously!: bias, nonlinearity, heteroskedasticity

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 43 / 64


More Geometry
Orthogonal projection matrix or “Hat” matrix:
b = Xβ̂ = X(X> X)−1 X> Y = PX Y
Y
| {z }
PX

PX projects Y onto S(X) and thus PX X = X


MX = In − PX projects Y onto S ⊥ (X): MX X = 0

Orthogonal decomposition: Y = PX Y + MX Y

Symmetric: PX = P>
X and
MX = M>
X
Idempotent: PX PX = PX
and MX MX = MX
Annihilator:
MX PX = PX MX = 0

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 44 / 64


Residuals

Projection of Y onto S > (X):

ˆ = Y − Xβ̂ = MX Y

Orthogonality:
D E
1 Yb , ˆ = hPX Y , MX Y i = 0
2 hXk , ˆi = hPX Xk , MX Y i = 0 for any column Xk of X
3 More generally, x · ˆ = 0 for any x ∈ S(X)
Pn
Zero mean: i=1 
ˆi = 0 since x1 = (1, . . . , 1) ∈ S(X)
(Sample and population) correlation between ˆi and xik is 0 for any
column k
Does not imply E( | X) = 0, E(xk · ) = 0 or Cor(i , xik )

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 45 / 64


The Coefficient of Determination

The (centered) coefficient of determination:


Pn 2
ˆ explained variance
2
R ≡ 1 − Pn i=1 i =
2 original variance
i=1 (Yi − Y )
D E

Recall Y = Xβ̂ + ˆ, Yb , ˆ = 0, and Y 1, ˆ = 0

Pythagoras’ Theorem:

kY − Y 1k2 = kXβ̂ + ˆ − Y 1k2 = kXβ̂ − Y 1k2 + kˆ


k2

Note Y = Xβ̂
Thus, V(Yi ) = V(Xi β̂) + V(ˆ
i ) (holds even in sample)

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 46 / 64


Variance Estimation Under Homoskedasticity

Under homoskedasticity, standard variance estimator is:

k2

β̂ | X) = σ̂ 2 (X> X)−1
V(\ where σ̂ 2 =
n−K

(Conditionally) Unbiased: E(σ̂ 2 | X) = σ 2 implies

E(V(\
β̂ | X) | X) = V(β̂ | X)

(Unconditionally) Unbiased: V{E(β̂ | X)} = 0 implies

V(β̂) = E{V(β̂ | X)} = E{E(V(\


β̂ | X) | X)} = E{V(\
β̂ | X)}

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 47 / 64


Proof that σ̂ 2 is Unbiased (Freedman, Theorem 4)

Recall ˆ = MX Y = MX (Xβ + ) = MX 
Then kˆ k2 = kMX k2 = > MX  = trace(> MX )
> MX  = i j i j Mij where Mij is the (i, j) element of MX
P P

Homoskedasticity: E(i j | X) = 0 for i 6= j and E(2i | X) = σ 2


k2 | X) = σ 2 trace(MX )
E(kˆ
Recall the following properties of trace operator:
1 trace(A + B) = trace(A) + trace(B)
2 trace(AB) = trace(BA) where A is m × n and B is n × m
trace(MX ) = trace(In ) − trace{X(X> X)−1 X> } = n − trace(IK ) = n − K

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 48 / 64


Model-Based Prediction

(Estimated) Expected value for Xi = x:

[
Y | Xi = x) = x > β̂
(x) = E(Yi\

Predicted value for Xi = x:

[
Y (x) = Y (x) + i

[
Variance (point estimate is still Y (x)):

V(Y (x) | X) = x > V(β̂ | X)x + σ 2


n o
2 > > −1
= σ x (X X) x + 1

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 49 / 64


Causal Inference with Interaction Terms

A Model: Yi = α + βTi + X1i> γ1 + X2i> γ2 + Ti X1i> δ + i


Average causal effect depends on pre-treatment covariates X1i
Average causal effect when X1i = x:

β + x >δ

Variance: V(β̂ | T , X) + x > V(δ̂ | T , X)x + 2x > Cov(β̂, δ̂ | T , X)

Difference in the average causal effects between the case with


X1i = x ∗ and the case with X1i = x:

(x ∗ − x)> δ

Variance: (x ∗ − x)> V(δ̂ | T , X)(x ∗ − x)


Challenge: variable selection

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 50 / 64


Model-Based Finite Sample Inference

q
Standard error for β̂k : s.e. = V(\
β̂ | X)kk
Sampling distribution:
 
β̂k − βk ∼ N 0, σ 2 (X> X)−1
kk

i.i.d.
Inference under i ∼ N (0, σ 2 ):
,v
β̂k − βk β̂k − βk u (n − K )σ̂ 2 1
u
= q u
2
∼ tn−K
s.e.k > −1 t| σ{z } n − K
u
σ (X X)kk
| {z } ∼χ2n−K
∼N (0,1)

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 51 / 64


Derivation (see Proposition 1.3 of Hayashi)

1 Note that if x ∼ N (0, IK ), then x > Ax ∼ χ2ν where ν = rank(A)


2 If A is a projection matrix, then rank(A) = trace(A)
Proof: Let Av = λv . Then, Av = A2 v = λAv = λ2 v =⇒ λ = 1.
Thus, trace(A) equals the # of non-zero eigenvalues, i.e., rank(A)
3 Recall ˆ = MX 
4 Given X, (n − K )σ̂ 2 /σ 2 = kMX /σk2 ∼ χ2rank(MX ) = χ2trace(MX ) = χ2n−K

5 Recall β̂ = β + (X> X)−1 X> 


6 Cov(β̂, ˆ | X) = Cov(β̂ − β, MX  | X) = (X> X)−1 X> E(> | X)MX = 0
7 Theorem: If X ∼ N (µ, Σ), then AX ∼ N (Aµ, AΣA> )
8 Thus, (β̂, ˆ) have a multivariate Normal distribution given X
9 β̂ and ˆ are independent given X

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 52 / 64


Testing Linear Null Hypothesis

Null hypothesis H0 : Aβ = b where A is of full row rank


Any linear restriction: a1 β1 + a2 β2 + · · · + aK βK = b
Any number of linearly independent such restrictions
F -statistic:

(Aβ̂ − b)> {A(X> X)−1 A> }−1 (Aβ̂ − b)


F ≡
σ̂ 2 rank(A)
∼χ2rank(A)
z }| {
> 2 > −1 > −1
(Aβ̂ − b) {σ A(X X) A } (Aβ̂ − b) n−K
= 2 2
×

k /σ rank(A)
| {z }
∼χ2n−K

∼ Frank(A),n−K

If F is larger than the critical value, reject the null


Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 53 / 64
Influential Observations and Leverage Points
Leverage for unit i in S(X):

pi ≡ Xi> (X> X)−1 Xi = the ith diagonal element of PX = kPX v (i)k2

where v (i) is a vector such that v (i)i = 1 and v (i)i 0 = 0 with i 6= i 0


Interpretation: Projecting v (i) on S(X)
0 ≤ pi ≤ kv (i)k2 = 1
p̄ ≡ ni=1 pi /n = trace(PX )/n = K /n
P

How much one observation can alter the estimate?


OLS estimate without the ith observation:
ˆi
β̂(i) = (X> −1 > > −1 >
(i) X(i) ) X(i) Y(i) = β̂ − (X X) Xi
1 − pi
Influential points: (1) high leverage, (2) outlier, and (3) both
Cook’s distance: Di ≡ (β̂(i) − β̂)> X> X(β̂(i) − β̂)/(σ̂ 2 K )
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 54 / 64
Model-Based Asymptotic Inference
> −1
Pn  Pn 
An alternative expression: β̂ = i=1 Xi Xi i=1 Xi Yi
Consistency: only E(Xi i ) = 0 is required
n
!−1 n
!
1X 1 X P
β̂ − β = Xi Xi> Xi i −→ 0
n n
i=1 i=1

Asymptotic distribution and inference:


n
!−1 n
!
√ 1X √ 1X
n(β̂ − β) = Xi Xi> × n Xi i
n n
| i=1 {z } | i=1
{z }
p d
−→{E(Xi Xi> )}−1 −→N (0, σ 2 E(Xi Xi> ))
 
d
−→ N 0, σ 2
{E(Xi Xi> )}−1
β̂k − βk d
−→ N (0, 1)
s.e.k
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 55 / 64
Robust Standard Errors
Heteroskedasticity: V(i | X) 6= σ 2
V(β̂ | X) = V(β̂ − β | X) = (X> X)−1 {X> E(> | X)X}(X> X)−1
How do we estimate E(> | X) = V( | X)?
Sandwich estimator: meat = X > V( \| X)X, bread = (X> X)−1
asymptotically consistent
i.i.d.: σ̂ 2 In
independence: diag(ˆ 2i ), n diag(ˆ2i )/(n − K ), etc.
>
 
ˆ1 ˆ1 0 ··· 0
 0 ˆ2 ˆ>2 ··· 0 
clustering:  . ..  or
 
.. ..
 .. . . . 
0 0 · · · ˆG ˆ>
G
PG
meat= g=1 Xg> ˆg ˆ>
g X g
autocorrelation, panel-correction, etc.
WARNING: only fixes asymptotic standard error but not bias!
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 56 / 64
Asymptotic Tests

Null hypothesis H0 : Aβ = b where A is of full row rank


Any linear restriction: a1 β1 + a2 β2 + · · · + aK βK = b
Any number of linearly independent such restrictions
Wald statistic:

W ≡ (Aβ̂ − b)> {AV(


[ β̂)A> }−1 (Aβ̂ − b)
√ √
= n(Aβ̂ − b)> {nAV([ β̂)A> }−1 n(Aβ̂ − b)
| {z } | {z } | {z }
d p d
−→N (0, nAV(β̂)A> ) −→{nAV(β̂)A> }−1 −→N (0, nAV(β̂)A> )
d
−→ χ2rank(A)

If W is larger than the critical value, reject the null

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 57 / 64


Generalized Least Squares (GLS)
Known heteroskedasticity: V( | X) = σ 2 Ω where Ω is a positive
definite matrix
OLS estimator is no longer BLUE
GLS estimator: β̂GLS = (X> Ω−1 X)−1 X> Ω−1 Y
β̂GLS is BLUE
Consider the transformed regression: Y ∗ = X∗ β + ∗ where
Y ∗ = Ω−1/2 Y , X∗ = Ω−1/2 X, and ∗ = Ω−1/2 
>
Cholesky decomposition: Ω = Ω1/2 Ω1/2 where Ω1/2 is a lower
triangular matrix with strictly positive diagonal elements
1 Pn
Variance: V(β̂GLS | X) = σ 2 (X> Ω−1 X)−1 with σ̂ 2 = n−k ∗i )2
i=1 (ˆ

Feasible GLS (FGLS):


1 Estimate the unknown Ω
2 β̂FGLS = (X> Ω b −1 X)−1 X> Ω
b −1 Y
3 Iterate until convergence
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 58 / 64
Weighted Least Squares

Ω = diag(1/wi ) with Ω−1/2 = diag( wi )
Independence across units, but different variance for each unit
Never know variance in practice
wi = Sampling weights = 1/ Pr(being selected into the sample)
Know a priori, independent sampling of units
OLS estimator in the finite population:
N
!−1 N
X X
βP = Xi Xi> Xi Yi
i=1 i=1

WLS estimator is consistent for βP :


N
!−1 N
X X
β̂WLS = 1{i ∈ S}wi Xi Xi> 1{i ∈ S}wi Xi Yi
i=1 i=1

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 59 / 64


Omitted Variables Bias
Omitted variables, unmeasured confounders
Scalar confounder U: Y = Xβ + Uγ + ∗ with E(∗ | X, U) = 0
| {z }

Decomposition: U = PX U + η̂ = Xδ̂ + η̂ where X> η̂ = 0
Then,

Y = X(β + γ δ̂) + γ η̂ + ∗

with E(γ η̂i + ∗i ) = 0 and E{Xi (γ η̂i + ∗i )} = 0


p p
β̂k −→ βk + γδk where δ̂ −→ δ
If δk 0 = 0 for all k 0 6= k , then
s
P Cov(Ui , Xik ) V(Ui )
β̂k −→ βk + γ = βk + γCor(Ui , Xik )
V(Xik ) V(Xik )

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 60 / 64


Measurement Error

Three types of ME: classical, nondifferential, differential

ME in Yi : Yi = Yi∗ + ei
Omitted variable problem: Yi = Xi> β + ei + i

ME in xk : Xik = Xik∗ + ei
Classical ME assumption: Cov(ei , Xik∗ ) = Cov(ei , Xik 0 ) = 0 for all
k 0 6= k , but Cov(ei , Xik ) 6= 0
Omitted variable problem: Yi = Xi> β − βk ei + i
V(Xik∗ )
 
P
Attenuation bias: β̂k −→ βk V(X ∗ )+V(e i)
≤ βk
ik

ME in xk yields an inconsistent estimate of βk 0 unless


Cov(ei , Xik 0 ) = 0

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 61 / 64


Model Assessment and Selection
Need a criteria to assess model fitting
R 2 : risk of overfitting
Adjusted R 2 (coefficient of determination):
1 Pn
2
[
V(i) n−K ˆ2i
i=1 
Radj = 1− = 1− 1 Pn 2
i=1 (Yi − Y )
\
V(Y i) n−1
K −1
= R2 − (1 − R 2 )
| {z n − K}
model complexity penalty

Alternative: in-sample average expected error for new


observations:
n
1X
Error = E{(Yinew − Xi> β̂)2 | X, Y }
n
i=1

where the expectation is taken over Yinew given all the data (X, Y )
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 62 / 64
The Cp Statistic
Compare Error with the average SSR under the homoskedasticity
assumption:
  n
1 2X
E Error − SSR X = E{(Yi − Xi> β)(Xi β̂ − Xi> β) | X}

n n
i=1
n
2 X
bi | X)
= Cov(Yi , Y
n
i=1
where the expectation is taken with respect to Y given X
One can show:
n 2
2X bi | X) = 2 trace{Cov(Y , PX Y | X)} = 2σ K
Cov(Yi , Y
n n n
i=1
The Cp statistic is defined:
[ = 1 + K /n SSR
Error
n−K
Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 63 / 64
Key Points

For experimental data, no need to run regression!


Use covariate adjustment before randomization of treatment (e.g.,
matched-pair design, randomized block design) with design-based
estimator
Robust and cluster standard errors can be justified from the
design-based point of view

In observational studies, regression adjustment is common


Many results depend on linearity and exogeneity
Non/semi-parametric regression
Preprocess the data with matching methods to make parametric
inference robust

Kosuke Imai (Princeton) Linear Regression POL572 Spring 2016 64 / 64

You might also like