You are on page 1of 12

Summary Notes

M300 Econometric Methods

Povilas Lastauskas
23rd November 2013

1 Theory
Research agenda for applied economics requires to address these major issues:
1. Causal relationship;
2. Ideal experiment;
3. Identification strategy;
4. Statistical inference.
Reality manifests in many ways which complicate empirical enquiry, especially when
questions about causality are raised. Drawing from Angrist and Pischke (2008), such
questions hinge on the experimentalist paradigm and potential outcomes. This new
paradigm in econometrics requires some vocabulary...

1.1 Basics of evaluation problem

Suppose we analyse a variable Yi , described by
Potential outcome =

1i

Y0i

Outcome if i is treated;
Outcome if i is not treated.

In effect, Y1i (potential outcome under treatment) and Y0i (potential outcome without
treatment) are outcomes in alternative states of the world. Inexistence of parallel worlds

Errors and typos to be reported to pl312@cam.ac.uk.

www.lastauskas.com.

leads to inability to measure treatment effects at the individual level. Notice that the
observed outcome Yi can be expressed in terms of potential ones:
Yi =

if Di = 1,
Y0i , if Di = 0,
1i,

or Yi = Y0i + (Y1i Y0i ) Di . One hope is, instead of individual ones, to analyse average
treatment effects. What about a comparison of the difference in means for treated and
untreated?
We obtain
E [Yi | Di = 1] E [Yi | Di = 0] = E [Y1i | Di = 1] E [Y0i | Di = 0] ,
and subtracting and adding E [Y0i | Di = 1] produces
E [Yi | Di = 1] E [Yi | Di = 0]
= E [Y1i | Di = 1] E [Y0i | Di = 1]
+E [Y0i | Di = 1] E [Y0i | Di = 0] .

(1)

The result is decomposed into two components: average treatment effect on treated
(ATT) and selection bias. The latter is simply the difference in average outcome under
non-treatment between those who were treated and those who were not.
Problem 1. Actual treatment status Di is not independent of potential outcomes.
One of the solutions is randomly assign treatment to individuals in the population since
Di becomes independent of potential outcomes, and so the selection bias disappears.
Under this,
E [Yi | Di = 1] E [Yi | Di = 0]
= E [Y1i | Di = 1] E [Y0i | Di = 1] ,
since E [Y0i | Di = 1] E [Y0i | Di = 0] = 0. This enables to infer ATT - which coincides
with the average treatment effect (ATE) in the entire population1 - from the difference
in means.
Under random assignment, we can obtain ATT and ATE by running OLS regressions,
Yi = + Di + i . For a moment suppose that the treatment effect is the same for
everybody, such that Y1i Y0i = . Using Yi = Y0i + (Y1i Y0i ) Di , we can re-express
it as Yi = EY0i + (Y1i Y0i ) Di + (Y0i EY0i ) = + Di + i , where EY0i and
i = Y0i EY0i . Obviously,
E [Yi | Di = 1] E [Yi | Di = 0] =
+E [i | Di = 1] E [i | Di = 0] .
Claim 2. The selection bias amounts to non-zero correlation between the regression error
term i and the regressor Di : what you previously called endogeneity bias is known as
selection bias here.
1

The correlation reflects the difference in potential outcomes (under no treatment)

between those who get treated and those who dont. To see this, note that E [i | Di = 1]
E [i | Di = 0] = E [Y0i | Di = 1] E [Y0i | Di = 0] . Clearly, if Di is randomly assigned,
there is no selection bias so that a regression of observed outcomes Yi on actual treatment
status Di estimates the causal effect.

1.2 Basics of Regression

The Conditional Expectation Function (CEF) for a dependent variable Yi , given covariates Xi , is the expectation - or the population average - of Yi with Xi held fixed. Denote
by2
E [Yi | Xi ] ,
from which it clearly follows that CEF is random as it is a function of Xi which is
random. Further, it is a population concept which the researcher attempts to uncover
using sample CEF. There is a number of useful CEF properties:
CEF decomposition property.
By decomposition property,
Yi = E [Yi | Xi ] + i ,
where i is mean independent of Xi , E [i | Xi ] = 0. Notice that this follows from
E [i | Xi ] = E [Yi | Xi ] E [[E [Yi | Xi ]] | Xi ] = 0.
More importantly, we can demonstrate that this property produces a result that i is
uncorrelated with any function of Xi . Let h (Xi ) be any function of Xi , then
E [h (Xi ) i ] = E [E [(h (Xi ) i ) | Xi ]] = E [h (Xi ) E (i | Xi )] = 0.
In demonstrating this relationship, we employed what is known as the Law of Iterated Expectations - stating that an unconditional expectation can be written as the
unconditional average of the CEF - or
i

Ex [Ey [y | x]] =
Ey [y | x] f (x) dx =
y f (y | x) dy f (x) dx =
yf (y | x) f (x) dy dx

=
yf (y, x) dy dx =
yf (y, x) dx dy = y f (y, x) dx dy = yf (y) dy
= E[y] .
Best predictor property.
E [Yi | Xi ] is the Best (Minimum mean squared error MMSE) predictor of Yi in that it
minimises the function E (Yi h (Xi ))2 , or
h

h(Xi )

In

continuous case, CEF is (with a slight abuse of notation)

E [Yi | Xi = x]
P
Yi fY (Yi | Xi = x) dYi ,whereas in discrete case it is E [Yi | Xi = x] = Yi P (Yi | Xi = x) .

where h (Xi ) is any function of Xi . The proof follows immediately by subtracting and
adding E [Yi | Xi ] inside the brackets, so that
E (Yi h (Xi ))2

= E [Yi E [Yi | Xi ]]2 2 [Yi E [Yi | Xi ]] [E [Yi | Xi ] h (Xi )] + [E [Yi | Xi ] h (Xi )]2 .


The first term does not involve E [Yi | Xi ] h (Xi ), the second one is zero from the
decomposition property, and the last one is minimised at zero, yielding a result that
h (Xi ) is CEF.3
We have not specified h (Xi ) so far and not really linked CEF to regression. Define
population regression parameter vector as the solution to the following minimisation
problem:4

 
0 2
.
= arg min E Yi X
i

The FOCs yield

h

i

E Xi Yi Xi0

= 0,

which can be used to produce = E [Xi Xi0 ]1 E [Xi Yi ] . An unbiased estimator results in
E = and its (asymptotic) variance-covariance matrix is E [Xi Xi0 ]1 E [Xi Xi0 2i ] E [Xi Xi0 ]1 .
A useful property is that a linear predictor takes the form
E [Yi | Xi ] = Xi0 ,
and the solution to the the Best Linear Predictor (BLP) problem, i.e., min E (Yi E [Yi | Xi ])2 ,
yields . If the CEF is linear, then the population regression function is exactly it. CEF
is linear when Y and X are jointly normal, also when the model is saturated, i.e., model
with a separate parameter for every possible combination of values that the regressors
can take.

2 Examples
2.1 Linear Regression
Consider the linear regression model
E [yi | xi ] = + xi ,
3
4

See Angrist and Pischke (2008) for the ANOVA theorem and yet another CEF property.
Wooldridge (2010) approaches this by using the linear projection of Yi on Xi . Let E [Yi | Xi ] denote
the true regression function. The linear projection of Yi on Xi , denoted
L (Yi | Xi ) = Xi0

2 
0
is such that the parameters solve arg min E Yi Xi
. In other words, the parameters in the
linear projection L (Yi | Xi ) provide the best (population) mean square error approximation to the
true regression function.

where x = 1 if an individual belongs to group 1 and x = 0 if the individual is from group

2. A random sample (yi , xi ), (i = 1, . . . , n), of observations are available.
Find Least Squares estimators of and . Moreover, show that b can be written as
y1 y2 , where yj is the average of the observations from group j, (j = 1, 2).

Solution
The OLS estimator for in the model y = + x + is
xy n1 x y
b= P 2
.
P
x n1 ( x)2
P

(2)

Let n1 be the number of observations in group 1 (where x = 1), so there are n2 =

n n1 observations in group 2 (where x = 0). As x is a dummy variable we have that
P
P
P
x = x=1 x = x2 = n1 . Thus, (2) becomes
b=

xy n1 n1 y
n
=
n1 n1 (n1 )2

y and

Since

xy =
b =

x=1

x=1

y=

y n1

x=1 y

n1

x=0

xy n1
n1 n2

(3)

y+

y n1

x=0

x=1

x=1

n1 n2
x=0 y
= y1 y2 ,
n2

n2

x=1

y n1
n1 n2

x=0

(4)

as requested. Finally, the estimator for is simply

a = y b
x=

y1 y2 )n1
(n1 + n2 )
y2
n1 y1 + n2 y2 (

=
= y2 .
n
n
n

(5)

A Method of Moments interpretation: Since x takes only two values we can derive the expectation of y conditional on these values (2 moments) to identify the two
parameters and . We have that
E[y | x = 1] = + E[x | x = 1] + E[ | x = 1] = +
E[y | x = 0] = + E[x | x = 0] + E[ | x = 0] = ,
which follows from the usual condition of no correlation between x and , E[ | x] = 0
(for all values of x). Then, we can replace the population conditional moments by their
sample counterparts and the parameters by their estimators. Note that y1 is the sample
estimator of E[y | x = 1] and y2 is the sample moment corresponding to E[y | x = 0].
Thus, the moment conditions are y1 = a + b and y2 = a. Upon solving this system we
get the same answer as before, b = y1 y2 .
The interpretation of the OLS estimator as a MoM estimator can be generalised to any
linear model, not just to this particular problem.

2.2 Densities
The joint density function of two random variables, X and Y , is given by

y 2 xexy

otherwise.

fXY (x, y) =
0

f (x) =

2ex (x2 +2x+2)

,
x2

and that Y is uniform distributed on [0, 1]. Also find the conditional density of X
given Y . Are X and Y are independent?

Solution

1
The marginal density of x is defined as f (x) = 0 f (x, y) dy. Using the functional
form provided and integrating by parts twice we get

"
2 xy

f (x) = x

y e

dy = x y

2e

x
1

"

= ex + 2 y

xy

#1

+
0

#
xy 1

+ 2x

y
0

ex 1 exy
dy = ex + 2

x
x
x x

xy

exy
dy
x

e
e 1
2e
2
=
2
x
x

"

#1

(x + 2x + 2)
,
x2

= ex 2

f (y) = y

xe
0

exy
x
y

"
xy

dx = y

+y

2
0

exy
exy
dx = y 2 0 y
y
y

f (x | y) =

f (x, y)
y 2 xexy
=
6= f (x) ,
f (y)
1

2.3 CEF and Joint Normality

Suppose Z1 and Z2 are two standard normal variables and
X1 = 1 + 1 Z1


X2 = 2 + 2 Z1 + 1 2 Z2 .

"

= 1,
0

Hence, (X1 ; X2 ) are bivariate normal (such that Xi N (i , i2 )). The covariance
between (X1 ; X2 ) is 1 2 . Recalling the OLS expressions for estimators,

= 2 1
= Cov (X1 ; X2 ) /V ar (X1 ) = 2 /1 .
Let the linear predictor be denoted by E ? (X2 | X1 ) = + X1 . Then, the best linear
predictor is CEF,
E ? (X2 | X1 )
= 2 1 + X1
= 2 + (X1 1 ) .
Note that employing properties of the Normal distribution,
E (X2 | X1 ) = 2 + (X1 1 ) .
So the CEF is linear in this case (this is not true with other distributions.).

3 A Few Useful Results Using Matrix Algebra

This is to summarise a few important and useful results widely used in econometric
y = x1 1 + x2 2 + . . . + xk k ,
where x and are k-dimensional vectors. Further, define the product of two vectors to
P
be x = xi i or a dot product (inner product). Two vectors are orthogonal if their
dot product equals zero, meaning graphically that the vectors are perpendicular.
Going to more statistical interpretation, let x and y be two random variables, each
with mean zero, and N elements. We can then construct a vector x = (x1 , x2 , . . . , xN )
and a similar vector y = (y1 , y2 , . . . , yN ). When we take their dot product, we calculate:
P
(x, y) and xx = P xi xi = (N 1) V ar (x) .5
xy = xi yi = (N 1) Cov
A matrix A is defined as a collection of n k entries arranged into n rows and k
columns. Given an n k matrix A with the entries described as above, the transpose
of A is the k n matrix A0 that results from interchanging the columns and rows of A.
Matrix multiplication is only defined between an n k matrix A and an k n matrix
B, and the order does matter. Two useful types of matrices include a symmetric matrix
that is the same as its transpose, A = A0 and idempotent matrices that are the same
when multiplied by themselves, AA = A. See Abadir and Magnus (2005) for excellent
treatment of linear algebra and many applications useful in econometrics.
We generally deal with matrix inverses only in theory, so its important to know some
theoretical properties of inverses. Ill add some rules for transposes as well, since they
mirror the others:
5

Geometrically, two vectors in N -dimensional space, the angle between them must always satisfy:
xy
(x, y). The cosine of two rays is one if they point in exactly the same

cos = xx
= Corr
yy
direction, zero if they are perpendicular, negative one if they point in exactly opposite directions exactly the same as with correlations

A1

1

= A, (AB)1 = B 1 A1 , (A0 )

0

= A1 ,

(A0 ) = A, (AB)0 = B 0 A0 , (A + B)0 = A0 + B 0 ,

where the rule (AB)1 = B 1 A1 , works only when each matrix is a square matrix.
Also, matrices can be used for operations applied for linear functions. For instance,
y = ax yields dy/dx = a whereas y = Ax yields y/x0 = A.
An extension to quadratic functions is also straightforward. In a matrix representation
that accounts to y = x0 Ax.6 Then y/x0 = 2x0 A.7
Having all this basic notation and language, we can construct a system of equations
with a vector y = (y1 , y2 , . . . , yN )0N 1 containing all of the y values, also parameter
vector = (1 , 2 , . . . , K )0K1 and matrix

X=

1 x12 x1K
1 x22 x2K
..
..
..
..
.
.
.
.
1 xN 2 xN K

N K

and a vector = (1 , 2 , . . . , N )0N 1 containing all of the unobservable determinants of the outcome y. The system of equations can be represented as: y N 1 =
X N K K1 + N 1 . Its econometric model is N 1 = y N 1 X N K K1 , where residuals N 1 are obtained once K1 is estimated. We want residuals tobe such that
their size (or norm) of the vector is minimised, i.e., min k
k = min 0 or, more
2
0
familiarly, min k
k = min . In
other words,
we want to minimise
the following ex


0
0
0
0
0

pression min = y X y X = y X y X = y 0 y y 0 X
0
0
The optimal OLS solves: y 0 X (X 0 y)0 + 2OLS X 0 X = 0 or
X 0 y + X 0 X .
OLS
1
0

= (X 0 X) X 0 y, where (X 0 y) = y 0 X.


1
1
From this expression, notice that = yX = yX (X 0 X) X 0 y = I X (X 0 X) X 0 y.
We can decompose y into two components: the orthogonal projection onto the K dimensional space spanned by X, X and the component that is the orthogonal projection
onto the n K subspace that is orthogonal to the span of X, . Since is chosen to
make as short as possible, will be orthogonal to the space spanned by X as in this
space, X 0 = 0. The FOCs that define the least squares estimator imply that this is so.

3.1 Projection Matrices

1
We have that X is the projection of y on the span of X or, X = X (X 0 X) X 0 y =
PX y. Then, is the projection of y off the space spanned by X, in other
words, onto the

1
0
0
space that is orthogonal to the span of X): = I X (X X) X y = (I PX ) y =

For a simple illustration, suppose y = x0 Ax = ax21 + 2bx1 x2 + cx22 . Its partial derivatives wrt to x1
and x2 , respectively, are simply 2 (ax1 + bx2 ) and 2 (bx1 + cx2 ).
7
Under symmetry, otherwise y/x0 = x0 A + A0 .

MX y. Therefore, y = PX y + MX y = (PX + MX ) y. Note that both PX and MX are

symmetric and idempotent (PX PX = PX and MX MX = MX ).


0
To determine the goodness-of-fit, also use the expression y 0 y = X 0 + 0 X + =
0
0
0
X 0 X + X 0 + 0 X + 0 = X 0 X + 0 since X 0 = 0. Then uncentred RU2
0
0 y = kPX yk2 / kyk2 = cos2 , where
is defined as RU2 = 1 0 /y 0 y = X 0 X /y
is the angle between y and the span of X. For a more usual centred coefficient of
determination, introduce the n-vector i = (1, 1, . . . , 1)0 which we can use in forming
1
M i = In i (i0 i) i0 = In ii0 /n. Obviously, M i y gives the vector of deviations from
the mean. Thus, RC2 = 1 0 /y 0 M i y = 1 ESS/T SS. Recalling that we construct
residuals to average to zero (when a constant is included), M i = .
However, the true power of this approach is best seen in more complex environments,
such as dealing with instrumental variable estimation. Suppose E (X 0 ) 6= 0 due to
simultaneity, omitted variables or errors-in-variables. Then consider some matrix Z
which is formed of variables uncorrelated with . This matrix defines a projection
1
matrix PZ = Z (Z 0 Z) Z, so that anything that is projected onto the space spanned
by Z will be uncorrelated with by the definition of Z. Then transform the original
model y = X + into
PZ y = PZ X + PZ ,
1

0
0
and observe that PZ X = Z
 (Z Z) Z X
 is the fitted value from a regression of X
0
on Z (first stage) and E (PZ X) PZ = E (X 0 PZ ) = 0. This is the generalised
instrumental variables estimator, defined as
1
IV = (X 0 PZ X) X 0 PZ y
1
= (X 0 PZ X) X 0 PZ (X + )
1
= + (X 0 PZ X) X 0 PZ ,

1

and the bias given by (X 0 Z) (Z 0 Z) (Z 0 X)

(X 0 Z) (Z 0 Z) Z 0 . However, dividing each term by N , and applying LLN, we can demonstrate that all terms go to finite
p
p
0

matrices while (Z 0 ) /N 0, stemming from

E (Z ) = 0. Hence, IV . Similarly,
CLT can be invoked by scaling IV by N . Hence, IV estimator is consistent, asymptotically
normally distributed
but biased in general, since even though E (X 0 PZ ) = 0,


1
1
E (X 0 PZ X) X 0 PZ may not be zero, since (X 0 PZ X) and X 0 PZ are not independent.

4 Miscellaneous Examples
The estimators.8
In the simple linear regression model, X has two columns: a vector of ones and a
vector containing the explanatory variable x. By working on the general formula for the

Pn

i=1 .

b = (X X)

"

Xy =

#
1

xn

1 1
x1 x2
P

xi

P 2
x

xi

1 x1

1 x2

.. ..

. .
1 xn

1 P

"

#
1

xn

1 1
x1 x2

y1
y2
..
.

yn

yi
xi yi

Take the inverse of the 2 2 matrix (X0 X),

P 2P

P
P 2
P P
P
xi yi xi xi yi
yi
xi x i
1
1

b= 0
.
= P 2

P
2
P
P
P
P
P
|X
X
|

n
x

(
x
)
i
i
xi yi + n xi yi
xi yi
xi
n

Consider the second element of vector b,

P
P 1 P 
P
P P
P
P
xi y i xi n y i
n xi yi xi yi
xi (yi y)
(xi x) (yi y)
 P  = P
= P
b2 =
=
.
P 2
P
P
P
2
xi (xi x)
n xi ( xi )
(xi x)2
x2i xi n1 xi
As for the other element, we have found that
b1 =

P 2P
P P
xi yi xi xi yi
P 2
P
2

xi (

xi )

By working on the numerator (add and subtract

P 2P
x
y
i

xi

xi y i =

 P
n x2

1
n

xi ) 2

xi )2 y (n

yi ),

xi yi

xi

yi ) x .

Therefore,
b1 =

 P
n x2
i

xi )2 y (n
P
n x2
i

xi y i

xi

yi ) x

xi )

= y b2 x .

Standard errors.
1
To find the covariance matrix of b, note that b = + (X0 X) X0 , so

E (b ) (b )0

= E (X0 X) X0 0 X (X0 X)

1 0
1
1
0
0
= (X X) X E [ ] X (X0 X) = 2 (X0 X) ,

which follows from the facts that X is fixed in repeated samples and E [0 ] = 2 I.9

Thus, for the simple regression model,

h

E (b ) (b )0 =

P 2
xi

P
P
n x2i ( xi )2 P x

xi

If X is stochastic then all the moments are defined conditional on X. For example, E [0 | X] = 2 I.

10

The standard error of b1 is given by the square root of the element (1, 1) of the
covariance matrix, while the covariance between b1 and b2 is given by its off-diagonal
element,
se(b1 ) =

v
u
u
t

2 x2i
,
P
P
n x2i ( xi )2

2 xi
cov(b1 , b2 ) = P 2
.
P
n xi ( xi )2
P

and

Altering the regressors

Consider the multiple regression model y = X + , where is a k 1 vector, and
the linear transformation Z = XA where Ais a k k nonsingular matrix. In a multiple

regression model the estimated parameter vector is b = (X0 X)1 X0 y and the residuals
regress
y on
Z. The estimated
can be calculated as e = y y = y Xb. Now we

parameter vector becomes

bA = (Z0 Z)1 Z0 y = ((XA)0 (XA))1 (XA)0 y = (A0 X0 XA)1 A0 X0 y .

0
Note that both A and (X X) are square (k k) and nonsingular. Remember that if
and nonsingular,

M, N, P are square
then (MNP)1 = P1 N1 M1 . Thus, bA becomes

1
0
1
0
1 01 0 0
1 0
bA = A (X X) A A X y = A (X X) X y = A1 b .

There is a linear relation between the coefficient estimated in the two regressions. Replace now the definition of Z and bA in the residuals for this new regression to see that
the original regression (y on X),10
they are the same as those in

eA = y ZbA = y (XA)(A1 b) = y Xb = e .

10

This is due to the fact that the projections matrices in both models are the same. We have
that PX = X(X0 X)1 X0 and PA = Z(Z0 Z)1 Z0 = Z(Z0 Z)1 Z0 = XA(A0 X0 XA)1 A0 X0 =
e = (I
P
)y =
(I P
)y =
e
. In words,

XAA1 (X0 X)1(A0 )1 A0 X0 = PX , and therefore

X
A
A

the space spanned by the columns of X is the same as the span ofZ, the only difference
is the basis

(which explains why b 6= bA and in particular

why bA is a rotation of b).

11

References
Abadir, K., and J. Magnus (2005): Matrix Algebra, Econometric Exercises. Cambridge University Press.
Angrist, J. D., and J.-S. Pischke (2008): Mostly Harmless Econometrics: An
Empiricists Companion. Princeton University Press.
Wooldridge, J. M. (2010): Econometric Analysis of Cross Section and Panel Data.
The MIT Press, second edn.

12