You are on page 1of 16

Econometrics 2 — Fall 2005

Generalized Method of Moments


(GMM) Estimation

Heino Bohn Nielsen

1 of 32

Outline
(1) Introduction and motivation

(2) Moment Conditions and Identification

(3) A Model Class: Instrumental Variables (IV) Estimation

(4) Method of Moment (MM) Estimation


Examples: Mean, OLS and Linear IV
(5) Generalized Method of Moment (GMM) Estimation
Properties: Consistency and Asymptotic Distribution
(6) Efficient GMM
Examples: Two-Stage Least Squares
(7) Comparison with Maximum Likelihood
Pseudo-ML Estimation
(8) Empirical Example: C-CAPM Model

2 of 32
Introduction
Generalized method of moments (GMM) is a general estimation principle.
Estimators are derived from so-called moment conditions.

Three main motivations:


(1) Many estimators can be seen as special cases of GMM.
Unifying framework for comparison.

(2) Maximum likelihood estimators have the smallest variance in the class of consistent
and asymptotically normal estimators.
But: We need a full description of the DGP and correct specification.
GMM is an alternative based on minimal assumptions.

(3) GMM estimation is often possible where a likelihood analysis is extremely difficult.
We only need a partial specification of the model.
Models for rational expectations.

3 of 32

Moment Conditions and Identification


• A moment condition is a statement involving the data and the parameters:

g(θ0) = E[f (wt, zt, θ0)] = 0. (∗)


where θ is a K × 1 vector of parameters; f (·) is an R dimensional vector of (non-
linear) functions; wt contains model variables; and zt contains instruments.

• If we knew the expectation then we could solve the equations in (∗) to find θ0.

• If there is a unique solution, so that

E[f (wt, zt, θ)] = 0 if and only if θ = θ0,


then we say that the system is identified.

• Identification is essential for doing econometrics. Two ideas:


(1) Is the model constructed so that θ 0 is unique (identification).
(2) Are the data informative enough to determine θ 0 (empirical identification).

4 of 32
Instrumental Variables Estimation
• In many applications, the moment condition has the specific form:

f (wt, zt, θ) = u(wt, θ) · |{z}


zt ,
| {z }
(1×1) (R×1)

where the R instruments in zt are multiplied by the disturbance term, u(wt, θ).

• You can think of u(wt, θ) as the equivalent of an error term.


The moment condition becomes

g(θ0) = E[u(wt, θ0) · zt] = 0,


stating that the instruments are uncorrelated with the error term of the model.

• This class of estimators is referred to as instrumental variables estimators.


The function u(wt, θ) may be linear or non-linear in θ.

5 of 32

Example: Moment Condition From RE


• Consider a monetary policy rule, where the interest rate depends on expected future
inflation:
rt = β · E[πt+1 | It] + t.
Noting that
xt+1 = E[xt+1 | It] + vt,
where vt is the expectation error, we can write the model as

rt = β · E[π t+1 | It] + t = β · xt+1 + ( t − β · vt) = β · xt+1 + ut.

• Under rational expectations, the expectation error, vt, should be orthogonal to the
information set, It, and for zt ∈ It we have the moment condition

E[ut · zt] = E[(rt − β · xt+1) · zt] = 0.


This is enough to identify β .
6 of 32
Method of Moments (MM) Estimator
• For a given sample, wt and zt (t = 1, 2, ..., T ), we cannot calculate the expectation.
We replace with sample averages to obtain the analogous sample moments:
T
1X
gT (θ) = f (wt, zt, θ).
T t=1

We can derive an estimator, b


θMM , as the solution to gT (b
θMM ) = 0.

• To find an estimator, we need at least as many equations as we have parameters.


The order condition for identification is R ≥ K .

— R = K is called exact identification.


The estimator is denoted the method of moments estimator, b
θMM .

— R > K is called over-identification.


The estimator is denoted the generalized method of moments estimator, b
θGMM .

7 of 32

Example: MM Estimator of the Mean


• Assume that yt is random variable drawn from a population with expectation µ0.
We have a single moment condition:

g(µ0) = E[f (yt, µ0)] = E[yt − µ0] = 0,


where f (yt, µ0) = yt − µ0.

• For a sample, y1, y2, ..., yT , we state the corresponding sample moment conditions:
T
1X
gT (b
µ) = b) = 0.
(yt − µ
T t=1
The MM estimator of the mean µ0 is the solution, i.e.
T
1X
bMM
µ = yt,
T t=1
which is the sample average.

8 of 32
Example: OLS as a MM Estimator
• Consider the linear regression model of yt on xt (K × 1):

yt = x0tβ 0 + t. (∗∗)
Assume that (∗∗) represents the conditional expectation:

E[yt | xt] = x0tβ 0 so that E[ t | xt] = 0.

• That implies the K unconditional moment conditions

g(β 0) = E[xt t] = E [xt (yt − x0tβ 0)] = 0,


which we recognize as the minimal assumption for consistency of the OLS estimator.

9 of 32

• We define the corresponding sample moment conditions as

1 XT ³ ´ 1X T
1 XT
b =
gT (β) 0
xt yt − xtβb = xtyt − b = 0.
xtx0tβ
T t=1 T t=1 T t=1
And the MM estimator is derived as the unique solution:
à T !−1 T
X X
βb =MM xtx0 xtyt,
t
t=1 t=1
PT 0
provided that t=1 xt xt is non-singular.

• Method of moments is one way to motivate the OLS estimator.


Highlights the minimal (or identifying) assumptions for OLS.

10 of 32
Example: Under-Identification
• Consider again a regression model

yt = x0tβ 0 + t = x01tγ 0 + x02tδ 0 + t.

• Assume that the K1 variables in x1t are predetermined, while the K2 = K − K1


variables in x2t are endogenous. That implies

E[x1t t] = 0 (K1 × 1) (†)


E[x2t t] =
6 0 (K2 × 1). (††)

• We have K parameters in β 0 = (γ 00, δ 00)0, but only K1 < K moment conditions


(i.e. K1 equations to determine K unknowns).
The parameters are not identified and cannot be estimated consistently.

11 of 32

Example: Simple IV Estimator


• Assume K2 new variables, z2t, that are correlated with x2t but uncorrelated with t:

E[z2t t] = 0. (†††)
The K2 moment conditions in (†††) can replace (††). To simplify notation, we define
µ ¶ µ ¶
x1t x1t
xt = and zt = .
(K×1) x2t (K×1) z2t
xt are model variables, z2t are new instruments, and zt are instruments.
We say that x1t are instruments for themselves.

• Using (†) and (†††) we have K moment conditions:


µ ¶
E[x1t t]
g(β 0) = = E[zt t] = E[zt (yt − x0tβ 0)] = 0,
E[z2t t]
which are sufficient to identify the K parameters in β .

12 of 32
• The corresponding sample moment conditions are given by

1 XT ³ ´
b
gT (β) = 0b
zt yt − xtβ = 0.
T t=1

• The method of moments estimator is the unique solution:


à T !−1 T
X X
b
β MM = ztx0
ztyt,
t
t=1 t=1
PT 0
provided that t=1 zt xt is non-singular.

• Note the following:


(1) We need the instruments to identify the parameters.
(2) The MM estimator coincides with the simple IV estimator.
(3) The procedure only works with K2 instruments (i.e. R = K ).
PT 0
(4) Non-singularity of t=1 zt xt requires relevant instruments.

13 of 32

Generalized Method of Moments Estimation


• The case R > K is called over-identification.
More equations than parameters and no solution to gT (θ) = 0 in general.

• Instead we minimize the distance from gT (θ) to zero.


The distance is measured by the quadratic form

QT (θ) = gT (θ)0WT gT (θ),


where WT is an R × R symmetric and positive definite weight matrix.

• The GMM estimator depends on the weight matrix:


b
θGMM (WT ) = arg min {gT (θ)0WT gT (θ)} .
θ

14 of 32
Distances and Weight Matrices
• Consider a simple example with 2 moment conditions
µ ¶
ga
gT (θ) = ,
gb
where the dependence of T and θ is suppressed.

• First consider a simple weight matrix, WT = I2 :


µ ¶µ ¶
¡ ¢ 1 0 ga
QT (θ) = gT (θ)0WT gT (θ) = ga gb = ga2 + gb2,
0 1 gb
which is the square of the simple distance from gT (θ) to zero.
Here the coordinates are equally important.

• Alternatively, look at a different weight matrix:


µ ¶µ ¶
¡ ¢ 2 0 ga
QT (θ) = gT (θ)0WT gT (θ) = ga gb = 2 · ga2 + gb2,
0 1 gb
which attaches more weight to the first coordinate in the distance.
15 of 32

Consistency: Why Does it Work?


• Assume that a law of large numbers (LLN) applies to f (wt, zt, θ), i.e.
T
X
−1
T f (wt, zt, θ) → E[f (wt, zt, θ)] for T → ∞.
t=1
That requires IID or stationarity and weak dependence.

• If the moment conditions are correct, g(θ0) = 0, then GMM is consistent,


b
θGMM (WT ) → θ0 as T → ∞,
for any WT positive definite.

• Intuition: If a LLN applies, then gT (θ) converges to g(θ).


Since bθGMM (WT ) minimizes the distance from gT (θ) to zero, it will be a consistent
estimator of the solution to g(θ0) = 0.

• The weight matrix, WT , has to be positive definite, so that we put a positive and
non-zero weight on all moment conditions.
16 of 32
Asymptotic Distribution
• Assume a central limit theorem for f (wt, zt, θ), i.e.:
T
√ 1 X
T · gT (θ0) = √ f (wt, zt, θ0) → N(0, S),
T t=1
where S is the asymptotic variance.

• Then it holds that for any positive definite weight matrix, W , the asymptotic distri-
bution of the GMM estimator is given by
√ ³ ´
b
T θGMM − θ0 → N(0, V ).
The asymptotic variance is given by
−1 −1
V = (D0W D) D0W SW D (D0W D) ,
where
∙ ¸
∂f (wt, zt, θ)
D=E
∂θ0
is the expected value of the R × K matrix of first derivatives of the moments.
17 of 32

Efficient GMM Estimation


• The variance of b
θGMM depends on the weight matrix, WT .
The efficient GMM estimator has the smallest possible (asymptotic) variance.

• Intuition: a moment with small variance is informative and should have large weight.
It can be shown that the optimal weight matrix, WTopt, has the property that

plim WTopt = S −1.


With the optimal weight matrix, W = S −1, the asymptotic variance simplifies to
¡ ¢−1 0 −1 −1 ¡ 0 −1 ¢−1 ¡ 0 −1 ¢−1
V = D0S −1D D S SS D D S D = DS D .

• The best moment conditions have small S and large D.


— A small S means that the sample variation of the moment (noise) is small.
— A large D means that the moment condition is much violated if θ 6= θ0.
The moment is very informative on the true values, θ0.

18 of 32
• Hypothesis testing can be based on the asymptotic distribution:
b a
θGMM ∼ N(θ0, T −1Vb ).

• An estimator of the asymptotic variance is given by


¡ ¢−1
Vb = DT0 ST−1DT ,
where
T
∂gT (θ) 1 X ∂f (wt, zt, θ)
DT =
|{z} =
∂θ0 T t=1 ∂θ0
(R×K)

is the sample average of the first derivatives.


And ST is an estimator of S = T · V [gT (θ)]. If the observations are independent, a
consistent estimator is
T
1X
ST = f (wt, zt, θ)f (wt, zt, θ)0.
T t=1
Estimation of the weight matrix is typically the most tricky part of GMM.
19 of 32

Test of Overidentifying Moment Conditions


• Recall that K moment conditions are sufficient to estimate the K parameters in θ.
If R > K , we can test the validity of the R − K overidentifying moment conditions.

• By MM estimation we can set K moment conditions equal to zero.


If all R conditions are valid then the R − K moments should also be close to zero.

• From CLT we have


a
gT (θ0) ∼ N(0, T −1S).
If we use the optimal weights, WTopt → S −1, then

ξ J = T · gT (b
θGMM )0WToptgT (b
θGMM ) = T · QT (b
θGMM ) → χ2(R − K).

• This is the J-test or the Hansen test for overidentifying restrictions.


In linear models it is often referred to as the Sargan test.
ξ J is not a test of the validity of model or the underlying economic theory.
ξ J considers whether the R−K moments are in line with the K identifying moments.
20 of 32
Computational Issues
• The estimator is defined by minimizing QT (θ). Minimization can be done by
∂QT (θ) ∂(gT (θ)0WT gT (θ))
= = 0 .
∂θ ∂θ (K×1)

Sometimes analytically but often by numerical optimization.


• We need an optimal weight matrix, WTopt, but that depends on the parameters!

Two-step efficient GMM:


(1) Choose an initial weight matrix, e.g. W[1] = IR , and find a consistent but inefficient
first-step GMM estimator
b
θ[1] = arg min gT (θ)0W[1]gT (θ).
θ

(2) Find the optimal weight matrix,


opt
W[2] , based on b
θ[1]. Find the efficient estimator

b opt
θ[2] = arg min gT (θ)0W[2] gT (θ).
θ
The estimator is not unique as it depends on the initial weight matrix W[1].
21 of 32

Iterated GMM estimator:


• From the estimator b
θ[2] it is natural to update the weights, W[3]opt
, and update b
θ[3].
We can switch between estimating W opt and b
[·] θ[·] until convergence.
Iterated GMM does not depend on the initial weight matrix.
The two approaches are asymptotically equivalent.

Continuously updated GMM estimator:


• A third approach is to recognize from the outset that the weight matrix depends on
the parameters, and minimize

QT (θ) = gT (θ)0WT (θ)gT (θ).


That is never possible to solve analytically.

22 of 32
Example: 2SLS
• Consider again a regression model

yt = x0tβ 0 + t = x01tγ 0 + x02tδ 0 + t,


where E[x1t t] = 0 and E[x2t t] 6= 0.
Assume that you have R > K valid instruments in zt so that

g(β 0) = E[zt t] = E[zt (yt − x0tβ 0)] = 0.

• The corresponding sample moments are given by


T
1X 1
gT (β) = zt (yt − x0tβ) = Z 0 (Y − Xβ) ,
| {z } T T
(R×1) t=1

where Y (T × 1), X (T × K), and Z (T × R) are the stacked matrices.

• In this case we cannot solve gT (β) = 0 directly; Z 0X is R × K and not invertible.

23 of 32

• Instead, we want to derive the GMM estimator by minimizing the criteria function

QT (β) = gT (β)0WT gT (β)


¡ ¢0 ¡ ¢
= T −1Z 0 (Y − Xβ) WT T −1Z 0 (Y − Xβ)
¡ ¢
= T −2 Y 0ZWT Z 0Y − 2β 0X 0ZWT Z 0Y + β 0X 0ZWT Z 0Xβ .

• We take the first derivative, and the GMM estimator is the solution to
∂QT (β)
= −2T −2X 0ZWT Z 0Y + 2T −2X 0ZWT Z 0Xβ = 0.
∂β
bGMM (WT ) = (X 0ZWT Z 0X)−1 X 0ZWT Z 0Y , depending on WT .
We find β

• To estimate the optimal weight matrix, WTopt = ST−1, we use the estimator
T T
1 X 0 1X 2 0
ST = · f (wt, zt, θ)f (wt, zt, θ) = b ztz ,
T t=1 T t=1 t t
which allows for general heteroskedasticity of the disturbance term.

24 of 32
• For the asymptotic distributions, we recall that
³ ¡ 0 −1 ¢−1´
b a −1
β GMM ∼ N β 0, T DS D .
The derivative is given by
³ PT ´
−1 0
∂ T t=1 zt (yt − xt β)
T
X
∂gT (β) −1
DT = = = −T ztx0t,
(R×K) ∂β 0 ∂β 0
t=1
so the variance of the estimator becomes
h i ¡ ¢
V β bGMM = T −1 D0 W optDT −1
T T
⎛Ã !Ã !−1 Ã !⎞−1
XT T
X 2 T
X
−1 ⎝
= T −T −1 0
xtzt T −1 0
bt ztzt −T −1
ztx0t ⎠
t=1 t=1 t=1
à T !−1 T
à T !−1
X X X
= xtzt0 b2t ztzt0 ztx0t .
t=1 t=1 t=1
• Note that this is the heteroskedasticity consistent (HC) variance estimator of White.
GMM with allowance for heteroskedastic errors automatically produces heteroskedas-
ticity consistent standard errors!
25 of 32

• If we assume that the error terms are IID, the optimal weight matrix simplifies to
T
b2 X 0
σ
ST = b2Z 0Z,
ztzt = T −1σ
T t=1

b2 is a consistent estimator for σ 2.


where σ

• In this case the efficient GMM estimator becomes


¡ ¢
βbGMM = X 0ZS −1Z 0X −1 X 0ZS −1Z 0Y.
T T
³ ¡ ¢−1
´−1 ¡ ¢−1 0
−1 2 0
0
= XZ T σ b ZZ ZX0
b2 Z 0 Z
X 0Z T −1σ ZY
³ ´−1
0 0 −1 0 −1
= X Z (Z Z) Z X X 0Z (Z 0Z) Z 0Y,
which is identical to the two stage least squares (2SLS) estimator.

• The variance of the estimator is


h i ¡ ¢−1
b
V β GMM = T −1 DT0 ST−1DT
−1
b2(X 0Z (Z 0Z) Z 0X)−1,

which again coincides with the 2SLS variance.
26 of 32
Pseudo-ML (PML) Estimation
• The first order conditions for ML estimation can be seen as a sample counterpart to
a moment condition
T
1X
st (θ) = 0 corresponds to E[st (θ)] = 0,
T t=1
and ML becomes a special case of GMM.

• b
θML is consistent for weaker assumptions than maintained by ML.
The FOC for a normal regression model corresponds to

E[xt(yt − x0tβ)] = 0,
which is weaker than the assumption that the entire distribution is correctly specified.
OLS is consistent even if t is not normal.

• A ML estimation that maximizes a likelihood function different from the true models
likelihood is referred to as a pseudo-ML or a quasi-ML estimator.
Note that the variance matrix is no longer the inverse information.
27 of 32

(My Unfair) Comparison of ML and GMM


Maximum Likelihood Generalized Method of Moments

Assumptions: Full specification. Partial specification/weak assumptions.


Know Density(θ0) apart from θ0. Moment conditions: E[f (data;θ0)] = 0.
Strong economic assumptions.

Efficiency: Cramér Rao lower bound. Efficient based on moment condition.


(Smallest possible variance). Larger than Cramér Rao.

Typical Statistical description of the data. Estimate deep parameters of


approach: Misspecification testing. economic model.
Restrictions recover economics.

Robustness: First order conditions should hold! Moment conditions should hold!
PML is a GMM interpretation of ML. Weights and variances can
Use larger PML variance. be made robust.

28 of 32
Example: The C-CAPM Model
• Consider the consumption based capital asset pricing (C-CAPM) model of Hansen
and Singleton (1982).

• A representative agent maximizes the discounted value of lifetime utility subject to


a budget constraint:

X
max E [δ s · u(ct+s) | It] ,
s=1
At+1 = (1 + rt+1) At + yt+1 − ct+1,
where At is financial wealth, yt is income, 0 ≤ δ ≤ 1 is a discount factor, and It is
the information set at time t.

• The first order condition is given by the Euler equation:

u0(ct) = E [δ · u0(ct+1) · Rt+1 | It] ,


where u0(·) is the derivative, and Rt+1 = 1 + rt+1 is the return factor.
29 of 32

• Now assume a constant relative risk aversion (CRRA) utility function:


ct1−γ
u(ct) = , γ < 1,
1−γ
so that u0(ct) = c−γ
t . That gives the explicit Euler equation:
£ ¤
c−γ
t − E δ · c−γ
t+1 · Rt+1 | It = 0.

• To ensure stationarity, we reformulate:


" µ ¶−γ #
ct+1
E δ· · Rt+1 − 1 | It = 0,
ct
which is a conditional moment condition.

• That implies the unconditional moment conditions


"Ã µ ¶−γ ! #
ct+1
E [f (ct+1, ct, Rt+1; zt; δ, γ)] = E δ· · Rt+1 − 1 zt = 0,
ct
for all variables zt ∈ It included in the formation set.
30 of 32
• To estimate the parameters, θ = (δ, γ)0, we need at least R = 2 instruments in zt.
We try with R = 3 instruments:
µ ¶0
ct
zt = 1, , Rt .
ct−1
• That produces the moment conditions
"Ã µ ¶−γ ! #
ct+1
E δ· · Rt+1 − 1 = 0
ct
"Ã µ ¶−γ !µ ¶#
ct+1 ct
E δ· · Rt+1 − 1 = 0
ct ct−1
"Ã µ ¶−γ ! #
ct+1
E δ· · Rt+1 − 1 Rt = 0,
ct
for t = 1, 2, ..., T .

• The model is formally identified but γ is poorly determined.


Weak instruments, little variation in the data, or wrong model!

31 of 32

Results for US data, 1959 : 3 − 1978 : 12.


Method Lags δ γ T ξJ DF p − val
2-Step HC 1 0.9987 0.8770 237 0.434 1 0.510
(0.0086) (3.6792)
Iterated HC 1 0.9982 1.0249 237 1.068 1 0.301
(0.0044) (1.8614)
CU HC 1 0.9981 0.9549 237 1.067 1 0.302
(0.0044) (1.8629)
2-Step HAC 1 0.9987 0.8876 237 0.429 1 0.513
(0.0092) (4.0228)
Iterated HAC 1 0.9980 0.8472 237 1.091 1 0.296
(0.0045) (1.8757)
CU HAC 1 0.9977 0.7093 237 1.086 1 0.297
(0.0045) (1.8815)
2-Step HC 2 0.9975 0.0149 236 1.597 3 0.660
(0.0066) (2.6415)
Iterated HC 2 0.9968 −0.0210 236 3.579 3 0.311
(0.0045) (1.7925)
CU HC 2 0.9958 −0.5526 236 3.501 3 0.321
(0.0046) (1.8267)
2-Step HAC 2 0.9970 −0.1872 236 1.672 3 0.643
(0.0068) (2.7476)
Iterated HAC 2 0.9965 −0.2443 236 3.685 3 0.298
(0.0047) (1.8571)
CU HAC 2 0.9952 −0.9094 236 3.591 3 0.309
(0.0048) (1.9108)
32 of 32

You might also like