You are on page 1of 111

Contents

1 Introduction 5
1.1 Means and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 5
1.2 Testing Sample Means . . . . . . . . . . . . . . . . . . . . . . . . . 6
Lecture Notes for Econometrics 2002 (first year 1.3 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
PhD course in Stockholm) 1.5 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 The Distribution of β̂ . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Paul Söderlind1 1.7 Diagnostic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Testing Hypotheses about β̂ . . . . . . . . . . . . . . . . . . . . . . 13
June 2002 (some typos corrected and some material added later)
A Practical Matters 14

B A CLT in Action 16

2 Univariate Time Series Analysis 19


2.1 Theoretical Background to Time Series Processes . . . . . . . . . . . 19
2.2 Estimation of Autocovariances . . . . . . . . . . . . . . . . . . . . . 20
2.3 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 ARMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Non-stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . 34

3 The Distribution of a Sample Average 42


3.1 Variance of a Sample Average . . . . . . . . . . . . . . . . . . . . . 42
1
University of St. Gallen and CEPR. Address: s/bf-HSG, Rosenbergstrasse 52, CH-9000 St. 3.2 The Newey-West Estimator . . . . . . . . . . . . . . . . . . . . . . . 46
Gallen, Switzerland. E-mail: Paul.Soderlind@unisg.ch. Document name: EcmAll.TeX.

1
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.9 GMM with Sub-Optimal Weighting Matrix∗ . . . . . . . . . . . . . . 111
7.10 GMM without a Loss Function∗ . . . . . . . . . . . . . . . . . . . . 113
4 Least Squares 50 7.11 Simulated Moments Estimator∗ . . . . . . . . . . . . . . . . . . . . . 113
4.1 Definition of the LS Estimator . . . . . . . . . . . . . . . . . . . . . 50
4.2 LS and R 2 ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8 Examples and Applications of GMM 116
4.3 Finite Sample Properties of LS . . . . . . . . . . . . . . . . . . . . . 54 8.1 GMM and Classical Econometrics: Examples . . . . . . . . . . . . . 116
4.4 Consistency of LS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 8.2 Identification of Systems of Simultaneous Equations . . . . . . . . . 120
4.5 Asymptotic Normality of LS . . . . . . . . . . . . . . . . . . . . . . 57 8.3 Testing for Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . 123
4.6 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 8.4 Estimating and Testing a Normal Distribution . . . . . . . . . . . . . 127
4.7 Diagnostic Tests of Autocorrelation, Heteroskedasticity, and Normality∗ 64 8.5 Testing the Implications of an RBC Model . . . . . . . . . . . . . . . 130
8.6 IV on a System of Equations∗ . . . . . . . . . . . . . . . . . . . . . 132
5 Instrumental Variable Method 71
5.1 Consistency of Least Squares or Not? . . . . . . . . . . . . . . . . . 71 11 Vector Autoregression (VAR) 134
5.2 Reason 1 for IV: Measurement Errors . . . . . . . . . . . . . . . . . 71 11.1 Canonical Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.3 Reason 2 for IV: Simultaneous Equations Bias (and Inconsistency) . . 73 11.2 Moving Average Form and Stability . . . . . . . . . . . . . . . . . . 135
5.4 Definition of the IV Estimator—Consistency of IV . . . . . . . . . . 77 11.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.5 Hausman’s Specification Test∗ . . . . . . . . . . . . . . . . . . . . . 83 11.4 Granger Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.6 Tests of Overidentifying Restrictions in 2SLS∗ . . . . . . . . . . . . 84 11.5 Forecasts Forecast Error Variance . . . . . . . . . . . . . . . . . . . 139
11.6 Forecast Error Variance Decompositions∗ . . . . . . . . . . . . . . . 140
6 Simulating the Finite Sample Properties 86 11.7 Structural VARs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.1 Monte Carlo Simulations in the Simplest Case . . . . . . . . . . . . . 86 11.8 Cointegration, Common Trends, and Identification via Long-Run Restrictions∗ 151
6.2 Monte Carlo Simulations in More Complicated Cases∗ . . . . . . . . 88
6.3 Bootstrapping in the Simplest Case . . . . . . . . . . . . . . . . . . . 90 12 Kalman filter 158
6.4 Bootstrapping in More Complicated Cases∗ . . . . . . . . . . . . . . 90 12.1 Conditional Expectations in a Multivariate Normal Distribution . . . . 158
12.2 Kalman Recursions . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7 GMM 94
7.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . 94 13 Outliers and Robust Estimators 165
7.2 Generalized Method of Moments . . . . . . . . . . . . . . . . . . . . 95 13.1 Influential Observations and Standardized Residuals . . . . . . . . . . 165
7.3 Moment Conditions in GMM . . . . . . . . . . . . . . . . . . . . . . 95 13.2 Recursive Residuals∗ . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.4 The Optimization Problem in GMM . . . . . . . . . . . . . . . . . . 98 13.3 Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.5 Asymptotic Properties of GMM . . . . . . . . . . . . . . . . . . . . 102 13.4 Multicollinearity∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.6 Summary of GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.7 Efficient GMM and Its Feasible Implementation . . . . . . . . . . . . 108 14 Generalized Least Squares 171
7.8 Testing in GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
14.2 GLS as Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . 172

2 3
14.3 GLS as a Transformed LS . . . . . . . . . . . . . . . . . . . . . . . 175
14.4 Feasible GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

15 Kernel Density Estimation and Regression 177 1 Introduction


15.1 Non-Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . 177
15.2 Estimating and Testing Distributions . . . . . . . . . . . . . . . . . . 182 1.1 Means and Standard Deviation
21 Some Statistics 188 The mean and variance of a series are estimated as
21.1 Distributions and Moment Generating Functions . . . . . . . . . . . . 188 PT PT
21.2 Joint and Conditional Distributions and Moments . . . . . . . . . . . 189 x̄ = t=1 x t /T and σ̂ 2 = t=1 (x t − x̄)2 /T. (1.1)
21.3 Convergence in Probability, Mean Square, and Distribution . . . . . . 192
The standard deviation (here denoted Std(xt )), the square root of the variance, is the most
21.4 Laws of Large Numbers and Central Limit Theorems . . . . . . . . . 194
common measure of volatility.
21.5 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
The mean and standard deviation are often estimated on rolling data windows (for
21.6 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
instance, a “Bollinger band” is ±2 standard deviations from a moving data window around
21.7 Special Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 196
a moving average—sometimes used in analysis of financial prices.)
21.8 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
If xt is iid (independently and identically distributed), then it is straightforward to find
22 Some Facts about Matrices 205 the variance of the sample average. Then, note that
22.1 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 P
T
 P
T
Var t=1 x t /T = t=1 Var (x t /T )
22.2 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
22.3 Systems of Linear Equations and Matrix Inverses . . . . . . . . . . . 205 = T Var (xt ) /T 2
22.4 Complex matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 = Var (xt ) /T. (1.2)
22.5 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . 208
The first equality follows from the assumption that xt and xs are independently distributed
22.6 Special Forms of Matrices . . . . . . . . . . . . . . . . . . . . . . . 209
(so the covariance is zero). The second equality follows from the assumption that xt and
22.7 Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . 211
xs are identically distributed (so their variances are the same). The third equality is a
22.8 Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
trivial simplification.
22.9 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
A sample average is (typically) unbiased, that is, the expected value of the sample
average equals the population mean. To illustrate that, consider the expected value of the
sample average of the iid xt
PT PT
E t=1 x t /T = t=1 E x t /T
= E xt . (1.3)

The first equality is always true (the expectation of a sum is the sum of expectations), and

4 5
a. Distribution of sample average b. Distribution of √T times sample average hypothesis if the sample average is far out in one the tails of the distribution. A traditional
3 0.4 two-tailed test amounts to rejecting the null hypothesis at the 10% significance level if
T=5 T=5
T=25 T=25 the test statistics is so far out that there is only 5% probability mass further out in that
2
T=50 T=50 tail (and another 5% in the other tail). The interpretation is that if the null hypothesis is
T=100 0.2 T=100
1 actually true, then there would only be a 10% chance of getting such an extreme (positive
or negative) sample average—and these 10% are considered so low that we say that the
0 0
−2 0 2 −5 0 5 null is probably wrong.
Sample average √T times sample average
Figure 1.1: Sampling
√ distributions. This figure shows the distribution of the sample
mean and of T times the sample mean of the random variable z t − 1 where z t ∼ χ 2 (1).
Figure 1.2: Density function of normal distribution with shaded 5% tails.

the second equality follows from the assumption of identical distributions which implies See Figure 1.2 for some examples or normal distributions. recall that in a normal
identical expectations. distribution, the interval ±1 standard deviation around the mean contains 68% of the
probability mass; ±1.65 standard deviations contains 90%; and ±2 standard deviations
1.2 Testing Sample Means contains 95%.
In practice, the test of a sample mean is done by “standardizing” the sampe mean so
The law of large numbers (LLN) says that the sample mean converges to the true popula- that it can be compared with a standard N (0, 1) distribution. The logic of this is as follows
tion mean as the sample size goes to infinity. This holds for a very large class of random
variables, but there are exceptions. A sufficient (but not necessary) condition for this con- Pr(x̄ ≥ 2.7) = Pr(x̄ − µ ≥ 2.7 − µ) (1.4)
x̄ − µ 2.7 − µ
 
vergence is that the sample average is unbiased (as in (1.3)) and that the variance goes to
= Pr ≥ . (1.5)
zero as the sample size goes to infinity (as in (1.2)). (This is also called convergence in h h
mean square.) To see the LLN in action, see Figure 1.1.
√ If x̄ ∼ N (µ, h 2 ), then (x̄ − µ)/ h ∼ N (0, 1), so the probability of x̄ ≥ 2.7 can be calcu-
The central limit theorem (CLT) says that T x̄ converges in distribution to a normal
lated by calculating how much probability mass of the standard normal density function
distribution as the sample size increases. See Figure 1.1 for an illustration. This also
there is above (2.7 − µ)/ h.
holds for a large class of random variables—and it is a very useful result since it allows
To construct a two-tailed test, we also need.the probability that x̄ is above some num-
us to test hypothesis. Most estimators (including LS and other methods) are effectively
ber. This number is chosen to make the two-tailed tst symmetric, that is, so that there is as
some kind of sample average, so the CLT can be applied.
much probability mass below lower number (lower tail) as above the upper number (up-
The basic approach in testing a hypothesis (the “null hypothesis”), is to compare the
per tail). With a normal distribution (or, for that matter, any symmetric distribution) this
test statistics (the sample average, say) with how the distribution of that statistics (which
is done as follows. Note that (x̄ − µ)/ h ∼ N (0, 1) is symmetric around 0. This means
is a random number since the sample is finite) would look like if the null hypothesis is
that the probability of being above some number, (C − µ)/ h, must equal the probability
true. For instance, suppose the null hypothesis is that the population mean is µ Suppose
of being below −1 times the same number, or
also that we know that distribution of the sample mean is normal with a known variance
x̄ − µ C −µ C −µ x̄ − µ
   
h 2 (which will typically be estimated and then treated as if it was known). Under the null Pr ≥ = Pr − ≤ . (1.6)
h h h h
hypothesis, the sample average should then be N (µ, h 2 ). We would then reject the null

6 7
Compared with a traditional estimate of a correlation (1.8) we here impose that the stan-
dard deviation of xt and xt− p are the same (which typically does not make much of a
Figure 1.3: Power of two-sided test
difference).

A 10% critical value is the value of (C − µ)/ h that makes both these probabilities
equal to 5%—which happens to be 1.645. The easiest way to look up such critical values 1.4 Least Squares
is by looking at the normal cumulative distribution function—see Figure 1.2.
Consider the simplest linear model

1.3 Covariance and Correlation yt = xt β0 + u t , (1.11)

The covariance of two variables (here x and y) is typically estimated as where all variables are zero mean scalars and where β0 is the true value of the parameter
PT we want to estimate. The task is to use a sample {yt , xt }t=1
T to estimate β and to test
d (xt , z t ) = − x̄) (z t − z̄) /T.
Cov t=1 (x t (1.7) hypotheses about its value, for instance that β = 0.
If there were no movements in the unobserved errors, u t , in (1.11), then any sample
Note that this is a kind of sample average, so a CLT can be used.
would provide us with a perfect estimate of β. With errors, any estimate of β will still
leave us with some uncertainty about what the true value is. The two perhaps most impor-
tant issues in econometrics are how to construct a good estimator of β and how to assess
Figure 1.4: Example of correlations on an artificial sample. Both subfigures use the same
sample of y. the uncertainty about the true value.
For any possible estimate, β̂, we get a fitted residual
The correlation of two variables is then estimated as
û t = yt − xt β̂. (1.12)
d (xt , z t )
Cov
d (xt , z t ) =
Corr , (1.8)
c (xt ) Std
Std c (z t ) One appealing method of choosing β̂ is to minimize the part of the movements in yt that
c t ) is an estimated standard deviation. A correlation must be between −1 and 1 we cannot explain by xt β̂, that is, to minimize the movements in û t . There are several
where Std(x
candidates for how to measure the “movements,” but the most common is by the mean of
(try to show it). Note that covariance and correlation measure the degree of linear relation
squared errors, that is, 6t=1
T û 2 /T . We will later look at estimators where we instead use
t
only. This is illustrated in Figure 1.4.
6t=1 û t /T .
T

The pth autocovariance of x is estimated by
With the sum or mean of squared errors as the loss function, the optimization problem
d xt , xt− p = T (xt − x̄) xt− p − x̄ /T,
 P 
Cov t=1 (1.9) T
1X
min (yt − xt β)2 (1.13)
where we use the same estimated (using all data) mean in both places. Similarly, the pth β T
t=1
autocorrelation is estimated as
d xt , xt− p

 Cov
Corr xt , xt− p =
d . (1.10)
c (xt )2
Std

8 9
has the first order condition that the derivative should be zero as the optimal estimate β̂ 1.6 The Distribution of β̂
T
1X  Equation (1.15) will give different values of β̂ when we use different samples, that is

xt yt − xt β̂ = 0, (1.14)
T different draws of the random variables u t , xt , and yt . Since the true value, β0 , is a fixed
t=1
constant, this distribution describes the uncertainty we should have about the true value
which we can solve for β̂ as
after having obtained a specific estimated value.
T
!−1 T To understand the distribution of β̂, use (1.11) in (1.15) to substitute for yt
1X 2 1X
β̂ = xt xt yt , or (1.15) !−1
T T T T
t=1 t=1 1X 2 1X
d (xt , yt ) ,
c (xt )−1 Cov β̂ = xt xt (xt β0 + u t )
= Var (1.16) T T
t=1 t=1
T
!−1 T
where a hat indicates a sample estimate. This is the Least Squares (LS) estimator. 1X 2 1X
= β0 + xt xt u t , (1.19)
T T
t=1 t=1

1.5 Maximum Likelihood where β0 is the true value.


The first conclusion from (1.19) is that, with u t = 0 the estimate would always be
A different route to arrive at an estimator is to maximize the likelihood function. If u t in
perfect — and with large movements in u t we will see large movements in β̂. The second
(1.11) is iid N 0, σ 2 , then the probability density function of u t is

conclusion is that not even a strong opinion about the distribution of u t , for instance that
1 u t is iid N 0, σ 2 , is enough to tell us the whole story about the distribution of β̂. The
h  i 
pdf (u t ) = √ exp −u 2t / 2σ 2 . (1.17)
2πσ 2 reason is that deviations of β̂ from β0 are a function of xt u t , not just of u t . Of course,
Since the errors are independent, we get the joint pdf of the u 1 , u 2 , . . . , u T by multiplying when xt are a set of deterministic variables which will always be the same irrespective
the marginal pdfs of each of the errors. Then substitute yt − xt β for u t (the derivative of of which sample we use, then β̂ − β0 is a time invariant linear function of u t , so the
the transformation is unity) and take logs to get the log likelihood function of the sample distribution of u t carries over to the distribution of β̂. This is probably an unrealistic case,
which forces us to look elsewhere to understand the properties of β̂.
T
T T   1X There are two main routes to learn more about the distribution of β̂: (i) set up a small
ln L = − ln (2π ) − ln σ 2 − (yt − xt β)2 /σ 2 . (1.18)
2 2 2 “experiment” in the computer and simulate the distribution or (ii) use the asymptotic
t=1
distribution as an approximation. The asymptotic distribution can often be derived, in
This likelihood function is maximized by minimizing the last term, which is propor-
contrast to the exact distribution in a sample of a given size. If the actual sample is large,
tional to the sum of squared errors - just like in (1.13): LS is ML when the errors are iid
then the asymptotic distribution may be a good approximation.
normally distributed. PT PT
A law of large numbers would (in most cases) say that both t=1 xt2 /T and t=1 xt u t /T
Maximum likelihood estimators have very nice properties, provided the basic dis-
in (1.19) converge to their expected values as T → ∞. The reason is that both are sam-
tributional assumptions are correct. If they are, then MLE are typically the most effi-
ple averages of random variables (clearly, both xt2 and xt u t are random variables). These
cient/precise estimators, at least asymptotically. ML also provides a coherent framework
expected values are Var(xt ) and Cov(xt , u t ), respectively (recall both xt and u t have zero
for testing hypotheses (including the Wald, LM, and LR tests).
means). The key to show that β̂ is consistent, that is, has a probability limit equal to β0 ,

10 11
is that Cov(xt , u t ) = 0. This highlights the importance of using good theory to derive not a. Pdf of N(0,1) b. Pdf of Chi−square(n)
only the systematic part of (1.11), but also in understanding the properties of the errors. 0.4 1
n=1
For instance, when theory tells us that yt and xt affect each other (as prices and quanti- n=2
ties typically do), then the errors are likely to be correlated with the regressors - and LS 0.2 0.5
n=5

is inconsistent. One common way to get around that is to use an instrumental variables
technique. More about that later. Consistency is a feature we want from most estimators,
0 0
since it says that we would at least get it right if we had enough data. −2 0 2 0 5 10
Suppose that β̂ is consistent. Can we say anything more about the asymptotic dis- x x

tribution? Well, the distribution of β̂ converges to a spike with all the mass at β0 , but Figure 1.5: Probability density functions
√ √ 
the distribution of T β̂, or T β̂ − β0 , will typically converge to a non-trivial normal
distribution. To see why, note from (1.19) that we can write 1.8 Testing Hypotheses about β̂

T
!−1 √ T Suppose we now assume that the asymptotic distribution of β̂ is such that
√   1X 2 T X
T β̂ − β0 = xt xt u t . (1.20) √ 
T T
 d  
t=1 t=1 T β̂ − β0 → N 0, v 2 or (1.21)
The first term on the right hand side will typically converge to the inverse of Var(xt ), as
√ We could then test hypotheses about β̂ as for any other random variable. For instance,
discussed earlier. The second term is T times a sample average (of the random variable
consider the hypothesis that β0 = 0. If this is true, then
xt u t ) with a zero expected value, since we assumed that β̂ is consistent. Under weak
√ √  √ 
conditions, a central limit theorem applies so T times a sample average converges to Pr T β̂/v < −2 = Pr T β̂/v > 2 ≈ 0.025, (1.22)

a normal distribution. This shows that T β̂ has an asymptotic normal distribution. It
turns out that this is a property of many estimators, basically because most estimators are which says that there is only a 2.5% chance that a random sample will deliver a value of

some kind of sample average. For an example of a central limit theorem in action, see T β̂/v less than -2 and also a 2.5% chance that a sample delivers a value larger than 2,
Appendix B assuming the true value is zero.
We then say that we reject the hypothesis that β0 = 0 at the 5% significance level

(95% confidence level) if the test statistics | T β̂/v| is larger than 2. The idea is that,
1.7 Diagnostic Tests
if the hypothesis is true (β0 = 0), then this decision rule gives the wrong decision in

Exactly what the variance of T (β̂ − β0 ) is, and how it should be estimated, depends 5% of the cases. That is, 5% of all possible random samples will make us reject a true
mostly on the properties of the errors. This is one of the main reasons for diagnostic tests. hypothesis. Note, however, that this test can only be taken to be an approximation since it
The most common tests are for homoskedastic errors (equal variances of u t and u t−s ) and relies on the asymptotic distribution, which is an approximation of the true (and typically
no autocorrelation (no correlation of u t and u t−s ). unknown) distribution.

When ML is used, it is common to investigate if the fitted errors satisfy the basic The natural interpretation of a really large test statistics, | T β̂/v| = 3 say, is that
assumptions, for instance, of normality. it is very unlikely that this sample could have been drawn from a distribution where the
hypothesis β0 = 0 is true. We therefore choose to reject the hypothesis. We also hope that
the decision rule we use will indeed make us reject false hypothesis more often than we

12 13
reject true hypothesis. For instance, we want the decision rule discussed above to reject 5. Verbeek (2000), A Guide to Modern Econometrics (general, easy, good applica-
β0 = 0 more often when β0 = 1 than when β0 = 0. tions)
There is clearly nothing sacred about the 5% significance level. It is just a matter of
6. Davidson and MacKinnon (1993), Estimation and Inference in Econometrics (gen-
convention that the 5% and 10% are the most widely used. However, it is not uncommon
eral, a bit advanced)
to use the 1% or the 20%. Clearly, the lower the significance level, the harder it is to reject
a null hypothesis. At the 1% level it often turns out that almost no reasonable hypothesis 7. Ruud (2000), Introduction to Classical Econometric Theory (general, consistent
can be rejected. projection approach, careful)
The t-test described above works only if the null hypothesis contains a single restric-
8. Davidson (2000), Econometric Theory (econometrics/time series, LSE approach)
tion. We have to use another approach whenever we want to test several restrictions
jointly. The perhaps most common approach is a Wald test. To illustrate the idea, suppose 9. Mittelhammer, Judge, and Miller (2000), Econometric Foundations (general, ad-
√ d
β is an m × 1 vector and that T β̂ → N (0, V ) under the null hypothesis , where V is a vanced)
covariance matrix. We then know that
10. Patterson (2000), An Introduction to Applied Econometrics (econometrics/time se-
√ √ d
T β̂ V
0 −1
T β̂ → χ (m) .
2
(1.23) ries, LSE approach with applications)

11. Judge et al (1985), Theory and Practice of Econometrics (general, a bit old)
The decision rule is then that if the left hand side of (1.23) is larger that the 5%, say,
critical value of the χ 2 (m) distribution, then we reject the hypothesis that all elements in 12. Hamilton (1994), Time Series Analysis
β are zero.
13. Spanos (1986), Statistical Foundations of Econometric Modelling, Cambridge Uni-
versity Press (general econometrics, LSE approach)
A Practical Matters
14. Harvey (1981), Time Series Models, Philip Allan
A.0.1 Software
15. Harvey (1989), Forecasting, Structural Time Series... (structural time series, Kalman
• Gauss, MatLab, RATS, Eviews, Stata, PC-Give, Micro-Fit, TSP, SAS filter).

• Software reviews in The Economic Journal and Journal of Applied Econometrics 16. Lütkepohl (1993), Introduction to Multiple Time Series Analysis (time series, VAR
models)
A.0.2 Useful Econometrics Literature
17. Priestley (1981), Spectral Analysis and Time Series (advanced time series)
1. Greene (2000), Econometric Analysis (general)
18. Amemiya (1985), Advanced Econometrics, (asymptotic theory, non-linear econo-
2. Hayashi (2000), Econometrics (general) metrics)

3. Johnston and DiNardo (1997), Econometric Methods (general, fairly easy) 19. Silverman (1986), Density Estimation for Statistics and Data Analysis (density es-
timation).
4. Pindyck and Rubinfeld (1997), Econometric Models and Economic Forecasts (gen-
eral, easy) 20. Härdle (1990), Applied Nonparametric Regression

14 15
B A CLT in Action These distributions are shown in Figure 1.1. It is clear that f (z̄ 1 ) converges to a spike
at zero as the sample size increases, while f (z̄ 2 ) converges to a (non-trivial) normal
This is an example of how we can calculate the limiting distribution of a sample average. distribution.

Remark 1 If T (x̄ − µ)/σ ∼ N (0, 1) then x̄ ∼ N (µ, σ 2 /T ). √
Example 6 (Distribution of 6t=1 T (z − 1) /T and T 6 T (z − 1) /T when z ∼ χ 2 (1).)
t t=1 t t

Example 2 (Distribution of 6t=1 T (z − 1) /T and T 6 T (z − 1) /T when z ∼ χ 2 (1).)
t t=1 t t When z t is iid χ 2 (1), then 6t=1
T z is χ 2 (T ), that is, has the probability density function
t
When z t is iid χ 2 (1), then 6t=1T z is distributed as a χ 2 (T ) variable with pdf f (). We
t T
1 T /2−1
f 6t=1
T
6 T zt T
z t /2 .
 
now construct a new variable by transforming 6t=1 T z as to a sample mean around one zt = exp −6t=1
t
2T /2 0 (T /2) t=1
(the mean of z t )
z̄ 1 = 6t=1
T
z t /T − 1 = 6t=1
T
(z t − 1) /T. We transform this distribution by first subtracting one from z t (to remove the mean) and

then by dividing by T or T . This gives the distributions of the sample mean, z̄ 1 =
Clearly, the inverse function is 6t=1
T z = T z̄ + T , so by the “change of variable” rule √
t 1
6t=1
T (z − 1) /T , and scaled sample mean, z̄ =
t 2 T z̄ 1 as
we get the pdf of z̄ 1 as
g(z̄ 1 ) = f T (T z̄ 1 + T ) T. 1
f (z̄ 1 ) = y T /2−1 exp (−y/2) with y = T z̄ 1 + T , and
2T /2 0 (T /2)
Example 3 Continuing the previous example, we now consider the random variable 1 √
f (z̄ 2 ) = T /2 y T /2−1 exp (−y/2) with y = T z̄ 1 + T .
√ 2 0 (T /2)
z̄ 2 = T z̄ 1 ,
These distributions are shown in Figure 1.1. It is clear that f (z̄ 1 ) converges to a spike

with inverse function z̄ 1 = z̄ 2 / T . By applying the “change of variable” rule again, we at zero as the sample size increases, while f (z̄ 2 ) converges to a (non-trivial) normal
get the pdf of z̄ 2 as distribution.
√ √ √ √
h (z̄ 2 ) = g(z̄ 2 / T )/ T = f T T z̄ 2 + T T.
Bibliography
Example 4 When z t is iid χ 2 (1), then 6t=1
T z is χ 2 (T ), which we denote f (6 T z ). We
t t=1 t
now construct two new variables by transforming 6t=1 T z
t Amemiya, T., 1985, Advanced Econometrics, Harvard University Press, Cambridge,
Massachusetts.
z̄ 1 = 6t=1
T
z t /T − 1 = 6t=1
T
(z t − 1) /T , and
√ Davidson, J., 2000, Econometric Theory, Blackwell Publishers, Oxford.
z̄ 2 = T z̄ 1 .

Example 5 We transform this distribution by first subtracting one from z t (to remove the Davidson, R., and J. G. MacKinnon, 1993, Estimation and Inference in Econometrics,
√ Oxford University Press, Oxford.
mean) and then by dividing by T or T . This gives the distributions of the sample mean

and scaled sample mean, z̄ 2 = T z̄ 1 as Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New
1 Jersey, 4th edn.
f (z̄ 1 ) = T /2 y T /2−1 exp (−y/2) with y = T z̄ 1 + T , and
2 0 (T /2)
1 √ Hamilton, J. D., 1994, Time Series Analysis, Princeton University Press, Princeton.
f (z̄ 2 ) = T /2 y T /2−1 exp (−y/2) with y = T z̄ 1 + T .
2 0 (T /2)

16 17
Härdle, W., 1990, Applied Nonparametric Regression, Cambridge University Press, Cam-
bridge.

Harvey, A. C., 1989, Forecasting, Structural Time Series Models and the Kalman Filter, 2 Univariate Time Series Analysis
Cambridge University Press.
Reference: Greene (2000) 13.1-3 and 18.1-3
Hayashi, F., 2000, Econometrics, Princeton University Press.
Additional references: Hayashi (2000) 6.2-4; Verbeek (2000) 8-9; Hamilton (1994); John-
Johnston, J., and J. DiNardo, 1997, Econometric Methods, McGraw-Hill, New York, 4th ston and DiNardo (1997) 7; and Pindyck and Rubinfeld (1997) 16-18
edn.

Lütkepohl, H., 1993, Introduction to Multiple Time Series, Springer-Verlag, 2nd edn.
2.1 Theoretical Background to Time Series Processes

Mittelhammer, R. C., G. J. Judge, and D. J. Miller, 2000, Econometric Foundations, Cam- Suppose we have a sample of T observations of a random variable
bridge University Press, Cambridge.  i T
yt t=1 = y1i , y2i , ..., yTi ,


Patterson, K., 2000, An Introduction to Applied Econometrics: A Time Series Approach,


where subscripts indicate time periods. The superscripts indicate that this sample is from
MacMillan Press, London.
planet (realization) i. We could imagine a continuum of parallel planets where the same
Pindyck, R. S., and D. L. Rubinfeld, 1997, Econometric Models and Economic Forecasts, time series process has generated different samples with T different numbers (different
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn. realizations).
Consider period t. The distribution of yt across the (infinite number of) planets has
Priestley, M. B., 1981, Spectral Analysis and Time Series, Academic Press.
some density function, f t (yt ). The mean of this distribution
Ruud, P. A., 2000, An Introduction to Classical Econometric Theory, Oxford University Z ∞
Press. Eyt = yt f t (yt ) dyt (2.1)
−∞

Silverman, B. W., 1986, Density Estimation for Statistics and Data Analysis, Chapman is the expected value of the value in period t, also called the unconditional mean of yt .
and Hall, London. Note that Eyt could be different from Eyt+s . The unconditional variance is defined simi-
larly.
Verbeek, M., 2000, A Guide to Modern Econometrics, Wiley, Chichester.
, yti .
 i
Now consider periods t and t − s jointly. On planet i we have the pair yt−s
The bivariate distribution of these pairs, across the planets, has some density function
gt−s,t (yt−s , yt ).1 Calculate the covariance between yt−s and yt as usual
Z ∞Z ∞
Cov (yt−s , yt ) = (yt−s − Eyt−s ) (yt − Eyt ) gt−s,t (yt−s , yt ) dyt dyt−s (2.2)
−∞ −∞
= E (yt−s − Eyt−s ) (yt − Eyt ) . (2.3)
R∞
1 The relation between f t (yt ) and gt−s,t (yt−s , yt ) is, as usual, f t (yt ) = −∞ gt−s,t (yt−s , yt ) dyt−s .

18 19
This is the s th autocovariance of yt . (Of course, s = 0 or s < 0 are allowed.) One sample from an AR(1) with corr=0.85 Histogram, obs 1−20
A stochastic process is covariance stationary if 5

Eyt = µ is independent of t, (2.4) 0.2


0
Cov (yt−s , yt ) = γs depends only on s, and (2.5) 0.1
both µ and γs are finite. (2.6) −5
0
0 500 1000 −5 0 5
Most of these notes are about covariance stationary processes, but Section 2.7 is about period
non-stationary processes.
Humanity has so far only discovered one planet with coin flipping; any attempt to
Histogram, obs 1−1000 Mean and Std over longer and longer samples
estimate the moments of a time series process must therefore be based on the realization
4 Mean
of the stochastic process from planet earth only. This is meaningful only if the process is
0.2 Std
ergodic for the moment you want to estimate. A covariance stationary process is said to 2
be ergodic for the mean if 0.1 0
T
1X
plim yt = Eyt , (2.7) −2
T 0
t=1 −5 0 5 0 500 1000
so the sample mean converges in probability to the unconditional mean. A sufficient sample length

condition for ergodicity for the mean is



X Figure 2.1: Sample of one realization of yt = 0.85yt−1 + εt with y0 = 4 and Std(εt ) = 1.
|Cov (yt−s , yt )| < ∞. (2.8)
s=0 Note that R (s) does not have to be symmetric unless s = 0. However, note that R (s) =
This means that the link between the values in t and t − s goes to zero sufficiently fast R (−s)0 . This follows from noting that
as s increases (you may think of this as getting independent observations before we reach
R (−s) = E (yt − Eyt ) (yt+s − Eyt+s )0
the limit). If yt is normally distributed, then (2.8) is also sufficient for the process to be
ergodic for all moments, not just the mean. Figure 2.1 illustrates how a longer and longer = E (yt−s − Eyt−s ) (yt − Eyt )0 , (2.10a)
sample (of one realization of the same time series process) gets closer and closer to the
where we have simply changed time subscripts and exploited the fact that yt is covariance
unconditional distribution as the sample gets longer.
stationary. Transpose to get

2.2 Estimation of Autocovariances R (−s)0 = E (yt − Eyt ) (yt−s − Eyt−s )0 , (2.11)

Let yt be a vector of a covariance stationary and ergodic. The sth covariance matrix is which is the same as in (2.9). If yt is a scalar, then R (s) = R (−s), which shows that
autocovariances are symmetric around s = 0.
R (s) = E (yt − Eyt ) (yt−s − Eyt−s )0 . (2.9)

20 21
Example 1 (Bivariate case.) Let yt = [xt , z t ]0 with Ext =Ez t = 0. Then 2.3 White Noise
" #
xt h i
A white noise time process has
R̂ (s) = E xt−s z t−s
zt
" # Eεt = 0
Cov (xt , xt−s ) Cov (xt , z t−s )
= . Var (εt ) = σ 2 , and
Cov (z t , xt−s ) Cov (z t , xt−s )
Cov (εt−s , εt ) = 0 if s 6= 0. (2.15)
Note that R (−s) is
" # If, in addition, εt is normally distributed, then it is said to be Gaussian white noise. The
Cov (xt , xt+s ) Cov (xt , z t+s )
R (−s) = conditions in (2.4)-(2.6) are satisfied so this process is covariance stationary. Moreover,
Cov (z t , xt+s ) Cov (z t , xt+s )
" # (2.8) is also satisfied, so the process is ergodic for the mean (and all moments if εt is
Cov (xt−s , xt ) Cov (xt−s , z t )
= , normally distributed).
Cov (z t−s , xt ) Cov (z t−s , xt )

which is indeed the transpose of R (s). 2.4 Moving Average


The autocovariances of the (vector) yt process can be estimated as A qth -order moving average process is
T
1 X yt = εt + θ1 εt−1 + ... + θq εt−q , (2.16)
R̂ (s) = (yt − ȳ) (yt−s − ȳ)0 , (2.12)
T
t=1+s
T where the innovation εt is white noise (usually Gaussian). We could also allow both yt
1
and εt to be vectors; such a process it called a vector MA (VMA).
X
with ȳ = yt . (2.13)
T
t=1 We have Eyt = 0 and
(We typically divide by T in even if we have only T −s full observations to estimate R (s)
Var (yt ) = E εt + θ1 εt−1 + ... + θq εt−q εt + θ1 εt−1 + ... + θq εt−q
 
from.)  
Autocorrelations are then estimated by dividing the diagonal elements in R̂ (s) by the = σ 2 1 + θ12 + ... + θq2 . (2.17)
diagonal elements in R̂ (0)
Autocovariances are calculated similarly, and it should be noted that autocovariances of
ρ̂ (s) = diag R̂ (s) /diag R̂ (0) (element by element). (2.14) order q + 1 and higher are always zero for an MA(q) process.

Example 2 The mean of an MA(1), yt = εt + θ1 εt−1 , is zero since the mean of εt (and
εt−1 ) is zero. The first three autocovariance are
 
Var (yt ) = E (εt + θ1 εt−1 ) (εt + θ1 εt−1 ) = σ 2 1 + θ12

Cov (yt−1 , yt ) = E (εt−1 + θ1 εt−2 ) (εt + θ1 εt−1 ) = σ 2 θ1


Cov (yt−2 , yt ) = E (εt−2 + θ1 εt−3 ) (εt + θ1 εt−1 ) = 0, (2.18)

22 23
and Cov(yt−s , yt ) = 0 for |s| ≥ 2. Since both the mean and the covariances are finite The forecasts made in t = 2 then have the follow expressions—with an example using
and constant across t, the MA(1) is covariance stationary. Since the absolute value of θ1 = 2, ε1 = 3/4 and ε2 = 1/2 in the second column
the covariances sum to a finite number, the MA(1) is also ergodic for the mean. The first
General Example
autocorrelation of an MA(1) is
y2 = 1/2 + 2 × 3/4 = 2
θ1 E2 y3 = E2 (ε3 + θ1 ε2 ) = θ1 ε2 = 2 × 1/2 = 1
Corr (yt−1 , yt ) = .
1 + θ12 E2 y4 = E2 (ε4 + θ1 ε3 ) = 0 =0
Since the white noise process is covariance stationary, and since an MA(q) with m <
Example 4 (MA(1) and conditional variances.) From Example 3, the forecasting vari-
∞ is a finite order linear function of εt , it must be the case that the MA(q) is covariance
ances are—with the numerical example continued assuming that σ 2 = 1
stationary. It is ergodic for the mean since Cov(yt−s , yt ) = 0 for s > q, so (2.8) is
satisfied. As usual, Gaussian innovations are then sufficient for the MA(q) to be ergodic General Example
for all moments. Var(y2 − E2 y2 ) = 0 =0
The effect of εt on yt , yt+1 , ..., that is, the impulse response function, is the same as Var(y3 − E2 y3 ) = Var(ε3 + θ1 ε2 − θ1 ε2 ) = σ 2 =1
the MA coefficients Var(y4 − E2 y4 ) = Var (ε4 + θ1 ε3 ) = σ 2 + θ12 σ 2 =5
∂ yt ∂ yt+1 ∂ yt+q ∂ yt+q+k If the innovations are iid Gaussian, then the distribution of the s−period forecast error
= 1, = θ1 , ..., = θq , and = 0 for k > 0. (2.19)
∂εt ∂εt ∂εt ∂εt
This is easily seen from applying (2.16) yt − Et−s yt = εt + θ1 εt−1 + ... + θs−1 εt−(s−1)

yt = εt + θ1 εt−1 + ... + θq εt−q is h  i


(yt − Et−s yt ) ∼ N 0, σ 2 1 + θ12 + ... + θs−1
2
, (2.21)
yt+1 = εt+1 + θ1 εt + ... + θq εt−q+1
.. since εt , εt−1 , ..., εt−(s−1) are independent Gaussian random variables. This implies that
.
the conditional distribution of yt , conditional on {εw }sw=−∞ , is
yt+q = εt+q + θ1 εt−1+q + ... + θq εt
yt | {εt−s , εt−s−1 , . . .} ∼ N Et−s yt , Var(yt − Et−s yt )
 
yt+q+1 = εt+q+1 + θ1 εt+q + ... + θq εt+1 . (2.22)
h  i
∼ N θs εt−s + ... + θq εt−q , σ 1 + θ1 + ... + θs−1 . (2.23)
2 2 2
The expected value of yt , conditional on {εw }t−s
w=−∞ is
The conditional mean is the point forecast and the variance is the variance of the forecast
Et−s yt = Et−s εt + θ1 εt−1 + ... + θs εt−s + ... + θq εt−q

error. Note that if s > q, then the conditional distribution coincides with the unconditional
= θs εt−s + ... + θq εt−q , (2.20) distribution since εt−s for s > q is of no help in forecasting yt .

since Et−s εt−(s−1) = . . . = Et−s εt = 0.


Example 5 (MA(1) and convergence from conditional to unconditional distribution.) From
examples 3 and 4 we see that the conditional distributions change according to (where
Example 3 (Forecasting an MA(1).) Suppose the process is

yt = εt + θ1 εt−1 , with Var (εt ) = σ 2 .

24 25
2 indicates the information set in t = 2) Iterate backwards on (2.26)

General Example yt = A (Ayt−2 + εt−1 ) + εt


y2 | 2 ∼ N (y2 , 0) = N (2, 0) = A2 yt−2 + Aεt−1 + εt
y3 | 2 ∼ N (E2 y3 , Var(y3 − E2 y3 )) = N (1, 1) ..
.
y4 | 2 ∼ N (E2 y4 , Var(y4 − E2 y4 )) = N (0, 5)
K
X
Note that the distribution of y4 | 2 coincides with the asymptotic distribution. = A K +1 yt−K −1 + As εt−s . (2.27)
s=0
Estimation of MA processes is typically done by setting up the likelihood function Remark 7 (Spectral decomposition.) The n eigenvalues (λi ) and associated eigenvectors
and then using some numerical method to maximize it. (z i ) of the n × n matrix A satisfy

(A − λi In ) z i = 0n×1 .
2.5 Autoregression
If the eigenvectors are linearly independent, then
A p th -order autoregressive process is
λ1 0 · · · 0
 
yt = a1 yt−1 + a2 yt−2 + ... + a p yt− p + εt . (2.24)
0 λ2 · · · 0
  h i
A = Z 3Z −1 , where 3 =  .
 
 .. .. .  and Z = z 1 z 2 · · · zn
 . . · · · ..

A VAR( p) is just like the AR( p) in (2.24), but where yt is interpreted as a vector and ai 
as a matrix. 0 0 ··· λn

Example 6 (VAR(1) model.) A VAR(1) model is of the following form Note that we therefore get

A2 = A A = Z 3Z −1 Z 3Z −1 = Z 33Z −1 = Z 32 Z −1 ⇒ Aq = Z 3q Z −1 .
" # " #" # " #
y1t a11 a12 y1t−1 ε1t
= + .
y2t a21 a22 y2t−1 ε2t √
Remark 8 (Modulus of complex number.) If λ = a + bi, where i = −1, then |λ| =

All stationary AR( p) processes can be written on MA(∞) form by repeated substitu- |a + bi| = a 2 + b2 .
tion. To do so we rewrite the AR( p) as a first order vector autoregression, VAR(1). For
Take the limit of (2.27) as K → ∞. If lim K →∞ A K +1 yt−K −1 = 0, then we have
instance, an AR(2) xt = a1 xt−1 + a2 xt−2 + εt can be written as
a moving average representation of yt where the influence of the starting values vanishes
" # " #" # " #
xt a1 a2 xt−1 εt asymptotically

= + , or (2.25) X
xt−1 1 0 xt−2 0 yt = As εt−s . (2.28)
yt = Ayt−1 + εt , (2.26) s=0

We note from the spectral decompositions that A K +1 = Z 3 K +1 Z −1 , where Z is the ma-


where yt is an 2 × 1 vector and A a 4 × 4 matrix. This works also if xt and εt are vectors trix of eigenvectors and 3 a diagonal matrix with eigenvalues. Clearly, lim K →∞ A K +1 yt−K −1 =
and. In this case, we interpret ai as matrices and 1 as an identity matrix. 0 is satisfied if the eigenvalues of A are all less than one in modulus and yt−K −1 does not
grow without a bound.

26 27
Example 12 (Ergodicity of a stationary AR(1).) We know that Cov(yt , yt−s )= a |s| σ 2 / 1 − a 2 ,

Conditional moments of AR(1), y =4 Conditional distributions of AR(1), y =4
0 0
4 0.4 so the absolute value is
s=1  
s=3 |Cov(yt , yt−s )| = |a||s| σ 2 / 1 − a 2
Mean s=5
2 0.2 s=7
Variance
s=7 Using this in (2.8) gives
∞ ∞
0 0 X σ2 X s
0 10 20 −5 0 5 |Cov (yt−s , yt )| = |a|
Forecasting horizon x 1 − a2
s=0 s=0
σ2 1
= (since |a| < 1)
1 − a 2 1 − |a|
Figure 2.2: Conditional moments and distributions for different forecast horizons for the
AR(1) process yt = 0.85yt−1 + εt with y0 = 4 and Std(εt ) = 1. which is finite. The AR(1) is ergodic if |a| < 1.

Example 9 (AR(1).) For the univariate AR(1) yt = ayt−1 +εt , the characteristic equation Example 13 (Conditional distribution of AR(1).) For the AR(1) yt = ayt−1 + εt with
is (a − λ) z = 0, which is only satisfied if the eigenvalue is λ = a. The AR(1) is therefore εt ∼ N 0, σ 2 , we get


stable (and stationarity) if −1 < a < 1. This can also be seen directly by noting that
Et yt+s = a s yt ,
a K +1 yt−K −1 declines to zero if 0 < a < 1 as K increases.  
Var (yt+s − Et yt+s ) = 1 + a 2 + a 4 + ... + a 2(s−1) σ 2
Similarly, most finite order MA processes can be written (“inverted”) as AR(∞). It is
a 2s − 1 2
therefore common to approximate MA processes with AR processes, especially since the = σ .
a2 − 1
latter are much easier to estimate.
The distribution of yt+s conditional on yt is normal with these parameters. See Figure
Example 10 (Variance of AR(1).) From the MA-representation yt = ∞ s=0 a εt−s and
s
P
2.2 for an example.
the fact that εt is white noise we get Var(yt ) = σ 2 ∞ 2s = σ 2 / 1 − a 2 . Note that
P 
s=0 a
this is minimized at a = 0. The autocorrelations are obviously a |s| . The covariance 2.5.1 Estimation of an AR(1) Process
T
matrix of {yt }t=1 is therefore (standard deviation×standard deviation×autocorrelation) T
Suppose we have sample {yt }t=0 of a process which we know is an AR( p), yt = ayt−1 +
εt , with normally distributed innovations with unknown variance σ 2 .
 
1 a a2 · · · a T −1
· · · a T −2  The pdf of y1 conditional on y0 is
 
 a 1 a
σ2  
.
 a2 a 1 · · · a T −3  !
2
1−a  .
 1 (y1 − ay0 )2
 .. ... 
pdf (y1 |y0 ) = √ exp − , (2.29)
2σ 2

  2πσ 2
a T −1 a T −2 a T −3 ··· 1
and the pdf of y2 conditional on y1 and y0 is
Example 11 (Covariance stationarity of an AR(1) with |a| < 1.) From the MA-representation !
(y2 − ay1 )2
P∞ s
s=0 a εt−s , the expected value of yt is zero, since Eεt−s = 0. We know that
yt = 1
pdf (y2 | {y1 , y0 }) = √ exp − . (2.30)
Cov(yt , yt−s )= a |s| σ 2 / 1 − a 2 which is constant and finite.

2π σ 2 2σ 2

28 29
Recall that the joint and conditional pdfs of some variables z and x are related as 2.5.2 Lag Operators∗

pdf (x, z) = pdf (x|z) ∗ pdf (z) . (2.31) A common and convenient way of dealing with leads and lags is the lag operator, L. It is
such that
Applying this principle on (2.29) and (2.31) gives Ls yt = yt−s for all (integer) s.

pdf (y2 , y1 |y0 ) = pdf (y2 | {y1 , y0 }) pdf (y1 |y0 ) For instance, the ARMA(2,1) model
2 !
(y2 − ay1 )2 + (y1 − ay0 )2

1 yt − a1 yt−1 − a2 yt−2 = εt + θ1 εt−1 (2.35)
= √ exp − . (2.32)
2π σ 2 2σ 2
can be written as
Repeating this for the entire sample gives the likelihood function for the sample  
1 − a1 L − a2 L2 yt = (1 + θ1 L) εt , (2.36)
T
!
  −T /2 1 X
T
pdf {yt }t=0 y0 = 2π σ 2 exp − 2 (yt − a1 yt−1 )2 . (2.33) which is usually denoted
2σ a (L) yt = θ (L) εt . (2.37)
t=1

Taking logs, and evaluating the first order conditions for σ 2 and a gives the usual OLS
2.5.3 Properties of LS Estimates of an AR( p) Process∗
estimator. Note that this is MLE conditional on y0 . There is a corresponding exact MLE,
but the difference is usually small (the asymptotic distributions of the two estimators are Reference: Hamilton (1994) 8.2
the same under stationarity; under non-stationarity OLS still gives consistent estimates). The LS estimates are typically biased, but consistent and asymptotically normally
PT
The MLE of Var(εt ) is given by t=1 v̂t2 /T , where v̂t is the OLS residual. distributed, provided the AR is stationary.
These results carry over to any finite-order VAR. The MLE, conditional on the initial As usual the LS estimate is
observations, of the VAR is the same as OLS estimates of each equation. The MLE of " T
#−1 T
PT 1X 1X
the i j th element in Cov(εt ) is given by t=1 v̂it v̂ jt /T , where v̂it and v̂ jt are the OLS β̂ L S − β = 0
xt xt xt εt , where (2.38)
T T
residuals. t=1 t=1
h i
To get the exact MLE, we need to multiply (2.33) with the unconditional pdf of y0 xt = yt−1 yt−2 · · · yt− p .
(since we have no information to condition on)
! The first term in (2.38) is the inverse of the sample estimate of covariance matrix of
1 y2 xt (since Eyt = 0), which converges in probability to 6x−1 (yt is stationary and ergodic
pdf (y0 ) = p exp − 2 0 , (2.34) PT x
2π σ 2 /(1 − a 2 ) 2σ /(1 − a 2 ) for all moments if εt is Gaussian). The last term, T1 t=1 xt εt , is serially uncorrelated,
so we can apply a CLT. Note that Ext εt εt0 xt0 =Eεt εt0 Ext xt0 = σ 2 6x x since u t and xt are
since y0 ∼ N (0, σ 2 /(1 − a 2 )). The optimization problem is then non-linear and must be
independent. We therefore have
solved by a numerical optimization routine.
T
1 X  
√ xt εt →d N 0, σ 2 6x x . (2.39)
T t=1

30 31
Combining these facts, we get the asymptotic distribution for the autoregression coefficients. This demonstrates that testing that all the autocorre-
√     lations are zero is essentially the same as testing if all the autoregressive coefficients are
T β̂ L S − β →d N 0, 6x−1 xσ
2
. (2.40) zero. Note, however, that the transformation is non-linear, which may make a difference
in small samples.
Consistency follows from taking plim of (2.38)
T
  1X 2.6 ARMA Models
plim β̂ L S − β = 6x−1
x plim xt εt
T
t=1

= 0, An ARMA model has both AR and MA components. For instance, an ARMA(p,q) is

since xt and εt are uncorrelated. yt = a1 yt−1 + a2 yt−2 + ... + a p yt− p + εt + θ1 εt−1 + ... + θq εt−q . (2.43)

Estimation of ARMA processes is typically done by setting up the likelihood function and
2.5.4 Autoregressions versus Autocorrelations∗
then using some numerical method to maximize it.
It is straightforward to see the relation between autocorrelations and the AR model when Even low-order ARMA models can be fairly flexible. For instance, the ARMA(1,1)
the AR model is the true process. This relation is given by the Yule-Walker equations. model is
For an AR(1), the autoregression coefficient is simply the first autocorrelation coeffi- yt = ayt−1 + εt + θ εt−1 , where εt is white noise. (2.44)
cient. For an AR(2), yt = a1 yt−1 + a2 yt−2 + εt , we have The model can be written on MA(∞) form as
Cov(yt , yt ) Cov(yt , a1 yt−1 + a2 yt−2 + εt )
   

X
 Cov(yt−1 , yt )  =  Cov(yt−1 , a1 yt−1 + a2 yt−2 + εt ) 
    yt = εt + a s−1 (a + θ )εt−s . (2.45)
s=1
Cov(yt−2 , yt ) Cov(yt−2 , a1 yt−1 + a2 yt−2 + εt )
The autocorrelations can be shown to be
a1 Cov(yt , yt−1 ) + a2 Cov(yt , yt−2 ) + Cov(yt , εt )
 

=  a1 Cov(yt−1 , yt−1 ) + a2 Cov(yt−1 , yt−2 )



 , or
 (1 + aθ )(a + θ )
ρ1 = , and ρs = aρs−1 for s = 2, 3, . . . (2.46)
a1 Cov(yt−2 , yt−1 ) + a2 Cov(yt−2 , yt−2 ) 1 + θ 2 + 2aθ

γ0
 
a1 γ1 + a2 γ2 + Var(εt )
 and the conditional expectations are
 γ1  =  a1 γ0 + a2 γ1 . (2.41)
   
Et yt+s = a s−1 (ayt + θ εt ) s = 1, 2, . . . (2.47)
γ2 a1 γ1 + a2 γ0
See Figure 2.3 for an example.
To transform to autocorrelation, divide through by γ0 . The last two equations are then
" # " # " # " #
ρ1 a 1 + a 2 ρ1 ρ1 a1 / (1 − a2 )
= or = . (2.42)
ρ2 a 1 ρ1 + a 2 ρ2 a12 / (1 − a2 ) + a2

If we know the parameters of the AR(2) model (a1 , a2 , and Var(εt )), then we can solve
for the autocorrelations. Alternatively, if we know the autocorrelations, then we can solve

32 33
a. Impulse response of a=0.9 a. Impulse response of a=0 eigenvalues of the canonical form (the VAR(1) form of the AR( p)) is one. Such a process
2 2 is said to be integrated of order one (often denoted I(1)) and can be made stationary by
taking first differences.
0
θ=−0.8
0 Example 14 (Non-stationary AR(2).) The process yt = 1.5yt−1 − 0.5yt−2 + εt can be
θ=0 written
θ=0.8
" # " #" # " #
yt 1.5 −0.5 yt−1 εt
−2 −2 = + ,
0 5 10 0 5 10 yt−1 1 0 yt−2 0
period period
where the matrix has the eigenvalues 1 and 0.5 and is therefore non-stationary. Note that
subtracting yt−1 from both sides gives yt − yt−1 = 0.5 (yt−1 − yt−2 ) + εt , so the variable
a. Impulse response of a=−0.9 xt = yt − yt−1 is stationary.
2 ARMA(1,1): yt = ayt−1 + εt + θεt−1
The distinguishing feature of unit root processes is that the effect of a shock never
vanishes. This is most easily seen for the random walk. Substitute repeatedly in (2.49) to
0
get

−2 yt = µ + (µ + yt−2 + εt−1 ) + εt
0 5 10
period ...
t
X
= tµ + y0 + εs . (2.50)
Figure 2.3: Impulse response function of ARMA(1,1) s=1

The effect of εt never dies out: a non-zero value of εt gives a permanent shift of the level
2.7 Non-stationary Processes
of yt . This process is clearly non-stationary. A consequence of the permanent effect of
2.7.1 Introduction a shock is that the variance of the conditional distribution grows without bound as the
forecasting horizon is extended. For instance, for the random walk with drift, (2.50), the
A trend-stationary process can be made stationary by subtracting a linear trend. The
distribution conditional on the information in t = 0 is N y0 + tµ, sσ 2 if the innovations

simplest example is
are Gaussian. This means that the expected change is tµ and that the conditional vari-
yt = µ + βt + εt (2.48)
ance grows linearly with the forecasting horizon. The unconditional variance is therefore
where εt is white noise. infinite and the standard results on inference are not applicable.
A unit root process can be made stationary only by taking a difference. The simplest In contrast, the conditional distributions from the trend stationary model, (2.48), is
N st, σ 2 .

example is the random walk with drift
A process could have two unit roots (integrated of order 2: I(2)). In this case, we need
yt = µ + yt−1 + εt , (2.49)
to difference twice to make it stationary. Alternatively, a process can also be explosive,
where εt is white noise. The name “unit root process” comes from the fact that the largest that is, have eigenvalues outside the unit circle. In this case, the impulse response function
diverges.

34 35
Example 15 (Two unit roots.) Suppose yt in Example (14) is actually the first difference we could proceed by applying standard econometric tools to 1yt .
of some other series, yt = z t − z t−1 . We then have One may then be tempted to try first-differencing all non-stationary series, since it
may be hard to tell if they are unit root process or just trend-stationary. For instance, a
z t − z t−1 = 1.5 (z t−1 − z t−2 ) − 0.5 (z t−2 − z t−3 ) + εt first difference of the trend stationary process, (2.48), gives
z t = 2.5z t−1 − 2z t−2 + 0.5z t−3 + εt ,
yt − yt−1 = β + εt − εt−1 . (2.52)
which is an AR(3) with the following canonical form
Its unclear if this is an improvement: the trend is gone, but the errors are now of MA(1)
εt
      
zt 2.5 −2 0.5 z t−1 type (in fact, non-invertible, and therefore tricky, in particular for estimation).
 z t−1  =  1 0 0   z t−2  +  0  .
      

z t−2 0 1 0 z t−3 0 2.7.3 Testing for a Unit Root I∗


The eigenvalues are 1, 1, and 0.5, so z t has two unit roots (integrated of order 2: I(2) and Suppose we run an OLS regression of
needs to be differenced twice to become stationary).
yt = ayt−1 + εt , (2.53)
Example 16 (Explosive AR(1).) Consider the process yt = 1.5yt−1 + εt . The eigenvalue
is then outside the unit circle, so the process is explosive. This means that the impulse where the true value of |a| < 1. The asymptotic distribution is of the LS estimator is
response to a shock to εt diverges (it is 1.5s for s periods ahead). √  
T â − a ∼ N 0, 1 − a 2 .

(2.54)

2.7.2 Spurious Regressions (The variance follows from the standard OLS formula where the variance of the estimator
−1
is σ 2 X 0 X/T . Here plim X 0 X/T =Var(yt ) which we know is σ 2 / 1 − a 2 ).

Strong trends often causes problems in econometric models where yt is regressed on xt .
In essence, if no trend is included in the regression, then xt will appear to be significant, It is well known (but not easy to show) that when a = 1, then â is biased towards
just because it is a proxy for a trend. The same holds for unit root processes, even if zero in small samples. In addition, the asymptotic distribution is no longer (2.54). In
they have no deterministic trends. However, the innovations accumulate and the series fact, there is a discontinuity in the limiting distribution as we move from a stationary/to
therefore tend to be trending in small samples. A warning sign of a spurious regression is a non-stationary variable. This, together with the small sample bias means that we have
when R 2 > DW statistics. to use simulated critical values for testing the null hypothesis of a = 1 based on the OLS
For trend-stationary data, this problem is easily solved by detrending with a linear estimate from (2.53).
trend (before estimating or just adding a trend to the regression). The approach is to calculate the test statistic
However, this is usually a poor method for a unit root processes. What is needed is a â − 1
t= ,
first difference. For instance, a first difference of the random walk is Std(â)

1yt = yt − yt−1 and reject the null of non-stationarity if t is less than the critical values published by
Dickey and Fuller (typically more negative than the standard values to compensate for the
= εt , (2.51)
small sample bias) or from your own simulations.
which is white noise (any finite difference, like yt − yt−s , will give a stationary series), so In principle, distinguishing between a stationary and a non-stationary series is very

36 37
difficult (and impossible unless we restrict the class of processes, for instance, to an p lags of 1yt if u t is an AR( p). The test remains valid even under an MA structure if
AR(2)), since any sample of a non-stationary process can be arbitrary well approximated the number of lags included increases at the rate T 1/3 as the sample lenngth increases.
by some stationary process et vice versa. The lesson to be learned, from a practical point In practice: add lags until the remaining residual is white noise. The size of the test
of view, is that strong persistence in the data generating process (stationary or not) invali- (probability of rejecting H0 when it is actually correct) can be awful in small samples for
dates the usual results on inference. We are usually on safer ground to apply the unit root a series that is a I(1) process that initially “overshoots” over time, as 1yt = et − 0.8et−1 ,
results in this case, even if the process is actually stationary. since this makes the series look mean reverting (stationary). Similarly, the power (prob of
rejecting H0 when it is false) can be awful when there is a lot of persistence, for instance,
2.7.4 Testing for a Unit Root II∗ if α = 0.95.
The power of the test depends on the span of the data, rather than the number of
Reference: Fuller (1976), Introduction to Statistical Time Series; Dickey and Fuller (1979),
observations. Seasonally adjusted data tend to look more integrated than they are. Should
“Distribution of the Estimators for Autoregressive Time Series with a Unit Root,” Journal
apply different critical values, see Ghysel and Perron (1993), Journal of Econometrics,
of the American Statistical Association, 74, 427-431.
55, 57-98. A break in mean or trend also makes the data look non-stationary. Should
Consider the AR(1) with intercept
perhaps apply tests that account for this, see Banerjee, Lumsdaine, Stock (1992), Journal
yt = γ + αyt−1 + u t , or 1yt = γ + βyt−1 + u t , where β = (α − 1) . (2.55) of Business and Economics Statistics, 10, 271-287.
Park (1990, “Testing for Unit Roots and Cointegration by Variable Addition,” Ad-
The DF test is to test the null hypothesis that β = 0, against β < 0 using the usual vances in Econometrics, 8, 107-133) sets up a framework where we can use both non-
t statistic. However, under the null hypothesis, the distribution of the t statistics is far stationarity as the null hypothesis and where we can have stationarity as the null. Consider
from a student-t or normal distribution. Critical values, found in Fuller and Dickey and the regression
p q
Fuller, are lower than the usual ones. Remember to add any nonstochastic regressors X X
yt = βs t s + βs t s + u t , (2.57)
that in required, for instance, seasonal dummies, trends, etc. If you forget a trend, then
s=0 s= p+1
the power of the test goes to zero as T → ∞. The critical values are lower the more
where the we want to test if H0 : βs = 0, s = p + 1, ..., q. If F ( p, q) is the Wald-statistics
deterministic components that are added.
for this, then J ( p, q) = F ( p, q) /T has some (complicated) asymptotic distribution
The asymptotic critical values are valid even under heteroskedasticity, and non-normal
under the null. You reject non-stationarity if J ( p, q) < critical value, since J ( p, q) → p
distributions of u t . However, no autocorrelation in u t is allowed for. In contrast, the
0 under (trend) stationarity.
simulated small sample critical values are usually only valid for iid normally distributed
Now, define
disturbances.
The ADF test is a way to account for serial correlation in u t . The same critical values Var (u t )
G ( p, q) = F ( p, q) √  ∼ χ p−q
2
under H0 of stationarity, (2.58)
apply. Consider an AR(1) u t = ρu t−1 + et . A Cochrane-Orcutt transformation of (2.55) Var T ū t
gives
and G ( p, q) → p ∞ under non-stationarity, so we reject stationarity
√  if G ( p, q) > critical
1yt = γ (1 − ρ) + β̃ yt−1 + ρ (β + 1) 1yt−1 + et , where β̃ = β (1 − ρ) . (2.56) value. Note that Var(u t ) is a traditional variance, while Var T ū t can be estimated with
a Newey-West estimator.
The test is here the t test for β̃. The fact that β̃ = β (1 − ρ) is of no importance, since β̃ is
zero only if β is (as long as ρ < 1, as it must be). (2.56) generalizes so one should include

38 39
2.7.5 Cointegration∗ to estimate γ and A2 . The relation to (2.61) is most easily seen in the bivariate case. Then,
by using (2.59) in (2.62) we get
Suppose y1t and y2t are both (scalar) unit root processes, but that
h i
yt − yt−1 = γ −γβ yt−1 − A2 (yt−1 − yt−2 ) + εt , (2.63)
z t = y1t − βy2t (2.59)
" #
h i y
1t so knowledge (estimates) of β (scalar), γ (2 × 1), A2 (2 × 2) allows us to “back out” A1 .
= 1 −β
y2t

is stationary. The processes yt and xt must then share the samehcommon stochastic
i trend, Bibliography
and are therefore cointegrated with the cointegrating vector 1 −β . Running the
Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New
regression (2.59) gives an estimator β̂ L S which converges much faster than usual (it is
Jersey, 4th edn.
“superconsistent”) and is not affected by any simultaneous equations bias. The intuition
for the second result is that the simultaneous equations bias depends on the simultaneous Hamilton, J. D., 1994, Time Series Analysis, Princeton University Press, Princeton.
reactions to the shocks, which are stationary and therefore without any long-run impor-
tance. Hayashi, F., 2000, Econometrics, Princeton University Press.
This can be generalized by letting yt be a vector of n unit root processes which follows Johnston, J., and J. DiNardo, 1997, Econometric Methods, McGraw-Hill, New York, 4th
a VAR. For simplicity assume it is a VAR(2) edn.

yt = A1 yt−1 + A2 yt−2 + εt . (2.60) Pindyck, R. S., and D. L. Rubinfeld, 1997, Econometric Models and Economic Forecasts,
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.
Subtract yt from both sides, add and subtract A2 yt−1 from the right hand side
Verbeek, M., 2000, A Guide to Modern Econometrics, Wiley, Chichester.
yt − yt−1 = A1 yt−1 + A2 yt−2 + εt − yt−1 + A2 yt−1 − A2 yt−1
= (A1 + A2 − I ) yt−1 − A2 (yt−1 − yt−2 ) + εt (2.61)

The left hand side is now stationary, and so is yt−1 − yt−2 and εt on the right hand side. It
must therefore be the case that (A1 + A2 − I ) yt−1 is also stationary; it must be n linear
combinations of the cointegrating vectors. Since the number of cointegrating vectors must
be less than n, the rank of A1 + A2 − I must be less than n. To impose this calls for special
estimation methods.
The simplest of these is Engle and Granger’s two-step procedure. In the first step, we
estimate the cointegrating vectors (as in 2.59) and calculate the different z t series (fewer
than n). In the second step, these are used in the error correction form of the VAR

yt − yt−1 = γ z t−1 − A2 (yt−1 − yt−2 ) + εt (2.62)

40 41
Example 1 (m t is a scalar iid process.) When m t is a scalar iid process, then
T T
!
1X 1 X
Var mt = 2 Var (m t ) /*independently distributed*/
3 The Distribution of a Sample Average T T
t=1 t=1
1
= 2 T Var (m t ) /*identically distributed*/
Reference: Hayashi (2000) 6.5 T
Additional references: Hamilton (1994) 14; Verbeek (2000) 4.10; Harris and Matyas 1
= Var (m t ) .
(1999); and Pindyck and Rubinfeld (1997) Appendix 10.1; Cochrane (2001) 11.7 T

√case. Clearly, limT ⇒∞ Var(m̄) = 0. By multiplying both sides by


This is the classical iid
T we instead get Var T m̄ = Var(m t ), which is often more convenient for asymptotics.
3.1 Variance of a Sample Average
Example i20 Let xt and z t be two scalars, with samples averages x̄ and z̄. Let m t =
In order to understand the distribution of many estimators we need to get an important h
xt z t . Then Cov(m̄) is
building block: the variance of a sample average.
Consider a covariance stationary vector process m t with zero mean and Cov(m t , m t−s ) = " #!

R (s) (which only depends on s). That is, we allow for serial correlation in m t , but no Cov (m̄) = Cov

heteroskedasticity. This is more restrictive than we want, but we will return to that further " #
Var (x̄) Cov (x̄, z̄)
on. = .
PT Cov (z̄, x̄) Var (z̄)
Let m̄ = t=1 m t /T . The sampling variance of a mean estimator of the zero mean
random variable m t is defined as Example 3 (Cov(m̄) with T = 3.) With T = 3, we have

T
! T
!0 
1 X 1 X Cov (T m̄) =
Cov (m̄) = E  mt mτ  . (3.1)
T T E (m 1 + m 2 + m 3 ) m 01 + m 02 + m 03 =

t=1 τ =1

E m 1 m 01 + m 2 m 02 + m 3 m 03 + E m 2 m 01 + m 3 m 02 + E m 1 m 02 + m 2 m 03 + Em 3 m 01 + Em 1 m 03 .
  
Let the covariance (matrix) at lag s be | {z } | {z } | {z } | {z } | {z }
3R(0) 2R(1) 2R(−1) R(2) R(−2)
R (s) = Cov (m t , m t−s )
The general pattern in the previous example is
= E m t m 0t−s , (3.2)
T
X −1
Cov (T m̄) = (T − |s|) R(s). (3.3)
since E m t = 0 for all t.
s=−(T −1)

Divide both sides by T

√ T −1  
 X |s|
Cov T m̄ = 1− R(s). (3.4)
T
s=−(T −1)

This is the exact expression for a given sample size.

42 43
In many cases, we use the asymptotic expression (limiting value as T → ∞) instead. Variance of sample mean, AR(1) Var(sample mean)/Var(series), AR(1)
If R (s) = 0 for s > q so m t is an MA(q), then the limit as the sample size goes to infinity 100 100

is
√  √  Xq
ACov T m̄ = lim Cov T m̄ = R(s), (3.5) 50 50
T →∞
s=−q

where ACov stands for the asymptotic variance-covariance matrix. This continues to hold
0 0
even if q = ∞, provided R (s) goes to zero sufficiently quickly, as it does in stationary −1 0 1 −1 0 1
AR(1) coefficient AR(1) coefficient
VAR systems. In this case we have
√ ∞
 X √
ACov T m̄ = R(s). (3.6) Figure 3.1: Variance of T times sample mean of AR(1) process m t = ρm t−1 + u t .
s=−∞

Estimation in finite samples will of course require some cut-off point, which is discussed sample. If we disregard all autocovariances, then we would conclude that the variance of

T m̄ is σ 2 / 1 − ρ 2 , which is smaller (larger) than the true value when ρ > 0 (ρ < 0).

below. √ 
The traditional estimator of ACov T m̄ is just R(0), which is correct when m t has For instance, with ρ = 0.85, it is approximately 12 times too small. See Figure 3.1 for an
no autocorrelation, that is illustration.
√ 
ACov T m̄ = R(0) = Cov (m t , m t ) if Cov (m t , m t−s ) = 0 for s 6= 0. (3.7) Example 5 (Variance of sample mean of AR(1), continued.) Part of the reason why
Var(m̄) increased with ρ in the previous examples is that Var(m t ) increases with ρ. We
By comparing with (3.5) we see that this underestimates the true variance if autocovari- √
can eliminate this effect by considering how much larger AVar( T m̄) is than in the iid
ances are mostly positive, and overestimates if they are mostly negative. The errors can √
case, that is, AVar( T m̄)/Var(m t ) = (1 + ρ) / (1 − ρ). This ratio is one for ρ = 0 (iid
be substantial. data), less than one for ρ < 0, and greater than one for ρ > 0. This says that if relatively
more of the variance in m t comes from long swings (high ρ), then the sample mean is
Example 4 (Variance of sample mean of AR(1).) Let m t = ρm t−1 + u t , where Var(u t ) =
more uncertain. See Figure 3.1 for an illustration.
σ 2 . Note that R (s) = ρ |s| σ 2 / 1 − ρ 2 , so


√  ∞
X
AVar T m̄ = R(s) Example 6 (Variance of sample mean of AR(1), illustration of why limT →∞ of (3.4).)
s=−∞ For an AR(1) (3.4) is
∞ ∞
!
σ2 X σ2 X
= ρ |s| = 1+2 ρs √ σ2
T −1 
|s|

1 − ρ s=−∞ 1 − ρ2

2
X
Var T m̄ = 1− ρ |s|
s=1
1 − ρ2 T
σ2 1 + ρ s=−(T −1)
= , " T −1 
#
1 − ρ2 1 − ρ σ2 X s s
= 1+2 1− ρ
1 − ρ2 T
which is increasing in ρ (provided |ρ| < 1, as required for stationarity). The variance s=1
σ2 ρ ρ T +1 − ρ
 
of m̄ is much larger for ρ close to one than for ρ close to zero: the high autocorrelation = 1 + 2 + 2 .
create long swings, so the mean cannot be estimated with any good precision in a small 1 − ρ2 1−ρ T (1 − ρ)2

44 45
The last term in brackets goes to zero as T goes to infinity. We then get the result in Std of LS estimator
Example 4. 0.1
2 −1
σ (X’X)
White’s
3.2 The Newey-West Estimator 0.08 Model: yt=0.9xt+εt,
Simulated
where εt ∼ N(0,ht), with ht = 0.5exp(αx2t )
3.2.1 Definition of the Estimator
0.06
Newey and West (1987) suggested the following estimator of the covariance matrix in
(3.5) as (for some n < T )
0.04
√ n  
\
 X |s|
ACov T m̄ = 1− R̂(s)
s=−n
n+1
0.02
n  
X s 
= R̂(0) + 1− R̂(s) + R̂(−s) , or since R̂(−s) = R̂ 0 (s),
n+1
s=1
0
n   −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2
X s 
= R̂(0) + 1− R̂(s) + R̂ 0 (s) , where (3.8) α
n+1
s=1

T
1 X 0 Figure 3.2: Variance of OLS estimator, heteroskedastic errors
R̂(s) = m t m t−s (if E m t = 0). (3.9)
T
t=s+1
It can also be shown that,
√under quite general circumstances, Ŝ in (3.8)-(3.9) is a
The tent shaped (Bartlett) weights in (3.8) guarantee a positive definite covariance
consistent estimator of ACov T m̄ , even if m t is heteroskedastic (on top of being au-
estimate. In contrast, equal weights (as in (3.5)), may give an estimated covariance matrix
tocorrelated). (See Hamilton (1994) 10.5 for a discussion.)
which is not positive definite, which is fairly awkward. Newey and West (1987) showed
that this estimator is consistent if we let n go to infinity as T does, but in such a way that
3.2.2 How to Implement the Newey-West Estimator
n/T 1/4 goes to zero.
There are several other possible estimators of the covariance matrix in (3.5), but sim- Economic theory and/or stylized facts can sometimes help us choose the lag length n.
ulation evidence suggest that they typically do not improve a lot on the Newey-West For instance, we may have a model of stock returns which typically show little autocor-
estimator. relation, so it may make sense to set n = 0 or n = 1 in that case. A popular choice of
n is to round (T /100)1/4 down to the closest integer, although this does not satisfy the
Example 7 (m t is MA(1).) Suppose we know that m t = εt + θ εt−1 . Then R(s)
 = 0 for consistency requirement.

s ≥ 2, so it might be tempting to use n = 1 in (3.8). This gives ACov
[ T m̄ = R̂(0) + It is important to note that definition of the covariance matrices in (3.2) and (3.9)
1
+ R̂ 0 (1)], while the theoretical expression (3.5) is ACov= R(0) + R(1) + R 0 (1).
2 [ R̂(1) assume that m t has zero mean. If that is not the case, then the mean should be removed
The Newey-West estimator puts too low weights on the first lead and lag, which suggests in the calculation of the covariance matrix. In practice, you remove the same number,
that we should use n > 1 (or more generally, n > q for an MA(q) process). estimated on the whole sample, from both m t and m t−s . It is often recommended to

46 47
Std of LS, Corr(xt,xt−1)=−0.9 Std of LS, Corr(xt,xt−1)=0 3.3 Summary
0.1 0.1
σ2(X’X)−1
Newey−West T
Simulated 1X
Let m̄ = m t and R (s) = Cov (m t , m t−s ) . Then
0.05 0.05 T
t=1
√  ∞
X
ACov T m̄ = R(s)
0 0 s=−∞
−0.5 0 0.5 −0.5 0 0.5 √ 
α α ACov T m̄ = R(0) = Cov (m t , m t ) if R(s) = 0 for s 6 = 0
√ n  
Model: yt=0.9xt+εt,
 X s 
\
Newey-West : ACov T m̄ = R̂(0) + 1− R̂(s) + R̂ 0 (s) .
Std of LS, Corr(xt,xt−1)=0.9 where εt = αεt−1 + ut, n+1
s=1
0.1 where ut is iid N(0,h) such that Std(εt)=1

Bibliography
0.05
Cochrane, J. H., 2001, Asset Pricing, Princeton University Press, Princeton, New Jersey.

Hamilton, J. D., 1994, Time Series Analysis, Princeton University Press, Princeton.
0
−0.5 0 0.5
α Harris, D., and L. Matyas, 1999, “Introduction to the Generalized Method of Moments
Estimation,” in Laszlo Matyas (ed.), Generalized Method of Moments Estimation .
chap. 1, Cambridge University Press.
Figure 3.3: Variance of OLS estimator, autocorrelated errors
Hayashi, F., 2000, Econometrics, Princeton University Press.
remove the sample means even if theory tells you that the true mean is zero.
Newey, W. K., and K. D. West, 1987, “A Simple Positive Semi-Definite, Heteroskedastic-
ity and Autocorrelation Consistent Covariance Matrix,” Econometrica, 55, 703–708.

Pindyck, R. S., and D. L. Rubinfeld, 1997, Econometric Models and Economic Forecasts,
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.

Verbeek, M., 2000, A Guide to Modern Econometrics, Wiley, Chichester.

48 49
Remark 1 (Summation and vectors) Let z t and xt be the vectors
 
" # x1t
z 1t
zt = and xt =  x2t  ,
 
4 Least Squares z 2t
x3t
Reference: Greene (2000) 6
then
Additional references: Hayashi (2000) 1-2; Verbeek (2000) 1-4; Hamilton (1994) 8
     PT PT 
T
X XT x1t h i X T x1t z 1t x1t z 2t t=1 x 1t z 1t t=1 x 1t z 2t
xt z t0 = t=1 x 2t z 2t  .
P T P T
 x2t  z 1t z 2t =  x2t z 1t x2t z 2t  =  t=1 x2t z 1t
     
4.1 Definition of the LS Estimator T T
t=1 t=1 t=1
P P
x3t x3t z 1t x3t z 2t t=1 x 3t z 1t t=1 x 3t z 2t
4.1.1 LS with Summation Operators
4.1.2 LS in Matrix Form
Consider the linear model
yt = xt0 β0 + u t , (4.1) Define the matrices
      
where yt and u t are scalars, xt a k×1 vector, and β0 is a k×1 vector of the true coefficients. y1 u1 x10 e1
     0   
Least squares minimizes the sum of the squared fitted residuals  y2  u2  x2  e2
, u= , X = .  , and e =  . (4.7)
   
Y = .  .. ..
. .
     
. . . .
   
T T
       
X X 2 0
et2 = yt − xt0 β , (4.2) yT T ×1 uT T ×1
x T T ×k
eT T ×1
t=1 t=1
Write the model (4.1) as
by choosing the vector β. The first order conditions are      
y1 x10 u1
T
x20
       
0kx1 =
X
xt yt − xt0 β̂ L S or (4.3) y2 u2
 β0 + 
     
 .. = .. ..  or (4.8)
. . .
     
t=1      
T T
X X yT x T0 uT
xt yt = xt xt0 β̂ L S , (4.4)
t=1 t=1 Y = Xβ0 + u. (4.9)

which are the so called normal equations. These can be solved as Remark 2 Let xt be a k × 1 and z t an m × 1 vector. Define the matrices
T
!−1 T    
X X x10 z 10
β̂ L S = xt xt0 xt yt (4.5)  0   0 
t=1 t=1  x2   z2 
X =  .. 
 and Z =  .. 
 .
.  . 
T
!−1 T
1X 1X  
= xt xt0 xt yt (4.6) x T0 z 0T
T T T ×k T ×m
t=1 t=1

50 51

We then have LS minimizes the sum of squared fitted errors, which is proportional to Var
c û t , so it
T
X
xt z t0 = X 0 Z . maximizes R 2 .
t=1 We can rewrite R 2 by noting that

We can then rewrite the loss function (4.2) as e0 e, the first order conditions (4.3) and d yt , ŷt = Cov
d ŷt + û t , ŷt = Var
c ŷt .
  
Cov (4.17)
(4.4) as (recall that yt = yt0 since it is a scalar)
d yt , ŷt /Var
  
  Use this to substitute for Var
c ŷt in (4.15) and then multiply both sides with Cov c ŷt =
0kx1 = X 0 Y − X β̂ L S (4.10) 1 to get
X 0 Y = X 0 X β̂ L S , (4.11) d yt , ŷt 2

2 Cov
R =
c (yt ) Var

and the solution (4.5) as Var c ŷt
2
β̂ L S = X 0 X
−1
X 0 Y. (4.12) d yt , ŷt
= Corr (4.18)

which shows that R 2 is the square of correlation coefficient of the actual and fitted value.
4.2 LS and R 2 ∗ d ŷt , û t = 0. From (4.14) this
Note that this interpretation of R 2 relies on the fact that Cov


implies that the sample variance of the fitted variables is smaller than the sample variance
The first order conditions in LS are
of yt . From (4.15) we see that this implies that 0 ≤ R 2 ≤ 1.
T
X To get a bit more intuition for what R 2 represents, suppose the estimated coefficients
xt û t = 0, where û t = yt − ŷt , with ŷt = xt0 β̂. (4.13) 2
t=1 equal the true coefficients, so ŷt = xt0 β0 . In this case, R 2 = Corr xt0 β0 + u t , xt0 β0 ,
that is, the squared correlation of yt with the systematic part of yt . Clearly, if the model
This implies that the fitted residuals and fitted values are orthogonal, 6t=1T ŷ û = 6 T β̂ 0 x û =
t t t t
t=1 is perfect so u t = 0, then R 2 = 1. On contrast, when there is no movements in the
0. If we let xt include a constant, then (4.13) also implies that the fitted residuals have a
systematic part (β0 = 0), then R 2 = 0.
zero mean, 6t=1 T û /T = 0. We can then decompose the sample variance (denoted Var)
t c
of yt = ŷt + û t as Remark 3 In a simple regression where yt = a + bxt + u t , where xt is a scalar, R 2 =
c (yt ) = Var c û t , d (yt , xt )2 . To see this, note that, in this case (4.18) can be written
 
Var c ŷt + Var (4.14) Corr

since ŷt and û t are uncorrelated in this case. (Note that Cov ŷt , û t = E ŷt û t −E ŷt Eû t so
  2
d yt , b̂xt
Cov d (yt , xt )2
b̂2 Cov
the orthogonality is not enough to allow the decomposition; we also need E ŷt Eû t = 0— R2 =  = ,
c (yt ) Var
Var c b̂xt b̂ Var (yt ) Var
2 c c (xt )
this holds for sample moments as well.)
We define R 2 as the fraction of Varc (yt ) that is explained by the model
so the b̂2 terms cancel.

Var
c ŷt
R2 = (4.15) Remark 4 Now, consider the reverse regression xt = c + dyt + vt . The LS estimator
c (yt )
Var
 d (yt , xt ) /Var
of the slope is d̂ L S = Cov d (yt , xt ) /Var
c (yt ). Recall that b̂ L S = Cov c (xt ). We
Var
c û t
=1− . (4.16) therefore have
c (yt )
Var
d (yt , xt )2
Cov
b̂ L S d̂ L S = = R2.
Var (yt ) Var
c c (xt )

52 53
This shows that d̂ L S = 1/b̂ L S if (and only if) R 2 = 1. if α > 0, and that this bias gets quite severe as α gets close to unity.

The finite sample distribution of the LS estimator is typically unknown.


4.3 Finite Sample Properties of LS Even in the most restrictive case where u t is iid N 0, σ 2 and E(u t |xt−s ) = 0 for all


s, we can only get that


Use the true model (4.1) to substitute for yt in the definition of the LS estimator (4.6) 
T
!−1 
2 1
!−1 X
1X
T
1X
T β̂ L S | {xt }t=1 ∼ N β0 , σ
T  0
xt xt . (4.22)
β̂ L S = xt xt0 xt xt0 β0 + u t T

t=1
T T
t=1 t=1
T
!−1 T This says that the estimator, conditional on the sample of regressors, is normally distrib-
1X 1X
= β0 + xt xt0 xt u t . (4.19) uted. With deterministic xt , this clearly means that β̂ L S is normally distributed in a small
T T
t=1 t=1 sample. The intuition is that the LS estimator with deterministic regressors is just a linear
It is possible to show unbiasedness of the LS estimator, even if xt stochastic and u t is combination of the normally distributed yt , so it must be normally distributed. However,
autocorrelated and heteroskedastic—provided E(u t |xt−s ) = 0 for all s. Let E u t | {xt }t=1
T T

if xt is stochastic, then we have to take into account the distribution of {xt }t=1 to find the
denote the expectation of u t conditional on all values of xt−s . Using iterated expectations unconditional distribution of β̂ L S . The principle is that
on (4.19) then gives   Z ∞   Z ∞  
 !−1  pdf β̂ = pdf β̂, x d x = pdf β̂ |x pdf (x) d x,
T T −∞ −∞
1 X 1 X
Eβ̂ L S = β0 + Ex  0 T

xt xt xt E u t | {xt }t=1  (4.20)
T T so the distribution in (4.22) must be multiplied with the probability density function of
t=1 t=1
T T
{xt }t=1 and then integrated over {xt }t=1 to give the unconditional distribution (marginal)
= β0 , (4.21)
of β̂ L S . This is typically not a normal distribution.
since E(u t |xt−s ) = 0 for all s. This is, for instance, the case when the regressors are Another way to see the same problem is to note that β̂ L S in (4.19) is a product of two
deterministic. Notice that E( u t | xt ) = 0 is not enough for unbiasedness since (4.19) random variables, (6t=1 T x x 0 /T )−1 and 6 T x u /T . Even if u happened to be normally
t t t=1 t t t
PT
contains terms involving xt−s xt u t from the product of ( T1 t=1 xt xt0 )−1 and xt u t . distributed, there is no particular reason why xt u t should be, and certainly no strong
reason for why (6t=1 T x x 0 /T )−1 6 T x u /T should be.
t t t=1 t t
Example 5 (AR(1).) Consider estimating α in yt = αyt−1 + u t . The LS estimator is

T
!−1
T 4.4 Consistency of LS
1X 2 1X
α̂ L S = yt−1 yt−1 yt
T T Reference: Greene (2000) 9.3-5 and 11.2; Hamilton (1994) 8.2; Davidson (2000) 3
t=1 t=1
T
!−1 T We now study if the LS estimator is consistent.
1X 2 1X
=α+ yt−1 yt−1 u t .
T
t=1
T
t=1
Remark 6 Suppose the true parameter value is β0 . The estimator β̂T (which, of course,
depends on the sample size T ) is said to be consistent if for every ε > 0 and δ > 0 there
In this case, the assumption E(u t |xt−s ) = 0 for all s (that is, s = ..., −1, 0, 1, ...) is false,
exists N such that for T ≥ N
since xt+1 = yt and u t and yt are correlated. We can therefore not use this way of proving
 
that α̂ L S is unbiased. In fact, it is not, and it can be shown that α̂ L S is downward-biased Pr β̂T − β0 > δ < ε.

54 55

(kxk = x 0 x, the Euclidean distance of x from zero.) We write this plim β̂T = β0 or An example of a case where LS is not consistent is when the errors are autocorrelated
just plim β̂ = β0 , or perhaps β̂ → p β0 . (For an estimator of a covariance matrix, the and the regressors include lags of the dependent variable. For instance, suppose the error
most convenient is to stack the unique elements in a vector and then apply the definition is a MA(1) process
above.) u t = εt + θ1 εt−1 , (4.26)

where εt is white noise and that the regression equation is an AR(1)


Remark 7 (Slutsky’s theorem.) If g (.) is a continuous function, then plim g (z T ) =
g (plim z T ). In contrast, note that Eg (z T ) is generally not equal to g (Ez T ), unless g (.) yt = ρyt−1 + u t . (4.27)
is a linear function.
This is an ARMA(1,1) model and it is clear that the regressor and error in (4.27) are
Remark 8 (Probability limit of product.) Let x T and yT be two functions of a sample of correlated, so LS is not a consistent estimator of an ARMA(1,1) model.
length T . If plim x T = a and plim yT = b, then plim x T yT = ab.

Assume 4.5 Asymptotic Normality of LS


T
1X
plim xt xt0 = 6x x < ∞, and 6x x invertible. (4.23) Reference: Greene (2000) 9.3-5 and 11.2; Hamilton (1994) 8.2; Davidson (2000) 3
T
t=1

The plim carries over to the inverse by Slutsky’s theorem.1 Use the facts above to write Remark 9 (Continuous mapping theorem.) Let the sequences of random matrices {x T }
d p
the probability limit of (4.19) as and {yT }, and the non-random matrix {aT } be such that x T → x, yT → y, and aT → a
d
T
(a traditional limit). Let g(x T , yT , aT ) be a continuous function. Then g(x T , yT , aT ) →
1X
plim β̂ L S = β0 + 6x−1
x plim xt u t . (4.24) g(x, y, a). Either of yT and aT could be irrelevant in g.
T
t=1
d
Remark 10 From the previous remark: if x T → x (a random variable) and plim Q T =
To prove consistency of β̂ L S we therefore have to show that d
Q (a constant matrix), then Q T x T → Qx.

1X
T √
plim xt u t = Ext u t = Cov(xt , u t ) = 0. (4.25) Premultiply (4.19) by T and rearrange as
T
t=1 !−1 √
√  T T
 1X T X
This is fairly easy to establish in special cases, for instance, when wt = xt u t is iid or T β̂ L S − β0 = xt xt0 xt u t . (4.28)
T T
when there is either heteroskedasticity or serial correlation. The case with both serial t=1 t=1

correlation and heteroskedasticity is just a bit more complicated. In other cases, it is clear If the first term on the right hand side converges in probability to a finite matrix (as as-
that the covariance the residuals and the regressors are not all zero—for instance when sumed in (4.23)), and the vector of random variables xt u t satisfies a central limit theorem,
some of the regressors are measured with error or when some of them are endogenous then
variables. √  
d
1 This puts non-trivial restrictions on the data generating processes. For instance, if x include lagged T (β̂ L S − β0 ) → N 0, 6x−1 x S0 6x x , where
−1
(4.29)
t
values of yt , then we typically require yt to be stationary and ergodic, and that u t is independent of xt−s for T √ T
!
s ≥ 0. 1X T X
6x x = xt xt0 and S0 = Cov xt u t .
T T
t=1 t=1

56 57
The last matrix in the covariance matrix does not need to be transposed since it is sym- can write
metric (since 6x x is). This general expression is valid for both autocorrelated and het- T
1X
eroskedastic residuals—all such features are loaded into the S0 matrix. Note that S0 is S0 = Ext xt0 Eu 2t (4.32)
√ T
the variance-covariance matrix of T times a sample average (of the vector of random t=1
T
variables xt u t ), which can be complicated to specify and to estimate. In simple cases, 1 X
= σ2 E xt xt0 (since u t is iid and σ 2 = Eu 2t ) (4.33)
we can derive what it is. To do so, we typically need to understand the properties of the T
t=1
residuals. Are they autocorrelated and/or heteroskedastic? In other cases we will have to = σ 2 6x x . (4.34)
use some kind of “non-parametric” approach to estimate it.
A common approach is to estimate 6x x by 6t=1 T x x 0 /T and use the Newey-West
t t Using this in (4.29) gives
estimator of S0 . √
Asymptotic Cov[ T (β̂ L S − β0 )] = 6x−1
x S0 6x x = 6x x σ 6x x 6x x = σ 6x x .
−1 −1 2 −1 2 −1

4.5.1 Special Case: Classical LS assumptions


4.5.2 Special Case: White’s Heteroskedasticity
Reference: Greene (2000) 9.4 or Hamilton (1994) 8.2.
Reference: Greene (2000) 12.2 and Davidson and MacKinnon (1993) 16.2.
We can recover the classical expression for the covariance, σ 2 6x−1 x , if we assume that
This section shows that the classical LS formula for the covariance matrix is valid
the regressors are stochastic, but require that xt is independent of all u t+s and that u t is
even if the errors are heteroskedastic—provided the heteroskedasticity is independent of
iid. It rules out, for instance, that u t and xt−2 are correlated and also that the variance of
the regressors.
u t depends on xt . Expand the expression for S0 as Expand the expression for S0 as
The only difference compared with the classical LS assumptions is that u t is now
√ T ! √ T !
T X T X allowed to be heteroskedastic, but this heteroskedasticity is not allowed to depend on the
0
S0 = E xt u t u t xt (4.30)
T T moments of xt . This means that (4.32) holds, but (4.33) does not since Eu 2t is not the
t=1 t=1
1 same for all t.
= E (... + xs−1 u s−1 + xs u s + ...) ... + u s−1 xs−1
0
+ u s xs0 + ... .

T However, we can still simplify (4.32) a bit more. We assumed that Ext xt0 and Eu 2t
(which can both be time varying) are not related to each other, so we could perhaps mul-
Note that
tiply Ext xt0 by 6t=1
T Eu 2 /T instead of by Eu 2 . This is indeed true asymptotically—where
t t
Ext−s u t−s u t xt0 = Ext−s xt0 Eu t−s u t (since u t and xt−s independent) any possible “small sample” relation between Ext xt0 and Eu 2t must wash out due to the
(
0 if s 6 = 0 (since Eu t−s u t = 0 by iid u t ) assumptions of independence (which are about population moments).
= (4.31) In large samples we therefore have
Ext xt0 Eu t u t else.
T T
! !
This means that all cross terms (involving different observations) drop out and that we 1X 2 1X 0
S0 = Eu t Ext xt
T T
t=1 t=1
T T
! !
1X 2 1X
= Eu t E xt xt0
T T
t=1 t=1

= ω2 6x x , (4.35)

58 59

where ω2 is a scalar. This is very similar to the classical LS case, except that ω2 is When there is only one restriction (s = 1), then T (R β̂ − r ) is a scalar, so the test
the average variance of the residual rather than the constant variance. In practice, the can equally well be based on the fact that
estimator of ω2 is the same as the estimator of σ 2 , so we can actually apply the standard √
T (R β̂ − r ) d
LS formulas in this case. → N (0, 1).
RV R 0
This is the motivation for why White’s test for heteroskedasticity makes sense: if the
In this case, we should reject the null hypothesis if the test statistics is either very low
heteroskedasticity is not correlated with the regressors, then the standard LS formula is
(negative) or very high (positive). In particular, let 8() be the standard normal cumulative
correct (provided there is no autocorrelation).
distribution function. We then reject the null hypothesis at the x% significance level if the
test statistics is below x L such that 8(x L ) = (x/2)% or above x H such that 8(x H ) =
4.6 Inference 1 − (x/2)% (that is with (x/2)% of the probability mass in each tail).

Consider some estimator, β̂k×1 , with an asymptotic normal distribution


Example 13 (T R 2 /(1 − R 2 ) as a test of the regression.) Recall from (4.15)-(4.16) that
√ d c ŷt /Var
R 2 = Var

c û t /Var
c (yt ) = 1 − Var

c (yt ), where ŷt and û t are the fitted value and
T (β̂ − β0 ) → N (0, V ) . (4.36)
residual respectively. We therefore get
Suppose we want to test the null hypothesis that the s linear restrictions Rβ0 = r hold,
T R 2 /(1 − R 2 ) = T Var
c ŷt /Var
c û t .
 
where R is an s × k matrix and r is an s × 1 vector. If the null hypothesis is true, then
√ d To simplify the algebra, assume that both yt and xt are demeaned and that no intercept is
T (R β̂ − r ) → N (0, RV R 0 ), (4.37)
used. (We get the same results, but after more work, if we relax this assumption.) In this
since the s linear combinations are linear combinations of random variables with an as- case, ŷt = xt0 β̂, so we can rewrite the previous eqiuation as
ymptotic normal distribution as in (4.37).
T R 2 /(1 − R 2 ) = T β̂ 0 6x x β̂ 0 /Var
c û t .


Remark 11 If the n × 1 vector x ∼ N (0, 6), then x 0 6 −1 x ∼ χn2 .


This is identical to (4.38) when R = Ik and r = 0k×1 and the classical LS assumptions
are fulfilled (so V = Var û t 6x−1 x ). The T R /(1 − R ) is therefore a χk distributed
2 2 2

Remark 12 From the previous remark and Remark (9), it follows that if the n × 1 vector
d d
x → N (0, 6), then x 0 6 −1 x → χn2 . statistics for testing if all the slope coefficients are zero.

From this remark, it follows that if the null hypothesis, Rβ0 = r , is true, then Wald Example 14 (F version of the test.) There is also an Fk,T −k version of the test in the
test statistics converges in distribution to a χs2 variable previous example: [R 2 /k]/[(1 − R 2 )/(T − k)]. Note that k times an Fk,T −k variable
converges to a χk2 variable as T − k → ∞. This means that the χk2 form in the previous
−1 d
T (R β̂ − r )0 RV R 0 (R β̂ − r ) → χs2 . (4.38) example can be seen as an asymptotic version of the (more common) F form.

Values of the test statistics above the x% critical value of the χs2 distribution mean that
4.6.1 Tests of Non-Linear Restrictions∗
we reject the null hypothesis at the x% significance level.
To test non-linear restrictions, we can use the delta method which gives the asymptotic
distribution of a function of a random variable.

60 61
Fact 15 Remark 16 (Delta method) Consider an estimator β̂k×1 which satisfies There are therefore few compelling theoretical reasons for why we should use F tests.2
√   d This section demonstrates that point.
T β̂ − β0 → N (0, ) ,
Remark 17 If Y1 ∼ χn21 , Y2 ∼ χn22 , and if Y1 and Y2 are independent, then Z =
and suppose we want the asymptotic distribution of a transformation of β d
(Y1 /n 1 )/(Y1 /n 1 ) ∼ Fn 1 ,n 2 . As n 2 → ∞, n 1 Z → χn21 (essentially because the denomina-
γq×1 = g (β) , tor in Z is then equal to its expected value).

where g (.) is has continuous first derivatives. The result is To use the F test to test s linear restrictions Rβ0 = r , we need to assume that the small

√ h   i d sample distribution of the estimator is normal, T (β̂ − β0 ) ∼ N (0, σ 2 W ), where σ 2 is
T g β̂ − g (β0 ) → N 0, 9q×q , where

a scalar and W a known matrix. This would follow from an assumption that the residuals
∂g (β0 ) ∂g (β0 )0 ∂g (β0 ) are normally distributed and that we either consider the distribution conditional on the
9=  , where is q × k.
∂β
0
∂β ∂β
0
regressors or that the regressors are deterministic. In this case W = 6x−1 x.
Consider the test statistics
Proof. By the mean value theorem we have
 −1
  ∂g (β ∗ )   F = T (R β̂ − r )0 R σ̂ 2 W R 0 (R β̂ − r )/s.
g β̂ = g (β0 ) + β̂ − β0 ,
∂β 0
This is similar to (4.38), expect that we use the estimated covariance matrix σ̂ 2 W instead
where
∂g1 (β) ∂g1 (β) of the true σ 2 W (recall, W is assumed to be known) and that we have divided by the
 
∂β1 ··· ∂βk
∂g (β)  .. ... .. number of restrictions, s. Multiply and divide this expressions by σ 2
,

= . . 
∂β 0 
∂gq (β) ∂gq (β)

−1
∂β1 ··· ∂βk T (R β̂ − r )0 Rσ 2 W R 0 (R β̂ − r )/s
q×k
F= .
√ σ̂ 2 /σ 2
and we evaluate it at β ∗ which is (weakly) between β̂ and β0 . Premultiply by T and
rearrange as The numerator is an χs2 variable divided by its degrees of freedom, s. The denominator
√ h   i ∂g (β ∗ ) √  
T g β̂ − g (β0 ) = T β̂ − β0 . can be written σ̂ 2 /σ 2 = 6(û t /σ )2 /T , where û t are the fitted residuals. Since we just
∂β 0
assumed that u t are iid N (0, σ 2 ), the denominator is an χT2 variable divided by its degrees
If β̂ is consistent (plim β̂ = β0 ) and ∂g (β ∗ ) /∂β 0 is continuous, then by Slutsky’s theorem of freedom, T . It can also be shown that the numerator and denominator are independent
plim ∂g (β ∗ ) /∂β 0 = ∂g (β0 ) /∂β 0 , which is a constant. The result then follows from the (essentially because the fitted residuals are orthogonal to the regressors), so F is an Fs,T
continuous mapping theorem. variable.
We need indeed very strong assumptions to justify the F distributions. Moreover, as
4.6.2 On F Tests∗ d
T → ∞, s F → χn2 which is the Wald test—which do not need all these assumptions.
F tests are sometimes used instead of chi–square tests. However, F tests rely on very spe- 2 However, some simulation evidence suggests that F tests may have better small sample properties than

cial assumptions and typically converge to chi–square tests as the sample size increases. chi-square test.

62 63
4.7 Diagnostic Tests of Autocorrelation, Heteroskedasticity, and Normality∗ course), then for any i and j different from zero
" # " # " #!
Reference: Greene (2000) 12.3, 13.5 and 9.7; Johnston and DiNardo (1997) 6; and Pindyck √ ρ̂i 0 1 0
T →d N , . (4.40)
and Rubinfeld (1997) 6, Patterson (2000) 5 ρ̂ j 0 0 1
LS and IV are still consistent even if the residuals are autocorrelated, heteroskedastic,
This result can be used to construct tests for both single autocorrelations (t-test or χ 2 test)
and/or non-normal, but the traditional expression for the variance of the parameter esti-
and several autocorrelations at once (χ 2 test).
mators is invalid. It is therefore important to investigate the properties of the residuals.
We would like to test the properties of the true residuals, u t , but these are unobserv- Example 18 (t-test) We want to test the hypothesis that ρ1 = 0. Since the N (0, 1) dis-
able. We can instead use residuals from a consistent estimator as approximations, since tribution has 5% of the probability mass below -1.65 and another 5% above 1.65, we

the approximation error then goes to zero as the sample size increases. The residuals from can reject the null hypothesis at the 10% level if T |ρ̂1 | > 1.65. With T = 100, we

an estimator are therefore need |ρ̂1 | > 1.65/ 100 = 0.165 for rejection, and with T = 1000 we need

|ρ̂1 | > 1.65/ 1000 ≈ 0.0.53.
û t = yt − xt0 β̂

The Box-Pierce test follows directly from the result in (4.40), since it shows that T ρ̂i
 
= xt0 β0 − β̂ + u t . (4.39) √
and T ρ̂ j are iid N(0,1) variables. Therefore, the sum of the square of them is distributed
If plim β̂ = β0 , then û t converges in probability to the true residual (“pointwise consis- as an χ 2 variable. The test statistics typically used is
tency”). It therefore makes sense to use û t to study the (approximate) properties of u t . L
X
We want to understand if u t are autocorrelated and/or heteroskedastic, since this affects QL = T ρ̂s2 →d χ L2 . (4.41)
the covariance matrix of the least squares estimator and also to what extent least squares s=1

is efficient. We might also be interested in studying if the residuals are normally distrib- Example 19 (Box-Pierce) Let ρ̂1 = 0.165, and T = 100, so Q 1 = 100 × 0.1652 =
uted, since this also affects the efficiency of least squares (remember that LS is MLE is 2.72. The 10% critical value of the χ12 distribution is 2.71, so the null hypothesis of no
the residuals are normally distributed). autocorrelation is rejected.
It is important that the fitted residuals used in the diagnostic tests are consistent. With
poorly estimated residuals, we can easily find autocorrelation, heteroskedasticity, or non- The choice of lag order in (4.41), L, should be guided by theoretical considerations,
normality even if the true residuals have none of these features. but it may also be wise to try different values. There is clearly a trade off: too few lags may
miss a significant high-order autocorrelation, but too many lags can destroy the power of
4.7.1 Autocorrelation the test (as the test statistics is not affected much by increasing L, but the critical values
increase).
Let ρ̂s be the estimate of the sth autocorrelation coefficient of some variable, for instance,
the fitted residuals. The sampling properties of ρ̂s are complicated, but there are several Example 20 (Residuals follow an AR(1)process) If u t = 0.9u t−1 + εt , then the true
useful large sample results for Gaussian processes (these results typically carry over to autocorrelation coefficients are ρ j = 0.9 j .
processes which are similar to the Gaussian—a homoskedastic process with finite 6th
moment is typically enough). When the true autocorrelations are all zero (not ρ0 , of A common test of the serial correlation of residuals from a regression is the Durbin-
Watson test
d = 2 1 − ρ̂1 ,

(4.42)

64 65
where the null hypothesis of no autocorrelation is White’s test for heteroskedasticity tests the null hypothesis of homoskedasticity against
the kind of heteroskedasticity which can be explained by the levels, squares, and cross
not rejected if d > dupper

products of the regressors. Let wt be the unique elements in xt ⊗ xt , where we have added
rejected if d < dlower (in favor of positive autocorrelation)

a constant to xt if there was not one from the start. Run a regression of the squared fitted
else inconclusive
LS residuals on wt
where the upper and lower critical values can be found in tables. (Use 4 − d to let nega- û 2t = wt0 γ + εt (4.43)
tive autocorrelation be the alternative hypothesis.) This test is typically not useful when and test if all elements (except the constant) in γ are zero (with a χ 2 or F test). The
lagged dependent variables enter the right hand side (d is biased towards showing no reason for this specification is that if u 2t is uncorrelated with xt ⊗ xt , then the usual LS
autocorrelation). Note that DW tests only for first-order autocorrelation. covariance matrix applies.
Breusch-Pagan’s test is very similar, except that the vector wt in (4.43) can be any
Example 21 (Durbin-Watson.) With ρ̂1 = 0.2 we get d = 1.6. For large samples, the 5%
vector which is thought of as useful for explaining the heteroskedasticity. The null hy-

critical value is dlower ≈ 1.6, so ρ̂1 > 0.2 is typically considered as evidence of positive
pothesis is that the variance is constant, which is tested against the alternative that the
autocorrelation.
variance is some function of wt .
The fitted residuals used in the autocorrelation tests must be consistent in order to in- The fitted residuals used in the heteroskedasticity tests must be consistent in order to
terpret the result in terms of the properties of the true residuals. For instance, an excluded interpret the result in terms of the properties of the true residuals. For instance, if some
autocorrelated variable will probably give autocorrelated fitted residuals—and also make of the of elements in wt belong to the regression equation, but are excluded, then fitted
the coefficient estimator inconsistent (unless the excluded variable is uncorrelated with residuals will probably fail these tests.
the regressors). Only when we know that the model is correctly specified can we interpret
a finding of autocorrelated residuals as an indication of the properties of the true residuals. 4.7.3 Normality

We often make the assumption of normally distributed errors, for instance, in maximum
4.7.2 Heteroskedasticity likelihood estimation. This assumption can be tested by using the fitted errors. This works
Remark 22 (Kronecker product.) If A and B are matrices, then since moments estimated from the fitted errors are consistent estimators of the moments
  of the true errors. Define the degree of skewness and excess kurtosis for a variable z t
a11 B · · · a1n B (could be the fitted residuals) as
 . .. 
.
A⊗B =  . .  . T
am1 B · · · amn B 1X
θ̂3 = (z t − z̄)3 /σ̂ 3 , (4.44)
T
t=1
Example 23 Let x1 and x2 be scalars. Then T
1 X
 " #    θ̂4 = (z t − z̄)4 /σ̂ 4 − 3, (4.45)
x1 x1 x1 T
" # " #  x1    t=1
x1 x1 
" x2 #
  x1 x2 .

⊗ = =
where z̄ is the sample mean and σ̂ 2 is the estimated variance.
x2 x2  x1   x2 x1 
x2
   
x2 x2 x2
Remark 24 (χ 2 (n) distribution.) If xi are independent N(0, σi2 ) variables, then 6i=1
n
xi2 /σi2 ∼

66 67
Histogram of 100 draws from a U(0,1) distribution Mean LS estimate Std of LS estimate
0.2 1
Simulation Simulation
θ3 = −0.14, θ4 = −1.4, W = 8 Asymptotic Asymptotic
0.1
mean σ2(X’X)−1
0.15 0.9
0.05

0.1 0.8 0
0 500 1000 0 500 1000
Sample size, T Sample size, T
0.05 Model: yt=0.9yt−1+εt
Estimated: ρ in yt=ρyt−1+εt

0
0 0.2 0.4 0.6 0.8 1
Figure 4.1: This figure shows a histogram from 100 draws of iid uniformly [0,1] distrib- Figure 4.2: Distribution of LS estimator of autoregressive parameter.
uted variables.
Hamilton, J. D., 1994, Time Series Analysis, Princeton University Press, Princeton.
χ 2 (n).
Hayashi, F., 2000, Econometrics, Princeton University Press.
In a normal distribution, the true values are zero and the test statistics θ̂3 and θ̂4 are
themselves normally distributed with zero covariance and variances 6/T and 24/T , re- Johnston, J., and J. DiNardo, 1997, Econometric Methods, McGraw-Hill, New York, 4th
spectively (straightforward, but tedious, to show). Therefore, under the null hypothesis edn.
of a normal distribution, T θ̂32 /6 and T θ̂42 /24 are independent and both asymptotically
Patterson, K., 2000, An Introduction to Applied Econometrics: A Time Series Approach,
distributed as χ 2 (1), so the sum is asymptotically a χ 2 (2) variable
MacMillan Press, London.
 
W = T θ̂32 /6 + θ̂42 /24 →d χ 2 (2). (4.46)
Pindyck, R. S., and D. L. Rubinfeld, 1997, Econometric Models and Economic Forecasts,
This is the Jarque and Bera test of normality. Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.

Verbeek, M., 2000, A Guide to Modern Econometrics, Wiley, Chichester.


Bibliography
Davidson, J., 2000, Econometric Theory, Blackwell Publishers, Oxford.

Davidson, R., and J. G. MacKinnon, 1993, Estimation and Inference in Econometrics,


Oxford University Press, Oxford.

Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.

68 69
5 Instrumental Variable Method
Reference: Greene (2000) 9.5 and 16.1-2
Additional references: Hayashi (2000) 3.1-4; Verbeek (2000) 5.1-4; Hamilton (1994) 8.2;
and Pindyck and Rubinfeld (1997) 7
Distribution of t−stat, T=5 Distribution of t−stat, T=100

0.4 0.4
5.1 Consistency of Least Squares or Not?

Consider the linear model


0.2 0.2
yt = xt0 β0 + u t , (5.1)

0 0 where yt and u t are scalars, xt a k×1 vector, and β0 is a k×1 vector of the true coefficients.
−4 −2 0 2 4 −4 −2 0 2 4 The least squares estimator is

T
!−1 T
Model: Rt=0.9ft+εt, εt = vt − 2 where vt has a χ2(2) distribution 1X 1X
β̂ L S = xt xt0 xt yt (5.2)
Results for T=5 and T=100: Probability density functions T T
t=1 t=1
Kurtosis of t−stat: 27.7 3.1 T
!−1 T
N(0,1) 1X 1X
Frequency of abs(t−stat)>1.645: 0.25 0.11 0.4
χ2(2)−2
= β0 + xt xt0 xt u t , (5.3)
T T
Frequency of abs(t−stat)>1.96: 0.19 0.05 t=1 t=1
0.2
where we have used (5.1) to substitute for yt . The probability limit is

T
!−1 T
0 1X 1X
−4 −2 0 2 4 plim β̂ L S − β0 = plim xt xt0 plim xt u t . (5.4)
T T
t=1 t=1

Figure 4.3: Distribution of LS estimator when residuals have a t3 distribution. In many cases the law of large numbers applies to both terms on the right hand side. The
first term is typically a matrix with finite elements and the second term is the covariance of
the regressors and the true residuals. This covariance must be zero for LS to be consistent.

5.2 Reason 1 for IV: Measurement Errors

Reference: Greene (2000) 9.5.

70 71
Suppose the true model is since xt∗ and vtx are uncorrelated with u ∗t and with each other. This shows that β̂ L S goes
yt∗ = xt∗0 β0 + u ∗t . (5.5) to zero as the measurement error becomes relatively more volatile compared with the true
value. This makes a lot of sense, since when the measurement error is very large then the
Data on yt∗ and xt∗ is not directly observable, so we instead run the regression
regressor xt is dominated by noise that has nothing to do with the dependent variable.
yt = xt0 β + u t , (5.6) Suppose instead that only yt∗ is measured with error. This not a big problem since this
measurement error is uncorrelated with the regressor, so the consistency of least squares
where yt and xt are proxies for the correct variables (the ones that the model is true for). is not affected. In fact, a measurement error in the dependent variable is like increasing
We can think of the difference as measurement errors the variance in the residual.
y
yt = yt∗ + vt and (5.7)
xt = xt∗ + vtx , (5.8) 5.3 Reason 2 for IV: Simultaneous Equations Bias (and Inconsis-
tency)
where the errors are uncorrelated with the true values and the “true” residual u ∗t .
Use (5.7) and (5.8) in (5.5) Suppose economic theory tells you that the structural form of the m endogenous variables,
yt , and the k predetermined (exogenous) variables, z t , is
y x 0
yt − vt = xt − vt β0 + u ∗t or


y
yt = xt0 β0 + εt where εt = −vtx0 β0 + vt + u ∗t . (5.9) F yt + Gz t = u t , where u t is iid with Eu t = 0 and Cov (u t ) = 6, (5.11)

Suppose that xt∗ is a measured with error. From (5.8) we see that vtx and xt are corre- where F is m × m, and G is m × k. The disturbances are assumed to be uncorrelated with
lated, so LS on (5.9) is inconsistent in this case. To make things even worse, measurement the predetermined variables, E(z t u 0t ) = 0.
errors in only one of the variables typically affect all the coefficient estimates. Suppose F is invertible. Solve for yt to get the reduced form
To illustrate the effect of the error, consider the case when xt is a scalar. Then, the
yt = −F −1 Gz t + F −1 u t (5.12)
probability limit of the LS estimator of β in (5.9) is
= 5z t + εt , with Cov (εt ) = . (5.13)
plim β̂ L S = Cov (yt , xt ) /Var (xt )
The reduced form coefficients, 5, can be consistently estimated by LS on each equation
= Cov xt∗ β0 + u ∗t , xt /Var (xt )

since the exogenous variables z t are uncorrelated with the reduced form residuals (which
= Cov xt β0 − vtx β0 + u ∗t , xt /Var (xt )

are linear combinations of the structural residuals). The fitted residuals can then be used
Cov (xt β0 , xt ) + Cov −vtx β0 , xt + Cov u ∗t , xt
 
= to get an estimate of the reduced form covariance matrix.
Var (xt ) The jth line of the structural form (5.11) can be written
Var (xt ) Cov −vtx β0 , xt∗ − vtx

= β0 +
Var (xt ) Var (xt ) F j yt + G j z t = u jt , (5.14)
= β0 − β0 Var vtx /Var (xt )

" # where F j and G j are the jth rows of F and G, respectively. Suppose the model is normal-
Var vtx

= β0 1 − . (5.10) ized so that the coefficient on y jt is one (otherwise, divide (5.14) with this coefficient).
Var xt∗ + Var vtx
 

72 73
Then, rewrite (5.14) as terms of the structural parameters
γ β γ
" # " # " #" #
y jt = −G j1 z̃ t − F j1 ỹt + u jt qt − β−γ α β−γ − β−γ u st
= At + .
= xt0 β + u jt , where xt0 = z̃ t0 , ỹt0 ,
 
(5.15) pt 1
− β−γ α 1
β−γ
1
− β−γ u dt

Example 2 (Supply equation with LS.) Suppose we try to estimate the supply equation in
where z̃ t and ỹt are the exogenous and endogenous variables that enter the jth equation,
Example 1 by LS, that is, we run the regression
which we collect in the xt vector to highlight that (5.15) looks like any other linear re-
gression equation. The problem with (5.15), however, is that the residual is likely to be q t = θ p t + εt .
correlated with the regressors, so the LS estimator is inconsistent. The reason is that a
shock to u jt influences y jt , which in turn will affect some other endogenous variables in If data is generated by the model in Example 1, then the reduced form shows that pt is
the system (5.11). If any of these endogenous variable are in xt in (5.15), then there is a correlated with u st , so we cannot hope that LS will be consistent. In fact, when both qt
correlation between the residual and (some of) the regressors. and pt have zero means, then the probability limit of the LS estimator is
Note that the concept of endogeneity discussed here only refers to contemporaneous Cov (qt , pt )
plim θ̂ =
endogeneity as captured by off-diagonal elements in F in (5.11). The vector of predeter- Var ( pt )
Cov γγ−β α γ β
 
mined variables, z t , could very well include lags of yt without affecting the econometric α
γ −β u t , γ −β
d s 1 d 1 d
At + γ −β u t − At + γ −β u t − γ −β u t
endogeneity problem. =  
α
Var γ −β At + 1 d
γ −β u t − 1 s
γ −β u t ,
Example 1 (Supply and Demand. Reference: GR 16, Hamilton 9.1.) Consider the sim-
where the second line follows from the reduced form. Suppose the supply and demand
plest simultaneous equations model for supply and demand on a market. Supply is
shocks are uncorrelated. In that case we get
qt = γ pt + u st , γ > 0, γ α2
Var (At ) + γ 2 Var u dt + β 2 Var u st
 
(γ −β)2 (γ −β) (γ −β)
plim θ̂ =
α2
Var (At ) + 1
Var u t + 1 2 Var
d
u st
 
and demand is
(γ −β)2 (γ −β)2 (γ −β)
qt = βpt + α At + u dt , β < 0,
γ α 2 Var (At ) + γ Var u dt + βVar u st
 
= .
α 2 Var (At ) + Var u dt + Var u st
 
where At is an observable demand shock (perhaps income). The structural form is there-
fore
First, suppose the supply shocks are zero, Var u st = 0, then plim θ̂ = γ , so we indeed
" #" # " # " # 
1 −γ qt 0 u st
+ At = . estimate the supply elasticity, as we wanted. Think of a fixed supply curve, and a demand
1 −β pt −α u dt
curve which moves around. These point of pt and qt should trace out the supply curve. It
The reduced form is " # " # " # is clearly u st that causes a simultaneous equations problem in estimating the supply curve:
qt π11 ε1t
= At + . u st affects both qt and pt and the latter is the regressor in the supply equation. With no
pt π21 ε2t
movements in u st there is no correlation between the shock and the regressor. Second, now
If we knew the structural form, then we can solve for qt and pt to get the reduced form in
suppose instead that the both demand shocks are zero (both At = 0 and Var u dt = 0).


Then plim θ̂ = β, so the estimated value is not the supply, but the demand elasticity. Not
good. This time, think of a fixed demand curve, and a supply curve which moves around.

74 75
Distribution of LS estimator, T=200 Distribution of LS estimator, T=900 LS, T=200 IV, T=200 ML, T=200

0.88 0.89 0.88 0.78 0.78


0.4 Mean and std:
0.03
0.4 0.01
0.4 0.03
0.4 0.05
0.4 0.05

0.2 0.2 0.2 0.2 0.2

0 0 0 0 0
0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1

True model: yt = ρyt−1 + εt, where εt = νt + θνt−1,

where ρ=0.8, θ=0.5 and νt is iid N(0,2)


LS, T=900 IV, T=900 ML, T=900
Estimated model: yt = ρyt−1 + ut
0.89 0.80 0.80
0.4 0.01
0.4 0.02
0.4 0.02

Figure 5.1: Distribution of LS estimator of autoregressive parameter 0.2 0.2 0.2

Example 3 (A flat demand curve.) Suppose we change the demand curve in Example 1 0 0 0
to be infinitely elastic, but to still have demand shocks. For instance, the inverse demand 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1

curve could be pt = ψ At + u tD . In this case, the supply and demand is no longer


a simultaneous system of equations and both equations could be estimated consistently Figure 5.2: Distribution of LS, IV and MLE estimators of autoregressive parameter. See
with LS. In fact, the system is recursive, which is easily seen by writing the system on Figure 5.1 for details.
vector form " #" # " # " #
1 0 pt −ψ u tD 5.4 Definition of the IV Estimator—Consistency of IV
+ At = .
1 −γ qt 0 u st
Reference: Greene (2000) 9.5; Hamilton (1994) 8.2; and Pindyck and Rubinfeld (1997)
A supply shock, u st , affects the quantity, but this has no affect on the price (the regressor
7.
in the supply equation), so there is no correlation between the residual and regressor in
Consider the linear model
the supply equation. A demand shock, u tD , affects the price and the quantity, but since
yt = xt0 β0 + u t , (5.16)
quantity is not a regressor in the inverse demand function (only the exogenous At is) there
is no correlation between the residual and the regressor in the inverse demand equation where yt is a scalar, xt a k × 1 vector, and β0 is a vector of the true coefficients. If
either. we suspect that xt and u t in (5.16) are correlated, then we may use the instrumental
variables (IV) method. To do that, let z t be a k × 1 vector of instruments (as many
instruments as regressors; we will later deal with the case when we have more instruments
than regressors.) If xt and u t are not correlated, then setting xt = z t gives the least squares
(LS) method.

76 77
Recall that LS minimizes the variance of the fitted residuals, û t = yt − xt0 β̂ L S . The To see if the IV estimator is consistent, use (5.16) to substitute for yt in (5.20) and
first order conditions for that optimization problem are take the probability limit
T T T T
1X   1X 0 1X 1X 0
0kx1 = xt yt − xt0 β̂ L S . (5.17) plim z t xt β0 + plim z t u t = plim z t xt β̂ I V . (5.22)
T T T T
t=1 t=1 t=1 t=1

If xt and u t are correlated, then plim β̂ L S 6= β0 . The reason is that the probability limit of Two things are required for consistency of the IV estimator, plim β̂ I V = β0 . First, that
the right hand side of (5.17) is Cov(xt , yt − xt0 β̂ L S ), which at β̂ L S = β0 is non-zero, so the plim 6z t u t /T = 0. Provided a law of large numbers apply, this is condition (5.18).
first order conditions (in the limit) cannot be satisfied at the true parameter values. Note Second, that plim 6z t xt0 /T has full rank. To see this, suppose plim 6z t u t /T = 0 is
that since the LS estimator by construction forces the fitted residuals to be uncorrelated satisfied. Then, (5.22) can be written
with the regressors, the properties of the LS residuals are of little help in deciding if to T
!
1X 0  
use LS or IV. plim z t xt β0 − plim β̂ I V = 0. (5.23)
T
The idea of the IV method is to replace the first xt in (5.17) with a vector (of similar t=1

size) of some instruments, z t . The identifying assumption of the IV method is that the
If plim 6z t xt0 /T has reduced rank, then plim β̂ I V does not need to equal β0 for (5.23) to
instruments are uncorrelated with the residuals (and, as we will see, correlated with the
be satisfied. In practical terms, the first order conditions (5.20) do then not define a unique
regressors)
value of the vector of estimates. If a law of large numbers applies, then plim 6z t xt0 /T =
0kx1 = Ez t u t (5.18) Ez t xt0 . If both z t and xt contain constants (or at least one of them has zero means), then
a reduced rank of Ez t xt0 would be a consequence of a reduced rank of the covariance
xt0 β0 .

= Ez t yt − (5.19)
matrix of the stochastic elements in z t and xt , for instance, that some of the instruments
The intuition is that the linear model (5.16) is assumed to be correctly specified: the are uncorrelated with all the regressors. This shows that the instruments must indeed be
residuals, u t , represent factors which we cannot explain, so z t should not contain any correlated with the regressors for IV to be consistent (and to make sense).
information about u t .
Remark 5 (Second moment matrix) Note that Ezx 0 = EzEx 0 + Cov(z, x). If Ez = 0
The sample analogue to (5.19) defines the IV estimator of β as1
and/or Ex = 0, then the second moment matrix is a covariance matrix. Alternatively,
T
1X   suppose both z and x contain constants normalized to unity: z = [1, z̃ 0 ]0 and x = [1, x̃ 0 ]0
0kx1 = z t yt − xt0 β̂ I V , or (5.20)
T where z̃ and x̃ are random vectors. We can then write
t=1
T
!−1 T
" # " #
1X 0 1X 0 1 h 0
i 0 0
β̂ I V = z t xt z t yt . (5.21) Ezx = 1 Ex̃ +
T T Ez̃ 0 Cov(z̃, x̃)
t=1 t=1 " #
1 Ex̃ 0
It is clearly necessay for 6z t xt0 /T to have full rank to calculate the IV estimator. = .
Ez̃ Ez̃Ex̃ 0 + Cov(z̃, x̃)
Remark 4 (Probability limit of product) For any random variables yT and x T where
For simplicity, suppose z̃ and x̃ are scalars. Then Ezx 0 has reduced rank if Cov(z̃, x̃) = 0,
plim yT = a and plim x T = b (a and b are constants), we have plim yT x T = ab.
since Cov(z̃, x̃) is then the determinant of Ezx 0 . This is true also when z̃ and x̃ are vectors.
−1
1 In matrix notation where z t0 is the t th row of Z we have β̂ I V = Z 0 X/T Z 0 Y /T .


78 79
Example 6 (Supply equation with IV.) Suppose we try to estimate the supply equation in If the first term on the right hand side converges in probability to a finite matrix (as as-
Example 1 by IV. The only available instrument is At , so (5.21) becomes sumed in in proving consistency), and the vector of random variables z t u t satisfies a
!−1 central limit theorem, then
T T
1X 1X
γ̂ I V = A t pt A t qt , √ d
 
T T T (β̂ I V − β0 ) → N 0, 6zx −1
S0 6x−1
z , where (5.26)
t=1 t=1
T √ T !
so the probability limit is 1X 0 T X
6zx = z t xt and S0 = Cov zt u t .
T T
t=1 t=1
plim γ̂ I V = Cov (At , pt )−1 Cov (At , qt ) ,
0
The last matrix in the covariance matrix follows from (6zx −1 )0 = (6 )−1 = 6 −1 . This
zx xz
since all variables have zero means. From the reduced form in Example 1 we see that general expression is valid for both autocorrelated and heteroskedastic residuals—all such
1 γ features are loaded into the S0 matrix. Note that S0 is the variance-covariance matrix of
Cov (At , pt ) = − αVar (At ) and Cov (At , qt ) = − αVar (At ) , √
β −γ β −γ T times a sample average (of the vector of random variables xt u t ).
so
Example 8 (Choice of instrument in IV, simplest case) Consider the simple regression
−1 
γ
 
1
plim γ̂ I V = − αVar (At ) − αVar (At ) yt = β1 xt + u t .
β −γ β −γ
= γ.
The asymptotic variance of the IV estimator is
This shows that γ̂ I V is consistent. √ T
!
√ T X
AVar( T (β̂ I V − β0 )) = Var z t u t / Cov (z t , xt )2
T
t=1
5.4.1 Asymptotic Normality of IV

T z u / T) =
Little is known about the finite sample distribution of the IV estimator, so we focus on the If z t and u t is serially uncorrelated and independent of each other, then Var(6t=1 t t

asymptotic distribution—assuming the IV estimator is consistent. Var(z t ) Var(u t ). We can then write
d √ Var(z t ) Var(u t )
Remark 7 If x T → x (a random variable) and plim Q T = Q (a constant matrix), then AVar( T (β̂ I V − β0 )) = Var(u t ) = .
d
Q T x T → Qx. Cov (z t , xt )2 Var(xt )Corr (z t , xt )2

Use (5.16) to substitute for yt in (5.20) An instrument with a weak correlation with the regressor gives an imprecise estimator.
With a perfect correlation, then we get the precision of the LS estimator (which is precise,
T
!−1 T
1X 0 1X but perhaps not consistent).
β̂ I V = β0 + z t xt zt u t . (5.24)
T T
t=1 t=1
√ 5.4.2 2SLS
Premultiply by T and rearrange as
Suppose now that we have more instruments, z t , than regressors, xt . The IV method does
T
!−1 √ T
√ 1X 0 T X not work since, there are then more equations than unknowns in (5.20). Instead, we can
T (β̂ I V − β0 ) = z t xt zt u t . (5.25) use the 2SLS estimator. It has two steps. First, regress all elements in xt on all elements
T T
t=1 t=1

80 81
in z t with LS. Second, use the fitted values of xt , denoted x̂t , as instruments in the IV that depend on At (the observable shifts of the demand curve) are used. Movements in pt
method (use x̂t in place of z t in the equations above). In can be shown that this is the most which are due to the unobservable demand and supply shocks are disregarded in p̂t . We
efficient use of the information in z t . The IV is clearly a special case of 2SLS (when z t know from Example 2 that it is the supply shocks that make the LS estimate of the supply
has the same number of elements as xt ). curve inconsistent. The IV method suppresses both them and the unobservable demand
It is immediate from (5.22) that 2SLS is consistent under the same condiditons as shock.
PT
IV since x̂t is a linear function of the instruments, so plim t=1 x̂t u t /T = 0, if all the
instruments are uncorrelated with u t . 5.5 Hausman’s Specification Test∗
The name, 2SLS, comes from the fact that we get exactly the same result if we replace
the second step with the following: regress yt on x̂t with LS. Reference: Greene (2000) 9.5
This test is constructed to test if an efficient estimator (like LS) gives (approximately)
Example 9 (Supply equation with 2SLS.). With only one instrument, At , this is the same
the same estimate as a consistent estimator (like IV). If not, the efficient estimator is most
as Example 6, but presented in another way. First, regress pt on At
likely inconsistent. It is therefore a way to test for the presence of endogeneity and/or
Cov ( pt , At ) 1 measurement errors.
pt = δ At + u t ⇒ plim δ̂ L S = =− α.
Var (At ) β −γ Let β̂e be an estimator that is consistent and asymptotically efficient when the null
Construct the predicted values as hypothesis, H0 , is true, but inconsistent when H0 is false. Let β̂c be an estimator that is
consistent under both H0 and the alternative hypothesis. When H0 is true, the asymptotic
p̂t = δ̂ L S At . distribution is such that    
Cov β̂e , β̂c = Var β̂e . (5.27)
Second, regress qt on p̂t
d qt , p̂t
Cov
 Proof. Consider the estimator λβ̂c + (1 − λ) β̂e , which is clearly consistent under H0
qt = γ p̂t + et , with plim γ̂2S L S = plim  . since both β̂c and β̂e are. The asymptotic variance of this estimator is
Var
c p̂t
     
Use p̂t = δ̂ L S At and Slutsky’s theorem λ2 Var β̂c + (1 − λ)2 Var β̂e + 2λ (1 − λ) Cov β̂c , β̂e ,
 
d qt , δ̂ L S At
plim Cov which is minimized at λ = 0 (since β̂e is asymptotically efficient). The first order condi-
plim γ̂2S L S =   tion with respect to λ
c δ̂ L S At
plim Var
     
Cov (qt , At ) plim δ̂ L S 2λVar β̂c − 2 (1 − λ) Var β̂e + 2 (1 − 2λ) Cov β̂c , β̂e = 0
=
Var (At ) plim δ̂ 2L S
should therefore be zero at λ = 0 so
γ
h ih i
− β−γ αVar (At ) − β−γ 1
α    
= h i2 Var β̂e = Cov β̂c , β̂e .
Var (At ) − β−γ 1
α

= γ. (See Davidson (2000) 8.1)

Note that the trick here is to suppress some the movements in pt . Only those movements

82 83
This means that we can write Pindyck, R. S., and D. L. Rubinfeld, 1997, Econometric Models and Economic Forecasts,
        Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.
Var β̂e − β̂c = Var β̂e + Var β̂c − 2Cov β̂e , β̂c
    Verbeek, M., 2000, A Guide to Modern Econometrics, Wiley, Chichester.
= Var β̂c − Var β̂e . (5.28)

We can use this to test, for instance, if the estimates from least squares (β̂e , since LS
is efficient if errors are iid normally distributed) and instrumental variable method (β̂c ,
since consistent even if the true residuals are correlated with the regressors) are the same.
In this case, H0 is that the true residuals are uncorrelated with the regressors.
All we need for this test are the point estimates and consistent estimates of the vari-
ance matrices. Testing one of the coefficient can be done by a t test, and testing all the
parameters by a χ 2 test
 0  −1  
β̂e − β̂c Var β̂e − β̂c β̂e − β̂c ∼ χ 2 ( j) , (5.29)

where j equals the number of regressors that are potentially endogenous or measured with
error. Note that the covariance matrix in (5.28) and (5.29) is likely to have a reduced rank,
so the inverse needs to be calculated as a generalized inverse.

5.6 Tests of Overidentifying Restrictions in 2SLS∗

When we use 2SLS, then we can test if instruments affect the dependent variable only
via their correlation with the regressors. If not, something is wrong with the model since
some relevant variables are excluded from the regression.

Bibliography
Davidson, J., 2000, Econometric Theory, Blackwell Publishers, Oxford.

Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.

Hamilton, J. D., 1994, Time Series Analysis, Princeton University Press, Princeton.

Hayashi, F., 2000, Econometrics, Princeton University Press.

84 85
case, we simply take the point estimates. In other cases, we adjust the point estimates
so that g(β) = 0 holds, that is, so you simulate the model under the null hypothesis
in order to study the size of asymptotic tests and to find valid critical values for small
6 Simulating the Finite Sample Properties samples. Alternatively, you may simulate the model under an alternative hypothesis in
order to study the power of the test using either critical values from either the asymptotic
Reference: Greene (2000) 5.3
distribution or from a (perhaps simulated) small sample distribution.
Additional references: Cochrane (2001) 15.2; Davidson and MacKinnon (1993) 21; Davi-
To make it a bit concrete, suppose you want to use these simulations to get a 5%
son and Hinkley (1997); Efron and Tibshirani (1993) (bootstrapping, chap 9 in particular);
critical value for testing the null hypothesis g (β) = 0. The Monte Carlo experiment
and Berkowitz and Kilian (2000) (bootstrapping in time series models)
follows these steps.
We know the small sample properties of regression coefficients in linear models with
fixed regressors (X is non-stochastic) and iid normal error terms. Monte Carlo Simula- T
1. (a) Construct an artificial sample of the regressors (see above), {x̃t }t=1 .
tions and bootstrapping are two common techniques used to understand the small sample T
(b) Draw random numbers {ũ t }t=1 and use those together with the artificial sam-
properties when these conditions are not satisfied. T
ple of x̃t to calculate an artificial sample { ỹt }t=1 by using (6.1). Calculate an
estimate β̂ and record it along with the value of g(β̂) and perhaps also the test
6.1 Monte Carlo Simulations in the Simplest Case statistics of the hypothesis that g(β) = 0.

Monte Carlo simulations is essentially a way to generate many artificial (small) samples 2. Repeat the previous steps N (3000, say) times. The more times you repeat, the
from a parameterized model and then estimating the statistics on each of those samples. better is the approximation of the small sample distribution.
The distribution of the statistics is then used as the small sample distribution of the esti-
3. Sort your simulated β̂, g(β̂), and the test statistics in ascending order. For a one-
mator.
sided test (for instance, a chi-square test), take the (0.95N )th observations in these
The following is an example of how Monte Carlo simulations could be done in the
sorted vector as your 5% critical values. For a two-sided test (for instance, a t-test),
special case of a linear model for a scalar dependent variable
take the (0.025N )th and (0.975N )th observations as the 5% critical values. You can
yt = xt0 β + u t , (6.1) also record how many times the 5% critical values from the asymptotic distribution
would reject a true null hypothesis.
where u t is iid N (0, σ 2 ) and xt is stochastic but independent of u t±s for all s. This means
that xt cannot include lags of yt . 4. You may also want to plot a histogram of β̂, g(β̂), and the test statistics to see
Suppose we want to find the small sample distribution of a function of the estimate, if there is a small sample bias, and how the distribution looks like. Is it close to
g(β̂). To do a Monte Carlo experiment, we need information on (i) β; (ii) the variance of normal? How wide is it?
u t , σ 2 ; (iii) and a process for xt . 5. See Figure 6.1 for an example.
The process for xt is typically estimated from the data on xt . For instance, we could
estimate the VAR system xt = A1 xt−1 + A2 xt−2 + et . An alternative is to take an actual
sample of xt and repeat it. Remark 1 (Generating N (µ, 6) random numbers) Suppose you want to draw an n × 1
The values of β and σ 2 are often a mix of estimation results and theory. In some vector εt of N (µ, 6) variables. Use the Cholesky decomposition to calculate the lower
triangular P such that 6 = P P 0 (note that Gauss and MatLab returns P 0 instead of

86 87
Mean LS estimate of y =0.9y
t t−1

t √T × Std of LS estimate √T × (bLS−0.9), T= 10 √T × (bLS−0.9), T= 100
1 6 6
Simulation 0.7 Simulation
Asymptotic 0.6 Asymptotic 4 4
2 −1
mean σ (X’X/T)
0.9 0.5
0.4 2 2
0.3
0.8 0 0
0 500 1000 0 500 1000 −0.5 0 0.5 −0.5 0 0.5
Sample size, T Sample size, T

Model: Rt=0.9ft+εt, where εt has a t3 distribution

Figure 6.1: Results from a Monte Carlo experiment of LS estimation of the AR coeffi- √T × (bLS−0.9), T= 1000
Kurtosis for T=10 100 1000: 46.9 6.1 4.1
cient. Data generated by an AR(1) process, 5000 simulations. 6
Rejection rates of abs(t−stat)>1.645: 0.16 0.10 0.10
4
P). Draw u t from an N (0, I ) distribution (randn in MatLab, rndn in Gauss), and define Rejection rates of abs(t−stat)>1.96: 0.10 0.05 0.06

εt = µ + Pu t . Note that Cov(εt ) = E Pu t u 0t P 0 = P I P 0 = 6. 2

0
6.2 Monte Carlo Simulations in More Complicated Cases∗ −0.5 0 0.5

6.2.1 When xt Includes Lags of yt


Figure 6.2: Results from a Monte Carlo experiment with thick-tailed errors. The regressor
If xt contains lags of yt , then we must set up the simulations so that feature is preserved in is iid normally distributed. The errors have a t3 -distribution, 5000 simulations.
every artificial sample that we create. For instance, suppose xt includes yt−1 and another
vector z t of variables which are independent of u t±s for all s. We can then generate an 6.2.2 More Complicated Errors
artificial sample as follows. First, create a sample {z̃ t }t=1 T by some time series model or It is straightforward to sample the errors from other distributions than the normal, for in-
by taking the observed sample itself (as we did with xt in the simplest case). Second, stance, a uniform distribution. Equipped with uniformly distributed random numbers, you
observation t of {x̃t , ỹt } is generated as can always (numerically) invert the cumulative distribution function (cdf) of any distribu-
" # tion to generate random variables from any distribution by using the probability transfor-
ỹt−1
x̃t = and ỹt = x̃t0 β + u t , (6.2) mation method. See Figure 6.2 for an example.
z̃ t

which is repeated for t = 1, ..., T . We clearly need the initial value y0 to start up the Remark 2 Let X ∼ U (0, 1) and consider the transformation Y = F −1 (X ), where F −1 ()
artificial sample, so one observation from the original sample is lost. is the inverse of a strictly increasing cdf F, then Y has the CDF F(). (Proof: follows from
the lemma on change of variable in a density function.)

Example 3 The exponential cdf is x = 1 − exp(−θ y) with inverse y = − ln (1 − x) /θ.


Draw x from U (0.1) and transform to y to get an exponentially distributed variable.

88 89
It is more difficult to handle non-iid errors, for instance, heteroskedasticity and auto- in the Monte Carlo simulations: replace any yt−1 in xt by ỹt−1 , that is, the corresponding
correlation. We then need to model the error process and generate the errors from that observation in the artificial sample.
model. For instance, if the errors are assumed to follow an AR(2) process, then we could
estimate that process from the errors in (6.1) and then generate artificial samples of errors. 6.4.2 Case 3: Errors are Heteroskedastic but Uncorrelated with of xt±s

Case 1 and 2 both draw errors randomly—based on the assumption that the errors are
6.3 Bootstrapping in the Simplest Case iid. Suppose instead that the errors are heteroskedastic, but still serially uncorrelated.
We know that if the heteroskedastcity is related to the regressors, then the traditional LS
Bootstrapping is another way to do simulations, where we construct artificial samples by
covariance matrix is not correct (this is the case that White’s test for heteroskedasticity
sampling from the actual data. The advantage of the bootstrap is then that we do not have
tries to identify). It would then be wrong it pair xt with just any û s since that destroys the
to try to estimate the process of the errors and regressors as we must do in a Monte Carlo
relation between xt and the variance of u t .
experiment. The real benefit of this is that we do not have to make any strong assumption
An alternative way of bootstrapping can then be used: generate the artificial sample
about the distribution of the errors.
by drawing (with replacement) pairs (ys , xs ), that is, we let the artificial pair in t be
The bootstrap approach works particularly well when the errors are iid and indepen-
( ỹt , x̃t ) = (xs0 β̂0 +û s , xs ) for some random draw of s so we are always pairing the residual,
dent of xt−s for all s. This means that xt cannot include lags of yt . We here consider
û s , with the contemporaneous regressors, xs . Note that is we are always sampling with
bootstrapping the linear model (6.1), for which we have point estimates (perhaps from
replacement—otherwise the approach of drawing pairs would be just re-create the original
LS) and fitted residuals. The procedure is similar to the Monte Carlo approach, except
data set. For instance, if the data set contains 3 observations, then artificial sample could
that the artificial sample is generated differently. In particular, Step 1 in the Monte Carlo
be
simulation is replaced by the following: ( ỹ1 , x̃1 ) (x2 β̂0 + û 2 , x2 )
   0 

 ( ỹ2 , x̃2 )  =  (x3 β̂0 + û 3 , x3 ) 


   0 
T by
1. Construct an artificial sample { ỹt }t=1
( ỹ3 , x̃3 ) (x30 β̂0 + û 3 , x3 )
ỹt = xt0 β + ũ t , (6.3) In contrast, when we sample (with replacement) û s , as we did above, then an artificial
sample could be
where ũ t is drawn (with replacement) from the fitted residual and where β is the 
( ỹ1 , x̃1 )

(x10 β̂0 + û 2 , x1 )
 
point estimate. Calculate an estimate β̂ and record it along with the value of g(β̂)  ( ỹ2 , x̃2 )

 =  (x2 β̂0 + û 1 , x2 )  .
  0 
and perhaps also the test statistics of the hypothesis that g(β) = 0. ( ỹ3 , x̃3 ) (x3 β̂0 + û 2 , x3 )
0

Davidson and MacKinnon (1993) argue that bootstrapping the pairs (ys , xs ) makes
6.4 Bootstrapping in More Complicated Cases∗ little sense when xs contains lags of ys , since there is no way to construct lags of ys in the
bootstrap. However, what is important for the estimation is sample averages of various
6.4.1 Case 2: Errors are iid but Correlated With xt+s
functions of the dependent and independent variable within a period—not how the line up
When xt contains lagged values of yt , then we have to modify the approach in (6.3) since over time (provided the assumption of no autocorrelation of the residuals is true).
ũ t can become correlated with xt . For instance, if xt includes yt−1 and we happen to
sample ũ t = û t−1 , then we get a non-zero correlation. The easiest way to handle this is as

90 91
6.4.3 Other Approaches Efron, B., and R. J. Tibshirani, 1993, An Introduction to the Bootstrap, Chapman and
Hall, New York.
There are many other ways to do bootstrapping. For instance, we could sample the re-
gressors and residuals independently of each other and construct an artificial sample of Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New
the dependent variable ỹt = x̃t0 β̂ + ũ t . This clearly makes sense if the residuals and re- Jersey, 4th edn.
gressors are independent of each other and errors are iid. In that case, the advantage of
this approach is that we do not keep the regressors fixed.

6.4.4 Serially Dependent Errors

It is quite hard to handle the case when the errors are serially dependent, since we must
the sample in such a way that we do not destroy the autocorrelation structure of the data.
A common approach is to fit a model for the residuals, for instance, an AR(1), and then
bootstrap the (hopefully iid) innovations to that process.
Another approach amounts to resampling of blocks of data. For instance, suppose the
sample has 10 observations, and we decide to create blocks of 3 observations. The first
block is (û 1 , û 2 , û 3 ), the second block is (û 2 , û 3 , û 4 ), and so forth until the last block,
(û 8 , û 9 , û 10 ). If we need a sample of length 3τ , say, then we simply draw τ of those
block randomly (with replacement) and stack them to form a longer series. To handle
end point effects (so that all data points have the same probability to be drawn), we also
create blocks by “wrapping” the data around a circle. In practice, this means that we add
a the following blocks: (û 10 , û 1 , û 2 ) and (û 9 , û 10 , û 1 ). An alternative approach is to have
non-overlapping blocks. See Berkowitz and Kilian (2000) for some other recent methods.

Bibliography
Berkowitz, J., and L. Kilian, 2000, “Recent Developments in Bootstrapping Time Series,”
Econometric-Reviews, 19, 1–48.

Cochrane, J. H., 2001, Asset Pricing, Princeton University Press, Princeton, New Jersey.

Davidson, R., and J. G. MacKinnon, 1993, Estimation and Inference in Econometrics,


Oxford University Press, Oxford.

Davison, A. C., and D. V. Hinkley, 1997, Bootstrap Methods and Their Applications,
Cambridge University Press.

92 93
It does not matter if the parameterers are estimated separately or jointly. In contrast, if
we want the correlation, ρx y , instead of the covariance, then we change the last moment
condition to
7 GMM 1 XT √ √
xt yt − ρx y σx x σ yy = 0,
T t=1

References: Greene (2000) 4.7 and 11.5-6 which must be estimated jointly with the first two conditions.
Additional references: Hayashi (2000) 3-4; Verbeek (2000) 5; Hamilton (1994) 14; Ogaki
Example 2 (MM for an MA(1).) For an MA(1), yt = t + θ t−1 , we have
(1993), Johnston and DiNardo (1997) 10; Harris and Matyas (1999); Pindyck and Rubin-
Eyt2 E (t + θ t−1 )2 = σ2 1 + θ 2

feld (1997) Appendix 10.1; Cochrane (2001) 10-11 =
E (yt yt−1 ) = E (t + θ t−1 ) (t−1 + θ t−2 ) = σ2 θ.
 

7.1 Method of Moments The moment conditions could therefore be

Let m (xt ) be a k × 1 vector valued continuous function of a stationary process, and let the
" P  # " #
T
t=1 yt − σ 1 + θ
1 2 2 2 0
T = ,
probability limit of the mean of m (.) be a function γ (.) of a k × 1 vector β of parameters. 1 PT
y y − σ 2θ 0
T t=1 t t−1 
We want to estimate β. The method of moments (MM, not yet generalized to GMM)
estimator is obtained by replacing the probability limit with the sample mean and solving which allows us to estimate θ and σ 2 .
the system of k equations
T 7.2 Generalized Method of Moments
1X
m (xt ) − γ (β) = 0k×1 (7.1)
T GMM extends MM by allowing for more orthogonality conditions than parameters. This
t=1
could, for instance, increase efficiency and/or provide new aspects which can be tested.
for the parameters β.
Many (most) traditional estimation methods, like LS, IV, and MLE are special cases
It is clear that this is a consistent estimator of β if γ is continuous. (Proof: the sample
of GMM. This means that the properties of GMM are very general, and therefore fairly
mean is a consistent estimator of γ (.), and by Slutsky’s theorem plim γ (β̂) = γ (plim β̂)
difficult to prove.
if γ is a continuous function.)

Example 1 (Moment conditions for variances and covariance) Suppose the series xt and 7.3 Moment Conditions in GMM
yt have zero means. The following moment conditions define the traditional variance and
covariance estimators Suppose we have q (unconditional) moment conditions,
 
1 XT Em 1 (wt , β0 )
x 2 − σx x = 0
T t=1 t ..
Em(wt , β0 ) = 
 
1 XT
 . 
y 2 − σ yy = 0

T t=1 t Em q (wt , β0 )
1 XT
xt yt − σx y = 0. = 0q×1 , (7.2)
T t=1

94 95
from which we want to estimate the k × 1 (k ≤ q) vector of parameters, β. The true Remark 5 (E(u|z) = 0 versus Euz = 0.) For any random variables u and z,
values are β0 . We assume that wt is a stationary and ergodic (vector) process (otherwise
the sample means does not converge to anything meaningful as the sample size increases). Cov (z, u) = Cov [z, E (u|z)] .
The sample averages, or “sample moment conditions,” evaluated at some value of β, are
The condition E(u|z) = 0 then implies Cov(z, u) = 0. Recall that Cov(z, u) = Ezu−EzEu,
1X
T and that E(u|z) = 0 implies that Eu = 0 (by iterated expectations). We therefore get that
m̄(β) = m(wt , β). (7.3)
T
" #
t=1 Cov (z, u) = 0
E (u|z) = 0 ⇒ ⇒ Euz = 0.
The sample average m̄ (β) is a vector of functions of random variables, so they are ran- Eu = 0
dom variables themselves and depend on the sample used. It will later be interesting to Example 6 (Euler equation for optimal consumption.) The standard Euler equation for
calculate the variance of m̄ (β). Note that m̄(β1 ) and m̄(β2 ) are sample means obtained 1−γ
optimal consumption choice which with isoelastic utility U (Ct ) = Ct / (1 − γ ) is
by using two different parameter vectors, but on the same sample of data. " #
Ct+1 −γ
 
E Rt+1 β − 1 t = 0,

Example 3 (Moments conditions for IV/2SLS.) Consider the linear model yt = xt0 β0 +u t , Ct
where xt and β are k × 1 vectors. Let z t be a q × 1 vector, with q ≥ k. The moment
where Rt+1 is a gross return on an investment and t is the information set in t. Let
conditions and their sample analogues are
z t ∈ t , for instance asset returns or consumption t or earlier. The Euler equation then
T implies
1X
0q×1 = Ez t u t = E[z t (yt − xt0 β0 )], and m̄ (β) = z t (yt − xt0 β),
" #
Ct+1 −γ
 
T E Rt+1 β z t − z t = 0.
t=1
Ct
(or Z 0 (Y − Xβ)/T in matrix form). Let q = k to get IV; let z t = xt to get LS.
Let z t = (z 1t , ..., z nt )0 , and define the new (unconditional) moment conditions as
Example 4 (Moments conditions for MLE.) The maximum likelihood estimator maxi-  
u 1 (xt , β)z 1t
mizes the log likelihood function, T1 6t=1
T ln L (w ; β), which requires 1 6 T ∂ ln L (w ; β) /∂β =
t t
T t=1  u 1 (xt , β)z 2t 
 
0. A key regularity condition for the MLE is that E∂ ln L (wt ; β0 ) /∂β = 0, which is just 
..

.
 
 
like a GMM moment condition.
m(wt , β) = u(xt , β) ⊗ z t =  (x , β)z ,
 
u
 1 t nt 
 (7.5)
 u 2 (xt , β)z 1t 
 
7.3.1 Digression: From Conditional to Unconditional Moment Conditions  .. 
.
 
 
Suppose we are instead given conditional moment restrictions u m (xt , β)z nt q×1

E [u(xt , β0 )|z t ] = 0m×1 , (7.4) which by (7.4) must have an expected value of zero, that is

where z t is a vector of conditioning (predetermined) variables. We want to transform this Em(wt , β0 ) = 0q×1 . (7.6)
to unconditional moment conditions.
This a set of unconditional moment conditions—just as in (7.2). The sample moment con-
ditions (7.3) are therefore valid also in the conditional case, although we have to specify

96 97
m(wt , β) as in (7.5). Example 8 (Simple linear regression.) Consider the model
Note that the choice of instruments is often arbitrary: it often amounts to using only
a subset of the information variables. GMM is often said to be close to economic theory, yt = xt β0 + u t , (7.9)
but it should be admitted that economic theory sometimes tells us fairly little about which
where yt and xt are zero mean scalars. The moment condition and loss function are
instruments, z t , to use.
T
1X
Example 7 (Euler equation for optimal consumption, continued) The orthogonality con- m̄ (β) = xt (yt − xt β) and
T
t=1
ditions from the consumption Euler equations in Example 6 are highly non-linear, and #2
T
"
theory tells us very little about how the prediction errors are distributed. GMM has the 1X
J =W xt (yt − xt β) ,
advantage of using the theoretical predictions (moment conditions) with a minimum of T
t=1
distributional assumptions. The drawback is that it is sometimes hard to tell exactly which
so the scalar W is clearly irrelevant in this case.
features of the (underlying) distribution that are tested.
Example 9 (IV/2SLS method continued.) From Example 3, we note that the loss function
7.4 The Optimization Problem in GMM for the IV/2SLS method is
" T
#0 " T
#
7.4.1 The Loss Function 1X 1X
m̄(β)0 W m̄(β) = z t (yt − xt0 β) W z t (yt − xt0 β) .
T T
t=1 t=1
The GMM estimator β̂ minimizes the weighted quadratic form
 0    When q = k, then the model is exactly identified, so the estimator could actually be found
m̄ 1 (β) W11 · · · · · · W1q m̄ 1 (β) by setting all moment conditions to zero. We then get the IV estimator
 .   . .. ..  ..
 ..   ..
 
. .   . 
J = .   . (7.7) T
. ..
    
 ..   .. ... ..  
   1X
.  0= z t (yt − xt0 β̂ I V ) or
     T
m̄ q (β) W1q · · · · · · Wqq m̄ q (β) t=1
T
!−1 T
= m̄(β)0 W m̄(β), (7.8) 1X 0 1X
β̂ I V = z t xt z t yt
T T
t=1 t=1
where m̄(β) is the sample average of m(wt , β) given by (7.3), and where W is some
= 6̂zx
−1
6̂zy ,
q × q symmetric positive definite weighting matrix. (We will soon discuss a good choice
of weighting matrix.) There are k parameters in β to estimate, and we have q moment where 6̂zx = 6t=1
T z x 0 /T and similarly for the other second moment matrices. Let z =
t t t
conditions in m̄(β). We therefore have q − k overidentifying moment restrictions. xt to get LS
With q = k the model is exactly identified (as many equations as unknowns), and it β̂ L S = 6̂x−1
x 6̂x y .
should be possible to set all q sample moment conditions to zero by a choosing the k = q
parameters. It is clear that the choice of the weighting matrix has no effect in this case
since m̄(β̂) = 0 at the point estimates β̂.

98 99
7.4.2 First Order Conditions estimate, β̂,

Remark 10 (Matrix differentiation of non-linear functions.) Let the vector yn×1 be a ∂ m̄(β̂)0 W m̄(β̂)
0k×1 =
function of the vector xm×1 ∂β
0 
∂ m̄ 1 (β̂) ∂ m̄ 1 (β̂)
  
··· W11 · · · · · · W1q m̄ 1 (β̂)
   
y1 f 1 (x)  . ∂β1 ..
∂βk
.. .. .. ..
 .   .  ..
   
 ..  = f (x) =  .. .
 .   . . .  . 
= . .. .. .. ..  (with β̂k×1 ),
    
   
 ..
  ... 
yn f n (x)  . 


 . . 
 . 

∂ m̄ q (β̂) ∂ m̄ q (β̂)
··· W1q · · · · · · Wqq m̄ q (β̂)
∂β1 ∂βk
Then, ∂ y/∂ x 0 is an n × m matrix
(7.10)
∂ f 1 (x) ∂ f 1 (x) ∂ f 1 (x)
    !0
∂x0 ∂ x1 ··· ∂ xm ∂ m̄(β̂)
∂y  .   . .. = W m̄(β̂). (7.11)
=  ..  =  .. .

. ∂β 0 |{z}
∂x0  ∂ f
| {z }
| {z } q×q q×1
  
1 (x) ∂ f n (x) ∂ f n (x)
∂x0 ∂ x1 ··· ∂ xm k×q

(Note that the notation implies that the derivatives of the first element in y, denoted y1 , We can solve for the GMM estimator, β̂, from (7.11). This set of equations must often be
with respect to each of the elements in x 0 are found in the first row of ∂ y/∂ x 0 . A rule to solved by numerical methods, except in linear models (the moment conditions are linear
help memorizing the format of ∂ y/∂ x 0 : y is a column vector and x 0 is a row vector.) functions of the parameters) where we can find analytical solutions by matrix inversion.

Remark 11 When y = Ax where A is an n × m matrix, then f i (x) in Remark 10 is a Example 15 (First order conditions of simple linear regression.) The first order condi-
linear function. We then get ∂ y/∂ x 0 = ∂ (Ax) /∂ x 0 = A. tions of the loss function in Example 8 is
#2
Remark 12 As a special case of the previous remark y = z 0 x where both z and x are T
"
d 1X
vectors. Then ∂ z 0 x /∂ x 0 = z 0 (since z 0 plays the role of A).
 0= W xt (yt − xt β̂)
dβ T
t=1
T T
" # " #
Remark 13 (Matrix differentiation of quadratic forms.) Let xn×1 , f (x)m×1 , and Am×m 1X 2 1X
= − xt W xt (yt − xt β̂) , or
symmetric. Then T T
t=1 t=1
∂ f (x)0 A f (x) ∂ f (x) 0
 
!−1
=2 A f (x) . T
1X 2 1X
T
∂x ∂x0 β̂ = xt xt yt .
T T
t=1 t=1
Remark 14 If f (x) = x, then ∂ f (x) /∂ x 0 = I , so ∂ x 0 Ax /∂ x = 2Ax.


Example 16 (First order conditions of IV/2SLS.) The first order conditions correspond-
The k first order conditions for minimizing the GMM loss function in (7.8) with re-
spect to the k parameters are that the partial derivatives with respect to β equal zero at the

100 101
ing to (7.11) of the loss function in Example 9 (when q ≥ k) are section discusses, in an informal way, how we can arrive at those results.
" #0
∂ m̄(β̂)
0k×1 = W m̄(β̂) 7.5.1 Consistency
∂β 0
" T
#0 T Sample moments are typically consistent, so plim m (β) = E m(wt , β). This must hold at
∂ 1X 1X
= z (y
t t − x 0
β̂) W z t (yt − xt0 β̂) any parameter vector in the relevant space (for instance, those inducing stationarity and
∂β T0 t
T
t=1 t=1 variances which are strictly positive). Then, if the moment conditions (7.2) are true only at
" T
#0 T
1 X 1 X the true parameter vector, β0 , (otherwise the parameters are “unidentified”) and that they
= − z t xt0 W z t (yt − xt0 β̂)
T T are continuous in β, then GMM is consistent. The idea is thus that GMM asymptotically
t=1 t=1
solves
= −6̂x z W (6̂zy − 6̂zx β̂).
0q×1 = plim m̄(β̂)
We can solve for β̂ from the first order conditions as
 −1 = E m(wt , β̂),
β̂2S L S = 6̂x z W 6̂zx 6̂x z W 6̂zy .
which only holds at β̂ = β0 . Note that this is an application of Slutsky’s theorem.
When q = k, then the first order conditions can be premultiplied with (6̂x z W )−1 , since
Remark 17 (Slutsky’s theorem.) If {x T } is a sequence of random matrices such that
6̂x z W is an invertible k × k matrix in this case, to give
plim x T = x and g(x T ) a continuous function, then plim g(x T ) = g(x).
0k×1 = 6̂zy − 6̂zx β̂, so β̂ I V = 6̂zx
−1
6̂zy .
Example 18 (Consistency of 2SLS.) By using yt = xt0 β0 + u t , the first order conditions
This shows that the first order conditions are just the same as the sample moment condi- in Example 16 can be rewritten
tions, which can be made to hold exactly since there are as many parameters as there are T
1X
equations. 0k×1 = 6̂x z W z t (yt − xt0 β̂)
T
t=1
T
7.5 Asymptotic Properties of GMM 1X h  i
= 6̂x z W z t u t + xt0 β0 − β̂
T
t=1
We know very little about the general small sample properties, including bias, of GMM.  
= 6̂x z W 6̂zu + 6̂x z W 6̂zx β0 − β̂ .
We therefore have to rely either on simulations (Monte Carlo or bootstrap) or on the
asymptotic results. This section is about the latter. Take the probability limit
GMM estimates are typically consistent and normally distributed, even if the series  
m(wt , β) in the moment conditions (7.3) are serially correlated and heteroskedastic— 0k×1 = plim 6̂x z W plim 6̂zu + plim 6̂x z W plim 6̂zx β0 − plim β̂ .
provided wt is a stationary and ergodic process. The reason is essentially that the esti-
In most cases, plim 6̂x z is some matrix of constants, and plim 6̂zu = E z t u t = 0q×1 . It
mators are (at least as a first order approximation) linear combinations of sample means
then follows that plim β̂ = β0 . Note that the whole argument relies on that the moment
which typically are consistent (LLN) and normally distributed (CLT). More about that
condition, E z t u t = 0q×1 , is true. If it is not, then the estimator is inconsistent. For
later. The proofs are hard, since the GMM is such a broad class of estimators. This

102 103
instance, when the instruments are invalid (correlated with the residuals) or when we use sample moment conditions with respect to the parameters, evaluated at the true parameters
LS (z t = xt ) when there are measurement errors or in a system of simultaneous equations. ∂ m̄(β0 )
D0 = plim , where (7.16)
∂β 0
7.5.2 Asymptotic Normality 
∂ m̄ 1 (β) ∂ m̄ 1 (β)

∂β1 ··· ∂βk
√  .. .. 
To give the asymptotic distribution of T (β̂ − β0 ), we need to define three things. (As ∂ m̄(β0 )  . . 
√ = .. ..  at the true β vector. (7.17)
 
usual, we also need to scale with T to get a non-trivial asymptotic distribution; the ∂β 0  . . 
asymptotic distribution of β̂ − β0 is a spike at zero.) First, let S0 (a q × q matrix) denote
 
∂ m̄ q (β) ∂ m̄ q (β)
√ ∂β1 ··· ∂βk
the asymptotic covariance matrix (as sample size goes to infinity) of T times the sample
moment conditions evaluated at the true parameters Note that a similar gradient, but evaluated at β̂, also shows up in the first order conditions
h√ i (7.11). Third, let the weighting matrix be the inverse of the covariance matrix of the
S0 = ACov T m̄ (β0 ) (7.12) moment conditions (once again evaluated at the true parameters)
T
" #
1 X
= ACov √ m(wt , β0 ) , (7.13) W = S0−1 . (7.18)
T t=1
It can be shown that this choice of weighting matrix gives the asymptotically most ef-
where we use the definition of m̄ (β0 ) in (7.3). (To estimate S0 it is important to recognize
ficient estimator for a given set of orthogonality conditions. For instance, in 2SLS, this
that it is a scaled sample average.) Let R (s) be the q ×q covariance (matrix) of the vector
means a given set of instruments and (7.18) then shows only how to use these instruments
m(wt , β0 ) with the vector m(wt−2 , β0 )
in the most efficient way. Of course, another set of instruments might be better (in the
R (s) = Cov m(wt , β0 ), m(wt−s , β0 ) sense of giving a smaller Cov(β̂)).
 

= E m(wt , β0 )m(wt−s , β0 )0 . (7.14) With the definitions in (7.12) and (7.16) and the choice of weighting matrix in (7.18)
and the added assumption that the rank of D0 equals k (number of parameters) then we
Then, it is well known that can show (under fairly general conditions) that
h√ ∞ √  −1
d
i
T (β̂ − β0 ) → N (0k×1 , V ), where V = D00 S0−1 D0 .
X
ACov T m̄ (β0 ) = R(s). (7.15) (7.19)
s=−∞
This holds also when the model is exactly identified, so we really do not use any weighting
In practice, we often estimate this by using the Newey-West estimator (or something
matrix.
similar).
To prove this note the following.
Second, let D0 (a q × k matrix) denote the probability limit of the gradient of the
Remark 19 (Continuous mapping theorem.) Let the sequences of random matrices {x T }
d p
and {yT }, and the non-random matrix {aT } be such that x T → x, yT → y, and aT → a
d
(a traditional limit). Let g(x T , yT , aT ) be a continuous function. Then g(x T , yT , aT ) →
g(x, y, a). Either of yT and aT could be irrelevant in g. (See Mittelhammer (1996) 5.3.)

Example 20 For instance, the sequences in Remark 19 could be x T = T 6t=
T w /T ,
t

104 105
the scaled sample average of a random variable wt ; yT = 6t=
T w 2 /T , the sample second
t limiting distributions (see Remark 19) we then have that
moment; and aT = 6t=1 0.7 .
T t
√   d
T β̂ − β0 → plim 0 × something that is N (0, S0 ) , that is,
d √ 
Remark 21 From the previous remark: if x T → x (a random variable) and plim Q T =
 d
T β̂ − β0 → N 0k×1 , (plim 0)S0 (plim 0 0 ) .
 
d
Q (a constant matrix), then Q T x T → Qx.
The covariance matrix is then
Proof. (The asymptotic distribution (7.19). Sketch of proof.) This proof is essentially

an application of the delta rule. By the mean-value theorem the sample moment condition ACov[ T (β̂ − β0 )] = (plim 0)S0 (plim 0 0 )
evaluated at the GMM estimate, β̂, is
−1 0 −1 0 0
= D00 W D0 D0 W S0 [ D00 W D0 D0 W ] (7.24)
∂ m̄(β1 ) −1 −1
= D00 W D0 D00 W S0 W 0 D0 D00 W D0 .

(7.25)
m̄(β̂) = m̄(β0 ) + (β̂ − β0 ) (7.20)
∂β 0
If W = W 0 = S0−1 , then this expression simplifies to (7.19). (See, for instance, Hamilton
for some values β1 between β̂ and β0 . (This point is different for different elements in (1994) 14 (appendix) for more details.)
m̄.) Premultiply with [∂ m̄(β̂)/∂β 0 ]0 W . By the first order condition (7.11), the left hand It is straightforward to show that the difference between the covariance matrix in
side is then zero, so we have  −1
(7.25) and D00 S0−1 D0 (as in (7.19)) is a positive semi-definite matrix: any linear
!0 !0
∂ m̄(β̂) ∂ m̄(β̂) ∂ m̄(β1 ) combination of the parameters has a smaller variance if W = S0−1 is used as the weight-
0k×1 = W m̄(β0 ) + W (β̂ − β0 ). (7.21)
∂β 0 ∂β 0 ∂β 0 ing matrix.
√ All the expressions for the asymptotic distribution are supposed to be evaluated at the
Multiply with T and solve as true parameter vector β0 , which is unknown. However, D0 in (7.16) can be estimated by
" !0 #−1 !0 ∂ m̄(β̂)/∂β 0 , where we use the point estimate instead of the true value of the parameter
√   ∂ m̄(β̂) ∂ m̄(β1 ) ∂ m̄(β̂) √
T β̂ − β0 = − W W T m̄(β0 ). (7.22) vector. In practice, this means plugging in the point estimates into the sample moment
∂β 0 ∂β 0 ∂β 0
| {z } conditions and calculate the derivatives with respect to parameters (for instance, by a
0
numerical method).
If Similarly, S0 in (7.13) can be estimated by, for instance, Newey-West’s estimator of
∂ m̄(β̂) ∂ m̄(β0 ) ∂ m̄(β1 ) √
plim = = D0 , then plim = D0 , Cov[ T m̄(β̂)], once again using the point estimates in the moment conditions.
∂β 0 ∂β 0 ∂β 0
since β1 is between β0 and β̂. Then Example 22 (Covariance matrix of 2SLS.) Define
−1 √ T !
plim 0 = − D00 W D0 D00 W. (7.23) h√ i T X
S0 = ACov T m̄ (β0 ) = ACov zt u t
√ √ T
t=1
The last term in (7.22), T m̄(β0 ), is T times a vector of sample averages, so by a CLT
T
!
it converges in distribution to N(0, S0 ), where S0 is defined as in (7.12). By the rules of ∂ m̄(β0 ) 1X 0
D0 = plim = plim − z t t = −6zx .
x
∂β 0 T
t=1

106 107

This gives the asymptotic covariance matrix of T (β̂ − β0 ) • Iterate at least once more. (You may want to consider iterating until the point esti-
 −1  −1 mates converge.)
V = D00 S0−1 D0 = 6zx
0
S0−1 6zx .
Example 23 (Implementation of 2SLS.) Under the classical 2SLS assumptions, there is
no need for iterating since the efficient weighting matrix is 6zz
−1 /σ 2 . Only σ 2 depends
7.6 Summary of GMM
on the estimated parameters, but this scaling factor of the loss function does not affect
β̂2S L S .
Economic model : Em(wt , β0 ) = 0q×1 , β is k × 1
1X
T One word of warning: if the number of parameters in the covariance matrix Ŝ is
Sample moment conditions : m̄(β) = m(wt , β) large compared to the number of data points, then Ŝ tends to be unstable (fluctuates a lot
T
t=1
between the steps in the iterations described above) and sometimes also close to singular.
Loss function : J = m̄(β)0 W m̄(β)
!0 The saturation ratio is sometimes used as an indicator of this problem. It is defined as the
∂ m̄(β̂)0 W m̄(β̂) ∂ m̄(β̂) number of data points of the moment conditions (qT ) divided by the number of estimated
First order conditions : 0k×1 = = W m̄(β̂)
∂β ∂β 0 parameters (the k parameters in β̂ and the unique q(q + 1)/2 parameters in Ŝ if it is
Consistency : β̂ is typically consistent if Em(wt , β0 ) = 0 estimated with Newey-West). A value less than 10 is often taken to be an indicator of
h√ i ∂ m̄(β0 ) problems. A possible solution is then to impose restrictions on S, for instance, that the
Define : S0 = Cov T m̄ (β0 ) and D0 = plim
∂β 0 autocorrelation is a simple AR(1) and then estimate S using these restrictions (in which
Choose: W = S0−1 case you cannot use Newey-West, or course).
√ d
 −1
Asymptotic distribution : T (β̂ − β0 ) → N (0k×1 , V ), where V = D00 S0−1 D0
7.8 Testing in GMM
7.7 Efficient GMM and Its Feasible Implementation The result in (7.19) can be used to do Wald tests of the parameter vector. For instance,
suppose we want to test the s linear restrictions that Rβ0 = r (R is s × k and r is s × 1)
The efficient GMM (remember: for a given set of moment conditions) requires that we
then it must be the case that under null hypothesis
use W = S0−1 , which is tricky since S0 should be calculated by using the true (unknown)
√ d
parameter vector. However, the following two-stage procedure usually works fine: T (R β̂ − r ) → N (0s×1 , RV R 0 ). (7.26)

• First, estimate model with some (symmetric and positive definite) weighting matrix. Remark 24 (Distribution of quadratic forms.) If the n × 1 vector x ∼ N (0, 6), then
The identity matrix is typically a good choice for models where the moment con- x 0 6 −1 x ∼ χn2 .
ditions are of the same order of magnitude (if not, consider changing the moment
conditions). This gives consistent estimates of the parameters β. Then a consistent From this remark and the continuous mapping theorem in Remark (19) it follows that,
estimate Ŝ can be calculated (for instance, with Newey-West). under the null hypothesis that Rβ0 = r , the Wald test statistics is distributed as a χs2
variable
• Use the consistent Ŝ from the first step to define a new weighting matrix as W = T (R β̂ − r )0 RV R 0
−1 d
(R β̂ − r ) → χs2 . (7.27)
Ŝ −1 . The algorithm is run again to give asymptotically efficient estimates of β.

108 109
We might also want to test the overidentifying restrictions. The first order conditions restrictions.
(7.11) imply that k linear combinations of the q moment conditions are set to zero by
solving for β̂. Therefore, we have q − k remaining overidentifying restrictions which Example 26 (Results from GMM on CCAPM; continuing Example 6.) The instruments
should also be close to zero if the model is correct (fits data). Under the null hypothesis could be anything known at t or earlier could be used as instruments. Actually, Hansen
that the moment conditions hold (so the overidentifying restrictions hold), we know that and Singleton (1982) and Hansen and Singleton (1983) use lagged Ri,t+1 ct+1 /ct as in-
√ struments, and estimate γ to be 0.68 to 0.95, using monthly data. However, T JT (β̂) is
T m̄ (β0 ) is a (scaled) sample average and therefore has (by a CLT) an asymptotic normal
distribution. It has a zero mean (the null hypothesis) and the covariance matrix in (7.12). large and the model can usually be rejected at the 5% significance level. The rejection
In short, is most clear when multiple asset returns are used. If T-bills and stocks are tested at the
√ d same time, then the rejection would probably be overwhelming.
T m̄ (β0 ) → N 0q×1 , S0 .

(7.28)

If would then perhaps be natural to expect that the quadratic form T m̄(β̂)0 S0−1 m̄(β̂) Another test is to compare a restricted and a less restricted model, where we have
should be converge in distribution to a χq2 variable. That is not correct, however, since β̂ used the optimal weighting matrix for the less restricted model in estimating both the less
chosen is such a way that k linear combinations of the first order conditions always (in restricted and more restricted model (the weighting matrix is treated as a fixed matrix in
every sample) are zero. There are, in effect, only q −k nondegenerate random variables in the latter case). It can be shown that the test of the s restrictions (the “D test”, similar in
the quadratic form (see Davidson and MacKinnon (1993) 17.6 for a detailed discussion). flavour to an LR test), is
The correct result is therefore that if we have used optimal weight matrix is used, W =
S0−1 , then T [J (β̂ r estricted ) − J (β̂ less r estricted )] ∼ χs2 , if W = S0−1 . (7.31)
d
T m̄(β̂)0 S0−1 m̄(β̂) → χq−k
2
, if W = S0−1 . (7.29) The weighting matrix is typically based on the unrestricted model. Note that (7.30) is a
The left hand side equals T times of value of the loss function (7.8) evaluated at the point special case, since the model with allows q non-zero parameters (as many as the moment
estimates, so we could equivalently write what is often called the J test conditions) always attains J = 0, and that by imposing s = q − k restrictions we get a
restricted model.
T J (β̂) ∼ χq−k
2
, if W = S0−1 . (7.30)

This also illustrates that with no overidentifying restrictions (as many moment conditions 7.9 GMM with Sub-Optimal Weighting Matrix∗
as parameters) there are, of course, no restrictions to test. Indeed, the loss function value
When the optimal weighting matrix is not used, that is, when (7.18) does not hold, then
is then always zero at the point estimates.
the asymptotic covariance matrix of the parameters is given by (7.25) instead of the result
Example 25 (Test of overidentifying assumptions in 2SLS.) In contrast to the IV method, in (7.19). That is,
2SLS allows us to test overidentifying restrictions (we have more moment conditions than √ d −1 −1
parameters, that is, more instruments than regressors). This is a test of whether the residu- T (β̂ − β0 ) → N (0k×1 , V2 ), where V2 = D00 W D0 D00 W S0 W 0 D0 D00 W D0 .
als are indeed uncorrelated with all the instruments. If not, the model should be rejected. (7.32)
It can be shown that test (7.30) is (asymptotically, at least) the same as the traditional The consistency property is not affected.
(Sargan (1964), see Davidson (2000) 8.4) test of the overidentifying restrictions in 2SLS. The test of the overidentifying restrictions (7.29) and (7.30) are not longer valid. In-
In the latter, the fitted residuals are regressed on the instruments; T R 2 from that regres-
sion is χ 2 distributed with as many degrees of freedom as the number of overidentifying

110 111
stead, the result is that 7.10 GMM without a Loss Function∗

T m̄(β̂) →d N 0q×1 , 9 , with

(7.33) Suppose we sidestep the whole optimization issue and instead specify k linear combi-
−1 0 −1 0 0 nations (as many as there are parameters) of the q moment conditions directly. That is,
9 = [I − D0 D00 W D0 D0 W ]S0 [I − D0 D00 W D0 D0 W ] . (7.34)
instead of the first order conditions (7.11) we postulate that the estimator should solve
This covariance matrix has rank q − k (the number of overidentifying restriction). This
distribution can be used to test hypotheses about the moments, for instance, that a partic- A m̄(β̂) (β̂ is k × 1).
0k×1 = |{z} (7.36)
| {z }
k×q q×1
ular moment condition is zero.
Proof. (Sketch of proof of (7.33)-(7.34)) Use (7.22) in (7.20) to get The matrix A is chosen by the researcher and it must have rank k (lower rank means that
√ √ √ ∂ m̄(β1 ) we effectively have too few moment conditions to estimate the k parameters in β). If A
T m̄(β̂) = T m̄(β0 ) +
T 0 m̄(β0 )
∂β 0 is random, then it should have a finite probability limit A0 (also with rank k). One simple
∂ m̄(β1 ) √
 
case when this approach makes sense is when we want to use a subset of the moment
= I+ 0 T m̄(β0 ).
∂β 0 conditions to estimate the parameters (some columns in A are then filled with zeros), but
−1 we want to study the distribution of all the moment conditions.
The term in brackets has a probability limit, which by (7.23) equals I −D0 D00 W D0 D00 W .
√ By comparing (7.11) and (7.36) we see that A plays the same role as [∂ m̄(β̂)/∂β 0 ]0 W ,
Since T m̄(β0 ) →d N 0q×1 , S0 we get (7.33).

but with the difference that A is chosen and not allowed to depend on the parameters.
Remark 27 If the n × 1 vector X ∼ N (0, 6), where 6 has rank r ≤ n then Y = In the asymptotic distribution, it is the probability limit of these matrices that matter, so
X 0 6 + X ∼ χr2 where 6 + is the pseudo inverse of 6. we can actually substitute A0 for D00 W in the proof of the asymptotic distribution. The
covariance matrix in (7.32) then becomes
Remark 28 The symmetric 6 can be decomposed as 6 = Z 3Z 0 where Z are the or-
0

thogonal eigenvector (Z Z = I ) and 3 is a diagonal matrix with the eigenvalues along ACov[ T (β̂ − β0 )] = (A0 D0 )−1 A0 S0 [(A0 D0 )−1 A0 ]0
the main diagonal. The pseudo inverse can then be calculated as 6 + = Z 3+ Z 0 , where = (A0 D0 )−1 A0 S0 A00 [(A0 D0 )−1 ]0 , (7.37)
" #
311
−1
0 which can be used to test hypotheses about the parameters.
3+ = ,
0 0 Similarly, the covariance matrix in (7.33) becomes

with the reciprocals of the non-zero eigen values along the principal diagonal of 3−1
11 . ACov[ T m̄(β̂)] = [I − D0 (A0 D0 )−1 A0 ]S0 [I − D0 (A0 D0 )−1 A0 ]0 , (7.38)

This remark and (7.34) implies that the test of overidentifying restrictions (Hansen’s which still has reduced rank. As before, this covariance matrix can be used to construct
J statistics) analogous to (7.29) is both t type and χ 2 tests of the moment conditions.
d
T m̄(β̂)0 9 + m̄(β̂) → χq−k
2
. (7.35)
7.11 Simulated Moments Estimator∗
It requires calculation of a generalized inverse (denoted by superscript + ), but this is
fairly straightforward since 9 is a symmetric matrix. It can be shown (a bit tricky) that Reference: Ingram and Lee (1991)
this simplifies to (7.29) when the optimal weighting matrix is used. It sometimes happens that it is not possible to calculate the theoretical moments in

112 113
GMM explicitly. For instance, suppose we want to match the variance of the model with Hamilton, J. D., 1994, Time Series Analysis, Princeton University Press, Princeton.
the variance of data
Hansen, L., and K. Singleton, 1982, “Generalized Instrumental Variables Estimation of
E m(wt , β0 ) = 0, where (7.39) Nonlinear Rational Expectations Models,” Econometrica, 50, 1269–1288.
m(wt , β) = (wt − µ)2 − Var in model (β) , (7.40) Hansen, L., and K. Singleton, 1983, “Stochastic Consumption, Risk Aversion and the
Temporal Behavior of Asset Returns,” Journal of Political Economy, 91, 249–268.
but the model is so non-linear that we cannot find a closed form expression for Var of model(β0 ).
Similary, we could match a covariance of Harris, D., and L. Matyas, 1999, “Introduction to the Generalized Method of Moments
The SME involves (i) drawing a set of random numbers for the stochastic shocks in Estimation,” in Laszlo Matyas (ed.), Generalized Method of Moments Estimation .
the model; (ii) for a given set of parameter values generate a model simulation with Tsim chap. 1, Cambridge University Press.
observations, calculating the moments and using those instead of Var of model(β0 ) (or
similarly for other moments), which is then used to evaluate the loss function JT . This is Hayashi, F., 2000, Econometrics, Princeton University Press.
repeated for various sets of parameter values until we find the one which minimizes JT . Ingram, B.-F., and B.-S. Lee, 1991, “‘Simulation Estimation of Time-Series Models,”
Basically all GMM results go through, but the covariance matrix should be scaled up Journal of Econometrics, 47, 197–205.
with 1 + T /Tsim , where T is the sample length. Note that the same sequence of random
numbers should be reused over and over again (as the parameter values are changed). Johnston, J., and J. DiNardo, 1997, Econometric Methods, McGraw-Hill, New York, 4th
edn.
Example 29 Suppose wt has two elements, xt and yt , and that we want to match both
variances and also the covariance. For simplicity, suppose both series have zero means. Mittelhammer, R. C., 1996, Mathematical Statistics for Economics and Business,
Then we can formulate the moment conditions Springer-Verlag, New York.
 2  Ogaki, M., 1993, “Generalized Method of Moments: Econometric Applications,” in G. S.
xt − Var(x) in model(β)
m(xt , yt , β) =  yt2 − Var(y) in model(β)  .
 
(7.41) Maddala, C. R. Rao, and H. D. Vinod (ed.), Handbook of Statistics, vol. 11, . chap. 17,
pp. 455–487, Elsevier.
xt yt − Cov(x,y) in model(β)
Pindyck, R. S., and D. L. Rubinfeld, 1997, Econometric Models and Economic Forecasts,
Bibliography Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.

Cochrane, J. H., 2001, Asset Pricing, Princeton University Press, Princeton, New Jersey. Verbeek, M., 2000, A Guide to Modern Econometrics, Wiley, Chichester.

Davidson, J., 2000, Econometric Theory, Blackwell Publishers, Oxford.

Davidson, R., and J. G. MacKinnon, 1993, Estimation and Inference in Econometrics,


Oxford University Press, Oxford.

Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.

114 115
8.1.2 The IV/2SLS Estimator (General)

The model is (8.1), but we use an IV/2SLS method. The q moment conditions (with
q ≥ k) are
8 Examples and Applications of GMM
T T T
1X 1X 1X 0
8.1 GMM and Classical Econometrics: Examples m̄ (β) = z t (yt − xt0 β) = z t yt − z t xt β. (8.7)
T T T
t=1 t=1 t=1

8.1.1 The LS Estimator (General) The loss function is (for some positive definite weighting matrix W , not necessarily
The model is the optimal)
yt = xt0 β0 + u t , (8.1) " T
#0 " T
#
1X 1X
m̄(β)0 W m̄(β) = z t (yt − xt0 β) W z t (yt − xt0 β) , (8.8)
where β is a k × 1 vector. T T
t=1 t=1
The k moment conditions are
T T T
and the k first order conditions, (∂ m̄(β̂)/∂β 0 )0 W m̄(β̂) = 0, are
1X 1X 1X
m̄ (β) = xt (yt − xt0 β) = xt yt − xt xt0 β. (8.2) " T
#0 T
T T T ∂ 1X 1X
t=1 t=1 t=1 0k×1 = z (y
t t − x 0
β̂) W z t (yt − xt0 β̂)
∂β T
0 t
T
The point estimates are found by setting all moment conditions to zero (the model is t=1 t=1
" T
#0 T
exactly identified), m̄ (β) = 0k×1 , which gives 1 X 1 X
= − z t xt0 W z t (yt − xt0 β̂)
T T
T
!−1 T t=1 t=1
1X 1X
β̂ = xt xt0 xt yt β. (8.3) = −6̂x z W (6̂zy − 6̂zx β̂). (8.9)
T T
t=1 t=1
We solve for β̂ as
If we define  −1
√ T ! β̂ = 6̂x z W 6̂zx 6̂x z W 6̂zy . (8.10)
h√ i T X
S0 = ACov T m̄ (β0 ) = ACov xt u t (8.4) Define
T
t=1 √ T !
T
! h√ i T X
∂ m̄(β0 ) 1X S0 = ACov T m̄ (β0 ) = ACov zt u t (8.11)
t t = −6x x .
0
D0 = plim = plim − x x (8.5) T
∂β 0 T t=1
t=1
T
!
√ ∂ m̄(β0 ) 1X 0
then the asymptotic covariance matrix of T (β̂ − β0 ) D0 = plim = plim − z t xt = −6zx . (8.12)
∂β 0 T
t=1
−1  −1


VL S = D00 S0−1 D0 = 6x0 x S0−1 6x x = 6x−1
x S0 6x x .
−1
(8.6) This gives the asymptotic covariance matrix of T (β̂ − β0 )
−1 −1
We can then either try to estimate S0 by Newey-West, or make further assumptions to
 
V = D00 S0−1 D0 = 6zx
0
S0−1 6zx . (8.13)
simplify S0 (see below).

116 117
When the model is exactly identified (q = k), then we can make some simplifications Using this in (8.6) gives
since 6̂x z is then invertible. This is the case of the classical IV estimator. We get V = σ 2 6x−1
x. (8.20)
−1
β̂ = 6̂zx
−1
6̂zy and V = 6zx
−1
S0 6zx
0
if q = k. (8.14) 8.1.4 Almost Classical LS Assumptions: White’s Heteroskedasticity.

(Use the rule (ABC)−1 = C −1 B −1 A−1 to show this.) Reference: Greene (2000) 12.2 and Davidson and MacKinnon (1993) 16.2.
The only difference compared with the classical LS assumptions is that u t is now
8.1.3 Classical LS Assumptions allowed to be heteroskedastic, but this heteroskedasticity is not allowed to depend on the
moments of xt . This means that (8.17) holds, but (8.18) does not since Eu 2t is not the
Reference: Greene (2000) 9.4 and Hamilton (1994) 8.2.
same for all t.
This section returns to the LS estimator in Section (8.1.1) in order to highlight the
However, we can still simplify (8.17) a bit more. We assumed that Ext xt0 and Eu 2t
classical LS assumptions that give the variance matrix σ 2 6x−1 x.
(which can both be time varying) are not related to each other, so we could perhaps mul-
We allow the regressors to be stochastic, but require that xt is independent of all u t+s
tiply Ext xt0 by 6t=1
T Eu 2 /T instead of by Eu 2 . This is indeed true asymptotically—where
t t
and that u t is iid. It rules out, for instance, that u t and xt−2 are correlated and also that
any possible “small sample” relation between Ext xt0 and Eu 2t must wash out due to the
the variance of u t depends on xt . Expand the expression for S0 as
assumptions of independence (which are about population moments).
√ T ! √ T !
T X T X In large samples we therefore have
S0 = E xt u t u t xt0 (8.15)
T T
t=1 t=1 T T
! !
1X 2 1X
1 S0 = Eu t Ext xt0
= E (... + xs−1 u s−1 + xs u s + ...) ... + u s−1 xs−1
0
+ u s xs0 + ... .

T T
T t=1 t=1
T T
! !
Note that 1 X
2 1X 0
= Eu t E xt xt
T T
t=1 t=1
Ext−s u t−s u t xt0 = Ext−s xt0 Eu t−s u t (since u t and xt−s independent)
( = ω2 6x x , (8.21)
0 if s 6 = 0 (since Eu s−1 u s = 0 by iid u t )
= (8.16)
Ext xt0 Eu t u t else. where ω2 is a scalar. This is very similar to the classical LS case, except that ω2 is
the average variance of the residual rather than the constant variance. In practice, the
This means that all cross terms (involving different observations) drop out and that we
estimator of ω2 is the same as the estimator of σ 2 , so we can actually apply the standard
can write
LS formulas in this case.
T
1X This is the motivation for why White’s test for heteroskedasticity makes sense: if the
S0 = Ext xt0 Eu 2t (8.17)
T heteroskedasticity is not correlated with the regressors, then the standard LS formula is
t=1
T correct (provided there is no autocorrelation).
1 X
= σ2 E xt xt0 (since u t is iid and σ 2 = Eu 2t ) (8.18)
T
t=1

= σ 2 6x x . (8.19)

118 119
8.1.5 Estimating the Mean of a Process The structural model (form) is

Suppose u t is heteroskedastic, but not autocorrelated. In the regression yt = α + u t , F yt + Gz t = u t , (8.25)


xt = z t = 1. This is a special case of the previous example, since Eu 2t is certainly
unrelated to Ext xt0 = 1 (since it is a constant). Therefore, the LS covariance matrix where yt is a vector of endogenous variables, z t a vector of predetermined (exogenous)
is the correct variance of the sample mean as an estimator of the mean, even if u t are variables, F is a square matrix, and G is another matrix.1 We can write the jth equation
heteroskedastic (provided there is no autocorrelation). of the structural form (8.25) as
y jt = xt0 β + u jt , (8.26)
8.1.6 The Classical 2SLS Assumptions∗ where xt contains the endogenous and exogenous variables that enter the jth equation
Reference: Hamilton (1994) 9.2. with non-zero coefficients, that is, subsets of yt and z t .
The classical 2SLS case assumes that z t is independent of all u t+s and that u t is iid. We want to estimate β in (8.26). Least squares is inconsistent if some of the regressors
The covariance matrix of the moment conditions are are endogenous variables (in terms of (8.25), this means that the jth row in F contains
T
! T
! at least one additional non-zero element apart from coefficient on y jt ). Instead, we use
1 X 1 X 0
S0 = E √ zt u t √ u t zt , (8.22) IV/2SLS. By assumption, the structural model summarizes all relevant information for
T t=1 T t=1 the endogenous variables yt . This implies that the only useful instruments are the vari-
ables in z t . (A valid instrument is uncorrelated with the residuals, but correlated with the
so by following the same steps in (8.16)-(8.19) we get S0 = σ 2 6zz .The optimal weighting
regressors.) The moment conditions for the jth equation are then
matrix is therefore W = 6zz −1 /σ 2 (or (Z 0 Z /T )−1 /σ 2 in matrix form). We use this result

in (8.10) to get 1X
T
Ez t y jt − xt0 β = 0 with sample moment conditions z t y jt − xt0 β = 0. (8.27)
 −1  
β̂2S L S = 6̂x z 6̂zz
−1
6̂zx 6̂x z 6̂zz
−1
6̂zy , (8.23) T
t=1
which is the classical 2SLS estimator.
If there are as many moment conditions as there are elements in β, then this equation
Since this GMM is efficient (for a given set of moment conditions), we have estab-
is exactly identified, so the sample moment conditions can be inverted to give the Instru-
lished that 2SLS uses its given set of instruments in the efficient way—provided the clas-
mental variables (IV) estimator of β. If there are more moment conditions than elements
sical 2SLS assumptions are correct. Also, using the weighting matrix in (8.13) gives
in β, then this equation is overidentified and we must devise some method for weighting
 −1 the different moment conditions. This is the 2SLS method. Finally, when there are fewer
1 −1
V = 6x z 2 6zz 6zx . (8.24)
σ moment conditions than elements in β, then this equation is unidentified, and we cannot
hope to estimate the structural parameters of it.
8.2 Identification of Systems of Simultaneous Equations We can partition the vector of regressors in (8.26) as xt0 = [z̃ t0 , ỹt0 ], where y1t and z 1t
are the subsets of z t and yt respectively, that enter the right hand side of (8.26). Partition z t
Reference: Greene (2000) 16.1-3
conformably z t0 = [z̃ t0 , z t∗0 ], where z t∗ are the exogenous variables that do not enter (8.26).
This section shows how the GMM moment conditions can be used to understand if
1 By premultiplying with F −1 and rearranging we get the reduced form y = 5z + ε , with 5 = −F −1
t t t
the parameters in a system of simultaneous equations are identified or not. and Cov(εt ) = F −1 Cov(u t )(F −1 )0 .

120 121
We can then rewrite the moment conditions in (8.27) as Example 2 (Supply and Demand: overidentification.) If we change the demand equation
" # " #0 ! in Example 1 to
z̃ t z̃ t
E y jt − β = 0. (8.28) qt = βpt + α At + bBt + u dt , β < 0.
z t∗ ỹt
There are now two moment conditions for the supply curve (since there are two useful
instruments) " # " #
y jt = −G j z̃ t − F j ỹt + u jt At (qt − γ pt ) 0
E = ,
Bt (qt − γ pt ) 0
= xt0 β + u jt , where xt0 = z̃ t0 , ỹt0 ,
 
(8.29)
but still only one parameter: the supply curve is now overidentified. The demand curve is
This shows that we need at least as many elements in z t∗ as in ỹt to have this equations still underidentified (two instruments and three parameters).
identified, which confirms the old-fashioned rule of thumb: there must be at least as
many excluded exogenous variables (z t∗ ) as included endogenous variables ( ỹt ) to have
8.3 Testing for Autocorrelation
the equation identified.
This section has discussed identification of structural parameters when 2SLS/IV, one This section discusses how GMM can be used to test if a series is autocorrelated. The
equation at a time, is used. There are other ways to obtain identification, for instance, by analysis focuses on first-order autocorrelation, but it is straightforward to extend it to
imposing restrictions on the covariance matrix. See, for instance, Greene (2000) 16.1-3 higher-order autocorrelation.
for details. Consider a scalar random variable xt with a zero mean (it is easy to extend the analysis
to allow for a non-zero mean). Consider the moment conditions
Example 1 (Supply and Demand. Reference: GR 16, Hamilton 9.1.) Consider the sim-
T
" # " # " #
plest simultaneous equations model for supply and demand on a market. Supply is xt2 − σ 2 1 X xt2 − σ 2 σ2
m t (β) = , so m̄(β) = , with β = .
xt xt−1 − ρσ 2 T
t=1
xt xt−1 − ρσ 2 ρ
qt = γ pt + u st , γ > 0, (8.30)
σ 2 is the variance and ρ the first-order autocorrelation so ρσ 2 is the first-order autoco-
and demand is
variance. We want to test if ρ = 0. We could proceed along two different routes: estimate
qt = βpt + α At + u dt , β < 0,
ρ and test if it is different from zero or set ρ to zero and then test overidentifying restric-
where At is an observable exogenous demand shock (perhaps income). The only mean- tions. We analyze how these two approaches work when the null hypothesis of ρ = 0 is
ingful instrument is At . From the supply equation we then get the moment condition true.

EAt (qt − γ pt ) = 0,
8.3.1 Estimating the Autocorrelation Coefficient
which gives one equation in one unknown, γ . The supply equation is therefore exactly We estimate both σ 2 and ρ by using the moment conditions (8.30) and then test if ρ =
identified. In contrast, the demand equation is unidentified, since there is only one (mean- 0. To do that we need to calculate the asymptotic variance of ρ̂ (there is little hope of
ingful) moment condition being able to calculate the small sample variance, so we have to settle for the asymptotic
EAt (qt − βpt − α At ) = 0, variance as an approximation).
but two unknowns (β and α).

122 123
We have an exactly identified system so the weight matrix does not matter—we can By combining (8.31) and (8.32) we get that
then proceed as if we had used the optimal weighting matrix (all those results apply). " #!
√ σ̂ 2  0 −1
To find the asymptotic covariance matrix of the parameters estimators, we need the ACov T = D0 S0−1 D0
ρ̂
probability limit of the Jacobian of the moments and the covariance matrix of the moments— " #0 " #−1 " #−1
evaluated at the true parameter values. Let m̄ i (β0 ) denote the ith element of the m̄(β) −1 0 2σ 4 0 −1 0
= 
vector—evaluated at the true parameter values. The probability of the Jacobian is 0 −σ 2 0 σ4 0 −σ 2
" # " # " # " #
∂ m̄ 1 (β0 )/∂σ 2 ∂ m̄ 1 (β0 )/∂ρ −1 0 −1 0 2σ 4 0
D0 = plim = = , = . (8.33)
∂ m̄ 2 (β0 )/∂σ 2 ∂ m̄ 2 (β0 )/∂ρ −ρ −σ 2 0 −σ 2 0 1
(8.31) √
since ρ = 0 (the true value). Note that we differentiate with respect to σ 2 , not σ , since This shows the standard expression for the uncertainty of the variance and that the T ρ̂.

we treat σ 2 as a parameter. Since GMM estimators typically have an asymptotic distribution we have T ρ̂ →d
The covariance matrix is more complicated. The definition is N (0, 1), so we can test the null hypothesis of no first-order autocorrelation by the test
"√ # "√ T #0 statistics
T
T X T X T ρ̂ 2 ∼ χ12 . (8.34)
S0 = E m t (β0 ) m t (β0 ) .
T T
t=1 t=1 This is the same as the Box-Ljung test for first-order autocorrelation.
Assume that there is no autocorrelation in m t (β0 ). We can then simplify as This analysis shows that we are able to arrive at simple expressions for the sampling
uncertainty of the variance and the autocorrelation—provided we are willing to make
S0 = E m t (β0 )m t (β0 )0 . strong assumptions about the data generating process. In particular, ewe assumed that
data was iid N (0, σ 2 ). One of the strong points of GMM is that we could perform similar
This assumption is stronger than assuming that ρ = 0, but we make it here in order to
tests without making strong assumptions—provided we use a correct estimator of the
illustrate the asymptotic distribution. To get anywhere, we assume that xt is iid N (0, σ 2 ).
asymptotic covariance matrix S0 (for instance, Newey-West).
In this case (and with ρ = 0 imposed) we get
" #" #0 " #
xt2 − σ 2 xt2 − σ 2 (xt2 − σ 2 )2 (xt2 − σ 2 )xt xt−1 8.3.2 Testing the Overidentifying Restriction of No Autocorrelation∗
S0 = E =E
xt xt−1 xt xt−1 (xt2 − σ 2 )xt xt−1 (xt xt−1 )2
" # " # We can estimate σ 2 alone and then test if both moment condition are satisfied at ρ = 0.
E xt4 − 2σ 2 E xt2 + σ 4 0 2σ 4 0 There are several ways of doing that, but the perhaps most straightforward is skip the loss
= = . (8.32)
0 2 2
E xt xt−1 0 σ4 function approach to GMM and instead specify the “first order conditions” directly as

To make the simplification in the second line we use the facts that E xt4 = 3σ 4 if xt ∼ 0 = Am̄
N (0, σ 2 ), and that the normality and the iid properties of xt together imply E xt2 xt−1
2 = T
" #
h i1X xt2 − σ 2
2 and E x 3 x
E xt2 E xt−1 t t−1 = E σ 2x x
t t−1 = 0. = 1 0 , (8.35)
T xt xt−1
t=1

which sets σ̂ 2 equal to the sample variance.

124 125
The only parameter in this estimation problem is σ 2 , so the matrix of derivatives 8.4 Estimating and Testing a Normal Distribution
becomes " # " #
∂ m̄ 1 (β0 )/∂σ 2 −1 8.4.1 Estimating the Mean and Variance
D0 = plim = . (8.36)
∂ m̄ 2 (β0 )/∂σ 2 0
This section discusses how the GMM framework can be used to test if a variable is nor-
By using this result, the A matrix in (8.36) and the S0 matrix in (8.32,) it is straighforward mally distributed. The analysis cold easily be changed in order to test other distributions
to calculate the asymptotic covariance matrix the moment conditions. In general, we have as well.
√ Suppose we have a sample of the scalar random variable xt and that we want to test if
ACov[ T m̄(β̂)] = [I − D0 (A0 D0 )−1 A0 ]S0 [I − D0 (A0 D0 )−1 A0 ]0 . (8.37)
the series is normally distributed. We analyze the asymptotic distribution under the null
The term in brackets is here (since A0 = A since it is a matrix with constants) hypothesis that xt is N (µ, σ 2 ).
 −1 We specify four moment conditions
   
xt − µ xt − µ
" # " # " # " #
1 0 −1  h i −1  h i 0 0
−  1 0  1 0 = . (8.38)  (xt − µ)2 − σ 2 
  T 
 (xt − µ)2 − σ 2 

0 1 0 } 0  0 1  so m̄ = 1
  X
| {z | {z } mt =  (x − µ)3
  (8.42)
| {z } | {z } A0 | {z } A0  t
 T 
t=1 
(xt − µ)3 
I2 D0 D0
 
(xt − µ)4 − 3σ 4 (xt − µ)4 − 3σ 4
We therefore get
" #" #" #0 " # Note that E m t = 04×1 if xt is normally distributed.
√ 0 0 2σ 4 0 0 0 0 0 Let m̄ i (β0 ) denote the ith element of the m̄(β) vector—evaluated at the true parameter
ACov[ T m̄(β̂)] = = . (8.39)
0 1 0 σ4 0 1 0 σ4 values. The probability of the Jacobian is
 
Note that the first moment condition has no sampling variance at the estimated parameters, ∂ m̄ 1 (β0 )/∂µ ∂ m̄ 1 (β0 )/∂σ 2
since the choice of σ̂ 2 always sets the first moment condition equal to zero.  ∂ m̄ 2 (β0 )/∂µ ∂ m̄ 2 (β0 )/∂σ 2 
 
D0 = plim  
The test of the overidentifying restriction that the second moment restriction is also  ∂ m̄ (β )/∂µ ∂ m̄ (β )/∂σ 2 
 3 0 3 0 
zero is ∂ m̄ 4 (β0 )/∂µ ∂ m̄ 4 (β0 )/∂σ 2
 √ +
T m̄ 0 ACov[ T m̄(β̂)] m̄ ∼ χ12 , (8.40)    
−1 0 −1 0
T
−2(xt − µ)
   
where we have to use a generalized inverse if the covariance matrix is singular (which it 1 X −1   0 −1 
= plim   =
 . (8.43)
is in (8.39)). T  −3(x − µ)2 0  2 0 
t=1  t   −3σ 
In this case, we get the test statistics (note the generalized inverse) −4(xt − µ)3 −6σ 2 0 −6σ 2
" #0 " #" # 2
6t=1 xt xt−1 /T
 T
0 0 0 0 (Recall that we treat σ 2 , not σ , as a parameter.)
T = T ,
6t=1 t t−1 /T
T x x 0 1/σ 4 6t=1
T x x
t t−1 /T σ4 The covariance matrix of the scaled moment conditions (at the true parameter values)
(8.41) is "√ T # "√ T #0
which is the T times the square of the sample covariance divided by σ 4 . A sample cor- T X T X
S0 = E m t (β0 ) m t (β0 ) , (8.44)
relation, ρ̂, would satisfy 6t=1T x x
t t−1 /T = ρ̂ σ̂ , which we can use to rewrite (8.41) as
2
T T
t=1 t=1
T ρ̂ σ̂ /σ . By approximating σ by σ̂ we get the same test statistics as in (8.34).
2 4 4 4 4

126 127
which can be a very messy expression. Assume that there is no autocorrelation in m t (β0 ), 8.4.2 Testing Normality∗
which would certainly be true if xt is iid. We can then simplify as
The payoff from the overidentifying restrictions is that we can test if the series is actually
S0 = E m t (β0 )m t (β0 ) ,
0
(8.45) normally distributed. There are several ways of doing that, but the perhaps most straight-
forward is skip the loss function approach to GMM and instead specify the “first order
which is the form we use here for illustration. We therefore have (provided m t (β0 ) is not conditions” directly as
autocorrelated)
   0   0 = Am̄
xt − µ xt − µ σ2 0 3σ 4 0  
xt − µ
 (xt − µ) − σ   (xt − µ) − σ 
2 2 2 2 4 6
     
 0 2σ 0 12σ " # T 
. (xt − µ)2 − σ 2
 
S0 = E      =  1 0 0 0 1 X .

 (xt − µ)3     (xt − µ)3  
 3σ
 4 0 15σ 6 0  =  (8.48)
     0 1 0 0 T 
t=1 
(xt − µ)3 
(xt − µ) − 3σ
4 4 (xt − µ) − 3σ
4 4 6 96σ 8

0 12σ 0
(xt − µ)4 − 3σ 4
(8.46)
It is straightforward to derive this result once we have the information in the following The asymptotic covariance matrix the moment conditions is as in (8.37). In this case,
remark. the matrix with brackets is
 −1
Remark 3 If X ∼ N (µ, σ 2 ), then the first few moments around the mean of a are E(X −      
µ) = 0, E(X − µ)2 = σ 2 , E(X − µ)3 = 0 (all odd moments are zero), E(X − µ)4 = 3σ 4 , 1 0 0 0 −1 0 
# −1 0 
 "  " #
E(X − µ)6 = 15σ 6 , and E(X − µ)8 = 105σ 8 .
  
 0 1 0 0   0 −1  1 0 0 0  0 −1  1 0 0 0
 −   
 0 0 1 0   −3σ 2 0   0 1 0 0  −3σ 2 0  0 1 0 0
     
Suppose we use the efficient weighting matrix. The asymptotic covariance matrix of 0 0 0 1 0 −6σ 2 | {z }
0 −6σ 2  | {z }
 A0  A0
the estimated mean and variance is then ((D00 S0−1 D0 )−1 ) | {z } | {z } | {z }
I4 D0 D0
 
 0  −1  −1 0 0 0 0
 −1 0 σ2 0 3σ 40 −1 0  
4 6 
 "
1
#−1  0 0 0 0 
= (8.49)
 0    
−1  0 2σ 0 12σ  0 −1   0 
σ −3σ 2
 2
 = 0 1 0
       
 −3σ 2  3σ 4 15σ 6  −3σ 2 0 2σ1 4
  
0  0 0  0 
     
 0 −6σ 2 0 1
0 −6σ 2 0 12σ 6 0 96σ 8 0 −6σ 2
" #
σ2 0
= .
0 2σ 4
(8.47)

This is the same as the result from maximum likelihood estimation which use the sample
mean and sample variance as the estimators. The extra moment conditions (overidenti-
fying restrictions) does not produce any more efficient estimators—for the simple reason
that the first two moments completely characterizes the normal distribution.

128 129
We therefore get a vector of parameters and some second moments

σc p
   0   y 
0 0 0 0 σ2 0 3σ 4 0 0 0 0 0 9 = δ, ..., σλ , , ..., Corr , n , (8.52)


 0
 4 6 
  σy n
0 0 0  0 2σ 0 12σ 0 0 0 0
ACov[ T m̄(β̂)] = 

  
 −3σ 2 0 1 0   3σ 4 0 15σ 6 0   −3σ 2 0 1 0 and estimate it with GMM using moment conditions. One of the moment condition is
  
  
0 −6σ 2 0 1 0 12σ 6 0 96σ 8 0 −6σ 2 0 1 that the sample average of the labor share in value added equals the coefficient on labor
 
0 0 0 0 in a Cobb-Douglas production function, another is that just the definitions of a standard

 0 0 0 0

 deviation, and so forth.
= (8.50)
The distribution of the estimator for 9 is asymptotically normal. Note that the covari-

 0 0
 6σ 6 0 

0 0 0 24σ 8 ance matrix of the moments is calculated similarly to the Newey-West estimator.
The second step is to note that the RBC model generates second moments as a function
We now form the test statistics for the overidentifying restrictions as in (8.40). In this
h (.) of the model parameters {δ, ..., σλ }, which are in 9, that is, the model generated
case, it is (note the generalized inverse)
second moments can be thought of as h (9).
0 
The third step is to test if the non-linear restrictions of the model (the model mapping
  
0 0 0 0 0 0

 0
 
  0 0 0 0

 0

 from parameters to second moments) are satisfied. That is, the restriction that the model
T 6 T (x − µ)3 /T
 
  0 0 1/(6σ 6 )

  6 T (x − µ)3 /T

second moments are as in data
 t=1 t   0   t=1 t


6t=1 [(xt − µ) − 3σ ]/T 1/(24σ ) 6t=1
T [(x − µ)4 − 3σ 4 ]/T σc p
T 4 4 8
  y 
0 0 0 t
H (9) = h (9) − , ..., Corr , n = 0, (8.53)
2 2 σy n
T 6t=1 (xt − µ) /T 3
T 6t=1 [(xt − µ) − 3σ ]/T
4 4
 T   T

= + . (8.51)
6 σ6 24 σ8 is tested with a Wald test. (Note that this is much like the Rβ = 0 constraints in the linear
When we approximate σ by σ̂ then this is the same as the Jarque and Bera test of nor- case.) From the delta-method we get
mality. √ ∂H ∂ H0
 
d
T H (9̂) → N 0, Cov( 9̂) . (8.54)
The analysis shows (once again) that we can arrive at simple closed form results by ∂9 0 ∂9
making strong assumptions about the data generating process. In particular, we assumed
Forming the quadratic form
that the moment conditions were serially uncorrelated. The GMM test, with a modified
−1
∂H ∂ H0

estimator of the covariance matrix S0 , can typically be much more general.
T H (9̂)0 Cov(9̂) H (9̂), (8.55)
∂9 0 ∂9

8.5 Testing the Implications of an RBC Model will as usual give a χ 2 distributed test statistic with as many degrees of freedoms as
restrictions (the number of functions in (8.53)).
Reference: Christiano and Eichenbaum (1992)
This section shows how the GMM framework can be used to test if an RBC model fits
data.
Christiano and Eichenbaum (1992) try to test if the RBC model predictions correspond
are significantly different from correlations and variances of data. The first step is to define

130 131
8.6 IV on a System of Equations∗ Bibliography
Suppose we have two equations Christiano, L. J., and M. Eichenbaum, 1992, “Current Real-Business-Cycle Theories and
Aggregate Labor-Market Fluctuations,” American Economic Review, 82, 430–450.
0
y1t = x1t β1 + u 1t
0
y2t = x2t β2 + u 2t , Davidson, R., and J. G. MacKinnon, 1993, Estimation and Inference in Econometrics,
Oxford University Press, Oxford.
and two sets of instruments, z 1t and z 2t with the same dimensions as x1t and x2t , respec-
Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New
tively. The sample moment conditions are
Jersey, 4th edn.
T
"  #
1 X z 1t y1t − x1t 0 β
1
m̄(β1 , β2 ) = 0 β
 , Hamilton, J. D., 1994, Time Series Analysis, Princeton University Press, Princeton.
T z 2t y2t − x2t 2
t=1

Let β = (β10 , β20 )0 . Then


  
∂ 1 PT
 ∂ 1 PT
∂ m̄(β1 , β2 )  ∂β10 T t=1 z 1t y1t − x1t β1 ∂β20 T t=1 z 1t y1t − x1t β1 
0 0
= ∂ 1 PT
 ∂ 1 PT
∂β 0 t=1 z 2t y2t − x 2t β2
0
t=1 z 2t y2t − x 2t β2
0

∂β10 T ∂β20 T
" P #
1 T 0
z 1t x1t 0
= T t=1 1 P T 0
.
0 T t=1 z 2t x 2t

This is invertible so we can premultiply the first order condition with the inverse of
0
∂ m̄(β)/∂β 0 A and get m̄(β) = 0k×1 . We can solve this system for β1 and β2 as


" # " PT #−1 " PT #


β1 1
T
0
t=1 z 1t x 1t 0 1
Tt=1 z 1t y1t
= 1 PT PT
β2 0 T
0
t=1 z 2t x 2t
1
Tt=1 z 2t y2t
  −1 
1 P T 0
" P
1 T
#
 T t=1 z 1t x1t 0 z 1t y1t
= −1  T1 Pt=1 .

 P T
t=1 z 2t y2t
1 T 0
0 T t=1 z 2t x 2t T

This is IV on each equation separately, which follows from having an exactly identified
system.

132 133
Example 3 (Canonical for of VAR(2) of 2×1 vector.) Continuing on the previous exam-
ple, we get
      
11 Vector Autoregression (VAR) xt A1,11 A1,11 A2,11 A2,12 xt−1 ε1,t
  A1,21 A1,22 A2,21 A2,22   z t−1   ε2,t 
      
 zt
 =  + .
Reference: Hamilton (1994) 10-11; Greene (2000) 17.5; Johnston and DiNardo (1997)  t−1   1
 x   0 0 0   t−2   0 
 x   
9.1-9.2 and Appendix 9.2; and Pindyck and Rubinfeld (1997) 9.2 and 13.5. z t−1 0 1 0 0 z t−2 0
Let yt be an n × 1 vector of variables. The VAR( p) is
11.2 Moving Average Form and Stability
yt = µ + A1 yt−1 + ... + A p yt− p + εt , εt is white noise, Cov(εt ) = . (11.1)
Consider a VAR(1), or a VAR(1) representation of a VAR( p) or an AR( p)
Example 1 (VAR(2) of 2 × 1 vector.) Let yt = [ xt z t ]0 . Then
" # " #" # " #" # " # yt∗ = Ayt−1

+ εt∗ . (11.5)
xt A1,11 A1,12 xt−1 A2,11 A2,12 xt−2 ε1,t
= + + . (11.2)
zt A1,21 A1,22 z t−1 A2,21 A2,22 z t−2 ε2,t ∗
Solve recursively backwards (substitute for yt−s ∗
= Ayt−s−1 + εt−s
∗ , s = 1, 2,...) to get

Issues: the vector moving average representation (VMA), or impulse response function

yt∗ = A Ayt−2

+ εt−1

+ εt∗

• Variable selection

• Lag length = A2 yt−2


∗ ∗
+ Aεt−1 + εt∗
= A2 Ayt−3∗
+ εt−2
∗ ∗
+ εt∗

+ Aεt−1
• Estimation
= A3 yt−3

+ A2 εt−2
∗ ∗
+ Aεt−1 + εt∗
• Purpose: data description (Granger-causality, impulse response, forecast error vari- ..
.
ance decomposition), forecasting, policy analysis (Lucas critique)?
K
X
= A K +1 yt−K

−1 + As εt−s

. (11.6)
11.1 Canonical Form s=0
Remark 4 (Spectral decomposition.) The n eigenvalues (λi ) and associated eigenvectors
A VAR( p) can be rewritten as a VAR(1). For instance, a VAR(2) can be written as (z i ) of the n × n matrix A satisfies
" # " # " #" # " #
yt µ A1 A2 yt−1 εt (A − λi In ) z i = 0n×1 .
= + + or (11.3)
yt−1 0 I 0 yt−2 0
If the eigenvectors are linearly independent, then
yt∗ = µ∗ + Ayt−1

+ εt∗ . (11.4)
λ1 0 · · · 0
 
Example 2 (Canonical form of a univariate AR(2).)
0 λ2 · · · 0
  h i
A = Z 3Z −1 , where 3 =  .
 
 .. .. .
" # " # " #" # " #  and Z = z 1 z 2 · · · zn
µ εt
. · · · ..
yt a1 a2 yt−1
 .

= + + . 
yt−1 0 1 0 yt−2 0
0 0 ··· λn

134 135
Note that we therefore get Note that
∂ yt ∂Et yt+s
= Cs or = Cs , with C0 = I (11.9)
A2 = A A = Z 3Z −1 Z 3Z −1 = Z 33Z −1 = Z 32 Z −1 ⇒ Aq = Z 3q Z −1 . ∂εt−s
0 ∂εt0
√ so the impulse response function is given by {I, C1 , C2 , ...}. Note that it is typically only
Remark 5 (Modulus of complex number.) If λ = a + bi, where i = −1, then |λ| =
√ meaningful to discuss impulse responses to uncorrelated shocks with economic interpreta-
|a + bi| = a 2 + b2 .
tions. The idea behind structural VARs (discussed below) is to impose enough restrictions
We want lim K →∞ A K +1 yt−K ∗
−1 = 0 (stable VAR) to get a moving average repre- to achieve this.
sentation of yt (where the influence of the starting values vanishes asymptotically). We
note from the spectral decompositions that A K +1 = Z 3 K +1 Z −1 , where Z is the matrix of Example 8 (Impulse response function for AR(1).) Let yt = ρyt−1 + εt . The MA rep-
resentation is yt = ts=0 ρ s εt−s , so ∂ yt /∂εt−s = ∂ E t yt+s /∂εt = ρ s . Stability requires
P
eigenvectors and 3 a diagonal matrix with eigenvalues. Clearly, lim K →∞ A K +1 yt−K

−1 =
0 is satisfied if the eigenvalues of A are all less than one in modulus. |ρ| < 1, so the effect of the initial value eventually dies off (lims→∞ ∂ yt /∂εt−s = 0).

Example 6 (AR(1).) For the univariate AR(1) yt = ayt−1 +εt , the characteristic equation Example 9 (Numerical VAR(1) of 2×1 vector.) Consider the VAR(1)
is (a − λ) z = 0, which is only satisfied if the eigenvalue is λ = a. The AR(1) is therefore " # " #" # " #
stable (and stationarity) if −1 < a < 1. xt 0.5 0.2 xt−1 ε1,t
= + .
zt 0.1 −0.3 z t−1 ε2,t
If we have a stable VAR, then (11.6) can be written

The eigenvalues are approximately 0.52 and −0.32, so this is a stable VAR. The VMA is
X
yt∗ = As εt−s

(11.7) " # " # " #" # " #" #
xt ε1,t 0.5 0.2 ε1,t−1 0.27 0.04 ε1,t−2
s=0 = + + + ...
zt ε2,t 0.1 −0.3 ε2,t−1 0.02 0.11 ε2,t−2
= εt∗ + Aεt−1

+ A2 εt−2

+ ...

We may pick out the first n equations from (11.7) (to extract the “original” variables from 11.3 Estimation
the canonical form) and write them as
The MLE, conditional on the initial observations, of the VAR is the same as OLS esti-
yt = εt + C1 εt−1 + C2 εt−2 + ..., (11.8) mates of each equation separately. The MLE of the i j th element in Cov(εt ) is given by
PT
t=1 v̂it v̂ jt /T , where v̂it and v̂ jt are the OLS residuals.
which is the vector moving average, VMA, form of the VAR. Note that the VAR system is a system of “seemingly unrelated regressions,” with the
Example 7 (AR(2), Example (2) continued.) Let µ = 0 in 2 and note that the VMA of the same regressors in each equation. The OLS on each equation is therefore the GLS, which
canonical form is coincides with MLE if the errors are normally distributed.
" # " # " #" # " #" #
yt εt a1 a2 εt−1 a12 + a2 a1 a2 εt−2
= + + + ... 11.4 Granger Causality
yt−1 0 1 0 0 a1 a2 0

The MA of yt is therefore Main message: Granger-causality might be useful, but it is not the same as causality.
  Definition: if z cannot help forecast x, then z does not Granger-cause x; the MSE of
yt = εt + a1 εt−1 + a12 + a2 εt−2 + ... the forecast E( xt | xt−s , z t−s , s > 0) equals the MSE of the forecast E( xt | xt−s , s > 0).

136 137
Test: Redefine the dimensions of xt and z t in (11.2): let xt be n 1 ×1 and z t is n 2 ×1. If Granger causes y3t . The answer might very well be yes since y1t−1 contains information
the n 1 × n 2 matrices A1,12 = 0 and A2,12 = 0, then z fail to Granger-cause x. (In general, about y2t−1 which does affect y3t . (If you let y1t be money, y2t be the (autocorrelated)
we would require As,12 = 0 for s = 1, ..., p.) This carries over to the MA representation Solow residual, and y3t be output, then this is a short version of the comment in King
in (11.8), so Cs,12 = 0. (1986) comment on Bernanke (1986) (see below) on why money may appear to Granger-
These restrictions can be tested with an F-test. The easiest case is when x is a scalar, cause output). Also note that adding a nominal interest rate to Sims (see above) money-
since we then simply have a set of linear restrictions on a single OLS regression. output VAR showed that money cannot be taken to be exogenous.

Example 10 (RBC and nominal neutrality.) Suppose we have an RBC model which says
that money has no effect on the real variables (for instance, output, capital stock, and the 11.5 Forecasts Forecast Error Variance
productivity level). Money stock should not Granger-cause real variables.
The error forecast of the s period ahead forecast is
Example 11 (Granger causality and causality.) Do Christmas cards cause Christmas?
yt+s − Et yt+s = εt+s + C1 εt+s−1 + ... + Cs−1 εt+1 , (11.10)
Example 12 (Granger causality and causality II, from Hamilton 11.) Consider the price
Pt of an asset paying dividends Dt . Suppose the expected return (Et (Pt+1 + Dt+1 )/Pt ) so the covariance matrix of the (s periods ahead) forecasting errors is
is a constant, R. The price then satisfies Pt =Et ∞ −s
P
s=1 R Dt+s . Suppose Dt = u t +
E (yt+s − Et yt+s ) (yt+s − Et yt+s )0 =  + C1 C10 + ... + Cs−1 Cs−1
0
. (11.11)
δu t−1 + vt , so Et Dt+1 = δu t and Et Dt+s = 0 for s > 1. This gives Pt = δu t /R, and
Dt = u t + vt + R Pt−1 , so the VAR is For a VAR(1), Cs = As , so we have
" # " #" # " #
Pt 0 0 Pt−1 δu t /R yt+s − Et yt+s = εt+s + Aεt+s−1 + ... + As εt+1 , and (11.12)
= + ,
Dt R 0 Dt−1 u t + vt
E (yt+s − Et yt+s ) (yt+s − Et yt+s ) =  + AA + ... + A
0 0 s−1
(A ).
s−1 0
(11.13)
where P Granger-causes D. Of course, the true causality is from D to P. Problem:
Note that lims→∞ Et yt+s = 0, that is, the forecast goes to the unconditional mean
forward looking behavior.
(which is zero here, since there are no constants - you could think of yt as a deviation
Example 13 (Money and output, Sims (1972).) Sims found that output, y does not Granger- from the mean). Consequently, the forecast error becomes the VMA representation (11.8).
cause money, m, but that m Granger causes y. His interpretation was that money supply Similarly, the forecast error variance goes to the unconditional variance.
is exogenous (set by the Fed) and that money has real effects. Notice how he used a
combination of two Granger causality test to make an economic interpretation. Example 15 (Unconditional variance of VAR(1).) Letting s → ∞ in (11.13) gives

Example 14 (Granger causality and omitted information.∗ ) Consider the VAR X 0
Eyt yt0 = As  As
s=0
ε1t
      
y1t a11 a12 0 y1t−1
=  + [AA0 + A2 (A2 )0 + ...]
 y2t  =  0 a22 0   y2t−1  +  ε2t 
      
=  + A  + AA0 + ... A0

y3t 0 a32 a33 y3t−1 ε3t
=  + A(Eyt yt0 )A0 ,
Notice that y2t and y3t do not depend on y1t−1 , so the latter should not be able to Granger-
cause y3t . However, suppose we forget to use y2t in the regression and then ask if y1t

138 139
which suggests that we can calculate Eyt yt0 by an iteration (backwards in time) 8t = which shows how the covariance matrix for the s-period forecast errors can be decom-
 + A8t+1 A0 , starting from 8T = I , until convergence. posed into its n components.

11.6 Forecast Error Variance Decompositions∗ 11.7 Structural VARs

If the shocks are uncorrelated, then it is often useful to calculate the fraction of Var(yi,t+s −Et yi,t+s ) 11.7.1 Structural and Reduced Forms
due to the j th shock, the forecast error variance decomposition. Suppose the covariance
We are usually not interested in the impulse response function (11.8) or the variance
matrix of the shocks, here , is a diagonal n × n matrix with the variances ωii along the
decomposition (11.11) with respect to εt , but with respect to some structural shocks, u t ,
diagonal. Let cqi be the ith column of Cq . We then have
which have clearer interpretations (technology, monetary policy shock, etc.).
n Suppose the structural form of the model is
X 0
Cq Cq0 = ωii cqi cqi . (11.14)
i=1 F yt = α + B1 yt−1 + ... + B p yt− p + u t , u t is white noise, Cov(u t ) = D. (11.16)
Example 16 (Illustration of (11.14) with n = 2.) Suppose
" # " # This could, for instance, be an economic model derived from theory.1
c11 c12 ω11 0 Provided F −1 exists, it is possible to write the time series process as
Cq = and  = ,
c21 c22 0 ω22
yt = F −1 α + F −1 B1 yt−1 + ... + F −1 B p yt− p + F −1 u t (11.17)
then " # = µ + A1 yt−1 + ... + A p yt− p + εt , Cov (εt ) = , (11.18)
ω11 c11
2 + ω c2
22 12 ω11 c11 c21 + ω22 c12 c22
Cq Cq0 = ,
ω11 c11 c21 + ω22 c12 c22 ω11 c21
2 + ω c2
22 22 where
which should be compared with  0
" #" #0 " #" #0 µ = F −1 α, As = F −1 Bs , and εt = F −1 u t so  = F −1 D F −1 . (11.19)
c11 c11 c12 c12
ω11 + ω22 Equation (11.18) is a VAR model, so a VAR can be thought of as a reduced form of the
c21 c21 c22 c22
"
2
# "
2
# structural model (11.16).
c11 c11 c21 c12 c12 c22
= ω11 2
+ ω 22 2
. The key to understanding the relation between the structural model and the VAR is
c11 c21 c21 c12 c22 c22
the F matrix, which controls how the endogenous variables, yt , are linked to each other
Applying this on (11.11) gives contemporaneously. In fact, identification of a VAR amounts to choosing an F matrix.
n n n Once that is done, impulse responses and forecast error variance decompositions can be
X X X
E (yt+s − Et yt+s ) (yt+s − Et yt+s )0 = ωii I + ωii c1i (c1i )0 + ... + ωii cs−1i (cs−1i )0 made with respect to the structural shocks. For instance, the impulse response function of
i=1 i=1 i=1 1 This is a “structural model” in a traditional, Cowles commission, sense. This might be different from
n
what modern macroeconomists would call structural.
X
ωii I + c1i (c1i ) + ... + cs−1i (cs−1i ) ,
0 0
 
=
i=1
(11.15)

140 141
the VAR, (11.8), can be rewritten in terms of u t = Fεt (from (11.19)) 11.7.2 “Triangular” Identification 1: Triangular F with Fii = 1 and Diagonal D

yt = εt + C1 εt−1 + C2 εt−2 + ... Reference: Sims (1980).


The perhaps most common way to achieve identification of the structural parameters
=F −1
Fεt + C1 F −1
Fεt−1 + C2 F −1
Fεt−2 + ...
is to restrict the contemporaneous response of the different endogenous variables, yt , to
=F −1
u t + C1 F −1
u t−1 + C2 F −1
u t−2 + ... (11.20)
the different structural shocks, u t . Within in this class of restrictions, the triangular iden-
Remark 17 The easiest way to calculate this representation is by first finding F −1 (see tification is the most popular: assume that F is lower triangular (n(n + 1)/2 restrictions)
below), then writing (11.18) as with diagonal element equal to unity, and that D is diagonal (n(n − 1)/2 restrictions),
which gives n 2 restrictions (exact identification).
yt = µ + A1 yt−1 + ... + A p yt− p + F −1 u t . (11.21)
A lower triangular F matrix is very restrictive. It means that the first variable can
To calculate the impulse responses to the first element in u t , set yt−1 , ..., yt− p equal to the react to lags and the first shock, the second variable to lags and the first two shocks, etc.
long-run average, (I − A1 − ... − Ap)−1 µ, make the first element in u t unity and all other This is a recursive simultaneous equations model, and we obviously need to be careful
elements zero. Calculate the response by iterating forward on (11.21), but putting all with how we order the variables. The assumptions that Fii = 1 is just a normalization.
elements in u t+1 , u t+2 , ... to zero. This procedure can be repeated for the other elements A diagonal D matrix seems to be something that we would often like to have in
of u t . a structural form in order to interpret the shocks as, for instance, demand and supply
shocks. The diagonal elements of D are the variances of the structural shocks.
We would typically pick F such that the elements in u t are uncorrelated with each
other, so they have a clear interpretation. Example 19 (Lower triangular F: going from structural form to VAR.) Suppose the
The VAR form can be estimated directly from data. Is it then possible to recover the structural form is
structural parameters in (11.16) from the estimated VAR (11.18)? Not without restrictions " #" # " #" # " #
on the structural parameters in F, Bs , α, and D. To see why, note that in the structural 1 0 xt B11 B12 xt−1 u 1,t
= + .
form (11.16) we have ( p + 1) n 2 parameters in {F, B1 , . . . , B p }, n parameters in α, and −α 1 zt B21 B22 z t−1 u 2,t
n(n + 1)/2 unique parameters in D (it is symmetric). In the VAR (11.18) we have fewer This is a recursive system where xt does not not depend on the contemporaneous z t , and
parameters: pn 2 in {A1 , . . . , A p }, n parameters in in µ, and n(n +1)/2 unique parameters therefore not on the contemporaneous u 2t (see first equation). However, z t does depend
in . This means that we have to impose at least n 2 restrictions on the structural para- on xt (second equation). The VAR (reduced form) is obtained by premultiplying by F −1
meters {F, B1 , . . . , B p , α, D} to identify all of them. This means, of course, that many " # " #" #" # " #" #
different structural models have can have exactly the same reduced form. xt 1 0 B11 B12 xt−1 1 0 u 1,t
= +
zt α 1 B21 B22 z t−1 α 1 u 2,t
Example 18 (Structural form of the 2 × 1 case.) Suppose the structural form of the " #" # " #
A11 A12 xt−1 ε1,t
previous example is = + .
" #" # " #" # " #" # " # A21 A22 z t−1 ε2,t
F11 F12 xt B1,11 B1,12 xt−1 B2,11 B2,12 xt−2 u 1,t
= + + . This means that ε1t = u 1t , so the first VAR shock equals the first structural shock. In
F21 F22 zt B1,21 B1,22 z t−1 B2,21 B2,22 z t−2 u 2,t
contrast, ε2,t = αu 1,t + u 2,t , so the second VAR shock is a linear combination of the first
This structural form has 3 × 4 + 3 unique parameters. The VAR in (11.2) has 2 × 4 + 3.
We need at least 4 restrictions on {F, B1 , B2 , D} to identify them from {A1 , A2 , }.

142 143
two shocks. The covariance matrix of the VAR shocks is therefore Remark 21 (Cholesky decomposition) Let  be an n × n symmetric positive definite
" # " # matrix. The Cholesky decomposition gives the unique lower triangular P such that  =
ε1,t Var (u 1t ) αVar (u 1t )
Cov = . P P 0 (some software returns an upper triangular matrix, that is, Q in  = Q 0 Q instead).
ε2,t αVar (u 1t ) α 2 Var (u 1t ) + Var (u 2t )

This set of identifying restrictions can be implemented by estimating the structural Remark 22 Note the following two important features of the Cholesky decomposition.
form with LS—equation by equation. The reason is that this is just the old fashioned fully First, each column of P is only identified up to a sign transformation; they can be reversed
recursive system of simultaneous equations. See, for instance, Greene (2000) 16.3. at will. Second, the diagonal elements in P are typically not unity.

Remark 23 (Changing sign of column and inverting.) Suppose the square matrix A2 is
11.7.3 “Triangular” Identification 2: Triangular F and D = I
the same as A1 except that the i th and j th columns have the reverse signs. Then A−1
2 is
The identifying restrictions in Section 11.7.2 is actually the same as assuming that F is the same as A−1
1 except that the i th and j th rows have the reverse sign.

triangular and that D = I . In this latter case, the restriction on the diagonal elements of F
has been moved to the diagonal elements of D. This is just a change of normalization (that This set of identifying restrictions can be implemented by estimating the VAR with
the structural shocks have unit variance). It happens that this alternative normalization is LS and then take the following steps.
fairly convenient when we want to estimate the VAR first and then recover the structural 0
• Step 1. From (11.19)  = F −1 I F −1 (recall D = I is assumed), so a Cholesky
parameters from the VAR estimates.
decomposition recovers F −1 (lower triangular F gives a similar structure of F −1 ,
Example 20 (Change of normalization in Example 19) Suppose the structural shocks in and vice versa, so this works). The signs of each column of F −1 can be chosen
Example 19 have the covariance matrix freely, for instance, so that a productivity shock gets a positive, rather than negative,
" # " # effect on output. Invert F −1 to get F.
u 1,t σ12 0
D = Cov = .
u 2,t 0 σ22 • Step 2. Invert the expressions in (11.19) to calculate the structural parameters from
the VAR parameters as α = Fµ, and Bs = F As .
Premultiply the structural form in Example 19 by
" # Example 24 (Identification of the 2×1 case.) Suppose the structural form of the previous
1/σ1 0
example is
0 1/σ2
" #" # " #" # " #" # " #
to get F11 0 xt B1,11 B1,12 xt−1 B2,11 B2,12 xt−2 u 1,t
= + + ,
" #" # " #" # " # F21 F22 zt B1,21 B1,22 z t−1 B2,21 B2,22 z t−2 u 2,t
1/σ1 0 xt B11 /σ1 B12 /σ1 xt−1 u 1,t /σ1 " #
= + . 1 0
−α/σ2 1/σ2 zt B21 /σ2 B22 /σ2 z t−1 u 2,t /σ2 with D = .
0 1
This structural form has a triangular F matrix (with diagonal elements that can be dif-
ferent from unity), and a covariance matrix equal to an identity matrix.

The reason why this alternative normalization is convenient is that it allows us to use
the widely available Cholesky decomposition.

144 145
Step 1 above solves 11.7.5 What if the VAR Shocks are Uncorrelated ( = I )?∗
" # " #−1 " #−1 0 Suppose we estimate a VAR and find that the covariance matrix of the estimated residuals
11 12 F11 0 F11 0
=   is (almost) an identity matrix (or diagonal). Does this mean that the identification is
12 22 F21 F22 F21 F22
  superfluous? No, not in general. Yes, if we also want to impose the restrictions that F is
1
2
F11
− F 2F21F triangular.
11 22 
= F21
2
F21 +F11 2
− F2 F 2 F2
There are many ways to reshuffle the shocks and still get orthogonal shocks. Recall
11 22 F11 22
that the structural shocks are linear functions of the VAR shocks, u t = Fεt , and that we
for the three unknowns F11 , F21 , and F22 in terms of the known 11 , 12 , and 22 . Note assume that Cov(εt ) =  = I and we want Cov(u t ) = I , that, is from (11.19) we then
that the identifying restrictions are that D = I (three restrictions) and F12 = 0 (one have (D = I )
restriction). (This system is just four nonlinear equations in three unknown - one of the F F 0 = I. (11.23)
equations for 12 is redundant. You do not need the Cholesky decomposition to solve it,
There are many such F matrices: the class of those matrices even have a name: orthogonal
since it could be solved with any numerical solver of non-linear equations—but why make
matrices (all columns in F are orthonormal). However, there is only one lower triangular
life even more miserable?)
F which satisfies (11.23) (the one returned by a Cholesky decomposition, which is I ).
A practical consequence of this normalization is that the impulse response of shock i Suppose you know that F is lower triangular (and you intend to use this as the identi-
equal to unity is exactly the same as the impulse response of shock i equal to Std(u it ) in fying assumption), but that your estimated  is (almost, at least) diagonal. The logic then
the normalization in Section 11.7.2. requires that F is not only lower triangular, but also diagonal. This means that u t = εt
(up to a scaling factor). Therefore, a finding that the VAR shocks are uncorrelated com-
11.7.4 Other Identification Schemes∗ bined with the identifying restriction that F is triangular implies that the structural and
reduced form shocks are proportional. We can draw no such conclusion if the identifying
Reference: Bernanke (1986).
assumption is something else than lower triangularity.
Not all economic models can be written in this recursive form. However, there are
often cross-restrictions between different elements in F or between elements in F and D, Example 25 (Rotation of vectors (“Givens rotations”).) Consider the transformation of
or some other type of restrictions on Fwhich may allow us to identify the system. the vector ε into the vector u, u = G 0 ε, where G = In except that G ik = c, G ik = s,
Suppose we have (estimated) the parameters of the VAR (11.18), and that we want to G ki = −s, and G kk = c. If we let c = cos θ and s = sin θ for some angle θ , then
impose D =Cov(u t ) = I . From (11.19) we then have (D = I ) G 0 G = I . To see this, consider the simple example where i = 2 and k = 3
 0 0 
 = F −1 F −1 .
   
(11.22) 1 0 0 1 0 0 1 0 0
 0 c s   0 c s  =  0 c2 + s 2 0 ,
     
As before we need n(n − 1)/2 restrictions on F, but this time we don’t want to impose
0 −s c 0 −s c 0 0 c2 + s 2
the restriction that all elements in F above the principal diagonal are zero. Given these
restrictions (whatever they are), we can solve for the remaining elements in B, typically
with a numerical method for solving systems of non-linear equations.

146 147
which is an identity matrix since cos2 θ + sin2 θ = 1. The transformation u = G 0 ε gives As before the structural shocks, u t , are

u t = εt for t 6= i, k u t = Fεt with Cov(u t ) = D.


u i = εi c − εk s
The VMA in term of the structural shocks is therefore
u k = εi s + εk c.
s
X
xt = C + (L) F −1 u t , where Cs+ = Cs with C0 = I. (11.26)
The effect of this transformation is to rotate the and i th k th
vectors counterclockwise
j=0
through an angle of θ . (Try it in two dimensions.) There is an infinite number of such
transformations (apply a sequence of such transformations with different i and k, change The C + (L) polynomial is known from the estimation, so we need to identify F in order to
θ , etc.). use this equation for impulse response function and variance decompositions with respect
to the structural shocks.
Example 26 (Givens rotations and the F matrix.) We could take F in (11.23) to be (the As before we assume that D = I , so
transpose) of any such sequence of givens rotations. For instance, if G 1 and G 2 are givens  0
0
rotations, then F = G 01 or F = G 2 or F = G 01 G 02 are all valid.  = F −1 D F −1 (11.27)

in (11.19) gives n(n + 1)/2 restrictions.


11.7.6 Identification via Long-Run Restrictions - No Cointegration∗
We now add restrictions on the long run impulse responses. From (11.26) we have
Suppose we have estimated a VAR system (11.1) for the first differences of some variables ∂ xt+s
yt = 1xt , and that we have calculated the impulse response function as in (11.8), which lim = lim Cs+ F −1
s→∞ ∂u 0t s→∞
we rewrite as = C(1)F −1 , (11.28)

1xt = εt + C1 εt−1 + C2 εt−2 + ... P∞


where C(1) = j=0 Cs . We impose n(n − 1)/2 restrictions on these long run responses.
= C (L) εt , with Cov(εt ) = . (11.24) Together we have n 2 restrictions, which allows to identify all elements in F.
In general, (11.27) and (11.28) is a set of non-linear equations which have to solved
To find the MA of the level of xt , we solve recursively
for the elements in F. However, it is common to assume that (11.28) is a lower triangular
xt = C (L) εt + xt−1 matrix. We can then use the following “trick” to find F. Since εt = F −1 u t

= C (L) εt + C (L) εt−1 + xt−2


 0
EC(1)εt εt0 C(1)0 = EC(1)F −1 u t u 0t F −1 C(1)0
..
.  0
C(1)C(1)0 = C(1)F −1 F −1 C(1)0 . (11.29)
= C (L) (εt + εt−1 + εt−2 + ...)
= εt + (C1 + I ) εt−1 + (C2 + C1 + I ) εt−2 + ... We can therefore solve for a lower triangular matrix
s
3 = C(1)F −1 (11.30)
X
= C + (L) εt , where Cs+ = Cs with C0 = I. (11.25)
j=0
by calculating the Cholesky decomposition of the left hand side of (11.29) (which is

148 149
available from the VAR estimate). Finally, we solve for F −1 from (11.30). The upper right element of the right hand side is
−F12 + F12 A22 + A12 F11
Example 27 (The 2 × 1 case.) Suppose the structural form is
(1 − A22 − A11 + A11 A22 − A12 A21 ) (F11 F22 − F12 F21 )
" #" # " #" # " #
F11 F12 1xt B11 B12 1xt−1 u 1,t 0
= + . which is one restriction on the elements in F. The other three are given by F −1 F −1 =
F21 F22 1z t B21 B22 1z t−1 u 2,t
, that is,
and we have an estimate of the reduced form  2 +F 2
F22
 "
− F22 F21 +F12 F11 2
#
 = 11 12 .
12
" # " # " # " #! (F F
11 22 −F F
12 21 ) 2 (F F
11 22 −F F
12 21 )
1xt 1xt−1 ε1,t ε1,t
 2 2
=A + , with Cov = . − F22 F21 +F12 F11 2
F21 +F11
12 22
(F11 F22 −F12 F21 ) (F11 F22 −F12 F21 )
2
1z t 1z t−1 ε2,t ε2,t

The VMA form (as in (11.24)) 11.8 Cointegration, Common Trends, and Identification via Long-
"
1xt
# "
ε1,t
# "
ε1,t−1
# "
ε1,t−2
# Run Restrictions∗
= +A +A 2
+ ...
1z t ε2,t ε2,t−1 ε2,t−2 These notes are a reading guide to Mellander, Vredin, and Warne (1992), which is well be-
and for the level (as in (11.25)) yond the first year course in econometrics. See also Englund, Vredin, and Warne (1994).
" # " # " # " # (I have not yet double checked this section.)
xt ε1,t ε1,t−1   ε
1,t−2
= + (A + I ) + A2 + A + I + ...
zt ε2,t ε2,t−1 ε2,t−2 11.8.1 Common Trends Representation and Cointegration

or since εt = F −1 u
t The common trends representation of the n variables in yt is
" # " # " # " # " # " #!
xt u 1,t u 1,t−1 u 1,t−2 ϕt ϕt
 
= F −1 +(A + I ) F −1 + A2 + A + I F −1 +... yt = y0 + ϒτt + 8 (L) , with Cov = In (11.31)
zt u 2,t u 2,t−1 u 2,t−2 ψt ψt

There are 8+3 parameters in the structural form and 4+3 parameters in the VAR, so we τt = τt−1 + ϕt , (11.32)
need four restrictions. Assume that Cov(u t ) = I (three restrictions) and that the long run
where 8 (L) is a stable matrix polynomial in the lag operator. We see that the k × 1 vector
response of u 1,t−s on xt is zero, that is,
ϕt has permanent effects on (at least some elements in) yt , while the r × 1 (r = n − k) ψt
" # " #−1
unrestricted 0   F11 F12 does not.
= I + A + A2 + ... The last component in (11.31) is stationary, but τt is a k × 1 vector of random walks,
unrestricted unrestricted F21 F22
" #−1 so the n × k matrix ϒ makes yt share the non-stationary components: there are k common
F11 F12
= (I − A) −1 trends. If k < n, then we could find (at least) r linear combinations of yt , α 0 yt where α 0 is
F21 F22
" #−1 " #−1 an r × n matrix of cointegrating vectors, which are such that the trends cancel each other
1 − A11 −A12 F11 F12 (α 0 ϒ = 0).
= .
−A21 1 − A22 F21 F22
Remark 28 (Lag operator.) We have the following rules: (i) L k xt = xt−k ; (ii) if 8 (L) =

150 151
a + bL−m + cLn , then 8 (L) (xt + yt ) = a (xt + yt ) + b (xt+m + yt+m ) + c (xt−n + yt−n ) We now try to write (11.36) in a form which resembles the common trends representation
and 8 (1) = a + b + c. (11.31)-(11.32) as much as possible.

Example 29 (Söderlind and Vredin (1996)). Suppose we have 11.8.3 Multivariate Beveridge-Nelson decomposition
   
ln Yt (output) 0 1 " # We want to split a vector of non-stationary series into some random walks and the rest
   
 ln Pt (price level)
yt =   , ϒ =  1 −1  , and τt = money supply trend ,
   (which is stationary). Rewrite (11.36) by adding and subtracting C(1)(εt + εt−1 + ...)
 ln M (money stock)   1 0  productivity trend
 t   
ln Rt (gross interest rate) 0 0 yt = C (1) (εt + εt−1 + εt−2 + ... + ε0 ) + [C(L) − C (1)] (εt + εt−1 + εt−2 + ... + ε0 ) .
(11.37)
then we see that ln Rt and ln Yt + ln Pt − ln Mt (that is, log velocity) are stationary, so Suppose εs = 0 for s < 0 and consider the second term in (11.37). It can be written
" #
0 0 0 1 h i
α0 = I + C1 L + C2 L2 + .... − C (1) (εt + εt−1 + εt−2 + ... + ε0 )
1 1 −1 0
= /*since C (1) = I + C1 + C2 + ...*/
are (or rather, span the space of) cointegrating vectors. We also see that α 0 ϒ = 02×2 .
[−C1 − C2 − C3 − ...] εt + [−C2 − C3 − ...] εt−1 + [−C3 − ...] εt−2 . (11.38)

11.8.2 VAR Representation Now define the random walks


The VAR representation is as in (11.1). In practice, we often estimate the parameters in
ξt = ξt−1 + εt , (11.39)
A∗s , α, the n × r matrix γ , and  =Cov(εt ) in the vector “error correction form”
= εt + εt−1 + εt−2 + ... + ε0 .
1yt = A∗1 1yt + ... + A∗p−1 1yt− p+1 + γ α 0 yt−1 + εt , with Cov(εt ) = . (11.33)
Use (11.38) and (11.39) to rewrite (11.37) as
This can easily be rewritten on the VAR form (11.1) or on the vector MA representation
yt = C (1) ξt + C ∗ (L) εt , where (11.40)
for 1yt ∞
X
Cs∗ = − C j. (11.41)
1yt = εt + C1 εt−1 + C2 εt−2 + ... (11.34) j=s+1
= C (L) εt . (11.35)

To find the MA of the level of yt , we recurse on (11.35)

yt = C (L) εt + yt−1
= C (L) εt + C (L) εt−1 + yt−2
..
.
= C (L) (εt + εt−1 + εt−2 + ... + ε0 ) + y0 . (11.36)

152 153
h i0
0 0
11.8.4 Identification of the Common Trends Shocks εt = F −1 ϕt ψt in (11.42).2 As before, assumptions about the covariance matrix
Rewrite (11.31)-(11.32) and (11.39)-(11.40) as of the structural shocks are not enough to achieve identification. In this case, we typically
rely on the information about long-run behavior (as opposed to short-run correlations) to
t
yt = C (1)
X
εt + C ∗ (L) εt , with Cov(εt ) = , and (11.42) supply the remaining restrictions.
s=0
" P # " # " #! • Step 1. From (11.31) we see that α 0 ϒ = 0r ×k must hold for α 0 yt to be stationary.
t
s=0 ϕt ϕt ϕt Given an (estimate of) α, this gives r k equations from which we can identify r k
h i
= ϒ 0n×r + 8 (L) , with Cov = In .
ψt ψt ψt elements in ϒ. (It will soon be clear why it is useful to know ϒ).
(11.43)
• Step 2. From (11.44) we have ϒϕt = C (1) εt as s → ∞. The variances of both
h i0
Since both εt and ϕt ψt
0 0
are white noise, we notice that the response of yt+s to either sides must be equal
must be the same, that is,
Eϒϕt ϕt0 ϒ 0 = EC (1) εt εt0 C (1)0 , or
" #
h i  ϕt ϒϒ 0 = C (1) C (1)0 , (11.48)
C (1) + Cs∗ εt = ϒ 0n×r + 8s

for all t and s ≥ 0. (11.44)
ψt
which gives k (k + 1) /2 restrictions on ϒ (the number of unique elements in the
This means that the VAR shocks are linear combinations of the structural shocks (as symmetric ϒϒ 0 ). (However, each column of ϒ is only identified up to a sign trans-
in the standard setup without cointegration) formation: neither step 1 or 2 is affected by multiplying each element in column j
of ϒ by -1.)
" #
ϕt
= Fεt
ψt • Step 3. ϒ has nk elements, so we still need nk − r k − k (k + 1) /2 = k(k − 1)/2
further restrictions on ϒ to identify all elements. They could be, for instance, that
" #
Fk
= εt . (11.45)
Fr money supply shocks have no long run effect on output (some ϒi j = 0). We now
know ϒ.
Combining (11.44) and (11.45) gives that " #!
ϕt
" # • Step 4. Combining Cov = In with (11.45) gives
Fk ψt
C (1) + Cs∗ = ϒ Fk + 8s (11.46)
Fr " # " # " #0
Ik 0 Fk Fk
must hold for all s ≥ 0. In particular, it must hold for s → ∞ where both Cs∗ and 8s =  , (11.49)
0 Ir Fr Fr
vanishes
C (1) = ϒ Fk . (11.47) which gives n (n + 1) /2 restrictions.

The identification therefore amounts to finding the n 2 coefficients in F, exactly as in – Step 4a. Premultiply (11.47) with ϒ 0 and solve for Fk
the usual case without cointegration. Once that is done, we can calculate the impulse −1
responses and variance decompositions with respect to the structural shocks by using Fk = ϒ 0 ϒ ϒ 0 C(1). (11.50)
2 Equivalently,we can use (11.47) and (11.46) to calculate ϒ and 8s (for all s) and then calculate the
impulse response function from (11.43).

154 155
−1 0 −1
(This means that Eϕt ϕt0 = Fk Fk0 = ϒ 0 ϒ ϒ C(1)C (1)0 ϒ ϒ 0 ϒ . Mellander, E., A. Vredin, and A. Warne, 1992, “Stochastic Trends and Economic Fluctu-
From (11.48) we see that this indeed is Ik as required by (11.49).) We still ations in a Small Open Economy,” Journal of Applied Econometrics, 7, 369–394.
need to identify Fr .
Pindyck, R. S., and D. L. Rubinfeld, 1997, Econometric Models and Economic Forecasts,
– Step 4b. From (11.49), Eϕt ψt0 = 0k×r , we get Fk Fr0 = 0k×r , which gives Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.
kr restrictions on the r n elements in Fr . Similarly, from Eψt ψt0 = Ir , we get
Fr Fr0 = Ir , which gives r (r + 1) /2 additional restrictions on Fr . We still Sims, C. A., 1980, “Macroeconomics and Reality,” Econometrica, 48, 1–48.
need r (r − 1) /2 restrictions. Exactly how they look does not matter for the
0
Söderlind, P., and A. Vredin, 1996, “Applied Cointegration Analysis in the Mirror of
impulse response function of ϕt (as long as Eϕt ψt = 0). Note that restrictions
Macroeconomic Theory,” Journal of Applied Econometrics, 11, 363–382.
on Fr are restrictions on ∂ yt /∂ψt0 , that is, on the contemporaneous response.
This is exactly as in the standard case without cointegration.

A summary of identifying assumptions used by different authors is found in Englund,


Vredin, and Warne (1994).

Bibliography
Bernanke, B., 1986, “Alternative Explanations of the Money-Income Correlation,”
Carnegie-Rochester Series on Public Policy, 25, 49–100.

Englund, P., A. Vredin, and A. Warne, 1994, “Macroeconomic Shocks in an Open


Economy - A Common Trends Representation of Swedish Data 1871-1990,” in Villy
Bergström, and Anders Vredin (ed.), Measuring and Interpreting Business Cycles . pp.
125–233, Claredon Press.

Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 4th edn.

Hamilton, J. D., 1994, Time Series Analysis, Princeton University Press, Princeton.

Johnston, J., and J. DiNardo, 1997, Econometric Methods, McGraw-Hill, New York, 4th
edn.

King, R. G., 1986, “Money and Business Cycles: Comments on Bernanke and Related
Literature,” Carnegie-Rochester Series on Public Policy, 25, 101–116.

156 157
12.2 Kalman Recursions

12.2.1 State space form


12 Kalman filter The measurement equation is

12.1 Conditional Expectations in a Multivariate Normal Distribution yt = Z αt + t , with Var (t ) = H , (12.4)

Reference: Harvey (1989), Lütkepohl (1993), and Hamilton (1994) where yt and t are n×1 vectors, and Z an n×m matrix. (12.4) expresses some observable
Suppose Z m×1 and X n×1 are jointly normally distributed variables yt in terms of some (partly) unobservable state variables αt . The transition
" # " # " #! equation for the states is
Z Z̄ 6zz 6zx
=N , . (12.1)
X X̄ 6x z 6x x αt = T αt−1 + u t , with Var (u t ) = Q, (12.5)

The distribution of the random variable Z conditional on that X = x is also normal with where αt and u t are m × 1 vectors, and T an m × m matrix. This system is time invariant
mean (expectation of the random variable Z conditional on that the random variable X since all coefficients are constant. It is assumed that all errors are normally distributed,
has the value x) and that E(t u t−s ) = 0 for all s.
E (Z |x) = Z̄ + 6zx 6x−1x x − X̄ ,

(12.2)
| {z } |{z} |{z}
Example 1 (AR(2).) The process xt = ρ1 xt−1 + ρ2 xt−2 + et can be rewritten as
|{z} | {z }
m×1 m×1 m×n n×n n×1
and variance (variance of Z conditional on that X = x) " #
h i xt
n o xt = 1 0 0 ,
+|{z}
Var (Z |x) = E [Z − E (Z |x)]2 x | {z } xt−1
|{z}
yt t
Z | {z }
αt
= 6zz − 6zx 6x−1
x 6x z . (12.3)
" # " #" # " #
xt ρ1 ρ2 xt−1 et
The conditional variance is the variance of the prediction error Z −E(Z |x). = + ,
xt−1 1 0 xt−2 0
Both E(Z |x) and Var(Z |x) are in general stochastic variables, but for the multivariate | {z } | {z }| {z } | {z }
αt T αt−1 ut
normal distribution Var(Z |x) is constant. Note that Var(Z |x) is less than 6zz (in a matrix " #
sense) if x contains any relevant information (so 6zx is not zero, that is, E(z|x) is not a Var (et ) 0
with H = 0, and Q = . In this case n = 1, m = 2.
constant). 0 0
It can also be useful to know that Var(Z ) =E[Var (Z |X )] + Var[E (Z |X )] (the X is
now random), which here becomes 6zz − 6zx 6x−1 12.2.2 Prediction equations: E(αt |It−1 )
x 6x z + 6zx 6x x Var(X ) 6x x 6x Z = 6zz .
−1 −1

Suppose we have an estimate of the state in t − 1 based on the information set in t − 1,


denoted by α̂t−1 , and that this estimate has the variance
h 0 i
Pt−1 = E α̂t−1 − αt−1 α̂t−1 − αt−1 .

(12.6)

158 159
Now we want an estimate of αt based α̂t−1 . From (12.5) the obvious estimate, denoted The variance of the prediction error is
by αt|t−1 , is
Ft = E vt vt0

α̂t|t−1 = T α̂t−1 . (12.7) n 0 o
= E Z αt − α̂t|t−1 + t Z αt − α̂t|t−1 + t
  
The variance of the prediction error is h 0 i
= Z E αt − α̂t|t−1 αt − α̂t|t−1 Z 0 + Et t0

h 0 i
Pt|t−1 = E αt − α̂t|t−1 αt − α̂t|t−1

n 0 o = Z Pt|t−1 Z 0 + H, (12.11)
= E T αt−1 + u t − T α̂t−1 T αt−1 + u t − T α̂t−1

n 0 o where we have used the definition of Pt|t−1 in (12.8), and of H in 12.4. Similarly, the
= E T α̂t−1 − αt−1 − u t T α̂t−1 − αt−1 − u t
  
h covariance of the prediction errors for yt and for αt is
0 i
= T E α̂t−1 − αt−1 α̂t−1 − αt−1 T 0 + Eu t u 0t

Cov αt − α̂t|t−1 , yt − ŷt|t−1 = E αt − α̂t|t−1 yt − ŷt|t−1
  
0
= T Pt−1 T + Q, (12.8) n 0 o
= E αt − α̂t|t−1 Z αt − α̂t|t−1 + t
 

where we have used (12.5), (12.6), and the fact that u t is uncorrelated with α̂t−1 − αt−1 . h 0 i
= E αt − α̂t|t−1 αt − α̂t|t−1 Z 0


Example 2 (AR(2) continued.) By substitution we get = Pt|t−1 Z 0 . (12.12)


" # " #" #
x̂t|t−1 ρ1 ρ2 x̂t−1|t−1 Suppose that yt is observed and that we want to update our estimate of αt from α̂t|t−1
α̂t|t−1 = = , and
x̂t−1|t−1 1 0 x̂t−2|t−1 to α̂t , where we want to incorporate the new information conveyed by yt .
" # " # " #
ρ1 ρ2 ρ1 1 Var (t ) 0 Example 3 (AR(2) continued.) We get
Pt|t−1 = Pt−1 +
1 0 ρ2 0 0 0 " #
h i x̂t|t−1
" # ŷt|t−1 = Z α̂t|t−1 = 1 0 = x̂t|t−1 = ρ1 x̂t−1|t−1 + ρ2 x̂t−2|t−1 , and
Var (t ) 0 x̂t−1|t−1
If we treat x−1 and x0 as given, then P0 = 02×2 which would give P1|0 = .
0 0 (" # " # " #)
h i ρ1 ρ2 ρ1 1 Var (t ) 0 h i0
Ft = 1 0 Pt−1 + 1 0 .
12.2.3 Updating equations: E(αt |It−1 ) →E(αt |It ) 1 0 ρ2 0 0 0
" #
The best estimate of yt , given ât|t−1 , follows directly from (12.4) Var (t ) 0
If P0 = 02×2 as before, then F1 = P1 = .
0 0
ŷt|t−1 = Z α̂t|t−1 , (12.9)
By applying the rules (12.2) and (12.3) we note that the expectation of αt (like z in
with prediction error (12.2)) conditional on yt (like x in (12.2)) is (note that yt is observed so we can use it to
guess αt )
vt = yt − ŷt|t−1 = Z αt − α̂t|t−1 + t .

(12.10)  −1  

α̂t = α̂t|t−1 + Pt|t−1 Z  Z Pt|t−1 Z + H 


0 0  yt − Z α̂t|t−1  (12.13)

|{z} | {z } | {z } | {z } | {z }
E(z|x) Ez 6zx 6x x =Ft Ex

160 161
with variance to which (12.8) converges: P = T P T 0 + Q. (The easiest way to calculate this is simply
−1 to start with P = I and iterate until convergence.)
0
Pt = Pt|t−1 − Pt|t−1 Z0 Z P Z0 + H Z Pt|t−1 , (12.14)
|{z} | {z } | {z }| t|t−1 {z }| {z } In non-stationary model we could set
Var(z|x) 6zz 6zx 6x−1 6x z
x
P0 = 1000 ∗ Im , and α0 = 0m×1 , (12.17)
where α̂t|t−1 (“Ez”) is from (12.7), Pt|t−1 Z 0 (“6zx ”) from (12.12), Z Pt|t−1 Z 0 + H (“6x x ”)
from (12.11), and Z α̂t|t−1 (“Ex”) from (12.9). in which case the first m observations of α̂t and vt should be disregarded.
(12.13) uses the new information in yt , that is, the observed prediction error, in order
to update the estimate of αt from α̂t|t−1 to α̂t . 12.2.5 MLE based on the Kalman filter
Proof. The last term in (12.14) follows from the expected value of the square of the
For any (conditionally) Gaussian time series model for the observable yt the log likelihood
last term in (12.13)
for an observation is
−1 0 −1
Pt|t−1 Z 0 Z Pt|t−1 Z 0 + H E yt − Z αt|t−1 yt − Z αt|t−1 Z Pt|t−1 ,
Z Pt|t−1 Z 0 + H

n 1 1
ln L t = − ln (2π) − ln |Ft | − vt0 Ft−1 vt . (12.18)
(12.15) 2 2 2
where we have exploited the symmetry of covariance matrices. Note that yt − Z αt|t−1 = In case the starting conditions are as in (12.17), the overall log likelihood function is
yt − ŷt|t−1 , so the middle term in the previous expression is ( P
T
ln L t in stationary models
0 ln L = Pt=1 (12.19)
E yt − Z αt|t−1 yt − Z αt|t−1 = Z Pt|t−1 Z 0 + H. T

(12.16) t=m+1 ln L t in non-stationary models.

Using this gives the last term in (12.14). 12.2.6 Inference and Diagnostics

We can, of course, use all the asymptotic MLE theory, like likelihood ratio tests etc. For
12.2.4 The Kalman Algorithm
diagnostoic tests, we will most often want to study the normalized residuals
The Kalman algorithm calculates optimal predictions of αt in a recursive way. You can p
also calculate the prediction errors vt in (12.10) as a by-prodct, which turns out to be ṽit = vit / element ii in Ft , i = 1, ..., n,
useful in estimation.
since element ii in Ft is the standard deviation of the scalar residual vit . Typical tests are
1. Pick starting values for P0 and α0 . Let t = 1. CUSUMQ tests for structural breaks, various tests for serial correlation, heteroskedastic-
ity, and normality.
2. Calculate (12.7), (12.8), (12.13), and (12.14) in that order. This gives values for α̂t
and Pt . If you want vt for estimation purposes, calculate also (12.10) and (12.11).
Increase t with one step. Bibliography
3. Iterate on 2 until t = T . Hamilton, J. D., 1994, Time Series Analysis, Princeton University Press, Princeton.

One choice of starting values that work in stationary models is to set P0 to the uncon- Harvey, A. C., 1989, Forecasting, Structural Time Series Models and the Kalman Filter,
ditional covariance matrix of αt , and α0 to the unconditional mean. This is the matrix P Cambridge University Press.

162 163
Lütkepohl, H., 1993, Introduction to Multiple Time Series, Springer-Verlag, 2nd edn.

13 Outliers and Robust Estimators

13.1 Influential Observations and Standardized Residuals

Reference: Greene (2000) 6.9; Rousseeuw and Leroy (1987)


Consider the linear model
yt = xt0 β0 + u t , (13.1)

where xt is k × 1. The LS estimator

T
!−1 T
X X
β̂ = xt xt0 xt yt , (13.2)
t=1 t=1

which is the solution to


T
X 2
min yt − xt0 β . (13.3)
β
t=1
The fitted values and residuals are

ŷt = xt0 β̂, and û t = yt − ŷt . (13.4)

Suppose we were to reestimate β on the whole sample, except observation s. This


would give us an estimate β̂ (s) . The fitted values and residual are then

ŷt(s) = xt0 β̂ (s) , and û (s) (s)


t = yt − ŷt . (13.5)

A common way to study the sensitivity of the results with respect to excluding observa-
tions is to plot β̂ (s) − β̂, and ŷs(s) − ŷs . Note that we here plot the fitted value of ys using
the coefficients estimated by excluding observation s from the sample. Extreme values
prompt a closer look at data (errors in data?) and perhaps also a more robust estimation
method than LS, which is very sensitive to outliers.
Another useful way to spot outliers is to study the standardized residuals, û s /σ̂ and
û (s) (s)
s /σ̂ , where σ̂ and σ̂
(s) are standard deviations estimated from the whole sample and

excluding observation s, respectively. Values below -2 or above 2 warrant attention (recall

164 165
that Pr(x > 1.96) ≈ 0.025 in a N (0, 1) distribution). Rescursive residuals from AR(1) with corr=0.85
CUSUM statistics and 95% confidence band
Sometimes the residuals are instead standardized by taking into account the uncer- 50
2
tainty of the estimated coefficients. Note that
0 0
û (s) 0 (s)
t = yt − x t β̂
 
= u t + xt0 β − β̂ (s) , (13.6) −2
−50
0 100 200 0 100 200
since yt = xt0 β + u t . The variance of û t is therefore the variance of the sum on the period period
right hand side of this expression. When we use the variance of u t as we did above to Figure 13.1: This figure shows recursive residuals and CUSUM statistics, when data are
standardize the residuals, then we disregard the variance of β̂ (s) . In general, we have simulated from yt = 0.85yt−1 + u t , with Var(u t ) = 1.

Var û (s)
    h  i
t = Var(u t ) + xt0 Var β − β̂ (s) xt + 2Cov u t , xt0 β − β̂ (s) . (13.7) This is repeated for the rest of the sample by extending the sample used in the estimation
by one period, making a one-period ahead forecast, and then repeating until we reach the
When t = s, which is the case we care about, the covariance term drops out since β̂ (s)
end of the sample.
cannot be correlated with u s since period s is not used in the estimation (this statement
A first diagnosis can be made by examining the standardized residuals, û [s] s+1 /σ̂ ,
[s]
assumes that shocks are not autocorrelated). The first term is then estimated as the usual
where σ̂ can be estimated as in (13.7) with a zero covariance term, since u s+1 is not
[s]
variance of the residuals (recall that period s is not used) and the second term is the
correlated with data for earlier periods (used in calculating β̂ [s] ), provided errors are not
estimated covariance matrix of the parameter vector (once again excluding period s) pre-
autocorrelated. As before, standardized residuals outside ±2 indicates problems: outliers
and postmultiplied by xs .
or structural breaks (if the residuals are persistently outside ±2).
The CUSUM test uses these standardized residuals to form a sequence of test statistics.
Example 1 (Errors are iid independent of the regressors.) In this case the variance of
A (persistent) jump in the statistics is a good indicator of a structural break. Suppose we
the parameter vector is estimated as σ̂ 2 (6xt xt0 )−1 (excluding period s), so we have
use r observations to form the first estimate of β, so we calculate β̂ [s] and û [s]
s+1 /σ̂
[s] for

Var û (s)
   
t = σ̂ 2 1 + xs0 (6xt xt0 )−1 xs . s = r, ..., T . Define the cumulative sums of standardized residuals
t
X
13.2 Recursive Residuals∗ Wt = û [s]
s+1 /σ̂ , t = r, ..., T.
[s]
(13.9)
s=r

Reference: Greene (2000) 7.8 Under the null hypothesis that no structural breaks occurs, that is, that the true β is the
Recursive residuals are a version of the technique discussed in Section 13.1. They same for the whole sample, Wt has a zero mean and a variance equal to the number of
are used when data is a time series. Suppose we have a sample t = 1, ..., T ,.and that elements in the sum, t − r + 1. This follows from the fact that the standardized resid-
t = 1, ..., s are used to estimate a first estimate, β̂ [s] (not to be confused with β̂ (s) used in uals all have zero mean and unit variance and are uncorrelated with each other. Typ-
Section 13.1). We then make a one-period ahead forecast and record the fitted value and ically, Wt is plotted along with a 95% confidence interval, which can be shown to be
√ √
± a T − r + 2a (t − r ) / T − r with a = 0.948. The hypothesis of no structural

the forecast error
[s]
ŷs+1 0
= xs+1 β̂ [s] , and û [s] [s]
s+1 = ys+1 − ŷs+1 . (13.8) break is rejected if the Wt is outside this band for at least one observation. (The derivation
of this confidence band is somewhat tricky, but it incorporates the fact that Wt and Wt+1

166 167
OLS vs LAD (LMS), and least trimmed squares (LTS) estimators solve
2 T
X
β̂ L AD = arg min

1.5 û t (13.10)
β
t=1
1 h  i
β̂ L M S = arg min median û 2t (13.11)
0.5 β
h
0 X
β̂ L T S = arg min û i2 , û 21 ≤ û 22 ≤ ... and h ≤ T. (13.12)
β
−0.5 Data i=1

−1 0.75*x Note that the LTS estimator in (13.12) minimizes of the sum of the h smallest squared
OLS
−1.5 residuals.
LAD
−2 These estimators involve non-linearities, so they are more computationally intensive
−3 −2 −1 0 1 2 3
than LS. In some cases, however, a simple iteration may work.
x
Figure 13.2: This figure shows an example of how LS and LAD can differ. In this case
yt = 0.75xt + u t , but only one of the errors has a non-zero value. Example 2 (Algorithm for LAD.) The LAD estimator can be written
T
X
β̂ L AD = arg min wt û 2t , wt = 1/ û t ,

are very correlated.)
β
t=1

13.3 Robust Estimation so it is a weighted least squares where both yt and xt are multiplied by 1/ û t . It can be

shown that iterating on LS with the weights given by 1/ û t , where the residuals are from
Reference: Greene (2000) 9.8.1; Rousseeuw and Leroy (1987); Donald and Maddala the previous iteration, converges very quickly to the LAD estimator.
(1993); and Judge, Griffiths, Lütkepohl, and Lee (1985) 20.4.
The idea of robust estimation is to give less weight to extreme observations than in It can be noted that LAD is actually the MLE for the Laplace distribution discussed
Least Squares. When the errors are normally distributed, then there should be very few ex- above.
treme observations, so LS makes a lot of sense (and is indeed the MLE). When the errors
have distributions with fatter tails (like the Laplace or two-tailed exponential distribution, 13.4 Multicollinearity∗
f (u) = exp(− |u| /σ )/2σ ), then LS is no longer optimal and can be fairly sensitive to
outliers. The ideal way to proceed would be to apply MLE, but the true distribution is Reference: Greene (2000) 6.7
often unknown. Instead, one of the “robust estimators” discussed below is often used. When the variables in the xt vector are very highly correlated (they are “multicollinear”)
Let û t = yt − xt0 β̂. Then, the least absolute deviations (LAD), least median squares then data cannot tell, with the desired precision, if the movements in yt was due to move-
ments in xit or x jt . This means that the point estimates might fluctuate wildly over sub-
samples and it is often the case that individual coefficients are insignificant even though
the R 2 is high and the joint significance of the coefficients is also high. The estimators
are still consistent and asymptotically normally distributed, just very imprecise.

168 169
A common indicator for multicollinearity is to standardize each element in xt by sub-
tracting the sample mean and then dividing by its standard deviation

x̃it = (xit − x̄it ) /std (xit ) . (13.13) 14 Generalized Least Squares


(Another common procedure is to use x̃it = xit /(6t=1T x 2 /T )1/2 .)
it Reference: Greene (2000) 11.3-4
Then calculate the eigenvalues, λ j , of the second moment matrix of x̃t Additional references: Hayashi (2000) 1.6; Johnston and DiNardo (1997) 5.4; Verbeek
T (2000) 4
1X
A= x̃t x̃t0 . (13.14)
T
t=1
14.1 Introduction
The condition number of a matrix is the ratio of the largest (in magnitude) of the
eigenvalues to the smallest Instead of using LS in the presence of autocorrelation/heteroskedasticity (and, of course,
c = |λ|max / |λ|min . (13.15) adjusting the variance-covariance matrix), we may apply the generalized least squares
method. It can often improve efficiency.
(Some authors take c1/2 to be the condition number; others still define it in terms of the
The linear model yt = xt0 β0 + u t written on matrix form (GLS is one of the cases in
“singular values” of a matrix.) If the regressors are uncorrelated, then the condition value
econometrics where matrix notation really pays off) is
of A is one. This follows from the fact that A is a (sample) covariance matrix. If it is
diagonal, then the eigenvalues are equal to diagonal elements, which are all unity since y = Xβ0 + u, where (14.1)
the standardization in (13.13) makes all variables have unit variances. Values of c above      
y1 x10 u1
several hundreds typically indicate serious problems.    0   
 y2   x2   u2 
 ..  , X =  ..
y=  , and u =  . .
 .
 
 .   .  .
 
 
Bibliography yT x T0 uT

Donald, S. G., and G. S. Maddala, 1993, “Identifying Outliers and Influential Observa- Suppose that the covariance matrix of the residuals (across time) is
tions in Econometric Models,” in G. S. Maddala, C. R. Rao, and H. D. Vinod (ed.),  
Eu 1 u 1 Eu 1 u 2 · · · Eu 1 u T
Handbook of Statistics, Vol 11 . pp. 663–701, Elsevier Science Publishers B.V.  
 Eu 2 u 1 Eu 2 u 2 Eu 2 u T 
Euu 0 =  .. .. .. 
Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New

 . . .


Jersey, 4th edn. Eu T u 1 Eu T u 2 Eu T u T
= T ×T . (14.2)
Judge, G. G., W. E. Griffiths, H. Lütkepohl, and T.-C. Lee, 1985, The Theory and Practice
of Econometrics, John Wiley and Sons, New York, 2nd edn. This allows for both heteroskedasticity (different elements along the main diagonal) and
Rousseeuw, P. J., and A. M. Leroy, 1987, Robust Regression and Outlier Detection, John autocorrelation (non-zero off-diagonal elements). LS is still consistent even if  is not
Wiley and Sons, New York. proportional to an identity matrix, but it is not efficient. Generalized least squares (GLS)

170 171
is. The trick of GLS is to transform the variables and the do LS. The last term in (14.3) can then be written
1 1
− (y − Xβ)0 −1 (y − Xβ) = − (y − Xβ)0 P 0 P (y − Xβ)
14.2 GLS as Maximum Likelihood 2 2
1
= − (P y − P Xβ)0 (P y − P Xβ) . (14.6)
Remark 1 If the n×1 vector x has a multivariate normal distribution with mean vector µ 2
and covariance matrix , then the joint probability density function is (2π )−n/2 ||−1/2 exp[−(x− This very practical result says that if we define yt∗ = yt /σt and xt∗ = xt /σt , then we get
µ)0 −1 (x − µ)/2]. ML estimates of β running an LS regression of yt∗ on xt∗ . (One of the elements in xt could
be a constant—also this one should be transformed). This is the generalized least squares
If the T ×1 vector u is N (0, ), then the joint pdf of u is (2π )−n/2 ||−1/2 exp[−u 0 −1 u/2].
(GLS).
Change variable from u to y − Xβ (the Jacobian of this transformation equals one), and
take logs to get the (scalar) log likelihood function Remark 2 Let A be an n × n symmetric positive definite matrix. It can be decomposed
n 1 1 as A = P P 0 . There are many such P matrices, but only one which is lower triangular P
ln L = − ln (2π) − ln || − (y − Xβ)0 −1 (y − Xβ) . (14.3) (see next remark).
2 2 2
To simplify things, suppose we know . It is then clear that we maximize the likelihood Remark 3 Let A be an n × n symmetric positive definite matrix. The Cholesky decom-
function by minimizing the last term, which is a weighted sum of squared errors. position gives the unique lower triangular P1 such that A = P1 P10 or an upper triangular
In the classical LS case,  = σ 2 I , so the last term in (14.3) is proportional to the matrix P2 such that A = P20 P2 (clearly P2 = P10 ). Note that P1 and P2 must be invertible
unweighted sum of squared errors. The LS is therefore the MLE when the errors are iid (since A is).
normally distributed.
When errors are heteroskedastic, but not autocorrelated, then  has the form When errors are autocorrelated (with or without heteroskedasticity), then it is typ-
  ically harder to find a straightforward analytical decomposition of −1 . We therefore
σ12 0 · · · 0
 ..  move directly to the general case. Since the covariance matrix is symmetric and positive
 0 σ2 . 
= .
 2
.

(14.4) definite, −1 is too. We therefore decompose it as
 .. ...
 0 
0 · · · 0 σT2 −1 = P 0 P. (14.7)

In this case, we can decompose −1 as The Cholesky decomposition is often a convenient tool, but other decompositions can
  also be used. We can then apply (14.6) also in this case—the only difference is that P
1/σ1 0 ··· 0 is typically more complicated than in the case without autocorrelation. In particular, the
 .. 
 0 1/σ2 . 
transformed variables P y and P X cannot be done line by line (yt∗ is a function of yt , yt−1 ,
 −1 0
= P P, where P =  .. . (14.5)
 
...

 . 0 
 and perhaps more).
0 ··· 0 1/σT
Example 4 (AR(1) errors, see Davidson and MacKinnon (1993) 10.6.) Let u t = au t−1 +
εt where εt is iid. We have Var(u t ) = σ 2 / 1 − a 2 , and Corr(u t , u t−s ) = a s . For T = 4,


172 173
the covariance matrix of the errors is Note that all the residuals are uncorrelated in this formulation. Apart from the first ob-
h i0  servation, they are also identically distributed. The importance of the first observation
 = Cov u1 u2 u3 u4 becomes smaller as the sample size increases—in the limit, GLS is efficient.
 
1 a a2 a3
σ2  2  14.3 GLS as a Transformed LS
 
=  a 1 a a .
1 − a2  2
 a a 1 a 

When the errors are not normally distributed, then the MLE approach in the previous
a3 a2 a 1
section is not valid. But we can still note that GLS has the same properties as LS has with
The inverse is   iid non-normally distributed errors. In particular, the Gauss-Markov theorem applies,
1 −a 0 0
so the GLS is most efficient within the class of linear (in yt ) and unbiased estimators
2
 
1 −a 1 + a −a 0
−1 = ,
 
 (assuming, of course, that GLS and LS really are unbiased, which typically requires that
σ2  0 −a 1 + a 2 −a 
  u t is uncorrelated with xt−s for all s). This follows from that the transformed system
0 0 −a 1
and note that we can decompose it as P y = P Xβ0 + Pu
 √ 0  √  y ∗ = X ∗ β0 + u ∗ , (14.8)
1 − a2 0 0 0 1 − a2 0 0 0
have iid errors, u ∗ . So see this, note that
   
1 −a 1 0 0  1 −a 1 0 0
−1 =  .

 
σ  0 −a 1 0 σ
  0 −a 1 0 

0 0 −a 1 0 0 −a 1 Eu ∗ u ∗0 = EPuu 0 P 0
= PEuu 0 P 0 . (14.9)
| {z }| {z }
P0 P

This is not a Cholesky decomposition, but certainly a valid decomposition (in case of Recall that Euu 0 = , P 0 P = −1 and that P 0 is invertible. Multiply both sides by P 0
doubt, do the multiplication). Premultiply the system
      P 0 Eu ∗ u ∗0 = P 0 PEuu 0 P 0
y1 x10 u1
= −1 P 0
 y2   x20 
     
 β0 +  u 2 
 
 =
 y   x0   u  = P 0 , so Eu ∗ u ∗0 = I. (14.10)
 3   3   3 
y4 x40 u4
14.4 Feasible GLS
by P to get
 q    q   q   In practice, we usually do not know . Feasible GLS (FGSL) is typically implemented by
1 − a 2 x10

1 − a 2 y1 1 − a2 u1
      first estimating the model (14.1) with LS, then calculating a consistent estimate of , and
1
 y2 − ay1   0
 = 1  x2 − ax1
0
 β0 + 1 
  ε2 .

finally using GLS as if  was known with certainty. Very little is known about the finite
σ y3 − ay2 
 σ
 x30 − ax20
 σ ε3

    sample properties of FGLS, but (the large sample properties) consistency, asymptotic
y4 − ay3 x40 − ax30 ε4 normality, and asymptotic efficiency (assuming normally distributed errors) can often be

174 175
established. Evidence from simulations suggests that the FGLS estimator can be a lot
worse than LS if the estimate of  is bad.
To use maximum likelihood when  is unknown requires that we make assumptions
about the structure of  (in terms of a small number of parameters), and more gener- 15 Kernel Density Estimation and Regression
ally about the distribution of the residuals. We must typically use numerical methods to
maximize the likelihood function. 15.1 Non-Parametric Regression

Example 5 (MLE and AR(1) errors.) If u t in Example 4 are normally distributed, then Reference: Campbell, Lo, and MacKinlay (1997) 12.3; Härdle (1990); Pagan and Ullah
we can use the −1 in (14.3) to express the likelihood function in terms of the unknown (1999); Mittelhammer, Judge, and Miller (2000) 21
parameters: β, σ , and a. Maximizing this likelihood function requires a numerical opti-
mization routine. 15.1.1 Simple Kernel Regression

Non-parametric regressions are used when we are unwilling to impose a parametric form
on the regression equation—and we have a lot of data.
Bibliography
Let the scalars yt and xt be related as
Davidson, R., and J. G. MacKinnon, 1993, Estimation and Inference in Econometrics,
yt = m(xt ) + εt , εt is iid and E εt = Cov [m(xt ), εt ] = 0, (15.1)
Oxford University Press, Oxford.

Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New where m() is a, possibly non-linear, function.
Jersey, 4th edn. Suppose the sample had 3 observations (say, t = 3, 27, and 99) with exactly the same
value of xt , say 1.9. A natural way of estimating m(x) at x = 1.9 would then be to
Hayashi, F., 2000, Econometrics, Princeton University Press. average over these 3 observations as we can expect average of the error terms to be close
to zero (iid and zero mean).
Johnston, J., and J. DiNardo, 1997, Econometric Methods, McGraw-Hill, New York, 4th
Unfortunately, we seldom have repeated observations of this type. Instead, we may
edn.
try to approximate the value of m(x) (x is a single value, 1.9, say) by averaging over
Verbeek, M., 2000, A Guide to Modern Econometrics, Wiley, Chichester. observations where xt is close to x. The general form of this type of estimator is
PT
w(xt − x)yt
m̂(x) = Pt=1
T
, (15.2)
t=1 w(x t − x)

where w(xt − x)/6t=1T w(x − x) is the weight given to observation t. Note that denom-
t
inator makes the weights sum to unity. The basic assumption behind (15.2) is that the
m(x) function is smooth so local (around x) averaging makes sense.
As an example of a w(.) function, it could give equal weight to the k values of xt
which are closest to x and zero weight to all other observations (this is the “k-nearest
neighbor” estimator, see Härdle (1990) 3.2). As another example, the weight function

176 177
could be defined so that it trades off the expected squared errors, E[yt − m̂(x)]2 , and the Data and weights for m(1.7) Data and weights for m(1.9)
expected squared acceleration, E[d 2 m̂(x)/d x 2 ]2 . This defines a cubic spline (and is often
4 4
used in macroeconomics, where xt = t and is then called the Hodrick-Prescott filter). weights, h=0.25
1 1
A Kernel regression uses a pdf as the weight function w(.). The pdf of N (0, h 2 )
is commonly used, where the choice of h allows us to easily vary the relative weights 5 data 5
of different observations. This weighting function is positive so all observations get a
positive weight, but the weights are highest for observations close to x and then tapers of 0 0
1.5 2 2.5 1.5 2 2.5
in a bell-shaped way. A low value of h means that the weights taper off fast—the weight xt xt
function is then a normal pdf with a low variance. With this particular kernel, we the
following estimator of m(x) at a point x
Data and weights for m(2.1)
PT
K h (xt − x)yt exp[−(xt − x)2 /(2h 2 )] 4
m̂(x) = Pt=1
T
, where K h (xt − x) = √ . (15.3)
t=1 K h (x t − x) h 2π
1
Note that K h (xt − x) corresponds to the weighting function w(xt − x) in (15.2).
5
In practice we have to estimate m̂(x) at a finite number of points x. This could, for
instance, be 100 evenly spread points in the interval between the min and max values 0
1.5 2 2.5
observed in the sample. See Figure 15.2 for an illustration. Special corrections might be xt
needed if there are a lot of observations stacked close the boundary of the support of x
(see Härdle (1990) 4.4).
Figure 15.1: Example of kernel regression with three data points
Example 1 Suppose the sample has three data points [x1 , x2 , x3 ] = [1.5, 2, 2.5] and
[y1 , y2 , y3 ] = [5, 4, 3.5]. Consider the estimation of m(x) at x = 1.9. With h = 1, the The estimate at x = 1.9 is therefore
numerator in (15.3) is
XT   √ m̂(1.9) ≈ 11.52/2.75 ≈ 4.19.
K h (xt − x)yt = e−(1.5−1.9) /2 × 5 + e−(2−1.9) /2 × 4 + e−(2.5−1.9) /2 × 3.5 / 2π
2 2 2

t=1
√ Kernel regressions are typically consistent, provided longer samples are accompanied
≈ (0.92 × 5 + 1.0 × 4 + 0.84 × 3.5) / 2π
√ by smaller values of h, so the weighting function becomes more and more local as the
= 11.52/ 2π. sample size increases. It can be shown (see Härdle (1990) 3.1) that the mean squared
error of the estimator at the value x is approximately (for general kernel functions)
The denominator is
 √
XT E[m̂(x) − m(x)]2 = variance + bias2

K h (xt − x) = e−(1.5−1.9) /2 + e−(2−1.9) /2 + e−(2.5−1.9) /2 / 2π
2 2 2

t=1 Var(x)
√ = × (“peakedness of kernel”) (15.4)
≈ 2.75/ 2π. hT
+ h × (variance of kernel)2 × (curvature of m(x))2
4
(15.5)

178 179
Kernel regression a. Drift vs level, in bins b. Volatility vs level, in bins
1.5
Data 0.2
5

t+1
h= 0.25 1

Driftt+1

Volatility
h= 0.2
0
0.5
4.5

−0.2 0
0 10 20 0 10 20
4 yt yt

Daily federal funds rates 1954−2003


3.5
Driftt+1 = yt+1−yt
1.4 1.6 1.8 2 2.2 2.4
x
Volatilityt+1 = (Driftt+1 − fitted Driftt+1)2

Figure 15.2: Example of kernel regression with three data points


Figure 15.3: Federal funds rate, daily data 1954-2002
A smaller h increases the variance (we effectively use fewer data points to estimate
15.1.2 Multivariate Kernel Regression
m(x)) but decreases the bias of the estimator (it becomes more local to x). If h decreases
less than proportionally with the sample size (so hT in the denominator of the first term Suppose that yt depends on two variables (xt and z t )
increases with T ), then the variance goes to zero and the estimator is consistent (since the
bias in the second term decreases as h does). yt = m(xt , z t ) + εt , εt is iid and E εt = 0. (15.6)
It can also be noted that the bias increases with the curvature of the m(x) function
This makes the estimation problem much harder since there are typically few observations
(that is, the average |d 2 m(x)/d x 2 |). This makes sense, since rapid changes of the slope
in every bivariate bin (rectangle) of x and z. For instance, with as little as a 20 intervals of
of m(x) makes it hard to get it right by averaging of nearby x values.
each of x and z, we get 400 bins, so we need a large sample to have a reasonable number
The variance is a function of the data and the kernel, but not of the m(x) function. The
of observations in every bin.
more concentrated the kernel is (larger “peakedness,” ∫ K (u)2 du) around x (for a given
In any case, the most common way to implement the kernel regressor is to let
h), the less information is used in forming the average around x, and the uncertainty is
PT
therefore larger. See Section 15.2.3 for a further discussion of the choice of h. K h x (xt − x)K h z (z t − z)yt
m̂(x, z) = Pt=1
T
, (15.7)
See Figures 15.3–15.4 for an example. Note that the volatility is defined as the square t=1 K h x (x t − x)K h z (z t − z)
of the drift minus expected drift (from the same estimation method).
where K h x (xt − x) and K h z (z t − z) are two kernels like in (15.3) and where we may allow
h x and h y to be different (and depend on the variance of xt and yt ). In this case. the
weight of the observation (xt , z t ) is proportional to K h x (xt − x)K h z (z t − z), which is high
if both xt and yt are close to x and y respectively.

180 181
N µ, σ 2 .

a. Drift vs level, kernel regression b. Vol vs level, kernel regression
1.5 The intuition for the χ22 distribution of the Bera-Jarque test is that both the skewness
0.2 Small h Small h
and kurtosis are, if properly scaled, N (0, 1) variables. It can also be shown that they,

Volatilityt+1
1
Driftt+1

Large h Large h
under the null hypothesis, are uncorrelated. The Bera-Jarque test statistic is therefore a
0
0.5 sum of the square of two uncorrelated N (0, 1) variables, which has a χ22 distribution.
The Bera-Jarque test can also be implemented as a test of overidentifying restrictions
−0.2 0
0 10 20 0 10 20 in GMM. Formulate the moment conditions
yt yt  
xt − µ
T 
 (xt − µ) − σ  ,
2 2 

Daily federal funds rates 1954−2003 1 X
g(µ, σ 2 ) =  (x − µ)3 (15.9)
T t

Driftt+1 = yt+1−yt t=1  
(xt − µ)4 − 3σ 4
Volatilityt+1 = (Driftt+1 − fitted Driftt+1)2
which should all be zero if xt is N (µ, σ 2 ). We can estimate the two parameters, µ and σ 2 ,
by using the first two moment conditions only, and then test if all four moment conditions
Figure 15.4: Federal funds rate, daily data 1954-2002 are satisfied. It can be shown that this is the same as the Bera-Jarque test if xt is indeed
iid N (µ, σ 2 ).
See Figure 15.4 for an example.

15.2.2 Non-Parametric Tests of General Distributions


15.2 Estimating and Testing Distributions
The Kolmogorov-Smirnov test is designed to test if an empirical distribution function,
Reference: Harvey (1989) 260, Davidson and MacKinnon (1993) 267, Silverman (1986); EDF(x) conforms with a theoretical cdf, F (x). The empirical distribution function is
Mittelhammer (1996), DeGroot (1986) defined as the fraction of observations which are less or equal to x, that is,
T
(
1X 1 if q is true
15.2.1 Parametric Tests of Normal Distribution EDF (x) = δ(xt ≤ x), where δ(q) = (15.10)
T 0 else.
t=1
The skewness, kurtosis and Bera-Jarque test for normality are useful diagnostic tools. For
(The EDF(xt ) and F (xt ) are sometimes plotted against the sorted (in ascending order)
an iid normally distributed variable, xt ∼iid N µ, σ 2 , they are

T .)
sample {xt }t=1
Test statistic Distribution
1 PT xt −µ 3 Example 2 Suppose we have a sample with three data points: [x1 , x2 , x3 ] = [5, 3.5, 4].
N 0, T6

skewness = T t=1 σ
1 PT xt −µ 4
(15.8) The empirical distribution function is then as in Figure 15.5.
N 3, 24

kurtosis = T t=1 σ T
Bera-Jarque = T 2
6 skewness + 24
T
(kurtosis − 3)2 χ22 . Define the absolute value of the maximum distance
This is implemented by using the estimated mean and standard deviation. The distrib-
DT = max |EDF (xt ) − F (xt )| . (15.11)
utions stated on the right hand side of (15.8) are under the null hypothesis that xt is iid xt

182 183
Empirical distribution function There is a corresponding test for comparing two empirical distributions.

1
15.2.3 Kernel Density Estimation
0.8 Reference: Silverman (1986)
0.6 A histogram is just a count of the relative number of observations that fall in (pre-
specified) non-overlapping intervals. If we also divide by the width of the interval, then
0.4 the area under the histogram is unity, so the scaled histogram can be interpreted as a den-
EDF sity function. For instance, if the intervals (“bins”) are a wide, then the scaled histogram
0.2
data can be defined as
T
0 cdf of N(3,1) 1 X1
g(x|x is in bini ) = δ(xt is in bini ). (15.14)
3 3.5 4 4.5 5 5.5 6
T a
t=1
x
Note that the area under g(x) indeed integrates to unity.
This is clearly a kind of weighted (with binary weights) average of all observations. As
Figure 15.5: Example of empirical distribution function with the kernel regression, we can gain efficiency by using a more sophisticated weighting
scheme. In particular, using a pdf as a weighting function is often both convenient and
Example 3 Figure 15.5 also shows the cumulative distribution function (cdf) of a nor- efficient. The N (0, h 2 ) is often used. The kernel density estimator of the pdf at some
mally distributed variable. The test statistic (15.11) is then the largest difference (in point x is then
1 XT
absolute terms) of the EDF and the cdf—among the observed values of xt . fˆ (x) = √ exp[−(x − xt )2 /(2h 2 )]. (15.15)
T h 2π t=1

We reject the null hypothesis that EDF(x) = F (x) if T Dt > c, where c is a critical The value h = Std(xt )1.06T −1/5 is sometimes recommended, since it can be shown to be
value which can be calculated from the optimal choice (in MSE sense) if data is normally distributed and the N (0, h 2 ) kernel
√  ∞ is used. A more ambitious approach is to choose h by a leave-one-out cross-validation
X 2 2
lim Pr T DT ≤ c = 1 − 2 (−1)i−1 e−2i c . (15.12) technique (see Silverman (1986)).
T →∞
i=1
The results on bias and variance in (15.4) are approximately true also for the kernel
It can be approximated by replacing ∞ with a large number (for instance, 100). For density estimation if we interpret m(x) as the pdf of x.
instance, c = 1.35 provides a 5% critical value. There is a corresponding test for compar- The easiest way to handle a bounded support of x is to transform the variable into one
ing two empirical cdfs. with an unbounded support, estimate the pdf for this variable, and then use the “change
Pearson’s χ 2 test does the same thing as the K-S test but for a discrete distribution. of variable” technique to transform to the pdf of the original variable.
Suppose you have K categories with Ni values in category i. The theoretical distribution We can also estimate multivariate pdfs. Let xt be a d ×1 matrix and ˆ be the estimated
PK
predicts that the fraction pi should be in category i, with i=1 pi = 1. Then covariance matrix of xt . We can then estimate the pdf at a point x by using a multivariate
K Gaussian kernel as
X (Ni − T pi )2
∼ χ K2 −1 . (15.13) 1 XT
i=1
T pi fˆ (x) = exp[−(x − xt )0 
ˆ −1 (x − xt )/(2h 2 )]. (15.16)
T h d (2π)d/2 ||
ˆ 1/2 t=1

184 185
a. Histogram (scaled: area=1) DeGroot, M. H., 1986, Probability and Statistics, Addison-Wesley, Reading, Massa-
0.2 b. Kernel density estimation
0.2 chusetts.
Small h
Medium h Härdle, W., 1990, Applied Nonparametric Regression, Cambridge University Press, Cam-
0.1 Large h bridge.
0.1

Harvey, A. C., 1989, Forecasting, Structural Time Series Models and the Kalman Filter,
0 0
0 10 20 Cambridge University Press.
0 10 20
yt yt
Mittelhammer, R. C., 1996, Mathematical Statistics for Economics and Business,
Daily federal funds rates 1954−2003
Springer-Verlag, New York.

K−S (against N(µ,σ2)): sqrt(T)D = 14.53 Mittelhammer, R. C., G. J. Judge, and D. J. Miller, 2000, Econometric Foundations, Cam-
Skewness: 1.222 bridge University Press, Cambridge.
kurtosis: 5.125
Bera−Jarque: 7901
Pagan, A., and A. Ullah, 1999, Nonparametric Econometrics, Cambridge University
Press.
Figure 15.6: Federal funds rate, daily data 1954-2002
Silverman, B. W., 1986, Density Estimation for Statistics and Data Analysis, Chapman
The value h = 0.96T −1/(d+4) is sometimes recommended. and Hall, London.

Bibliography
Ait-Sahalia, Y., 1996, “Testing Continuous-Time Models of the Spot Interest Rate,” Re-
view of Financial Studies, 9, 385–426.

Ait-Sahalia, Y., and A. W. Lo, 1998, “Nonparametric Estimation of State-Price Densities


Implicit in Financial Asset Prices,” Journal of Finance, 53, 499–547.

Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The Econometrics of Financial


Markets, Princeton University Press, Princeton, New Jersey.

Cox, J. C., J. E. Ingersoll, and S. A. Ross, 1985, “A Theory of the Term Structure of
Interest Rates,” Econometrica, 53, 385–407.

Davidson, R., and J. G. MacKinnon, 1993, Estimation and Inference in Econometrics,


Oxford University Press, Oxford.

186 187
exp t 2 /2 . (Of course, this result can also be obtained by directly setting µ = 0 and


σ = 1 in mg f X .)

21 Some Statistics Fact 5 (Change of variable, univariate case, monotonic function) Suppose X has the
probability density function f X (c) and cumulative distribution function FX (c). Let Y =
This section summarizes some useful facts about statistics. Heuristic proofs are given in g(X ) be a continuously differentiable function with dg/d X 6 = 0 (so g(X ) is monotonic)
a few cases. for all c such that f X (c) > 0. Then the cdf of Y is
Some references: Mittelhammer (1996), DeGroot (1986), Greene (2000), Davidson
(2000), Johnson, Kotz, and Balakrishnan (1994). FY (c) = Pr[Y ≤ c] = Pr[g(X ) ≤ c] = Pr[X ≤ g −1 (c)] = FX [g −1 (c)],

where g −1 is the inverse function of g such that g −1 (Y ) = X . We also have that the pdf
21.1 Distributions and Moment Generating Functions of Y is
dg (c)
−1
Most of the stochastic variables we encounter in econometrics are continuous. For a f Y (c) = f X [g −1 (c)] .
dc
continuous random variable X , the range is uncountably infinite and the probability that
Proof. Differentiate FY (c), that is, FX [g −1 (c)] with respect to c.
Rx
X ≤ x is Pr(X ≤ x) = −∞ f (q)dq where f (q) is the continuous probability density
function of X . Note that X is a random variable, x is a number (1.23 or so), and q is just
Example 6 Let X ∼ U (0, 1) and Y = g(X ) = F −1 (X ) where F(c) is a strictly increas-
a dummy argument in the integral.
ing cdf. We then get
d F(c)
Fact 1 (cdf and pdf) The cumulative distribution function of the random variable X is f Y (c) = .
Rx dc
F(x) = Pr(X ≤ x) = −∞ f (q)dq. Clearly, f (x) = d F(x)/d x. Note that x is just a
The variable Y then has the pdf d F(c)/dc and then cdf F(c).
number, not random variable.

Fact 2 (Moment generating function of X ) The moment generating function of the ran- 21.2 Joint and Conditional Distributions and Moments
dom variable X is mg f (t) = E et X . The r th moment is the r th derivative of mg f (t)
evaluated at t = 0: E X r = dmg f (0)/dt r . If a moment generating function exists (that 21.2.1 Joint and Conditional Distributions
is, E et X < ∞ for some small interval t ∈ (−h, h)), then it is unique.
Fact 7 (Joint and marginal cdf) Let X and Y be (possibly vectors of) random variables
Fact 3 (Moment generating function of a function of X ) If X has the moment generat- and let x and y be two numbers. The joint cumulative distribution function of X and
Rx Ry
ing function mg f X (t) = E et X , then g(X ) has the moment generating function Eetg(X ) . Y is H (x, y) = Pr(X ≤ x, Y ≤ y) = −∞ −∞ h(qx , q y )dq y dqx , where h(x, y) =
The affine function a + bX (a and b are constants) has the moment generating func- ∂ 2 F(x, y)/∂ x∂ y is the joint probability density function.
tion mg f g(X ) (t) = E et (a+bX ) = eta E etbX = eta mg f X (bt). By setting b = 1 and
Fact 8 (Joint and marginal pdf) The marginal cdf of X is obtained by integrating out Y :
a = − E X we obtain a mgf for central moments (variance, skewness, kurtosis, etc), R x R ∞
F(x) = Pr(X ≤ x, Y anything) = −∞ −∞ h(qx , q y )dq y dqx . This shows that the

mg f (X −E X ) (t) = e−t E X mg f X (t). R∞
marginal pdf of x is f (x) = d F(x)/d x = −∞ h(qx , q y )dq y .
Example 4 When X ∼ N (µ, σ 2 ), then mg f X (t) = exp µt + σ 2 t 2 /2 . Let Z = (X −


µ)/σ so a = −µ/σ and b = 1/σ . This gives mg f Z (t) = exp(−µt/σ )mg f X (t/σ ) =

188 189
Proof. E[E (Y |X )] = yg(y|x)dy f (x)d x = yg(y|x) f (x)dyd x =
R R  RR RR
Fact 9 (Change of variable, multivariate case, monotonic function) The result in Fact 5 yh(y, x)dyd x =
still holds if X and Y are both n × 1 vectors, but the derivative are now ∂g −1 (c)/∂dc0 E Y.
which is an n × n matrix. If gi−1 is the ith function in the vector g −1 then
Fact 16 (Conditional vs. unconditional variance) Var (Y ) = Var [E (Y |X )]+E [Var (Y |X )].
∂g1−1 (c) ∂g1−1 (c)
 
∂c1 ··· ∂cn
∂g −1 (c)  .. .. .
 Fact 17 (Properties of Conditional Expectations) (a) Y = E (Y |X ) + U where U and
∂dc0  .
= .  E(Y |X ) are uncorrelated: Cov (X, Y ) = Cov [X, E (Y |X ) + U ] = Cov [X, E (Y |X )]. It
∂gn−1 (c) ∂gn−1 (c)
∂c1 ··· ∂cm follows that (b) Cov[Y, E (Y |X )] = Var[E (Y |X )]; and (c) Var (Y ) = Var [E (Y |X )] +
Fact 10 (Conditional distribution) Then, the pdf of Y conditional on X = x (a number) Var (U ). Property (c) is the same as Fact 16, where Var (U ) = E [Var (Y |X )].
is g(y|x) = h(x, y)/ f (x).
Proof. Cov (X, Y ) = x(y−E y)h(x, y)dyd x = x (y − E y)g(y|x)dy f (x)d x,
RR R R 

but the term in brackets is E (Y |X ) − E Y .


21.2.2 Moments of Joint Distributions
Fact 18 (Conditional expectation and unconditional orthogonality) E (Y |Z ) = 0 ⇒
Fact 11 (Caucy-Schwartz) (E X Y )2 ≤ E(X 2 ) E(Y 2 ).
E Y Z = 0.
Proof. 0 ≤ E[(a X +Y )2 ] = a 2 E(X 2 )+2a E(X Y )+E(Y 2 ). Set a = − E(X Y )/ E(X 2 )
Proof. Note from Fact 17 that E(Y |X ) = 0 implies Cov(X, Y ) = 0 so E X Y =
to get
[E(X Y )]2 [E(X Y )]2 E X E Y (recall that Cov (X, Y ) = E X Y − E X E Y.) Note also that E (Y |X ) = 0 implies
0≤− + E(Y 2 ), that is, ≤ E(Y 2 ).
E(X 2 ) E(X 2 ) that E Y = 0 (by iterated expectations). We therefore get
" #
Cov (X, Y ) = 0
E (Y |X ) = 0 ⇒ ⇒ E Y X = 0.
Fact 12 (−1 ≤ Corr(X, y) ≤ 1). Let Y and X in Fact 11 be zero mean variables (or EY = 0
variables minus their means). We then get [Cov(X, Y )]2 ≤ Var(X ) Var(Y ), that is, −1 ≤
Cov(X, Y )/[Std(X )Std(Y )] ≤ 1.
21.2.4 Regression Function and Linear Projection
21.2.3 Conditional Moments
Fact 19 (Regression function) Suppose we use information in some variables X to predict
Fact 13 (Conditional moments) E (Y |x) = yg(y|x)dy and Var (Y |x) = [y−E (Y |x)]g(y|x)dy.
R R
Y . The choice of the forecasting function Ŷ = k(X ) = E (Y |X ) minimizes E[Y − k(X )]2 .
The conditional expectation E (Y |X ) is also called the regression function of Y on X . See
Fact 14 (Conditional moments as random variables) Before we observe X , the condi-
Facts 17 and 18 for some properties of conditional expectations.
tional moments are random variables—since X is. We denote these random variables by
E(Y |X ), Var(Y |X ), etc. Fact 20 (Linear projection) Suppose we want to forecast the scalar Y using the k × 1
vector X and that we restrict the forecasting rule to be linear Ŷ = X 0 β. This rule is a
Fact 15 (Law of iterated expectations) E Y = E[E (Y |X )]. Note that E (Y |X ) is a random
linear projection, denoted P(Y |X ), if β satisfies the orthogonality conditions E[X (Y −
variable since it is a function of the random variable X . It is not a function of Y , however.
X 0 β)] = 0k×1 , that is, if β = (E X X 0 )−1 E X Y . A linear projection minimizes E[Y −
The outer expectation is therefore an expectation with respect to X only.
k(X )]2 within the class of linear k(X ) functions.

190 191
p
Fact 21 (Properties of linear projections) (a) The orthogonality conditions in Fact 20 We denote this X T → X or plim X T = X (X is the probability limit of X T ). Note: (a)
mean that X can be a constant instead of a random variable; (b) if X T and X are matrices, then
p
Y = X 0 β + ε, X T → X if the previous condition holds for every element in the matrices.
where E(X ε) = 0k×1 . This implies that E[P(Y |X )ε] = 0, so the forecast and fore-
Example 28 Suppose X T = 0 with probability (T − 1)/T and X T = T with prob-
cast error are orthogonal. (b) The orthogonality conditions also imply that E[X Y ] =
ability 1/T . Note that limT →∞ Pr(|X T − 0| = 0) = limT →∞ (T − 1)/T = 1, so
E[X P(Y |X )]. (c) When X contains a constant, so E ε = 0, then (a) and (b) carry over to
limT →∞ Pr(|X T − 0| = ε) = 1 for any ε > 0. Note also that E X T = 0 × (T − 1)/T +
covariances: Cov[P(Y |X ), ε] = 0 and Cov[X, Y ] = Cov[X P, (Y |X )].
T × 1/T = 1, so X T is biased.
Example 22 (P(1|X )) When Yt = 1, then β = (E X X 0 )−1 E X . For instance, suppose
X = [x1t , xt2 ]0 . Then Fact 29 (Convergence in mean square) The sequence of random variables {X T } con-
" #−1 " # verges in mean square to the random variable X if (and only if)
E x1t2 E x1t x2t E x1t
β= 2
. lim E(X T − X )2 = 0.
E x2t x1t E x2t E x2t T →∞

If x1t = 1 in all periods, then this simplifies to β = [1, 0]0 . m


We denote this X T → X . Note: (a) X can be a constant instead of a random variable;
m
Remark 23 Some authors prefer to take the transpose of the forecasting rule, that is, to (b) if X T and X are matrices, then X T → X if the previous condition holds for every
use Ŷ = β 0 X . Clearly, since X X 0 is symmetric, we get β 0 = E(Y X 0 )(E X X 0 )−1 . element in the matrices.

Fact 24 (Linear projection with a constant in X ) If X contains a constant, then P(aY + Fact 30 (Convergence in mean square to a constant) If X in Fact 29 is a constant, then
m
b|X ) = a P(Y |X ) + b. then X T → X if (and only if)

Fact 25 (Linear projection versus regression function) Both the linear regression and the lim (E X T − X )2 = 0 and lim Var(X T 2 ) = 0.
regression function (see Fact 19) minimize E[Y −k(X )]2 , but the linear projection imposes T →∞ T →∞

the restriction that k(X ) is linear, whereas the regression function does not impose any This means that both the variance and the squared bias go to zero as T → ∞.
restrictions. In the special case when Y and X have a joint normal distribution, then the
Proof. E(X T − X )2 = E X T2 − 2X E X T + X 2 . Add and subtract (E X T )2 and recall
linear projection is the regression function.
that Var(X T ) = E X T2 − (E X T )2 . This gives E(X T − X )2 = Var(X T ) − 2X E X T + X 2 +
Fact 26 (Linear projection and OLS) The linear projection is about population moments, (E X T )2 = Var(X T ) + (E X T − X )2 .
but OLS is its sample analogue.
Fact 31 (Convergence in distribution) Consider the sequence of random variables {X T }
21.3 Convergence in Probability, Mean Square, and Distribution with the associated sequence of cumulative distribution functions {FT }. If limT →∞ FT =
F (at all points), then F is the limiting cdf of X T . If there is a random variable X with
d
Fact 27 (Convergence in probability) The sequence of random variables {X T } converges cdf F, then X T converges in distribution to X : X T → X . Instead of comparing cdfs, the
in probability to the random variable X if (and only if) for all ε > 0 comparison can equally well be made in terms of the probability density functions or the
moment generating functions.
lim Pr(|X T − X | < ε) = 1.
T →∞

192 193
m
Fact 32 (Relation between the different types of convergence) We have X T → X ⇒ Fact 41 In general, strict stationarity does not imply covariance stationarity or vice
p d
X T → X ⇒ X T → X . The reverse implications are not generally true. versa. However, strict stationary with finite first two moments implies covariance sta-
tionarity.
Example 33 Consider the random variable in Example 28. The expected value is E X T =
0(T − 1)/T + T /T = 1. This means that the squared bias does not go to zero, so X T
does not converge in mean square to zero.
21.6 Martingales

Fact 34 (Slutsky’s theorem) If {X T } is a sequence of random matrices such that plim X T = Fact 42 (Martingale) Let t be a set of information in t, for instance Yt , Yt−1 , ... If
X and g(X T ) a continuous function, then plim g(X T ) = g(X ). E |Yt | < ∞ and E(Yt+1 |t ) = Yt , then Yt is a martingale.

Fact 35 (Continuous mapping theorem) Let the sequences of random matrices {X T } and Fact 43 (Martingale difference) If Yt is a martingale, then X t = Yt −Yt−1 is a martingale
p
d
{YT }, and the non-random matrix {aT } be such that X T → X , YT → Y , and aT → a (a difference: X t has E |X t | < ∞ and E(X t+1 |t ) = 0.
d
traditional limit). Let g(X T , YT , aT ) be a continuous function. Then g(X T , YT , aT ) →
Fact 44 (Innovations as a martingale difference sequence) The forecast error X t+1 =
g(X, Y, a).
Yt+1 − E(Yt+1 |t ) is a martingale difference.

21.4 Laws of Large Numbers and Central Limit Theorems Fact 45 (Properties of martingales) (a) If Yt is a martingale, then E(Yt+s |t ) = Yt for
s ≥ 1. (b) If X t is a martingale difference, then E(X t+s |t ) = 0 for s ≥ 1.
Fact 36 (Khinchine’s theorem) Let X t be independently and identically distributed (iid)
p
with E X t = µ < ∞. Then 6t=1
T X /T →
t µ. Proof. (a) Note that E(Yt+2 |t+1 ) = Yt+1 and take expectations conditional on t :
T X /T ) = 0, then
E[E(Yt+2 |t+1 )|t ] = E(Yt+1 |t ) = Yt . By iterated expectations, the first term equals
Fact 37 (Chebyshev’s theorem) If E X t = 0 and limT →∞ Var(6t=1 t
p E(Yt+2 |t ). Repeat this for t + 3, t + 4, etc. (b) Essentially the same proof.
6t=1
T X /T →
t 0.
Fact 46 (Properties of martingale differences) If X t is a martingale difference and gt−1
Fact 38 (The Lindeberg-Lévy theorem) Let X t be independently and identically distrib-
d is a function of t−1 , then X t gt−1 is also a martingale difference.
uted (iid) with E X t = 0 and Var(X t ) < ∞. Then √1 6t=1
T X /σ →
t N (0, 1).
T
Proof. E(X t+1 gt |t ) = E(X t+1 |t )gt since gt is a function of t .
21.5 Stationarity
Fact 47 (Martingales, serial independence, and no autocorrelation) (a) X t is serially
Fact 39 (Covariance stationarity) X t is covariance stationary if uncorrelated if Cov(X t , X t+s ) = 0 for all s 6= 0. This means that a linear projection of
X t+s on X t , X t−1,... is a constant, so it cannot help predict X t+s . (b) X t is a martingale
E X t = µ is independent of t,
difference with respect to its history if E(X t+s |X t , X t−1 , ...) = 0 for all s ≥ 1. This means
Cov (X t−s , X t ) = γs depends only on s, and that no function of X t , X t−1 , ... can help predict X t+s . (c) X t is serially independent if
both µ and γs are finite. pdf(X t+s |X t , X t−1 , ...) = pdf(X t+s ). This means than no function of X t , X t−1 , ... can
help predict any function of X t+s .
Fact 40 (Strict stationarity) X t is strictly stationary if, for all s, the joint distribution of
X t , X t+1 , ..., X t+s does not depend on t.

194 195
Fact 48 (WLN for martingale difference) If X t is a martingale difference, then plim 6t=1 T X /T =
t Fact 54 (Conditional normal distribution) Suppose Z m×1 and X n×1 are jointly normally
0 if either (a) X t is strictly stationary and E |xt | < 0 or (b) E |xt |1+δ < ∞ for δ > 0 and distributed " # " # " #!
all t. (See Davidson (2000) 6.2) Z µZ 6Z Z 6Z X
∼N , .
X µX 6X Z 6X X
Fact 49 (CLT for martingale difference) Let X t be a martingale difference. If plim 6t=1
T (X 2 −
t The distribution of the random variable Z conditional on that X = x (a number) is also
E X t2 )/T = 0 and either
normal with mean
(a) X t is strictly stationary or E (Z |x) = µ Z + 6 Z X 6 −1
X X (x − µ X ) ,
(E |X t |2+δ )1/(2+δ)
(b) maxt∈[1,T ] 6t=1
T E X 2 /T < ∞ for δ > 0 and all T > 1, and variance (variance of Z conditional on that X = x, that is, the variance of the
t
prediction error Z − E (Z |x))
√ d
T X / T )/(6 T E X 2 /T )1/2 →
then (6t=1 t t=1 t N (0, 1). (See Davidson (2000) 6.2)
Var (Z |x) = 6 Z Z − 6 Z X 6 −1
X X 6X Z .

21.7 Special Distributions Note that the conditional variance is constant in the multivariate normal distribution
(Var(Z |X ) is not a random variable in this case). Note also that Var(Z |x) is less than
21.7.1 The Normal Distribution
Var(Z ) = 6 Z Z (in a matrix sense) if X contains any relevant information (so 6 Z X is not
Fact 50 (Univariate normal distribution) If X ∼ N (µ, σ 2 ), then the probability density zero, that is, E(Z |x) is not the same for all x).
function of X , f (x) is
1 1 x−µ 2
f (x) = √ e− 2 ( σ ) . Fact 55 (Stein’s lemma) If Y has normal distribution and h() is a differentiable function
2π σ 2
such that E |h 0 (Y )| < ∞, then Cov[Y, h(Y )] = Var(Y ) E h 0 (Y ).
The moment generating function is mg f X (t) = exp µt + σ 2 t 2 /2 and the moment gen-


erating function around the mean is mg f (X −µ) (t) = exp σ 2 t 2 /2 .


 R∞
Proof. E[(Y −µ)h(Y )] = −∞ (Y −µ)h(Y )φ(Y ; µ, σ 2 )dY , where φ(Y ; µ, σ 2 ) is the
pdf of N (µ, σ 2 ). Note that dφ(Y ; µ, σ 2 )/dY = −φ(Y ; µ, σ 2 )(Y −µ)/σ 2 , so the integral
Example 51 The first few moments around the mean are E(X −µ) = 0, E(X −µ)2 = σ 2 , R∞
can be rewritten as −σ 2 −∞ h(Y )dφ(Y ; µ, σ 2 ). Integration by parts (“ udv = uv −
R
E(X − µ)3 = 0 (all odd moments are zero), E(X − µ)4 = 3σ 4 , E(X − µ)6 = 15σ 6 , and h
∞ ∞
i
vdu”) gives −σ 2 h(Y )φ(Y ; µ, σ 2 ) φ(Y ; µ, σ 2 )h 0 (Y )dY = σ 2 E h 0 (Y ).
R R

E(X − µ)8 = 105σ 8 . −∞ −∞

Fact 52 (Standard normal distribution) If X ∼ N (0, 1), then the moment generating Fact 56 (Stein’s lemma 2) It follows from Fact 55 that if X and Y have a bivariate
function is mg f X (t) = exp t 2 /2 . Since the mean is zero, m(t) gives central moments.
 normal distribution and h() is a differentiable function such that E |h 0 (Y )| < ∞, then
The first few are EX = 0, EX 2 = 1, EX 3 = 0 (all odd moments are zero), and EX 4 = 3. Cov[X, h(Y )] = Cov(X, Y ) E h 0 (Y ).

Fact 53 (Multivariate normal distribution) If X is an n × 1 vector of random variables Example 57 (a) With h(Y ) = exp(Y ) we get Cov[X, exp(Y )] = Cov(X, Y ) E exp(Y );
with a multivariate normal distribution, with a mean vector µ and variance-covariance (b) with h(Y ) = Y 2 we get Cov[X, Y 2 ] = Cov(X, Y )2 E Y so with E Y = 0 we get a zero
matrix 6, N (µ, 6), then the density function is covariance.

Fact 58 (Truncated normal distribution) Let X ∼ N (µ, σ 2 ), and consider truncating the
 
1 1
f (x) = exp − (x − µ) 6
0 −1
(x − µ) .
(2π )n/2 |6|1/2 2 distribution so that we want moments conditional on X > a. Define a0 = (a − µ)/σ and

196 197
and suppose we want the asymptotic distribution of a transformation of β
a. Pdf of N(0,σ2) b. Pdf of bivariate normal, corr=0

σ2=0.5 0.2 γq×1 = g (β) ,


0.4 σ2=1 0.1
σ2=2 where g (.) is has continuous first derivatives. The result is
0.2 0
2 √ h   i d
T g β̂ − g (β0 ) → N 0, 9q×q , where

0 2
0 0
−2 0 2 y −2 −2 ∂g (β0 ) ∂g (β0 )0 ∂g (β0 )
x 9=  , where is q × k.
x ∂β
0
∂β ∂β
0

Proof. By the mean value theorem we have


c. Pdf of bivariate normal, corr=0.8   ∂g (β ∗ )  
g β̂ = g (β0 ) + β̂ − β 0 ,
0.4 ∂β 0
0.2 where
∂g1 (β) ∂g1 (β)
 
0 ∂β1 ··· ∂βk
2 ∂g (β)  .. ... ..
,

= . . 
0 2 ∂β 0 
∂gq (β) ∂gq (β)

0 ···
y −2 −2 x ∂β1 ∂βk q×k

Figure 21.1: Normal distributions and we evaluate it at β∗
which is (weakly) between β̂ and β0 . Premultiply by T and
rearrange as
let √ h   i ∂g (β ∗ ) √  
T g β̂ − g (β0 ) = T β̂ − β0 .
λ(a0 ) = φ(a0 )/[1 − 8(a0 )] and δ(a0 ) = λ(a0 )[λ(a0 ) − a0 ]. ∂β 0

Then, If β̂ is consistent (plim β̂ = β0 ) and ∂g (β ∗ ) /∂β 0 is continuous, then by Slutsky’s theorem


E(X |X > a) = µ + σ λ(a0 ) and Var(X |X > a) = σ 2 [1 − δ(a0 )]. plim ∂g (β ∗ ) /∂β 0 = ∂g (β0 ) /∂β 0 , which is a constant. The result then follows from the
continuous mapping theorem.
The same results hold for E(X |X < a) and Var(X |X < a) if we redefine λ(a0 ) as λ(a0 ) =
−φ(a0 )/8(a0 ) (and recalculate δ(a0 )) . 21.7.2 The Lognormal Distribution

Example 59 Suppose X ∼ N (0, σ 2 ) and we want to calculate E |x|. This is the same as Fact 61 (Univariate lognormal distribution) If x ∼ N (µ, σ 2 ) and y = exp(x) then the
E(X |X > 0) = 2σ φ(0). probability density function of y, f (y) is
1 1 ln y−µ 2
Fact 60 (Delta method) Consider an estimator β̂k×1 which satisfies f (y) = √ e− 2 ( σ ) , y > 0.
y 2πσ 2
√   d
T β̂ − β0 → N (0, ) ,
The r th moment of y is E y r = exp(r µ + r 2 σ 2 /2).

Example 62 The first two moments are E y = exp µ + σ 2 /2 and E y 2 = exp(2µ +




198 199
2σ 2 ). We therefore get Var(y) = exp 2µ + σ 2 exp σ 2 − 1 and Std (y) / E y =
  
Fact 69 (Quadratic forms of normally distribution random variables) If the n × 1 vector
exp(σ 2 ) − 1. X ∼ N (0, 6), then Y = X 0 6 −1 X ∼ χn2 . Therefore, if the n scalar random variables
p

X i , i = 1, ..., n, are uncorrelated and have the distributions N (0, σi2 ), i = 1, ..., n, then
Fact 63 (Moments of a truncated lognormal distribution) If x ∼ N (µ, σ 2 ) and y = Y = 6i=1 n
X i2 /σi2 ∼ χn2 .
exp(x) then E(y r |y > a) = E(y r )8(r σ − a0 )/8(−a0 ), where a0 = (ln a − µ) /σ .
Fact 70 If the n×1 vector X ∼ N (0, I ), and A is a symmetric idempotent matrix (A = A0
Remark 64 Note that the denominator is Pr(y > a) = 8(−a0 ), while the complement and A = A A = A0 A) of rank r , then Y = X 0 AX ∼ χr2 .
is Pr(y ≤ a) = 8(a0 ) .
Fact 71 If the n × 1 vector X ∼ N (0, 6), where 6 has rank r ≤ n then Y = X 0 6 + X ∼
Example 65 The first two moments of the truncated lognormal distribution are E(y|y > χr2 where 6 + is the pseudo inverse of 6.
a) = exp µ + σ 2 /2 8(σ − a0 )/8(−a0 ) and E(y 2 |y > a) = exp 2µ + 2σ 2 8(2σ −
 
Proof. 6 is symmetric, so it can be decomposed as 6 = C3C 0 where C are the
a0 )/8(−a0 ). orthogonal eigenvector (C 0 C = I ) and 3 is a diagonal matrix with the eigenvalues along
the main diagonal. We therefore have 6 = C3C 0 = C1 311 C10 where C1 is an n × r
Fact 66 (Bivariate lognormal distribution). Let x1 and x2 have a bivariate normal distri-
matrix associated with the r non-zero eigenvalues (found in the r × r matrix 311 ). The
bution " # " # " #! generalized inverse can be shown to be
x1 µ1 σ12 ρσ1 σ2
∼N , ,
µ2 ρσ1 σ2 σ22
" #
x2 h i 3−1 0 h i0
6 = C1 C2
+ 11
C1 C2 = C1 3−1 11 C 1 ,
0
and consider y1 = exp(x1 ) and y2 = exp(x2 ). From Fact 54 we know that the conditional 0 0
distribution of x1 given y2 or x2 is then normal
−1/2 −1/2 −1/2
We can write 6 + = C1 311 311 C10 . Consider the r × 1 vector Z = 311 C10 X , and
ρσ1
 
x1 |(y2 or x2 ) ∼ N µ1 + (x2 − µ2 ), σ12 (1 − ρ 2 ) . note that it has the covariance matrix
σ2
−1/2 −1/2 −1/2 −1/2
It follows that the conditional distribution of y1 given y2 or x2 is lognormal (with the E Z Z 0 = 311 C10 E X X 0 C1 311 = 311 C10 C1 311 C10 C1 311 = Ir ,
parameters in the last equation).
since C10 C1 = Ir . This shows that Z ∼ N (0r ×1 , Ir ), so Z 0 Z = X 0 6 + X ∼ χr2 .
Fact 67 In the case of Fact 66, we also get Fact 72 (Convergence to a normal distribution) Let Y ∼ χn2 and Z = (Y − n)/n 1/2 .
d
h i Then Z → N (0, 2).
Cov(y1 , y2 ) = exp(ρσ1 σ2 ) − 1 exp µ1 + µ2 + (σ12 + σ22 )/2 , and
 

 q Example 73 If Y = 6i=1
n
X i2 /σi2 , then this transformation means Z = (6i=1
n
X i2 /σi2 −
Corr(y1 , y2 ) = exp(ρσ1 σ2 ) − 1 / exp(σ12 ) − 1 exp(σ22 ) − 1 .
  
1/2
1)/n .

21.7.3 The Chi-Square Distribution Proof. We can directly note from the moments of a χn2 variable that EZ = (EY −
n)/n 1/2 = 0, and Var(Z ) = Var(Y )/n = 2. From the general properties of moment
Fact 68 If Y ∼ χn2 , then the pdf of Y is f (y) = 2n/2 0(n/2)
1
y n/2−1 e−y/2 , where 0() is the
generating functions, we note that the moment generating function of Z is
gamma function. The moment generating function is mg f Y (t) = (1−2t)−n/2 for t < 1/2.
−n/2
The first moments of Y are EY = n and Var(Y ) = 2n. √

t
mg f Z (t) = e−t n
1 − 2 1/2 with lim mg f Z (t) = exp(t 2 ).
n n→∞

200 201
b. Pdf of F(n ,10) has a tn distribution. The moment generating function does not exist, but EZ = 0 for
a. Pdf of Chi−square(n) 1
1 1 n > 1 and Var(Z ) = n/(n − 2) for n > 2.
n=1 n1=2
n=2 n1=5 Fact 77 The t distribution converges to a N (0, 1) distribution as n → ∞.
n=5
0.5 0.5 n1=10
n=10 Fact 78 If Z ∼ tn , then Z 2 ∼ F(1, n).

0 0 21.7.5 The Bernouilli and Binomial Distributions


0 5 10 0 2 4 6
x x Fact 79 (Bernoulli distribution) The random variable X can only take two values: 1 or 0,
with probability p and 1 − p respectively. The moment generating function is mg f (t) =
c. Pdf of F(n1,100) d. Pdf of N(0,1) and t(n) pet + 1 − p. This gives E(X ) = p and Var(X ) = p(1 − p).
1 0.4
n1=2 N(0,1) Example 80 (Shifted Bernoulli distribution) Suppose the Bernoulli variable takes the
n1=5 t(10) values a or b (instead of 1 and 0) with probability p and 1 − p respectively. Then
t(50)
0.5 n1=10 0.2 E(X ) = pa + (1 − p)b and Var(X ) = p(1 − p)(a − b)2 .

Fact 81 (Binomial distribution). Suppose X 1 , X 2 , ..., X n all have Bernoulli distributions


0
0 2 4 6
0
−2 0 2
with the parameter p. Then, the sum Y = X 1 + X 2 + ... + X n has a Binomial distribution
x x with parameters p and n. The pdf is pdf(Y ) = n!/[y!(n − y)!] p y (1 − p)n−y for y =
Figure 21.2: χ 2 , F, and t distributions 0, 1, ..., n. The moment generating function is mg f (t) = [ pet + 1 − p]n . This gives
E(Y ) = np and Var(Y ) = np(1 − p).
d
This is the moment generating function of a N (0, 2) distribution, which shows that Z →
Example 82 (Shifted Binomial distribution) Suppose the Bernuolli variables X 1 , X 2 , ..., X n
N (0, 2). This result should not come as a surprise as we can think of Y as the sum of
take the values a or b (instead of 1 and 0) with probability p and 1 − p respectively.
n variables; dividing by n 1/2 is then like creating a scaled sample average for which a
Then, the sum Y = X 1 + X 2 + ... + X n has E(Y ) = n[ pa + (1 − p)b] and Var(Y ) =
central limit theorem applies.
n[ p(1 − p)(a − b)2 ].

21.7.4 The t and F Distributions


21.8 Inference
Fact 74 If Y1 ∼ χn21 and Y2 ∼ χn22 and Y1 and Y2 are independent, then Z = (Y1 /n 1 )/(Y2 /n 2 )
has an F(n 1 , n 2 ) distribution. This distribution has no moment generating function, but Fact 83 (Comparing variance-covariance matrices) Let Var(β̂) and Var(β ∗ ) be the variance-
EZ = n 2 /(n 2 − 2) for n > 2. covariance matrices of two estimators, β̂ and β ∗ , and suppose Var(β̂) − Var(β ∗ ) is a pos-
itive semi-definite matrix. This means that for any non-zero vector R that R 0 Var(β̂)R ≥
Fact 75 The distribution of n 1 Z = Y1 /(Y2 /n 2 ) converges to a χn21 distribution as n 2 →
R 0 Var(β ∗ )R, so every linear combination of β̂ has a variance that is as large as the vari-
∞. (The idea is essentially that n 2 → ∞ the denominator converges to the mean, which
ance of the same linear combination of β ∗ . In particular, this means that the variance of
is EY2 /n 2 = 1. Left is then only the numerator, which is a χn21 variable.)
every element in β̂ (the diagonal elements of Var(β̂)) is at least as large as variance of
Fact 76 If X ∼ N (0, 1) and Y ∼ χn2 and X and Y are independent, then Z = X/(Y /n)1/2 the corresponding element of β ∗ .

202 203
Bibliography
Davidson, J., 2000, Econometric Theory, Blackwell Publishers, Oxford.

DeGroot, M. H., 1986, Probability and Statistics, Addison-Wesley, Reading, Massa-


22 Some Facts about Matrices
chusetts. Some references: Greene (2000), Golub and van Loan (1989), Björk (1996), Anton
Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New (1987), Greenberg (1988).
Jersey, 4th edn.
22.1 Rank
Johnson, N. L., S. Kotz, and N. Balakrishnan, 1994, Continuous Univariate Distributions,
Wiley, New York, 2nd edn. Fact 1 (Submatrix) Any matrix obtained from the m × n matrix A by deleting at most
m − 1 rows and at most n − 1 columns is a submatrix of A.
Mittelhammer, R. C., 1996, Mathematical Statistics for Economics and Business,
Springer-Verlag, New York. Fact 2 (Rank) The rank of the m × n matrix A is ρ if the largest submatrix with non-
zero determinant is ρ × ρ. The number of linearly independent row vectors (and column
vectors) of A is then ρ.

22.2 Vector Norms

Fact 3 (Vector p-norm) Let x be an n × 1 matrix. The p-norm is defined as/

n
!1/ p
X
kxk p = |xi | p
.
i=1

The Euclidian norm corresponds to p = 2

n
!1/2
X √
kxk2 = xi2 = x 0 x.
i=1

22.3 Systems of Linear Equations and Matrix Inverses

Fact 4 (Linear systems of equations) Consider the linear system Ax = c where A is


m × n, x is n × 1, and c is m × 1. A solution is a vector x such that Ax = c. It has
a unique solution if and only if rank(A) = rank([ A c ]) = n; an infinite number of
solutions if and only if rank(A) = rank([ A c ]) < n; and no solution if and only if
rank(A) 6= rank([ A c ]).

204 205
Example 5 (Linear systems of equations, unique solution when m = n) Let x be 2 × 1, Sum of squared errors of solution to A*x=c, scalar x
and consider the linear system 5
" # " # With uniqe solution
1 5 3 No solutions
Ax = c with A = and c = . 4
2 6 6
3 A=[1 2]′ and c=[3,7]′ A=[1 2]′ and c=[3,6]′
Here rank (A) = 2 and rank([ A c ]) = 2. The unique solution is x = [ 3 0 ]0 .

2
Example 6 (Linear systems of equations, no solution when m > n) Let x be a scalar, and
consider the linear system
1
" # " #
1 3
Ax = c with A = and c = . 0
2 7 2.5 3 3.5 4
x
Here rank (A) = 1 and rank([ A c ]) = 2. There is then no solution.

Fact 7 (Least squares) Suppose that no solution exists to Ax = c. The best approxi- Figure 22.1: Value of qudratic loss function.
mate solution, in the sense of minimizing (the square root of) the sum of squared errors,
h 0 i1/2 −1 0 Example 9 (Linear systems of equations, unique solution when m > n) Change c in
c − A x̂ c − A x̂ = c − A x̂ , is x̂ = A0 A
2
A c, provided the inverse exist.
Example 6 to c = [ 3 6 ]0 . Then rank (A) = 1 and rank([ A c ]) = 1, and the unique
This is obviously the least squares solution. In the example with c = [ 3 7 ]0 , it is
solution is x = 3.
" #0 " #−1 " #0 "
Example 10 (Linear systems of equations, infinite number of solutions, m < n) Let x be
#
1 1 1 3
x̂ =  
2 × 1, and consider the linear system
2 2 2 7
h i
17 Ax = c with A = 1 2 and c = 5.
= or 3.4.
5
This is illustrated in Figure 22.1. (Translation to OLS notation: c is the vector of depen- Here rank (A) = 1 and rank([ A c ]) = 1. Any value of x1 on the line 5 − 2x2 is a
dent variables for m observations, A is the matrix with explanatory variables with the t th solution.
observation in row t, and x is the vector of parameters to estimate).
Example 11 (Pseudo inverses again) In the previous example, there is an infinite number

of solutions along the line x1 = 5 − 2x2 . Which one has the smallest norm x̂ 2 =
Fact 8 (Pseudo inverse or generalized inverse) Suppose that no solution exists to Ax = [(5 − 2x2 )2 + x22 ]1/2 ? The first order condition gives x2 = 2, and therefore x1 = 1. This
c, and that A0 A is not invertible. There are then several approximations, x̂, which all is the same value as given by x̂ = A+ c, since A+ = [0.2, 0.4] in this case.

minimize c − A x̂ 2 . The one with the smallest x̂ 2 is given by x̂ = A+ c, where A+ is
Fact 12 (Rank and computers) Numerical calculations of the determinant are poor indi-
the Moore-Penrose pseudo (generalized) inverse of A. See Fact 54.
cators of whether a matrix is singular or not. For instance, det(0.1 × I20 ) = 10−20 . Use
the condition number instead (see Fact 51).

206 207
Fact 13 (Some properties of inverses) If A, B, and C are invertible, then (ABC)−1 = that x = 0 is always a solution, and it is the unique solution if rank(A) = n. We can thus
C −1 B −1 A−1 ; (A−1 )0 = (A0 )−1 ; if A is symmetric, then A−1 is symmetric; (An )−1 = only get a nontrivial solution (not all elements are zero), only if rank (A) < n.
n
A−1 .
Fact 18 (Eigenvalues) The n eigenvalues, λi , i = 1, . . . , n, and associated eigenvectors,
Fact 14 (Changing sign of column and inverting) Suppose the square matrix A2 is the z i , of the n × n matrix A satisfy
same as A1 except that the i th and j th columns have the reverse signs. Then A−1
2 is the
(A − λi I ) z i = 0n×1 .
same as A−1
1 except that the i th and j th rows have the reverse sign.

We require the eigenvectors to be non-trivial (not all elements are zero). From Fact 17,
22.4 Complex matrices an eigenvalue must therefore satisfy

Fact 15 (Modulus of complex number) If λ = a + bi, where i = −1, then |λ| = det(A − λi I ) = 0.

|a + bi| = a 2 + b2 .
Fact 19 (Right and left eigenvectors) A “right eigenvector” z (the most common) satisfies
Fact 16 (Complex matrices) Let A H denote the transpose of the complex conjugate of A, Az = λz, and a “left eigenvector” v (seldom used) satisfies v 0 A = λv 0 , that is, A0 v = λv.
so that if
Fact 20 (Rank and eigenvalues) For any m × n matrix A, rank (A) = rank A0 =
" # 
h i 1
H
. rank A0 A = rank A A0 and equals the number of non-zero eigenvalues of A0 A or A A0 .
 
A = 1 2 + 3i then A =
2 − 3i
A square matrix A is unitary (similar to orthogonal) if A H = A−1 , for instance, Example 21 Let x be an n × 1 vector, so rank (x) = 1. We then have that the outer
" # " # product, x x 0 also has rank 1.
1+i 1+i 1−i 1+i
A= 2 2 gives A H = A−1 = 2 2 .
1−i
2
−1+i
2
1−i
2
−1−i
2
Fact 22 (Determinant and eigenvalues) For any n × n matrix A, det(A) = 5i=1
n
λi .

and it Hermitian (similar to symmetric) if A = A H , for instance


" # 22.6 Special Forms of Matrices
1 1+i
A= 2
1−i
2
−1
. 22.6.1 Triangular Matrices
2 2

A Hermitian matrix has real elements along the principal diagonal and A ji is the complex Fact 23 (Triangular matrix) A lower (upper) triangular matrix has zero elements above
conjugate of Ai j . Moreover, the quadratic form x H Ax is always a real number. (below) the main diagonal.

Fact 24 (Eigenvalues of triangular matrix) For a triangular matrix A, the eigenvalues


22.5 Eigenvalues and Eigenvectors equal the diagonal elements of A. This follows from that

Fact 17 (Homogeneous linear system). Consider the linear system in Fact 4 with c = 0: det(A − λI ) = (A11 − λ) (A22 − λ) . . . (Ann − λ) .
Am×n xn×1 = 0m×1 . Then rank(A) = rank([ A c ]), so it has a unique solution if and
only if rank(A) = n; and an infinite number of solutions if and only if rank(A) < n. Note Fact 25 (Squares of triangular matrices) If T is lower (upper) triangular, then T T is as
well.

208 209
22.6.2 Orthogonal Vector and Matrices Fact 32 (More properties of positive definite matrices) det (A) > 0; if A is pd, then A−1
is too; if Am×n with m ≥ n, then A0 A is pd.
Fact 26 (Orthogonal vector) The n × 1 vectors x and y are orthogonal if x 0 y = 0.

Fact 27 (Orthogonal matrix) The n × n matrix A is orthogonal if A0 A = I . Properties: Fact 33 (Cholesky decomposition) See Fact 41.
If A is orthogonal, then det (A) = ±1; if A and B are orthogonal, then AB is orthogonal.
22.6.4 Symmetric Matrices
Example 28 (Rotation of vectors (“Givens rotations”).) Consider the matrix G = In
except that G ik = c, G ik = s, G ki = −s, and G kk = c. If we let c = cos θ and s = sin θ Fact 34 (Symmetric matrix) A is symmetric if A = A0 .
for some angle θ , then G 0 G = I . To see this, consider the simple example where i = 2
Fact 35 (Properties of symmetric matrices) If A is symmetric, then all eigenvalues are
and k = 3
real, and eigenvectors corresponding to distinct eigenvalues are orthogonal.
 0    
1 0 0 1 0 0 1 0 0
Fact 36 If A is symmetric, then A−1 is symmetric.
 0 c s   0 c s  =  0 c2 + s 2 0 ,
     

0 −s c 0 −s c 0 0 c2 + s 2
22.6.5 Idempotent Matrices
which is an identity matrix since cos2 θ + sin2 θ = 1. G is thus an orthogonal matrix. It
Fact 37 (Idempotent matrix) A is idempotent if A = A A. If A is also symmetric, then
is often used to “rotate” an n × 1 vector ε as in u = G 0 ε, where we get
A = A0 A.
u t = εt for t 6= i, k
u i = εi c − εk s 22.7 Matrix Decompositions
u k = εi s + εk c.
Fact 38 (Diagonal decomposition) An n × n matrix A is diagonalizable if there exists a
The effect of this transformation is to rotate the i th and k th vectors counterclockwise matrix C such that C −1 AC = 3 is diagonal. We can thus write A = C3C −1 . The n × n
through an angle of θ . matrix A is diagonalizable if and only if it has n linearly independent eigenvectors. We
can then take C to be the matrix of the eigenvectors (in columns), and 3 the diagonal
22.6.3 Positive Definite Matrices matrix with the corresponding eigenvalues along the diagonal.

Fact 29 (Positive definite matrix) The n × n matrix A is positive definite if for any non- Fact 39 (Spectral decomposition.) If the eigenvectors are linearly independent, then we
zero n × 1 vector x, x 0 Ax > 0. (It is positive semidefinite if x 0 Ax ≥ 0.) can decompose A as

Fact 30 (Some properties of positive definite matrices) If A is positive definite, then all
h i
A = Z 3Z −1 , where 3 = diag(λ1 , ..., λ1 ) and Z = z 1 z 2 · · · z n ,
eigenvalues are positive and real. (To see why, note that an eigenvalue satisfies Ax = λx.
Premultiply by x 0 to get x 0 Ax = λx 0 x. Since both x 0 Ax and x 0 x are positive real numbers, where 3 is a diagonal matrix with the eigenvalues along the principal diagonal, and Z is
λ must also be.) a matrix with the corresponding eigenvalues in the columns.

Fact 31 (More properties of positive definite matrices) If B is a nonsingular n × n matrix


and A is positive definite, then B AB 0 is also positive definite.

210 211
Fact 40 (Diagonal decomposition of symmetric matrices) If A is symmetric (and possibly which is upper triangular. The ordering of the eigenvalues in T can be reshuffled, al-
singular) then the eigenvectors are orthogonal, C 0 C = I , so C −1 = C 0 . In this case, we though this requires that Z is reshuffled conformably to keep A = Z T Z H , which involves
can diagonalize A as C 0 AC = 3, or A = C3C 0 . If A is n × n but has rank r ≤ n, then a bit of tricky “book keeping.”
we can write
" # Fact 44 (Generalized Schur Decomposition) The decomposition of the n × n matrices G
i 3 0 h i0
and D gives the n × n matrices Q, S, T , and Z such that Q and Z are unitary and S and
h
1
A = C1 C2 C1 C2 = C1 31 C10 ,
0 0 T upper triangular. They satisfy

where the n × r matrix C1 contains the r eigenvectors associated with the r non-zero G = Q S Z H and D = QT Z H .
eigenvalues in the r × r matrix 31 .
The generalized Schur decomposition solves the generalized eigenvalue problem Dx =
Fact 41 (Cholesky decomposition) Let  be an n × n symmetric positive definite matrix.
λGx, where λ are the generalized eigenvalues (which will equal the diagonal elements in
The Cholesky decomposition gives the unique lower triangular P such that  = P P 0
T divided by the corresponding diagonal element in S). Note that we can write
(some software returns an upper triangular matrix, that is, Q in  = Q 0 Q instead). Note
that each column of P is only identified up to a sign transformation; they can be reversed Q H G Z = S and Q H D Z = T.
at will.
Example 45 If G = I in the generalized eigenvalue problem Dx = λGx, then we are
Fact 42 (Triangular Decomposition) Let  be an n × n symmetric positive definite ma- back to the standard eigenvalue problem. Clearly, we can pick S = I and Q = Z in this
trix. There is a unique decomposition  = AD A0 , where A is lower triangular with case, so G = I and D = Z T Z H , as in the standard Schur decomposition.
ones along the principal diagonal, and D is diagonal with positive diagonal elements.
Fact 46 (QR decomposition) Let A be m × n with m ≥ n. The QR decomposition is
This decomposition is usually not included in econometric software, but it can easily be
calculated from the commonly available Cholesky decomposition since P in the Cholesky Am×n = Q m×m Rm×n
decomposition is of the form h
"
i R
#
1
 √ = Q1 Q2
0

D11 0 ··· 0
 √ √
= Q 1 R1 .

 D11 A21 D22 0
.

P=  .. . . ..
.
. .

where Q is orthogonal (Q 0 Q = I ) and R upper triangular. The last line is the “thin
 
√ √ √
D11 An1 D22 An2 · · · Dnn
QR decomposition,” where Q 1 is an m × n orthogonal matrix and R1 an n × n upper
Fact 43 (Schur decomposition) The decomposition of the n × n matrix A gives the n × n triangular matrix.
matrices T and Z such that
A = ZT ZH Fact 47 (Inverting by using the QR decomposition) Solving Ax = c by inversion of A can
be very numerically inaccurate (no kidding, this is a real problem). Instead, the problem
where Z is a unitary n × n matrix and T is an n × n upper triangular Schur form with the can be solved with QR decomposition. First, calculate Q 1 and R1 such that A = Q 1 R1 .
eigenvalues along the diagonal. Note that premultiplying by Z −1 = Z H and postmulti- Note that we can write the system of equations as
plying by Z gives
T = Z H AZ , Q 1 Rx = c.

212 213
Premultply by Q 01 to get (since Q 01 Q 1 = I ) Fact 52 (Condition number and computers) The determinant is not a good indicator of
the realibility of numerical inversion algorithms. Instead, let c be the condition number
Rx = Q 01 c. of a square matrix. If 1/c is close to the a computer’s floating-point precision (10−13 or
so), then numerical routines for a matrix inverse become unreliable. For instance, while
This is an upper triangular system which can be solved very easily (first solve the first
det(0.1× I20 ) = 10−20 , the condition number of 0.1× I20 is unity and the matrix is indeed
equation, then use the solution is the second, and so forth.)
easy to invert to get 10 × I20 .
Fact 48 (Singular value decomposition) Let A be an m ×n matrix of rank ρ. The singular
Fact 53 (Inverting by using the SVD decomposition) The inverse of the square matrix A
value decomposition is
0 is found by noting that if A is square, then from Fact 48 we have
A = Um×m Sm×n Vn×n

where U and V are orthogonal and S is diagonal with the first ρ elements being non-zero, A A−1 = I or
that is,   U SV 0 A−1 = I , so
" # s11 · · · 0 A−1 = V S −1U 0 ,
S1 0 .. . . . 
. ..  .

S= , where S1 =  .
0 0 
0 · · · sρρ provided S is invertible (otherwise A will not be). Since S is diagonal, S −1 is also diago-
nal with the inverses of the diagonal elements in S, so it is very easy to compute.
Fact 49 (Singular values and eigenvalues) The singular values of A are the nonnegative
square roots of A A H if m ≤ n and of A H A if m ≥ n. Fact 54 (Pseudo inverse or generalized inverse) The Moore-Penrose pseudo (general-
ized) inverse of an m × n matrix A is defined as
Remark 50 If the square matrix A is symmetric and idempotent (A = A0 A), then the " #
−1
S11 0
singular values are the same as the eigevalues. From Fact (40) we know that a symmetric A = V S U , where Snxm =
+ + 0 +
,
A can be decomposed as A = C3C 0 . It follows that this is the same as the singular value 0 0
decomposition. where V and U are from Fact 48. The submatrix S11 −1
contains the reciprocals of the
non-zero singular values along the principal diagonal. A+ satisfies the A+ satisfies the
Fact 51 (Condition number) The condition number of a matrix is the ratio of the largest
Moore-Penrose conditions
(in magnitude) of the singular values to the smallest
0 0
A A+ A = A, A+ A A+ = A+ , A A+ = A A+ , and A+ A = A+ A.
c = |sii |max / |sii |min .
See Fact 8 for the idea behind the generalized inverse.
For a square matrix, we can calculate the condition value from the eigenvalues of A A H
or A H A (see Fact 49). In particular, for a square matrix we have Fact 55 (Some properties of generalized inverses) If A has full rank, then A+ = A−1 ;
p p (BC)+ = C + B + ; if B, and C are invertible, then (B AC)−1 = C −1 A+ B −1 ; (A+ )0 =
c = λi / λi ,

max min (A0 )+ ; if A is symmetric, then A+ is symmetric.
where λi are the eigenvalues of A AH and A is square. Fact 56 (Pseudo inverse of symmetric matrix) If A is symmetric, then the SVD is identical
0
to the spectral decomposition A = Z 3Z 0 where Z are the orthogonal eigenvector (Z Z =

214 215
I ) and 3 is a diagonal matrix with the eigenvalues along the main diagonal. By Fact 54) is a linear function
we then have A+ = Z 3+ Z 0 , where     
y1 a11 · · · a1m x1
" #  .   .. ..  .
3−1  ..  =  . .

3+ = 11 0 , . .  .
 
   
0 0 yn an1 · · · anm xm

with the reciprocals of the non-zero eigen values along the principal diagonal of 3−1
11 . In this case ∂ y/∂ x 0 = A and ∂ y 0 /∂ x = A0 .

Fact 60 (Matrix differentiation of inner product) The inner product of two column vec-
22.8 Matrix Calculus
tors, y = z 0 x, is a special case of a linear system with A = z 0 . In this case we get
∂ z 0 x /∂ x 0 = z 0 and ∂ z 0 x /∂ x = z. Clearly, the derivatives of x 0 z are the same (a
 
Fact 57 (Matrix differentiation of non-linear functions, ∂ y/∂ x 0 ) Let the vector yn×1 be a
function of the vector xm×1 transpose of a scalar).
   
f 1 (x) Example 61 (∂ z 0 x /∂ x = z when x and z are 2 × 1 vectors)

y1
 . 
 ..  = f (x) =  ... .
 
" #! " #
∂ h i x z1
   
1
yn f n (x) z1 z2 = .
∂x x2 z2
Then, let ∂ y/∂ x 0 be the n × m matrix
Fact 62 (First order Taylor series) For each element f i (x) in the n× vector f (x), we
∂ f 1 (x) ∂ f 1 (x) ∂ f 1 (x)
   
∂x0 ∂ x1 ··· ∂ xm
can apply the mean-value theorem
∂y  .   . ..
.  =  .. .

∂x 0  .
= . ∂ f i (bi )
f i (x) = f i (c) + (x − c) ,
  
∂f n (x) ∂ f n (x) ∂ f n (x)
∂x0 ∂ x1 ··· ∂ xm ∂x0

This matrix is often called the Jacobian of the f functions. (Note that the notation im- for some vector bi between c and x. Stacking these expressions gives
∂ f 1 (b1 )
plies that the derivatives of the first element in y, denoted y1 , with respect to each of the · · · ∂ f∂1x(bm1 )
      
f 1 (x) f 1 (c) ∂ x1 x1
elements in x 0 are found in the first row of ∂ y/∂ x 0 . A rule to help memorizing the format  .   .   . ..  .
 ..  =  ..  +  ..   ..

.  or
of ∂ y/∂ x 0 : y is a column vector and x 0 is a row vector.)
      
∂ f n (bn )
f n (x) f n (c) ∂ x1 · · · ∂ f∂nx(bmn ) xm
Fact 58 (∂ y 0 /∂ x instead of ∂ y/∂ x 0 ) With the notation in the previous Fact, we get ∂ f (b)
f (x) = f (c) + (x − c) ,
∂x0
∂ f 1 (x) ∂ f n (x)
 
∂ x · · · ∂ x 0 where the notation f (b) is a bit sloppy. It should be interpreted as that we have to
∂ y 0 h ∂ f1 (x)  ...
1
..
1
 = ∂y .
i  
∂ f n (x)

= ∂x · · · ∂x
= . evaluate the derivatives at different points for the different elements in f (x).
∂x 
∂ f 1 (x)
 ∂x0
∂ xm · · · ∂ ∂fnx(x)
m
Fact 63 (Matrix differentiation of quadratic forms) Let xm×1 be a vector, Am×m a matrix,
Fact 59 (Matrix differentiation of linear systems) When yn×1 = An×m xm×1 , then f (x)

216 217
and f (x)n×1 a vector of functions. Then, Fact 67 (Kronecker product) If A and B are matrices, then
0
∂ f (x)0 A f (x) ∂ f (x)
  
a11 B · · · a1n B
A + A0 f (x)

=  .
∂x ∂x0 . .. 
A⊗B =  . . .
∂ f (x) 0
 
=2 A f (x) if A is symmetric. am1 B · · · amn B
∂x0
Properties: (A ⊗ B)−1 = A−1 ⊗ B −1 (if conformable); (A ⊗ B)(C ⊗ D) = AC ⊗ B D (if
If f (x) = x, then ∂ f (x) /∂ x 0 = I , so ∂ x 0 Ax /∂ x = 2Ax if A is symmetric.

conformable); (A ⊗ B)0 = A0 ⊗ B 0 ; if a is m × 1 and b is n × 1, then a ⊗ b = (a ⊗ In )b.
Example 64 (∂ x 0 Ax /∂ x = 2Ax when x is 2 × 1 and A is 2 × 2)

Fact 68 (Cyclical permutation of trace) Trace(ABC) =Trace(BC A) =Trace(C AB), if
" # " #! " # " #! " #
∂ h i A
11 A12 x1 A11 A12 A11 A21 x1 the dimensions allow the products.
x1 x2 = + ,
∂x A21 A22 x2 A21 A22 A12 A22 x2
" #" # Fact 69 (The vec operator). vec A where A is m × n gives an mn × 1 vectorwith the
A11 A12 x1

=2 if A21 = A12 . #  a11 
A12 A22 x2
"
a11 a12  a21 
columns in A stacked on top of each other. For instance, vec  a .
= 
Example 65 (Least squares) Consider the linear model Ym×1 = X m×n βn×1 + u m×1 . We a21 a22  12 
want to minimize the sum of squared fitted errors by choosing the n × 1 vector β. The a22
Properties: vec (A + B) = vec A+ vec B; vec (ABC) = C 0 ⊗ A vec B; if a and b are

fitted errors depend on the chosen β: u (β) = Y − Xβ, so quadratic loss function is
column vectors, then vec ab0 = b ⊗ a.


L = u(β)0 u(β)
Fact 70 (The vech operator) vechA where A is m × m gives an m(m + 1)/2 × 1 vector
= (Y − Xβ)0 (Y − Xβ) .
with the elements on and below the principal diagonal
 A stacked
 on top of each other
In thus case, f (β) = u (β) = Y − Xβ, so ∂ f (β) /∂β 0 = −X . The first order condition
" # a11
a11 a12
(columnwise). For instance, vech =  a21 , that is, like vec, but uses
 
for u 0 u is thus   a21 a22
−2X 0 Y − X β̂ = 0n×1 or X 0 Y = X 0 X β̂, a22
only the elements on and below the principal diagonal.
which can be solved as
−1
β̂ = X 0 X X 0 Y. Fact 71 (Duplication matrix) The duplication matrix Dm is defined such that for any sym-
metric m × m matrix A we have vec A = Dm vechA. The duplication matrix is therefore
useful for “inverting” the vech operator (the step from vec A to A is trivial). For instance,
22.9 Miscellaneous
to continue the example of the vech operator
Fact 66 (Some properties of transposes) (A + B)0 = A0 + B 0 ; (ABC)0 = C 0 B 0 A0 (if    
1 0 0   a11
conformable).   a11  
 0 1 0   a21 
   a21  =
    or D2 vechA = vec A.
 0 1 0 
 a21 

a22
 
0 0 1 a22

218 219
Fact 72 (Sums of outer products) Let yt be m × 1, and xt be k × 1. Suppose we have T Fact 74 (Matrix exponential) The matrix exponential of an n × n matrix A is defined as
such vectors. The sum of the outer product (an m × k matrix) is ∞
X (At)s
T
exp (At) = .
X s!
S= yt xt0 . s=0

t=1

0 0
Bibliography
Create matrices YT ×m and X T ×k by letting yt and xt be the t th rows
 0   0  Anton, H., 1987, Elementary Linear Algebra, John Wiley and Sons, New York, 5th edn.
y1 x
 .   .1 
. .
.
YT ×m =  .  and X T ×k =  . 
   Björk, Å., 1996, Numerical Methods for Least Squares Problems, SIAM, Philadelphia.
0 0
yT xT Golub, G. H., and C. F. van Loan, 1989, Matrix Computations, The John Hopkins Uni-
We can then calculate the same sum of outer product, S, as versity Press, 2nd edn.

S = Y 0 X. Greenberg, M. D., 1988, Advanced Engineering Mathematics, Prentice Hall, Englewood


Cliffs, New Jersey.
(To see this, let Y (i, :) be the ith row of Y , and similarly for X , so
Greene, W. H., 2000, Econometric Analysis, Prentice-Hall, Upper Saddle River, New
T
X Jersey, 4th edn.
Y0X = Y (t, :)0 X (t, :),
t=1

which is precisely 6t=1


T y x 0 .)
t t

Fact 73 (Matrix geometric series) Suppose the eigenvalues to the square matrix A are all
less than one in modulus. Then,

I + A + A2 + · · · = (1 − A)−1 .

To see why this makes sense, consider (1 − A) 6t=1


T At (with the convention that A0 = I ).

It can be written as
   
(1 − A) 6t=1
T
At = I + A + A2 + · · · − A I + A + A2 + · · · = I − A T +1 .

If all the eigenvalues are stable, then limT →∞ A T +1 = 0, so taking the limit of the previ-
ous equation gives
(1 − A) lim 6t=1
T
A = I.
T →∞

220 221

You might also like