Ordinary Least Squares

© All Rights Reserved

51 views

Ordinary Least Squares

© All Rights Reserved

- Statistics (2)
- Vella Verbeek
- McCloskey CommentOnBoettke2018
- Guide to Six Sigma Statistics
- Tugas 01 - Statin
- task02 (1)
- Top 10 Concepts Statistics
- synopsis-eliza-sharma.pdf
- Why Do Security Prices Change_A Transaction-Level Analysis of NYSE Stocks_Ananth Madhavan, Matthew Richardson
- Lectures 1
- 10305 Voulme3 Issue 1 Paper 03
- 1.Homework Week4 Session 2
- FHP Blablacar Drives Prices
- Stats With TI-83-84 +
- Case Study
- Copy of Eco Project1
- odisha_tourism.pdf
- Linear Regression Ppt
- Assessing the Relationship between Ownership Structure, Corporate Cash Holdings and Stock Prices on Tehran Stock Exchange
- Thin-trading Effects in Beta Bias v. Est

You are on page 1of 23

Rmulo A. Chumacero

February 2005

Introduction

Ragnar Frisch (one of the founders of the Econometrics Society) is credited with

coining the term econometrics. Econometrics aims at giving empirical content to

economic relationships by uniting three key ingredients: economic theory, economic

data, and statistical methods. Neither theory without measurement, nor measurement without theory are sucient for explaining economic phenomena. It is their

union that is the key to understand economic relationships.

Social scientists generally must accept the conditions under which their subjects

act and the responses occur. As economic data come almost exclusively from nonexperimental sources, researchers cannot specify or choose the level of a stimulus and

record the outcome. They can just observe the natural experiments that take place.

In this sense economics, as meteorology, is an observational science.

For example, many economists have studied the influence of monetary policy on

macroeconomic conditions, yet the eects of actions by central banks continue to be

widely debated. Some of the controversies would be removed if a central bank could

experiment with monetary policy over repeated trials under identical conditions, thus

being able to isolate the eects of policy more accurately. However, no one can turn

back the clock to try various policies under essentially the same conditions. Each

time a central bank contemplates an action, it faces a new set of conditions. The

actors and technologies have all changed. The social, economic, and political orders

are dierent. To learn about one aspect of the economic world, one must take into

account many others. To apply past experience eectively, one must take into account

similarities and dierences between the past, present, and future.

Here we will review some of the finite sample properties of the most basic and popular estimation procedure in econometrics (Ordinary Least Squares, OLS for short).

The document is organized as follows: Section 2 describes the general framework for

regression analysis. Section 3 derives the OLS estimator and discusses its properties.

Department of Economics of the University of Chile and Research Department of the Central

Bank of Chile. E -mail address: rchumace@econ.uchile.cl

tests for linear constraints. Finally, Section 6 discusses the issue of prediction within

the OLS context.

Preliminaries

vector of data. Partition wt = (yt , xt ) where yt R, xt Rk . Let the joint density of

the variables be given by f (yt , xt , ), where is a vector of unknown parameters.

In econometrics we are often interested in the conditional distribution of one

set of random variables given another set of random variables (e.g., the conditional

distribution of consumption given income, or the conditional distribution of wages

given individual characteristics). Recalling that the joint density can be written as

the product of the conditional density and the marginal density, we have:

f (yt , xt , ) = f (yt |xt , 1 )f (xt , 2 ),

R

where f (xt , 2 ) = f (yt , xt , )dy is the marginal density of x.

Regression analysis can be defined as statistical inferences on 1 . For this purpose

we can ignore f (xt , 2 ), provided there is no relationship between 1 and 2 .1 In this

framework, y is called the dependent or endogenous variable and the vector x is

called the vector of independent or exogenous variables.

In regression analysis we usually want to estimate only the first and second moments of the conditional distribution, rather than the whole parameter vector 1 (in

certain cases the first two moments characterize 1 completely). Thus we can define

the conditional mean m (xt , 3 ) and conditional variance g (xt , 4 ) as

Z

m (xt , 3 ) = E (yt |xt , 3 ) =

yf (y| xt , 1 )dy

Z

g (xt , 4 ) =

y2 f (y| xt , 1 )dy [m (xt , 3 )]2 .

The conditional mean and variance are random variables, as they are functions of

the random vector xt . If we define ut as the dierence between yt and its conditional

mean,

ut = yt m (xt , 3 ) ,

we obtain:

yt = m (xt , 3 ) + ut .

(1)

Other than (yt , xt ) having a joint density, no assumptions have been made to

develop (1).

1

Proposition 1 Properties of ut :

1. E (ut |xt ) = 0,

2. E (ut ) = 0,

3. E [h(xt )ut ] = 0 for any function h () ,

4. E (xt ut ) = 0.

Proof. 1. By definition of ut and the linearity of conditional expectations,2

E (ut |xt ) = E [yt m (xt ) |xt ]

= E [yt |xt ] E [m(xt ) |xt ]

= m (xt ) m (xt ) = 0.

2. By the law of iterated expectations and the first result,3

E (ut ) = E [E (ut |xt )] = E (0) = 0.

3. By essentially the same argument,

E [h(xt )ut ] = E [E [h(xt )ut |xt ]]

= E [h(xt )E [ut |xt ]]

= E [h(xt ) 0] = 0.

4. Follows from the third result, setting h(xt ) = xt .

Equation (1) plus the first result of Proposition 1 are often stated jointly as the

regression framework:

yt = m (xt , 3 ) + ut

E (ut |xt ) = 0.

This is a framework, not a model, because no restrictions have been placed on the

joint distribution of the data. These equations hold true by definition.

Given that the moments m () and g () can take any shape (usually nonlinear), a

regression model imposes further restrictions on the joint distribution and on u (the

regression error). If we assume that m () is linear we obtain what is known as the

linear regression model:

m (xt , 3 ) = x0t ,

where is a k-element vector. Finally, let

y1

x01

x1,1 x1,k

.. ,

...

X = ... = ...

Y = ... ,

.

T 1

T k

0

yT

xT

xT,1 xT,k

2

3

u1

u = ... .

T 1

uT

The law of iterated expectations states that E [E [y |x, z ] |x ] = E [y |x ].

1. yt = x0t + ut or Y = X + u,

2. E (ut |xt ) = 0,

3. rank(X) = k or det (X 0 X) 6= 0,

4. E (ut us ) = 0 t 6= s.

The most important assumption of the model is the linearity of the conditional expectation. Furthermore, this framework considers that x provides no information for

forecasting u and that X is of full rank. Finally, it is assumed that ut is uncorrelated

with us .4

Definition 2 The Homoskedastic Linear Regression Model (HLRM) is the LRM plus

5. E (u2t |xt ) = 2 or E (uu0 |X ) = 2 IT .

This model adds the auxiliary assumption that g () is conditionally homoskedastic.

Definition 3 The Normal Linear Regression Model (NLRM) is the LRM plus

6. ut N (0, 2 ) .

Possing and additional assumption, this model has the advantage that exact distributional results are available for the OLS estimators and tests statistics. It is not

very popular in current econometric practice and, as we will see, is not necessary to

derive most of the results that follow.

OLS Estimation

This section defines the OLS estimator of and shows that it is the best linear

unbiased estimator. The estimation of the error variance is also discussed.

3.1

ST () = (Y X)0 (Y X)

= Y 0 Y 2Y 0 X + 0 X 0 X.

The OLS estimator ()

(FONC) for minimization are:

ST ()

b = 0,

= 2X 0 Y + 2X 0 X

b

which yield the normal equations X 0 Y = X 0 X .

4

Ocassionally, we will make the assumption of serial independence of {ut } which is stronger than

no correlation, although both concepts are equivalent when u is normal.

b = (X 0 X)1 (X 0 Y ) .

Proposition 2 arg minST () =

Proof. Using the normal equations we obtain

b is indeed a minimum we evaluate the Second Order Sucient Conditions (SOSC)

2 ST ()

0

0 = 2X X,

b is a minimum, as X 0 X is a positive definite matrix.

which show that

b is a linear

Three important implications are derived from this theorem: First,

b

function of Y . Second, even if X is a non stochastic matrix, is a random variable

as it depends on Y which is itself a random variable. Finally, in order to obtain the

OLS estimator we require X 0 X to be of full rank.

b we define

Given ,

b

u

b = Y X ,

(2)

and call it the least squares residuals. Using u

b, we can estimate 2 by

Using (2), we can write

b2 = T 1 u

b0 u

b.

b+u

Y = X

b = P Y + MY,

b is orthogonal to X (that is,

0

u

b X = 0), OLS can be regarded as decomposing Y into two orthogonal components: a

component that can be written as a linear combination of the column vector of X and

a component that is orthogonal to X. Alternatively, we can call P Y the projection

of Y onto the space spanned by the column vectors of X and MY the projection of

Y onto the space orthogonal to X. These properties are illustrated in Figure 1.5

Proposition 3 Let A be an n r matrix of rank r. A matrix of the form P =

A (A0 A)1 A0 is called a projection matrix and has the following properties:

i) P = P 0 = P 2 (Hence P is symmetric and idempotent),

ii) rank(P ) = r,

iii) the characteristic roots (eigenvalues) of P consist of r ones and n-r zeros,

iv) if Z = Ac for some vector c, then P Z = Z (hence the word projection),

v) M = I P is also idempotent with rank n-r, the eigenvalues consist of n-r ones

and r zeros, and if Z = Ac, then MZ= 0,

vi) P can be written as G0 G, where GG0 = I, or as v1 v10 + v2 v20 + ... + vr vr0 where

vi is a vector and r = rank(P ).

Proof. Left as an exercise.6

5

6

Appendix A presents this and other exercises.

MY

PY

0

Col(X)

3.2

Now we relate a traditional motivation for the OLS estimator. The NLRM is yt =

x0t + ut with ut N (0, 2 ).

The density function for a single observation is

(y x0 )2

1

t 2t

2

e

f yt xt , , =

2 2

"T

#

Y

f yt xt , , 2

T , 2 ; Y |X = ln

t=1

T

X

t=1

ln f yt xt , , 2

T

T

= ln (2) ln 2

2

2

T

T

= ln (2) ln 2

2

2

6

T

1 X

2

(yt x0t )

2 2 t=1

1

ST () .

2 2

bM LE =

bOLS .

Proposition 4 In the NLRM,

T (, 2 )

1 0

0 b

X Y X X = 0

2 =

b2

,

b

b

Y X

Y X

T (, 2 )

T

=

+

= 0.

2

2

2b

2

2b

4

,

Thus,

b2 = T 1 u

b0 u

b.7

MLE maximizes T (, 2 ) by minimizing ST (). Due to this equivalence, the OLS

b is frequently referred to as the Gaussian MLE, the Gaussian Quasiestimator

MLE, or the Gaussian Pseudo-MLE.8

3.3

b and

The Mean and Variance of

b2

i

b

b = .

|X = 0 and E

b = (X 0 X)1 X 0 Y = (X 0 X)1 X 0 (X + u)

1

= + (X 0 X) X 0 u.

Then

h

i

h

i

1

0

0

b

E |X = E (X X) X u |X

1

= (X 0 X) X 0 E (u |X )

= 0.

h

i

b =E E

b |X

Applying the law of iterated expectations, E

= .

Thus,

which is a stronger result.

b = 2 E (X 0 X)1 .

b |X = 2 (X 0 X)1 and V

Proposition 6 In the HLRM, V

7

The term quasi (pseudo) is used for misspecified models. In this case, the normality assumption was used to construct the likelihood and the estimator, but may be believed not to be

true.

8

b = (X 0 X)1 X 0 u,

Proof. Since

0

b

b

b

V |X

= E |X

i

h

1

1

= E (X 0 X) X 0 uu0 X (X 0 X) |X

1

= (X 0 X) X 0 E [uu0 |X ] X (X 0 X)

1

= 2 (X 0 X) .

h

i

h

i

b =E V

b |X + V E

b |X

Thus, V

= 2 E (X 0 X)1 .

This result is derived from the assumptions that u is uncorrelated and homoskedasb measures the precision with which the retic. The variance-covariance matrix of

lationship between Y and X is estimated. Some of its features are: First, and most

b grows proportionally with 2 (the volatility of the unpreobvious, the variance of

dictable component). Second, although less obvious, as the sample size increases, the

b should decrease (we will provide formal arguments

variance-covariance matrix of

in this regard when we analyze the asymptotic properties of OLS). Finally, it also

depends on the volatility of the regressors; as it increases, the precision with which

we measure will be enhanced. Thus, we generally prefer a sample of X that is

more volatile, given that it would better help us to uncover its association with Y .

Proposition 7 In the LRM,

b2 is biased.

Proof. We know that u

b = MY .

b2 = T 1 u

b0 u

b = T 1 u0 Mu. This implies

2

E

b |X =

=

=

=

=

=

b = Mu. Then,

that

T 1 E [u0 Mu |X ]

T 1 tr E [u0 Mu |X ]

T 1 E [tr (u0 Mu) |X ]

T 1 E [tr (Muu0 ) |X ]

T 1 2 tr (M)

2 (T k) T 1 .

2

Applying the law of iterated expectations we obtain E

b = 2 (T k) T 1 .

To derive this result we used the facts that 2 is a scalar (tr denotes the trace of a

matrix), that the expectation is a lineal operator (thus tr and E are interchangeable),

P

that tr(AB)=tr(BA), and that M is symmetric in which case tr(M ) = Ti=1 i , where

i denotes the i-th eigenvalue of M (here we used the results of Proposition 3).

Proposition 7 shows that

b2 is a biased estimator of 2 . A trivial modification

yields an unbiased estimator for 2 ;

e2 = (T k)1 u

b0 u

b.

8

2

Proposition 8 In the NLRM, V

b = T 2 2 (T k) 4 .

Proof. Left as an exercise.

From the results derived so far, three facts are worth mentioning: First, with

the exception of Proposition 8 none of the results derived required the assumption of

normality of the error term. Second, while

b2 is biased, it coincides with the maximum

likelihood estimator of the variance under the assumption of normality of u and, as we

b and

will show later, it is consistent. Finally, both the variance-covariance matrix of

2

2

b depend on which is unknown, thus in practice we use estimators of the variancecovariance matrix of the OLS estimators by replacing 2 with

b2 or

e2 . For example,

b is V

b =

b

the estimator of the variance-covariance matrix of

e2 (X 0 X)1 .

3.4

b is BLUE

Definition 4 Let b

and be estimators of a vector parameter .

B

Let A and

0

be their respective mean squared error matrices; that is A = E b

b

is better (or more ecient) than

if c0 (B A) c 0 for every vector c and every parameter value and c0 (B A) c > 0

for at least one value of c and at least one value of the parameter.9

Once we made precise what we mean by better, we are ready to present one of the

most famous theorems in econometrics;

b

Theorem 1 (Gauss-Markov) The Best Linear Unbiased Estimator (BLUE) is .

Proof. Let A = (X 0 X)1 X 0 , then

b. Without loss of generality, let b = (A + C) Y . Then,

1

E (b |X ) = (X 0 X)

X 0 X + CX = (I + CX) .

V (b |X ) = E (A + C) uu0 (A + C)0 .

b |X + 2 CC 0 .

V (b |X ) = V

This definition can also be stated as B A for every parameter value and B 6= A for at least

one parameter value (in this context, B A means that B A is positive semi-definite and B > A

means that B A is positive definite).

9

Then, V (b |X ) V

Despite its popularity, the Gauss-Markov theorem is not very powerful. It restricts our quest of alternative candidates to those that are both linear and unbiased

estimators. There may be a nonlinear or biased estimator that can do better in the

metric of Definition 4. Furthermore, OLS ceases to be BLUE when the assumption

of homoskedasticity is relaxed. If both homoskedasticity and normality are present,

we can rely on a stronger theorem which we will discuss later (the Cramer-Rao lower

bound).

3.5

By definition,

Y = Yb + u

b.

Y Y = Yb Y + u

b.

Thus

Y Y

0

b

b

b

Y Y +2 Y Y u

b+u

b0 u

b,

Y Y = Y Y

but Yb 0 u

b = Y 0 P MY = 0 and Y u

b = Y 0 u

b = 0 when the model contains an intercept

(more generally, if lies in the space spanned by X).10 Thus

0

0

Y Y

Y Y = Yb Y

Yb Y + u

b0 u

b.

This is called the analysis of variance formula, often written as

T SS = ESS + SSR,

where T SS, ESS, and SSR stand for Total sum of squares, Equation sum of

squares and Sum of squares of the residuals, respectively. The equation R2 (also

known as the centered coecient of determination) is defined as

R2 =

ESS

SSR

Y 0 MY

=1

=1 0

,

T SS

T SS

Y LY

0 R2 1. If the regressors do not include a constant, R2 can be negative because, without the benefit of an intercept, the regression could do worse (tracking the

dependent variable) than the sample mean.

10

10

The equation measures the percentage of the variance of Y that is accounted for in

the variation of the predicted value Yb . R2 is typically reported in applied work and is

frequently referenced as measure or goodness of fit. This label is inappropriate,

as R2 does not measure the adequacy or fit of a model.11

It is not even clear if R2 has an unambiguous interpretation in terms of forecast

performance. To see this, note that the explanatory power of the models yt =

xt + ut and yt xt = xt + ut with = 1 are the same. The models are

mathematically identical and yield the same implications and forecasts. Yet their

reported R2 will dier greatly. For illustration, suppose that ' 1. Then the R2

from the second model will (nearly) equal zero, while the R2 from the first model can

be arbitrarily close to one. An econometrician reporting the near-unit R2 from the

first model might claim success, while an econometrician reporting the R2 ' 0 from

the second model might be accused of a poor fit. This dierence in reporting is quite

unfortunate, since the two models and implications are mathematically identical. The

bottom line is that R2 is not a measure of fit and should not be interpreted as such.

Another interesting fact about R2 is that it necessarily increases as regressors are

added to the model. As by definition the OLS estimate minimizes the SSR, by adding

additional regressors, the SSR cannot increase; it either can stay the same, or (more

likely) decrease. But the T SS is unaected by adding regressors, so that R2 either

stays constant or increases. To counteract this eect, Theil proposed an adjustment,

2

typically called R (or adjusted R2 ) which penalizes model dimensionality and is

defined as:

SSR T

e2

2

R =1

= 1 2.

T SS T k

by

While often reported in applied work, this statistic is not used that much today

as better model evaluation criteria have been developed (we will discuss this later).

3.6

vector, but only of a subset of . Partition

X = X1 X2

and

1

2

More unfortunate is the claim that the R2 measures the percentage of the variance of y that

is explained by the model. An econometric model, by itself, doesnt explain anything. Only the

combination of a good econometric model and sound economic theory can, in principle, explain a

phenomenon.

11

11

Then X 0 X

b1 + X 0 X2

b2 = X 0 Y

X10 X1

1

1

0

0

b

b

X2 X1 1 + X2 X2 2 = X20 Y.

(3a)

(3b)

Solving for

and

b1 = (X 0 M2 X1 )1 X 0 M2 Y

1

1

b2 = (X20 M1 X2 )1 X20 M1 Y,

These results can also be derived using the following theorem:

b2 and u

Theorem 2 (Frisch-Waugh-Lovell)

b can be computed using the following

algorithm:

1. Regress Y on X1 , obtain residual Ye ,

e2 ,

2. Regress X2 on X1 , obtain residual X

b

e

e

3. Regress Y on X2 , obtain 2 and residuals u

b.

Proof. Left as an exercise.

computation, but in most cases there is little computational advantage of using it.12

There are, however, two common applications of the FWL theorem, one of which is

usually presented in introductory econometrics courses: the demeaning formula for

regression; the other deals with ill-conditioned problems.

where X1 = is a vector of ones, and X2 is the matrix of observed regressors. In this

case,

e2 = M1 X2 = X2 (0 )1 0 X2

X

= X2 X 2

and

1

Ye = M1 Y = Y (0 ) 0 Y

= Y Y.

12

A few decades ago, a crucial limitation for conducting OLS estimation was the computational

cost of inverting even moderately sized matrices and the FWL was invoked routinely.

12

The FWL theorem says that

e2 , or yt Y on x2t X 2 :

X

T

!1 T

!

X

X

0

b2 =

.

x2t X 2 x2t X 2

x2t X 2 yt Y

t=1

t=1

Thus, the OLS estimator for the slope coecients is a regression with demeaned

data.

The other application is more useful. In our analysis we assumed that X is full

rank (X 0 X is invertible). Suppose for a moment that X1 is full rank but that X2 is

not. In that case 2 cannot be estimated, but 1 still can be estimated as follows:

b1 = (X 0 M X1 )1 X 0 M Y,

1 2

1 2

where M2 is formed using X2 that has columns equal to the maximal number of

linearly independent columns of X2 .

In this section we shall consider the estimation of and 2 when there are certain

linear constraints on the elements of . We shall assume that the constraints are of

the form:

Q0 = c,

(4)

where Q is a k q matrix of known constants and c is a q-vector of known constants.

We shall also assume that q < k and rank(Q) = q.

4.1

The CLS estimator of , denoted by , is defined to be the value of that minimizes the SSR subject to the constraint (4). The Lagrange expression for the CLS

minimization problem is

L (, ) = (Y X)0 (Y X) + 2 0 (Q0 c) ,

where is a q-vector of Lagrange multipliers corresponding to the q constraints. The

FONC are

L

= 2X 0 Y + 2X 0 X + 2Q = 0

,

L

= Q0 c = 0.

,

13

b (X 0 X)

=

h

i1

1

bc .

Q Q0 (X 0 X) Q

Q0

(5)

2 = T 1 Y X Y X .

4.2

CLS as BLUE

1

= + R (R0 X 0 XR)

R0 X 0 u,

R0 Q = 0.13 Therefore is unbiased and its variance-covariance matrix is given by

1

V = 2 R (R0 X 0 XR) R0 .

and d is a k-vector. This class is broader than the class of linear estimators considered

in the unconstrained case because of the additive constants d. We did not include

d previously because in the unconstrained model the unbiasedness condition would

ensure d = 0. Here, the unbiasedness condition E (D 0 Y d) = implies D0 X =

I + GQ0 and d = Gc for some arbitrary k q matrix G. We have V ( ) = 2 D0 D

and CLS is BLUE because of the identity

h

ih

i0

1

1

1

D 0 DR (R0 X 0 XR) R0 = D0 R (R0 X 0 XR) R0 X 0 D0 R (R0 X 0 XR) R0 X 0 ,

where we have used D 0 X = I + GQ0 and R0 Q = 0.

In this section we shall regard the linear constraints (4) as a testable hypothesis,

calling it the null hypothesis. For now we will assume that the normal linear regression

model holds and derive the most frequently used tests in the OLS context.14

13

Such a matrix can always be found and is not unique, and any matrix that satisfies these

conditions will do.

14

We will discuss the case of inference in the presence of nonlinear constraints and departures

from normality of u later. For those impatient, none of the results derived here change when these

assumptions are relaxed (at least asymptotically).

14

5.1

The t Test

The t test is an ideal test to use when we have a single constraint, that is, q = 1. As

b thus under the null hypothesis we

we assumed that u is normally distributed, so is ;

have

h

i

a

1

bv

N c, 2 Q0 (X 0 X) Q .

Q0

With q = 1, Q0 is a row vector and c is a scalar. Therefore

bc

Q0

1/2 N (0, 1) .

2 Q0 (X 0 X)1 Q

(6)

u

b0 u

b

2T k ,

2

(7)

bc

Q0

tT =

1/2 ST k ,

e2 Q0 (X 0 X)1 Q

which is Students t with T k degrees of freedom. Only now we have invoked the

assumption of normality of u and, as shown later, it is not necessary for (6) to hold

(in large samples).

If we were interested in testing a single hypothesis of the form:

H0 : 1 = 0,

0

we would define Q = 1 0 0 and c = 0, in which case we would obtain the

familiar t test

b

tT = q 1 ,

b1,1

V

where V

b

.

With these tools we can construct confidence intervals CT for i . As CT is a

function of the data, it is random. Its objective is to cover i with high probability.

The coverage probability is Pr ( CT ). We say that CT has (1 ) % coverage for

if Pr ( CT ) (1 ). We construct a confidence interval as follows:

q

q

b + z/2 V

bi z/2 V

bi,i < i <

bi,i = 1 ,

Pr

i

where z/2 is the upper /2 quantile of the distribution being considered (asymptotically, the normal distribution; in small samples, the Students t distribution). The

15

most common choice for is 0.05. If |tT | < z/2 , we cannot reject the null hypothesis

at an % significance level; otherwise the null hypothesis is rejected.

An alternative approach to reporting results, is to report a p-value. The p-value

for the above statistic is constructed as follows. Define the tail probability, or p-value

function

pT = p (tT ) = Pr (|Z| |tT |) = 2 (1 (|tT |)) .

If the p-value pT is small (close to zero) then the evidence against H0 is strong.

In a sense, p-values and hypothesis tests are equivalent since pT if and only if

|tT | z/2 . The p-value is more general, however, in that the reader is allowed to

pick the level of significance .15

A confidence interval for can be constructed as follows

"

#

2

(T

k)

e

(T k)

e2

< 2 < 2

Pr 2

= 1 .

(8)

T k,1/2

T k,/2

5.2

The F Test

When q > 1 we cannot apply the t test described above and use instead a simple

transformation of what is known as the Likelihood Ratio Test (which we will discuss

at length later). Under the null hypothesis, it can be shown that

b

ST ST

2q .

2

As in the previous case, when 2 is not known, a finite sample correction can be

made by replacing 2 with

e2 , in which case we have

1 1

0b

0

0

0b

b

Qc

ST ST

T k Q c Q (X X) Q

=

Fq,T k . (9)

q

u

b0 u

b

e2

Once again, as in the case of t tests, we reject the null hypothesis when the value

computed exceeds the critical value.

5.3

Y1 = X1 1 + u1

Y2 = X2 2 + u2 ,

15

16

Suppose further that

2

1 IT1

u1 0

0

0

u1 u2 =

E

.

u2

0

22 IT2

We want to test the null hypothesis H0 : 1 = 2 . First, we will derive an F test

assuming homoskedasticity among regimes and later we will relax this assumption.

To apply the test we define

Y = X + u,

where

Y =

Y1

Y2

X=

X1 0

0 X2

1

2

and

u=

u1

u2

1

1 1 b

0

0

b

b

b

(X

X

)

+

(X

X

)

1

2

2

1

2

1

2

T1 + T2 2k 1

h

i

Fk,T1 +T2 2k , (10)

1

k

Y 0 I X (X 0 X) X 0 Y

b2 = (X 0 X2 )1 X 0 Y2 .

b1 = (X 0 X1 )1 X 0 Y1 and

where

1

1

2

2

Alternative, the same result can be derived as follows: Define the sum of squares

of the residuals under the alternative of structural change,

h

i

b = Y 0 I X (X 0 X)1 X 0 Y

ST

and the sum of squares of the residuals under the null hypothesis

h

i

1

0

0

0

ST = Y I X (X X) X Y.

It is easy to show that

S

S

T

T

T1 + T2 2k

Fk,T1 +T2 2k .

k

b

ST

ST

2

e =

.

T1 + T2 2k

(11)

Before we remove the assumption that 1 = 2 we will first derive a test of the

equality of the variances. Under the null hypothesis (same variances across regimes)

we have

bi

u

b0i u

2Ti k for i = 1, 2.

2

17

b01 u

T2 k u

b1

FT1 k,T2 k .

0

T1 k u

b2 u

b2

Unlike the previous tests, a two-tailed test should be used here, because a large

or small value of the test is a reason to reject the null hypothesis.

If we remove the assumption of equal variances among regimes and focus on the

hypothesis of equality of the regression parameters, the tests are more involved. We

will concentrate on the case in which k = 1, where a t test is applicable. It can be

shown (though this is not trivial) that

where

tT = q

v=

b1

b2

Sv ,

21

X10 X1

i2

21

X10 X1

41

(T1 1)(X10 X1 )

22

X20 X2

+

+

22

X20 X2

42

(T2 1)(X20 X2 )

A cleaner way to perform this type of tests is through the use of direct Likelihood

Ratio Tests (which we will discuss in depth later).

Even though structural change (or Chow) tests are popular, modern econometric

practice is skeptic with respect to the way in which they are described above, particularly because in these cases the econometrician sets in an ad-hoc manner the point at

which to split the sample. Recent theoretical and empirical applications are working

on treating the period of possible break as an endogenous latent variable.

Prediction

that period, the relationship will be:

yp = x0p + up ,

where yp and up are scalars and xp are the pth period observations on the regressors.

If we assume that the conditions outlined in HLRM are satisfied, it is trivial to verify

bT , with

bT denoting the OLS estimator of

that the best linear predictor is x0p

conditional on the information available on period T .16

16

Best is defined in terms of the candidate that minimizes the mean squared prediction error

conditional on observing xp .

18

In this case, it can be verified that, conditional on xp , the mean squared prediction

error is

h

i

1

E (b

yp yp )2 |xp = 2 1 + x0p (X 0 X) xp .

e2 . It may be thought that the construction of confidence intervals for the

2 with

prediction is trivial and could be formulated as follows:

q

q

b

byp = 1 .

Pr ybp z/2 Vyp < yp < ybT +p + z/2 V

This however is usually wrong. There is a very important dierence between this

and the confidence interval constructed for the estimators of the parameters. To

prove it, notice that

0

b

u

+

x

p

p

yp ybp

q

=q

.

2 1 + x0p (X 0 X)1 xp

2 1 + x0p (X 0 X)1 xp

Even though a scaled factor of the second term of the numerator can converge

to a normal distribution, the first term has the distribution that u has. Thus, this

relation does not have a discernible limiting distribution. Of course, the only case in

which this equation would converge to a normal distribution is if u were normal. As

already mentioned, all the other results (at least asymptotically) did not require this

restriction on u.17

Finally, another point that is worth noticing is that the mean squared prediction

error assumed that the econometrician knew the future value of the vector xp . Needless to say, if x is stochastic and not known at T , the mean squared error could be

seriously underestimated.

If we want to assess the predictive accuracy of forecasting models, we can use

a variety of measures to evaluate ex-post forecasts, that is, forecasts for which the

exogenous variables do not have to be forecasted. Two that are based on the residuals

from the forecasts are

v

u

P

u1 X

t

RMSE =

(yp ybp )2

P p=1

and

17

P

1 X

MAE =

|yp ybp | ,

P p=1

See Lam and Veall (2002) for a discussion of the construction of confidence intervals in the

presence of departures from normality.

19

where RMSE stands for Root Mean-Squared Error, MAE for Mean Absolute Error,

and P is the number of periods being forecasted. These have an obvious scaling

problem. Several that do not, are based on the Theil U statistic:

v

u PP

u p=1 (yp ybp )2

.

U =t

PP

2

p=1 yp

This measure is related to R2 but is not bounded by zero and one. Large values

indicate a poor forecasting performance. An alternative is to compute the measure

in terms of the changes in y:

v

u PP

u p=1 (yp b

yp )2

t

U =

,

PP

2

p=1 (yp )

where

yp = yp yp1

and

yp yp1

yp1

b

yp = ybp yp1

and

b

yp =

yp =

ybp yp1

.

yp1

These measures will reflect the models ability to track turning points in the data.

When several competing forecast models are considered, one set of them will

appear more successful than another in a given dimension (say, one model has the

smallest MAE for 2-steps ahead forecasts). It is inevitable then to ask how likely it

is that this result is due to chance. Diebold and Mariano (1995), approach forecast

comparison in this framework.

Consider the pair of h-steps ahead forecast errors of models i and j (b

ui,p , u

bj,p ) for

18

p = 1, . . . , P ; whose quality is to be judged by the loss function g (b

ui,p ). Defining

dp = g (b

ui,p ) g (b

uj,p ), under the null hypothesis of equal forecast accuracy between

models i and j, we have Edp = 0. Given the covariance-stationary realization {dp }Pp=1 ,

it is natural to base a test on the observed sample mean:

P

1 X

d=

dp .

P p=1

Even with optimal h-steps ahead forecasts, the sequence of forecast errors follows

a MA(h 1) process. If the autocorrelations of order h and higher are zero, the

variance of d can be consistently estimated as follows:

!

h1

X

1

V =

j ,

b

b0 + 2

P

j=1

For example, in case of Mean Squared Error comparison, g () is a quadratic loss function

b2i,p and in the case of MAE, it is the absolute value loss function g (b

ui,p ) = |b

ui,p |.

g (b

ui,p ) = u

18

20

where

bj is an estimate if the j-th autocovariance of dp .

The Diebold-Mariano (DM) statistic is given by

d d

DM = N (0, 1)

V

under the null of equal forecast accuracy. Harvey et al (1997) suggest to modify the

DM test and use instead:

1/2

P + 1 2h + h (h 1) /P

HLN = DM

P

to correct size problems of DM . They also suggest to use a Students t with P 1

degrees of freedom instead of a standard normal to account for possible fat-tailed

errors.

To test if model i is not dominated by model j in terms of forecasting accuracy

for the loss function g (), a one-sided test of DM or HLN can be conducted, where

under the null Edp 0. Thus, if the null is rejected, we conclude that model j

dominates model i.

21

References

Amemiya, T. (1985). Advanced Econometrics. Harvard University Press.

Baltagi, B. (1999). Econometrics. Springer-Verlag.

Diebold, F. and R. Mariano (1995). Comparing Predictive Accuracy, Journal of

Business and Economic Statistics 13, 253-65.

Greene, W. (1993). Econometric Analysis. Macmillan.

Hansen, B. (2001). Lecture Notes on Econometrics, Manuscript. Michigan University.

Harvey, D., S. Leybourne, and P. Newbold (1997). Testing the Equality of Prediction

Mean Square Errors, International Journal of Forecasting 13, 281-91.

Hayashi, F. (2000). Econometrics. Princeton University Press.

Lam, J. and M. Veill (2002). Bootstrap Prediction Intervals for Single Period Regression Forecasts, International Journal of Forecasting 18, 125-30.

Mittelhammer, R., G. Judge, and D. Miller (2000). Econometric Foundations. Cambridge University Press.

Ruud, P. (2000). An Introduction to Classical Econometric Theory. Oxford University

Press.

22

Workout Problems

1. Prove that independence implies no correlation but that the contrary is not

necessarily true. Give an example of variables that are uncorrelated but not

independent.

2. Let y, x be scalar dichotomous random variables with zero means. Define u =

yCov(y, x) [V (x)]1 . Prove that E (u |x) = 0. Are u and x independent?

3. Let y be a scalar random variable and x a vector random variable. Prove that

E [y E (y |x)]2 E [y w (x)]2 for any function w.

4. Prove that if V (ut ) = 2 , V (b

ut ) = (1 ht ) 2 . Find an expression for ht .

5. Prove Proposition 3.

6. Prove Proposition 8.

7. In Theorem 1 we used the fact that (A + C) (A + C)0 = (X 0 X)1 + CC 0 . Prove

this.

8. Prove that when a constant is included, R2 = 1 (Y 0 M Y /Y 0 LY ) , with L being

as defined in section 3.5.

b defined in section 3.6.

9. Derive the variance-covariance matrix of

2

1 0 0

0 0

12. Prove that theCLS

estimator can be expressed as = +R (R X XR) R X u

and obtain V .

u0 u

b) 2 2T k .

14. Demonstrate (8).

16. Prove that to test the null H0 : i = 0 for all i except the constant, the F test

is equivalent to (T k) R2 / [(1 R2 ) (k 1)] .

23

- Statistics (2)Uploaded byTrung Tròn Trịa
- Vella VerbeekUploaded byCarolina Gialdi
- McCloskey CommentOnBoettke2018Uploaded byJorge Eduardo
- Guide to Six Sigma StatisticsUploaded bymehdi810
- Tugas 01 - StatinUploaded byKhoirun Nisa'ul Afifah
- task02 (1)Uploaded byIftekhar Chowdhury Prince
- Top 10 Concepts StatisticsUploaded byneeebbbsy89
- synopsis-eliza-sharma.pdfUploaded byaankur97
- Why Do Security Prices Change_A Transaction-Level Analysis of NYSE Stocks_Ananth Madhavan, Matthew RichardsonUploaded byMA
- Lectures 1Uploaded by叶子纯
- 10305 Voulme3 Issue 1 Paper 03Uploaded byanon_226223680
- 1.Homework Week4 Session 2Uploaded byTony Lâm
- FHP Blablacar Drives PricesUploaded byRahul Arora
- Stats With TI-83-84 +Uploaded bykkathrynanna
- Case StudyUploaded bysomyanath
- Copy of Eco Project1Uploaded byZahir Ahmad
- odisha_tourism.pdfUploaded bysatyadeep mohanty
- Linear Regression PptUploaded byAgnes Sambat Daniels
- Assessing the Relationship between Ownership Structure, Corporate Cash Holdings and Stock Prices on Tehran Stock ExchangeUploaded byTI Journals Publishing
- Thin-trading Effects in Beta Bias v. EstUploaded byDewiRatihYunus
- Economist or Analyst or EconometricianUploaded byapi-122241624
- Chapter 4 Line Er Estimation TheoryUploaded byEren Kocbey
- Herding Behavior in CEE Stock Markets Under Asymmetric Conditions a Quantile Regression AnalysisUploaded byTeutza Iulia
- A meta-analysis on the price elasticity and income elasticity of residential electricity demandUploaded byJosue Marshall
- 3. Zingales, L., 1994Uploaded byRoslena
- Chapter RrrUploaded byAlexa Smith
- Strategic_Management_Practices_in_Constr.pdfUploaded byShaik Muscat
- A Study on Performance of Unit-linked In2Uploaded byRITU SHARMA
- MATH2201 Assignment 1Uploaded byJesse Waechter-Cornwill
- MR final report.docxUploaded byJagriti Sanghai

- Cook, Campbell & Peracchio (1990) Quasi-experimentationUploaded byrjalon
- Berndt Ch07 Demand for ElectricityUploaded byrjalon
- Quantative Modeling in Marketing ResearchUploaded byalexandru_bratu_6
- Chumacero_Econometric funtional formUploaded byrjalon
- Go SeigenUploaded byEXDE601E
- Cho Chikun - Positional JudgmentUploaded byEchipa Dropia
- ABA Banking Journal - October 2008 - Rethinking SegmentationUploaded byrjalon
- ABA Inventing the Future of Retail BankingUploaded byrjalon
- Jarvis Thomson Abortion1Uploaded byrjalon

- HilbertUploaded byCepa Ugol
- Geometric Functional Analysis and Its ApplicationsUploaded bysabsamtarte
- Chicago Undergraduate Mathematics BibliographyUploaded byJustin Reyes
- Linear algebraUploaded bymydeardog
- The Construction of Free-free Flexibility MatricesUploaded byennegi
- Samson Abramsky and Bob Coecke- A categorical semantics of quantum protocolsUploaded bydcsi3
- Gauss Markov BookUploaded byNavneet
- la09.pdfUploaded byNATUREISNATURE
- MIT8_05F13_Chap_04.pdfUploaded byluizmira
- 101_2015_3_bUploaded byAdam Michael
- 3D Vision GeometryUploaded byatom tux
- Linear Algebra Jin Ho KwakUploaded byRafael Solano Martinez
- Linear Algebra, Jim HefferonUploaded byJuanmagonza
- IDRISI Taiga GIS Image Processing Technical SpecificationsUploaded byElBoss Grande
- Chapter6(6.1~6.3)Uploaded byAnisa
- 01 Introduction to Finite Element Methods (4 Files Merged)Uploaded byRicardo Colosimo
- 5.2 Orthogonal Complements and ProjectionsUploaded byCostalot
- Projection TheoremUploaded byAnonymous Gd16J3n7
- Polynomial Approximation of 1D signalUploaded bypi194043
- Some Worlds of Quantum Theory, by Jeremy Butterfield (WWW.OLOSCIENCE.COM)Uploaded byFausto Intilla
- 1hUploaded byShiningEntity
- John Bay-Fundamentals of Linear State Space Systems-McGraw-Hill _1998Uploaded byRoohullah khan
- A Visual Tour.spie.4Uploaded byibzero
- 1203.4558Uploaded byCarlos Mostachú Reivaj Perez
- Multivariate Statistics - An Introduction 8th EditionUploaded bysharingandnotcaring
- Methods for Calculating Empires in QuasicrystalsUploaded byKlee Irwin
- C AlgebrasUploaded byLeandro Fosque
- RipUploaded byVarun Kumar Kakar
- Linear AlgebraUploaded byactuarytest
- Challenging Linear Algebra ProblemsUploaded byBen Baer