Econometrics Lecture Notes

Brennan Thompson∗ 11th September 2006

1

Review of probability and statistics

1.1 Random variables
In the simplest of terms, a is determined by the

will denote an RV by an uppercase letter (such as a lowercase letter (such as die. Let

random variable RV outcome
(or

) is a variable whose value

of a random experiment.

1 In what follows, we

X

or

Y ),

and an outcome by

x

or

y ).

As a simple example, consider the random experiment of rolling a single

X

be the RV determined by the outcome of this experiment. In this

experiment, there are six possible outcomes: If the die lands with 1, 2, 3, 4, 5, or 6, facing up, we assign that value to the RV

X.

This example illustrates what is known as a Here, the RV or 6. A discrete RV diers from a

an RV which can only take on a nite (or countably innite) number of values.

discrete random variable

:

X

can take only take on the values only the value of 1, 2, 3, 4, 5,

can take on any value in some real interval (and therefore an uncountably innite number of values). An example of continuous RV is a person's height. For an adult, this could take on any value between (say) 4 and 8 feet, depending on the precision of measurement.

continuous random variable

: an RV which

1.2 Probability functions
Associated with every RV is a discrete RV, we will call a continuous RV, we will call a

Consider rst the discrete case. The probability mass function the probability that the discrete RV

probability function probability mass function probability density function
X
will take on the value

, which in the case of a , and in the case of a (or

PDF
f (x)

). tells us

x.

More formally,

f (x) = Prob(X = x).
∗ Department of Economics, Ryerson University, 350 Victoria Street, Toronto, Ontario, M5B 2K3, Canada. Email: brennan@ryerson.ca 1 From this deniton, it should be obvious that an RV diers from a constant, which is a variable whose value is xed.

1

It should be noted that a probability mass function for a discrete RV requires that

0 ≤ f (x) ≤ 1,
2 and

f (x) = 1.
x
In our die rolling example, it should be clear that, since each outcome (1, 2, 3, 4, 5, or 6) is equally likely, the probability mass function for equal to

X , f (x),

is

1 6 for each

x,

i.e.

f (x) =

1 , for x = 1, . . . , 6. 6

Such a probability mass function is said to represent a

since each outcome has an equal (or uniform) probability of occurring.

uniform distribution
X

,

For a continuous RV, the probability associated with any particular value is zero, so we can only assign positive probabilities to ranges of values. example, if we are interested in the probability that the continuous RV take on a value in the range between as For will

a and b, the PDF f (x) is implicitly dened
b

Prob(a ≤ X ≤ b) =
a

f (x)dx.

As in the discrete case, a PDF for a continuous RV has certain (roughly analogous) requirements that must be satised:

f (x) ≥ 0,
and

3

f (x)dx = 1.
x

1.3 Distribution functions
Closely related to the concept of a probability function is a

tion

(or

probability that an RV will take on any value

cumulative distribution function CDF
or

less than or equal to

), which tells us the some specic

distribution funcF (x)
is

value. That is, for an RV dened as

X

(whether discrete or continuous), the CDF

F (x) = Prob(X ≤ x).
It should be fairly easy to see that for a discrete RV

X,

this implies that

F (x) =
X ≤x

f (x).

2 The notation is used to denote the sum over the entire range of values that the x discrete RV X can take on. 3 The notation is used to denote the integral over the entire range of values that the x continuous RV X can take on.

2

In our example of rolling a single die, it should be easy to see that the CDF of

X

is

F (x) =
For a continuous RV

x , for x = 1, . . . , 6. 6
x

X,

the denition of a CDF implies that

F (x) =
−∞
and

f (t)dt,

f (x) =
ous), the CDF

dF (x) . dx X
(whether discrete or continu-

Finally, it should be noted that for any RV

F (x)

must satisfy

0 ≤ F (x) ≤ 1, F (+∞) = 1, F (−∞) = 0,
and

F (x1 ) ≥ F (x2 ) for x1 > x2 .

1.4 Joint probability functions
A probability function dened over more than one RV is known as a

probability function Y joint probability mass function f (x, y)
, the the RV

. For instance, in the case of two discrete RVs,

X

joint
y

and

X

will take on the value

x and

tells us the probability that

the RV

Y

will take on the value

(simultaneously). That is,

f (x, y ) = Prob(X = x, Y = y ).
Similar to the requirements of a probability mass function (for a single discrete RV), we must have

0 ≤ f (x, y ) ≤ 1,
and

f (x, y ) = 1.
x y
For example, consider two dice, one red and one blue. Denote by by

X

the RV

whose value is determined by the random experiment of rolling the red die, and

Y

the RV whose value is determined by the random experiment of rolling It should be emphasized that these are two do not confuse this with the

the blue die.

experiments (i.e.

single
and

separate

random

random experiment of let

tossing a pair of dice). combined outcome that

However, the possible outcomes of these two random

experiments can be combined in the following manner:

(x, y )

denote the

X

takes on the value

x,

Y

takes on the value

y.

3

the probability that a continuous RV takes on a specic outcome is zero. y ) is dened so that b d In the continuous case. and f (x. given that the continuous RV X takes on some value extremely close to x. y )dxdy = 1. . . In total. X = x) Pr(X = x) f (y. Accordingly. . . the combined outcome each with equal probability. y ) ≥ 0. .5 Conditional probability functions Closely related to the concept of a joint probability function is a probability function ability mass function f (y|x) . as noted above. i. f (x) (or takes on a value in some range. x) . . The ) f (x. 5) represents rolling a 1 with the red die. . 36. mass function for (1. a ) can be used to tell us the probability that a continuous RV conditional probability density function Pr(Y = y. and the RV Y will take on Prob(a ≤ x ≤ b. we can write the joint probability as f (x. y )dxdy.e. c ≤ y ≤ d) = a c The requirements here are similar to that of a PDF (for a single continuous RV): f (x. for x = 1. X and Y 1 36 . there are 36 such combined outcomes. f (y |x) = Pr(Y = y |X = x). 36 a and b d (simultaneously). Of course.For example. For two discrete RVs. 6 and y = 1. The probability that the continuous Suppose that we want to know the probability that the continuous RV 4 . and is equal to f (y |x) = = conditional PDF this. the . X and Y . y ) = 1 . given that another continuous RV takes on some specic value. . . x y 1. we may be interested in the probability that the RV X takes on some value in the range between (or function some value between joint PDF c and joint probability density f (x. tell us the probability that Y conditional conditional probtakes on the value y. given that X takes on the value x. and a 5 with the blue die. so we need to make some approximation to takes on some value in the range between Y a and b. In the continuous case.

5 . the probability that Y takes on some value in the range between a and b and X takes on some value in the range between x and x+h is b x+h Pr(a ≤ Y ≤ b. x ≤ X ≤ x + h) = a b x f (y. where h is some extremely small number. a ≈ So. similar to the discrete case.5 and Prob(X ≥ m) ≤ 0. and the notation E (X ) when we speak of µx when we speak of its is dened as mean. x ≤ X ≤ x + h) Pr(x ≤ X ≤ x + h) b a b a hf (y. is the average value central tendency.5. so is the notation. v )dydv hf (y. x)dy hf (x) f (y. f (x) f (y |x) as From this approximation. x)dy. since these terms are interchangeable. However. the probability that in the range between Y takes on some value a and b given X takes on some value in the range between x and x+h is Pr(a ≤ Y ≤ b|x ≤ X ≤ x + h) = ≈ = Pr(a ≤ Y ≤ b. the we would expect it to take on if we were to repeat the random experiment which determines its value many times over. is x+h Pr(x ≤ X ≤ x + h) = x f (v )dv. we implicitly dene that conditional PDF b Pr(a ≤ Y ≤ b|X ≈ x) = a f (y |x)dy. while the mode of X is the value of x at which f (x) takes it maximum. ≈ hf (x). 4 Other measures of central tendency are the median and mode. The mathematical expectation of a discrete RV X E (X ) = x xf (x). Now. say in the range between x and x + h.6 Expected value Roughly speaking. x)dy . 1. It is what is known as a measure of expected value mean or of an RV.RV X takes on some value extremely close to x.4 We generally use the notation the expected value of the the RV X. The median of the RV X (whether discrete or continuous) is the value m such that Prob(X ≤ m) ≥ 0.

where. integration is used. the expected value of an RV is a Using the preceding rule.5. X (i. it might help to imagine a new The expected value of Z = g (X ) = 4X .e. the expected value of a constant is the value c. X. whose value is determined from the outcome of tossing a single die. In the discrete case. It should go without saying that of the constant itself. imagine a new RV. let's again consider the RV what is known as a RV called X. It should be clear that these denitions are quite similar. calculated as 6 . letting g (x)f (x). Suppose we have Z. while in the continuous case. In our example of rolling a single die. 6 6 6 6 6 6 be some function of the discrete RV In a more general way. for any constant E (c) = c. these weighted values are summed. we have That is. here. call it which is a function of our original RV.while for a continuous RV X we have E (X ) = xf (x)dx. Z = g (X )). in that the dierent possible values of the RV are weighted by the probability attached to them. it is not so obvious that constant. Again. the expected g (X ) be some function of a continuous RV value of this function is E [g (X )] = Note that g (x)f (x)dx. To see this. Z. Letting g (X ) X. the expected value of this function is E [g (X )] = x Similarly. On the other hand. which might be a little less confusing if we use the notation E (µX ) = µX . linear function g (X ) = 4X Z (this is can be ). As an example. it might help to a function of an RV is an RV itself. we can calculate the expected value of X as E (X ) = 1 1 1 1 1 1 (1) + (2) + (3) + (4) + (5) + (6) = 3. we therefore have E [E (x)] = E (x). we can also calculate the expected value of a function of a RV.

nonlinear function ).25. is 3. X and Y. y ) is the joint probability mass function of X and Y. The expected value of this E [g (X. y ) is the joint PDF of X Y. The proof of this follows directly from the denition of the mathematical expectation of a discrete RV: 5 E (cX ) = x cxf (x) xf (x) x = c = cE (X ). and a similar proof could be made. which in this case would be a function. it might help to once Consider now our example of rolling two separate die. Similarly.5). 7 . Y. the expected value of the func- g (X. y f (x. y )f (x. we are interested in functions of more than one RV. one red and one blue. Continuing with our random experiment of tossing a single die. Y )] = x where tion g (x. y )f (x. we have E (cX ) = cE (X ). Y ). is the constant multiplied by the expected value of that RV. 6 6 6 6 6 6 X (which. That is. Y )] = x where g (x. This example illustrates an important rule: The expected value of an RV multiplied by a constant. 5 This rule also holds for continuous RVs. y and f (x. as we saw Notice that this is just four times the expected value of above. Y ) is E [g (X. Once The E (Z ) = E (X 2 ) = 1 1 1 1 1 91 1 (1) + (4) + (9) + (16) + (25) + (36) = ≈ 15. Now. function is 3.52 = 12. As above. where Z = g (X ) = X 2 . y )dxdy. X and Quite often. it might help to imagine a new RV called expected value of Z . Consider g (X.17. consider the function g (X. y ). 6 6 6 6 6 6 6 E (X 2 ) (the expected value of the square of the RV It is important to note that X) is not equal to [E (X )] 2 (the square of the expected value of the RV Y ). if X is an RV and c is a constant.E (Z ) = E (4X ) = 1 1 1 1 1 1 (4) + (8) + (12) + (16) + (20) + (24) = 14. consider the function g (X ) = X 2 Z (this is what is known as a can be calculated as again. for two continuous RVs. Y ) = X + Y . of two discrete RVs.

7 Conditional expectation Often. Y ) = X + Y . it is not true 1. 1. determined by the outcomes of two separate random experiments. After seeing this. we are interested in knowing what the expected value of an RV is. The conditional expectation X of Y given that X takes on the value X and Y . We omit the proof for both cases here. the conditional expectation of X takes on some value extremely close to x (see above) is E (Y |X ≈ x) = y f (y |x)dy. given that another RV takes on a certain value. Y (3. statistical independence E (XY ) = E (X )E (Y )). 7 As above. Consider two discrete RVs.5) plus the expected Note that this is just equal to the expected value of value of This example illustrates another important rule: The expected value of the sum of two (or more) RVs is the sum of their expected values. X and Y. x is E (Y |X = x) = y For two continuous RVs.5).again imagine a new RV called value of Z. it might be tempting to that that the same thing holds for the product of several RVs. we have 7 E (X + Y ) = E (X ) + E (Y ). is the average squared distance from 6 Note that.e. In general terms. (i. for any two RVs. the its expected value that we would expect it to take on if we were to repeat the variance of an RV. 8 . we are dening it here as the RV whose value is determined by a function of two separate RVs. while Z can be seen as the RV whose value is determined by the outcome of the single random experiment of rolling a pair of dice. this rule also holds for continuous RVs. Y given Y.8 Variance Roughly speaking. X (3. that in a certain special case (known as in general.6 The expected Z can be calculated as E (Z ) = E (X + Y ) 1 2 3 4 5 6 = (2) + (3) + (4) + (5) + (6) + (7) 36 36 36 36 36 36 4 3 2 1 5 + (8) + (9) + (10) (11) + (12) 36 36 36 36 36 = 7. that and f (y |x). where Z = g (X.While this is true ).

is calculated as 2 σX 1 1 1 (1 − 3.5.5)2 6 6 6 1 1 1 + (4 − 3.18. 9 . 9 Keeping this in mind. where which implies Before moving on. we can also consider the variance of a function of an RV. as dened above).5)2 + (5 − 3. the variance of the RV (X − µX )2 . its variance) is actually a constant (just like its expected value.5 ≈ 4.10 8 At times. the variance of a constant is c we have The proof of this is quite trivial: for some constant 2 σc = E [(c − µc )2 ] = E [(c − c)2 ] = 0. the variance of µX = 3.e. From this denition. it may be convenient to dene the variance the RV X as σ 2 = E (X 2 ) − µ2 . Another measure .5). X can be written as while the variance of the continuous RV 2 σX = (x − µX )2 f (x)dx. it should be clear that zero. The variance of an RV is known as a measure of of dispersion for an RV is its denoted square root root of its variance. More formally.e. = σX = √ 17. We will consider the variance of a function of more than one RV in the next section. 8 (or sometimes Var(X )). its expected value (i.5)2 + (3 − 3. X (remembering that In our example of tossing a single die.random experiment which determines its value many times over. of the discrete RV X as 2 σX = x (x − µX )2 f (x). but the analysis could be extended to more general function. It X is a useful exercise to show that these two denitions are equivalent. is dened as the variance of the RV X (whether discrete or continuous). 10 Here.5)2 + (2 − 3. 9 While this function is also an RV. Consider the linear function g (X ) = cX X is an RV and c is a constant. usually denoted 2 σX 2 σX = E [(X − µX )2 ]. which is just the positive X is usually σX .5)2 + (6 − 3. The standard deviation of the RV standard deviation X dispersion. we can write the variance is the expected value of the function Notice that the variance of an RV is just the expectation of a function of that RV (i. we will focus on linear functions. as we pointed out above).5)2 6 6 6 = 17.

Y = E [(X − µX )(Y − µY )]. 11 We can also dene the covariance of the RVs X and Y as σ X. which is a mea(or sometimes X and Y (whether discrete or continuous). two RVs dent if (x − µX )(y − µY )f (x.Y σX.Y = E (XY ) − µX µY . In other words. this means 12 σX. 12 Here.9 Covariance Closely related to the concept of variance is sure of association between two RVs.The variance of this function is Var(cX ) = = = = = E [(cX − µcX )2 ] E [(cX − cµX )2 ] E [c2 (X − µX )2 ] c2 E [(X − µX )2 ] 2 c2 σX .Y = x while for two continuous RVs (x − µX )(y − µY )f (x. the covariance of the RV X with itself is σX. is dened as11 σX. this means E [g (X. X Y 10 . y X and Y. The covariance between the two RVs usually denoted covariance . For two discrete RVs X and Y. y )dxdy. y ). This denition should make it clear that the covariance of an RV with itself is just equal to the variance of that RV. we can view covariance as the expectation of a function of two RVs. 1.X = E [(X − µX )(X − µX )] = E [(X − µX )2 ] 2 = σX . Y ) = (X − µ )(Y y − µ ). Y )] = x y As mentioned above. Cov(X. the expectation is taken of the function g (X. X and Y are said to be statistically indepen- E (XY ) = E (X )E (Y ). As was the case with the denitions of variance. Just as we viewed variance as the expectation of a function of a (single) RV. it is a useful exercise to show that these two are equivalent. Y )).

Y ) = X + Y . Consider the linear function where X and Y are two RVs. using the alternate denition of covariance.Y = Note that if uncorrelated.Y = 0). A similar analysis (omitted here) could be used to show that the variance of the function g (X.Y = E (XY ) − µX µy = E (X )E (Y ) − µX µy = 0 Using the denition of covariance. unless X and Y are statistically independent. it can be see that if two RVs X and Y are statistically independent. in which case we would say they are said to be 13 We again focus on only linear functions. Y are statistically independent. V ar(X + Y ) = E [(X + Y − µX +Y )2 ] = E [(X + Y − µX − µY )2 ] = E (X 2 + Y 2 + 2XY − 2XµX − 2XµY 2 −2Y µX − 2Y µY + 2µX µY + µ2 X + µY ) 2 2 = E (X 2 − 2XµX + µ2 X ) + E (Y − 2Y µY + µY ) +2E (XY − XµX − Y µY + µX µY ) 2 2 = σX + σY + 2σX. The ρX. 11 .Y . if X and 2 2 Var(X + Y ) = σX + σY since σX. we can now consider the variance of a function of more than one RV. σ X σY X and lation is zero (since Y are statistically independent.Y . 13 The variance of this function is g (X. in which case 2 2 Var(X − Y ) = σX + σY .Y is equal to zero. their covariance is zero: σX. σX.Y .Y . Y ) = X − Y is equal to 2 2 Var(X − Y ) = σX + σY − 2σX. relation A related measure of association is between the two RVs X and Y.Therefore. but more general functions could be considered. is dened as coecient of cor- ρX. correlation denoted . the coecient of correσX. Of course.

denoted X. in the case of covariance.1. In some cases we may choose to assume values for them. Simply put. . called the ¯. random sample (or ). (xi − X i=1 It should be clear right away that both the sample mean and sample variance of an RV are functions of RVs (the random sample). common to assume that the RV follows some theoretical distribution (such as . since it does not involve estimating the parameters of some theoretical distribution (which is a parametric approach). xn are Let's start by considering an . In such cases. . each of which has the same distribution as this reason. . 14 If the RV X follows a normal distribution with mean µ and variance σ 2 . it is However. X. . X is ¯= 1 X n x1 . if we are not comfortable in assuming some theoretical distribution. Given this. variance. .10 Estimators In order to calculate the measures considered above (expected value. we say that the RVs distributed IID sample mean where x1 . consider an estimator of the the variance of an RV. For the RV X. . the RV X was known a priori to be uniformly distributed. This approach is considered nonparametric. the joint probability function associated with the RVs) of interest. . it is even possible that we can estimate the probability function itself using a sample of observations. it is common x X 2 ). 12 . . we were able to calculate the expected value and variance of X quite easily. estimate them using a sample of observations. we will use the PDF the uniform distribution). identically X. xn . 15 of n independently drawn observations on the RV X x1 . and which requires us to know the parameters 2 14 σX . . is a random sample of observations on the RV variance Next. for − ∞ < x < +∞. and covariance). the probability function for an RV determined by anything but the most trivial random experiment is usually unknown. it is necessary to know something about the probability function associated with the RV (or. . denoted sample is s2 X 1 = n−1 n ¯ )2 . we need to A is a set of RVs. and are therefore RVs themselves. estimators are RVs. s2 X. if we assume that the continuous RV X follows a normal distribution 1 2 2πσX f (x) = exp −(x − µX )2 2σ 2 µX . The usual estimator of the mean of the RV estimator n i=1 independent. For example. but in others. For of the mean of an RV. xn xi . In the die rolling example. the usual estimator of variance. σX 15 In fact. to write X ∼ N (µX . The problem is that we can't derive these parameters unless we rst know the PDF (which requires knowing the parameters). . . called the .

+ µX ) n = µX . distribution function. s2 X n is thus E (s2 X) = E 1 n−1 ¯ − µX )]2 [(x − µX ) − (X i=1 13 . but it will help to begin by rewriting the sample variance of the RV X as s2 X = 1 n−1 n ¯ − µX )]2 [(xi − µX ) − (X i=1 (note that this is equivalent to our original expression since the The expected value of µX terms cancel). etc. the expected value of the X is ¯) E (X = E 1 E n 1 n n xi i=1 n = = xi i=1 1 [E (x1 ) + . expected value. The proof is a little more complicated. + σX n2 X 2 σX . . . This demonstrates that the sample mean is what is a called an that parameter. . . . sample mean of the RV For example. estimators have their own probability function. The variance of the sample mean of the RV mator: an estimator of a parameter whose expected value is the true value of unbiased esti- X is ¯) Var(X = Var 1 n n xi i=1 n = = = = 1 Var n2 xi i=1 1 [Var(x1 ) + . variance. . + E (xn )] n 1 = (µX + .As RVs. . it turns out that this estimator is also unbiased. + Var(xn )] n2 1 2 σ 2 + . n Turning to the sample variance. .

The parameter being estimated. Another commonly used estimator of variance of the RV X. s2 X. if We won't do so here. . σ ˆX is dened as 2 σ ˆX = 2 The expected value of σ ˆX is 1 n n ¯ )2 = (xi − X i=1 n−1 2 s . n is bias which implies that the estimator 2 σ ˆX of an estimator is equal to the expected value of the estimator minus the biased (i. this is 2 Bias(ˆ σX ) 2 2 = E (ˆ σX − σX ) 2 2 = E (ˆ σX ) − E (σX ) n−1 2 2 = σX − σX n −1 2 = σ n X 14 . . denoted 2 . is X is a Var(s2 X) = It is useful to compare the estimator 4 2σX . n X 2 E (ˆ σX ) = = n−1 E (s2 X) n n−1 2 σX .e. it is not unbiased). + E (xn − µX )2 − E {[X n−1 n−1 1 n 2 ¯) (σ 2 + . but it can shown that the variance of normally distributed RV. n−1 s2 X with an alternative estimator of variance. . For 2 σ ˆX .= 1 E n−1 1 E n−1 1 E n−1 1 E n−1 n ¯ − µX )(xi − µX ) + (X ¯ − µX )2 (xi − µX )2 − 2(X i=1 n n = ¯ − µX ) (xi − µX ) − 2(X 2 i=1 n i=1 ¯ − µX )2 (xi − µX ) + n(X = ¯ − µX )2 + n(X ¯ − µX )2 (xi − µX )2 − 2n(X i=1 n = = = = = (xi − µX )2 i=1 − n ¯ − µX )2 ] E [(X n−1 1 n ¯ − E (X ¯ )]2 } E (x1 − µX )2 + . + σX )− Var(X n−1 X n−1 2 2 n σX nσX − n−1 n−1 n 2 σX . .

which is simply equal to the variance of the estimator mean squared error we have (or plus the square of the bias of the estimator. For example. For example. For this reason. the MSE of an estimator is dened as its squared expected deviation from its 2 2 2 true value. 2 Var(ˆ σX ) < Var(s2 X ). we see a trade-o between bias and eciency. we can use the variance of 2 variance of σ ˆX : 2 Var(ˆ σX ) s2 X to nd the = = = = Var n−1 2 s n X 2 n−1 n Var(s2 X) 2 4 n−1 2σX n n−1 4 2(n − 1)σX . MSE(s2 X) = = = 2 2 Var(s2 X ) + [Bias(sX )] 4 2σ X + [0]2 n−1 4 2σ X . which is just equal to the ratio of their variances. if it is more ecient than s2 X has a lower bias than 2 . This raises the important question of which estimator is better. n2 Note that this is less than the variance of s2 X . Note that eciency is a relative measure (i. i. ecient of relative eciency s2 X and 2 σ ˆX is Var(s2 X) 2 ) = Var(ˆ σX any other estimator. σ ˆX it is not as ecient. we usually write it that way. It would be a useful exercise to show this equivalence. since this is equivalent to the estimator's variance plus its squared bias. we may say that an estimator is ecient. Typically (as in this case). 16 For s2 X.If X is normally distributed. A commonly used measure that takes both of these factors into consideration is the MSE ) of an estimator. However. the MSE of s2 X is dened as E [(sX − σX ) ]. However. we would say that 2 σ ˆX is a more 2 estimator of variance than sX . Since it has a lower variance. Although n n−1 2 .e. we often refer to the the relative eciency of two estimators.e. we can only speak of the eciency of an estimator in comparison to the eciency of other estimators). 15 . n−1 16 Actually.

Before moving on.e.while for 2 σ ˆX . the usual estimator of covariance.Y = sX . sY X. sX sY of to calculate the sample rX. X sX.Y . denoted sample covariance sX.Y = . the the sample variance of sample standard deviation X . as sX. These dier from the that we considered above (where n was nite). (xi − X i=1 coecient of correlation where We can use the sample covariance of . asymptotic properties n sample properties The innite size (i. where form in 1. = which implies 2 MSE(s2 σX ). is dened as 1 n−1 n ¯ )(yi − Y ¯ ). 2 σ ˆX would be the preferred estimator. 16 . Of course. X ) > MSE(ˆ So. denoted X and Y rX.11 Asymptotic properties of estimators of an estimator are based on random samples of tends towards innity). 17 17 For this reason.Y . From n observations on each of two RVs. asymptotic properties are sometimes called large sample properties. n2 2 To compare the two estimators.Y . MSE is not the only basis on which to compare two (or more) estimators. let's consider an estimator of the covariance of two RVs. Although small working with random samples of innite size may seem unrealistic (and it is). we have 2 MSE(ˆ σX ) = = = = 2 2 2 Var(ˆ σX ) + [ˆ σX ] 4 −1 2 2(n − 1)σX + σ 2 n n X 4 σ4 2(n − 1)σX + X 2 n n2 4 (2n − 1)σX . we can use the dierence in their MSEs: 2 MSE(s2 σX ) X ) − MSE(ˆ 4 4 2σX (2n − 1)σX − n−1 n2 2 4 σ = n2 X > 0. called the and Y. is just the square root of is dened similarly. on the basis of MSE. asymptotic analysis provides us with an approximation of how estimators per- large random samples.

In order to examine the asymptotic properties of estimators, it is necessary to understand the concept of depends on introducing some new notation. Let

convergence in probability
Xn > 0.

.

We begin by

be an RV (such as an estimator), which

n. X n

is said to converge in probability to a constant

c

if

n→∞
If

lim Prob(|Xn − c| > ) = 0, for c

Xn

converges in probability to

(as above), we will usually write

Xn → c.
Alternatively, we may say that write

p

c

is the

probability limit plim
(or

) of

Xn

and

plim(Xn ) = c.
Using the concept of convergence in probability, we can now consider some useful asymptotic properties of estimators. An estimator of a parameter is said to be probability to the true value of the parameter. This condition is known as of the RV

consistent
A

sucient

if it converges in condition for an and always

estimator to be consistent is that its MSE tends to zero as

implies convergence in probability. For example, consider the sample mean,

convergence in quadratic mean
is

n approaches innity. ¯, X

X.

The MSE of

¯ X

¯) = MSE(X
since

2 σX , n

¯ X

is unbiased. So

n→∞

¯ )] lim [MSE(X

= =

n→∞

lim

2 σX n

0,

which implies that,

p ¯→ X µX .

So, we can conclude that have

¯ X

is a consistent estimator of the mean of

Next, consider the estimators of the variance of the RV

X , s2 X

and

X , µX . 2 σ ˆX . We

n→∞

lim [MSE(s2 X )]

= =

n→∞

lim

4 2σX n−1

0,

and

n→∞

2 lim [MSE(ˆ σX )]

= =
17

n→∞

lim

4 (2n − 1)σX n2

0,

implying that

2 s2 X → σX ,
and

p

2 2 σ ˆX → σX . 2 X , σX . 2 Note that this is true even though σ ˆX is a biased estimator of the variance of X . If the bias of an estimator goes to zero as n approaches innity, it is said 2 to be . Recall that the bias of the estimator σ ˆX was
So, both and are consistent estimators of the variance of

p

s2 X

2 σ ˆX

asymptotically unbiased

2 Bias(ˆ σX )=
As

−1 2 σ . n X
is asymptotically , if, as

2 n approaches innity, this becomes zero, and therefore, σ ˆX
Similarly, an estimator is said to be

unbiased.

approaches innity, it is more ecient than any other estimator. of estimators. Specically, we are often interested in the

asymptotically ecient

n

Before concluding, we should make some mention about the

bution convergence in distribution
of estimators (i.e. The RV

the distribution of estimators which are based on .

asymptotic distri-

distribution

random samples of innite size). This requires understanding of the concept of

Xn ,

with distribution function

bution to the RV

X,

with distribution function

Fn (x), is said F (x), if

to converge in distri-

n→∞
at all continuity points of

lim |Fn (x) = F (x)| = 0
This is usually written as

F (x).

Xn → X.

d

theorem
RV

In considering the asymptotic distribution of an estimator, a (or

tribution of the sample mean can be found by using the Lindeberg-Levy CLT. This CLT states that if

CLT

) often comes in handy.

For example, the asymptotic disof

central limit

X,

where

X

has mean

x1 , . . . , xn is a random sample 2 µX and variance σX , then
n d

n

observations on the

n−1/2
i=1

2 (xi − µX ) → N (0, σX ).

Alternatively, this can be written as


since

d 2 ¯ − µX ) → n(X N (0, σX ),

n

n

n−1/2
i=1

(xi − µX )

= n−1/2
i=1

xi − nµX

¯ − nµX ) = n−1/2 (nX √ ¯ − µX ) = n(X
18

where

normal
2

¯ X
.

is as dened earlier.

In this case,

¯ X

is said to be

What is remarkable about this theorem is that it holds true even

asymptotically

if the RV

X

does not follow a normal distribution (e.g.

it may be uniformly

distributed).

Regression Models

2.1 Introduction
Natural and social scientists use abstract models to describe the behaviour of the systems that they study. Typically, these models are formulated in mathematical terms, using one or more equations. For example, in an introductory macroeconomics course, the aggregate expenditures model is used to describe the behaviour of an economy. In this model, the equation

C = cD,
is used to describe how the level of consumption, disposable income, by the parameter level of

C,

depends on the level of

D.

Here,

C

is a linear function of

D,

whose slope is given given the

c,

called the marginal propensity to consume (which is some

arbitrary constant). Such a relationship is said to be

deterministic :

D,

we can determine the exact level of

C.

Deterministic relationships, however, rarely (if ever) hold in observed data. This is for two reasons: First, there will almost always be some error in measuring the variables involved; and second, there will very often be some other,

non-systematic
identied.

factors that aect the relationship but which cannot be easily

exact relationship is said to be statistical.
of interest, but it will not be

Of course, there may be

some
(i.e.

relationship between the variables completely deterministic). Such a

To illustrates the dierence, consider a model which posits that the variable

Y

is determined by some function of the variable

X,

i.e. (1)

Y = m(X ).
Now, suppose that we have which we denote

n observations on each of the variables Y and X , y1 , . . . , yn and x1 , . . . , xn , respectively. For any given pair of observations, (yi , xi ), we would like to allow the possibility that the relationship
in (1) does not hold exactly. To do this, we add in what is called a

term

, denoted

ui ,

which gives us the

regression model
i = 1, . . . , n.

disturbance
(2)

yi = m(xi ) + ui ,

By adding this disturbance term, we are able to capture any measurement errors or other non-systematic factors not identied in (1). On average, we would hope that each disturbance term is zero (otherwise, the model would contain some systematic error). However, for each pair of observations, the actual size of

(yi , xi ), i = 1, . . . , n

ui

will dier.

19

. . . . . . bivariate most regression models used in the natural and social sciences involve more than just two variables. . n . . Like any other RV. . xi is an RV. is IID. . As an example. we would still expect that u1 . but u1 . . xn are the doses of the medication given y1 . xn and y1 . In this case. Here. . If we our observations are generated by a controlled x1 . Typically. . x1 . .e. by nature. Given the IID assumption.Since the disturbance term is. . For this reason. . referred to as a . . and sometimes it isn't. . . xn are RVs. the disturbance terms are said to be ally means same variance). . . .e. This means. xn are the 1. . holding time (due to the disturbances doses unchanged). Of course.e. to patients patients 1. . n. . usually referred to as the the We haven't yet indicated whether experiment. for each ui . y n are the income levels of 1. and therefore group of workers each time). . keeping the y1 . . . consider an experiment where we observe dierent workers' education and income levels (perhaps we want to see if higher levels of education result in higher levels of income). . Here. n . . and measure the reduction in their blood pressures over some period of time. . This is because sometimes it is. . yn are RVs (since we repeat this experiment several times over (with. Note that. . If and x1 . . i. we have been dealing with what is called a . E (ui ) = 0. A model of this type. we assume that each u i . 1. σu ). has one dependent variable and more than one independent multivariate 20 . among other things. education levels of workers workers x1 . . . un ). . . . . it also must be an RV itself. since yi is a function of the RV ui (see Equation (2)). x1 . that each has the same variance. unpredictable. . . Of course. . . . since it involves just two variables: yi and xi . n . . . On the other hand. . . n. . . un then are RVs). . . i. . . . 2 Var(ui ) = σu . y1 . . yn would be dierent each In this case. it is treated as an RV. . . it has its own distribution. and (or yi is also dependent on xi . If we repeat this experiment several times over (with a new will be dierent each time. . . The most important characteristic of this distribution is that its expected value is zero. yi is i is usually referred to as ). say. . . for each ui . a new group of patients each time). . For example. consider an experiment where we give patients dierent doses of a blood pressure medication. can be considered RVs (since we don't know their values until we complete the regression model regression model Up to this point. . experiment). xn are held constant (and are obviously not RVs). . x1 . yn are the reduction in the blood pressures of x1 . if our observations are generated by a eld experiment. . . xn are not. it is common to write homoskedastic (which liter- 2 ui ∼ IID(0. . . . then dependent variable x independent variable explanatory variable . i = 1. xn constant (i. .

. where x1. . For ease of exposition.1 . and yi = m ˆ (xi ) + u ˆi . m ˆ (·). the error term is equal to the dierence between the observed and estimated values of yi .i ) + ui . like any other estimator. the task of the statistician is to come up with some estimate of this function.. xk. denoted u ˆi .n are ob. . we add in what is called an error term (or residual yi . . . . .y p th-order autoregressive process 1 . Assuming we have an estimate of way. Clearly. X1 . xn .. we have u ˆi = yi − m ˆ (xi ) = yi − y ˆi . . x1. i. the exact form of the function will denote RVs m(·) is unknown.e. . and so on. .. it is possible to have a only one variable is involved. ). the estimate m ˆ (xi ) will give us an estimate y ˆi = m ˆ (xi ). . yt−p ) + ut . Therefore. which also may be RVs). where the dependent variable n be observations on (or AR(p ) yt = m(yt−1 . . . the estimate write y ˆi will not always be equal to the observed value So. . will occupy us for the remainder of the course. . yn (and x1 . . which we will denote y ˆi . . since any estimate of m(·) will be based on the the y1 . This will be a guiding principle in developing the estimators we are interested in. 2. x2. . . yi . univariate regression model Letting process the variable Y over time. . where yi = m(x1. Such models are usually only relevant when the variable involved is in the form of a time-series. m(·) should have the property That is. x2. Given a specic observation of m(·) in hand. . a ) takes the form autoregressive process y . Aler natively. . Of course.n. Clearly.n. the rest of this lecture will focus on bivariate regression models. . Exactly how m(·) is estimated. which we m ˆ (·).n are observations on the variable servations on the variable X2 . . .2 Estimation of regression models Usually. . but all of the results are perfectly applicable to multivariate and univariate regression models. any estimator of that it minimizes such errors. 21 . Rearranging.. and how m ˆ (·) is distributed (which will depend on the method of estimation). .variables. it can be used in the following xi .1 . will also be an RV. . . .i . t = 1. An example of such a model is what is known as an is regressed on lagged values of itself. A multivariate regression model with takes the form k dierent independent variables i = 1.

Denoting this estimate γ ˆ. Estimating nonlinear regression models is beyond the scope of this course. which gives use the linear regression model yi = α + βxi + ui . if we will consider nonparametric regression models later in the course.3 Specication of regression models While the exact form of m(·) is usually unknown. but parametric regression model m(·) nonparametric re(since is left If m(·) is given m(·) ane function18 . 3 Ordinary least squares 3. 22 . ordinary least squares OLS sum of squared residuals SSR (or (or ). (2) is referred to as a the function must involve some parameters). the regression model in (2) is referred to as a .e.2. γ is a parameter to be estimated. we have ˆ m ˆ (xi ) = xγ i. is one in which A m(·) is nonlinear. where a and b are any real numbers. Denoting these estimates ˆ. takes the form g(x) = a + bx. we will focus on parametric regression models. α ˆ Here and α and β are parameters to be estimated. i. This ) from 18 An ane function. regression model Of course. if nonlinear m(xi ) = xγ i we have the nonlinear regression model yi = xγ i + ui . The most common form assumed for regression models involves specifying gression model as an unspecied. For now. m ˆ (xi ) = α ˆ + βx Arriving at these estimates will be the focus of the next topic. a specic form. g(x). not all parametric regression models are linear. On the other hand. β we have ˆ i. For example.1 Estimating bivariate linear regression models The most common method for estimating linear regression models (both bivariate and multivariate) is the method of method is based on minimizing the the estimated regression model. m(xi ) = α + βxi . As above. it is common for the statis- tician (or even the theorist) to assume some function form.

β n Therefore. βx 23 . ˆβ i=1 ˆ i )2 (yi − α ˆ − βx The rst-order. . n ˆ) = min SSR(ˆ α. u ˆ i = yi − α ˆ − βx Summing the square of each of these terms (from i = 1 . (yi − α ˆ − βx From the left hand side of this equation. necessary conditions for this optimization problem are n ˆ) ∂ SSR(ˆ α. we have n n yi − nα ˆ− i=1 i=1 ˆ i = 0. n). (yi − α ˆ − βx i=1 (5) n ˆ i ) = 0. we are faced with a very simple optimization problem: Minimize the function choosing ˆ) SSR(α ˆ. β = −2 ˆ ∂β Equation (4) implies n ˆ i )(xi ) = 0. it is evident that the sum of squared residuals is a function of both α ˆ and ˆ. (yi − α ˆ − βx i=1 Carrying through the summation operator. we have n n u ˆ2 i i=1 = i=1 ˆ i )2 . β i. = −2 (yi − α ˆ − βx ∂α ˆ i=1 and (4) ˆ) ∂ SSR(ˆ α. β ˆ i ) = 0. β by α ˆ and ˆ. β i=1 ˆ i )2 . we normally write ˆ) = SSR(ˆ α. . β ˆ α. then the i = 1. . . .If we have the bivariate linear regression model yi = α + βxi + ui .e. (yi − α ˆ − βx Since the method of OLS is based on minimizing this function. .n (3) ith residual from the estimate of this model is ˆ i. .

we have ˆ xi − β i=1 x2 i = 0. ˆ. . n n n n yi xi − y ¯ i=1 Finally. y n . we have n n n yi xi − α ˆ i=1 Now. we have n ˆ 2 ) = 0. . β we have ˆ= β n ¯ i=1 yi xi − y n 2−x x ¯ i=1 i 24 n i=1 xi . . y ¯= and 1 n n yi .e. i. we have n ˆ i )(xi ) = 0. i=1 x ¯ is the sample mean of the observations xi . where. i=1 Moving on to Equation (5). solving for α ˆ gives α ˆ= which can be rewritten as 1 n n n ˆ yi − β i=1 i=1 xi . i. substituting for ˆ xi − β i=1 i=1 x2 i = 0. α ˆ from Equation (6) gives n n n ˆx yi xi − (¯ y−β ¯) i=1 i=1 Expanding the term in brackets. solving for ˆx xi + β ¯ i=1 i=1 ˆ xi − β i=1 x2 i = 0. (6) y ¯ is the sample mean of the observations yi . x ¯= 1 n n xi . . . ˆx α ˆ=y ¯−β ¯. . xn . carrying through the summation operator. (yi − α ˆ − βx i=1 Multiplying through by xi . .Finally. .e. n i=1 xi (7) . (yi xi − α ˆ xi − βx i i=1 Next.

expand the denominator in Equation (8) and rearrange: n n (xi − x ¯)2 i=1 = i=1 n (x2 ¯+x ¯2 ) i − 2xi x n = i=1 n x2 x i − 2¯ i=1 xi + nx ¯2 = i=1 n x2 ¯ 2 + nx ¯2 i − 2nx x2 ¯2 i − nx i=1 n n = = i=1 x2 ¯ i −x i=1 xi . (8) To show that Equations (7) and (8) are equivalent. To 25 . we show that the numerator and denominator of each are equivalent. −x ¯)xi (9) To show that this is equivalent to Equation (8) (and therefore Equation (7)). which is exactly the denominator in Equation (7).It is often desirable to write ˆ β as ˆ= β n ¯)(xi − i=1 (yi − y n ¯)2 i=1 (xi − x x ¯) . we again show that the numerator and denominator of each are equivalent. Second. First. expand the numerator in Equation (8) and rearrange: n n (yi − y ¯)(xi − x ¯) i=1 = i=1 n (yi xi − yi x ¯−y ¯xi + y ¯x ¯) n n = i=1 n yi xi − x ¯ i=1 yi − y ¯ i=1 n xi + ny ¯x ¯ xi + ny x ¯) = i=1 n yi xi − x ¯ ny ¯−y ¯ i=1 n = i=1 yi xi − y ¯ i=1 xi . which is exactly the numerator in Equation (7). it will be convenient to to rewrite ˆ β as ˆ= β n i=1 (xi n i=1 (xi −x ¯)yi . At other times.

ˆ: β Equations (7). So. We most often use Equation (8). Equation (9) will often come in handy. n (yi − y ¯) = 0.do so. (8). Similarly. but. and (9). and similarly. the denominator in Equation (8) can be expanded to give n n (xi − x ¯)2 i=1 = i=1 n [(xi − x ¯)xi − (xi − x ¯)¯ x] n = i=1 n (xi − x ¯)xi − x ¯ i=1 (xi − x ¯) = i=1 n (xi − x ¯)xi − 0 (xi − x ¯)xi . it is helpful to realize that n n (xi − x ¯) i=1 = i=1 xi − nx ¯ = nx − nx ¯ = 0. the numerator in Equation (8) can be expanded to give n n (xi − x ¯)(yi − y ¯) i=1 = i=1 n [(xi − x ¯)yi − (xi − x ¯)¯ y] n = i=1 n (xi − x ¯)yi − y ¯ i=1 (xi − x ¯) = i=1 n (xi − x ¯)yi − 0 (xi − x ¯)yi . we have three dierent (but equivalent) expressions for section. i=1 Now. i=1 = which is the numerator in Equation (9). i=1 = which is the denominator in Equation (9). as we will see in the next 26 .

Let's start with ˆ. This assumption does not have nearly as large an impact on the properties of OLS estimators as the IID assumption. σu ). . . . −x ¯)xi ˆ = β = α n i=1 (xi −x ¯) + β − 0+β n i=1 (xi n i=1 (xi n i=1 (xi n i=1 (xi n n ¯)xi + i=1 (xi i=1 (xi − x n ¯)xi i=1 (xi − x n x ¯)xi + i=1 (xi − x ¯)ui −x ¯)ui −x ¯)xi (10) = β+ −x ¯)ui . we have n i=1 (xi n i=1 (xi −x ¯)(α + βxi + ui ) . ˆ β is unbiased. we have α ˆ = 1 n n ˆx yi − β ¯. we will also assume that xi . −x ¯)ui −x ¯)xi n ¯)E (ui ) i=1 (xi − x n ( x − x ¯)xi i i=1 since E (ui ) = 0. we substitute for yi in Equation (9) from Equation (3): ˆ= β Multiplying through. for each ui . as we progress. but will make the following analysis a little easier. However. −x ¯)xi ˆ: β n i=1 (xi n i=1 (xi Next. β To do so. n are not RVs. The rst question we want to ask is whether or not OLS estimators are unbiased. 2 ui ∼ IID(0. So. i=1 27 . In addition. we nd the expected value of ˆ) E (β = E β+ = β+ = β. we will relax this assumption and see how the properties of OLS estimators change.2 Properties of OLS estimators of bivariate linear regression models The properties of OLS estimators depend on the properties of the error terms in the linear regression model we are estimating. From Equation (6). Let's now move on to α ˆ. For now. i = 1.3. . we will assume that.

we have E (ˆ α) ˆ) + = E α+x ¯(β − β 1 n n ui i=1 n ˆ)] + 1 = α+x ¯[β − E (β n = α+x ¯(β − β ) = α So. Var(β From Equation (10). n ( x − x ¯)2 i=1 i  n 2  . i=1 (11) α ˆ. let's introduce some new notation. Since unbiased. ˆ) = E  Var(β i=1 Expanding this. E (ui ) i=1 α ˆ is also unbiased. Let ki = so that (xi − x ¯) . we have α ˆ = 1 n n ˆx (α + βxi + ui ) − β ¯ i=1 n n = 1 ˆx nα + β xi + ui ) − β ¯ n i=1 i=1 1 n n = α + βx ¯+ ˆx −β ¯ i=1 n ˆ) + 1 = α+x ¯(β − β n Taking the expected value of ui . Another important property of OLS estimators is their variance. we have ˆ β is ˆ) = E [(β ˆ − β )2 ]. this means ˆ) = E Var(β n i=1 (xi n i=1 (xi −x ¯)ui −x ¯)xi 2 To make handling the term in brackets a little easier.Substituting for yi from Equation (3). we have ki ui  ˆ) Var(β = E n 2 2 ki ui + 2 i=1 n−1 n  ki kj ui uj  i=1 j =i+1 28 .

for all i = j. Since α ˆ is unbiased. So. n ¯)2 i=1 (xi − x α ˆ. is IID (and therefore statistically indepen- since we have assumed that each dent from one another).n n−1 2 ki E (u2 i) + 2 n = i=1 ki kj E (ui uj ). notice that 2 E (u2 i ) = σu since 2 σu = Var(ui ) = E [ui − E (ui )]2 = E (u2 i) Second. we have E (ui uj ) = E (ui )E (uj ) = 0. this means  Var(ˆ α) ˆ) + 1 = E x ¯(β − β n  n 2   n 2 ui i=1   ˆ)2 + 2¯ ˆ) + 1 = E x ¯2 (β − β x(β − β n2 ui i=1   n n−1 n 1 ˆ)2 ] + 2¯ ˆ)] + = x ¯2 E [(β − β x[β − E (β E u2 ui uj  i +2 n2 i=1 i=1 j =i+1 29 . ui n ˆ) Var(β = 2 σu i=1 n 2 ki 2 = σu i=1 2 σu (xi − x ¯) n ¯)2 i=1 (xi − x n 2 i=1 2 = = We now move on to [ (xi − x ¯)2 ] (xi − x ¯)2 2 σu . i=1 j =i+1 Two things should be apparent here: First. we have Var(α ˆ ) = E [(ˆ α − α)2 ]. From Equation (11).

x ¯2 n i=1 (xi −x ¯)2 Given the variance of OLS estimators. n ˜) E (β = E ˆ+ β i=1 n wi yi wi yi i=1 = β+E . This implies that n ˜ = β i=1 n (ki + wi )yi n = i=1 ki yi + i=1 n wi yi (12) ˆ+ = β i=1 Taking expectations. let's consider some linear. we will prove this only for Using the Gauss-Markov theorem BLUE The states that. ˆ. Now. 30 . . they are in fact the most ecient. By best. in the class of linear. gi = ki + wi . and wi is some non-zero term which depends on x1 . . stated as unbiased estimators. we mean most ecient. This property is often (Best Linear Unbiased Estimator). we have wi yi . β but the proof for α ˆ is very similar. xn . Here. ki notation we introduced above. we might want to ask how ecient they are. . unbiased estimator of β.ˆ) + 2¯ = x ¯2 Var(β x[β − β ] + = 2 = σu 2 2 σu x ¯ n ( x i=1 i − 1 n2 n E (u2 i) + i=1 2 n2 n−1 n E (ui uj ) i=1 j =i+1 x ¯)2 + 1 nσ 2 n2 u + 1 n . . call it ˜: β n ˜= β i=1 where gi yi . Equation (9) implies n ˆ= β i=1 This should make it clear that linear in ki yi . ˆ β is a linear estimator  the above function is yi .

Also. the variance of (ki + wi )ui . Substituting these last two results into Equation (12). E (ui ) = 0.Substituting for yi from Equation (3). we have n n ˜ = β+ β i=1 n ki ui + i=1 wi ui = β+ i=1 Finally. i=1 and n wi xi = 0. in general. note that Equation (10) implies n ˆ=β+ β i=1 ki ui . This allows us to rewrite the second term on the left-hand side of Equation (12) as n wi yi i=1 = = α wi (α + βxi + ui ) i=1 n n n wi + β i=1 wi xi + i=1 wi ui i=1 n = i=1 wi ui . we have n ˜) E (β = β+E i=1 n wi (α + βxi + ui ) n n = β+α i=1 n wi + β i=1 n wi xi + i=1 E (ui ) = β+α i=1 since wi + β i=1 wi xi . In order for ˜ β to be unbiased. we must have n wi = 0. i=1 since. ˜ β is ˜) Var(β = E ˜ − E (β ˜) β 2 31 . α n and β are non-zero.

 = E n 2   (ki + wi )ui i=1 n = E i=1 n (ki + wi )2 u2 i (ki + wi )2 E (u2 i) = i=1 2 = σu n (ki + wi )2 i=1 n 2 2 (ki + 2ki wi + wi ) i=1 n n 2 2 ki + 2σu i=1 n i=1 n 2 ki wi + σu i=1 i=1 2 wi . . n ¯)wi i=1 (xi − x n ¯)2 i=1 (xi − x n n ¯ i=1 i=1 xi wi − x n ¯)2 i=1 (xi − x wi since n i=1 xi wi = 0 and n i=1 wi = 0. + βk xk. but the proof for α ˆ is quite similar. 2 ki wi + σu i=1 n 2 wi 2 = σu 2 = σu = ˆ) + 2σ 2 Var(β u The second term on the right-hand-side of this equation is n ki wi i=1 = = = 0. For this reason. . . 32 .n that the bivariate linear regression model considered above is just a special case of the multivariate linear regression model considered here. 3. Therefore. we have n ˜) = Var(β ˆ) + σ 2 Var(β u i=1 which implies 2 wi . β We won't go through it here.i + ui . 19 Note i = 1. what follows is applicable to both bivariate and multivariate linear regression models. Var(β since wi is non-zero. . This proves the Gauss-Markov theorem for ˆ. . .3 Estimating multivariate linear regression models Now consider the multivariate linear regression model 19 yi = β1 + β2 x2.i + . ˜) > Var(β ˆ).

1 .   . . 20 In these notes. .To make the analysis of such regression models much easier. . 1 x2.n   β= and β1 . yn x2. β is a k -vector. . . un ˆ u denote the residuals from the estimate of this model. .. the method of OLS is based on minimizing this function. . b algebra. βk   u= Letting u1 . y Xb = (y Xb) = b X y. we make use of the fact that checking its dimensions)..   y1  . we have ˆ = y − Xb u where b is an estimator of β. As we saw in the bivariate case. i. dened   X= 1 . . . . y Xb is a scalar (this can be conrmed by Here. we make the assumption that the reader is reasonably well-versed in matrix 33 . .   .  y= . X is an as follows: n × k matrix. xk.1 . we typically express this in matrix form: 20 y = Xβ + u where (13) and y and u are n-vectors. Our optimization problem is thus min SSR(b) = y y − 2b X y + b X Xb. . . = = = = The sum squared residuals is thus SSR(b) ˆu ˆ u (y − Xb) (y − Xb) y y − y Xb − b X y + b X Xb y y − 2b X y + b X Xb.e. . . . and is therefore equal to its transpose.n   . xk.

what follows is also applicable to bivariate linear regression models (since the bivariate linear regression model is just a special case of the multivariate linear regression model). . . This  u1 .4 Properties of OLS estimators of multivariate linear regression models21 In order to analyze the properties of means that the about the error terms. we have b = (X X)−1 X y.  · · · un = E  .  . un    0 where 0 is a The variance of k -vector of u is Var(u) zeros. . . . As before. . un u1 ··· u2 n   ) which is called the each ui is IID with variance error covariance matrix 2 σu . = E . . 0 . . n-vector u has the following expected value:    E (u)  = E   =  = 0. we will assume that which means that 21 As noted above.  ( u1 un   u2 · · · u1 un 1  . solving for b. 34 . .The rst-order necessary condition for this optimization problem is ∂ SSR(b) = −2X y + 2X Xb = 0. ∂b which implies X y = X Xb. = E ([u − E (u)][u − E (u)] ) = E (uu )   u1  . 2 E (u2 i ) = σu . (14) 3. we assume that each b. we again need to make some assumptions ui has mean zero. For now. Finally. .

··· 0 . we typically write 2 u ∼ IID(0.    0 2 = σu In . Let's start the analysis by asking whether Equation (13) into Equation (14). for all i = j. . we have Var(b) = E ([b − E (b)][b − E (b)] ) = E [(b − β )(b − β ) ] . Therefore. we have Var(b) = E [(X X)−1 X u][(X X)−1 X u] = E (X X)X uu X(X X)−1 = (X X)−1 X E (uu )X(X X)−1 2 = σu (X X)−1 X X(X X)−1 2 = σu (X X)−1 .  Var(u)  =  2 σu . since E (u) = 0. where 2 · · · σu In is an n×n identity matrix.and. . . Given these assumptions. 35 (16) . Given that b is unbiased. Let's now move on to the variance of b. we have X is not a RV. . Later in the course. Now. we will see what the consequences are if we relax this b is unbiased or not. to make things easier. taking expectations we have (15) E (b) = E β + (X X)−1 X u = β + (X X)−1 X E (u) = β. we will continue to assume that assumption. σu In ). Also. E (ui uj ) = 0. Substituting from Equation (15). So b is unbiased. Substituting b = (X X)−1 X (Xb + u) = (X X)−1 X Xβ + (X X)−1 X u = β + (X X)−1 X u.

The expected value of B is thus E (B) = E (Gy) = GE (y). y is Note that the expected value of E (y) = E (Xβ + u) = Xβ + E (u) = Xβ. where G = W + (X X)−1 X . The variance of B is therefore Var(B) = E ([B − E (B)][b − E (B)] ) = E [(β + Gu − β )(β + Gu − β ) ] 36 . notice that B = = = = Gy G(Xβ + u) GXβ + Gu β + Gu. we now want to show that the Gauss- Markov theorem holds here. B to be unbiased. call it B: B = Gy. and W is some non-zero matrix which depends on X. unbiased estimator of β. and that WX = [G − (X X)−1 X ]X = GX − (X X)−1 X X = Ik − Ik = 0. we must have GX = Ik . Given this condition. where Ik is an k×k identity matrix (check the dimensions of GX to conrm this).Having derived the variance of b. Consider some other linear. so E (B) This means that. in order for = GXβ.

its MSE is equal 2 MSE(b) = σu (X X)−1 . 22 We didn't do this when we focused on the bivariate linear regression model. we would expect that each such element 1 n times this times each of these elements would approach some constant. as n approaches innity. n→∞ lim [MSE(b)] = = n→∞ lim lim 1 2 σ n u 1 2 σ n u −1 1 XX n n→∞ −1 n→∞ lim 1 XX n −1 = (0)Q = 0. is consistent by rewriting its We can use this assumption to show that MSE as MSE(b) = the 1 2 1 σ ( X X)−1 . i. Since Recall that a sucient condition for estimator to be consistent is that its MSE n b is unbiased. Finally. approaches innity. it would be safe to assume that this assumption as n→∞ where lim 1 XX n = Q. Since each element of this matrix is a sum n numbers. 37 . non-singular matrix. b (17) Q is a nite. since W is a non-zero matrix. Now. but. all of what follows is applicable to that special case. Taking the limit as we have n approaches innity. n u n b since Notice that this is equivalent to our previous expression for the MSE of 1 n terms cancel each other out. We can write would also approach innity. Before moving on. 22 Arguably the most important of these is consistency.= = = = = = E [(Gu)(Gu) ] E (Guu G ) GE (uu )G 2 σu GG 2 σu [W + (X X)−1 X ][W + (X X)−1 X ] 2 σu [WW + WX(X X)−1 + (X X)−1 X W + (X X)−1 X X(X X)−1 ] 2 2 = σu WW + σu (W W)−1 . Var(B) > Var(b). once again. let's briey consider some of the asymptotic properties of OLS estimators. which proves the Gauss-Markov theorem.e. consider the matrix of X X. However. approaches zero as to its variance.

the 1 XX n −1 1 Xu . then b is also normally distributed since it is a linear function of u (see Equation (15)). n 1 n terms cancel each other out. we can't generally say anything about the distribution of b unless we know the exact distribution of X and u. lim n→∞ Var(Xi ui ) i=1 . So.23 To do so. For example. let result that Xi ui be the row of the matrix Using a CLT. leaving us with Equation (15). Rearranging.which implies b → β. p b is consistent. n→∞ 1 XX n 23 In small samples. let's consider the asymptotic distribution of rewriting Equation (15) as b. 1 n (18) Now. we have the n n−1/2 [Xi ui − E (Xi ui )] → N The expected value of d 0. so its variance is Var(Xi ui ) = = = = E [(Xi ui )(Xi ui ) ] E (Xi ui ui Xi ) Xi E (ui ui )Xi 2 σu Xi Xi . Using Assumption (17). X u. begin by b=β+ Notice that. Xi ui is E (Xi ui ) = Xi E (ui ) = 0. 38 . if X is not an RV. and u is normally distributed. we have √ n(b − β ) = ith 1 XX n −1 n−1/2 X u. we thus have n→∞ lim 1 n n Var(Xi ui ) i=1 = n→∞ lim 1 n n 2 σu Xi Xi i=1 2 = σu lim 2 = σu Q. as above. Finally.

n−k While it may not be immediately clear where this comes from.e. It is easy to show that the matrix that 24 M is both symmetric and idempotent. 2 −1 n(b − β ) → N (0. means √ That is. σu Q ). Combining this with Assumption (17). Let's start by substituting Equation (14) into the denition of ˆ: u ˆ = y − Xb u = y − X(X X)−1 X y = (In − X(X X)−1 X )y = My. 3.5 Estimating the variance of the error terms of linear regression models Even if we are willing to make the assumption that each of the error terms is IID with mean zero and variance the following estimator 2 . 39 . i. We start here by proposing 2 σu . which we denote s2 u: s2 u = ˆ ˆu u .Therefore. and MM = M. we can write 2 n−1/2 (Xi ui ) → N (0. we have d 1 XX n −1 2 −1 n−1/2 (Xi ui ) → N (0. where M = (In − X(X X)−1 X ). by Equation 18. M = M. we can show that it is an unbiased estimator of 2 σu . d which. σu Q QQ−1 ). σu So. d b is asymptotically normal. we typically would like to get some estimate of it. 24 This proof of this will be left as an assignment question. σu Q). σu we do not usually know the value of 2 .

u Mu = Tr(u Mu). we have Taking expectations of s2 u.Substituting Equation (13) into the above expression for ˆ. u we have ˆ = u = = = since My M(Xβ + u) MXβ + Mu Mu. i. we can write E (s2 u) = = = = = E [Tr(u Mu)] n−k E [Tr(Muu )] n−k Tr[ME (uu )] n−k 2 σu Tr(M) n−k 2 σu . and is therefore equal to its trace.e. E (s2 u) = = = = ˆ) E (ˆ uu n−k E [(Mu) (Mu)] n−k E (u M Mu) n−k E (u Mu) . n−k Notice that u Mu is a scalar (check the dimensions to conrm this). MX = (In − X(X X)−1 X )X = X − X(X X)−1 X X = X−X = 0. So. since Tr(M) = Tr[In − X(X X)−1 X ] 40 .

u (X X) 2 ˆ s2 u is an unbiased estimator of σu . . 2.0 . i. and H0 : β2 = β3 = 0. (19) Some examples of null hypotheses we might wish to test are: 1. 1. Written this way.i + β3 x3. . 2 σu . which we will call the rewritten in terms of a 1. β = ( β1 1 1 0 0 ). r= 2. . is found by replacing σu by su in Equation (31). depends on σu (see Equation (31)). here. and 0 0 . 0 1 β2 and β3 ) . let's start by considering the following linear regression model: yi = β1 + β2 x2. Both of these examples t into the general linear framework H0 : Rβ = r.1 Testing linear restrictions Having established the properties of the OLS estimator. This estimate. is an unbiased estimator of s2 u −1 ˆ b) = s2 Var( .i + ui .e. we can 2 use su to get an estimate of the variance of b. which we denote 2 2 ˆ by Var(b).= = = = So. 4 Hypothesis testing 4. it can be seen that each of these null hypotheses imposes some linear restriction(s) to the original regression model in (19).i + β3 x3. Var(b) is ˆ Var(b).0 . where.i + ui . where β2. we now move on to showing how this estimator can be used to test various hypotheses. . we conclude that Tr(In ) − Tr[X(X X)−1 X ] n − Tr[(X X)−1 X X] n − Tr(Ik ) n − k. unrestricted model restricted model 41 .0 x2. Each of these null hypotheses can therefore be : and H0 : yi = β1 + β2.0 is some specied value (often zero). i = 1. 2 Note that since the variance of b. H0 : β2 = β2. . For illustrative purposes. n. and for and R=( 0 R= 0 0 r = β2. . The square root of Var(b) is often called the Since standard error an unbiased estimate of of b.

q = 1). or. σu R(X X)−1 R ]. and β is a k -vector. if we assume that 2 u ∼ IID(0. then E(b) = β and 2 Var(b) = σu (X X)−1 . while the second imposes two restrictions (i.e. where q is the number of restrictions. σu (X X)−1 ]. which implies that 2 Rb ∼ N[Rβ. that 2 Rb − Rβ ∼ N[0. then 2 b ∼ N[β. where X is an and r will be a In order to test any such linear restriction(s). If we make the much stronger assumption that 2 u ∼ N(0. and Var(Rb) = = = = = = = E([Rb − E(Rb)][Rb − E(Rb)] ) E[(Rb − Rβ )(Rb − Rβ ) ] E[(Rb − Rβ )(b R − β R )] E[R(b − β )(b − β )R ] RE[(b − β )(b − β ) ]R RVar(b)R 2 σu R(X X)−1 R . 42 . we have E(Rb) = RE(b) = Rβ.e. σu In ). Therefore. for the linear regression model y = Xβ + u. σu In ). Note that the rst example above imposes a single restriction (i. we use the OLS estimator of β. Recall that. H0 : yi = β1 + ui . In general. b = (X X)−1 X y. R will be a q × k matrix. σu R(X X)−1 R ]. q -vector. q = 2).2. alternatively. n × k matrix.

distributed as Therefore. then we have a quantity which is F(q. σu R(X X)−1 R ]. so we have to estimate s2 u = Replacing ˆu ˆ u . 2 )/(n − k ) ˆ /σu (ˆ uu Due to the randomness of A (21) ˆ . First. we have 2 R(X X)−1 R ]−1 (Rb − r) (Rb − r) [σu . u (X X) 43 . the numerator is.n−k) . since the 2nd diagonal −1 ˆ b) = s2 Var( . ˆu ˆ /(n − k ) u Dividing both the numerator and denominator of the above by 2 σu . this quantity is no longer distributed as χ2 u (q ) . 2 R(X X)−1 R ]−1 (Rb − r)/q (Rb − r) [σu ∼ F(q.2 Testing a single restriction Let's now see how we can construct specic test statistics for the rst example considered above. That is. which is a scalar. note that Rb − r = b2 − β2. as Rβ = r.0 . u R(X X) (22) 4.n−k) . Furthermore. under the null hypothesis.Note that. as shown in Appendix and is independent of the numerator in (21). we have −1 (Rb − r) [s2 R ]−1 (Rb − r)/q ∼ F(q. the parameter it using (20) 2 σu is usually unknown. ˆ ˆu u ∼ χ2 (n−k) . so we can rewrite the above 2 Rb − r ∼ N[0. we can rewrite this in quadratic form (see Appendix A) to get the test statistic 2 (Rb − r) [σu R(X X)−1 R ]−1 (Rb − r) ∼ χ2 (q ) . Unfortunately. as we saw earlier. n−k 2 σu in (20) by s2 u gives (Rb − r) [R(X X)−1 R ]−1 (Rb − r) . Finally. if we divide the nu- merator in (21) by its degrees of freedom (q ). Second. 2 )/(n − k ) ˆ /σu (ˆ uu Eliminating the the test statistic 2 σu terms and substituting in for the denition of s2 u. Therefore.n−k) . 2 σu How- ever. note that element element the 3×3 matrix R(X X)−1 R picks out (X X)−1 .

(23) The sum squared residuals is thus SSR(b∗ ) ˆ ∗u ˆ∗ = u = (y − Xb∗ ) (y − Xb∗ ) = y y − 2b∗ X y + b∗ X Xb∗ . we use the will denote we have b∗ . s2 R = Var( u R(X X) which is also a scalar (check the dimensions). i. The main problem is that it will usually involve having to estimate both the unrestricted and restricted models.n−k) . we need to use a dierent method of estimation. The idea is to estimate the model To estimate the unrestricted model. taking the square root of this statistic. OLS can be used. we will have Rb − r = bj − βj. to estimate the restricted y = Xβ + u.0 . and we want to test the null hypothesis j th one is equal to some specied value. u Rβ = r. H0 : βj = βj. subject to the restriction To do so.0 )2 ∼ F(1. However. if there are that the k parameters. In general.e. model. we have what is known as a t -statistic: bj − βj.we have −1 ˆ b2 ).3 Testing several restrictions When there is more than one restriction to test. we have the test statistic (bj − βj. s2 R = Var( u R(X X) Substituting these into (22).0 ∼ t(n−k) Se(bj ) 4. ˆ bj ) Var( Alternatively. estimator of β. constructing an appropriate test statistic is not so simple.0 . which we the residuals from the estimate of this model. 44 . and −1 ˆ bj ). Letting ˆ ∗ denote u restricted least squares ˆ ∗ = y − Xb∗ .

Solving for b∗ . Finally. which implies Rb∗ − Rb = R(X X)−1 X R λ. and the rst-order necessary conditions are ∂L(b∗ . we have (26) where b is the usual OLS estimator. λ) = −2(Rb∗ − r) = 0. Substituting this into Equation (27). and therefore. we have b∗ = (X X)−1 X y + (X X)−1 X R λ = b + (X X)−1 X R λ. b∗ subject to Rb∗ = r. we have b∗ = b + (X X)−1 X R [R(X X)−1 X R ]−1 (r − Rb). the restricted least squares estimator is based on minimizing this function. Premultiplying by Rb∗ = Rb + R(X X)−1 X R λ. ∂ b∗ and (24) ∂L(b∗ . we have [R(X X)−1 X R ]−1 (r − Rb) = λ. λ) = y y − 2b∗ X y + b∗ X Xb∗ − 2λ(Rb∗ − r). (28) 45 . The Lagrangian for this problem is L(b∗ . [R(X X)−1 X R ]−1 (Rb∗ − Rb) = λ. R. ∂λ (25) Equation (24) implies X Xb∗ = X y + R λ.As with OLS. λ) = −2X y + 2X Xb∗ − 2R λ = 0. Our (constrained) optimization problem is thus min SSR(b∗ ) = y y − 2b∗ X y + b∗ X Xb∗ . Equation (25) implies (27) Rb∗ = r. substituting this into Equation (26).

Let's now focus on the residuals from restricted model. This leads to what is known as the F -statistic: (SSRR − SSRU )/q ∼ F(q. and ˆ ∗u ˆ∗ u ˆu ˆ u we have SSRR − SSRU = (r − Rb) [R(X X)−1 X R ]−1 (r − Rb). (2004. This problem leads to a procedure known as the idea is to use the observed data to try to estimate the distribution of the relevant test statistic.6). Notice that the term on the right hand side is in the numerator of (22). we have Substituting for ˆ ∗u ˆ∗ u ˆu ˆ + {(X X)−1 X R [R(X X)−1 X R ]−1 (r − Rb)} = u X X(X X)−1 X R [R(X X)−1 X R ]−1 (r − Rb) ˆu ˆ + (r − Rb) [R(X X)−1 X R ]−1 RX(X X)−1 = u X X(X X)−1 X R [R(X X)−1 X R ]−1 (r − Rb) ˆu ˆ + (r − Rb) [R(X X)−1 X R ]−1 (r − Rb). = u 25 The sum of squared residuals from the restricted model is thus ˆ ∗u ˆ∗ u = = = = [ˆ u − X(b∗ − b)] [ˆ u − X(b∗ − b)] [ˆ u − (b∗ − b) X ][ˆ u − X(b∗ − b)] ˆu ˆ −u ˆ X(b∗ − b) − (b∗ − b) X u ˆ + (b∗ − b) X X(b∗ − b) u ˆu ˆ + (b∗ − b) X X(b∗ − b). A more complete introduction can be found in Davidson and MacKinnon. SSRU /(n − k ) (29) 4. = u (the sum of squared residuals from the restricted model) by (the sum of squared residuals from the unrestricted model) by Denoting SSRR . we use make use of the fact that X u we present a very brief introduction to the bootstrap. tracting Adding and sub- Xb to Equation (23). Section 4. However.n−k) . In general. 46 . this procedure requires the following steps: bootstrap 26 . 26 Here. In such cases.4 Bootstrapping The test statistics considered above were built on the assumption that u is normally distributed. quite often. we have ˆ∗ u = y − Xb − Xb∗ + Xb ˆ − X(b∗ − b). we therefore do not know the distribution of our test statistics. Proving this would be a useful exercise. The basic ˆ = 0. SSRU . 25 Here. u b∗ − b from Equation (28). we may be unwilling to make this assumption.

For inference purposes. u respectively. estimate both the unrestricted and restricted models. ˆ ∗ . and . (2004. model (this is known as strap sample as: resampling ). repeat Steps 3-5 bootstrap test statistic ˆ ∗ Tb . Using the estimates from the previous step. Cal this ˆ. b = 1. Generate the u boot- ˆR + u ˆ∗ y∗ = y 4. . Since computing costs are now so low. I (T >T where I (·) is an indicator function. T B dierent bootstrap test statistics. T b It turns out that these we can calculate the B bootstrap test statistics provide a fairly good estiof the original test statistic . Save the tted values and residuals from the restricted model. 47 . Call this Finally. since it is based B (some large number 27 ) times. Using the bootstrap sample. Using the observed data. α.01. . B .1. calculate the test statistic of interest (e. . equal to one if ˆ∗ > T ˆ. 199. by OLS. for the test. 299. T 3. re-estimate both the unrestricted and restricted models. For α = 0. T as ∗ ˆb ˆ). σu In ). T b and zero otherwise.g. calculate the test statistic of interest (this is known as the on the bootstrap sample). Randomly sample. Using the estimates from the previous step. B should be chosen so that α(B + 1) is an integer (see Davidson and MacKinnon. we made the (very strong) assumption that (30) 2 u ∼ IID(0. appropriate values of B would be 99. B is often chosen to be 999 or even 9999. the t -statistic or the F -statistic). simulated P -value ˆ) = 1 p ˆ∗ (T B B b=1 ˆ.6)). . ˆ. Call these n of the residuals from the restricted ˆ ∗ . mate of the distribution of the test statistic of interest.1 Introduction In estimating the linear regression model y = Xβ + u. 5. Section 4. What this leaves you with is the original test statistic. Call these ˆR y and ˆ R. with replacement. 2. 27 For a chosen level of signicance. 5 Generalized least squares 5. and so on.

   un u1 = Ω. and consider what happens when the error terms are not IID. multiply both sides of Equation (30) by Ψ: Ψ y = Ψ Xβ + Ψ u. (31) It turns out (as we will see shortly). Next. and that there is some 2 E(u2 i ) = E(uj ) for i = j (and that therefore ui and uj are not identically distributed). However. the variance of the OLS estimator is now dierent: Var(b) = E [(X X)−1 X u][(X X)−1 X u] = E (X X)X uu X(X X)−1 = (X X)−1 X E (uu )X(X X)−1 = (X X)−1 Ω(X X)−1 . In deriving this better estimator. we allow the possibility that there is some Notice that the OLS estimator is still unbiased since E (b) = E[(X X)−1 X y] = E[(X X)−1 X (Xb + u)] = E β + (X X)−1 X u = β + (X X)−1 X E (u) = β. we will now let  Var(u) =  E u2 1 . While we will continue to assume that E(u) = 0. In other words. . where ··· u2 n Ω. which is usually triangular. . To do so. in this more general setting. where Ψ is some n×n matrix. . OLS is no longer the best linear unbiased estimator (BLUE). 48 . That is. . (and that therefore E(ui uj ) = 0 for i = j ui and uj are not independent). that. the error covariance matrix. is some n×n positive denite matrix. start by letting Ω−1 = ΨΨ . there is some other linear unbiased estimator which has a variance smaller than (32). That is. · · · u1 un . we allow the possibility that each ui is not IID.We would now like to relax this assumption. the basic strategy is to transform the linear regression model in (30) so that the error terms become IID.

In ). u∗ ∼ IID(0. and u∗ = Ψ u.It is convenient to rewrite this as y∗ = X∗ β + u∗ where (32) y∗ = Ψ y. GLS Applying OLS to Equation (32). we get the ) estimator generalized least squares (or bGLS = (X∗ X∗ )−1 X∗ y∗ = [(Ψ X) Ψ X]−1 (Ψ X) Ψ y = (X ΨΨ X)−1 X ΨΨ y = (X Ω−1 X)−1 X Ω−1 y. X∗ = Ψ X. That is. (33) 49 . Note that E (u∗ ) = E (Ψ u) = Ψ E (u) = 0 and that Var(u∗ ) = = = = = = = = = = E {[u∗ − E (u∗ )][u∗ − E (u∗ ) ]} E (u∗ u∗ ) E [(Ψ u)(Ψ u) ] E (Ψ uu Ψ) Ψ E (uu )Ψ Ψ Var(u)Ψ Ψ ΩΨ Ψ (ΨΨ )−1 Ψ Ψ (Ψ )−1 Ψ−1 Ψ In .

The proof of this will be very similar to the proof of the Gauss-Markov theorem for b (except that. where. which states that. So bGLS is unbiased.We now want to show that into Equation (33). and W is some non-zero matrix which depends on X. The expected value of B is thus E(B) = = = = = E(Gy) GE(y) GE(Xβ + u) GXβ + GE(u) GXβ 50 . Substituting Equation (30) bGLS = (X Ω−1 X)−1 X Ω−1 (Xb + u) = (X Ω−1 X)−1 X Ω−1 Xb + (X Ω−1 X)−1 X Ω−1 u = β + (X Ω−1 X)−1 X Ω−1 u. if assumed that estimator of is BLUE. (34) Taking expectations we have E (bGLS ) = E β + (X Ω−1 X)−1 X Ω−1 u = β + (X Ω−1 X)−1 X Ω−1 E (u) = β. we have bGLS is unbiased. we have Var(bGLS ) = = = = = = E{[(X Ω−1 X)−1 X Ω−1 u][(X Ω−1 X)−1 X Ω−1 u] } E[(X Ω−1 X)−1 X Ω−1 uu Ω−1 X(X Ω−1 X)−1 ] (X Ω−1 X)−1 X Ω−1 E (uu )Ω−1 X(X Ω−1 X)−1 (X Ω−1 X)−1 X Ω−1 Var(uu )Ω−1 X(X Ω−1 X)−1 (X Ω−1 X)−1 X Ω−1 ΩΩ−1 X(X Ω−1 X)−1 (X Ω−1 X)−1 . bGLS (35) We now want to consider a generalization of the Gauss-Markov theorem. Var(u) = σu We again consider some other linear. Let's now move on to the variance of bGLS : Var(bGLS ) = E ([bGLS − E (bGLS )][bGLS − E (bGLS )] ) = E [(bGLS − β )(bGLS − β ) ] . we 2 In ). since E (u) = 0. here. Var(u) = Ω. there. unbiased β. Substituting from Equation (34). G = W + (X Ω−1 X)−1 X Ω−1 . which we will again call B: B = Gy.

51 . It should be noted that. The variance of B is therefore Var(B) E ([B − E(B)][b − E(B)] ) E[(β + Gu − β )(β + Gu − β ) ] E[(Gu)(Gu) ] E(Guu G ) GE(uu )G GΩG [W + (X Ω−1 X)−1 X Ω−1 ]Ω[W + (X Ω−1 X)−1 X Ω−1 ] [WΩ + (X Ω−1 X)−1 X Ω−1 Ω][W + Ω−1 X(X Ω−1 X)−1 ] WΩW + WΩΩ−1 X(X Ω−1 X)−1 + (X Ω−1 X)−1 X W +(X Ω−1 X)−1 X Ω−1 X(X Ω−1 X)−1 = WΩW + (X Ω−1 X)−1 X = = = = = = = = = W is a non-zero matrix and Finally. then B includes b as a special case 28 . in order for B to be unbiased. b 2 GLS includes b as a special case when Var(u) = σu In . Var(B) > Var(bGLS ). notice that B = = = = Gy G(Xβ + u) GXβ + Gu β + Gu. since Ω is a positive denite matrix. which proves our generalized Gauss-Markov theorem. and we have Var(b) > Var(bGLS ). where Ik is an k×k identity matrix. Proving this would be a useful exercise. Given this condition. we must have GX = Ik . 28 On the other hand. if 2 Var(u) = σu In .This means that. and that WX = [G − (X Ω−1 X)−1 X Ω−1 ]X = GX − (X Ω−1 X)−1 X Ω−1 X = Ik − Ik = 0.

That is. Ω Replacing (or FGLS ˆ Ω in Equation (33). . i (meaning each ui is not identically distributed).2 Feasible Generalized Least Squares While the GLS estimator has some highly desirable properties.  0 52 · · · 1/σun  y1   y2     .  0 · · · 1/σun Ω−1 = ΨΨ . However.  .  . . y∗ as desired. 0 then  0 1/σu2 . . . .  0 2 · · · 1/σu 1 So. but each diagonal element may not be identical heteroskedasticity 2 σu 1  0  Ω= .    . . ··· ··· 0 0 . . .  . . if we let 1/σu1  0  Ψ=Ψ = . ··· ··· 0  0 0   . GLS is not usually feasible. Below. 0 occurs when Ω is a diagonal matrix (mean2 Var(ui ) = σu . .5.    .3 Heteroskedasticity The problem of ing each we have ui is independent). 2 · · · σu 1 Since Ω is diagonal. . . we can use some estimate of generalized least squares which we denote ˆ. . . letting  0 2 σu 2 . ··· ··· 0 0 . transforming our variables as above. 5. . . . 0  0 2 1/σu 2 . by That is. ··· ··· 0 0 .  .  .  . 0 0 1/σu2 . we have = Ψy  1/σu1  0  =  .  yn .  . . . ˆ −1 X)−1 X Ω ˆ −1 y. bFGLS = (X Ω How we actually go about estimating Ω depends on the particular problem. . we analyze two of the most common problems encountered in practice. we have the ) estimator feasible Ω. the main drawback with this method is that Ω Ω is almost never actually known in practice. we have Ω−1 2 1/σu 1  0  = . Therefore.

1 /σu1 xk.  .2   . this requires knowing the value of each weighted 53 . .  .n /σun un /σun ··· ···    . .  . . ··· ··· 0 0 .  .1 xk. un /σun 0 1/σu2 .  . .  . . · · · xk.n /σun · · ·  ··· ··· x1. the transformed model. . x1. y∗ can be written as = X∗ β + u∗ . σui σ ui σui σ ui i = 1.  · · · 1/σun  u1   u2     . .1 /σu1 u1 /σu1  u2 /σu2 xk. . + βk + .  . n.i ui = β1 + . . . . in this case. . . .    . . yn /σun X∗ = ΨX  1/σu1 0  0 1 /σ u2  =  . . . Of course.2   . x1. for the   xk. .   . xk. . .   . we have yi x1. 0  u1 /σu1  u2 /σu2  =  .1   x1.2 /σu2     = .    y1 /σu1 x1.2 /σu2 · · ·  =  .n /σun 0 0  ··· ···  xk. least squares Estimating this model by OLS is a special case of GLS known as (or WLS ). . 0 0  x1. .n · · · 1/σun  xk. .n and u∗ = Ψu  1/σu1  0  =  .  un So. . . .2 /σu2   .  0    . . · · · xk. . .  ith observation. y1 /σu1  y2 /σu2    =  . .1 /σu1 · · ·  x1. .i xk. . .2 /σu2    β +  .1 /σu1  y2 /σu2   x1. .n /σun Alternatively. .   . . yn /σun x1.

by estimating i = 1. σ ˆui . + αr zr. However. .i . suppose the error terms follow a (linear) rst-order autoregressive process (or AR(1) process): ut = ρut−1 + vt . σ ˆ ui σ ˆ ui σ ˆ ui σ ˆ ui by OLS. For example. an n × r matrix which may or may not include some of the X. where the error terms may be generated by some autoregressive process. . we could estimate the non-linear regression model u2 i.i + .i + .i + wi . what might be referred to as feasible WLS). .σui (i. 5. we don't have time to cover non-linear regression models in this course. . . which.i + vi . . Letting i = 1. . 2 σu i as follows: we could construct estimates of γ ˆ1 γ ˆr 2 σ ˆu =α ˆ 1 z1 ˆ r zr. . . . in their place. . n. but it is possible that autocorrelation and heteroskedasticity are simultaneously present. our FGLS estimators are found i = 1. + βk + . this is done as follows. 29 Using the parameter estimates from this model. . . . n. the null hypothesis would be H0 : α1 = . . . . . Suppose we think that the variance of can be explained by columns of is u Z. . . . n. (36) model could also be used to testing for the presence of heteroskedasticity. . + αr zr. = αr = 0 (implying homoskedasticity). . . . .e. . γ1 γr u ˆ2 i = α1 z1. . .i + . we have γr γ1 u2 i = α1 z1. since we can't actually observe we would have to use the residuals from the OLS estimation of model (30) . . + α i Using the square root of these estimates. as mentioned above. + αr zr. An F-test could be used to test this restriction. is almost never possible.i ui = β1 + . Typically. That is. . A fairly general non-linear function describing this relationship γr γ1 E(u2 i ) = α1 z1. . 29 This t = 1. knowing Ω). but estimating such a model is certainly possible. in this case. n. . . we need some method to estimate Ω so that we can use FGLS (or. which is in the form of a non-linear regression model.i + . yi x1. where i = 1. Here.i xk. Autocorrelation is usually encountered in time-series data. i = 1. and vi denote the dierence between E(u2 i) u2 i. .i + wi . Accordingly. Here. we will assume that the errors are homoskedastic (meaning each is identical). u ˆ2 i. n. n. .4 Autocorrelation The problem of o-diagonal elements in diagonal element in autocorrelation Ω Ω (or are non-zero (meaning each serial correlation ui ) occurs when the is no longer indepen- dent). . wi is some error term. 54 . Unfortunately.

σv ).where |ρ| < 1. t. 55 . note One way to see this is to imagine that. although we only start observing it t = 1. The condition that is what is known as for ut is one in which covariance stationary E(ut ). Taking expectations. the series for that (36) implies that ut−1 = ρut−2 + vt−1 . ∞ = i=0 ρi vt−i . A covariance stationary process for any given Cov(ut . Substituting these into (36). we have ∞ E(ut ) = = E i=0 ∞ ρi vt−i ρi E(vt−i ) i=0 = since Next. Var(ut ). j. ut has been in existence for an innite time. First. ut−2 = ρut−3 + vt−2 . and so on. ∞ E (vt ) = 0. . this can be rewritten as ut = vt + ρvt−1 + ρ2 vt−2 + ρ2 vt−2 + ρ3 vt−3 + . since 0. we have ut = = = = ρ(ρut−2 + vt−1 ) + vt ρ2 ut−2 + ρvt−1 + vt ρ2 (ρut−3 + vt−2 ) + ρvt−1 + vt ρ3 ut−3 + ρ2 vt−2 + ρvt−1 + vt . ut−j ). is independent of Var(ut ) = = Var i=0 ∞ ρi vt−i ρ2i Var(vt−i ) i=0 ∞ = i=0 2 ρ2 i σ v . . Alternatively. and so on. So E(ut ) vt is IID. and 2 vt ∼ IID(0. are independent of at t. and |ρ| < 1 is imposed so that the AR(1) process in (36) .

we have  2 σv Ω= 1 − ρ2       ρ 1 ρ . Var(ut ) is also independent of t. we have Cov(ut . |ρ| ≥ 1 this innite series would not converge (rather. . if ). |ρ| < 1. . = 1 − ρ2 Similarly. we can write Var(ui ) = So. Cov(ut . ut−2 ) = E(ut ut−2 ) = E[(ρut−1 + vt )ut−2 ] = E(ρut−1 ut−2 + vt ut−2 ) = ρE(ut−1 ut−2 ) 2 ρ2 σ v . On the other hand. 2 σv . . ρ2 ρ 1 . for any given j. let's consider the covariance between First. . . ut−1 ) E(ut ut−1 ) E[(ρut−1 + vt )ut−1 ] E(ρu2 t−1 + vt ut−1 ) 2 ρE(ut−1 ) ρVar(ut−1 ) 2 ρσv .   ρn−1 ρn−2 ρn−3 ··· 1 56 . Since this result depends on and Var(ut ) being constant.which is a innite geometric series (since |ρ| < 1). it therefore depends on the condition Var(ut ) 1 ρ ρ2 . 1 − ρ2 Therefore. . . ut−j ).     . Cov(ut . and therefore depend on t (and make the process Finally. · · · ρn−1 · · · ρn−2 · · · ρn−3 . = 1 − ρ2 Generalizing these two results. Cov(ut . note that non-stationary = = = = = ut and ut−j . 1 − ρ2 t. it would explode). . ut−j ) = which is also independent of Using the above results for 2 ρj σ v .

.n ··· ···    . .1  1 0 ··· 0   x1. . 0 0 ··· 1 Therefore.n−1 · · · xk.2 − ρxk. x1. .n 0 0 ··· 1 ··· ··· xk.1  = σv  . x1.1 .  xk.  x1.1 1 − ρ2 xk. . . . . . .  .n−1 57 .  . . .  ··· 0 ··· 0   ··· 0   . . . .  . transforming our variables as above.  . .  . .n − ρxk. .  . . . 0 −ρ 1 + ρ2 . 0 0 0 ΩΩ1 = In ).2   . · · · xk. . . . 0   0 0 ··· 0 x1. . . y∗ 1 − ρ2 −ρ 0 . . . .   yn − ρyn−1 X∗ = ΨX     = σv     1 − ρ2 −ρ 0 . .n − ρx1. . . 0 then as desired. .  . .2 − ρx1. .1 xk. . ··· 1 So. .  . if we let (this can be conrmed by showing that     Ψ = σv    Ω−1 = ΨΨ .   y1 0 0 ··· 0   1 0 ··· 0    y2    −ρ 1 · · · 0   y3   .Inverting this matrix. we have  Ω −1 = 2 σv       1 −ρ 0 .  .2 −ρ 1 · · · 0   . .  0 0 ··· 0 1 0 ··· 0   −ρ 1 · · · 0  .  . 0 y 1 1 − ρ2 y2 − ρy1 y3 − ρy2 .  . . . . yn 0 0 ··· 1     .  . . we have = Ψy     = σv        = σv    1 − ρ2 −ρ 0 . .1 1 − ρ2  x1. −ρ 1 + ρ2 −ρ .

t − ρx1.2 − ρx1. . n.  . . .1 1 − ρ2 + u 1 1 − ρ2 . we can divide it out of each side of the above y1 1 − ρ2 = β1 x1. . . . .n − ρx1. we have σ v y1 and for 1 − ρ2 = β1 σv x1. . σv (yt − ρyt−1 ) = β1 σv (x1.1 1 − ρ2 x1.  yn − ρyn−1  · · · xk. un − ρun−1    β  Alternatively. . the transformed model. since to get σv is a constant.and u∗ = Ψu     = σv        = σv    1 − ρ2 −ρ 0 .1 1 − ρ2    x1.   u1 0 0 ··· 0   1 0 ··· 0    u2   u3  −ρ 1 · · · 0    . . y∗ can be written as = X∗ β + u∗ . 0 u1 1 − ρ2 u2 − ρu1 u3 − ρu2 .t − ρxk.t−1 )+ .   . Of course. + βk σv xk.n−1 · · · xk. 1 − ρ2 + . + βk σv (xk. .1 t-th observation.t−1 )+ σv (ut − ρut−1 ).n − ρxk. .  x1. .1    = σv  . + βk xk.1 1 − ρ2 + . . . .1 ··· xk.n−1   u 1 1 − ρ2  u2 − ρu1      +σv  u3 − ρu2  .  . .2 − ρxk.1 we have 1 − ρ2 + σv u1 1 − ρ2 . . . .   un − ρun−1 So.  . . t = 2. un 0 0 ··· 1     .   . . .   . . .     σv    y 1 1 − ρ2 y2 − ρy1 y3 − ρy2 . . . 58 . . for the 1st observation. in this case.  .

.t − ρxk. u j th estimate of 2. Estimating this model by OLS gives us GLS estimates. that E(RbGLS ) = RE(bGLS ) = Rβ. Using the residuals from the previous step. n. so we need to somehow estimate it (and therefore in this procedure are as follows: Ω). we need to resort to a procedure known as iterated FGLS . . Of course. ˆ = y − Xb.t − ρx1.0001). we use use bGLS (or bFGLS ) instead of b as an bGLS .t−1 ) + vt . Save the residuals. ρ is not usually known in practice.5 Hypothesis Testing Hypothesis testing can be done using GLS (or FGLS) estimates in very similar manner as it was done using OLS estimates. + βk (xk. . where (j ) (j −1) δ is some small number (say 0. In this case. j th FGLS (j ) ˆ = y − XbFGLS . (j ) Save the updated Repeat Steps 2-3 until |ρ ˆ(j ) − ρ ˆ(j −1) | < δ. respectively. at the end of this section.t−1 ) + . . in H0 : Rβ = r. bFGLS . and |bFGLS − bFGLS | < δ. 5. as ρ ˆ(j ) = 3. Use OLS to estimate (30). and get FGLS estimates. estimate of β. . . 59 . Note rst. n ˆt i=2 u ρ ˆ(j ) to get the residuals. testing the linear restriction imposed by The only dierence is that. For now. .and yt − ρyt−1 = β1 (x1. The steps involved 1. compute the ρ ρ ˆ(j ) . but. the parameter t = 2. we will use we will discuss the eects of using bFGLS . Use n ˆt u ˆt−1 i=2 u . u estimate of (30).

we can rewrite this in quadratic form to get the test statistic (RbGLS − r) [R(X Ω−1 X)−1 R ]−1 (RbGLS − r) ∼ χ2 (q ) . Appendix A We showed these results in class. G. 60 . (X Ω−1 X)−1 ]. that RbGLS − Rβ ∼ N[0. Of course. Ω Substituting ˆ Ω for Ω in the above. R(X Ω−1 X)−1 R ]. ˆ. if we make the assumption that u ∼ N(0. (RbFGLS − r) [R(X Ω whose distribution is very hard to ascertain (even if u is normally distributed). we need to use our estimate. These notes will be eventually updated. we have the test statistic ˆ −1 X)−1 R ]−1 (RbFGLS − r). Oxford University Press.and Var(RbGLS ) = = = = = = = E([RbGLS − E(RbGLS )][RbGLS − E(RbGLS )] ) E[(RbGLS − Rβ )(RbGLS − Rβ ) ] E[(RbGLS − Rβ )(bGLS R − β R )] E[R(bGLS − β )(bGLS − β )R ] RE[(bGLS − β )(bGLS − β ) ]R RVar(bGLS )R R(X Ω−1 X)−1 R . so we can rewrite the above RbGLS − r ∼ N[0. which implies that RbGLS ∼ N[Rβ. Note that. here. alternatively. J. Note that. and MacKinnon. since we almost never Ω. Econometric theory and methods. then bGLS ∼ N[β. (2004). Finally. R(X Ω−1 X)−1 R ]. as Rβ = r. Here. R. but also to construct the the above test statistic. New York. under the null hypothesis. As a result. R(X Ω−1 X)−1 R ]. Ω). or. knowledge of know Ω is needed not only to compute bGLS . References Davidson. it is suggested that a bootstrap procedure be used in such tests.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.