Professional Documents
Culture Documents
Consider a sample yt of T observations that we assume have been drawn from a distribution
that has a mean (location parameter) of E [ yt ] and a variance (scale parameter) of σ 2 . Let yt
be the outcome or response variable and x2 t be the known explanatory, conditioning, control
variable. Thus yt is a random variable and x2 t is fixed or nonstochastic. We may model the
(1) yt E [ yt ] et β1 x2t β2 et
where β1 reflects the level and β 2 reflects the slope of the relationship that is linear in terms of
the parameters. The parameters β1 , β 2 are unobserved and the random variable et are
unobservable.
We assume that the error term et is an independently and identically distributed (IID) random
variable with mean E [et ] 0 , variance E [(et ) 2 ] σ 2 , and covariance E [(et es )] 0 for t s .
Denote e (e1 , e2 , , eT ) . Since E [e] 0 , the corresponding matrix of covariances for the
Thus
𝑒1 𝐸(𝑒12 ) 𝐸(𝑒1 𝑒2 ) ⋯ 𝐸(𝑒1 𝑒𝑇 )
𝑒2 2
(3) 𝐸[𝐞𝐞′] = 𝐸 [ [ ⋮ ] [𝑒1 𝑒2 ⋯ 𝑒𝑇 ](1×𝑇) ] = 𝐸(𝑒2 𝑒1 ) 𝐸(𝑒2 ) ⋯ 𝐸(𝑒2 𝑒𝑇 )
⋮ ⋮ ⋱ ⋮
𝑒𝑇 (𝑇×1) [𝐸(𝑒𝑇 𝑒1 ) 𝐸(𝑒𝑇 𝑒2 ) ⋯ 𝐸(𝑒𝑇2 ) ](𝑇×𝑇)
Since 𝐸(𝑒𝑡2 ) = 𝜎 2 and covariance E [(et es )] 0 for t s , we may write Eq. (3) as
σ 2 0 0 1 0 0
0 1 0
σ2 0
E [e e] E
0
(4) σ2 σ 2I
T
0 0 σ 2 0 0 1
Given Eq. (1), the T equations, one for each yt may be written as
y1 β1 x21 β2 e1
1
(5) y2 β1 x22 β2 e2
yT β1 x2T β2 eT
vector of ones and x 2 ( x21, x22 ,, x2T ) , we may write the statistical model for the sample
y1 , y2 , , yT as
y1 1 x21 e1
y x e
(6) 2 1 β 22 β 2
1 2
yT 1 x2T eT
or, compactly, as
β1
(7) y x 1 β1 x 2 β 2 e [ x 1 x 2 ] e X β e
β 2
(9) y Xβ e
where the (T 1) observable random vector y has mean E [y] X β and covariance matrix
σ 2 I T and the unobservable random vector e has mean E [e] 0 and covariance matrix σ 2 I T .
moment that the scale parameter σ 2 is known) that represent the unknown level and slope
coefficients for the economic relationship that is under study. We will use the least squares
criterion. According to this criterion, given a sample of observed values of the random variables
y1 , y2 , , yT , to obtain an estimate 𝐛 for the unknown parameter vector β ( β1 , β2 ) , an
estimator is chosen that minimizes the residual sum of squares ∑𝑇𝑡=1 𝑒̂𝑡2 = 𝒆̂′𝒆̂ where residual
𝒆̂ = 𝒚 − 𝒚
̂, and 𝒚
̂ = 𝐗𝐛.
Formally, we can state this criterion as: given the sample observations 𝒚, find values for 𝑏1 and
𝑏2 that minimize the following quadratic form
2
𝑇
In this case we need to find the minimizing values 𝑏1 and 𝑏2 for 𝛽1 and 𝛽2 that make the partial
derivatives vanish. These derivatives are
𝐗 ′ 𝐗𝐛 = 𝐗 ′ 𝒚
Pre-multiplied each side by (𝐗′𝐗)−1 we have
(𝐗 ′ 𝐗)−1 𝐗 ′ 𝐗𝐛 = (𝐗 ′ 𝐗)−1 𝐗 ′ 𝒚 or
(11a)
𝐛 = (𝐗 ′ 𝐗)−1 𝐗 ′ 𝒚 .
This is the (ordinary) least square estimator.
In algebraic notation, derivatives (11) can be written as
𝟎 = −2𝐗 ′ 𝒚 + 2𝐗 ′ 𝐗𝐛
𝐱 𝟏′ 𝒚 𝐱 𝟏′
= −2 [ ] + 2 [ ] [𝐱1 𝐱 2 ] 𝐛
𝐱 2′ 𝒚 𝐱 2′
(12) 𝐱 𝟏′ 𝒚 𝐱 𝟏′ 𝐱1 𝐱 𝟏′ 𝐱 2
= −2 [ ] + 2 [ ]𝐛
𝐱 2′ 𝒚 𝐱 2′ 𝐱1 𝐱 2′ 𝐱 2
Σ𝑦𝑡 𝑇 Σ𝑥2𝑡 𝑏1
= −2 [ ] +2[ 2 ] [𝑏 ]
Σ𝑥2𝑡 𝑦𝑡 Σ𝑥2𝑡 Σ𝑥2𝑡 2
or
T x 2t b1 yt
2
x 2t x 2t b2 x 2t yt
or
X X b X y .
The equations in (12) and (13) represent a system of linear-simultaneous equations that must
be solved for b1 and b 2 . Using the concept of inverse of a matrix,
3
𝑏1 𝑇 Σ𝑥2𝑡 −1 Σ𝑦𝑡
(14) [ ]=[ 2 ] [ ]
𝑏2 Σ𝑥2𝑡 Σ𝑥2𝑡 Σ𝑥2𝑡 𝑦𝑡
1 2
Σ𝑥2𝑡 −Σ𝑥2𝑡 Σ𝑦𝑡
= [ ][ ]
2 ) 2
𝑇(∑ 𝑥2𝑡 − (∑ 𝑥2𝑡 ) −Σ𝑥2𝑡 𝑇 Σ𝑥2𝑡 𝑦𝑡
2 )
where 𝑇(Σ𝑥2𝑡 − (Σ𝑥2𝑡 )2 = 𝑇 ∑(𝑥2𝑡 − 𝑥̅ 2 )2 and 𝑥̅2 is the arithmetic mean of 𝑥2𝑡 .
Consequently,
( x 22t yt ) ( x 2 t x2 t yt )
(15) b1
T ( x2t x2 )2
T ( x 2 t yt ) ( x 2 t )( yt )
b2 .
T ( x2t x2 )2
It is sometimes useful to simplify b1 as b1 y x2 b2 where y and x2 are the sample means for
To summarize, b , the least squares estimator of the unknown parameters 1 and 2 , results
from solving the two simultaneous linear equations in (13). The resulting least squares
estimator is
(16) b ( XX) 1 Xy
Note that the least squares estimator b is a linear function of the observations y .
To learn about the mean vector for b , we make use of the expectations operator E and
investigate the E [b] . Using (14)
(17) 𝐸(𝐛) = 𝐸[(𝐗 ′ 𝐗)−1 𝐗 ′ 𝒚] = 𝐸[(𝐗 ′ 𝐗)−1 𝐗 ′ (𝐗𝛃 + 𝒆)]
= 𝐸[(𝐗 ′ 𝐗)−1 𝐗 ′ 𝐗𝛃 + (𝐗 ′ 𝐗)−1 𝐗 ′ 𝒆] = 𝐸(𝐈𝛃) + (𝐗 ′ 𝐗)−1 𝐗 ′ 𝐸(𝒆)
= 𝛃 + (𝐗 ′ 𝐗)−1 𝐗 ′ 𝟎 = 𝛃
since by assumption E[et ] 0 . Consequently, using the least squares estimator results in a
linear rule for estimating 1 and 2 that is unbiased. Note that what is unbiased is the rule or
4
estimator ad not the estimate that is related to a particular sample. Thus, is we use the least
squares rule b , it will be an unbiased rule since E [b β] 0 .
Our next concern is with its sampling variability or precision. We know that
𝐛 = (𝐗′𝐗)−1 𝐗 ′ 𝒚 = (𝐗′𝐗)−1 𝐗 ′ (𝐗𝛃 + 𝐞),
or
(18) b β ( XX) 1 Xe
and so
(19) b β ( XX) 1 Xe .
Making use of (19), we express the covariance matrix b for the random vector b as
σ 2 ( XX) 1 X I X ( XX) 1
σ 2 ( XX) 1
1
2
T x2 t
σ
x2 t x22t
where use is made of the assumption that E [ee] σ 2IT . Since b (b1 , b2 ) is a two-
dimensional random vector, the covariance b is the following (2 2) matrix
Using the inverse of XX in (14), the variances and covariances of b1 and b2 are
x22t
(23) var (b1 ) σ 2 2
σ 2 a1
T ( x 2t x 2 )
σ2
(24) var (b2 ) σ 2 a2
( x2 t x2 ) 2
x2
(25) cov (b1 , b2 ) σ 2 2
( x2 t x2 )
5
To summarize, the sampling properties of the random variables b1 and b2 can be summarized
as follows
b1 ~ ( 1 , σ 2 a1 )
and
b2 ~ ( 2 , σ 2 a2 )
where a1 and a2 are defined in (23) and (24). One thing that is apparent from (23) to (25) is
that the more dispersed the explanatory variable [i.e., the larger is ( x2t x2 ) 2 ], the greater the
precision of b1 and b2 . Also, because the number of terms in summation ( x2t x2 ) 2 increases
as samples size increases, an increase in sample size generally leads to an increase in precision.
Finally the smaller the error variance σ 2 , which reflects the variability of yt about its mean,
y2 1 x22 2 x32 3 xK 2 K e2
yT 1 x2T 2 x3T 3 xKT K eT
in which
y1 1 x21 xK 1 1 𝑒1
y 1 x xK 2 𝑒2
(29) y 2 X 22
β 2 𝐞=[ ⋮ ]
𝑒𝑇
yT 1 x2T xKT K
where
𝒚 is 𝑇 × 1 column vector of dependent variable observations
𝐗 is 𝑇 × 𝐾 matrix of independent variable observations
𝛃 is 𝐾 × 1 column vector of unknown parameters
𝐞 is 𝑇 × 1 column vector of errors
6
The assumptions of the classical linear regression model are
ii. The elements of X are fixed and have finite variance. In addition, X has rank K , which
is less than the number of observations T .
iii. 𝐞 is normally distributed with E [e] 0 and E [ee] σ 2I , where I is T T identity matrix.
The assumption that X has rank K guarantees that perfect collinearity will not be present. With
perfect collinearity, one of the columns of X would be a linear combination of the remaining
columns, and the rank of X would be less than K . The error assumptions are the strongest
possible, since they guarantee the statistical as well as arithmetic properties of the ordinary
least squares estimation process. In addition to normality we assume that each error term has
mean 0 , all variances are constant, and all covariances are 0 . The variance covariance matrix
σ 2 I appears as in (3) and (4).
Now consider the properties of the least squares estimator b . First, we can prove that b is an
unbiased estimator of β :
β ( XX) 1 Xe
7
= β Ae where A (XX) 1 X
Then
Looking at (33), notice that Ae (XX) 1 Xe represents the regression of e on X . As long as
the effects of missing variables (or omitted variables) are randomly distributed independently
of X and have mean 0 , the least squares parameter estimator b will be unbiased.
Further, the least squares estimator will be normally distributed, since b is a linear function of
𝐞 and 𝐞 is normally distributed. The properties of the variances of the individual bi , i 1,, K ,
estimated parameters, while the off-diagonal terms represent the covariances. Note from (33)
that b β Ae where A (XX) 1 X . Then
Therefore,
(36) E [(b β)(b β)] σ 2 (XX) 1
8
An unbiased estimator of 2
My
(38) eˆ My M ( Xβ e) Me
A simple way to evaluate the expectation of this quadratic form is to make use of the concept
of the trace of a matrix. Since eMe is a scalar, it is equal to its trace. Consequently,
Since for the product of matrices A , B , C , the tr (ABC) tr (CAB) tr (BCA) , assuming ABC ,
CAB and BCA exist. Also, because E [tr ( z )] tr [ E ( z )] for any argument of z , we have
(42) 𝐸(𝐞̂′𝐞̂) = tr[𝐸(𝐌𝐞𝐞′)] = tr[𝐌 𝐸(𝐞𝐞′)] = tr[𝐌𝜎 2 𝐈] = 𝜎 2 tr[𝐌 𝐈] = 𝜎 2 tr[𝐌]
= 𝜎 2 tr(𝐈𝑇 − 𝐗(𝐗 ′ 𝐗)−1 𝐗′)
= 𝜎 2 [tr(𝐈𝑇 ) − tr(𝐗(𝐗 ′ 𝐗)−1 𝐗′)]
= 𝜎 2 [tr(𝐈𝑇 ) − tr(𝐗′𝐗(𝐗 ′ 𝐗)−1 )]
= 𝜎 2 [tr(𝐈𝑇 ) − tr(𝐈𝐾 )]
= 𝜎 2 [𝑇 − 𝐾]
9
Consequently,
eˆ eˆ 1 2
(43) E σ (T K ) σ .
2
T K T K
eˆ eˆ
Thus if we let σˆ 2 be an estimator of σ 2 , it will be an unbiased estimator since
T K
eˆ eˆ
E σ2 .
T K
Gauss-Markov Theorem
Recall that for the classical linear regression model, the following assumptions must hold:
1. The relationship between y and X is linear.
2. The X ’s are nonstochastic variables whose values are fixed.
3. The error has zero expected value: E (et ) 0 for all t .
4. The error term has constant variance for all observation, i.e. E (et2 ) σ 2 for all t .
5. The random variables e t are statistically independent. Thus E (et es ) 0 for all t s .
(For classical normal linear regression model, we add assumption 6, that the error term is
normally distributed). (As another way to express these assumptions, see page 11).
Gauss-Markov Theorem: Given assumptions 1 through 5, the estimators b are the best (most
efficient) linear unbiased estimators of β in the sense that they have the minimum variance of
all linear unbiased estimators.
We have proved in (34) that b is an unbiased estimator of β . To complete the proof of the
~
Gauss-Markov theorem, we need to show that any other unbiased linear estimator b has
greater variance than b . Recall that b Ay . Without loss of generality, we can write (for any
matrix C )
~
(37) b ( A C)y ( A C)( Xβ e)
(A C)Xβ (A C)e
~
The expected value of b is given by
~
(37) E (b ) E [( A C) Xβ] E [( A C)e]
Since E [(A C)e] (A C) E [e] 0 then
~
(38) E (b ) ( A C) X E (β) ( XX) 1 XXβ CXβ
I β CXβ
10
~
If b is unbiased, then it must be that CX 0 for all β .
~
Now examine the matrix var (b ) . Using (37), the condition that CX 0 and since
Thus
~ ~ ~
(39) var (b | X) E [(b β)(b β)]] E {[( A C)e][( A C)e]}
E [(A C)ee( A C)] ( A C) E [ee](A C)
σ 2 ( A C)(A C)
Since
( A C)(A C) AA CA AC CC
= (XX) 1 CC
Therefore
~
(40) var( b | X) σ 2 [( XX) 1 CC] var (b | X) σ 2CC
We can show that CC is a positive semidefinite matrix. Write 𝐳 ′ = 𝐚′𝐂 and its transpose 𝐳 =
𝐂′𝐚, then a quadratic form 𝒛′𝐳 = 𝐚′𝐂𝐂′𝐚 will be non-negative for a non-zero vector 𝐚.
Remember in linear algebra, a symmetric 𝑛 × 𝑛 real matrix 𝑴 is said to be positive definite if
the scalar 𝐚′𝐌𝐚 is strictly positive for every non-zero column vector 𝐚. As we have 𝒛′𝐳 =
𝐚′𝐂𝐂′𝐚 to be non-negative, 𝐂𝐂′ is positive semi-definite. The only case in which the quadratic
form CC will be 0 is when C 0 (all elements of C are 0 ). When C 0 , the alternative
estimator becomes the ordinary least squares estimator b and the theorem is proved: OLS
estimator b Ay is the most efficient
11
Notes on some matrix derivation to get the OLS formula
(Green, W. H. (2012). Appendix A Matrix Algebra, A.8.1 Differentiation and Matrix Algebra
pp.1048-1049)
Thus
𝜕(𝒚′ 𝒚 − 𝐛′ 𝐗 ′ 𝒚 − 𝒚′𝐗𝐛 + 𝐛′ 𝐗 ′ 𝐗𝐛)
= −2𝐗 ′ 𝒚 + 2𝐗 ′ 𝐗𝐛 = 𝟎.
𝜕𝐛
12