1 1 Ordinary Least Squares Rev2019

Ordinary Least Squares (OLS)
Linear statistical model (for two variables)
Consider a sample yt of T observations that we assume have been drawn from a distribution
that has a mean (location parameter) of E [ yt ] and a variance (scale parameter) of σ 2 . Let yt
be the outcome or response variable and x2 t be the known explanatory, conditioning, control
variable. Thus yt is a random variable and x2 t is fixed or nonstochastic. We may model the
observed random variable yt as
(1) yt  E [ yt ]  et  β1  x2t β2  et
where β1 reflects the level and β 2 reflects the slope of the relationship that is linear in terms of
the parameters. The parameters β1 , β 2 are unobserved and the random variable et are
unobservable.
We assume that the error term et is an independently and identically distributed (IID) random
variable with mean E [et ]  0 , variance E [(et ) 2 ]  σ 2 , and covariance E [(et es )]  0 for t  s .
Denote e  (e1 , e2 ,  , eT ) . Since E [e]  0 , the corresponding matrix of covariances for the
vector of random variables e is

(2) E {[e  E [e]] [e  E [e]]}  E [e e] .
Thus
𝑒1 𝐸(𝑒12 ) 𝐸(𝑒1 𝑒2 ) ⋯ 𝐸(𝑒1 𝑒𝑇 )
𝑒2 2
(3) 𝐸[𝐞𝐞′] = 𝐸 [ [ ⋮ ] [𝑒1 𝑒2 ⋯ 𝑒𝑇 ](1×𝑇) ] = 𝐸(𝑒2 𝑒1 ) 𝐸(𝑒2 ) ⋯ 𝐸(𝑒2 𝑒𝑇 )
⋮ ⋮ ⋱ ⋮
𝑒𝑇 (𝑇×1) [𝐸(𝑒𝑇 𝑒1 ) 𝐸(𝑒𝑇 𝑒2 ) ⋯ 𝐸(𝑒𝑇2 ) ](𝑇×𝑇)
Since 𝐸(𝑒𝑡2 ) = 𝜎 2 and covariance E [(et es )]  0 for t  s , we may write Eq. (3) as
σ 2 0  0 1 0  0
  0 1  0 
σ2  0
E [e e]  E 
0
(4)  σ2   σ 2I
           T
   
 0 0  σ 2  0 0  1 
where I T denotes a T th order identity matrix and σ 2 I T is a scalar diagonal matrix.
Given Eq. (1), the T equations, one for each yt may be written as
y1  β1  x21 β2  e1
1
(5) y2  β1  x22 β2  e2

yT  β1  x2T β2  eT
If we represent the sample observations by vector y  ( y1 , y2 ,  , yT ) , let x 1 represent a (T 1)
vector of ones and x 2  ( x21, x22 ,, x2T ) , we may write the statistical model for the sample
y1 , y2 ,  , yT as
 y1  1  x21   e1 
y   x  e 
(6)  2   1 β   22  β   2
    1    2 
      
 yT  1  x2T  eT 
or, compactly, as
 β1 
(7) y  x 1 β1  x 2 β 2  e  [ x 1 x 2 ]    e  X β  e
 β 2 
where X is a (T  2) known matrix, and β is (2  1) vector of unknown location parameters.

In this case the random vector y has mean E [y]  X β and covariance matrix
(8) E [(y  Xβ)(y  Xβ)]  E [e e]  σ 2IT .
To summarize, we may write the linear statistical model as
(9) y  Xβ  e
where the (T 1) observable random vector y has mean E [y]  X β and covariance matrix
σ 2 I T and the unobservable random vector e has mean E [e]  0 and covariance matrix σ 2 I T .
The least squares criterion & least squares estimators

Our concern is to estimate the unknown location parameters β1 and β 2 (assumed for the
moment that the scale parameter σ 2 is known) that represent the unknown level and slope
coefficients for the economic relationship that is under study. We will use the least squares
criterion. According to this criterion, given a sample of observed values of the random variables
y1 , y2 ,  , yT , to obtain an estimate 𝐛 for the unknown parameter vector β  ( β1 , β2 ) , an
estimator is chosen that minimizes the residual sum of squares ∑𝑇𝑡=1 𝑒̂𝑡2 = 𝒆̂′𝒆̂ where residual
𝒆̂ = 𝒚 − 𝒚
̂, and 𝒚
̂ = 𝐗𝐛.
Formally, we can state this criterion as: given the sample observations 𝒚, find values for 𝑏1 and
𝑏2 that minimize the following quadratic form
2
𝑇
𝑅𝑆𝑆 = ∑(𝑦𝑡 − 𝑥1𝑡 𝑏1 − 𝑥2𝑡 𝑏2 )2 = 𝒆̂′ 𝒆̂

𝑡=1
(10)
= (𝒚 − 𝐗𝐛)′ (𝒚 − 𝐗𝐛)
= 𝒚′ 𝒚 − 2𝐛′ 𝐗 ′ 𝒚 + 𝐛′ 𝐗 ′ 𝐗𝐛
In this case we need to find the minimizing values 𝑏1 and 𝑏2 for 𝛽1 and 𝛽2 that make the partial
derivatives vanish. These derivatives are
𝜕(𝒚′ 𝒚 − 2𝐛′ 𝐗 ′ 𝒚 + 𝐛′ 𝐗 ′ 𝐗𝐛)

(11) = −2𝐗 ′ 𝒚 + 2𝐗 ′ 𝐗𝐛 = 𝟎
𝜕𝐛
Arranging the terms,
𝐗 ′ 𝐗𝐛 = 𝐗 ′ 𝒚
Pre-multiplied each side by (𝐗′𝐗)−1 we have
(𝐗 ′ 𝐗)−1 𝐗 ′ 𝐗𝐛 = (𝐗 ′ 𝐗)−1 𝐗 ′ 𝒚 or
(11a)
𝐛 = (𝐗 ′ 𝐗)−1 𝐗 ′ 𝒚 .
This is the (ordinary) least square estimator.
In algebraic notation, derivatives (11) can be written as
𝟎 = −2𝐗 ′ 𝒚 + 2𝐗 ′ 𝐗𝐛
𝐱 𝟏′ 𝒚 𝐱 𝟏′
= −2 [ ] + 2 [ ] [𝐱1 𝐱 2 ] 𝐛
𝐱 2′ 𝒚 𝐱 2′
(12) 𝐱 𝟏′ 𝒚 𝐱 𝟏′ 𝐱1 𝐱 𝟏′ 𝐱 2
= −2 [ ] + 2 [ ]𝐛
𝐱 2′ 𝒚 𝐱 2′ 𝐱1 𝐱 2′ 𝐱 2
Σ𝑦𝑡 𝑇 Σ𝑥2𝑡 𝑏1
= −2 [ ] +2[ 2 ] [𝑏 ]
Σ𝑥2𝑡 𝑦𝑡 Σ𝑥2𝑡 Σ𝑥2𝑡 2
This leads us to a system of two linear equations:

(13) T b1   x2t b2   yt
 x2t b1   x2t b2   x2t yt .
2
or
 T  x 2t   b1    yt 
 2      
 x 2t  x 2t  b2   x 2t yt 
or
X X b  X y .
The equations in (12) and (13) represent a system of linear-simultaneous equations that must
be solved for b1 and b 2 . Using the concept of inverse of a matrix,
3
𝑏1 𝑇 Σ𝑥2𝑡 −1 Σ𝑦𝑡
(14) [ ]=[ 2 ] [ ]
𝑏2 Σ𝑥2𝑡 Σ𝑥2𝑡 Σ𝑥2𝑡 𝑦𝑡
1 2
Σ𝑥2𝑡 −Σ𝑥2𝑡 Σ𝑦𝑡
= [ ][ ]
2 ) 2
𝑇(∑ 𝑥2𝑡 − (∑ 𝑥2𝑡 ) −Σ𝑥2𝑡 𝑇 Σ𝑥2𝑡 𝑦𝑡
2 )
where 𝑇(Σ𝑥2𝑡 − (Σ𝑥2𝑡 )2 = 𝑇 ∑(𝑥2𝑡 − 𝑥̅ 2 )2 and 𝑥̅2 is the arithmetic mean of 𝑥2𝑡 .
Consequently,
( x 22t  yt )  ( x 2 t  x2 t yt )
(15) b1 
T  ( x2t  x2 )2
T ( x 2 t yt )  ( x 2 t )( yt )
b2  .
T  ( x2t  x2 )2
It is sometimes useful to simplify b1 as b1  y  x2 b2 where y and x2 are the sample means for
the observations on y and x 2 , respectively.
To summarize, b , the least squares estimator of the unknown parameters 1 and  2 , results
from solving the two simultaneous linear equations in (13). The resulting least squares
estimator is
(16) b  ( XX) 1 Xy
Note that the least squares estimator b is a linear function of the observations y .
Sampling properties of the least squares estimators

Our next concern is with the sampling performance of the estimator. Since b is a linear
function of the sample observations y , which is a random vector, the least squares estimators
of the location vector b is also a random vector; that is, b is a vector of random variables. We
want to know about the mean and sampling variability of this random vector.
To learn about the mean vector for b , we make use of the expectations operator E and
investigate the E [b] . Using (14)
(17) 𝐸(𝐛) = 𝐸[(𝐗 ′ 𝐗)−1 𝐗 ′ 𝒚] = 𝐸[(𝐗 ′ 𝐗)−1 𝐗 ′ (𝐗𝛃 + 𝒆)]
= 𝐸[(𝐗 ′ 𝐗)−1 𝐗 ′ 𝐗𝛃 + (𝐗 ′ 𝐗)−1 𝐗 ′ 𝒆] = 𝐸(𝐈𝛃) + (𝐗 ′ 𝐗)−1 𝐗 ′ 𝐸(𝒆)
= 𝛃 + (𝐗 ′ 𝐗)−1 𝐗 ′ 𝟎 = 𝛃
since by assumption E[et ]  0 . Consequently, using the least squares estimator results in a
linear rule for estimating 1 and  2 that is unbiased. Note that what is unbiased is the rule or
4
estimator ad not the estimate that is related to a particular sample. Thus, is we use the least
squares rule b , it will be an unbiased rule since E [b  β]  0 .
Our next concern is with its sampling variability or precision. We know that
𝐛 = (𝐗′𝐗)−1 𝐗 ′ 𝒚 = (𝐗′𝐗)−1 𝐗 ′ (𝐗𝛃 + 𝐞),
or
(18) b  β  ( XX) 1 Xe
and so
(19) b  β  ( XX) 1 Xe .
Since b is an unbiased rule, we may specify the covariance matrix for b as

(20) E {[b  E (b)][b  E (b)]}  E [(b  β)(b  β)]
Making use of (19), we express the covariance matrix  b for the random vector b as
(21)  b  E [(XX) 1 XeeX(XX) 1 ]
 ( XX) 1 XE[ee]X( XX) 1
 σ 2 ( XX) 1 X I X ( XX) 1
 σ 2 ( XX) 1
1
2
T  x2 t 
σ 
  x2 t  x22t 
where use is made of the assumption that E [ee]  σ 2IT . Since b  (b1 , b2 ) is a two-
dimensional random vector, the covariance  b is the following (2  2) matrix
 E [(b1  1 ) 2 ] E [(b1  1 )(b2   2 )]

(22) b   
 E [(b1  1 )(b2   2 )] E [(b2   2 ) 2 ] 
 var (b1 ) cov (b1 , b2 )


cov (b1 , b2 ) var (b2 ) 
Using the inverse of XX in (14), the variances and covariances of b1 and b2 are
  x22t 
(23) var (b1 )  σ 2  2
 σ 2 a1
 T  ( x 2t  x 2 ) 
σ2
(24) var (b2 )   σ 2 a2
 ( x2 t  x2 ) 2
  x2 
(25) cov (b1 , b2 )  σ 2  2
  ( x2 t  x2 ) 
5
To summarize, the sampling properties of the random variables b1 and b2 can be summarized
as follows
b1 ~ ( 1 , σ 2 a1 )
and
b2 ~ ( 2 , σ 2 a2 )
where a1 and a2 are defined in (23) and (24). One thing that is apparent from (23) to (25) is
that the more dispersed the explanatory variable [i.e., the larger is ( x2t  x2 ) 2 ], the greater the
precision of b1 and b2 . Also, because the number of terms in summation ( x2t  x2 ) 2 increases
as samples size increases, an increase in sample size generally leads to an increase in precision.
Finally the smaller the error variance σ 2 , which reflects the variability of yt about its mean,
the more precise are the estimators.
Multiple regression model in matrix form

We now consider the multiple regression model in matrix form, which includes K  1 variables
–– a dependent variable and K independent variables (including the constant term) of T
observations. Thus for each t  1,  , T
(27) y1  1  x21 2  x31 3    xK 1 K  e1
y2  1  x22  2  x32  3    xK 2  K  e2

yT  1  x2T  2  x3T  3    xKT  K  eT
The corresponding matrix formulation of the model is

(28) y  Xβ  e
in which
 y1  1 x21  xK 1   1  𝑒1
y  1 x  xK 2    𝑒2
(29) y   2 X 22
β 2 𝐞=[ ⋮ ]
         
      𝑒𝑇
 yT  1 x2T  xKT   K 
where
𝒚 is 𝑇 × 1 column vector of dependent variable observations
𝐗 is 𝑇 × 𝐾 matrix of independent variable observations
𝛃 is 𝐾 × 1 column vector of unknown parameters
𝐞 is 𝑇 × 1 column vector of errors
6
The assumptions of the classical linear regression model are
i. The model specification is given by (28), that is y  Xβ  e
ii. The elements of X are fixed and have finite variance. In addition, X has rank K , which
is less than the number of observations T .
iii. 𝐞 is normally distributed with E [e]  0 and E [ee]  σ 2I , where I is T  T identity matrix.
The assumption that X has rank K guarantees that perfect collinearity will not be present. With
perfect collinearity, one of the columns of X would be a linear combination of the remaining
columns, and the rank of X would be less than K . The error assumptions are the strongest
possible, since they guarantee the statistical as well as arithmetic properties of the ordinary
least squares estimation process. In addition to normality we assume that each error term has
mean 0 , all variances are constant, and all covariances are 0 . The variance covariance matrix
σ 2 I appears as in (3) and (4).
Least squares estimation
Our objective is to find a vector of parameters b which minimize
(30) 𝑅𝑆𝑆 = ∑𝑇𝑡=1 𝑒̂𝑡2 = 𝒆̂′𝒆̂
Substituting 𝒆̂ = 𝒚 − 𝐗𝐛 into (30) , we get

(31) 𝒆̂′𝒆̂ = (𝒚 − 𝐗𝐛)′ (𝒚 − 𝐗𝐛) = 𝒚′ 𝒚 − 2𝐛′ 𝐗 ′ 𝒚 + 𝐛′ 𝐗 ′ 𝐗𝐛
The last step follows because 𝐛′ 𝐗 ′ 𝒚 and 𝒚′𝐗𝐛 are both scalars and are equal to each other. To
determine the least squares estimators b , we minimize 𝑅𝑆𝑆 as follows
𝜕𝑅𝑆𝑆
= −2𝐗 ′ 𝒚 + 2𝐗 ′ 𝐗𝐛 = 𝟎
(32) 𝜕𝐛
𝐛 = (𝑿′𝑿)−1 𝑿′𝒚
The matrix XX , called the cross-product matrix, is guaranteed to have an inverse because of
our assumption that X has rank K .
Properties of the least squares estimator
Now consider the properties of the least squares estimator b . First, we can prove that b is an
unbiased estimator of β :
(33) b  ( XX) 1 Xy  ( XX) 1 X(Xβ  e)
 β  ( XX) 1 Xe
7
= β  Ae where A  (XX) 1 X
Then
(34) E [b]  β  E [Ae]  β  A E [e]  β
Looking at (33), notice that Ae  (XX) 1 Xe represents the regression of e on X . As long as
the effects of missing variables (or omitted variables) are randomly distributed independently
of X and have mean 0 , the least squares parameter estimator b will be unbiased.
Further, the least squares estimator will be normally distributed, since b is a linear function of
𝐞 and 𝐞 is normally distributed. The properties of the variances of the individual bi , i  1,, K ,
and their covariances are determined as follows
(35)  b  E [(b  β)(b  β)]
 E[(b1  1 ) 2 ] E[(b1  1 )(b2   2 )]  E[(b1  1 )(bK   K )] 

 
E[(b2   2 )(b1  1 )] E[(b2   2 ) 2 ]  E[(b2   2 )(bK   K )]
 
     
 
 E[(bK   K )(b1  1 )] E[(bK   K )(b2   2 )]  E[(bK   K ) 2 ] 
 var (b1 ) cov (b1 , b2 )  cov (b1 , bK ) 

 cov (b , b ) var (b2 )  cov (b2 , bK )
  1 2
     
 
cov (b1 , bK ) cov (b2 , bK )  var (bK ) 
where  b is a K  K matrix. The diagonal elements of  b represent the variances of the
estimated parameters, while the off-diagonal terms represent the covariances. Note from (33)
that b  β  Ae where A  (XX) 1 X . Then
 b  E [(b  β)(b  β)]  E [(Ae)(Ae)]  E (AeeA)
 A E (ee)A  A(σ 2I)A  σ 2 AA
since A and A  are matrices of fixed numbers. Note that
AA  [(XX) 1 X][(XX) 1 X]  [(XX) 1 X][X(XX) 1 ]
 (XX) 1 (XX)(XX) 1  (XX) 1
Therefore,
(36) E [(b  β)(b  β)]  σ 2 (XX) 1
8
An unbiased estimator of 2
Note that by definition, the vector of least squares residuals is
(37) eˆ  y  Xb  y  X(XX) 1 Xy
 (IT  X(XX) 1 X)y
 My
where M is T  T matrix , symmetric ( M  M  ) and idempotent (𝐌 = 𝐌𝐌). In view of (37),

we can interpret M as a matrix that, when it premultiplies any vector y , produces the vector
of least squares residuals in the regression of y on X . It follows immediately that MX  0 .
One way to interpret this result is that if X is regressed on X , a perfect fit will result and the
residuals will be zero.
Further, (37) can also be written as
(38) eˆ  My  M ( Xβ  e)  Me
as MX  0 . An estimator of σ 2 will be based on the sum of squared residuals:
(39) eˆ ˆe  eMe
If we made use of the (39) as an estimator of σ 2 , we need to evaluate
(40) E [eˆ eˆ ]  E [eMe ]
A simple way to evaluate the expectation of this quadratic form is to make use of the concept
of the trace of a matrix. Since eMe is a scalar, it is equal to its trace. Consequently,
(41) E [eˆ eˆ ]  E [tr (eMe )]  E [tr (Me e)]
Since for the product of matrices A , B , C , the tr (ABC)  tr (CAB)  tr (BCA) , assuming ABC ,
CAB and BCA exist. Also, because E [tr ( z )]  tr [ E ( z )] for any argument of z , we have
(42) 𝐸(𝐞̂′𝐞̂) = tr[𝐸(𝐌𝐞𝐞′)] = tr[𝐌 𝐸(𝐞𝐞′)] = tr[𝐌𝜎 2 𝐈] = 𝜎 2 tr[𝐌 𝐈] = 𝜎 2 tr[𝐌]
= 𝜎 2 tr(𝐈𝑇 − 𝐗(𝐗 ′ 𝐗)−1 𝐗′)
= 𝜎 2 [tr(𝐈𝑇 ) − tr(𝐗(𝐗 ′ 𝐗)−1 𝐗′)]
= 𝜎 2 [tr(𝐈𝑇 ) − tr(𝐗′𝐗(𝐗 ′ 𝐗)−1 )]
= 𝜎 2 [tr(𝐈𝑇 ) − tr(𝐈𝐾 )]
= 𝜎 2 [𝑇 − 𝐾]
9
Consequently,
 eˆ eˆ   1  2
(43) E     σ (T  K )  σ .
2
T  K  T  K 
eˆ eˆ
Thus if we let σˆ 2  be an estimator of σ 2 , it will be an unbiased estimator since
T K
 eˆ eˆ 
E   σ2 .
T  K 
Gauss-Markov Theorem
Recall that for the classical linear regression model, the following assumptions must hold:
1. The relationship between y and X is linear.
2. The X ’s are nonstochastic variables whose values are fixed.
3. The error has zero expected value: E (et )  0 for all t .
4. The error term has constant variance for all observation, i.e. E (et2 )  σ 2 for all t .
5. The random variables e t are statistically independent. Thus E (et es )  0 for all t  s .
(For classical normal linear regression model, we add assumption 6, that the error term is
normally distributed). (As another way to express these assumptions, see page 11).
Gauss-Markov Theorem: Given assumptions 1 through 5, the estimators b are the best (most
efficient) linear unbiased estimators of β in the sense that they have the minimum variance of
all linear unbiased estimators.
We have proved in (34) that b is an unbiased estimator of β . To complete the proof of the
~
Gauss-Markov theorem, we need to show that any other unbiased linear estimator b has
greater variance than b . Recall that b  Ay . Without loss of generality, we can write (for any
matrix C )
~
(37) b  ( A  C)y  ( A  C)( Xβ  e)
 (A  C)Xβ  (A  C)e
~
The expected value of b is given by
~
(37) E (b )  E [( A  C) Xβ]  E [( A  C)e]
Since E [(A  C)e]  (A  C) E [e]  0 then
~
(38) E (b )  ( A  C) X E (β)  ( XX) 1 XXβ  CXβ
 I β  CXβ
10
~
If b is unbiased, then it must be that CX  0 for all β .
~
Now examine the matrix var (b ) . Using (37), the condition that CX  0 and since
AX  (XX) 1 XX  I , we may write

~
b  β  ( A  C) Xβ  ( A  C)e  β
 AXβ  β  CXβ  (A  C)e
 (A  C)e
Thus
~ ~ ~
(39) var (b | X)  E [(b  β)(b  β)]]  E {[( A  C)e][( A  C)e]}
 E [(A  C)ee( A  C)]  ( A  C) E [ee](A  C)
 σ 2 ( A  C)(A  C)
Since
( A  C)(A  C)  AA  CA  AC  CC
 (XX) 1 XX(XX) 1  CX(XX) 1  (XX) 1 XC  CC
= (XX) 1  CC
Therefore
~
(40) var( b | X)  σ 2 [( XX) 1  CC]  var (b | X)  σ 2CC
We can show that CC is a positive semidefinite matrix. Write 𝐳 ′ = 𝐚′𝐂 and its transpose 𝐳 =
𝐂′𝐚, then a quadratic form 𝒛′𝐳 = 𝐚′𝐂𝐂′𝐚 will be non-negative for a non-zero vector 𝐚.
Remember in linear algebra, a symmetric 𝑛 × 𝑛 real matrix 𝑴 is said to be positive definite if
the scalar 𝐚′𝐌𝐚 is strictly positive for every non-zero column vector 𝐚. As we have 𝒛′𝐳 =
𝐚′𝐂𝐂′𝐚 to be non-negative, 𝐂𝐂′ is positive semi-definite. The only case in which the quadratic
form CC will be 0 is when C  0 (all elements of C are 0 ). When C  0 , the alternative
estimator becomes the ordinary least squares estimator b and the theorem is proved: OLS
estimator b  Ay is the most efficient
11
Notes on some matrix derivation to get the OLS formula
(Green, W. H. (2012). Appendix A Matrix Algebra, A.8.1 Differentiation and Matrix Algebra
pp.1048-1049)
Suppose a and x are 𝑛 × 1 column vectors and a linear function in 𝐱 as follows

𝑛
′ ′
𝑦 = 𝐚 𝐱 = 𝐱 𝐚 = ∑ 𝑎𝑖 𝑥𝑖
𝑖=1
then,
𝜕(𝐚′ 𝐱)
= 𝐚.
𝜕𝐱
Note that 𝜕(𝐚′ 𝐱)⁄𝜕𝐱 = 𝐚, not 𝐚′.
Suppose a quadratic form in x as follows

𝑛 𝑛
′
𝐱 𝐀𝐱 = ∑ ∑ 𝑥𝑖 𝑥𝑗 𝑎𝑖𝑗
𝑖=1 𝑗=1
where 𝐀 is matrix 𝑛 × 𝑛, then
𝜕(𝐱 ′ 𝐀𝐱)
= 2𝐀𝐱
𝜕𝐱
if 𝐀 is a symmetric matrix. If 𝐀 is not symmetric, then
𝜕(𝐱 ′ 𝐀𝐱)
= (𝐀 + 𝐀′ )𝐱.
𝜕𝐱
The residual sum of square (RSS) is (𝒚 − 𝐗𝐛)′ (𝒚 − 𝐗𝐛) = 𝒚′ 𝒚 − 𝐛′ 𝐗 ′ 𝒚 − 𝒚′𝐗𝐛 + 𝐛′ 𝐗 ′ 𝐗𝐛.

The second and third elements are a linear function in 𝐛, while the fourth element is a quadratic
form in b.
We want to evaluate the derivation

𝜕(𝒚′ 𝒚 − 𝐛′ 𝐗 ′ 𝒚 − 𝒚′𝐗𝐛 + 𝐛′ 𝐗 ′ 𝐗𝐛)
=𝟎.
𝜕𝐛
For the third term, since we can write 𝒚′ 𝐗𝐛 as (𝐗 ′ 𝒚)′ 𝐛, then

𝜕(𝒚′𝐗𝐛)
= 𝐗 ′ 𝒚.
𝜕𝐛
For the second term, since 𝐛′ 𝐗 ′ 𝒚 = 𝒚′ 𝐗𝐛, we have

𝜕(𝐛′ 𝐗 ′ 𝒚)
= 𝐗 ′ 𝒚.
𝜕𝐛
For the fourth term, since 𝐗 ′ 𝐗 is a symmetric matrix,
𝜕(𝐛′𝐗′𝐗𝐛)
= 2𝐗 ′ 𝐗𝐛.
𝜕𝐛
Thus
𝜕(𝒚′ 𝒚 − 𝐛′ 𝐗 ′ 𝒚 − 𝒚′𝐗𝐛 + 𝐛′ 𝐗 ′ 𝐗𝐛)
= −2𝐗 ′ 𝒚 + 2𝐗 ′ 𝐗𝐛 = 𝟎.
𝜕𝐛
12

1 1 Ordinary Least Squares Rev2019

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 1 Ordinary Least Squares Rev2019

Uploaded by

Copyright:

Available Formats

Ordinary Least Squares (OLS)

Linear statistical model (for two variables)

observed random variable yt as

vector of random variables e is

where I T denotes a T th order identity matrix and σ 2 I T is a scalar diagonal matrix.

If we represent the sample observations by vector y  ( y1 , y2 ,  , yT ) , let x 1 represent a (T 1)

where X is a (T  2) known matrix, and β is (2  1) vector of unknown location parameters.

(8) E [(y  Xβ)(y  Xβ)]  E [e e]  σ 2IT .

To summarize, we may write the linear statistical model as

The least squares criterion & least squares estimators

𝑅𝑆𝑆 = ∑(𝑦𝑡 − 𝑥1𝑡 𝑏1 − 𝑥2𝑡 𝑏2 )2 = 𝒆̂′ 𝒆̂

𝜕(𝒚′ 𝒚 − 2𝐛′ 𝐗 ′ 𝒚 + 𝐛′ 𝐗 ′ 𝐗𝐛)

This leads us to a system of two linear equations:

the observations on y and x 2 , respectively.

Sampling properties of the least squares estimators

Since b is an unbiased rule, we may specify the covariance matrix for b as

(21)  b  E [(XX) 1 XeeX(XX) 1 ]

 ( XX) 1 XE[ee]X( XX) 1

 E [(b1  1 ) 2 ] E [(b1  1 )(b2   2 )]

 var (b1 ) cov (b1 , b2 )

the more precise are the estimators.

Multiple regression model in matrix form

The corresponding matrix formulation of the model is

i. The model specification is given by (28), that is y  Xβ  e

Least squares estimation

Our objective is to find a vector of parameters b which minimize

(30) 𝑅𝑆𝑆 = ∑𝑇𝑡=1 𝑒̂𝑡2 = 𝒆̂′𝒆̂

Substituting 𝒆̂ = 𝒚 − 𝐗𝐛 into (30) , we get

Properties of the least squares estimator

(33) b  ( XX) 1 Xy  ( XX) 1 X(Xβ  e)

(34) E [b]  β  E [Ae]  β  A E [e]  β

and their covariances are determined as follows

(35)  b  E [(b  β)(b  β)]

 E[(b1  1 ) 2 ] E[(b1  1 )(b2   2 )]  E[(b1  1 )(bK   K )] 

 var (b1 ) cov (b1 , b2 )  cov (b1 , bK ) 

where  b is a K  K matrix. The diagonal elements of  b represent the variances of the

 b  E [(b  β)(b  β)]  E [(Ae)(Ae)]  E (AeeA)

 A E (ee)A  A(σ 2I)A  σ 2 AA

since A and A  are matrices of fixed numbers. Note that

AA  [(XX) 1 X][(XX) 1 X]  [(XX) 1 X][X(XX) 1 ]

 (XX) 1 (XX)(XX) 1  (XX) 1

Note that by definition, the vector of least squares residuals is

(37) eˆ  y  Xb  y  X(XX) 1 Xy

 (IT  X(XX) 1 X)y

where M is T  T matrix , symmetric ( M  M  ) and idempotent (𝐌 = 𝐌𝐌). In view of (37),

Further, (37) can also be written as

as MX  0 . An estimator of σ 2 will be based on the sum of squared residuals:

(39) eˆ ˆe  eMe

If we made use of the (39) as an estimator of σ 2 , we need to evaluate

(40) E [eˆ eˆ ]  E [eMe ]

(41) E [eˆ eˆ ]  E [tr (eMe )]  E [tr (Me e)]

AX  (XX) 1 XX  I , we may write

 (XX) 1 XX(XX) 1  CX(XX) 1  (XX) 1 XC  CC

Suppose a and x are 𝑛 × 1 column vectors and a linear function in 𝐱 as follows

Suppose a quadratic form in x as follows

The residual sum of square (RSS) is (𝒚 − 𝐗𝐛)′ (𝒚 − 𝐗𝐛) = 𝒚′ 𝒚 − 𝐛′ 𝐗 ′ 𝒚 − 𝒚′𝐗𝐛 + 𝐛′ 𝐗 ′ 𝐗𝐛.

We want to evaluate the derivation

For the third term, since we can write 𝒚′ 𝐗𝐛 as (𝐗 ′ 𝒚)′ 𝐛, then

For the second term, since 𝐛′ 𝐗 ′ 𝒚 = 𝒚′ 𝐗𝐛, we have

You might also like