You are on page 1of 12

Ordinary Least Squares (OLS)

Linear statistical model (for two variables)

Consider a sample yt of T observations that we assume have been drawn from a distribution

that has a mean (location parameter) of E [ yt ] and a variance (scale parameter) of σ 2 . Let yt

be the outcome or response variable and x2 t be the known explanatory, conditioning, control

variable. Thus yt is a random variable and x2 t is fixed or nonstochastic. We may model the

observed random variable yt as

(1) yt  E [ yt ]  et  β1  x2t β2  et

where β1 reflects the level and β 2 reflects the slope of the relationship that is linear in terms of

the parameters. The parameters β1 , β 2 are unobserved and the random variable et are

unobservable.

We assume that the error term et is an independently and identically distributed (IID) random

variable with mean E [et ]  0 , variance E [(et ) 2 ]  σ 2 , and covariance E [(et es )]  0 for t  s .

Denote e  (e1 , e2 ,  , eT ) . Since E [e]  0 , the corresponding matrix of covariances for the

vector of random variables e is


(2) E {[e  E [e]] [e  E [e]]}  E [e e] .

Thus
𝑒1 𝐸(𝑒12 ) 𝐸(𝑒1 𝑒2 ) ⋯ 𝐸(𝑒1 𝑒𝑇 )
𝑒2 2
(3) 𝐸[𝐞𝐞′] = 𝐸 [ [ ⋮ ] [𝑒1 𝑒2 ⋯ 𝑒𝑇 ](1×𝑇) ] = 𝐸(𝑒2 𝑒1 ) 𝐸(𝑒2 ) ⋯ 𝐸(𝑒2 𝑒𝑇 )
⋮ ⋮ ⋱ ⋮
𝑒𝑇 (𝑇×1) [𝐸(𝑒𝑇 𝑒1 ) 𝐸(𝑒𝑇 𝑒2 ) ⋯ 𝐸(𝑒𝑇2 ) ](𝑇×𝑇)

Since 𝐸(𝑒𝑡2 ) = 𝜎 2 and covariance E [(et es )]  0 for t  s , we may write Eq. (3) as

σ 2 0  0 1 0  0
  0 1  0 
σ2  0
E [e e]  E 
0
(4)  σ2   σ 2I
           T
   
 0 0  σ 2  0 0  1 

where I T denotes a T th order identity matrix and σ 2 I T is a scalar diagonal matrix.

Given Eq. (1), the T equations, one for each yt may be written as

y1  β1  x21 β2  e1

1
(5) y2  β1  x22 β2  e2

yT  β1  x2T β2  eT

If we represent the sample observations by vector y  ( y1 , y2 ,  , yT ) , let x 1 represent a (T 1)

vector of ones and x 2  ( x21, x22 ,, x2T ) , we may write the statistical model for the sample

y1 , y2 ,  , yT as

 y1  1  x21   e1 
y   x  e 
(6)  2   1 β   22  β   2
    1    2 
      
 yT  1  x2T  eT 

or, compactly, as
 β1 
(7) y  x 1 β1  x 2 β 2  e  [ x 1 x 2 ]    e  X β  e
 β 2 

where X is a (T  2) known matrix, and β is (2  1) vector of unknown location parameters.


In this case the random vector y has mean E [y]  X β and covariance matrix

(8) E [(y  Xβ)(y  Xβ)]  E [e e]  σ 2IT .

To summarize, we may write the linear statistical model as

(9) y  Xβ  e

where the (T 1) observable random vector y has mean E [y]  X β and covariance matrix

σ 2 I T and the unobservable random vector e has mean E [e]  0 and covariance matrix σ 2 I T .

The least squares criterion & least squares estimators


Our concern is to estimate the unknown location parameters β1 and β 2 (assumed for the

moment that the scale parameter σ 2 is known) that represent the unknown level and slope
coefficients for the economic relationship that is under study. We will use the least squares
criterion. According to this criterion, given a sample of observed values of the random variables
y1 , y2 ,  , yT , to obtain an estimate 𝐛 for the unknown parameter vector β  ( β1 , β2 ) , an

estimator is chosen that minimizes the residual sum of squares ∑𝑇𝑡=1 𝑒̂𝑡2 = 𝒆̂′𝒆̂ where residual
𝒆̂ = 𝒚 − 𝒚
̂, and 𝒚
̂ = 𝐗𝐛.
Formally, we can state this criterion as: given the sample observations 𝒚, find values for 𝑏1 and
𝑏2 that minimize the following quadratic form

2
𝑇

𝑅𝑆𝑆 = ∑(𝑦𝑡 − 𝑥1𝑡 𝑏1 − 𝑥2𝑡 𝑏2 )2 = 𝒆̂′ 𝒆̂


𝑡=1
(10)
= (𝒚 − 𝐗𝐛)′ (𝒚 − 𝐗𝐛)
= 𝒚′ 𝒚 − 2𝐛′ 𝐗 ′ 𝒚 + 𝐛′ 𝐗 ′ 𝐗𝐛

In this case we need to find the minimizing values 𝑏1 and 𝑏2 for 𝛽1 and 𝛽2 that make the partial
derivatives vanish. These derivatives are

𝜕(𝒚′ 𝒚 − 2𝐛′ 𝐗 ′ 𝒚 + 𝐛′ 𝐗 ′ 𝐗𝐛)


(11) = −2𝐗 ′ 𝒚 + 2𝐗 ′ 𝐗𝐛 = 𝟎
𝜕𝐛
Arranging the terms,

𝐗 ′ 𝐗𝐛 = 𝐗 ′ 𝒚
Pre-multiplied each side by (𝐗′𝐗)−1 we have
(𝐗 ′ 𝐗)−1 𝐗 ′ 𝐗𝐛 = (𝐗 ′ 𝐗)−1 𝐗 ′ 𝒚 or
(11a)
𝐛 = (𝐗 ′ 𝐗)−1 𝐗 ′ 𝒚 .
This is the (ordinary) least square estimator.
In algebraic notation, derivatives (11) can be written as
𝟎 = −2𝐗 ′ 𝒚 + 2𝐗 ′ 𝐗𝐛
𝐱 𝟏′ 𝒚 𝐱 𝟏′
= −2 [ ] + 2 [ ] [𝐱1 𝐱 2 ] 𝐛
𝐱 2′ 𝒚 𝐱 2′
(12) 𝐱 𝟏′ 𝒚 𝐱 𝟏′ 𝐱1 𝐱 𝟏′ 𝐱 2
= −2 [ ] + 2 [ ]𝐛
𝐱 2′ 𝒚 𝐱 2′ 𝐱1 𝐱 2′ 𝐱 2
Σ𝑦𝑡 𝑇 Σ𝑥2𝑡 𝑏1
= −2 [ ] +2[ 2 ] [𝑏 ]
Σ𝑥2𝑡 𝑦𝑡 Σ𝑥2𝑡 Σ𝑥2𝑡 2

This leads us to a system of two linear equations:


(13) T b1   x2t b2   yt
 x2t b1   x2t b2   x2t yt .
2

or
 T  x 2t   b1    yt 
 2      
 x 2t  x 2t  b2   x 2t yt 

or
X X b  X y .

The equations in (12) and (13) represent a system of linear-simultaneous equations that must
be solved for b1 and b 2 . Using the concept of inverse of a matrix,

3
𝑏1 𝑇 Σ𝑥2𝑡 −1 Σ𝑦𝑡
(14) [ ]=[ 2 ] [ ]
𝑏2 Σ𝑥2𝑡 Σ𝑥2𝑡 Σ𝑥2𝑡 𝑦𝑡
1 2
Σ𝑥2𝑡 −Σ𝑥2𝑡 Σ𝑦𝑡
= [ ][ ]
2 ) 2
𝑇(∑ 𝑥2𝑡 − (∑ 𝑥2𝑡 ) −Σ𝑥2𝑡 𝑇 Σ𝑥2𝑡 𝑦𝑡

2 )
where 𝑇(Σ𝑥2𝑡 − (Σ𝑥2𝑡 )2 = 𝑇 ∑(𝑥2𝑡 − 𝑥̅ 2 )2 and 𝑥̅2 is the arithmetic mean of 𝑥2𝑡 .
Consequently,
( x 22t  yt )  ( x 2 t  x2 t yt )
(15) b1 
T  ( x2t  x2 )2

T ( x 2 t yt )  ( x 2 t )( yt )
b2  .
T  ( x2t  x2 )2

It is sometimes useful to simplify b1 as b1  y  x2 b2 where y and x2 are the sample means for

the observations on y and x 2 , respectively.

To summarize, b , the least squares estimator of the unknown parameters 1 and  2 , results

from solving the two simultaneous linear equations in (13). The resulting least squares
estimator is
(16) b  ( XX) 1 Xy

Note that the least squares estimator b is a linear function of the observations y .

Sampling properties of the least squares estimators


Our next concern is with the sampling performance of the estimator. Since b is a linear
function of the sample observations y , which is a random vector, the least squares estimators
of the location vector b is also a random vector; that is, b is a vector of random variables. We
want to know about the mean and sampling variability of this random vector.

To learn about the mean vector for b , we make use of the expectations operator E and
investigate the E [b] . Using (14)
(17) 𝐸(𝐛) = 𝐸[(𝐗 ′ 𝐗)−1 𝐗 ′ 𝒚] = 𝐸[(𝐗 ′ 𝐗)−1 𝐗 ′ (𝐗𝛃 + 𝒆)]
= 𝐸[(𝐗 ′ 𝐗)−1 𝐗 ′ 𝐗𝛃 + (𝐗 ′ 𝐗)−1 𝐗 ′ 𝒆] = 𝐸(𝐈𝛃) + (𝐗 ′ 𝐗)−1 𝐗 ′ 𝐸(𝒆)
= 𝛃 + (𝐗 ′ 𝐗)−1 𝐗 ′ 𝟎 = 𝛃
since by assumption E[et ]  0 . Consequently, using the least squares estimator results in a

linear rule for estimating 1 and  2 that is unbiased. Note that what is unbiased is the rule or

4
estimator ad not the estimate that is related to a particular sample. Thus, is we use the least
squares rule b , it will be an unbiased rule since E [b  β]  0 .

Our next concern is with its sampling variability or precision. We know that
𝐛 = (𝐗′𝐗)−1 𝐗 ′ 𝒚 = (𝐗′𝐗)−1 𝐗 ′ (𝐗𝛃 + 𝐞),
or
(18) b  β  ( XX) 1 Xe

and so
(19) b  β  ( XX) 1 Xe .

Since b is an unbiased rule, we may specify the covariance matrix for b as


(20) E {[b  E (b)][b  E (b)]}  E [(b  β)(b  β)]

Making use of (19), we express the covariance matrix  b for the random vector b as

(21)  b  E [(XX) 1 XeeX(XX) 1 ]

 ( XX) 1 XE[ee]X( XX) 1

 σ 2 ( XX) 1 X I X ( XX) 1

 σ 2 ( XX) 1
1
2
T  x2 t 
σ 
  x2 t  x22t 

where use is made of the assumption that E [ee]  σ 2IT . Since b  (b1 , b2 ) is a two-
dimensional random vector, the covariance  b is the following (2  2) matrix

 E [(b1  1 ) 2 ] E [(b1  1 )(b2   2 )]


(22) b   
 E [(b1  1 )(b2   2 )] E [(b2   2 ) 2 ] 

 var (b1 ) cov (b1 , b2 )



cov (b1 , b2 ) var (b2 ) 

Using the inverse of XX in (14), the variances and covariances of b1 and b2 are

  x22t 
(23) var (b1 )  σ 2  2
 σ 2 a1
 T  ( x 2t  x 2 ) 

σ2
(24) var (b2 )   σ 2 a2
 ( x2 t  x2 ) 2

  x2 
(25) cov (b1 , b2 )  σ 2  2
  ( x2 t  x2 ) 

5
To summarize, the sampling properties of the random variables b1 and b2 can be summarized

as follows
b1 ~ ( 1 , σ 2 a1 )
and
b2 ~ ( 2 , σ 2 a2 )

where a1 and a2 are defined in (23) and (24). One thing that is apparent from (23) to (25) is

that the more dispersed the explanatory variable [i.e., the larger is ( x2t  x2 ) 2 ], the greater the

precision of b1 and b2 . Also, because the number of terms in summation ( x2t  x2 ) 2 increases

as samples size increases, an increase in sample size generally leads to an increase in precision.
Finally the smaller the error variance σ 2 , which reflects the variability of yt about its mean,

the more precise are the estimators.

Multiple regression model in matrix form


We now consider the multiple regression model in matrix form, which includes K  1 variables
–– a dependent variable and K independent variables (including the constant term) of T
observations. Thus for each t  1,  , T
(27) y1  1  x21 2  x31 3    xK 1 K  e1

y2  1  x22  2  x32  3    xK 2  K  e2


yT  1  x2T  2  x3T  3    xKT  K  eT

The corresponding matrix formulation of the model is


(28) y  Xβ  e

in which
 y1  1 x21  xK 1   1  𝑒1
y  1 x  xK 2    𝑒2
(29) y   2 X 22
β 2 𝐞=[ ⋮ ]
         
      𝑒𝑇
 yT  1 x2T  xKT   K 
where
𝒚 is 𝑇 × 1 column vector of dependent variable observations
𝐗 is 𝑇 × 𝐾 matrix of independent variable observations
𝛃 is 𝐾 × 1 column vector of unknown parameters
𝐞 is 𝑇 × 1 column vector of errors

6
The assumptions of the classical linear regression model are

i. The model specification is given by (28), that is y  Xβ  e

ii. The elements of X are fixed and have finite variance. In addition, X has rank K , which
is less than the number of observations T .

iii. 𝐞 is normally distributed with E [e]  0 and E [ee]  σ 2I , where I is T  T identity matrix.

The assumption that X has rank K guarantees that perfect collinearity will not be present. With
perfect collinearity, one of the columns of X would be a linear combination of the remaining
columns, and the rank of X would be less than K . The error assumptions are the strongest
possible, since they guarantee the statistical as well as arithmetic properties of the ordinary
least squares estimation process. In addition to normality we assume that each error term has
mean 0 , all variances are constant, and all covariances are 0 . The variance covariance matrix
σ 2 I appears as in (3) and (4).

Least squares estimation

Our objective is to find a vector of parameters b which minimize

(30) 𝑅𝑆𝑆 = ∑𝑇𝑡=1 𝑒̂𝑡2 = 𝒆̂′𝒆̂

Substituting 𝒆̂ = 𝒚 − 𝐗𝐛 into (30) , we get


(31) 𝒆̂′𝒆̂ = (𝒚 − 𝐗𝐛)′ (𝒚 − 𝐗𝐛) = 𝒚′ 𝒚 − 2𝐛′ 𝐗 ′ 𝒚 + 𝐛′ 𝐗 ′ 𝐗𝐛
The last step follows because 𝐛′ 𝐗 ′ 𝒚 and 𝒚′𝐗𝐛 are both scalars and are equal to each other. To
determine the least squares estimators b , we minimize 𝑅𝑆𝑆 as follows
𝜕𝑅𝑆𝑆
= −2𝐗 ′ 𝒚 + 2𝐗 ′ 𝐗𝐛 = 𝟎
(32) 𝜕𝐛
𝐛 = (𝑿′𝑿)−1 𝑿′𝒚
The matrix XX , called the cross-product matrix, is guaranteed to have an inverse because of
our assumption that X has rank K .

Properties of the least squares estimator

Now consider the properties of the least squares estimator b . First, we can prove that b is an
unbiased estimator of β :

(33) b  ( XX) 1 Xy  ( XX) 1 X(Xβ  e)

 β  ( XX) 1 Xe

7
= β  Ae where A  (XX) 1 X

Then

(34) E [b]  β  E [Ae]  β  A E [e]  β

Looking at (33), notice that Ae  (XX) 1 Xe represents the regression of e on X . As long as
the effects of missing variables (or omitted variables) are randomly distributed independently
of X and have mean 0 , the least squares parameter estimator b will be unbiased.

Further, the least squares estimator will be normally distributed, since b is a linear function of
𝐞 and 𝐞 is normally distributed. The properties of the variances of the individual bi , i  1,, K ,

and their covariances are determined as follows

(35)  b  E [(b  β)(b  β)]

 E[(b1  1 ) 2 ] E[(b1  1 )(b2   2 )]  E[(b1  1 )(bK   K )] 


 
E[(b2   2 )(b1  1 )] E[(b2   2 ) 2 ]  E[(b2   2 )(bK   K )]
 
     
 
 E[(bK   K )(b1  1 )] E[(bK   K )(b2   2 )]  E[(bK   K ) 2 ] 

 var (b1 ) cov (b1 , b2 )  cov (b1 , bK ) 


 cov (b , b ) var (b2 )  cov (b2 , bK )
  1 2
     
 
cov (b1 , bK ) cov (b2 , bK )  var (bK ) 

where  b is a K  K matrix. The diagonal elements of  b represent the variances of the

estimated parameters, while the off-diagonal terms represent the covariances. Note from (33)
that b  β  Ae where A  (XX) 1 X . Then

 b  E [(b  β)(b  β)]  E [(Ae)(Ae)]  E (AeeA)

 A E (ee)A  A(σ 2I)A  σ 2 AA

since A and A  are matrices of fixed numbers. Note that

AA  [(XX) 1 X][(XX) 1 X]  [(XX) 1 X][X(XX) 1 ]

 (XX) 1 (XX)(XX) 1  (XX) 1

Therefore,
(36) E [(b  β)(b  β)]  σ 2 (XX) 1

8
An unbiased estimator of 2

Note that by definition, the vector of least squares residuals is

(37) eˆ  y  Xb  y  X(XX) 1 Xy

 (IT  X(XX) 1 X)y

 My

where M is T  T matrix , symmetric ( M  M  ) and idempotent (𝐌 = 𝐌𝐌). In view of (37),


we can interpret M as a matrix that, when it premultiplies any vector y , produces the vector
of least squares residuals in the regression of y on X . It follows immediately that MX  0 .
One way to interpret this result is that if X is regressed on X , a perfect fit will result and the
residuals will be zero.

Further, (37) can also be written as

(38) eˆ  My  M ( Xβ  e)  Me

as MX  0 . An estimator of σ 2 will be based on the sum of squared residuals:

(39) eˆ ˆe  eMe

If we made use of the (39) as an estimator of σ 2 , we need to evaluate

(40) E [eˆ eˆ ]  E [eMe ]

A simple way to evaluate the expectation of this quadratic form is to make use of the concept
of the trace of a matrix. Since eMe is a scalar, it is equal to its trace. Consequently,

(41) E [eˆ eˆ ]  E [tr (eMe )]  E [tr (Me e)]

Since for the product of matrices A , B , C , the tr (ABC)  tr (CAB)  tr (BCA) , assuming ABC ,
CAB and BCA exist. Also, because E [tr ( z )]  tr [ E ( z )] for any argument of z , we have
(42) 𝐸(𝐞̂′𝐞̂) = tr[𝐸(𝐌𝐞𝐞′)] = tr[𝐌 𝐸(𝐞𝐞′)] = tr[𝐌𝜎 2 𝐈] = 𝜎 2 tr[𝐌 𝐈] = 𝜎 2 tr[𝐌]
= 𝜎 2 tr(𝐈𝑇 − 𝐗(𝐗 ′ 𝐗)−1 𝐗′)
= 𝜎 2 [tr(𝐈𝑇 ) − tr(𝐗(𝐗 ′ 𝐗)−1 𝐗′)]
= 𝜎 2 [tr(𝐈𝑇 ) − tr(𝐗′𝐗(𝐗 ′ 𝐗)−1 )]
= 𝜎 2 [tr(𝐈𝑇 ) − tr(𝐈𝐾 )]
= 𝜎 2 [𝑇 − 𝐾]

9
Consequently,
 eˆ eˆ   1  2
(43) E     σ (T  K )  σ .
2

T  K  T  K 
eˆ eˆ
Thus if we let σˆ 2  be an estimator of σ 2 , it will be an unbiased estimator since
T K
 eˆ eˆ 
E   σ2 .
T  K 

Gauss-Markov Theorem

Recall that for the classical linear regression model, the following assumptions must hold:
1. The relationship between y and X is linear.
2. The X ’s are nonstochastic variables whose values are fixed.
3. The error has zero expected value: E (et )  0 for all t .

4. The error term has constant variance for all observation, i.e. E (et2 )  σ 2 for all t .
5. The random variables e t are statistically independent. Thus E (et es )  0 for all t  s .
(For classical normal linear regression model, we add assumption 6, that the error term is
normally distributed). (As another way to express these assumptions, see page 11).

Gauss-Markov Theorem: Given assumptions 1 through 5, the estimators b are the best (most
efficient) linear unbiased estimators of β in the sense that they have the minimum variance of
all linear unbiased estimators.

We have proved in (34) that b is an unbiased estimator of β . To complete the proof of the
~
Gauss-Markov theorem, we need to show that any other unbiased linear estimator b has
greater variance than b . Recall that b  Ay . Without loss of generality, we can write (for any
matrix C )
~
(37) b  ( A  C)y  ( A  C)( Xβ  e)
 (A  C)Xβ  (A  C)e
~
The expected value of b is given by
~
(37) E (b )  E [( A  C) Xβ]  E [( A  C)e]
Since E [(A  C)e]  (A  C) E [e]  0 then
~
(38) E (b )  ( A  C) X E (β)  ( XX) 1 XXβ  CXβ
 I β  CXβ

10
~
If b is unbiased, then it must be that CX  0 for all β .
~
Now examine the matrix var (b ) . Using (37), the condition that CX  0 and since

AX  (XX) 1 XX  I , we may write


~
b  β  ( A  C) Xβ  ( A  C)e  β
 AXβ  β  CXβ  (A  C)e
 (A  C)e

Thus
~ ~ ~
(39) var (b | X)  E [(b  β)(b  β)]]  E {[( A  C)e][( A  C)e]}
 E [(A  C)ee( A  C)]  ( A  C) E [ee](A  C)

 σ 2 ( A  C)(A  C)

Since
( A  C)(A  C)  AA  CA  AC  CC

 (XX) 1 XX(XX) 1  CX(XX) 1  (XX) 1 XC  CC

= (XX) 1  CC
Therefore
~
(40) var( b | X)  σ 2 [( XX) 1  CC]  var (b | X)  σ 2CC

We can show that CC is a positive semidefinite matrix. Write 𝐳 ′ = 𝐚′𝐂 and its transpose 𝐳 =
𝐂′𝐚, then a quadratic form 𝒛′𝐳 = 𝐚′𝐂𝐂′𝐚 will be non-negative for a non-zero vector 𝐚.
Remember in linear algebra, a symmetric 𝑛 × 𝑛 real matrix 𝑴 is said to be positive definite if
the scalar 𝐚′𝐌𝐚 is strictly positive for every non-zero column vector 𝐚. As we have 𝒛′𝐳 =
𝐚′𝐂𝐂′𝐚 to be non-negative, 𝐂𝐂′ is positive semi-definite. The only case in which the quadratic
form CC will be 0 is when C  0 (all elements of C are 0 ). When C  0 , the alternative
estimator becomes the ordinary least squares estimator b and the theorem is proved: OLS
estimator b  Ay is the most efficient

11
Notes on some matrix derivation to get the OLS formula
(Green, W. H. (2012). Appendix A Matrix Algebra, A.8.1 Differentiation and Matrix Algebra
pp.1048-1049)

Suppose a and x are 𝑛 × 1 column vectors and a linear function in 𝐱 as follows


𝑛
′ ′
𝑦 = 𝐚 𝐱 = 𝐱 𝐚 = ∑ 𝑎𝑖 𝑥𝑖
𝑖=1
then,
𝜕(𝐚′ 𝐱)
= 𝐚.
𝜕𝐱
Note that 𝜕(𝐚′ 𝐱)⁄𝜕𝐱 = 𝐚, not 𝐚′.

Suppose a quadratic form in x as follows


𝑛 𝑛

𝐱 𝐀𝐱 = ∑ ∑ 𝑥𝑖 𝑥𝑗 𝑎𝑖𝑗
𝑖=1 𝑗=1
where 𝐀 is matrix 𝑛 × 𝑛, then
𝜕(𝐱 ′ 𝐀𝐱)
= 2𝐀𝐱
𝜕𝐱
if 𝐀 is a symmetric matrix. If 𝐀 is not symmetric, then
𝜕(𝐱 ′ 𝐀𝐱)
= (𝐀 + 𝐀′ )𝐱.
𝜕𝐱

The residual sum of square (RSS) is (𝒚 − 𝐗𝐛)′ (𝒚 − 𝐗𝐛) = 𝒚′ 𝒚 − 𝐛′ 𝐗 ′ 𝒚 − 𝒚′𝐗𝐛 + 𝐛′ 𝐗 ′ 𝐗𝐛.


The second and third elements are a linear function in 𝐛, while the fourth element is a quadratic
form in b.

We want to evaluate the derivation


𝜕(𝒚′ 𝒚 − 𝐛′ 𝐗 ′ 𝒚 − 𝒚′𝐗𝐛 + 𝐛′ 𝐗 ′ 𝐗𝐛)
=𝟎.
𝜕𝐛

For the third term, since we can write 𝒚′ 𝐗𝐛 as (𝐗 ′ 𝒚)′ 𝐛, then


𝜕(𝒚′𝐗𝐛)
= 𝐗 ′ 𝒚.
𝜕𝐛

For the second term, since 𝐛′ 𝐗 ′ 𝒚 = 𝒚′ 𝐗𝐛, we have


𝜕(𝐛′ 𝐗 ′ 𝒚)
= 𝐗 ′ 𝒚.
𝜕𝐛
For the fourth term, since 𝐗 ′ 𝐗 is a symmetric matrix,
𝜕(𝐛′𝐗′𝐗𝐛)
= 2𝐗 ′ 𝐗𝐛.
𝜕𝐛

Thus
𝜕(𝒚′ 𝒚 − 𝐛′ 𝐗 ′ 𝒚 − 𝒚′𝐗𝐛 + 𝐛′ 𝐗 ′ 𝐗𝐛)
= −2𝐗 ′ 𝒚 + 2𝐗 ′ 𝐗𝐛 = 𝟎.
𝜕𝐛

12

You might also like