You are on page 1of 8

The Method of Ordinary Least Square:

Let us consider the two variable PRF:

𝑌𝑖 = 𝛽1 + 𝛽2 𝑋𝑖 + 𝜇𝑖 … … … (1)

However, PRF is not directly observable. We estimate if from the SRF:

𝑌𝑖 = 𝛽̂1 + 𝛽̂2 𝑋𝑖 + 𝜇̂ 𝑖

𝑌𝑖 = 𝑌̂𝑖 + 𝜇̂ 𝑖

Where 𝑌̂𝑖 is the estimated value of 𝑌𝑖

𝜇̂ 𝑖 = 𝑌𝑖 − 𝑌̂𝑖

= 𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖 ………………………….(2)

Taking square and sum we get

∑ 𝜇̂ 𝑖2 = ∑(𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖 )2 ……………………………….(3)

Now differentiating (3) partially with respect to 𝛽̂1 and 𝛽̂2 we obtain
𝛿
∑ 2
̂1 ( 𝜇̂ 𝑖 )
𝛿𝛽
= −2 ∑(𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖 )……………………………(4)

𝛿
∑ 2
̂2 ( 𝜇̂ 𝑖 )
𝛿𝛽
= −2 ∑(𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖 )𝑋𝑖 ………………………..(5)

Now setting (4)= 0 we get

−2 ∑(𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖 )=0

⟹ ∑(𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖 )=0

⇒ ∑ 𝑌𝑖 − ∑ 𝛽1 − ∑ 𝛽̂2 𝑋𝑖 =0

⇒ ∑ 𝑌𝑖 − 𝑛𝛽̂1 − 𝛽̂2 ∑ 𝑋𝑖 =0

⇒ 𝛽̂1 = 𝑌̅ − 𝛽̂2 𝑋̅………………………………..(6)

Setting (5)=0, we get

−2 ∑(𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖 )𝑋𝑖 =0

⟹ ∑(𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖 )𝑋𝑖 = 0

⟹ ∑ 𝑌𝑖 𝑋𝑖 − ∑ 𝛽̂1 𝑋𝑖 − ∑ 𝛽̂2 𝑋𝑖2 = 0


⟹ ∑ 𝑌𝑖 𝑋𝑖 − 𝛽̂1 ∑ 𝑋𝑖 − 𝛽̂2 ∑ 𝑋𝑖2 = 0

⟹ ∑ 𝑌𝑖 𝑋𝑖 − (𝑌̅ − 𝛽̂2 𝑋̅) ∑ 𝑋𝑖 − 𝛽̂2 ∑ 𝑋𝑖2 = 0

⟹ ∑ 𝑌𝑖 𝑋𝑖 − 𝑌̅ ∑ 𝑋𝑖 + 𝛽̂2 𝑋̅ ∑ 𝑋𝑖 − 𝛽̂2 ∑ 𝑋𝑖2 = 0


2
∑ 𝑋𝑖 ∑ 𝑌𝑖 (∑ 𝑋𝑖 )
⟹ ∑ 𝑌𝑖 𝑋𝑖 − ̂ 2
− 𝛽2 (∑ 𝑋𝑖 − )=0
𝑛 𝑛
∑ 𝑋𝑖 ∑ 𝑌𝑖
∑ 𝑌𝑖 𝑋𝑖 −
⟹ 𝛽̂2 = 𝑛
2
(∑ 𝑋 𝑖)
∑ 𝑋𝑖2 −
𝑛

Numerical properties of the estimators:

1. The OLS estimators are expressed solely in terms of the observable quantities. Therefore, they can
be easily computed.
2. They are point estimators
3. Once the OLS estimates are obtained from the sample data, the regression line can be easily
obtained. The regression line thus obtained has the following properties:
i. It passes through the sample means of Y and X
ii. The mean value of the estimated 𝑌 = 𝑌̂𝑖 is equal to the mean value of the actual Y for
𝑌̂𝑖 = 𝛽̂1 + 𝛽̂2 𝑋𝑖
= (𝑌̅ − 𝛽̂2 𝑋̅) + 𝛽̂2 𝑋𝑖
= 𝑌̅ + 𝛽̂2 (𝑋𝑖 − 𝑋̅)

Summing both sides of this last equality over the sample values and dividing through by n give

𝑌̂𝑏𝑎𝑟 = 𝑌̅

iii. The mean value of 𝜇̂ 𝑖 is equal to zero


iv. The residuals 𝜇̂ 𝑖 are uncorrelated with the predicted 𝑌𝑖
v. The residuals are uncorrelated with 𝑋𝑖
Gauss- Markov Theorem:
Given the assumptions of the classical linear regression model, the least square estimators, in the class of
unbiased linear estimators, have minimum variance, that is, they are BLUE.

Proof of the Gauss Markov Theorem:

Linearity:

Let us consider the regression model

𝑌𝑖 = 𝛽1 + 𝛽2 𝑋𝑖 + 𝜇𝑖 … … … (1)

Sample regression equation corresponding to the PRE

𝑌𝑖 = 𝛽̂1 + 𝛽̂2 𝑋𝑖 + 𝜇̂ 𝑖

Now the OLS coefficient estimators given by the formulas

∑ 𝑋𝑖 ∑ 𝑌𝑖
∑ 𝑌𝑖 𝑋𝑖 − ∑(𝑋𝑖 − ̅̅̅
𝑋)(𝑌𝑖 − 𝑌̅) ∑ 𝑦𝑖 𝑥𝑖
𝛽̂2 = 𝑛
2 = ∑(𝑋𝑖 − ̅̅̅
= … … … . (2)
2 (∑ 𝑋𝑖 ) 𝑋)2 ∑ 𝑥𝑖2
∑ 𝑋𝑖 −
𝑛

𝛽̂1 = 𝑌̅ − 𝛽̂2 𝑋̅

From (2) we can write

∑ 𝑦𝑖 𝑥𝑖
𝛽̂2 =
∑ 𝑥𝑖2

∑(𝑌𝑖 − 𝑌̅)𝑥𝑖
𝛽̂2 =
∑ 𝑥𝑖2

∑ 𝑌𝑖 𝑥𝑖 ∑ 𝑌̅𝑥𝑖
𝛽̂2 = −
∑ 𝑥𝑖2 ∑ 𝑥𝑖2

∑ 𝑌𝑖 𝑥𝑖
𝛽̂2 = = ∑ 𝑘𝑖 𝑌𝑖 … … … . (3)
∑ 𝑥𝑖2

The OLS estimator 𝛽̂2 is the linear function of the sample values 𝑌𝑖

Some properties of 𝒌𝒊 :

𝑥𝑖
i. ∑ 𝑘𝑖 = ∑ =0
∑ 𝑥𝑖2
𝑥𝑖 ∑ 𝑥𝑖 2 1
ii. ∑ 𝑘𝑖2 = ∑( 2
2) = 2 =∑
∑ 𝑥𝑖 (∑ 𝑥𝑖 )2 𝑥𝑖 2

iii. ∑ 𝑘𝑖 𝑥𝑖 = ∑ 𝑘𝑖 (𝑋𝑖 − 𝑋̅) = ∑ 𝑘𝑖 𝑋𝑖 − 𝑋̅ ∑ 𝑘𝑖 = ∑ 𝑘𝑖 𝑋𝑖


iv. ∑ 𝑘𝑖 𝑥𝑖 = 1
̂𝟐
Unbiasedness of 𝜷
We know that

𝛽̂2 = ∑ 𝑘𝑖 𝑌𝑖

= ∑ 𝑘𝑖 (𝛽1 + 𝛽2 𝑋𝑖 + 𝜇𝑖 ) (from (1)

= ∑ 𝑘𝑖 𝛽1 + 𝛽2 ∑ 𝑘𝑖 𝑋𝑖 + ∑ 𝑘𝑖 𝜇𝑖

= 𝛽1 ∑ 𝑘𝑖 + 𝛽2 ∑ 𝑘𝑖 𝑋𝑖 + ∑ 𝑘𝑖 𝜇𝑖

= 𝛽2 + ∑ 𝑘𝑖 𝜇𝑖

Taking expectation on both side

𝐸(𝛽̂2 ) = 𝐸(𝛽2 ) + ∑ 𝑘𝑖 𝐸(𝜇𝑖 ) = 𝛽2

̂ 𝟏:
Unbaisedness of 𝜷
𝑌𝑖 = 𝛽1 + 𝛽2 𝑋𝑖 + 𝜇𝑖

⇒ ∑ 𝑌𝑖 = ∑ 𝛽1 + 𝛽2 ∑ 𝑋𝑖 + ∑ 𝜇𝑖

⇒ ∑ 𝑌𝑖 = 𝑁𝛽1 + 𝛽2 ∑ 𝑋𝑖 + ∑ 𝜇𝑖

∑ 𝑌𝑖 ∑ 𝑋𝑖 ∑ 𝜇𝑖
⇒ = 𝛽1 + 𝛽2 +
𝑁 𝑁 𝑁

⇒ 𝑌̅ = 𝛽1 + 𝛽2 𝑋̅ … … … … . . (2)

Substituting the (2) in the formula of 𝛽̂1 , we get

𝛽̂1 = 𝑌̅ − 𝛽̂2 𝑋̅

= 𝛽1 + 𝛽2 𝑋̅ − 𝛽̂2 𝑋̅
= 𝛽1 + (𝛽2 − 𝛽̂2 )𝑋̅

Taking Expectation 𝐸(𝛽̂1 ) = 𝛽1

̂ 𝟐:
Variance of 𝜷
We know that
2
𝑉(𝛽̂2 ) = 𝐸[𝛽̂2 − 𝐸(𝛽̂2 )]

2
= 𝐸[𝛽̂2 − 𝛽2 ]

2
= 𝐸 [∑ 𝑘𝑖 𝜇𝑖 ]

= 𝐸(𝑘1 𝜇1 + 𝑘2 𝜇2 + ⋯ … … … +𝑘𝑛 𝜇𝑛 )2

= 𝐸(𝑘12 𝜇12 + ⋯ … … … 𝑘𝑛2 𝜇𝑛2 + 2𝑘1 𝑘2 𝜇1 𝜇2 + ⋯ … .2𝑘𝑛−1 𝑘𝑛 𝜇𝑛−1 𝜇𝑛 )

Since by assumption, 𝐸(𝜇𝑖2 ) = 𝜎 2 for each i and 𝐸(𝜇𝑖 𝜇𝑗 ) = 0, 𝑖 ≠ 𝐽, it follows that

𝜎2
𝑉(𝛽̂2 ) = 𝜎 2 ∑ 𝑘𝑖2 =
∑ 𝑥2
𝑖

̂ 𝟏:
Variance of 𝜷
𝜎2
𝑉(𝛽̂1 ) = 𝑉(𝑌̅ − 𝛽̂2 𝑋̅) = 𝑉(−𝛽̂2 𝑋̅ + 𝑌̅) = 𝑋̅ 2 𝑉(𝛽̂2 ) = 𝑋̅ 2
∑ 𝑥2
𝑖

Minimum variance property:


We know that

𝛽̂2 = ∑ 𝑘𝑖 𝑌𝑖

Which shows that 𝛽̂2 is a weighted average of the Y’s, with 𝑘𝑖 serving as the weights.

Let us define an alternative linear estimator of 𝛽2 as follows:

𝛽2∗ = ∑ 𝑤𝑖 𝑌𝑖
Where 𝑤𝑖 are also weights, not necessarily equal to 𝑘𝑖 . Now

𝐸(𝛽2∗ ) = ∑ 𝑤𝑖 𝐸(𝑌𝑖 )

= ∑ 𝑤𝑖 (𝛽1 + 𝛽2 𝑋𝑖 )

= 𝛽1 ∑ 𝑤𝑖 + 𝛽2 ∑ 𝑤𝑖 𝑋𝑖

Therefore, for 𝛽2∗ to be unbiased, we must have

∑ 𝑤𝑖 = 0 and ∑ 𝑤𝑖 𝑋𝑖 = 1

Also, we may write

𝑉(𝛽2∗ ) = 𝑉 (∑ 𝑤𝑖 𝑌𝑖 )

= ∑ 𝑤𝑖2 𝑉(𝑌𝑖 )

= 𝜎 2 ∑ 𝑤𝑖2

𝑥𝑖 𝑥𝑖 2
= 𝜎 2 ∑(𝑤𝑖 − + )
∑ 𝑥2
𝑖 ∑ 𝑥 2
𝑖
2
𝑥𝑖 2
𝑥𝑖 𝑥𝑖 𝑥𝑖
= 𝜎 2 ∑(𝑤𝑖 − ) + 𝜎2 ∑ ( ) + 2 𝜎 2 ∑ (𝑤𝑖 − )( )
∑ 𝑥2
𝑖
2
∑ 𝑥𝑖 2
∑ 𝑥𝑖 ∑ 𝑥2
𝑖

𝑥𝑖 2 2 1
= 𝜎 2 ∑(𝑤𝑖 − ) + 𝜎 ( 2)
∑ 𝑥2
𝑖 ∑ 𝑥𝑖

𝑥𝑖 2
𝑉(𝛽2∗ ) = 𝜎 2 ∑(𝑤𝑖 −
2
) + 𝑉(𝛽̂2 )
∑ 𝑥𝑖

𝑉(𝛽2∗ ) ≥ 𝑉(𝛽̂2 )

• Covariance of 𝛽̂2 and 𝛽̂1 :


• The least – Square estimator of 𝜎 2
The Probability distribution of disturbances:
We know that,

𝛽̂2 = ∑ 𝑘𝑖 𝑌𝑖

Since X’s are assumed to be fixed, or non stochastic, and since 𝑌𝑖 = 𝛽1 + 𝛽2 𝑋𝑖 + 𝜇𝑖 we can write,

𝛽̂2 = ∑ 𝑘𝑖 (𝛽1 + 𝛽2 𝑋𝑖 + 𝜇𝑖 )

Because, k. the beats and X are all fixed, 𝛽̂2 is ultimately a linear function of the random variable 𝜇𝑖 , which
is random by assumption. Therefore, the probability distribution of 𝛽̂2 will depend on the assumption made
about the probability distribution of 𝜇𝑖 .

The Normality Assumption for 𝝁𝒊


The classical linear regression model assumes that each 𝜇𝑖 is distributed normally with

𝐸(𝜇𝑖 ) = 0 & 𝑉(𝜇𝑖 ) = 𝜎 2 𝐶𝑜𝑣(𝜇𝑖 , 𝜇𝑗 ) = 0, 𝑖 ≠ 𝑗

The assumptions given above can be more compactly stated as

𝜇𝑖 ~𝑁(0, 𝜎 2 )

Why the Normality Assumption?


1. 𝜇𝑖 represent the combined influence of a large number of independent variables that are not
explicitly introduced in the regression model. As noted, we hope that the influence of these
omitted variables is small and at best random. Now by the central limit theorem of statistics, it
can be shown that if there are a large number of independent and identically distributed random
variables, then, with few exceptions, the distribution of their sum tends to a normal distribution
as the number of such variables increases indefinitely. It is the CLT that provides a theoretical
justification for the assumption of normality of 𝜇𝑖 .
2. A variant of the CLT states that, even if the number of variables is not very large or if these
variables are strictly independent, their sum may still be normally distributed.
3. With the normality assumptions, the probability distributions of OLS estimators can easily derived
because, one of the property of the normal distribution is that any linear function of normally
distributed variable is itself normally distributed.
4. The normal distribution is a simple distribution, involves only two parameters and its properties
have been extensively studied in mathematical statistics. Besides, many phenomena seem to
follow normal distribution.
5. If we are dealing with a small, or finite , sample size, say data less than 100 observations, the
normality assumptions assumes a critical role. It is not only helps us derive the exact probability
distributions of OLS estimators, but also enables us to use the t, F and ℵ2 probability distributions
6. Finally in large samples, t and F statistics have approximately the t and F distributions so that the
t and F tests that are based on the assumption that the error term is normally distributed can still
be applied validity.

Properties of OLS estimators under the Normality Assumption:


1. They are unbiased
2. They have minimum variance
3. They have consistency
4. 𝛽̂1 ~𝑁(𝛽1 , 𝜎𝛽2̂ ), then by the standard normal distribution, the variable Z, which is defined as
1
̂ −𝛽
𝛽1 1
𝑍= ~𝑁(0,1)
𝜎𝛽
̂
1
̂
5. 𝛽2 is normally distributed with
̂ −𝛽
𝛽2 2
𝑍= ~𝑁(0,1)
𝜎𝛽
̂
2

You might also like