You are on page 1of 169

Sampling distribution

● We have defined the sampling distribution as the distribution of an estimator. We have

explained this using arithmetic mean as an example to this. To give a little more insight

about a sampling distribution, let us consider an example.

● Now suppose that the age of the population is as follows:

(X)

10

There are only 5 observations. This is the entire population. Then

2 +4+ 6+8+10
µ = 5
= 6 𝑦𝑒𝑎𝑟𝑠 . This is the population parameter. Similarly

2
2 Σ(𝑋− µ) 40
σ = 𝑁
= 5
= 8

2
σ = σ = 8 = 2. 83

X is a random variable, and our interest is to find out population mean µ and population

2
variance σ . If the entire population is known, there is no problem. These quantities can easily be

estimated, there is no need for statistical inference or hypothesis testing.

● Now suppose that we take n = 2, then the samples are as follows

Samples 𝑋

2 4 3

2 6 4

1
2 8 5

2 10 6

4 6 5

4 8 6

4 10 7

6 8 7

6 10 8

8 10 9

The values of 𝑋 varies from 3 to 9, 𝑋 is used as an estimator of µ. here µ = 6. You

will never know the value of population parameter µ.

If your sample happens to be first one (2, 4), then 𝑋 = 3, which is far away from

µ. If your sample happens to be last one (8, 10), then 𝑋 = 9, which is also far

away from µ. if you are lucky that your sample is the fourth one ( 2, 10) then your

𝑋 = 6 which is equal to the population mean(µ = 6).

Now, let’s see what the mean of the mean. It is called as 𝑋

3+4+5+6+5+6+7+7+8+9
or 𝐸(𝑋) = 10
= 6 𝑌𝑒𝑎𝑟𝑠

● That means E(X) = µ

● That is if X is distributed with mean µ, then 𝑋 is also distributed with mean µ.

Now, in which sense we say that 𝑋 is a random variable? Now, we know that 𝑋 takes the values

as follows with frequencies, if we divide the frequencies with the total frequency , we will get

relative frequencies as follows.

𝑋 F Relative Frequency f(X)

2
3 1 0.1

4 1 0.1

5 2 0.2

6 2 0.2

7 2 0.2

8 1 0.1

9 1 01

Total 10 1

● The total of relative frequency is 1. That is f(X). This relative frequency is known as the

probability.

● Remember that, 𝑋 takes many values, we do not know the values of 𝑋 before the sample

is observed. The values of 𝑋 is unknown, it is a random variable. And for any random

variable we have an associated PDF. PDF is given like f(X). If you draw this you will get

a curve as follows

2
2 σ
σ𝑥 = 𝑛

2 8
σ𝑥 = 2
= 4

3
𝑆𝐷𝑋 = 4 = 2

2
● So, if X is distributed with mean µ and variance σ , then 𝑋 is also distribute with the same
2
σ
mean µ and variance 𝑛

● As you can see 𝑋 is a random variable and associated probabilities and the characteristics

2
given as 𝐸(𝑋), σ𝑥 and 𝑆𝐸𝑋 . This is the example of sampling distribution.

● In hypothesis testing we use this sampling distribution, not directly, we convert this sampling

distribution in to standard normal and t, F and chi-square etc., depending on what we are

trying to test. Just as we have converted X in to Z. we can convert 𝑋 into Z also.

4
● Now we have the distribution and area properties, just like we have described earlier.

µ ± σ = ±1

µ ± 2σ = ±2

µ ± 3σ = ±3

● So, we convert 𝑋 into Z. and Z is used for hypothesis testing. 𝑋 is the sampling distribution

and Z is the standard normal distribution derived from this sampling distribution.

● Random variables are PDFs are not what is used for testing, what is used for testing is the

sampling distribution of estimators converted into test statistics. One test statistic is Z. Z is

used if and only if the population variance and population standard deviation is known.

Otherwise Z conversion is not possible.

● Let us derive this results mathematically,

Σ𝑋 1 1 1
𝑋 = 𝑛
= 𝑛
𝑋1 + 𝑛
𝑋2 +........................................ + 𝑛
𝑋𝑛

And we assume that this 𝑋1, 𝑋2,...............................𝑋𝑛 are iid random variables. That is

identically and independently distributed random variables with same mean and variance.

That is each 𝑋𝑖 is distributed normal with same mean and variance.

2
𝑋𝑖 ∼ 𝑁 (µ, σ )

1 1 1
Then 𝐸(𝑋) = 𝑛
𝐸(𝑋1) + 𝑛
𝐸(𝑋2) +................................... + 𝑛
𝐸(𝑋𝑛)

1 1 1
= 𝑛
µ + 𝑛
µ +................................... + 𝑛
µ

1
= 𝑛
𝑛µ = µ

𝐸(𝑋) = µ

5
● So, this result is obtained if and only if this 𝑋1, 𝑋2,...............................𝑋𝑛 are identically and

independently distributed. And this condition is always satisfied in the case of random

sampling.

● In random sampling procedure, the samples are selected independently and it is also

assumed that they are identical.

Σ𝑋
𝑉𝑎𝑟 (𝑋) = 𝑉𝑎𝑟 ( 𝑛
)

1 2 1 2 1 2
= 2 σ + 2 σ +.................................... + 2 σ
𝑛 𝑛 𝑛

1 2
= 2 𝑛σ
𝑛

2
σ
= 𝑛

● As we have stated earlier, 𝑉𝑎𝑟 ( 𝑋 + 𝑌) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) if and only if X and Y

are independent. Otherwise there is a covariance term also.

● Now 𝑋1, 𝑋2,...............................𝑋𝑛 are random variables, and have written the variance of

𝑋1 and variance of 𝑋2 etc. variance of 𝑋𝑛 only, covariance terms are ignorant because as 𝑋1,

𝑋2,...............................𝑋𝑛 are iid, covariance terms disappear. That is why we have write it as

1 2 1 2 1 2
2 σ + 2 σ +.................................... + 2 σ and the variance are assumed to be same
𝑛 𝑛 𝑛

because random variables 𝑋1, 𝑋2,...............................𝑋𝑛 are iid random variables.

2
● If X is distributed with mean µ and variance σ , then 𝑋 is also distributed with mean µ and
2
σ
variance 𝑛
this will be satisfied if the random variables are iid random variables.

Otherwise this will not be satisfied.

6
● For hypothesis testing estimators are used, and estimators can be used if and only if the

relationship between population distribution and the distribution of sample statistics.

● If we know the distribution of the population, then distribution of sample statistics can be

derived.

● Remember this, all the estimators are assumed to be random variables. And distribution of

the estimator is known as sampling distribution. Sampling distribution is used for hypothesis

testing. Remember this, sampling distribution is not directly used for hypothesis testing.

From sampling distribution we derive test statistics. The popular test statistics used for

hypothesis testing are Z, chi-square, F and t.

Chi-Square Distribution

● We have discussed

2
𝑋 ∼ 𝑁 (µ, σ )—-------------- Normal Distribution

𝑋−µ
𝑍 = σ
—-------------- Standard Normal Distribution

2
𝑋 ∼ 𝑁 (µ, σ )—-------------- Normal Distribution

𝑋−µ
𝑍 = σ —-------------- Standard Normal Distribution
𝑛

● In statistics we come across squared quantities such as


2
2 Σ(𝑋 − 𝑋)
𝑆 = 𝑛 −1
…………………….. Sample Variance

● To derive sampling distribution of such squared quantities we use Chi-Square

distribution. Now let’s see what is Chi-Square distribution.

7
2 2
● If X is normal with mean µ and variance σ (𝑋 ∼ 𝑁 (µ, σ )), Z is standard normal with

𝑋−µ 2
0 and 1. (𝑍 =
𝑋−µ
σ
). Then 𝑍 =
2
( σ ) 2 2
∼ χ(1) , that is 𝑍 follows chi square

distribution with 1 degrees of freedom.

● So, the square of a standard normal variable is distributed as a chi square variable with 1

degrees of freedom.

● So, the definition is the square of a standard normal is a chi square with 1 degrees of

freedom.

● We can generalize this. If 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently

𝑘
2
distributed chi square variables, and the sum 𝑍 = ∑ 𝑍𝑖 ∼ χ(𝑘)
𝑖=1

● . If 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are standard normal variables, and the sum

𝑘
2 2 2
𝑍 = ∑ 𝑍𝑖 ∼ χ(𝑘)
𝑖=1

Where k means the number of independent terms in the sum. And here, as there are k

independent chi square variables the degrees of freedom is k.

2
● If Z is standard normal 𝑍 is chi square. So, naturally if

𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently distributed standard normal

2
variables, then 𝑍 , that is the square of sum of 𝑍1, 𝑍2.......................................................... 𝑍𝑘

follows chi square with k degrees of freedom.

𝑘
2 2 2
𝑍 = ∑ 𝑍𝑖 ∼ χ(𝑘)
𝑖=1

This is the definition of chi square

8
● Now, if 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently distributed chi square

variables, with 𝑘𝑖 degrees of freedom, then ∑ 𝑍𝑖 also follows chi square with Σ𝑘𝑖 degrees

of freedom

● Let us consider it simply as if 𝑍1, 𝑍2 are independently distributed chi square variables,

each with 𝑘1, 𝑘2 degrees of freedom, it means there are 𝑘1 random variables in 𝑍1 and 𝑘2

random variables in 𝑍2, then the sum 𝑍 = 𝑍1 + 𝑍2 also follows chi square with 𝑘1+ 𝑘2

degrees of freedom.this property is known as the reproductive property of chi square

distribution.

2 2
● So, to summarize, if X is normal Z is standard normal, 𝑍 distributed with χ(1) and

𝑍1, 𝑍2.......................................................... 𝑍𝑘 each following standard normal distribution, then

𝑘
2 2
∑ 𝑍𝑖 ∼ χ(𝑘) . And if 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently distributed
𝑖=1

chi square variables with 𝑘𝑖 degrees of freedom and 𝑍 = 𝑍1 + 𝑍2 also follows chi

square with 𝑘1+ 𝑘2 degrees of freedom.

𝑋1 − µ 2 𝑋2 − µ 2 𝑋𝑛 − µ 2
● If expand this, 𝑍 = ( σ ) ( + σ ) +............................................ + ( σ ) . so, this
quantity has k independently distributed random variables, So, this quantity has k degrees

of freedom.

● The PDF of the Chi-square distribution is as follows

9
2
● For χ distribution, as it is squared quantity it will not take negative values, it always take

non- negative values, so it starts with 0. And for small degrees of freedom, it is highly

skewed. Say K=2, K= 5. But when degrees of freedom increases it becomes more and

more like a normal distribution.

2
● Remember this to find out the area property of χ distribution, you must always consider

the degrees of freedom.

● If you go through any book of statistics, you will see that tables providing, the areas of

2
χ distribution for various degrees of freedom. We will consider one case.

2
● Suppose that the degrees of freedom is 20, then χ distribution will be as follows.

10
2
● Then the probability that χ is greater than 10.85 is 95%. Probability that it is >23.83 is

25% and probability that it is >39.99 is 5%.

2
● In words if the degrees of freedom is 20, the probability of getting a χ variable in excess

2
of 39.99 is only 5%. That is 95% of the area under χ is in between 0 and 39.99. You can

calculate other areas for different degrees of freedom.

2 2
● For χ distribution, mean that is expected value of χ distribution is K

2
𝐸(χ ) = 𝐾

2
𝑉𝑎𝑟(χ ) = 2𝐾

2
● χ distribution is extensively used for testing hypothesis about population variance, it is

also used for testing association between two variables using contingency tables, that is

test for independence etc.

11
2
● Any way χ distribution is skewed, its mean is K and variance is 2K and area properties

depends on degrees of freedom. We will not use estimators directly for hypothesis testing.

For hypothesis testing, we develop test statistics based on estimators.

𝑋−µ 2 2
● The test statistic based on 𝑋 that is 𝑍 = σ , similarly the 𝑆 is the estimator of σ .
𝑛

2 2 2
but 𝑆 not directly used for testing inference about the parameter σ . Using 𝑆 We develop

a test statistic.

● Test statistic developed for for testing the significance of variance is given as
2
𝑆 2
( 𝑛 − 1) 2 and we can show that this quantity follows χ (𝑛 − 1)𝑑𝑓
σ

2
𝑆 2
( 𝑛 − 1) 2 ∼ χ (𝑛 − 1)𝑑𝑓
σ

2 2
Where 𝑆 is an unbiased estimator of σ based on a sample of size n and (𝑛 − 1)

degrees of freedom.
2
𝑆 2
● And if expand this (𝑛 − 1) 2 ∼ χ (𝑛 − 1)𝑑𝑓
σ

2 2
(Σ𝑋 −𝑋) (Σ𝑋 −𝑋)
( 𝑛 − 1) (𝑛−1) = 2
2 σ
σ

2
So, this quantity follows χ with (𝑛 − 1) degrees of freedom.

● And this quantity will be extensively used later

● There are n random variables, but all these variables are not independent that is why we

have (𝑛 − 1) degrees of freedom.


2
𝑆
● ( 𝑛 − 1) 2 this quantity is used to test hypothesis about population variance
σ

12
● This is all about chi square distribution. So, chi square distribution is the square of a

standard normal distribution. And is used to test the significance of population variance .

t Distribution

In econometrics or empirical analysis the most widely used distribution is t- distribution. This is

because Z distribution cannot be applied in most cases. Because the population variance σ is

unknown . when σ is replaced by its unbiased estimator sample variance. Z distribution is

converted into t- distribution.

Properties which respect z are different from t .

𝑧1∼ N (0,1) is standard normal distribution

2
𝑍2 ∼ χ𝐾 is distributed independently of Z , then the variable,

𝑍1
t= ∼ 𝑡𝑘
𝑍2/𝑘

The degree of freedom of 𝑡𝑘 is same as the degree of freedom of χ𝑘 .

It is the ratio between two independently distributed random variables. Before deriving this

distribution, what are its area properties?

13
1. standard normal distribution range of t -α < t < +α

2. Degrees of freedom increase indefinitely, t- distribution depending on consideration of

degrees of freedom.

3. There is a family of t- distribution depending on consideration of degrees of freedom.

4. For small degrees of freedom t- distribution is flat. As degrees of freedom increase, it

approaches normal distribution.

5. Like standard normal distribution, range of t is -α to +α.

6. t distribution is symmetric, but its flatness depends on degrees of freedom.

𝑘
7. Mean of t = 0, variance of t = 𝑘−2
i.e k >2. But in the case of Z distribution, mean = 0

and variance is 1. The difference is with respect to variance only.

8. If k = 10, variance ( t) = 10/8 = 1.25

If k = 30, var( t) = 30/28 =1.08

If k = 100, and var(t) = 100/98 = 1.07.

14
As degrees of freedom increases, the variance of t approaches to 1 , that means

approaches to standard distribution.

As compared to t with z , the only difference is variance.

How to transform a random variable x and its sampling distribution 𝑥 into t rather than z.

𝑍1
𝑡 = ∼ 𝑡𝑘
𝑍2/𝐾

𝑥−µ 𝑥 −µ
𝑡 = 𝑠
𝑡 =
𝑠/ 𝑛

2
We assume that , x ∼ ( µ, σ )

𝑥−µ
𝑧1 = ( σ
) ∼ 𝑁 ( 0, 1)

𝑠2 (𝑥−µ) 2
𝑧 2 = ( 𝑛 − 1) 2 =Σ 2
σ σ

2
∼ χ ( 𝑛 − 1)

Suppose that , population σ is unknown. You cannot use Z as the test statistics. You require

knowledge about σ. whenever is replaced by unbiased estimator s . The distribution Z is

converted into t distribution ( whether your sample size is large or sample)

F distribution

𝑧1 𝑧2 two independently distributed chi square variable with 𝑘1,𝑘2 degrees of freedom, the

𝑧1/𝑘1
𝐹 = 𝑧2𝑘2
∼ f ( 𝑘1 , 𝑘2 )

15
● As the degrees of freedom increase, it approaches a normal distribution.

● This F distribution is the test statistics used in ANOVA, econometrics etc…

● F distribution is used to test the overall significance of an estimated regression line.

Properties :

1. F distribution is highly skewed for small degrees of freedom. And it approaches normal

distribution as degrees of freedom increase.

𝐾2
2. E (F) = 𝐾2− 2

= 𝐾2 > 2

2
(
2𝑘2 𝑘1+𝐾2−2 )
Variance ( F) =
(
𝑘1 𝑘2− 2 ) (𝑘 − 4)
2
2

It is defined only if it is 𝐾2 >4

16
2
3. 𝑡𝑘= 𝑓 (1, 𝑘) as in the case of t distribution the degrees of freedom of f is also derived

2
from the χ distribution.

2
All these distributions z, χ , 𝑡, f are related to the normal distribution. Normal distribution is the

basis of sampling distribution.

If the population is not normal, sample size is sufficiently large i.e central limit theorem. And the

normality assumption is not at all a problem in the case of a large sample. We can conduct

2
hypothesis testing using t, z, F, χ …etc.

Degrees of Freedom

Suppose that you select 10 numbers. Then if no other conditions stated. Free to select all these 10

numbers. You code that, sum of 10 numbers = 100.

You can select 9 numbers freely, but the 10 th number is selected only; it will provide a sum of

10 numbers = 100.

Degrees of freedom = 9 df ( if 10 numbers are known 10th will be automatically known)


2
2 Σ(𝑋−𝑋) 2
𝑆= 𝑛−1
unbiased estimator of σ

2 2
E(𝑆 ) = σ

E(𝑋) = µ

22
E⎡⎣Σ(𝑋 − 𝑋) ⎤⎦ = σ (n-1)
2 2
E⎡⎢ 𝑛 ⎤⎥ =
Σ(𝑋−𝑋) σ (𝑛−1)
𝑛
⎣ ⎦
2 1 2
=σ − 𝑛
σ biased estimator

17
2

E⎡⎢ 𝑛−1 ⎤⎥ =
Σ(𝑋−𝑋) (𝑛−1) 2
𝑛−1
σ
⎣ ⎦
2

2 2
Then 𝑆 unbiased estimator of σ

Degrees of freedom is an adjustment used to make an estimator unbiased.

df = n - k, k is number of constraint or restrictions, n is sample size

Properties of Estimators

As we have discussed earlier, our aim is to estimate population parameters. But the population

parameter is generally unknown as the entire population is unknown. So in any situation we

have only a sample , and using the sample we have to estimate population parameters.Arithmetic

mean is an estimator of population parameter µ sample variance is an estimator of population

variance , OLS estimators β1 and β2 are the estimators of regression parameters. For any

parameter, there will be many estimators. For the population parameter µ, the estimators are

sample mean , sample median , mode etc. As an estimator of the population variance we can use

standard deviation , mean deviation , range, coefficient of variation etc. Similarly as estimators

of the regression coefficients we have OLS estimators we have the method of maximum

18
likelihood, we have instrumental variable estimators, method of moment estimation procedures,

so many procedures are there. So the question is among various estimators of a parameter which

estimator we will select, or what are the criteria for selecting estimators. To select one estimator

from an array of estimators, we will consider the properties of estimators and we will select as

our estimator of the population parameter one with most of the desirable properties. The

properties of estimators are classified into two categories

1. Small sample properties

2. Large sample properties/ asymptotic properties

Small sample properties will be satisfied in small samples as well in large samples. So small

sample properties are the most desirable properties. On the other hand large sample properties

will be satisfied only in large samples. They will not satisfy in small samples. The discussion

about properties of the estimator will be based on the assumption that the estimator will be a

random variable and it has a PDF. The PDF is known as the sampling distribution

Small sample properties

If the small sample properties are satisfied we will not go for large sample properties

1. Unbiasedness

Suppose that the parameter of interest is θ, the estimator is θ

We will say that estimator θ is an unbiased estimator of θ if the expected value of θ is equal to

the parameter θ

E (θ) = θ

E(θ) - θ = 0

bias(θ) = E(θ)-θ

19
As we have discussed in the context of sampling distribution, unbiasedness is a repeated sample

property, if we take all the possible samples and if the expectation of the estimator is equal to the

parameter, it is unbiased.

Example : suppose we have 2 estimators

θ1 θ2 → θ

Now consider the graph

So unbiasedness is a repeated sample property. Unbiasedness itself is not a very useful property

because even if an estimator is unbiased it says nothing about the properties of an estimator in a

single sample, In the numerical example we have considered earlier

µ= 6
But the range of 𝑥 from 3 to 9 , in any given situation , we will take only one sample and 𝑥

may be 3 far away from µ , it may be 9 again far away from µ. So unbiasedness by itself is not a

useful property. But it will become a useful property when it is combined with another property

namely minimum variance. So the next small sample property is minimum variance

2. Minimum variance/ Best Estimator

θ1 is the minimum variance estimator of θ , ie θ1 = θ if the variance of θ1 is less than or at least

equal to another estimator of θ namely θ2.

20
θ1→ θ → θ2

θ2 need not be unbiased or θ1need not be unbiased.We consider only the minimum variance

property here, that is we consider 2 estimators of θ , one with the smallest variance is the

minimum variance estimator.This can be represented in the form of a notation as

2 2
E ⎡⎢θ − 𝐸 (θ ) ⎤⎥ ≤ E ⎡⎢θ − 𝐸(θ ) ⎤⎥
⎣ 1 1 ⎦
⎣ 2 2

Variance(θ1) ≤ variance(θ2)

21
As in the case of unbiasedness , minimum variance by itself is not a very useful property. The

reason is that even though in this example θ1 is the minimum variance estimator , its distribution

is far away from the population parameter, or you consider the estimator as a constant , a

constant has no variability. But it cannot be a sensible estimator of the parameter θ. So by itself

the minimum variance is not a very useful property. Now it becomes a useful property when it is

combined with unbiasedness.This property is minimum variance unbiased estimator.

3. Minimum variance unbiased or Best Unbiased or efficient

It is a combination of unbiasedness and variance. If θ1 and (θ2) are two unbiased estimators of

θ, then θ1 is said to be minimum variance unbiased or the best unbiased if

Variance(θ1) ≤ variance( θ2).

Here we compare the estimators which are unbiased. Thus the property is a combination of

unbiasedness and variance

First condition: E (θ1)= θ= E (θ2)

22
Second condition :Variance( θ1) ≤ variance( θ2).

If these two conditions are satisfied, then we can say that ( θ1) is the Minimum variance unbiased

or Best Unbiased or efficient estimator.

4. Linear estimator

An estimator is linear if it is a linear combination of sample values. An estimator θ is said to be a

linear estimator of θ if it is obtained as a linear combination of the sample values.

Example: consider the estimator 𝑋

Σ𝑋 1 1
𝑋= 𝑛
= 𝑛 𝑋1 +............... 𝑛 𝑋𝑛 by giving equal weights to X1, X2….Xn we have 𝑋 can be written

as K 𝑋1 +.......... 𝐾 𝑋n, so it is a linear combination of random variables X1, X2….Xn with K

serving as the weights.So 𝑥 is a linear estimator. An estimator is linear if it is a linear function of

sample observations

23
5. Best Linear Unbiased Estimator(BLUE)

This is a combination of these four properties. The estimator θ is said to be BLUE if it has

minimum variance in the class of all linear unbiased estimators. This BLUE property is very

important in econometrics and this blue property is known as Gauss Markov Theorem.

6. Minimum mean square error estimator (MSE Estimator)

This properties are not extensively used in econometrics. Consider two estimators of θ, say θ1

and θ2

The point is θ1 is unbiased. θ2is biased.

Which estimator should we select?

If unbiasedness is used as the property θ1 can be selected.

But it has a large variance , so the uncertainty is very high.

In the case of θ2 it is slightly biased, but its variance is low. The chance of getting a value far

away from the parameter is low. So there is a trade off between bias and variance and the

24
property says that if there is a trade off between bias and variance select the estimator with

minimum mean square error.

Properties of Estimators (2)

Large sample properties

We consider these properties only when small sample properties are not satisfied.

1. Asymptotic unbiasedness

An estimator θ is said to be an asymptotically unbiased estimator of θ if

limit E ( θn )=θ
n→ α
We say that θ is an asymptotically unbiased estimator of θ if θ become unbiased as sample

size increases indefinitely. Remember this it is a repeated sample property like unbiasedness but

in small samples the estimator will be biased but as the sample size increases indefinitely it

becomes an unbiased estimator.

Example : suppose
2
2 Σ(𝑋−𝑋 )
𝑆 = 𝑛

2 (𝑛−1) 2
E (𝑠) = 𝑛
σ

2 1 2
=σ - 𝑛
σ

1 2
𝑛
σ → the bias term

2 2
So in a small sample 𝑆 is a biased estimator of σ

As the sample size increases indefinitely the bias term disappears , so we will say that it is

asymptotically unbiased.

2. Consistency

25
This is the minimum requirement an estimator must satisfy to be useful. If an estimator will not

satisfy at least consistency then we cannot use such an estimator for estimation and inference.

θ is said to be a consistent estimator of θ, if θ approaches the true value θ as sample size

increases or the sample size gets larger and larger

Consider arithmetically

E (𝑋 ) = µ

2
σ
Variance(𝑋 )= 𝑛

When n approaches to infinity , the variance of (𝑋 )approaches to 0

Example

When the sample size is small estimator θ is biased. Not only biased its variance is also very

high. As the sample size increases the bias decreases, variance also decreases. When sample size

is 100 the entire bias disappears and Eθ=θ. The bias disappears not only that the variance

26
approaches to 0, actually if an estimator is consistent, when the sample size increases indefinitely

the entire distribution collapses to a single point and we will say that the distribution degenerate

and we will say that the estimator is consistent. So consistency requires that if there is bias in

small samples bias must disappear and also variance approaches 0. Asymptotic unbiasedness is

about what happens to bias only. It says nothing about what happens to variance.Consistency

requires both : bias disappears and variance also disappears then only we can say that an

estimator is consistent. Consistency can be technically written as

{| |
lim 𝑃𝑟 θ − θ < δ =1,
𝑛→∞
} δ> 0

We say that plim (θ) = θ, plim means probability limit

Consider a graph

If the distribution collapses like this , the absolute difference between θ

27
and θ is very small. Then we will say that the estimator is consistent.

We write it as

𝑝 lim (θ)= θ
𝑛→∞
plim means probability limit

An estimator is consistent if the distribution of θ not only collapses to θ, its variance also

decreases or if in the limit that is as N approaches to infinity,the distribution of θ collapses to a

single point θ the distribution has 0 spread, or 0 variance.Then we say that θ is a consistent
estimator of θ.
Two sufficient conditions for consistency is ,

1) lim E( θ𝑛) = θ
𝑛→∞

2) lim Var( θ𝑛) = 0


𝑛→∞

Arithmetic mean is an example of a consistent estimator.


There are many situations in which we may not be able to assess the small sample properties.In
such cases we will consider the large sample properties.And the large sample property which we
consider is consistency. The probability limit has some properties , they are:

Suppose that θ is a consistent estimator of θ

And, h(θ) is a continuous function of θ

then 𝑝 lim h (θ) = h(θ)


𝑛→∞

For Example: if θ is a consistent estimator of θ


Then we can say that
1 1
● consistent estimator of θ
θ

● log (θ) is a consistent estimator of logθ


● 𝑝 lim b = b, b is a constant
𝑛→∞

28
Suppose that θ1 and θ2 are the consistent estimators of θ , then

plim ( θ1+ θ2) = plim(θ1) + plim( θ2)

plim(θ1 θ2) = plim(θ1)plim( θ2)

θ1 𝑝𝑙𝑖𝑚(θ1)
plim( )=
θ2 𝑝𝑙𝑖𝑚( θ2)

Last two properties will not hold in the case of expectations operator.
3. Asymptotic efficiency

Let θ is a consistent estimator of θ

θ -θ , Then the variance of the asymptotic distribution of θ is known as the asymptotic

variance. If θ is consistent and if its asymptotic variance is smaller than the asymptotic

variance of all the other consistent estimators then we will say that θ1is asymptotically efficient

estimator of θ
4. Asymptotic normality
2
If X ∼N (µ, σ )
2
σ
𝑋∼N (µ, 𝑛
)

This is not a theorem , this will always be satisfied. Now we have seen the central limit theorem ,
even if
2
X is not normal with (µ, σ )
2
σ
𝑥 ∼N (µ, 𝑛
) as n→ ∞

This property is known as asymptotic normality. It is a useful property in many cases, if you are
not sure about the nature of distribution of population under consideration , you can use this
property for hypothesis testing provided that we have a sufficiently large sample size. This is the
property when the sample size is sufficiently large.
Among all the large sample properties consistency is the most important.which we will use when
the small sample properties are either unknown or cannot be derived.

29
Assumptions of Classical Linear Regression Model

In the previous class we have derived the OLS estimators.For the simple linear regression model

the parameters of interest are β1 and β2. And the estimators are the c.

● β 1 = 𝑌 = β 2𝑋

Σ𝑥𝑖𝑦𝑖
● β2 = 2
Σ𝑥 𝑖

Now if our purpose is estimation only then OLS is enough. By applying the principles of OLS β1

and β2 are obtained. But in econometrics our objective is not estimation only. We have to use

this estimators to make inferences about the parameters β1 and β2. Our aim is not only estimation

of β1 and β2. We have to use this estimators to make inferences about regression coefficients β1

and β2.

For that we have to make assumptions about the model.

PRF stochastic form is 𝑌𝑖=β1+β2 𝑋𝑖+𝑢𝑖.

We know that 𝑌𝑖 depends on 𝑋𝑖 and 𝑢𝑖.

30
So if we have to make inferences about β1 and β2 , we have to make assumptions about how 𝑌𝑖

is generated. As 𝑌𝑖 depends on 𝑋𝑖 and 𝑢𝑖. we have to make inferences about how 𝑋𝑖 and 𝑈𝑖.. are

generated.

Only then we can use the estimators to make inferences about population parameters. Now what

we discuss below is the assumptions of the classical linear regression model. The assumptions

underlying the method of OLS. While discussing the properties of estimators we have suggested

that the properties are classified into statistical properties and numerical properties.

Numerical properties we have discussed. That hold as a result of the use of OLS regardless of

how the data are generated. Statistical properties of the estimators hold only after making

assumptions about how data are generated. So in the discussion we are considering how the data

𝑌𝑖 or 𝑋𝑖 and 𝑢𝑖 are generated.These assumptions are important and the properties of estimators

depends on the assumptions which we make about 𝑋𝑖 and 𝑢𝑖 and hence 𝑌𝑖.

So the Classical linear regression model, the Gaussian linear regression model the method of

OLS make seven assumptions about how 𝑋𝑖 and 𝑢𝑖 and hence 𝑌𝑖 are generated.

Now let us see what the assumptions are:

● Linear in parameters.

That is the linear regression model. We have defined linearity as linearity in parameters

and linearity in variables. And in regression analysis in econometrics linearity is defined

as linearity in parameters.

So our model 𝑌𝑖=β1+β2 𝑋𝑖+𝑢𝑖 is linear in parameters. As we have discussed this

property in detail there is nothing more to discuss here. If the model is not linear in

parameters we have to go for nonlinear least squares.

31
This is the first assumption underlying the method of ordinary least squares or the

gaussian linear regression model.

● Fixed X values or X values are assumed to be independent of the stochastic error term.

The assumption is in the form of equation:

( )
cov 𝑋𝑖 𝑈𝑖 =0

This assumption can be stated in different ways.

1. X is assumed to be fixed in repeated samples. That is fixed in repeated sample

assumptions. But suppose that X is not fixed in repeated samples. That is like Y,

X also varies from sample to sample.

( )
2. cov 𝑋𝑖𝑈𝑖 = 0 .

Correlation between 𝑋𝑖 𝑎𝑛𝑑 𝑈𝑖 is zero.

Now the question is what is the implication of this assumption.

Let me first write an equation for the covariance.

( ) {[
cov 𝑋𝑖𝑈𝑖 = 𝐸 𝑋𝑖 − 𝐸 𝑋𝑖 ( ) ][𝑈𝑖 − 𝐸(𝑈𝑖)]}
( )
As we will see as the next assumption E 𝑈𝑖 is assumed to be zero the

implications we will see.

So it can be written as {[ ( )][𝑈𝑖]}. Expected 𝑈𝑖 assumed to be zero.


E 𝑋𝑖 − 𝐸 𝑋𝑖

[ ( ) ]
= E 𝑋𝑖𝑈𝑖 − 𝐸 𝑋𝑖 𝑈𝑖

( ) ( )( ) ( )
= E 𝑋𝑖𝑈𝑖 -𝐸 𝑋𝑖 𝐸 𝑈𝑖 since we assumed 𝐸 𝑈𝑖 = 0

(
= E 𝑋𝑖𝑈𝑖 )
(
Assumption requires that expected value of 𝑋𝑖𝑈𝑖 =0. )

32
Now we assume that 𝑋𝑖 is fixed in repeated samples then 𝑋𝑖 can be considered as constant from

sample to sample. The we can write it as:

(
= E 𝑋𝑖𝑈𝑖 )
( )
=𝑋𝑖E 𝑈𝑖 =0

So if X is assumed to be fixed in repeated samples then the assumption will be satisfied

automatically.

( )
But if X is stochastic like 𝑈𝑖then we have to ensure that cov 𝑋𝑖 𝑈𝑖 =0.

( )
Because this condition will not be automatically satisfied we have to ensure that E 𝑋𝑖𝑈𝑖 = 0

As a first approximation if this assumption is not satisfied. That is if 𝑋𝑖 is call later with 𝑈𝑖 we

will get biased estimators.

So this assumption is very important.

(
cov 𝑋𝑖 𝑈𝑖 =0 )
Now the assumption that x is fixed in repeated samples is a highly unrealistic assumption.

If 𝑌𝑖 is stochastic 𝑋𝑖 also stochastic.

But the question is why it is assumed that X is fixed in repeated samples. As we will see later

this assumption is used at an introductory level to derive some results easily. If X is assumed to

be fixed in repeated samples it is easy to derive the estimators and properties. That is why it is

assumed to be fixed in repeated samples.

Also the assumption that is assumed to be fixed in repeated samples is not unrealistic always.

There are many cases in which this assumption is valid. Suppose that we consider a model salary,

it is a function of education. Salary= f(𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛).

Suppose that the education is given in years:

33
Years(x) Salary(y)

10

12

Suppose that you take one sample , you will be able to find out many people with different levels

of salary for the same levels of education, with 8 years of education, 9 years of education etc. In

this case the level of education will remain the same, but salary varies from sample to sample. So

this is the realistic case.

You consider another case, suppose that you are applying a certain amount of fertilizers and

study the relation between fertilizers and yield.

Yield Fertilizer

10

20

30

Now if you apply 10 kilos of fertilizers on different plots you will get different levels of yield.

Yield will vary from one plot to another plot. If the plots are randomly selected the procedure is

correct also. So fixed in repeated sample assumptions is not highly not unrealistic. Any way for

(
the time being just keep in your mind cov 𝑋𝑖 𝑈𝑖 =0. )
( )
3. E 𝑈𝑖/𝑋𝑖 = 0

( )
The assumption is that given 𝑋𝑖 values E 𝑈𝑖 = 0

As we have seen in an earlier time

34
There is PRF. So corresponding to 𝑋𝑖 is X1, X2, X3 etc. The PRF passes through the conditional

mean values.Now the deviation of a Y from its mean is U. So there are so many positive and

( )
negative U values, the assumption simply says that E 𝑈𝑖/𝑋𝑖 , that is given 𝑋𝑖 = 𝑋1𝑈1 has a

( )
distribution with E(𝑈)=0, 𝑈2 has distribution with E 𝑈2 =0, 𝑈3 has distribution with E 𝑈3 =0 ( )
like that.

Actually this is not a very important assumption, not an assumption with impact.As long as there

is an intercept in this model this assumption will be automatically satisfied or stated differently

(
we can always rewrite our model in such a way that E 𝑈𝑖/𝑋𝑖 = 0. )
( )
As an example suppose that E 𝑈𝑖 = α0

Now consider a model 𝑌𝑖= β1+ α0+ β2𝑋𝑖+ 𝑈𝑖-α0

If you take the expectations you will get E (𝑌 ) =β1+ α0+ β2𝑋𝑖+α0-α0 where α0-α0= 0

So the only thing is the intercept become β1+ α0. We will get a biased estimator for the intercept,

that is a problem but is not a serious problem as intercept is not very important in most cases. So

this is not an assumption with impact. But is assumed to hold in all the cases. Provided that the

35
model is correctly specified , provided all the important variables are included. Then we can

( )
conclude that E 𝑈𝑖/𝑋𝑖 = 0 is always satisfied.

There is one more point, if the conditional variances of one random variable given the other is

( ) (
equal to zero, the random variables are independent. So the E 𝑈𝑖/𝑋𝑖 = 0 then the cov 𝑋𝑖 𝑈𝑖 )
=0.

( )
cov 𝑋𝑖 𝑈𝑖 =0 does not mean that the variables are independent.

But this is true in the case of conditional distribution.

E (𝑌/𝑋)=E(𝑌) that is conditional distribution = marginal distribution. Then the variables Y and

X are independent.

( ) ( )
If E 𝑈𝑖/𝑋𝑖 = 0 then it follows that E 𝑌/𝑋𝑖 = β1+ β2𝑋𝑖

So the conditional mean of Y is β1+ β2𝑋𝑖 The unconditional mean of Y is different from the

( )
conditional mean. So this condition E 𝑈𝑖/𝑋𝑖 = 0 also implies this condition

( )
E 𝑌/𝑋𝑖 = β1+ β2𝑋𝑖.

The conditional expectation of a random variable given the other is equal to zero and it means

that the random variables are independent.

This is not true in the case of covariance.

E (𝑌/𝑋)=0 , X and Y are independent.

cov(𝑋𝑌)=0 does not mean that X and Y are independent. Because X and Y will be non linearly

related. But remember this E (𝑌/𝑋)=0 X and Y are independent.

( )
So if E 𝑈𝑖/𝑋𝑖 = 0 then it means that U and X are independent. But this is not true for

covariance between two random variables.

36
The assumption that X is fixed in repeated samples is only a simplification used in the

introductory textbooks to derive some of the results easily. In the real situation X is stochastic

like Y. It means from sample to sample X varies like Y.

( )
Then the requirement is cov 𝑋𝑖 𝑈𝑖 =0. If this assumption is violated we will get biased

estimates..

4. Homoskedasticity. In some books it is written as Homoscedasticity.

Now the accepted spelling is Homoskedasticity.

Homo = equal

Skedasticity = spread

So Homoskedasticity means equal spread.

( ) [
The assumption is that var 𝑈𝑖/𝑋𝑖 = E 𝑈𝑖 − 𝐸 𝑈𝑖/𝑋𝑖( )]2
[
= E 𝑈𝑖/𝑋𝑖]2
2

( )
Because E 𝑈𝑖/𝑋𝑖 =0 by assumption. Is a constant sigma square.

2
That means the variance of 𝑈𝑖 given different values of 𝑋𝑖 is a constant number σ .

In the form of a graph

37
This is an example of homoskedasticity/ Equal spread.

2
That is the distribution of U1,U2,U3 having mean zero and variance same. Same number if σ of

U1 is 10 of U2 10, U3 10 like that.

Violation of the assumption of homoskedasticity is known as Heteroskedasticity.

Hetero= Unequal

Skedasticity= variance.

So unequal variance.

2
( )
Heteroskedasticity is written as var 𝑈𝑖/𝑋𝑖 =σ𝑖 . ‘i’ represents fact that the variance of U varies

from 1 point to another.

2
σ = homoskedasticity

2
σ𝑖 = heteroskedasticity.

Example of heteroskedasticity:

There is one more meaning to this assumption:

( ) ( )
Homoskedasticity means var 𝑈𝑖/𝑋𝑖 ≠ f 𝑋𝑖

Whether X increases or decreases variance of 𝑈𝑖 remains the same.

38
( ) ( )
Heteroskedasticity means var 𝑈𝑖/𝑋𝑖 = f 𝑋𝑖

There is one more point here. We know that

( ) [
var 𝑌𝑖 = E 𝑌𝑖 −𝐸 𝑌𝑖 ( )]2
=E [β1 + β2𝑋𝑖 + 𝑈𝑖 − β1 − β2χ𝑖]2
= E 𝑈𝑖 ( )2 =σ2
So the point is variance of 𝑈𝑖 same as variance of 𝑌𝑖.

If 𝑈𝑖 is homoskedastic 𝑌𝑖 is also homoskedastic. But in the case of mean, mean of 𝑈𝑖 is zero, but

mean of 𝑌𝑖 is β1 + β2𝑋𝑖, that is the difference.

( )
Mean of 𝑌𝑖 is different from mean of 𝑈𝑖 or what we say is, if we subtract E 𝑌𝑖 form 𝑌𝑖 that is 𝑈𝑖,

( )
𝑈𝑖 is 𝑌𝑖-E 𝑌𝑖 that is we change the origin. Then the mean becomes zero. Variance is not

affected.

Variance of 𝑌𝑖 same as variance of 𝑈𝑖.

This is the homoskedasticity assumption. There is another way to express this :

39
Now consider the graph.It is a three dimensional graph .

( )
f 𝑈𝑖 is the PDF of 𝑈𝑖.

( )
𝐸 𝑌𝑖 = β1+ β2𝑋𝑖 It is the relationship between consumption and income, the PRF.

When X=𝑋1, X= 𝑋2, X=𝑋3 like this.

PRF starts from Y axis and as the relation is assumed to be positive the distance between x axis

and PRF in cases.

( )
This is the cases of third dimension, that is f 𝑈𝑖 and this is the case of homoskedasticity.

That is the distribution of 𝑈𝑖 is same in all the cases.

Heteroscedasticity

The case where variance decreases.

40
.

5. No autocorrelation between disterbances.

It means that given any two X values 𝑋𝑖 and 𝑋𝑗 the correlation between 𝑈𝑖 and 𝑈𝑗=0. Using

notations

( ) {[ (
cov 𝑈𝑖𝑈𝑗 =E 𝑈𝑖 − 𝐸 𝑈𝑖/𝑋𝑖 )][𝑈𝑗 − 𝐸(𝑈𝑗/𝑋𝑗)]}
[ ]
=E 𝑈𝑖/𝑋𝑖,, 𝑈𝑗/𝑋𝑗 =0

It means given any X values the covariance between 𝑈𝑖 and 𝑈𝑗 = 0.

41
It simply means that given any 𝑋1 and 𝑋2 , Covariance between 𝑈1 and 𝑈2 is zero.

If this assumption is violated we have a case of autocorrelation. That is covariance 𝑈𝑖 𝑈𝑗 not

equal to zero. One implication of this assumption is suppose that we are using quarterly data.

And suppose that there is a strike in the first quarter. Then production will be disrupted. And if it

is effect die out in the first quarter itself. Then there is no autocorrelation. But if its effect exists

to the 1st, 2nd, 3rd quarter etc then there is autocorrelation. The implication of this assumption is

violation and consequences etc will be considered in the later time.

6. n > K

The number of observations must be greater than the number of parameters to be estimated. It is

very important in the sense that suppose we have two X Y values:

Y X

10 15

With just 2 values we cannot estimate a model like this β1 and β2. So there must be more than

one observation.

In a technical way the assumption says that,

42
● if n is not greater than k we will not have sufficient degrees of freedom.

● If n = k, degrees of freedom will become zero. Because degrees of freedom are defined as

n-k.

7. There will be some variability in X.

That means given a pair of XY values not all the X values will be same in a given sample or state

2
differently Σ(𝑋 − 𝑋) >0.

Now the point is there are various implications to this assumptions.

Σ𝑋𝑖𝑌𝑖
we have β2= 2
Σ𝑋𝑖

2
(
If all the x values are identical then it follows that Σ 𝑋 − 𝑋 ) =0, estimation is not possible.
The second implication is that irrespective of values of y then variability in Y cannot be

attributed to X.

X is remaining constant Y is changing. Then changing Y is not due to X. X can explain

variability in Y if and only if X values are not the same in a given sample. Technically the
2
Σ(𝑋−𝑋)
variance of X that is 𝑛−1
must be a positive number.

Now when we go beyond to a two variable model we will add 8th assumption:

8. There should be no multicollinearity.

43
This assumption is not important for the 2 variable model.

Suppose that we have 𝑌𝑖= β1+ β2𝑋2 + β3𝑋3 + 𝑈𝑖

We have 2 explanatory variables. Then there is a possibility that there will be correlation

between 𝑋2 and 𝑋3. For example if i is consumption, 𝑋2 is income 𝑋3 is wealth, there is a

possibility that correlation between income and wealth. So this assumption says that there should

be no perfect collinearity between 𝑋2 and 𝑋3.

Perfect collinearity, that means the correlation between 𝑋2 and 𝑋3 must not be +1 or -1.

So these are the assumptions of the classical linear regression model.

All the assumptions related to PRF, not SRF.

But some of the assumptions of PRF will be duplicated by SRF also.

Assumptions related to SRF

( )
1.For example: E 𝑢𝑖/𝑋𝑖 =0 (assumption number 3)

And we have Σ𝑢𝑖=0, 𝑢 = 0, the residual has zero mean.

2. cov (𝑢𝑖𝑋𝑖)=0 related to PRF

In the context of SRF Σ𝑈𝑖𝑋𝑖=0.

These are the 2 conditions imposed on data for estimating the OLS distribution.

So these 2 assumptions are duplicated to SRF.

Now the question is in addition when we proceed we will consider some other

assumption also like the model is correctly specified, that is there should be no specification

errors.there should be no errors of measurements and so many other issues related are there.

44
You will not see all the assumptions in all the books, some assumptions like assumption number

6,7 etc assumed, will not be explicitly stated in many text books.

And also in this model the assumptions stated in terms of U. But you will also see assumptions

stated in terms of Y also.

( ) ( )
We can say that covariance E 𝑢𝑖/𝑋𝑖 =0, we can say that E 𝑌/𝑋𝑖 = β1 + β2𝑋𝑖

2
( )
Similarly var 𝑢𝑖/𝑋𝑖 =σ Can also be stated as var 𝑌𝑖/𝑋𝑖( )
So we can specify this assumption of homoskedasticity in terms of U and in terms of Y.

( ) (
Similarly Cov 𝑢𝑖𝑢𝑗 =0 can also be stated as Cov 𝑌𝑖𝑌𝑗 =0 )
Now the question is how realistic are these assumptions.

The point is if you have a set of data satisfying all these assumptions that is the most ideal data

set. This is because if you apply OLS on such data the resulting estimators will be the most

desirable in terms of its properties.

But the problem is in the real world many of these assumptions will be violated. And we will see

what happens to the properties of estimators if these assumptions are violated. And remember

that these assumptions are stated to derive the properties of estimators. And we can compare this

model with the model of perfect competition in micro economics.You know that a perfectly

competitive market is the most desirable. But we know that in the real world most of the

assumptions of perfect competition will be violated, and we study what happens to price and

quantity when these assumptions are violated. In the same way we know that a data set satisfying

these assumptions is the most desirable. And if these assumptions are violated it will affect the

properties of estimators. And later we will consider what happened to the properties of

estimators if these assumptions were violated. Some assumptions are not important in terms of

its impact on properties.

45
So these are the assumptions of classical linear regression model, gaussian linear regression

model, and the method of OLS.

Mean and Variance of OLS Estimators

It is clear from the previous analysis that OLS estimators are functions of sample values. In the

two variables model in deviation form as;

Σ 𝑥𝑖𝑦𝑖
β2 = 2
Σ𝑥𝑖

β1= 𝑌 - β2 𝑋

So, as functions of sample values, the value of the OLS estimators vary from

sample to sample, we require a measure of precision of OLS estimators. The measure of

precision of the OLS estimators is given by the “Standard error” of estimators, and

standard error is nothing but the standard deviation of the sampling distribution of the

estimator. The sampling distribution of these estimators is nothing but the probability

distribution of the estimates.

The next concern is derivation of the standard error of estimators β1 and β2 .

Derivation of standard error required as a measure of the precision of the estimators, not only

that, it is also required for hypothesis testing, because for hypothesis testing we use the sampling

distribution, and the test statistics is derived from the sampling distribution. So, we can use a test

statistics for an estimator for hypothesis testing, if and only if the standard error of the estimate is

known.

46
The next concern is the derivation of the standard error of β1 and β2. In this section , we

derive only the mean and standard error of β2 and mean and standard error of β1 is simply

stated.

Mean and SE of β2

Σ 𝑥𝑖𝑦𝑖
We have shown that, β2 = 2
, this can be written as;
Σ𝑥𝑖

Σ 𝑥𝑖𝑌𝑖
β2 = 2
this we derived earlier
Σ𝑥𝑖

This can be also written as;

𝑥𝑖
β2 = Σ𝑘 𝑌 , where 𝑘 =
𝑖 𝑖 𝑖 2
Σ𝑥𝑖

Now, as β2= 𝑘1𝑌1+𝑘2𝑌2………………………+𝑘𝑛𝑌𝑛

So, β2 is a linear function of 𝑌1, 𝑌2, 𝑌3………………𝑌𝑛 with 𝑘1, 𝑘2, 𝑘3…………..𝑘𝑛

serving as the weights

So, we can say that, β2 is a linear estimator with;

𝑥𝑖
𝑘𝑖= 2
serving as the weights
Σ𝑥𝑖

The weight 𝑘𝑖 is very important in econometrics

Properties of weight 𝑘𝑖

1. Since 𝑋𝑖 is assumed to be fixed in repeated samples, 𝑋𝑖 is assumed to be non-stochastic.

47
𝑥𝑖
𝑘𝑖= 2
, 𝑘𝑖 is also non-stochastic (𝑘𝑖 being the function of 𝑋𝑖 , 𝑘𝑖 is
Σ𝑥𝑖

non-stochastic, it remains constant from sample to sample).

𝑥𝑖 Σ 𝑥𝑖
2. Σ𝑘𝑖= Σ( 2
) = 2
= 0, because in the numerator Σ 𝑥𝑖 =( Σ 𝑋𝑖 -𝑋)=0
Σ𝑥𝑖 Σ𝑥𝑖

So, Σ𝑘𝑖= 0

𝑥𝑖 2 Σ𝑥𝑖
2
2 1
3. Σ𝑘𝑖 = Σ( 2
) = =
Σ𝑥𝑖 2 2 2
(Σ𝑥𝑖 ) Σ𝑥𝑖

2 1
So, Σ𝑘𝑖 =
2
Σ𝑥𝑖

4. Σ𝑘𝑖 𝑥𝑖 = Σ𝑘𝑖 𝑋𝑖 , the reason is Σ𝑘𝑖 (𝑋𝑖 − 𝑋)= Σ𝑘𝑖𝑋𝑖 − 𝑋Σ𝑘𝑖= Σ𝑘𝑖𝑋𝑖-0 = Σ𝑘𝑖𝑋𝑖

Σ𝑥𝑖 𝑋𝑖
Σ𝑘𝑖𝑋𝑖 =
2
Σ𝑥𝑖

Σ(𝑋𝑖−𝑋 )𝑋𝑖
= 2
Σ (𝑋𝑖−𝑋)

2
Σ 𝑋𝑖 − 𝑋 Σ 𝑋𝑖
= =1 denominator expansion derived earlier
2
Σ 𝑋𝑖 − 𝑋 Σ 𝑋𝑖

So, Σ𝑘𝑖 𝑥𝑖 = Σ𝑘𝑖 𝑋𝑖 =1

48
Using this we can derive the standard error of β1 and β2. Since, β2 = Σ𝑘𝑖 𝑌𝑖 , β2

is a linear estimator with 𝑘 serving as the weights. It is a small sample property of


𝑖

an estimator we have discussed.

Now, we substitute for 𝑌𝑖 from the PRF, as;

Σ𝑘𝑖𝑌𝑖 =Σ𝑘𝑖( β1 + β2𝑋𝑖 + 𝑢𝑖 )

β2= β1Σ𝑘𝑖 + β2 Σ𝑘𝑖𝑋𝑖 + Σ𝑘𝑖𝑢𝑖

Here, Σ𝑘𝑖=0 and Σ𝑘𝑖 𝑋𝑖 = 1

2
So, β2= β + Σ𝑘𝑖𝑢𝑖 , it also follows that

β2-β2 = Σ𝑘𝑖𝑢𝑖

E(β2) = β2 + Σ𝑘𝑖Σ (𝑢𝑖)

Σ (𝑢𝑖)= 0

β2= β2

And this shows that, β2 is an unbiased estimator of β2. it is actually the unbiasedness

property, and this the unbiasedness property will be satisfied if and only if 𝑥𝑖 𝑢𝑖 are

uncorrelated , and that is the second assumption , Cov(𝑥𝑖 𝑢𝑖 )= 0

E(β2)= β2

And similarly, we can prove that,

49
E(β1)= β1

Variance of β2

2
⎡ ⎤
Var(β2 ) = 𝐸⎢β2 − 𝐸(β2) ⎥
⎣ ⎦

Since, E(β2)=β2

2
= 𝐸⎡⎢β2 − β2⎤⎥
⎣ ⎦

And we have already written that, β2 − β2 = Σ𝑘𝑖𝑢𝑖

So, this can be rewritten as;

Var(β2 ) = 𝐸 Σ𝑘𝑖𝑢𝑖 [ ]2
While we are discussed about the summation operators,we have expanded that the term

2 2 2 2
(Σ𝑥𝑖) as 𝑥1+𝑥2...... 𝑥𝑛+ 2𝑥1𝑥2 etc…….

By expanding this, we will get

2 2 2 2 2 2
= 𝐸⎡⎢𝑘1𝑢1 + 𝑘2𝑢2 ......... + 𝑘𝑛𝑢𝑛 + 2𝑘𝑖 𝑘2𝑢1𝑢2 + ....... + 2𝑘𝑛−1 𝑘𝑛𝑢𝑛−1𝑢𝑛⎤⎥
⎣ ⎦

We take expectation operator inside;

2 2 2 2 2 2
= 𝑘1 𝐸(𝑢1) + 𝑘2 𝐸(𝑢2) +...............+ 𝑘𝑛 𝐸(𝑢𝑛) + 2𝑘1 𝑘2 𝐸(𝑢1𝑢2) +..........+ 2𝑘𝑛−1𝑘𝑛𝐸

(𝑢𝑛−1𝑢𝑛)

You can see that,

50
2 2
𝐸(𝑢𝑖) = σ (homoskedasticity assumption)

2 2 2 2 2 2
So, =𝑘1σ + 𝑘2σ +................+ 𝑘𝑛σ

𝐸(𝑢1𝑢𝑗) = 0 No autocorrelation

We use the assumption of homoskedasticity and no autocorrelation. So the derivation of

the formula of the variance expected value etc….. Critically depends on the assumptions

of classical linear regression model, because these assumptions are used;

2 2
= σ Σ𝑘𝑖

2
1
Σ𝑘𝑖 = 2
Σ𝑥𝑖

2
σ
𝑉𝑎𝑟 (β2) = 2
Σ𝑥𝑖

The positive square root of the variance of the β2 is known as the standard error of β2

Standard error of β2 as;

σ
SE(β2)=
2
Σ𝑥𝑖

When this quantity is used for hypothesis testing, for the derivation of Z, t etc…..
2
σ σ
𝑉𝑎𝑟 (β2) = 2 and SE(β2)=
Σ𝑥𝑖 Σ𝑥𝑖
2

It is nothing but the standard derivation of the sampling distribution of β2, because, β2 is an

estimator, it has a sampling distribution and it is a random variable.

51
Similarly without proof, we can show that;
2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖

2
Σ𝑋𝑖
SE(β1)= 2 σ
𝑛Σ𝑥𝑖

These quantities are used as the measure of the precision of the estimator, not only that, these

quantities are used for hypothesis testing

For prediction purposes, we use the following derivation.

2⎡ 1 2 1 ⎤
𝑉𝑎𝑟 (β1) = σ ⎢ 𝑛 +𝑥 ⎥
⎢ Σ𝑥𝑖
2

⎣ ⎦
2 2
2⎡ Σ𝑥𝑖 + 𝑛𝑥 ⎤
= σ⎢ ⎥
⎢ 𝑛Σ𝑥𝑖
2

⎣ ⎦
2 2 2
2⎡ Σ𝑋𝑖 − 𝑛𝑥 +𝑛𝑥 ⎤
=σ ⎢ ⎥
⎢ 𝑛Σ𝑥𝑖
2

⎣ ⎦
2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖

Consider these formulas of variance and also standard errors you will immediately feel

2
that, in this formula, Σ𝑥 𝑖 is known, once we have a sample, and n is also known. The

2 2
unknown is σ , and the σ is the variance of 𝑢𝑖, and u is unknown. So we cannot apply

the formula of variances for hypothesis testing. So what we do in practice, so we replace

2
σ with its unbiased estimator.
2
2 Σ𝑢𝑖
2
σ =σ = 𝑛−2

n-2 = degrees of freedom

52
2
Σ𝑢𝑖 = sum of squares of residuals

2
2 Σ(𝑋−𝑥)
For example, 𝑠 = 𝑛−1
, here n-1 is the degrees of freedom

2
In the context of OLS, Σ𝑢𝑖 can be derived, if and only if β1 𝑎𝑛𝑑 β2 are derived. So, two

restrictions are imposed on the data, when two quantities are estimated from the data.

1. Σ𝑢𝑖 = 0

2. Σ𝑢𝑖 𝑋𝑖 = 0

The degrees of freedom, is generally known as n minus number of parameters estimated

df= n-k

Degrees of freedom is an adjustment to make an estimator unbiased.

2
2
Now,σ = is the unbiased estimator of σ

()
2
2
Then,𝐸 σ = σ

⎡ 2 ⎤ 2
If ,E ⎢Σ𝑢𝑖 ⎥ = (n-2) σ
⎢ ⎥
⎣ ⎦
2
The expectation of the numerator is n-2 σ , so if we divide the equation with n-2, we get;
2
⎡ Σ𝑢 ⎤
= Σ⎢⎢ 𝑛−𝑖2 ⎥⎥= ⎡ 𝑛−2 ⎤σ
𝑛−2 2

⎢ ⎥ ⎣ ⎦
⎣ ⎦
2
2
So, division by n-2, make the estimator of σ that is 𝑢 an unbiased estimator.
2

2 Σ𝑢𝑖
In the equation of Var(β1) and Var(β2 ), σ is replaced with 𝑛−2
.

53
2
How do we calculate Σ𝑢𝑖

2 2
Actually Σ𝑢𝑖 = Σ( 𝑌 − β + β2 𝑋𝑖 )
𝑖 1

2
Once β1 and β2 are calculate, then it is easy to calculate the Σ𝑢𝑖 . But in practice, we calculate it

differently .

We know that,

𝑌𝑖 = 𝑌𝑖+ 𝑢 𝑖

And in derivation form,

𝑦 𝑖 = 𝑦 𝑖+ 𝑢 𝑖

2 2
2
Σ𝑦𝑖 = Σ𝑦 + Σ𝑢 + 2Σ𝑦 𝑢
𝑖 𝑖 𝑖 𝑖

Here, Σ𝑦 𝑖 𝑢 𝑖 =0

2 2
2
Σ𝑦𝑖 = Σ𝑦 𝑖 + Σ𝑢 𝑖

2 2
2
Σ𝑢 𝑖 = Σ𝑦𝑖 - Σ𝑦 𝑖

Since, 𝑦𝑖 = β2𝑥𝑖

2 2
2
Σ𝑦𝑖 = β2Σ𝑥 𝑖

2 2
2 2
So, Σ𝑢 = Σ𝑦 - β2Σ𝑥 𝑖
𝑖 𝑖

54
2
This is the formula that we used to calculated Σ𝑢
𝑖

2
Σ𝑢
σ= 𝑛−2
𝑖
is known as the SE of the estimates, also known as SE of regression, it is

nothing but standard deviation of Y around the estimated regression line, and it is often

considered as a measure of goodness of fit of the sample regression line to the sample

data.

From the definition of variance and standard error of estimates consider these points;
2
σ
𝑉𝑎𝑟 (β2) = 2
Σ𝑥𝑖

2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖

2
Consider the slope coefficient β2, in this case variance of β2 is directly proportional to σ

2
as the variance of Y increases, variance of β2 increases. It is inversely proportional to Σ𝑥 𝑖

2
2
(
Σ𝑥 𝑖 = Σ 𝑋𝑖 − 𝑋 )
If all the X values equal to 𝑋 (there is no variability) this become 0, and variance of β2

become infinite. So, if you have a sample with low variability in X, then the variance of

the estimator will be very high and precision of the estimator will be low, and vice versa.

If you have two data sets with different values of X, then always select the set with high

variability in X, as the variance increases, precision also increases.

55
Gauss Markov Theorem

In this section, we are discussing the statistical properties of least squares. We have already

discussed that properties are divided into numerical properties and statistical properties.

Numerical properties are those that hold as result of the use of OLS regardless of how the data is

generated. Statistical properties are those that hold only after making some assumptions about

how data are generated. The statistical properties of least squares is given in the form of a

theorem known as Gauss Markov Theorem. This is a basic theorem in econometrics. The

statement of the theorem is, given the assumptions of the classical linear regression model, OLS

estimators in the class of all linear unbiased estimators have minimum variance, they are BLUE.

So we consider a class of linear unbiased estimators and we will prove that in this class OLS

estimators have minimum variance, that is Gauss Markov Theorem.

As we discussed earlier, the BLUE property is a small sample property. Properties of estimators

are classified into small sample properties and large sample properties, and small sample

properties are the most desirable. If small sample properties are satisfied, then we need not

consider the large sample properties.

Let's prove the Gauss Markov Theorem with respect to β2, and it will automatically applicable

to β1 . as we have written earlier

β2 = Σ𝑘 𝑌 , so the β2 is a linear estimator………………(1)


𝑖 𝑖

We have also proved that, E( β2) = β2, So β2 an unbiased estimator also……….(2)

2
σ
𝑉𝑎𝑟 (β2) = 2 ……………….(3)
Σ𝑥𝑖

56
We have derived the variance and we have to prove that, in the class of unbiased linear

estimators β2 has the minimum variance.

The 1st and 2nd properties are absolute, means, these properties do not depend on other

estimators. But the 3rd property is a relative property, meaning a property in relation to the

variance of another estimate. So we define an alternative estimator as;


β2 (β 𝑡𝑖𝑙𝑑𝑒 2)= Σ𝑤𝑖𝑌𝑖 , it is linear, were 𝑤𝑖 is serving as the weights.

Substitute for 𝑌𝑖, it will become;

= Σ𝑤𝑖(β1+β2Xi+𝑢𝑖)


β2 = β1Σ𝑤𝑖+β2Σ𝑤𝑖𝑥 𝑖+Σ𝑤𝑖𝑢
𝑖


E(β2 ) = β2, it is satisfied if and only if Σ𝑤𝑖=0, and Σ𝑤𝑖𝑥 𝑖=1, so we assume that Σ𝑤𝑖=0, and Σ𝑤𝑖

𝑥 𝑖=1


β2 = β2+Σ𝑤𝑖𝑢
𝑖


E(β2 ) = β2 because Σ𝑤𝑖𝑢 = 0
𝑖

So, here we imposed two restrictions;

1. Σ𝑤𝑖=0,

2. Σ𝑤𝑖𝑥 𝑖=1

∼ ∼
Now, we have to derive the variance of β2 , we can derive the variance of β2 just as we derived

the variance of β2.


We have,β2 = Σ𝑤𝑖𝑌𝑖

57

Then, Var(β2 )= Var(Σ𝑤𝑖𝑌𝑖)

2 2 2
= 𝑤1 Var(𝑌1)+𝑤2 Var(𝑌2)+.................𝑤𝑛 Var(𝑌𝑛)

2
It is assumed that Var(𝑌𝑖) = σ , And also, Cov(𝑌𝑖 𝑌𝑗) = 0, as we already discussed Var(𝑌𝑖) is

same as the Var(𝑢𝑖).

Then, Cov(𝑌𝑖 𝑌𝑗) = Cov(𝑢𝑖 𝑢𝑗)

Using this covariance terms will disappear, and we can write it as;

2 2 2 2 2 2
= 𝑤1 σ +𝑤2 σ +............𝑤𝑛 σ

∼ 2 2
Var(β2 ) = σ Σ𝑤𝑖

To derive this, we have use two assumptions;

2
1. Var(𝑌𝑖) = σ

2. Cov(𝑌𝑖 𝑌𝑗) = 0,

Consider that, the Var(β2) = Var(Σ𝑘𝑖𝑌𝑖)

2 2 2 2 2 2
= 𝑘1σ +𝑘2σ +.........................𝑘𝑛σ

2
2 2 σ
σ Σ 𝑘𝑖 = 2
Σ𝑥𝑖

2
To write the variance like this, we have to assume that Var(𝑌𝑖) = σ , and Cov(𝑌𝑖 𝑌𝑗) = 0. As we

have stated the assumptions in terms of u, we have adopted a different root, if we state the

assumption in terms of y, we can use this root also, both give same result. Here we used the

2
assumptions that Var(𝑌𝑖) = σ , and Cov(𝑌𝑖 𝑌𝑗) = 0.

58
∼ 2 2
So, Var(β2 ) = σ Σ𝑤𝑖

Let us consider , 𝑤𝑖 = 𝑘𝑖 + (𝑤𝑖 - 𝑘𝑖 )

2 2 2
Σ𝑤𝑖 = Σ𝑘𝑖 +Σ(𝑤𝑖 − 𝑘𝑖) + 2Σ𝑘𝑖 (𝑤𝑖- 𝑘𝑖 )

2
Consider the last terms, Σ𝑘𝑖 (𝑤𝑖- 𝑘𝑖 )= Σ𝑘𝑖 𝑤𝑖 - Σ𝑘𝑖

2
= Σ 𝑘 𝑖 𝑤 𝑖- Σ 𝑥 𝑖

Σ𝑥𝑖 𝑤𝑖 1
= 2 - 2
Σ𝑥𝑖 Σ𝑥𝑖

Σ𝑥 𝑖 𝑤𝑖 = Σ(𝑋𝑖 − 𝑋)𝑤𝑖

=Σ𝑤𝑖𝑋𝑖 - 𝑋Σ𝑤 , Σ𝑤 = 0

=Σ𝑤𝑖𝑋𝑖 =1

1 1
= 2 - 2 =0
Σ𝑥𝑖 Σ𝑥𝑖

So, the last term of above equation (Σ𝑘𝑖 (𝑤𝑖-𝑘𝑖 )=0

2
Then we write Var(β 2) = σ ⎡⎢Σ𝑘𝑖 + Σ(𝑤𝑖 − 𝑘𝑖) ⎤⎥
∼ 2 2

⎣ ⎦

2 2 2 2
= σ Σ 𝑘𝑖 + σ Σ(𝑤𝑖 − 𝑘𝑖)


2
2 2
σ
Var (β 2)= 2 +σ Σ(𝑤𝑖 − 𝑘𝑖)
Σ𝑥𝑖

∼ 2 2
Var (β 2)= 𝑉𝑎𝑟 (β2) +σ Σ(𝑤𝑖 − 𝑘𝑖)

59
∼ 2 2
As you can see here, Var (β 2) = = 𝑉𝑎𝑟 (β2) +σ Σ(𝑤𝑖 − 𝑘𝑖) , Now, if 𝑤𝑖 = 𝑘𝑖, the weight

𝑤𝑖 is same as 𝑘𝑖, the last term of the above equation will disappear , and


Var (β 2)= 𝑉𝑎𝑟 (β2)

If 𝑤𝑖 is any value other than 𝑘𝑖, 𝑤𝑖 < 𝑘𝑖 or 𝑤𝑖 > 𝑘𝑖, the squared quantity of the last term in

the above equation will become positive. Then, you will get the result that;


Var (β 2)> 𝑉𝑎𝑟 (β2)

So, another way of saying this is that, if there if minimum variance unbiased estimator (BLUE)

of β2, it is given by OLS estimator β2. This is the proof of Gauss Markov Theorem.

The graphical presentation of the Gauss Markov Theorem

∼ ∼
Both are unbiased, and the Var (β 2)> 𝑉𝑎𝑟 (β2) or 𝑉𝑎𝑟 (β2)<Var (β 2).

Coefficient of Determination

60
● We have estimated the regression coefficients β1 𝑎𝑛𝑑 β2 , we have also derived the standard

errors of estimators.

Σ𝑥𝑖 𝑦𝑖
β2 = 2
Σ𝑥𝑖

2
σ
𝑉𝑎𝑟 (β2) = 2
Σ𝑥𝑖

β1 = 𝑌 − β2 𝑋

2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖

● Now if β1 and β2 are known, easily determine the sample Regression Line. that is equal to

𝑌𝑖 = β1 + β2 𝑋𝑖 this is the SRL

● If we know β1 and β2 , by substituting the values𝑋𝑖, 𝑌𝑖 can be estimated, you will get the

sample regression line.

● Now, our next issue is to find out how fit is the sample regression line to the sample data.

● Ideally what we expect is all the sample points lie on the sample regression line. But in the

real world, all sample points will not lie on the sample regression line. Some points will lie

above the sample regression line and some will lie below the sample regression line. That is

there will be positive u hat and negative u hat. i.e., (𝑢) 𝑎𝑛𝑑 (− 𝑢)

● Now, as a measure of goodness of fit, goodness of fit means how good is the fit of the

sample regression line to the sample data, we use a measure known as coefficient of

determination.

61
● Coefficient of determination is the measure used to assess the fit of the sample regression

2
line to the sample data . For simple linear regression model it is represented as (𝑟 ). and

when we go beyond the 2 variable model , in the general K variable model , we denote

2
coefficient of determination as multiple coefficient of determination (𝑅 ).

2
● Coefficient of determination is written as (𝑟𝑦,𝑥 )

● Now, in order to see what is meant by coefficient of determination, first consider an

2
informal method of considering (𝑟 ). suppose that we use a technique known as venn

Diagram, also known as Ballentine, suppose that, the circle Y represents variation Y and the

circle X represents variation X.

● The term variance and variation are two different things. When we talk about variation, the

question is variation from what? There will be a base. The base that we usually refer is a
2
2
(
Σ 𝑌𝑖 −𝑌 )
(
mean, so the variation is denoted as Σ 𝑌𝑖 − 𝑌 ) and variance is denoted as 𝑛−1

● So, variation is not variance. Now suppose that the circle Y represents variation in Y and

circle X represents variation X, there is no overlap in the following diagram

● But there is overlap in the following diagram

62
The overlap increases in the following diagrams

2 2
● Now 𝑅 is a measure of this overlap. In the first case as overlap is zero 𝑅 = 0 , in the

2 2
last case as the overlap is complete, 𝑅 = 1. in the fourth case 𝑅 < 1. This is the

informal definition of coefficient of determination.

63
2
● To derive 𝑅 we proceed as follows

● Now, we know that 𝑌𝑖 = 𝑌𝑖 + 𝑢𝑖 and in deviation form 𝑦𝑖 = 𝑦𝑖 + 𝑢𝑖 . this the two

variable model in terms of original observation and in terms of deviation form. And

2
2
(
Σ𝑦𝑖 = Σ 𝑦𝑖 + 𝑢𝑖 )
2 2
2
Σ𝑦𝑖 = Σ𝑦𝑖 + Σ𝑢𝑖 + 2 Σ𝑦𝑖 𝑢𝑖 = 0

This can be also written as

2 2 2
2
Σ 𝑦𝑖 = β 2 Σ 𝑥 𝑖 + Σ𝑢𝑖

Because , as we have derived earlier like this

● Now, let’s see the components in the form of a graph.

● Now, let’s define various sum of squares. The first one is

64
2
2
(1) Σ𝑦𝑖 = Σ 𝑌𝑖 − 𝑌 ( ) = 𝑇𝑆𝑆 ( 𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠)

2 2
( )
2
(2) Σ𝑦𝑖 = Σ 𝑌𝑖 − 𝑌 (
= Σ 𝑌𝑖 − 𝑌 ) we have proved that 𝑌 = 𝑌, that is the mean

of actual Y is equal to mean of estimated Y.

2
( )
Σ 𝑌𝑖 − 𝑌 —----------------- we called this ESS, that is Explained Sum of

Squares. That is that part of variation is explained by the sample regression line .

2
(3) Σ𝑢𝑖 That part of the variation in Y not explained by the sample regression line.

We called it as RSS, that is Residual Sum of Squares.

● So, various Sum of Squares we can arranged as

TSS = ESS +RSS

Now, dividing through out with TSS

𝐸𝑆𝑆 𝑅𝑆𝑆
1 = 𝑇𝑆𝑆
+ 𝑇𝑆𝑆

2 2
Σ𝑦𝑖 Σ𝑢𝑖
1 = 2 + 2
Σ𝑦𝑖 Σ𝑦𝑖

2 2 2
β2 Σ𝑥𝑖 Σ𝑢𝑖
1 = 2 + 2
Σ𝑦𝑖 Σ𝑦𝑖

2
Now, we define 𝑟 as

2 𝐸𝑆𝑆
𝑟 = 𝑇𝑆𝑆

2
2 𝑅𝑆𝑆 Σ𝑦𝑖
So, 1 = 𝑟 + 𝑇𝑆𝑆
= 2
Σ𝑦𝑖

2
So 𝑟 can also be written as

65
2
2 𝑅𝑆𝑆 Σ𝑢𝑖
𝑟 = 1 − 𝑇𝑆𝑆
= 1 − 2
Σ𝑦𝑖

This quantity is known as the coefficient of determination. Coefficient of determination

often used as the summary measure of the goodness of fit. That is the fit of the sample

2
regression line to the sample data. Then 𝑟 is defined as the proportion of variation in Y

explained by X, or the fraction of sample variation in Y explained by X . if you multiply

2 2
𝑟 with 100 ( 𝑟 *100) , then it will be the percentage of variation in Y explained by X.

2
For ex: if 𝑟 = 0.6, the 0.6*100 = 60% then the explanation is that 60% of the variation

in Y explained by X.

● For Computational purpose, the formula used is

2 Σ𝑥 2
2 𝑖
𝑟 = β2 2
Σ𝑦𝑖

This can be written as

2 2
2 𝑆𝑥
𝑟 = β2 2
𝑆𝑦

2
● The value of 𝑟 always lie in between 0 and 1.

2
● If Σ𝑢𝑖 = 0 means all the sample points lie on the sample regression line, then

2 0
𝑟 = 1 − 2 = 1 that is the entire variation in Y is explained by X. on the other
Σ𝑦𝑖

2 2
Σ𝑢𝑖 Σ𝑢𝑖
hand, 2 that is no part of the variation in Y is explained by X. Then, 2 = 1 so,
Σ𝑦𝑖 Σ𝑦𝑖

2
𝑟 = 1 − 1 = 0

66
2
● So, 𝑟 is squared quantity, it will not be negative. The range is in between 0 and 1.

2
Where 0 means the regression line has no fit at all, 1 means perfect fit. But typically 𝑟

always lie in between 0 and 1.

2
● If the value of 𝑟 is 0.8, 0.9 etc., we will say that the fit of the SRL is very high. And if it

is 0.3, 0.2 etc., the fit of the SRL is comparatively low.

2
● As we will see later, in the cross sectional data due to heterogeneity in the data 𝑟 will

2
be generally low, but in time series data 𝑟 will be very high

2
● And always remember that 𝑟 is the measure of the fit of the SRL to the sample data.

2
● As you see later, 𝑟 not a very important quantity in econometrics.

2
● Now, a quantity which is closely related to but conceptually very different from 𝑟 is the

sample coefficient of correlation r.

Σ𝑥𝑖𝑦𝑖
𝑟 =
2 2
Σ𝑥𝑖 Σ𝑦𝑖

● Remember the properties of coefficient of correlation

(1) − 1 ≤ 𝑟 ≤+ 1

(2) 𝑟 is symmetric. Symmetric means 𝑟 between X&Y is same as Y&X. that is

𝑟𝑥𝑦 = 𝑟𝑦𝑥

(3) r is independent of origin and scale, which means

𝐶𝑜𝑟𝑟(𝑋, 𝑌) = 𝑐𝑜𝑟𝑟(𝑎𝑥 + 𝑏, 𝑐𝑦 + 𝑏) that is whether we change the origin or

scale, correlation is not affected. Correlation is a unitless number.

(4) If two variables X, Y are independent , then r = 0, but 0 correlation does not mean

that X and Y are independent. That means even if the correlation is a measure of

67
2
linear association only, if 𝑌 = 𝑋 , 𝑡ℎ𝑒𝑦 𝑎𝑟𝑒 𝑟𝑒𝑙𝑎𝑡𝑒𝑑, 𝑏𝑢𝑡 𝑟 = 0 the reason is

that they are non-linearly related. So, even if correlation is 0, we cannot say that

the two variables are independent. But If two variables are independent , then r =

0.

There is an exception to this rule that, if X and Y are normally distributed random

variables, zero correlation always mean statistical independence. And this will not

be true in all cases. This will true only in the case of normally distributed random

variables. And finally

(5) Correlation need not imply causation: a classical example to this is a spurious

correlation. This concept was first introduced by Yules in Statistics. Later it was

introduced in econometrics, in the context of time series . the point is that, even if

two variables are highly correlated, we cannot say that one is a cause. There will

not be a causation between two variables. That means, even if these two variables

are not related, you may possibly get a high correlation. Such a possibility is

called spurious correlation.

● Now the question is if correlation is represented by r, and it is a linear association

2
between variables, 𝑟 is conceptually different. It is the pattern of variation in Y

± 2 2
explained by X. but actually this 𝑟 is 𝑟 and which sense 𝑟 is the square of r will

consider later

2
● Suppose that 𝑟 = 09, then r = 0.81, but the problem is if r can be positive or negative but

2
𝑟 is positive only. So, the question is 0.81 is positive or negative? To see that you must

know the relation between X and Y.

68
𝑌 = 𝑓( 𝑥 )

Suppose consumption is the function of income, then the slope coefficient is positive. So,

if the slope coefficient is positive r is positive. Suppose that

𝑄 = 𝑓(𝑝) quantity demanded is the function of p, then slope coefficient if negative, so r

is negative.

● So, in order to determine what will be the sign of r, you just consider the sign of the slope

coefficient in the relation between Y and X.

2 2
● So, r can be positive or negative, but 𝑟 is always positive. And always remember that 𝑟

is the square of r even though the meaning of the two terms are different. The sense in

2
which 𝑟 is the square of r will be discussed and proved in the context of multiple linear

regression model.

Normality Assumptions

● Classical theory of statistical inference has two branches of or is divided in to

(1) Estimation and

(2) Hypothesis Testing

● Using the method of ordinary least squares we have estimated the parameters.

β1 𝑖𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑎𝑠 β1

β2 𝑖𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑎𝑠 β2

2
2
σ 𝑖𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑎𝑠 σ

● And the assumptions of classical linear regression model these estimators posses many

desirable statistical properties, summarized as Gauss Markov theorem. That is the BLUE

property. A small sample property.

69
● So, provided that the assumptions classical linear regression model are satisfied, β1 and

β2 𝑎𝑟𝑒 best unbiased estimators. But we also know that these are only estimators. As

estimators, their value vary from sample to sample. Or stated differently, they are random

variables. So, a measure of the position of these estimators is given by the standard error

of the estimators(SE). We have derive standard error of β1 and β2 . Standard error is used

as the measure of the position of the estimator, not only that SE are used for hypothesis

testing.

● The estimation is the only one part, hypothesis testing is the other part.

● Suppose that in the model 𝑌𝑖 = β1 + β2𝑋𝑖 + 𝑢𝑖, where 𝑌𝑖 is consumption and 𝑋𝑖 is

income .

● Suppose that we have β2 = 0. 7 −−−−−−−−− 𝑀𝑃𝐶, suppose that the estimated

MPC is 0.7. Now the question is if we take another sample, you will get another value for

β2 if we take another sample, you will get a third value for β2 , and we do not know

which β2 will approximate the parameter β2. because of sampling fluctuations β2 vary

from sample to sample and we do not know which β2 will approximate the true

population parameter β2

● Our aim is not estimate β1 or β2 but use these estimators to make inferences about the

population parameters. So the question is how do we use the estimators to make

inferences about parameters?

70
● Suppose that we want to make inference about µ, using a sample we estimate a 𝑋. 𝑋 is an

estimator and used to make inference about the parameter µ.

● So once 𝑋 is estimated from sample, using the procedure of hypothesis testing or using Z

statistic or t statistic we can test hypotheses about population parameters.

𝑋 −µ
𝑍= σ

𝑋 −µ
𝑡= 𝑆
𝑛

● That branch is known as hypothesis testing. Statistician developed rules and procedure

for testing hypothesis, steps in formulating hypothesis and steps in testing hypothesis etc.,

● The question is how do we use these estimators for hypothesis testing?

● We know that estimators are random variables, so, naturally, estimators are sampling

distributions. We can use estimators for hypothesis testing, if and only if the sampling

distribution of the estimator is known. So, for ex: in the case of 𝑋 , X follows a normal

2
distribution with mean µ and variance σ and also 𝑋 follows a normal distribution with

2
mean µ and variance σ /𝑛.

2
𝑋 ∼ 𝑁(µ, σ )

2
𝑋 ∼ 𝑁(µ, σ /𝑛)

● So, sampling distribution of 𝑋 is given, so the test statistic Z or t can be constructed, so,

Z or t can be used for hypothesis testing.

2
● 𝑋 ∼ 𝑁(µ, σ /𝑛) this is based on the assumption that the samples are iid, or simply

samples are taken by using the random sampling technique, it is assumed X is normal. If

71
even is X is not normal, we use the central limit theorem. Provided that n is greater than

30 like that.

● Similarly, in the case of OLS estimators, we have suggested that

( )
2 2
σ Σ𝑋𝑖
β1 ∼ β1 2
𝑛Σ𝑥𝑖

Similarly

β2 ∼ β2( ) σ
2

Σ𝑥𝑖
2

● In both cases, the nature of the distribution is not specified. In the case of arithmetic

mean, arithmetic mean is assumed to be normally distributed if X is normal. But even if

2
X is not normal 𝑋 is normal with (µ, σ /𝑛) provided that sample size is sufficiently large.

So that we can construct the test statistic Z and t. In the case of OLS we have proved that

the estimators are unbiased, variances are derived, and we also shown that estimators are

true. But estimators are random variables, so, naturally it has a sampling distribution. But

the nature of the sampling distribution has not been discussed so far.

● But use these estimators to make inferences about the population parameters, we have to

specify sampling distributions of β1 and β2. without sampling distributions we can not

use estimators to make inferences about population parameters. Now let us derive the

sampling distribution of β1 and β2

● Now, we have stated that

Σ𝑥𝑖𝑦𝑖
β2 = 2
Σ𝑥𝑖

= Σ𝐾𝑖𝑌𝑖

72
β2 = β2 + Σ𝐾𝑖𝑢𝑖

● This step we have already derived. So, naturally, the distribution of the β2 or β2 simply

depends on 𝑢𝑖. because β2 is a constant and 𝐾𝑖 is also a constant

𝑥𝑖
𝐾𝑖 = 2
Σ𝑥𝑖

● So, as β2 is depends on 𝑢𝑖, we can also show that β1 𝑎𝑙𝑠𝑜 𝑑𝑒𝑝𝑒𝑛𝑑𝑠 𝑜𝑛 𝑢𝑖, we will prove

this only with respect to β2

● Since, this distribution β2 is depends on 𝑢𝑖, we have to state assumptions about how 𝑢𝑖 is

distributed? Then only we can determine the distribution of 𝑢𝑖

● So, in the regression context, we assume that 𝑢𝑖 is normally distributed. That is the

stochastic error term 𝑢𝑖 is normally distributed, and when we add the normality

assumption, Classical Linear Regression Model(CLRM) becomes Classical Normal

Linear Regression Model(CNLRM)

● Now we assume that each 𝑢𝑖 is normally distributed with

𝐸(𝑢𝑖) = 0

Then variance of 𝑢𝑖 that is

2 2
𝐸[𝑢𝑖 − 𝐸(𝑢𝑖) ] = σ

Covariance , that is

[
𝐸[𝑢𝑖 − 𝐸(𝑢𝑖) ] 𝑢𝑗 − 𝐸(𝑢𝑗) ] = 0

We can write it as

73
2
𝑢𝑖 ∼ 𝑁(0, σ )

This is the normality assumption

● So, in regression analysis, when we talk about normality assumption, we are talking

about the distribution of 𝑢𝑖. as we know, as each 𝑢𝑖 is normally distributed, covariance is

assumed to be zero, we can say that each 𝑢𝑖 is normally and independently distributed

2
with 0 and σ .

2
That is 𝑢𝑖 ∼ 𝑁𝐼𝐷(0, σ )

● For 2 normally distributed random variables 0 covariance means independence.

● Now the question is that why we assume the 𝑢𝑖 is normal?

● The celebrated Central Limit Theorem says that if there are a large number of

independently and identically distributed random variables, then with a few exemptions,

their sum of the random variables will also be normally distributed. A variant of the

central limit theorem says that even if the number of random variables is not very large,

also even if they are not strictly independently distributed, even then their sum tends to be

normally distributed.

● So, we know that U in the linear regression model in econometrics, is the sum of a large

number of random variables. For example, in the consumption income example you

consider, wealth, number of family members and so many other variables. As U is the

sum of identically and independently distributed large number of variables, U is normally

distributed. And if u is normally distributed β1 𝑎𝑛𝑑 β2 are normally distributed. The

reason is that, there is a theorem in statistics, which says that a linear combination of

normally distributed random variables itself is normally distributed. We have seen that

74
● β2 = β2 + Σ𝑘𝑖𝑢𝑖

● β2 is the linear combination of with 𝑘𝑖 and β2 as serving as the weights. As β2 is the

linear combination of 𝑢𝑖, β2 is normally distributed. Similarly β1 . so we can write

( )
2 2
σ Σ𝑋𝑖
β1 ∼ 𝑁 β1 2
𝑛Σ𝑥𝑖

Similarly

β2 ∼ 𝑁 β2( ) σ
2

Σ𝑥𝑖
2

● When we study 𝑋, we simple assume that if X , that is the population is normal, the

distribution of the 𝑋 is also normal. That is proved by the statisticians. In the context of

OLS, we use the normality assumption to prove that estimators are normal. Normality

assumption is about u and the estimators. It says nothing about the distribution of X or Y.

● Then why normality assumption? Because u is the sum of a large number of identically

and independently distributed random variables. And if a phenomena is the sum of a large

number of factors, then that phenomena will be normally distributed. Height is normally

distributed, weight is normally distributed. Because the height phenomena, weight

phenomena etc, are the large number of variables. So, such variables will be naturally

distributed. And if u is normal, OLS estimators are normally distributed. And if OLS

estimators are normal then we can use normal distribution for hypothesis testing.

( )
2 2
σ Σ𝑋𝑖
● Since β1 ∼ 𝑁 β1 2
𝑛Σ𝑥𝑖

β1 −β1
𝑍= ∼ 𝑁 ( 0, 1)
𝑆𝐸 (β1)

75
Similarly,

β2 ∼ 𝑁 β2( ) σ
2

Σ𝑥𝑖
2 is converted to

β2 −β2
𝑍= ∼ 𝑁 ( 0, 1)
𝑆𝐸 (β2)

● This is the standard normal variable.

( ) ( )
2 2
σ Σ𝑋𝑖 σ
2
● β1 ∼ 𝑁 β1 2 this is the sampling distribution of β1 and β2 ∼ 𝑁 β2 2 is the
𝑛Σ𝑥𝑖 Σ𝑥𝑖

sampling distribution of β2. and estimators are not directly used for hypothesis testing,

using the estimators we will derive the test statistics. The test statistic is Z.

● So, in the form of a graph

76
● The sampling distribution β1 is converted in to Z. similarly we can convert β2 in to Z.

● In hypothesis testing, we will use the Z distribution. The sampling distribution can be

used for hypothesis testing. But in general, we will not use sampling distribution directly.

We derive the test statistic Z and this distribution is used for hypothesis testing. That is to

test hypothesis about β1 and β2 , the regression coefficients using the parameters β1 and

β2

● Remember this, Z conversion is possible, if and only if the standard errors are known. In

reality, standard errors will be unknown. So, we will be go for t distribution. That is the

point we considered earlier. Now, for our future use the quantity,
2
σ 2
( 𝑛 − 2) 2 ∼ χ (𝑛 − 2)𝑑𝑓
σ

2
Now, in the context of χ distribution, we have shown that
2
𝑆 2
( 𝑛 − 1) 2 ∼ χ (𝑛 − 2)𝑑𝑓
σ

2
That was in the context of testing the significance of σ

2 2
● In statistics, we use the notation 𝑆 for σ and 𝑋 𝑓𝑜𝑟 µ

2
2
● In econometrics, we use σ for σ and µ 𝑓𝑜𝑟 µ
2
𝑆 2
● ( 𝑛 − 1) 2 ∼ χ (𝑛 − 2)𝑑𝑓 this quantity is used to derive the t distribution.
σ

2
● We can also show that β1 and β2 are independently distributed of the quantity σ .

Because that is the basis requirement for t distribution. If 𝑍1 is the standard normal and

2
𝑍2 is χ with k degrees of freedom and is distributed independently of 𝑍1, then the ratio

77
2
has t distribution. We assume that estimators are independent of σ . To prove this matrix

algebra is required. So, this is the normality assumption.

● Normality assumption is very important in econometrics. Only if we can assume that u is

normally distributed we can use the estimators for hypothesis testing. But there is another

point. If this normality assumption is very important, it is important if the sample size is

more. If the sample size is very large, then the normality assumption is not very

important. Because, in general as the sample size increases, indefinitely, estimators will

be normally distributed. In the case of 𝑋, if n >30, whatever is the nature distribution of

the population 𝑋 is normally distributed. We can say the same in the context of OLS

estimators also. If you have a large sample say n >100, then normality assumptions not

required. Because whether u is normal or nor, or whether you are not sure about the

distribution of the u, β1 and β2 will be normally distributed. So, you can simply use the

normality assumption and the corresponding Z statistics or if the variances are unknown

the t statistics for hypothesis testing. So, adding the normality assumption, we have the

Classical Normal Linear Regression Model otherwise it is simply the Classical Linear

Regression Model.

Interval Estimation Vs Point Estimation

The classical definition has two branches i.e, estimation and hypothesis .Estimation is divided in

to two :

● Point estimation

● Interval estimation

78
OLS estimators are point estimation. When we consider hypothesis testing there are two

approaches , confidence interval and test of significant approach.

Confidence interval approach is based on interval approach and test of significance is based on

point estimation.

Difference between point and interval estimation

2
RV → X height→follows a certain distribution with parameters with µ, σ Random variables

are normally distributed, but we didn't know about parameters. Our interest is distribution of

parameters.

In point estimation, we assume that RV → 𝑋 follows a certain distribution and we have a random

sample of size ‘n’ from this known population, for example 50 sample, we use this sample to

estimate the unknown parameters. In the case of point estimation - random variable x follows

certain PDF.

PDF f (𝑥 , θ) ,θ is the parameter of the distribution. We assume that PDF is normal.

θ = f (𝑥1, 𝑥2........ 𝑥𝑛)

θ is the estimator or the sample statistics, a rule or formula which tells us how to estimate the

unknown parameter.

Interval Estimation

Point estimation will gives only a single value for the parameter θ → θ.

The value of θ varies sample to sample. due to sampling fluctuations.

In interval estimation, instead of suggesting single value for θ i.e, θ→ 150, We suggest a range

in which the parameter will lie. The idea behind the notion of interval estimation is the concept

of sampling distribution of estimators. The parameter will lie in between 2 values. We construct

79
an interval around point estimation or taking 2 or 3 standard errors on either side. And we say

that the probability is 95% or 99%. Of including the parameter.

For eg: X ∼ 𝑁 µ , σ ( ) 2

𝑋∼ 𝑁 µ,( ) σ
𝑛
2

σ
We can suggest interval 𝑋± 1. 96
𝑛

The interval varies from one sample to another sample.

If you want to find two positive numbers: δ , α

(θ − δ, θ + δ, = 1 − α )
θ= 1 − α

[
P θ − δ≤ θ ≤θ + δ= 1 − α ]
⎡ ⎤
P⎢ θ1 ≤ θ ≤ θ2 ⎥ = 1 − α
⎣ ⎦

1 − α = confidence coefficient

α = level of significance , the probability of committing type I error.

Eg : suppose that height of population µ 𝑖𝑛𝑐ℎ𝑒𝑠, σ = 2. 5, 𝑛 = 100, 𝑥 = 67.

𝑥 − 1. 96( ) ≤ µ ≤ 𝑥 + 1. 96( )
σ
𝑛
σ
𝑛

67 − 1. 96 ( ) ≤ µ ≤ 67 + 1. 96( )
2.5 2.5
100 100

66.51≤ µ ≤ 67. 49

The parameter will lie intervals like in between 95 out of 100 times.

80
Construction of Confidence Interval of Population Mean

First we explain the construction of confidence interval for the parameter µ using its estimator 𝑋

2
𝑋 ∼ 𝑁 µ, σ ( )
( )
2
σ
𝑋 ∼ 𝑁 µ,
𝑛

𝑥− µ
Z= σ ∼ 𝑁 ( 0, 1)
𝑛

If σ it is known we can use Z for constructing confidence intervals.

⎡ ⎤
𝑃𝑟⎢− 𝑧α/2 < 𝑍 < 𝑧α/2⎥=1-α
⎣ ⎦

Suppose that , α = 0. 05 (5%).

1- 0.05 = .95 or 95%.

95% of z lies between these 2 intervals.

Then z lie between -1.96 to +1.96

α α 𝑥−µ
− 2
and 2
are known as critical values, then z in the middle is empirical z, ie z =
σ/ 𝑛

Substituting for z,

𝑥−µ
𝑃𝑟⎡⎢− 𝑧α/2 < < 𝑧α/2 ⎤⎥=1-α
⎣ σ/ 𝑛 ⎦
σ
Multiplying all the term with
𝑛

𝑃𝑟⎡⎢− 𝑧α/2
⎣ ( ) < 𝑋 − µ < 𝑧 ( )⎤⎥⎦=1-α
σ
𝑛 α/2
σ
𝑛

𝑋 is subtracted from all the term

𝑃𝑟⎡⎢− 𝑧α/2
⎣ ( ) − 𝑋 <− µ < 𝑧 ( ) − 𝑋⎤⎥⎦=1-α
σ
𝑛 α/2
σ
𝑛

Multiplying all with positive value.

81
When we multiply with positive value < change to >

𝑃𝑟⎡⎢𝑋 + 𝑧α/2
⎣ ( ) > µ > 𝑋 − 𝑧 ( )⎤⎥⎦=1-α
σ
𝑛 α/2
σ
𝑛

Change the last part to first part

( ) < µ < 𝑋 + ( )⎤⎥⎦=1-α


𝑃𝑟⎡⎢𝑋 − 𝑧α/2

σ
𝑛
α
2
σ
𝑛

µ = 𝑋 ± 𝑧 α/2( )
σ
𝑛

If α = 0. 5

µ = 𝑋 ± 1. 96 ( ) σ
𝑛

σ
is standard error of the distribution of 𝑋.
𝑛

In empirical application this σ will be unknown, if σ is unknown we cannot use z as test

statistics. Then we use t distribution.

𝑥−µ
𝑡= 𝑠 ∼ 𝑡(𝑛−1)
𝑛

Whenever σ is replaced by its unbiased estimator s, the distribution becomes t.

𝑋−µ
If 𝑍1= σ ∼ 𝑁 ( 0, 1)
𝑛

2
𝑠
𝑍2 = (𝑛 − 1) 2
σ

2
Σ(𝑥 −𝑥) 2
= 2 ∼ χ(𝑛−1)
σ

(𝑋−µ) 𝑛

𝑍1= σ
2
1 Σ(𝑥 −𝑥)
𝑛− 1 2
σ

(𝑋−µ)
𝑠
𝑛

82
⎡ ⎤
𝑃𝑟⎢− 𝑡α/2 < 𝑡 < 𝑡α/2⎥=1-α
⎣ ⎦

𝑋−µ
𝑃𝑟⎡⎢− 𝑡α/2 < < 𝑡α/2 ⎤⎥=1-α
⎣ 𝑠 𝑛 ⎦

If α = 5%

In the case of t distribution , 95% of t depends on degrees of freedom. When degrees of freedom

are different, the critical values are different.

𝑠
Multiplied with , 𝑋 is subtracted from all the term, Multiplying all with positive value, we
𝑛

will get

⎣ ( ) < µ < 𝑋 + 𝑡 ( )⎤⎥⎦=1-α


𝑃𝑟⎡⎢𝑋 − 𝑡α/2
𝑠
𝑛 α/2
𝑠
𝑛

µ = 𝑋 ± 𝑡α/2( )
𝑠
𝑛

If degrees of freedom is 20 α = 5%

µ = 𝑋 ± 2. 036 ( ) 𝑠
𝑛

Construction of Confidence Interval of Regression Coefficients

Now we consider how to construct confidence intervals for the regression coefficients β1, β2.

We will explain how to construct confidence interval for β2 and generalize it for β1 or any

regression coefficients. We know that


2 2
σ Σ𝑋𝑖
β1 ∼ 𝑁 (β1 2 )
𝑛Σ𝑥𝑖

2
σ
β2∼ 𝑁(β2 2 )
Σ𝑥𝑖

With normality assumption OLS estimators are normally distributed. Let us consider how we

construct confidence interval forβ2

83
From β2

β2− β2
Z= 2
∼N(0,1)
σ/ Σ𝑥𝑖

2
( On the assumption that σ is known)

If σ is known , we can use Z for the construction of confidence intervals

Let us see how to construct a confidence interval forβ2

⎡ ⎤
𝑃𝑟⎢− 𝑧α/2 < 𝑍 < 𝑧α/2⎥ = 1-α
⎣ ⎦

Once α is fixed the critical values are known, For example ifα is .05, then − 𝑧α/2= -1.96

𝑧α/2=1.96 that 95% of Z lies between these 2 intervals. And Z in the middle is Z obtained from

sample or it is the test statistic .This can be written as

⎡ β2−β2 ⎤
𝑃𝑟⎢− 𝑧α/2 < 2
< 𝑧α/2 ⎥ =1-α
⎣ σ/ Σ𝑥𝑖 ⎦

We can give any value to α like 5%,10% etc…

As we have done in the case of arithmetic mean, here also assume that you multiply the entire

2
terms with σ Σ𝑥𝑖 then assume that β2 is subtracted from all the terms then again you assume

that you multiply all the terms with a minus you will get

Pr⎡⎢ β2 − 𝑧α/2(σ/ Σ𝑥𝑖 ) < β2 < ⎡ β + 𝑍 (σ/ Σ𝑥𝑖2 )⎤⎥⎤⎥ =1-α


2
⎢ 2 α/2
⎣ ⎣ ⎦⎦

This is the confidence interval for β2 for the level of significance α.If α is 5%, this is the 95%

confidence interval for β2.So we can write

β2 = β2 ±𝑧α/2⎡⎢𝑆𝐸( β2)⎤⎥
⎣ ⎦

84
This can be generalized as

β𝑖 = β𝑖 ±𝑧α/2⎡⎢𝑆𝐸 ( β𝑖⎤⎥
⎣ ⎦

This is the 100(1-α)% confidence interval for any regression coefficient β𝑖.

Z distribution can be used for constructing confidence interval only if σ is known. If σ is known

and α is fixed then this confidence interval can be constructed. But as we have discussed in the

case of arithmetic mean and using it for constructing confidence interval for µ, in practical

applications σ will be unknown. So as in the case of arithmetic mean, we use in the case of OLS

estimators also, the t distribution, for constructing the confidence interval.

β2−β2
t= 2
∼t(n-2)
σ/ Σ𝑥𝑖

Now we are considering the case where the degrees of freedom is n-2 not n-1.

Now let us see how we actually derive this t. We know that

2
( β2−β2) Σ𝑥𝑖
𝑍1 = σ
∼N(0,1)

2
σ
𝑍2=(n-2)
σ2

2
Σ𝑢𝑖 2
= 2 ∼χ (n-2)
σ

Now substituting the values

2
( β2−β2)/ Σ𝑥𝑖

= σ

2
1 Σ𝑢𝑖
𝑛−2 2
σ

85
= β2 − β2
2
σ / Σ𝑥𝑖

This quantity follows t distribution with n-2 degrees of freedom. Whenever σ is replaced by its

unbiased estimator , Z is converted into t distribution.This t distribution can be used whether the

sample size is large or small, because whenever σ is replaced its estimator σ , it is the t

distribution (irrespective of the fact that whether the sample size is small or large).

But the question is why we use Z even when σ is unknown in large samples? when its estimator

is used. This is because , in large samples, the distribution Z and t are not much different. So in

empirical applications , t distribution is widely used. And Z distribution cannot be used. But if

the sample size is large Z can also be used but as you will see in most of the econometric

softwares, we use t distribution rather than Z distribution for hypothesis testing. Now we will use

t distribution for constructing confidence intervals.

[
Pr − 𝑡 α/2
< 𝑡< 𝑡 α/2]=1-α
t is now the empirical quantity obtained from the sample

⎡ ( β2−β2) ⎤
Pr⎢− 𝑡 α < < 𝑡 α ⎥=1-α
⎢ 2 σ / Σ𝑥𝑖
2
2⎥
⎣ ⎦

Pr⎡⎢β2 − 𝑡 α/2⎢⎣
⎡𝑆𝐸 (β )⎤⎥ < β < β + 𝑡 ⎡𝑆𝐸 (β )⎤⎥⎤⎥=1-α
α/2⎢⎣
⎣ 2 ⎦ 2 2 2 ⎦⎦

The point in this case is the critical t value 𝑡α/2 depends on the degrees of freedom in addition to

α.ie, the level of significance .

Once

α is fixed as 5% and

Degrees of freedom is 20

86
The critical t value from the table or from the software = 2.038

So − 𝑡 = -2.038
α/2

𝑡 α/2
=+ 2.038

95% of the t value lie between these two interval.and this is the 95% confidence interval for β2

using the t distribution.

That is β2=β2±𝑡 α/2


SE (β2)

This can be generalized as

β𝑖=β𝑖±𝑡 α/2
SE (β𝑖)

100(1-α)% 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝑎 𝑔𝑖𝑣𝑒𝑛 𝑑𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚

There are mainly two approaches to hypothesis testing

1) Test of significance approach ( based on point estimation)

2) Confidence interval approach (based confidence interval estimation)

Hypothesis Testing

Formation of Null and Alternative Hypotheses

How to formulate hypotheses?

What are the various steps in the hypothesis testing procedure?

How to accept or reject a hypothesis?

Let us see how to formulate hypotheses?

There are some procedures for formulating the hypothesis

87
Classical statistical inference has two branches

1) Estimation

2) Hypothesis testing

Hypothesis testing is required because, as we have discussed earlier We may have samples, and

using the samples we have to make inference about the parameter. If the entire population is

known to us you can directly estimate population parameters, then there will be no need for

hypothesis testing .But this is not practically possible . Now let us see how to formulate

hypotheses. Hypothesis testing is the process of determining whether or not a given hypothesis is

true. The first step in hypothesis testing is formulating the hypothesis by specifying what is

known as Null hypothesis

Null Hypothesis(H0)

A null hypothesis is an assertion about the value of a population parameter. The null hypothesis

is an assertion about the value of a population parameter. It is the assertion we call it true until

we have sufficient statistical evidence to conclude otherwise. So as an example, a null hypothesis

may assume that the population parameter µ = 100

𝐻0:µ = 100

This is the assertion we called as true until we have sufficient statistical evidence to conclude

that the µ ≠ 100.

For example , the average height of a population will be

𝐻0 :µ = 160 cm

88
This is the null hypothesis

There will be an alternate hypothesis also. An alternate hypothesis is the negation of the null

hypothesis.that is an alternate hypothesis may be written as

𝐻1: µ ≠ 100

It is sometimes represented as 𝐻𝑎

Even though formulation of 𝐻0 𝑎𝑛𝑑 𝐻1seems to be very simple. Determination of 𝐻0 in a given

situation is somewhat complicated

It is important that we must be clear about what is null hypothesis or H0 in a particular situation

otherwise the entire process of hypothesis testing is invalid. Now let us see how to formulate H0

and H1 using a concrete example to clarify some points. As an example to this, suppose that a

company selling coca cola is selling bottles with 2 liters, we expect that on an average each

bottle will contain 2 liters of cola , due to some issues some bottles may contain a little more than

2 liters, some bottles may contain a little less than 2 liters. But on an average there will be 2 liters

of cola in each bottle. Suppose that a consumer suspects that the company is cheating the

consumers by filling less than 2 liters of cola. Then in the first case what is the null hypothesis?

and what is the alternate hypothesis? . As we discussed earlier, the null hypothesis is an assertion

about population parameters. It is an assertion we call true until we have sample evidence to

conclude that it is not true. So the company producing and selling cola is a reputed company , so

we expect that the company is doing the business honestly. So there is no need to believe

otherwise.So the null hypothesis is

89
𝐻0 : µ ≥ 2

This the null hypothesis and it is tested against Alternate hypothesis

𝐻1:µ< 2

There will be some significance in the consumer advocates suspicion if and only company is

taking this (𝐻1:µ< 2)

As this example shows

H0 is often a claim made by someone ,

H1 - The suspicion someone will have about the company.

Remember that we will go for new research if and only if we have some doubt about the status

quo. If we have no doubt about the status quo, there is no need for further research. The purpose

of research is to find something new or something different. In this case the claim made by the

company is not true, or you have doubt about this claim , you find out an instrument, then you

collect the data or samples, using the samples, we can test the hypothesis, and if the null

hypothesis is rejected you can conclude that the claim made by the company is false. But

variations are possible. Suppose that the company is not taking any claim, the company is simply

filling all bottles with 2 liters of cola. Suppose that we want to demonstrate that, on average, a

company is selling less than 2 liters. As an independent a researcher you want to prove that the

company is selling less than 2 liters of cola .In this case

H1= what we want to demonstrate

90
𝐻0 : µ ≥ 2

Now you have to demonstrate that the company is on average selling less than 2 liters. Now on

way to distinguish H0 and H1 is

If H0 is accepted = No corrective action is required

If H1 is accepted = Corrective action is required.

This is an easy way to distinguish H0 and H1

In this case if you are a consumer advocate or an independent researcher if H0 is accepted the

company is selling on an average 2 liters of cola no corrective action is required, suppose that if

H0 is rejected , on an average the company is selling less than 2 liters of cola .Thus corrective

action is required,

Suppose the consumer advocate has no suspicion about the company but the owner has a doubt

whether the machines fill on an average more than 2 liters. If the machine is filling more than 2

liters , it will be a loss as far as the owner is concerned. Now from the owners point of view we

write

𝐻0: µ ≤ 2

𝐻1:µ > 2

So in the owners perspective corrective action is required if the bottles are filled with more than

2 liters ie when alternate hypothesis is accepted

Third perspective:-

91
From the machine operator’s perspective

H0 : μ = 2 liters

H1 : μ ≠ 2 liters

These are the 3 principles to remember whenever we formulate hypotheses.

If you are an independent researcher, doing an academic research, for example:- in your field
of research such as Economics, Sociology, Psychology etc. your research hypothesis is
always H1

What you want to demonstrate is H1

H0 is Status quo

If H0 is accepted, nothing new is discovered from the sample.

On the other hand, if H0 is rejected, you have statistical evidence against H0 and you collect
data to estimate the parameters etc to demonstrate something new. So, the researcher’s
hypothesis is H1 whatever is your field of study: Economics, Sociology, Psychology etc.

This is true if you have tested the significance of only one parameter or if you have tested the
difference between the 2 parameters, i.e; between μ1 and μ2 or if you have tested the
difference between many samples like in the case of ANOVA or whatever is the context,
the null hypothesis is the resultant hypothesis.

If we would say that, if the test statistic is statistically significant, if and only if H0 is rejected.

Significant means something have been discovered from the sample

Test statistic is based on the sample, we would say that, test statistic is significant

Remember one more point:- whenever we go for research, always formulate a hypothesis first.
If we formulate a hypothesis after observing the results, you can always reject H0 or
formulate H0 in such a way to be rejected.n

92
So, never go for such a practice. The correct practice is to formulate H0 & H1 first.

If we formulate hypotheses after observing the results, it is what is known as Data Snooping.

After formulating the hypothesis, collect the suitable data; primary or secondary, time series or
panel, depending on your research question.

Formulate H1 , then estimate parameters and using the procedures of hypothesis testing, test the
significance of the population parameters. This is the first step, the formulation of null and
alternative hypotheses.

Steps in Hypothesis testing

Estimation and hypothesis testing have the twin branches of statistical inference.

Estimation means estimation of the parameters. Using a sample we estimate the parameters using

the sample statistics. The second part is hypothesis testing.

How to test hypotheses about the population parameters using sample statistics.

Now let us see what are the steps of hypothesis testing.

Steps in hypothesis testing

1. Formulation of null and alternative hypothesis.

Formulation of 𝐻0and 𝐻1

Before beginning any research formulation of hypotheses otherwise the entire procedure

is invalid.

Now let us formulate the problem of hypothesis testing more formally as follows.

Suppose that a random variable X with a PDF f(𝑋, θ)

Where θ is the parameter of the distribution. Now we observed a random sample of size

n from this population we obtain a point estimator θ for the parameter θ.

93
And using this random sample we obtain a point estimate. Since θ the parameter is

unknown we ask a question, the question is the estimator θ compatible with hypothesized

*
value of θ. So θ = θ Where compatible means sufficiently close to.

So the question is: Is the estimator θ is compatible with some hypothesized value of θ,

* *
So θ = θ . Where θ is the value of the θ stated under null hypothesis or it is specific value of θ.

*
(
So state differently could our sample have been come from a PDF 𝑓 𝑋 θ = θ . )
So the question is whether the sample is generated from a population with the value of

* *
θ=θ . And in the language of hypothesis testing θ=θ is known as null hypothesis and it is

written as 𝐻0.

*
This discussion is general applicable on all the situations now the parameter is θ, θ is estimated.

*
The question is whether the sample is generated from a population with θ=θ and it is 𝐻0. It is

tested against alternative hypothesis.

*
𝐻0: θ = θ

*
𝐻1 : θ ≠ θ

Now this 𝐻0 and 𝐻1 can be simple or composite. Hypothesis is an assertion about a population

parameter.

Now the hypothesis can be simple or composite.

The hypothesis is simple if it is specified a specific value of the parameter. If the specific value is

not specified, then it is composite.

2
As an example suppose that 𝑋 ∼ 𝑁 µ, σ ( )

94
* 2 *2
Suppose that our 𝐻0: µ = µ , σ = σ . This is simple hypothesis.

* 2 *2
Suppose that our 𝐻0: µ = µ and σ > σ . This is the composite hypothesis. Because its

2
value is not specific. Here the value of σ is not specified.

In econometrics we always test a hypothesis known as θ=0. Instead of specifying a particular

value we often test the hypothesis like this. This is known as zero null hypothesis. But this is not

the only hypothesis which we will test, we can give any value for θ.

When we read a software output always remember that the software output is based on this

hypothesis. If you want to test other hypotheses then you have to specify options available in all

the softwares to test hypotheses other than zero.

Now usually 𝐻0 is simple.

*
𝐻0: θ = θ . But 𝐻1, the alternative hypothesis can take any of these forms

*
𝐻1: θ > θ

*
𝐻1: θ < θ

*
𝐻1: θ ≠ θ

And remember that all these alternative hypotheses are composite in nature. And which

hypothesis we will use depends on the problem at hand.

2
Example: Suppose that our null hypothesis is that 𝑋 ∼ 𝑁 µ, σ ( )
Suppose also that we want to test µ only, then suppose that we formulate a hypothesis µ = 70.

Teste against µ≠ 70. Then the form in which 𝐻1 is formulated to determine the critical region.

The location of the critical region depends on how we formulate 𝐻1.

95
Now to test a hypothesis we divide the entire population into 2.

● One is the acceptance region.

● Other one is the critical region

Hypothesis testing we use a test statistic. Suppose that it is 𝑋 to test hypothesis about µ.

Estimators are not directly used for hypothesis testing. From the estimators we develop test

𝑋−µ
statistic say 𝑍 =
σ/ 𝑛

( 2
)
𝑋 ∼ 𝑁 µ, σ , then𝑋 ∼ 𝑁 µ, ( ) σ
𝑛
2
. It mean if X has mean=µ say, µ=100, then 𝑋=100.

When the χ = µ

96
Then the value of Z=0.

*
My hypothesis is 𝐻0:µ=µ , here µ is specific value is about the mean of χ.

It is also the mean of X. Suppose the value is specific is 100.

Now the entire region of test statistic divide into two, highly likely under 𝐻0.

It is the acceptance region, say 95%.

The 2.5% on both sides of the graph is known as the critical region, value of the population is

having very low probability to observe.

Acceptance region is value of the population with a high probability to observe given that 𝐻0:µ

=100.

So the entire population , the entire distribution of test statistic is divided into highly likely under

𝐻0 and highly unlikely under 𝐻0.

If 𝐻1is not equal to tail, the critical region on both tails, the left and right tail.

If your 𝐻1> tail entire critical region is in the right side only. Otherwise in the left side.

So whether you have a two tailed test, a right tailed and left tailed test depends on how you

formulate 𝐻1, the research hypothesis/ alternative hypothesis.

Now in the empirical applications the critical region is often fixed as 5%, and this is also known

as the level of significance α. .

In econometrics the highly unlikely region is often fixed at 5%, it is not only the value which we

use. In medical sciences it is fixed at 1%. In some cases we can go 15% etc. depending on

circumstances.

97
The second step in hypothesis testing is to choose the level of significance of the test.

So the point is our decision to accept 𝐻0or 𝐻1 is based on sample information. Due to sampling

fluctuations the conclusions will not be true always. In many decisions regarding accepting or

rejecting hypothesis we are supposed to commit 2 types of errors.

● Type 1 error. That is reject 𝐻0 when 𝐻0true.

● Type 2 error: accept 𝐻0 when 𝐻0 false.

Decision 𝐻0 True 𝐻0 False

Reject Type 1 No error.

Do not reject No error Type 2

We can not manage 2 types of errors simultaneously.

When we reduce type 1 error, Type 2 error increases.and vice versa.

Example: we know that in the legal proceedings police arrest a criminal, collect evidence present

the accused before the court. The court after considering the evidence suggested by the police

take a decision regarding 𝐻0 or 𝐻1.

Now in this example the 𝐻0: The accused is innocent.

Then 𝐻1: the accused has committed the crime.

In a democratic country, we can not judge a person as a criminal only because he has been

arrested by the police. Police arrest and suspect. Then collect the evidence. In our case the

evidence is the sample. And suppose that on the basis of evidence the sample we can not reject

98
𝐻0. That means the accused is innocent. In our case the sample evidence is not strong enough to

reject 𝐻0 or nothing new has been reduced from sample evidence.

Now what is the Type 1 error here?

● Reject 𝐻0 when 𝐻0 true means , the accused is innocent if we reject it what happens here

is the innocent will go to jail.

● The second is 𝐻0 is false, but we do not reject it, false means the accused is not innocent,

he has committed the crime. But he will go free.

Which one is more serious?

The classical approach to hypothesis testing procedure is Type 1 error is considered to be more

serious.

So what we do is we fix the type 1 error at 5%, 10%,1% etc and try to minimize Type 2 error.

And the probability of Type 1 error is known as α. It is known as the level of significance.

The probability of Type 2 error is β.

The probability of not committing Type 2 error is 1-β. It is known as the power of the test.

Even though in this example Type 1 error is more serious, variations are possible.

Example: Steel is used for constructing house.also for fencing. And steel of a particular strength

is required for housing .

𝐻0: The required strength of steel is 1000

𝐻1: The required strength of steel < 1000

What is the cost of Type 1 error?

The cost of producing steel only

.What is the cost of Type 2 error?

99
Type 2 error means 𝐻0 is false since it has less strength.

Now the question is the cost of Type 1 and Type 2 error depends on the purpose for which it is

used. If it is used for constructing a house , the house will collapse, there will be loss of life.It is

highly costly.

Type 2 error is costly compared to Type 1 error. Because Type 1 error is only a cost of

production.

But suppose that it is used for fencing only, here Type 2 error is not very important because even

if the fence collapses its consequence will be very low.

So whether Type 1 or Type 2 error is more serious depends on the issue at the hand.

In general Type 1 error is more serious.

3. The third step is the location of the critical region.

We use test statistics like z, t etc for hypothesis testing

And the entire distribution is divided into 2, highly likely under 𝐻0, highly unlikely under 𝐻0.

Highly likely is the acceptance region, highly unlikely is the critical region.

Now the critical region may be the right side only , left side only or both side, depending on how

you formulate :

*
● 𝐻1: θ ≠ θ is your alternative hypothesis, then it is a 2 tailed critical region.

*
● If it is 𝐻1: θ > θ Critical region in the right tail of the test statistic.

*
● If it is 𝐻1: θ < θ Critical region in the left tail.

So whether the critical region in the right tail, left tail or both tails depends on how we

formulate the alternative hypothesis.

100
4. The fourth step is to choose appropriate test statistics.

To test hypotheses we have to use test statistics. If you want to test the significance of a mean we

can test z or t.

● If the population variance is known Whatever the sample size you can use z.

● If population variance is unknown and the sample size is less than 30, we will use t test.

● To test the significance of variance we use the chi square test.

● If you want to test the difference between more than two means µ1 − µ2. we can use the

test statistic as F.

● While discussing the normal distribution and distribution related to normal distribution

we have seen that chi square is used to derive the sampling distribution of squared

quantities F and related issues.

We can use the appropriate test statistics depending on the context.

5. Compute the sample value of the test statistics.

Suppose that z is our test statistics and we have to make a testing about a population parameter µ

χ −µ
Then 𝑧 = σ
𝑛

Now if the value is given under H0, as an example to this Suppose that we specify

σ
µ = 60, χ = 70, = 2
𝑛

70−60
Then 𝑍 = 2
= 5.

This is the value of the test statistics.

This is based on the assumption that σ is known to us otherwise we will use t test.

6. Compare the value of the test statistics with the critical value.

If α=5% 95% of z is between -1.96 and +1.96.

101
So if the value of the test statistic calculated from the sample is in the critical region we reject 𝐻0

. That µ=60 against the alternative µ≠60.

So when we reject a null hypothesis using the test statistics we say that our sample is not

generated by a population with µ=60, that is our sample generated by a different population.

The population µ≠60.

These are the steps in hypothesis testing. For example suppose that the value of test statistics

= -1. Then we accept 𝐻0. We will say that our sample is generated by a population with

mean = 60. There is no difference between what we have assumed under 𝐻0 and what we have

obtained from the sample or the sample value is close to the hypothesized value. So 𝐻0 is not

rejected.

Hypothesis testing, population mean

In this session we discuss how to test the hypothesis. Now there are two approaches to

hypothesis testing;

1. Confidence interval approach

2. Test of significance approach

Test of significance approach was developed by Fisher, Neyman and Pearson.

We will first consider the confidence interval approach. While discussing the estimation theory,

we discussed that there are two types of estimators; point estimation and interval estimation.

Confidence interval approach is based on the concept of interval estimation, and the test of

significance approach based on the concept of point estimation.

Confidence interval approach

102
The first step in hypothesis testing is to formulate the null and alternative hypothesis. Let our

null hypothesis is;

*
𝐻0:µ= µ

Tested against ;

*
𝐻1:µ≠ µ

The first explain, the approaches to hypothesis testing using the estimator arithmetic mean, then

we will extend our analysis using OLS estimators β1 and β2.

2
We assume that the population X is normally distributed with mean µ and variance σ ;

2
𝑋 ∼N (µ σ )
2 2
σ σ
And, 𝑋 is also normally distributed with same µ and variance 𝑛
; 𝑥 ∼N(µ 𝑛
)

If X is normally distributed, 𝑋 is also normally distributed, since the probability distribution of 𝑋

is known, we can use this known probability distribution to construct confidence interval as;

100(1-α) % confidence interval for the parameter µ

This confidence interval for µ is based on the estimator 𝑋

The decision rule is once we construct a confidence interval of 100(1-α) % , α will be specified.

If α=5%, then it is 95% confidence interval, if α=1%, then it is 99% confidence interval

*
etc….then we will see whether µ= µ is included in this interval or not, once the interval is

constructed ,what we do is whether the hypothesized value is included in this interval or not. If

*
the hypothesized value µ= µ is in the interval, we will not reject the null hypothesis. Otherwise

we will reject it.

103
So, in the confidence interval approach, we construct a 100(1-α) % confidence interval for the

parameter µ using the estimator 𝑋. if the hypothesized value is included in this interval we will

not reject the null hypothesis and whether the hypothesized value is not included in this interval

we will reject the null hypothesis.

If α= 0.05 (5%) , and 100(1-α) % = .95 (95%) confidence interval for µ


2
σ
We know that, if 𝑋∼N(µ 𝑛
), Z can be written as

*
𝑋−µ
Z= σ
, it is standard normal variable with mean 0 and variance 1;
𝑛

*
𝑋−µ
Z= σ
∼ 𝑁(0 1)
𝑛

Here the distribution of 𝑋 and hence Z, depends on the value of µ assumed under null

hypothesis, it is not an actual distribution, it is only an assumed distribution.

⎡ −𝑍α 𝑋−µ
*
𝑍α ⎤
Pr⎢ 2 < < ⎥= 1-α
⎢ σ 2 ⎥
⎣ 𝑛 ⎦

If α = 5%, the above equation can be written as;

⎡ 𝑋−µ
* ⎤
Pr⎢− 1. 96 < < 1. 96 ⎥= 1-α
⎢ σ ⎥
⎣ 𝑛 ⎦

Rearranging this, we will get;

σ * σ
Pr⎡⎢𝑋 − 1. 96 ( ) < µ < 𝑋 + 1. 96 ( ) ⎤⎥= 1-α
⎣ 𝑛 𝑛 ⎦
* σ
The 95% confidence interval for µ is 𝑋 ± 1. 96( ).
𝑛

So this confidence interval is constructed for the value of µ assumed under the null hypothesis

*
𝐻 0 µ =µ

104
Once the confidence interval is established, we will see whether the hypothesized value lie

within this interval or not.

Now the procedure is simple, once the 95% CI is constructed for µ using 𝑋 and its

standard error, we will see whether the hypothesized value is within this limit or not. If

the hypothesized value is within this limit we will not reject the null hypothesis, if the

hypothesized value is not within this limit, we will reject the null hypothesis.

As an example to this; suppose that we have taken a sample of 100, from which 𝑋= 67, σ

σ
=2.5, n=100, and =0.25.
𝑛

σ σ
Now, 𝑋+1.96 = 67+1.96 (0.25) and 𝑋-1.96 = 67+1.96 (0.25), this will give you
𝑛 𝑛

the critical values

σ
𝑋+1.96 = 67+1.96 (0.25) =67.49
𝑛

σ
𝑋-1.96 = 67+1.96 (0.25)= 66.51
𝑛

Suppose that;

105
𝐻0:µ= 69

Tested against

𝐻0:µ≠ 69

What we do is, we consider whether the hypothesized value is within the range of 67.49 and

66.51.

The hypothesized value is 69, and the value is not within this range (67.49 and 66.51). So we

reject the null hypothesis.

In the context of confidence intervals, as we have discussed earlier, if we construct intervals like

this a repeated number of times, the parameters will lie in such intervals 95 out of 100 times.

Suppose that the hypothesized value is within this interval, and we accept the hypothesis. But the

parameter will lie within this interval 95 out of 100 times. So the parameter will not lie within

this interval 5 out of 100 times.

So when we reject the null hypothesis, we will be rejecting a null hypothesis 5 out of 100 times.

Thus this is the probability of committing Type I error.

2. The test of significance approach

The test of significance is a procedure by which sample results are used to verify the truth or

falsity of a null hypothesis, and the key idea behind this is test statistics and its distribution of the

test statistic under null hypothesis . The decision to accept or reject the hypothesis is based on the

value of the test statistic obtained from the sample.


2
𝑋−µ 2 σ
Under normality assumption Z= σ
, X∼ 𝑁(µ σ ) and 𝑋 ∼ 𝑁(µ 𝑛
)
𝑛

If the value of µ is specified under null hypothesis, we can calculate Z on the assumption that σ

is known. So Z can be used as the test statistics.

106
As Z∼ 𝑁(0 1 ), can state the probability statement as

⎡ 𝑥−µ
* ⎤
Pr⎢− 𝑧α/2 < < 𝑧α/2 ⎥= 1-α
⎢ σ ⎥
⎣ 𝑛 ⎦

If α= 5%

⎡ 𝑥−µ
* ⎤
Pr⎢− 1. 96 < < 1. 96 ⎥= 1-α
⎢ σ ⎥
⎣ 𝑛 ⎦

Then we rearrange this as;

* σ * σ
Pr⎡⎢ µ − 𝑧α/2 ( ) < 𝑋 < µ + 𝑧α/2( ) ⎤⎥= 1-α
⎣ 𝑛 𝑛 ⎦

σ *
What we did here is, multiply throughout with . then we add µ for all the terms.
𝑛

*
This is the interval in which 𝑋 will fall with 1-α probability given that µ =µ . ( this is not the

confidence interval)

In the language of hypothesis testing this 100(1-α) % CI established is known as the region of

acceptance of null hypothesis and outside is the region of rejection.

And this confidence interval is based on the assumption that σ is known. In most of the case the

σ is unknown, and we use t;

𝑥−µ
t= 𝑠
∼ t (n-1)
𝑛

In that case, we replace Z with t, and rearranging that, we will get;

* 𝑠 * 𝑠
Pr⎡⎢ µ − 𝑡α/2 ( ) < 𝑥 < µ + 𝑡α/2( ) ⎤⎥= 1-α
⎣ 𝑛 𝑛 ⎦

This 𝑡α/2 is the critical t value for α level of level of significance and (n-1) degrees of freedom.

*
We accept the null hypothesis, if 𝑋 in the interval, given µ =µ

107
𝑠
If a distribution is normal, we know the area property , that is µ ± 1. 96 ( ) includes 95% of
𝑛

the Z. and this Z (test statistics derived from the 𝑋) is generated from 𝑋 (sampling distribution).

*
So if µ =µ the distribution has 95% within this interval, and 5% will fall outside. If the sample

𝑋 obtained from the sample lie in this interval, we do not reject the null hypothesis. In the

language of significance test, a statistic is said to be statistically significant if the value of the test

statistics lies in the critical region. Then the null hypothesis will be rejected. A statistic is said to

be statistically insignificant if the value of the test statistics lies in the acceptance region. Then

we fail to reject the null hypothesis.

Let us see a concrete example;

Suppose that the average height of students in a college is claimed to be 160 cm;

𝐻0: µ =160cm

𝐻1: µ ≠160cm

σ
A sample size of n= 100 is taken, 𝑋= 150 cm, σ= 4, SE= =0.4
𝑛

σ
µ ± 1. 96 ( ) = 160± 1. 96(0.4) = 160.714 and 159.216 (critical values)
𝑛

Present this graphically;

108
How to find the area between 159.21 and 160.71, that area is 95% of the 𝑋. If we construct the

sampling distribution of 𝑋, the area property is same as the area property of µ.

What is the probability that, sample is generated from a population with a mean=160, and we

know that the probability is very low. Because, bcz the 150 is lying in the low probability area of

the distribution. So we reject the null hypothesis. 95% of the value lies between 159.21 and

160.71, only 5% of value lies below and above these values (2.5%values above 160.71 and 2.5%

of the values below 159.21). So the probability of a sample generated from the population with a

mean= 160 is very low.

We can calculate Z as;

150−160
Z= .4
= 25

Let convert the graph in to Z as;

109
Hypothesis testing about regression coefficients

Now we discuss in detail the hypothesis testing in the context of regression analysis. The

parameters of our discussion are β1and β2, and the estimators are β1and β2. We use these

estimators to make inferences about parameters. As we discussed earlier, remember this, before

estimating the model, we should develop hypothesis about β1and β2.

Once the hypotheses are fixed, we collect data, and using the inferential procedures, the steps in

hypothesis testing, we test the hypothesis after fixing the level of significance and critical region

based on the alternative hypothesis and choose the test statistics based on the problem at hand.

Finally we calculate the value of the test statistics from the sample and compare it with the

critical value, and make a decision.

There are two approaches to hypothesis testing; the confidence interval approach and the test of

significance approach. The question of hypothesis testing is; is a given observation of finding is

compatible with a stated hypothesis or not? This is the fundamental question that we ask in

hypothesis testing. compatible means, sufficiently close to, so that we do not reject the stated

hypothesis, and if it is sufficiently close to the stated hypothesis , we do not reject it or otherwise

we reject it. In the language of hypothesis testing, the stated hypothesis is the null hypothesis

tested against the alternative hypothesis, also known as the maintained hypothesis or the research

hypothesis. The theory of hypothesis testing is concerned with developing rules or procedures of

deciding whether to accept or reject the null hypothesis. In both approaches of hypothesis testing,

we assume that the estimator is a random variable, and the estimator has a probability

distribution, and the probability distribution of the estimator is known as the sampling

distribution, and the sampling distribution is used for hypothesis testing. The normality

assumption is used for hypothesis testing, because if X is normal the estimator is also normal. In

110
the context of regression, we assumed that the u is normal and as estimators are linear

combinations of u, the estimators are also normally distributed.

Both the approaches of hypothesis testing are based on an example; Example of income and

consumption

Income starting from 80-260 and consumption also, n=10, now using OLS suppose that the

estimated model is;

𝑌= 24.4545+0.5091 𝑋𝑖

SE(β1)= 6.4138 0.0357

n=10

df= 8

α=5%

Critical t value = 2.306

In the discussion of regression analysis, we consider only the t value, because the standard errors

are estimated standard errors.

Now let us consider the confidence interval approach of hypothesis testing first;

Here, β2 = 0.5091

Suppose our hypothesis is 𝐻0:β2=0.3 tested against 𝐻1:β2≠0.3 (0.3 is an assumed value for the

purpose of hypothesis testing)

Once we formulate the hypothesis, we construct the 95% confidence interval for β2 as;

β2= 0.5091± 2.306 (0.0357)

= 0.4268 β2 < 0.5914

111
The 95% confidence interval for β2 is

In the language of hypothesis testing, if β2under 𝐻0, lie in this interval (the interval is 100(1-α

)%.ie., at 95% confidence interval we do not reject the 𝐻0, otherwise we reject it. The

hypothesized value is 0.3, and the 0.3 is outside the interval, so we reject the 𝐻0. That means, the

sample is not regenerated by the population with mean=0.3.

The procedure in confidence interval approach to hypothesis testing is;

1. Construct a 100(1-α)%confidence interval for β2

2. If β2 under the 𝐻0, falls in this region, do not reject the 𝐻0, otherwise reject

As we already discussed, the parameter will lie in such intervals 95 out of 100 times, only 5

times the parameter will lie outside. Actually the parameter may be outside. In our example the

parameter is outside the limit, so we reject the 𝐻0. our sample is not generated from the

population with β2 =0.3.

In the language of hypothesis testing, when we reject a null hypothesis, we say that our finding is

statistically significant, when do not reject 𝐻0 ,we say that our finding is not statistically

significant. In this case we reject the 𝐻0, so our finding is statistically significant.

112
Statistically significant means, our sample finding is significant or our test statistic is significant.

While in the case of court procedure, the evidence collected by the police is used to verify

whether the person has committed the crime or not. If the court accepts 𝐻0, then the person is

innocent, that means evidence is not significant. In our case, the evidence is our sample. Sample

is used to decide whether to accept or reject the 𝐻0. If 𝐻0 is accepted our sample finding is not

significant, in this sense, we say that the test statistics is not significant. When we reject a

hypothesis, we will be rejecting a true hypothesis (5 out of 100 times). The parameter will lie

within this interval 95 out of 100 times, 5 times the parameter will lie outside. But still we reject

the 𝐻0, even though the 𝐻0 is true. Reject the 𝐻0, when the 𝐻0is true is called type I error.

Test of significance approach

● Now we consider the test of significance approach developed independently by Fisher,

Neyman and Pearson . A test of significance is a procedure by which sample results are

used to verify the truth or falsity of a null hypothesis. The key idea behind the test of

significance approach is the test statistic and sampling distribution of test statistic under

null hypothesis. The decision to accept or reject the 𝐻0 is based on the value of test

statistics received from the sample at hand.

● Now, under the normality assumption

β2 − β2
𝑡=
𝑆𝐸(β2)

The standard error is the estimated standard error.

β2 − β2
𝑡= we have derived this
2
σ/ σ𝑥𝑖

113
● Here explain the test of significance approach as in the case of confidential interval

approach, using the t test statistic only because in the case of econometrics, we will not

use Z statistic in this context because σ is always replaced by σ. and the standard errors

given in the example are estimated standard errors

● Now, if the value of β2 is specified under 𝐻0 then t statistic can be calculated. Because

standard error is given. The only unknown is β2. then t can be used as the test statistic.

And our decision depends on the value of t obtained from the sample compared to critical

t values.

● Now, as we have written earlier


*
⎡ β2 −β2 ⎤
𝑃𝑟⎢ − 𝑡α/2 < < 𝑡α/2 ⎥ = 1 − α
⎢ 𝑆𝐸(β2) ⎥
⎣ ⎦
*
β2 is the value of β2 specified under 𝐻0

𝑡α/2 is the critical value for given degrees of freedom and specified level of significance.

In our example it is 2.306.


*
⎡ β2 −β2 ⎤
𝑃𝑟⎢− 2. 306 < < 2. 306 ⎥ = 1 − α
⎢ 𝑆𝐸(β2) ⎥
⎣ ⎦

This is the typical example.

Now, this can be rearranged as

* *
𝑃𝑟⎡⎢β2 − 𝑡α/2 𝑆𝐸(β2) < β2 < β2 + 𝑡α/2𝑆𝐸(β2) ⎤⎥ = 1 − α
⎣ ⎦

This is not the confidence interval, this is a different procedure, multiplied through out

*
with standard error of β2 and adding a β2on all the terms, we can rearrange the equation.

*
● Now, it is the interval in which β2 falls with 1 − α probability given that β2 = β2

114
● And in the language of hypothesis testing this 100(1 − α)% confidence interval is

known as the region of acceptance of 𝐻0and outside it is the region of rejection. Before

explaining more, let us elaborate on the example.

● In our example first explained in terms of β2

● If we calculate the values

* *
𝑃𝑟⎡⎢β2 − 𝑡α/2 𝑆𝐸(β2) < β2 < β2 + 𝑡α/2𝑆𝐸(β2) ⎤⎥ = 1 − α
⎣ ⎦

0. 3 ± 2. 306(0. 0357)

*
Not a confidence interval , β2 to -t to +t

That is 0.5% of β2 lie in this interval.

What it says is if you take samples like this, a repeated number of times, you will
generate a distribution and you will plot like this 95% of the β2 lie in between 0.2177
and 0.3823. And provided that the mean is 3. And this is not an actual distribution. This

115
is a hypothetical distribution. You will get this distribution only after formulating the 𝐻0.
if 𝐻0 is not formulated you can not derive the ‘t' statistic. You will not get this
distribution
● Remember, the hypothesis testing is based on the sampling distribution of the estimators.

Not the actual sampling distribution. Based on a hypothetical sampling distribution.

Based on an assumed parameter.

● Now, we will not do this directly, we will convert the unit of β2 in to unit of t.

● 95 % of t with 8 degrees of freedom lies in between -2.306 and +2.306. Now we calculate

this t

116
0.5091 − 0.3
0.0357
= 5. 86

The value of β2 obtained is 0.5091, when it is converted into t, it is 5.86

Now, in the test of significance approach, we compare the value obtained from the

sample with the hypothesized value. If the value obtained from the sample is far away

from the hypothesized value it means our hypothesis may not be correct or our sample

may not be generated from a population with hypothesized value equal to 0.3. So, we

have increasing suspicion about 𝐻0. As soon as the value of β2 is close to the

*
hypothesized value β2, then we will be in the acceptance region. Otherwise will be in the

117
rejection region. What it says, the probability that our sample is generated from a

population with mean is equal to 0.3 is less than 5%. because , 2.5% in the left side and

2.5% in the right side. Only a low probability, so we reject 𝐻0. By converting in to t, our

t value is 5.86. The probability of t equal to 5.86 is less than 5%. The actual probability

will be discussed later. It is less than 5% so again 𝐻0 is rejected.

● You can test a hypothesis using an estimator itself, and a sampling distribution. We will

not do this. What we do is, we convert an estimator into a sample statistic and the test

statistic is used for hypothesis testing. This is the procedure that we do.

*
● If β2 = β2 then t = 0 , t= 0 means null hypothesis accepted. Now estimated β2 is far away

from the hypothesized value, then large value is evidence against 𝐻0 then small value is

evidence in favour of 𝐻0. And to conclude, in the language of significance test, a test is

said to be significant, if the value of the test statistic lie in the critical region. If the value

lie in the acceptance region, it is statistically significant. So, in this example, as the value

of the test statistic lie in the critical region, we will say that the test statistic is significant.

Means the sample finding is significant.

● As we have using t test here, we have a test procedure based on t test. Now, remember

this, a statistical finding or test statistic is said to be significant only if 𝐻0 is rejected,

otherwise it is not significant. If null hypothesis is not rejected, the test statistics is not

significant. And this is a two tailed test. We can adopt a one-tailed test also.

● If our null hypothesis is

*
𝐻0: β2 = 0. 3

118
*
𝐻1: β2 > 0. 3

Then the critical t value become, t = 1. 86, so, if you take one tailed test, the critical

value will be small. In the case of a two-tailed test, 𝐻0 is accepted, one tailed it is

rejected. So, it increases the power of the test.

*
● If β2 is less than tail, we have a one tailed test. Whether we have a one-tailed test or a

two-tailed test or right tailed or left tailed depends on the alternative hypothesis

● So, to summarize,

𝐻0 𝐻1 Reject 𝐻0 if

Two tailed β2= β2


* *
β2≠ β2 |𝑡| > 𝑡α/2, 𝑑𝑓

Right tailed β2≤ β2


* *
β2> β2 𝑡 > 𝑡α, 𝑑𝑓

Left tailed β2≥ β2


* *
β2< β2 𝑡 <− 𝑡α/2, 𝑑𝑓

● Always remember that, for hypothesis testing, we use sampling distribution. And the area

of the sampling distribution is divided in to two. Highly likely, and highly unlikely under

𝐻0 . Highly likely values are close to hypothesized values. Unlikely values are far away

from the hypothesized values. Also remember that the distribution is a hypothetical

distribution. If you give a different distribution, say for example, instead of 0.3, if it was

0.5, then β2 will be in the acceptance region, not in the rejection region.

● So, it is important to fix the hypothesis first, otherwise once you obtain the results and if

you fix the hypothesis after observing the results, you can always reject 𝐻0. The practice

119
is not recommended. Fix the hypothesis first, then collect data, estimate parameters, using

the sampling distribution, remember, sampling distribution is not directly used, we use

test statistics derived from the sampling distribution.

● And also remember this, the sampling distribution of the test statistic depends on the

normality assumption. So, normality assumption of the estimator is important. And in the

regression context, normality of the estimator depends on normality of few. It was

nothing to do with dependent variable or independent variable, also this normality

assumption is important if and only if the sample size is more. If you have a sufficiently

large sample size, normality assumption is not very important. That is provided that n

>100 for example, normality is not very important.

● This is the lessons of hypothesis testing and there are many other issues related with

hypothesis testing and these issues will be discussed in the next sessions.

Hypothesis testing, some practical aspects

● Hypothesis testing is perhaps one of the most important areas of empirical analysis. The

success of an empirical research depends on how to formulate an hypothesis and how to

test it using inferential statistics. We have discussed hypothesis testing about the

population parameter µ using the estimator 𝑋, we have also discussed in detail how to test

hypotheses about regression coefficients β1and β2. The test statistics we have extensively

used is the t test, now we have to consider some practical issues regarding hypothesis

testing. In addition to the technical details of hypothesis testing, we must also be familiar

with some of the practical aspects of hypothesis testing. This session is meant to give

some concrete ideas about some of the practical issues, some of the precautions, some of

the considerations, when we test the hypothesis.

120
● The first issue that we are considering is what is meant by accepting or rejecting a

hypothesis?

● Remember this, our decision to accept or reject the null hypothesis is based on sample

evidence. Our decisions will not always be correct. Whenever we decide to accept or

reject a null hypothesis, we are falling to commit two types of errors.

● Type I error : reject 𝐻0 when 𝐻0 is true

● Type II error : Accept 𝐻0 when 𝐻0 is false

● Now, as the decision to accept or reject 𝐻0is based on sample evidence and sample value

vary from sample to sample, we are not 100% sure about to accept or reject 𝐻0. And

suppose that on the basis of sample evidence we decided to accept 𝐻0 but that doesn't

mean that 𝐻0 is true beyond any doubt. To explain this point let us consider an example.

● Suppose that we hypothesis a value for MPC, 𝐻0: β2 = 0. 7 suppose in the consumption

income example we hypothesize that

𝐻0: β2 = 0. 7

𝐻1: β2 ≠ 0. 7

Suppose that the estimated value β2 = 0. 7241 and 𝑆𝐸 (β2) = 0. 0701, now consider

the t

0.7241 −0.7
𝑡 = 0.0701
= 0. 3438

So, this t value is not statistically significant, so, we accept 𝐻0 because 0.3438 is in the

acceptance region. So, we say that we accept 𝐻0

121
● Now, suppose that β2 = 0. 6, then

𝐻0: β2 = 0. 6

𝐻1: β2 ≠ 0. 6

0.7241 −0.6
𝑡 = 0.0701
= 1. 7703

Again the t value is not significant, again we accept 𝐻0

● So, we accept the null hypothesis that β2 = 0. 7 again accept the null hypothesis that

β2 = 0. 6 but we do not know which 𝐻0 is true. Because the same data is compatible

with different hypotheses. The same sample evidence is compatible with different

hypotheses. So, we do not know which hypothesis is true.

● So, what we do is we may accept 𝐻0 rather than we do accept 𝐻0. Or we do not reject 𝐻0

say at 5% level of significance rather than we accept 𝐻0 at 5% level of significance. Just

as a Court declares as “Not Found Guilty” rather than “Innocent”. So on the basis of the

evidence collected by the Police, the Judge declares that the accused is not found guilty.

But it doesn't mean that he is innocent. The sample evidence is not sufficient to prove

that he has committed the crime. So “Not Found Guilty” rather than “Innocent”. When

the accused is free by the Court, it doesn’t mean that the accused is innocent beyond a

doubt. But the evidence is such that there is no evidence to punish, in the same way we

do not reject 𝐻0 rather than we accept 𝐻0 or we may accept 𝐻0 rather than we do accept

𝐻0. This is so because the same sample evidence will be compatible with different

hypotheses. And we do not know which hypothesis is correct. So, whenever we reject or

accept a hypothesis, always keep this point in mind.

122
● So, as one author says, when we accept a null hypothesis, or we can say data do not allow

us to reject either of these hypotheses at 5% level of significance or x% level of

significance.

● The second point we want to discuss is Zero Null Hypothesis

● We can formulate 𝐻0 depending on the formula under consideration. But the null

hypothesis often tested in empirical analysis is the zero null hypothesis. Consider the

relation

𝑌𝑖 = β1 + β2𝑋𝑖 + 𝑢𝑖

𝐻0: β2 = 0

𝐻1: β2 ≠ 0

Now, the purpose of this hypothesis is whether Y is related to X at all. If this hypothesis

is accepted or not rejected, there is no meaning in testing other hypotheses. Not only that,

we have many softwares for estimating econometric relations. In all these softwares the

default hypothesis is the zero null hypothesis. You have the option to test other

hypotheses and when we proceed we will consider other hypotheses also. Testing the zero

null hypothesis is not always the best practice. We have to test other hypotheses also.

Anyway, in the softwares, the default hypothesis is the zero null hypothesis.

● If we test the null hypothesis which is the zero null hypothesis, then t become

β2
𝑡 =
( )
𝑆𝐸 β2

Because it is β2 - 0

123
So, whether t is positive or negative depends on whether β2 is positive or negative. In the

consumption income example t will be positive as β2 is positive, in the quantity

demanded and price example t will be negative as β2 is negative, like that.

β2
● So, t becomes 𝑡 =
( )
𝑆𝐸 β2

● As you can see, to test the significance of the t, we compare β2 with its standard error. T

will be significant if and only if β2 is much higher than the standard error of β2 . if β2 is

equal to standard error of β2 then t = 1, not significant. Again if the standard error of β2 is

higher than the β2 , t is not significant. We place β2 in relation to its standard error.

● So, remember this, when we obtain the computer output, we will have the coefficient ,

below the coefficient standard error , coefficient divided by the standard error is the t

value. Below that we will get p value, which is a topic we will see soon

● So, a hypothesis which we test in econometrics is a zero null hypothesis.

● Another issue of practical importance is how to formulate 𝐻0and 𝐻1?

● There are no hard and fast rules regarding how to formulate 𝐻0and 𝐻1. in many cases,

economic theory will tell us how to formulate 𝐻0and 𝐻1. If economic theory is not

strong, we have to use our intuition, previous empirical work having a similar nature etc

to formulate 𝐻0and 𝐻1. and remember that formulation of 𝐻0and 𝐻1depends on the

problem at hand. Often theory will give you some guidance or previous empirical works

124
will give you some guidance. Or your intellectual guess can be used to formulate 𝐻0and

𝐻1

● No matter how you formulate 𝐻0and 𝐻1, it is absolutely essential that you must formulate

𝐻0and 𝐻1before estimation. This is very important. The point is, when you go for an

empirical research, the first issue is to identify a problem, by literature review we identify

a research gap, formulate a research problem.

● A research problem is specified in the form of 𝐻0and 𝐻1. Once 𝐻0and 𝐻1are formulated

you can collect your own data, that is primary data, or you can go for secondary data,

collected by some other agency. The point is 𝐻0and 𝐻1 should be formulated first.

Suppose that you identified an issue, collected data, and observed the results. And if you

formulate the hypotheses after observing the results, you can always formulate 𝐻0and 𝐻1

in such a way that your null hypothesis is rejected. Because your sampling or your

finding is significant. Because hypothesis testing is based on a hypothetical distribution.


This distribution takes shape only when the value of µ is specified as µ , so if you give


one value to µ you have one result, give another value, you have another distribution.

125

● 𝐻0 : µ = µ

● So, for one value you may reject 𝐻0 ad for another value you may accept 𝐻0. So, if you

observe the results and formulate the hypothesis after observing the results, the problem

is you can always fix µ which is far away from the observed value of 𝑋 as an example,

for a given standard error. In that case 𝐻0 will be rejected

● Now, if you formulate a hypothesis after observing the results, there may be a temptation

to form hypotheses just to justify one’s results. If you formulate a hypothesis after

observing the results, you will always formulate a hypothesis just to justify your results.

● So, as one author says, the researcher will be guilty of circular reasoning or self fulfilling

prophecies . And there is a possibility that, he will formulate a 𝐻0 such a way that the

sampling finding is justified. That is 𝐻0 is rejected. This is not scientific.

● So, the scientific procedure is, formulate the hypothesis first, then collect data, estimate,

test and accept the hypothesis on the basis of sample evidence and hypothesis already

formulated. That is the practical issue we must always keep in mind.

● Then another issue that a researcher confronts is how to fix α- level of significance?

126
● We divide the entire area statistics into two. Highly likely under 𝐻0 and highly unlikely

under 𝐻0. And the division of the area into two depends on the value of α. If α is say 5%,

that is 0.5 and for a two tailed test 2.5% in the left region and 2.5% in the right region and

95% in the central portion of a normal distribution curve as follows

● So, our decision to accept or reject 𝐻0 depends on the level of significance,α. The level

of significance is the probability of committing Type I error. You know that the following

is the distribution for hypothesis testing, say the distribution of 𝑋 and hypothesized

values is µ. This is a hypothetical distribution and is depends on µ. If you give another µ

you will get another distribution like that. Now the question is what is α ? if α is 5% the

critical region is divided in to 2.5% in each tail. If α is 10% the critical region expands

and the acceptance region shrinks.

127
● And suppose that initially, a value is in 5% right tail, then at 5% it is accepted and at 10%

it is rejected. Our decision to accept or reject 𝐻0 will critically depends on α

● So, the question is how do we fix α ?

● There are no hard and fast rules. α is fixed at 1%, 5%, 10% etc. in social science we most

preferred the level of significance of 5%. In medical sciences it is 1%.

● The point is α is the probability of committing type I error, you can reduce the probability

of committing type I error by fixing α at a low level 1% etc. When you reduce the value

of α, the probability of type II error will increase. That is β . that is accept 𝐻0, when it is

false. In a given sample, you cannot minimize both type of errors. When α is reduced,

that is probability of type I error is reduced, then probability of type II error will increase

& viceversa. This is the problem.

● But if you ask a question, can you reduce type I error to zero? Such a possibility is there.

Type I error is rejecting 𝐻0 when 𝐻0 is true. If you never reject 𝐻0, probability of type I

error will reduce. But the problem is the probability of type II error is 1. You accept a

false hypothesis is 1.

● So, we cannot minimize both types of errors in a particular case. So, the common

practice is, econometricians and statisticians generally believe that type I error is more

serious than type II error. So, conventionally, we fix type I error at 1%, 5%, 10% etc. And

try to reduce type II error as much as possible.

● Regarding whether type I error or type II error is more serious, we have discussed an

example earlier. The cost of type I error and type II errors, the example of steel. If the

steel is used for building a house, the type II error is more serious but if it is used for

fencing type I error is more serious.

128
● So, when fixing α remember this as α increases, β decreases and vice versa. So, while

fixing α, one must keep in mind the cost of type I and type II errors. This is important

and it depends on the cost or purpose of type I and type II errors.

Hypothesis testing P value

The other issue is the P value or exact level of significance. The concept is

very important as part of inferential statistics. One author said, Achilles heal of classical

statistical inference is arbitrary in fixing at α 1%, 5% or 10%. This is the only weak spot in

classical statistical inference procedure. What is suggested is that, instead of fixing , arbitrary at

once a test statistic obtained from a sample i.e eg : t = 5.86. Find out the actual probability of

obtaining such t value. It is known as probability value. Or it is termed as an exact or observed

level of significance.

Eg : 𝐻0 → 𝑀𝑃𝐶 = 0. 3

t = 2.036 what is the probability of getting a t equal to or greater than is 5% or P= 0.05. I.e

probability of getting t = 0.05.

α = 5%.

Sample values lie in the critical value. 𝐻0 is rejected, what is the probability of committing type

1 error . it is 5%. But actually it is not 5%. What is the probability of committing type 1 error, we

have to find out the probability of t = 5.86

P =0.002 two tail test

= 0.001 one tail test.

Computer value is = 0.000189 ( p value of observed t).

Suppose that p = 0.01, so the test statistic is significance at 1%

p = 0.05, so the test statistic is significant at 5%

129
p = 0.1, so the test statistic is significant at 10%

p = 0.15, so the test statistic is significance at 15%

You Report p value along with the coefficient, standard error and t , and let the reader decide

whether to accept or reject at 1%, 5%,10%,15%..suppose that a reader willing to accept only at

1%, 5%,10%... etc. you report P value along with results, let the reader decide whether accept or

reject.

p value is the probability value, the lowest level of significance in which a null hypothesis is

rejected. This is also known as the exact level of significance or the observed level of

significance. This is obtained after calculating the empirical t from the sample.

Statistical significance v/s practical significance

Statistical significance is purely based on 𝑡 . if t value is low, the coefficient will be


β

statistically insignificant. If t value is high, the coefficient is statistically significant. In practical,

significance depends upon the sign of the coefficient and size of the coefficient. Suppose that the

coefficient is statistically significant, but it has no practical significant. i.e , β2 = 0.00002.

Both are important from a practitioner point of view. Even if a coefficient is statistically

significant, the sign is not according to our expectation . it is not practically significant. For

example in the consumption income if you get β2 as negative coefficient, it is not theoretically

acceptable, so sign is important. The second one is size is also important.

Statistical significance is not confused with the practical significance.

Guidelines

1. Check for statistical significant : yes, if the coefficient is statistically significant. Then

consider the sign and size.

130
2. It is not statistically significant at 1%,5%,10%... but still you consider expected sign and

size. And calculate p value. For small samples, reject 𝐻0, 𝑃 = 0. 2 ( Up to 20%)

3. Sign + (-), if the sign is not theoretically justified , then we can discard it.

Let me consider an example, dependent variable ( Y) is hourly wage and independent variable

( X) is years of education.

Y ( Hourly X ( years of
wage) education)

4.45 6

.. 7

.. ..

13.53 18

𝑦1 =− 0. 0144 + 0. 7240 𝑋1

SE = 0.93173 0.7000

t = -0.0154 10.3428

131
P = 0.987 0.000

n= 13

𝑑𝑓2 = 11

α = 5%

t = 2.201

2
𝑟 = 0. 9065

90% of variations in Y are explained by X. Range of value not close to 0. Slope = 0.7240. Slope

coefficient is significant. There is a positive relationship between years of education and hourly

wage.

Regression and ANOVA, the F test

ANOVA is introduced by Fisher. To analyze data relating to agricultural experiments. Nowadays

ANOVA is extensively used to compare differences between more than two means. It is used in

regression points only for statistical inference. To discuss ANOVA,

TSS = ESS +RSS

2 2
2
Σ𝑦𝑖 = Σ𝑦𝑖 + Σ𝑢𝑖

2 2
2 2
Σ𝑦𝑖 = β2Σ𝑥 𝑖 + Σ𝑢𝑖

A study of components of TSS is known as ANOVA in the context of regression. It is associated

with any sum of squares and it is the degrees of freedom.

2
TSS = Σ (𝑌 − 𝑌 ) (𝑛 − 1) degrees of freedom.

2 2
Σ𝑢𝑖 =Σ𝑢𝑖 (𝑛 − 2) degrees of freedom.

132
2
ESS = Σ (𝑌 − 𝑌 ) 1

( 𝑛 − 1) ( 𝑛 − 2)

= n-1-n+2

= -1+2

=1

ANOVA Table

Source of SS df MSS
Variation

ESS 2 2
2 1 ESS/1
Σ𝑦𝑖 =β2Σ𝑥 𝑖

RSS 2 𝑛− 2 2
Σ𝑢𝑖 2
Σ𝑢𝑖 =σ
𝑛−2

TSS Σ 𝑦𝑖
2 𝑛− 1

2
2
β2Σ𝑥𝑖
F= 2 ∼ 𝐹 ( 1, 𝑛 − 2)
σ

2
𝑧1 ∼χ 𝑘1

2
𝑧2 ∼χ 𝑘2

𝑍1/𝐾1
𝐹 = 𝑍2/𝐾2
∼ 𝐹 (𝐾1, 𝐾2)

β2 −β2
𝑧1= ∼ 𝑁 ( 0, 1)
2
σ/ Σ𝑥𝑖

133
( )∼ χ
2
(β2 −β2 )
2 2
𝑍1 = 2 2
(1)
σ /Σ𝑥𝐼

2
𝑧𝑖 /1
F= 𝑧2/𝑛−2

2
2 2
(β2 −β2 ) Σ𝑥𝑖 2
2
(β2 −β2 ) Σ𝑥𝑖
= 2
σ
= 2
∼ 𝐹 ( 1, 𝑛 − 2)
Σ𝑢 𝑖
×
1 Σ𝑢𝑖 /𝑛−2
2 𝑛−2
σ

Suppose that 𝐻0: β2= 0

2
2
β2Σ𝑥𝑖
= 2 ∼ 𝐹 ( 1, 𝑛 − 2)
σ

This is the ratio between two independently distributed one.

β2= 0

β2
t=
𝑆𝐸 β2 ( )
( )
2
2 2 2 2
E β2Σ𝑥 𝑖 = σ +β2Σ𝑥 𝑖

2
2
E(σ ) = σ Therefore the F value is small. RSS is a measure of noise in the system. If

2 2
𝑡𝑘 = 𝐹 (1, 𝑘) . In two variable analysis 𝑡𝑛−2= F ((1, 𝑛 − 2)

Prediction using simple linear regression model

Example: there are two observations

Yi Xi n= 10

Using the observations and applying OLS , you have estimated a consumption function

134
𝑌𝑖 = 24.45+.5091Xi

Where, 24.45= β1 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

.5091= β2= slope ,

the marginal propensity to consume

Then What is the use of estimated regression / historical regression?

An estimated regression is used to verify an economic theory. It is used to policy purposes. It is

also used for prediction. One use of this historical regression is to forecast a future value

Say 𝑌0 − 𝑋0

That is consumption expenditure𝑌0 associated with income 𝑋0

𝑌0 and 𝑋0 are not included in the original observations (Yi Xi: here i= 1,2,3,.......n, this Y0 and

𝑋0 not included in original observations)to estimate the regression line.

𝑋0 may be within the sample range . To estimate this,

Range of sample used may be 80-250

𝑋0 may be outside the range or it may be within this range . The only requirement is that 𝑋0 was

not included in the original observations

It also assumed that while using 𝑋0 to predict Y= 𝑌0, we assume that the original relation

between Y and X holds for the predicted𝑋0 𝑌0 also. We assume that the relationship between Y

and X determined by the parameters holds true for the predicted𝑌0 𝑋0 also, ie the relationship

which generated the sample data is applicable to the future observations also whether 𝑋0 is

within the range or whether it is outside the range. Now there are two types of predictions

135
Prediction of an individual Y say 𝑌0 corresponding to X =𝑋0

1) 𝑌0= 𝑋0= individual prediction

Predicting a value on the population regression line .

2) E(Y/𝑋0)= mean prediction

Individual prediction

Suppose our aim is to predict

Y= 𝑌0→X=𝑋0 (corresponding to)

We want to predict an individual value Y corresponding to X= 𝑋0.

𝑌0=β1+β2𝑋0+𝑢𝑂

We predict it as

𝑌𝑂= β1+ β2= 𝑋0

If 𝑌𝑂 is the predictor of 𝑌0 and as 𝑌𝑂 is an estimator it is likely to be different from the true

value 𝑌0. Both need not be one and the same thing due to the fact that 𝑌𝑂 is an estimator as it is

based on the estimators β1+ β2

The difference between 𝑌0 and 𝑌𝑂= the prediction error or the forecast error

𝑢0 =( 𝑌0 − 𝑌0)= (β1 +β2𝑋0+𝑢0) - (β1 +β2 𝑋0)

= (β1 − β1 )+(β2- β2 )𝑋0+𝑢0

There are 2 sources of error in this formula

1. 𝑢0= The stochastic error term or the presence of additive error term 𝑢0

136
2. Second source of error is estimators . The reason is estimators are random variables and

the value of estimators varies from sample to sample.

But if you take expectations:

E(𝑢0)= E ( 𝑌0 − 𝑌0) = 0

It means that the predictor is unbiased

𝑌𝑂→unbiased.

Variance of prediction error

⎡ ⎤
⎢ ⎥
Variance (𝑢0)= Var( 𝑌0 − 𝑌0)=Var ⎢ (β1 − β1) + (β2 − β2 )𝑋0 + 𝑢0⎥
⎢ ⎥
⎣ ⎦

𝑢0 can also be written as

● 𝑢0=(𝑌0 − 𝑌0) =(β1 − β1 ) +(β2-β2)𝑋0+𝑢0

=𝑢0+ (β1-β1) −(β2-β2)𝑋0

2 2 2
2 2 2
● (𝑢0)= (𝑌0 − 𝑌0 ) =𝑢0+(β1 − β1) +(β2 − β2) 𝑋0-2𝑢0(β1-β1)- 2𝑢𝑜(β2-β2)𝑥 𝑜+2

(β1 − β1) (β2 -β2)𝑥 0

2
2 2 2⎡ 1 (𝑋0−𝑋) ⎤
● E((𝑢0)= 𝐸(𝑌0 − 𝑌0) =σ ⎢1 + + ⎥
⎢ 𝑛 2
Σ𝑥𝑖 ⎥
⎣ ⎦

The variance of prediction error is given in this formula.

The variance of prediction error will be smallest when 𝑋0= 𝑥

137
2 1
If 𝑋0= 𝑥 , variance of prediction error becomes σ (1 + 𝑛
)

The prediction error increases non linearly when 𝑋0 deviates from 𝑥

We know that 𝑢0,or 𝑦0- 𝑦0 is a linear function of normal variables because these are the linear

function of β1, β2 and 𝑢0

We can write

𝑌0−𝑌0
∼ 𝑁 ( 0, 1)
1 𝑋𝑂−𝑥
σ 1+ +
𝑛 2
Σ𝑥𝑖

σ is unknown . So replace it with σ

𝑌0−𝑌0
= ∼ 𝑡(𝑛 − 2)
1 (𝑋𝑂−𝑥)2
σ 1+ 𝑛 + 2
Σ𝑥𝑖

In this formula everything is known except 𝑌0.So we construct a 95% confidence interval for 𝑌0

as

1 (𝑋𝑂−𝑥)2
𝑌𝑜= 𝑌0 ±t 0.025 σ 1 + 𝑛
+ 2
Σ𝑥𝑖

Example : In the example considered earlier , we want to predict

𝑌0 =24.4545+.5091(100)

=75.3645

2 1 (𝑋𝑂−𝑥)2
Var ( 𝑢0 ) = σ 1 + 𝑛
+ 2
Σ𝑥𝑖

138
2
1 (100−170)
=42.159(1+ 10 + 33000

= 52.63 SE= 7.255

The 95% confidence interval for 𝑌𝑜

58.6245<𝑌𝑜 <92.0945= known as confidence band

This can be repeated to any value of x , we will get the confidence band . If your aim is

individual prediction Then we have to predict

E𝑌/𝑋0=β1 + β2 𝑥 0

A point on the PRF.

We predict it as

𝑦0=β1 + β2 𝑥 0

When you take the variance of prediction error it becomes

139
2⎡ 1 (𝑋𝑂−𝑥)2 ⎤
σ ⎢𝑛 + ⎥
⎢ 2
Σ𝑥𝑖 ⎥
⎣ ⎦

When you construct 95% confidence interval for the predictor the confidence band will be close

to SRF

The point is , once you estimate a model, you can predict any value of i corresponding to X=X0

But the problem is if you try to predict a value which is far away from the mean of x, prediction

error and its variance increases non linearly.So never try to predict a value which is far away

from the mean of independent variable or atleast never try to predict a value which is far away

from sample range .It can be explained as

If we extend an estimated regression line far away from the sample range it will give

contradictory results .Suppose that we have information about GNP of India up to 2020 and also

that we have the value of aggregate consumption and we estimate a consumption income

relationship and if you try to forecast the consumption expenditure in the year 2030 by

140
substituting a value for income in the year 2030 , it will give the absurd, because the variance of

prediction error will be very high , such prediction is meaningless. Thus never predict a value

which is far away from the sample range.That will give you misleading results

Regression through the origin

Consider a PRF

𝑌𝐼 = β2 𝑋𝑖 + 𝑢𝑖

In this model the intercept β1is assumed to be 0.Such a model is known as regression through the

origin model. As you know, the PRF of such a model starts from the origin not from the Y axis.

As an example to this , consider the permanent income hypothesis of Milton Friedman, which

says that consumption is proportional to permanent income

141
𝑝
1) 𝐶 = α𝑌 , adding a disturbance term u, it is an example of regression through the origin

model.

2 3
2) Consider a cost function TC = 𝑎0+𝑎1 𝑄 + 𝑎2𝑄 + 𝑎 𝑄
3

TC = TFC+TVC

If you are modeling a variable cost figure,it starts from the origin.Because we know that

Total Variable Cost Curve starts from the origin.

Now if you have strong theoretical reasons to believe that the correct model is a

regression through the origin model , then you can have a model like this. Regression

through the origin mode is not generally preferred in empirical analysis, but still it is

useful for some other purposes. So it is important to know what are the properties of

regression through the origin model, because when we discuss heteroscedasticity , auto

correlation etc we will come across such models and a discussion will be useful later.

Now this is the PRF . Now let us say how to estimate SRF is

𝑌𝑖 = β2𝑋𝑖 𝑢𝑖

142
𝑌𝑖 = β2 𝑋𝑖 + 𝑢𝑖

𝑢𝑖= 𝑌𝑖 - β2 𝑋𝑖

2 2
Σ𝑢𝑖 = Σ[ 𝑌𝑖 − β2 𝑋𝑖 ]

2
𝑑Σ𝑢𝑖
=2Σ[ 𝑌𝑖 − β2 𝑋𝑖 ] [−Xi]= 0
𝑑 β2

= -2Σ[ 𝑌𝑖 − β2 𝑋𝑖 ] [Xi] = 0

2
Can be written as Σ 𝑌𝑖 𝑋𝑖 = β2 𝑋𝑖

From this we know that it is

Σ𝑢 𝑋𝑖 = 0

Solving this you will get

Σ 𝑌𝑖 𝑋𝑖
β2 = 2
Σ 𝑋𝑖

Now if we use a substitute for Yi from the PRF , we can write it as

Σ𝑋𝑖( β2 𝑋𝑖 +𝑢𝑖)
= 2
Σ 𝑋𝑖

2
β2Σ 𝑋𝑖 +Σ𝑢𝑖
2
Σ 𝑋𝑖

Σ 𝑋𝑖 𝑢𝑖
β2 =β2 + 2
Σ 𝑋𝑖

E(β2)=β2

So β2 in the regression through the origin model is unbiased.

143
Now let us see what is the variance of β2

We know that From this step

2 Σ 𝑋𝑖 𝑢𝑖 2
E [β2 − β2] = E [ 2 ]
Σ 𝑋𝑖

This can be written as

1 2
= 2 E [ Σ 𝑋𝑖 𝑢𝑖]
(Σ𝑋𝑖)

If we expand this , it becomes

2 2 2 2
𝑋1 𝑢1 +.........𝑋𝑛𝑢𝑛 + 2 𝑋1𝑋2𝑢1𝑢2 +...... + 2𝑋𝑛−1𝑋𝑛 𝑢𝑛−1𝑢𝑛

And from the assumptions all these cross terms will disappear as we derived earlier, then

this becomes

1 2 2
2 2
σ Σ𝑋𝑖
(Σ𝑋𝑖 )

That will give you variance of β2


2
σ
Var(β2)= 2
Σ𝑋𝑖

2
Now we know that σ is unknown , so we replace it with its unbiased estimator
2
2 Σ𝑢𝑖
σ = 𝑛−1
Here it is n-1 not n-2 because

The reason is , in the regression through the origin model , we estimate only one parameter so

only one degree of freedom is lost.

If we write expected value of

144
( )
2
Σ𝑢𝑖 2
(𝑛−1)σ 2
E 𝑛−1
= 𝑛−1

Now it is informative to compare the regression through the origin model with the model having

the intercept. As we have discussed while deriving the estimator , a step was written

1)Σ𝑢𝑖𝑋𝑖 = 0

This is a condition imposed on the data to derive the estimator β2, as only one condition is

imposed we have what is known as we have n-1 degrees of freedom.In the conventional model,

in addition to this we have

Σ𝑢𝑖=0 . But in the regression through origin model we cannot impose both these conditions

together.Let us see why?

We know that

𝑌𝑖 = β2𝑋𝑖 + 𝑢𝑖

Σ𝑌𝑖 = Σβ2Σ𝑋𝑖 + Σ𝑢𝑖

If we impose the condition that

Σ𝑢𝑖= 0

Σ𝑌 𝑌
Then β2 becomes Σ𝑋𝑖 = = This quantity is biased
𝑖 𝑋

So in the regression through the origin model we can introduce

Σ𝑢𝑖𝑋𝑖= 0 but we cannot introduce Σ𝑢𝑖=0 .

Also as we have derived earlier , we have shown that

𝑌𝑖=𝑌𝑖+ 𝑢𝑖

145
𝑌= 𝑌+ 𝑢

But since

Σ𝑢𝑖 ≠ 0

Then

𝑢=0

Then

𝑌= 𝑌

But, in our model 𝑢 ≠ 0

So 𝑌≠ 𝑌

We cannot impose the condition that mean of Y is the mean of estimated Y. Both are different.

So as a comparison between both the models we know that for the conventional model

Σ𝑥𝑖𝑦𝑖
β2= 2
Σ𝑥𝑖

2
σ
Var (β2)= 2
Σ𝑥𝑖

2 2
Σ𝑢
σ = 𝑛−2

Now let us see a comparison of paths

1) In the regression through the origin model we use the raw sum of squares and cross

products. But in the model with intercept we use adjusted sum of squares and adjusted

cross products.(adjusted from the mean)

2) In the regression through origin model , we have n-1 degrees of freedom , in the

conventional model we have n-2 degrees of freedom.

146
3) In the conventional model Σ𝑢𝑖=0. But we cannot assume that Σ𝑢𝑖=0 in the regression

through the origin model.

2 2
4) In the model with intercept, 0< 𝑟 <+1 ie 𝑟 𝑖𝑠 𝑎𝑙𝑤𝑎𝑦𝑠 𝑎 𝑛𝑜𝑛 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑛𝑢𝑚𝑏𝑒𝑟. But in

the regression through the origin model there is a possibility that

2
𝑟 𝑤𝑖𝑙𝑙 𝑏𝑒𝑐𝑜𝑚𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒

5) In the regression through the origin model we can show that

2 2
2 2
Σ𝑢𝑖 = Σ𝑌𝑖 - β2Σ𝑋𝑖 And we know that

2
2 Σ𝑢𝑖
𝑟 =1- 2
Σ𝑦𝑖

2 2
2 2
And this Σ𝑢𝑖 =Σ𝑦𝑖 - β2Σ𝑥 𝑖

We can write this quantity as

2
2
Σ𝑢𝑖 ≤ Σ𝑦𝑖

2 2 2
But in the model without intercept Σ𝑦𝑖 = Σ𝑌𝑖 - n Σ𝑦

Then compare the equations


2
2 2
Σ𝑌𝑖 −β2Σ𝑋𝑖
It can be written as 1- 2 2
Σ𝑌𝑖 − 𝑛 𝑦

So there is a possibility that

10−5 5
1- 10−4
= 4

2
Here 1 - greater than 1 𝑟 becomes negative

147
2
So in the regression through origin model we cannot use the conventional 𝑟 as a

2
measure of goodness of fit. So as a measure of goodness of fit 𝑟 suggested in the

2 (Σ𝑋𝑖𝑌𝑖)
literature is raw 𝑟 = 2 2
Σ𝑋𝑖 Σ𝑌𝑖

2 2 2
This Raw 𝑟 satisfy the condition that 0< 𝑟 < 1. The conventional 𝑟 is not.

Thus compared to a model with an intercept, regression through the origin model has

many problems . So unless you have sufficient theoretical reasons , it is better to use the

conventional model . Also if your model is

𝑌𝑖= β1 + β2𝑋𝑖+ 𝑢𝑖

This β1 𝑤𝑖𝑙𝑙 𝑛𝑜𝑡 𝑏𝑒 𝑚𝑒𝑎𝑛𝑖𝑛𝑔𝑓𝑢𝑙 𝑖𝑛 𝑚𝑎𝑛𝑦 𝑐𝑎𝑠𝑒𝑠.

So we can simply ignore this

Suppose that we have a model that

H0: β1 = 0

H1 :β1 ≠ 0

If this null hypothesis is accepted, from the practical point of view, this is equal to

regression through the origin model. So the point is, it is better to regress a model with an

intercept rather than a model without intercept unless you have very strong theoretical

reasons.

Scaling and unit of measurement

The unit in which data is expressed is very important. If we are not familiar with units of

measurement in which dependent and independent variables are given there is a possibility that

we will make wrong interpretations.

148
Example : Suppose that United States Gross Private Domestic Investments (GPDI) and Gross

Domestic Product(GDP) is given from 1992 to 2005.When suppose that we regress GPDI on

GDP to see how GPDI responds to GDP.

● Suppose that one researcher uses data in billions of Dollars.

● Another researcher uses data in millions of Dollars.

One billion = 1000 million

We can use data in any units. Eg. We can use data in Rupees, thousands and lakhs of rupees ,

millions,billions as you like. If the data is used in billions we can convert it into millions by

multiplying this with 1000.

Now let us see what are some of the issues associated with unit of measurement.

● If one researcher is using billions of Dollars, Another researcher uses data in millions of

Dollars. The question is, will the regression line be the same in both cases?

● The second question is, if the results are different, which result should one use ?

● Next question is, whether the unit of measurement of X and Y makes any difference in

the regression results obtained?

If so, what is the most sensible course of action? That is what should be the best course of

action to follow?

When we interpret the results the unit of measurement is very important.

Let us consider a regression model as 𝑌𝑖 = β1 + β2 𝑋𝑖 + 𝑢𝑖

Where 𝑌𝑖 is GPDI, 𝑋𝑖 is GDP.

*
𝑌𝑖 = 𝑊1𝑌𝑖

*
𝑋𝑖 = 𝑊2𝑋𝑖

149
* *
We defined 2 variables 𝑌𝑖 and 𝑋𝑖 .

𝑊𝑖 and 𝑊2

Are known as scale factors. They are constants.

𝑊𝑖 = 𝑊2.

𝑊𝑖 ≠ 𝑊2

It means data given in billions, one can be converted into millions, another can be converted into

some other unit of measurement.

* *
Now we have 𝑌𝑖 and 𝑋𝑖, 𝑌𝑖 and 𝑋𝑖 , and as it is multiplied by the 𝑊𝑖 and 𝑊2 and as 𝑊𝑖 and

* *
𝑊2 are known as scale factors we can say that 𝑌𝑖 and 𝑋𝑖 are rescaled values of 𝑌𝑖 𝑎𝑛𝑑 𝑋 .
𝑖

*
And if 𝑌𝑖 𝑎𝑛𝑑 𝑋 . are in billions of dollars. Then 𝑌𝑖 =1000𝑌𝑖
𝑖

*
𝑋𝑖 = 1000𝑋𝑖

* * *
* *
Now consider regression 𝑌𝑖 = β 1
+ β 2
+ 𝑋𝑖 + 𝑢 𝑖

*
𝑌𝑖 = 𝑊𝑖𝑌𝑖

*
𝑋𝑖 = 𝑊2𝑋𝑖

*
𝑢 𝑖 = 𝑊1𝑢 𝑖

The rescaled variables.

*
Relationship between β1 and β 1,

*
β2 and β 2

150
*
var(β1) and var (β 1)

*
var(β2 ) and var(β 2)

2 *2
σ and σ

2 2 * *
𝑟 𝑥𝑦 and 𝑟 𝑥 𝑦

* 𝑊1
● β 2
= 𝑊2
β2

*
If 𝑊1 = 𝑊2 then β 2
= β2

( )
* 𝑊1
● 𝑣𝑎𝑟(β 2) = 𝑊2
𝑣𝑎𝑟(β2 )

If 𝑊1 = 𝑊2 the unit of measurement changed from one to another and it is the same for both Y

and X. Then estimator β variance is also the same.

Now what about the intercept?

*
● β 1
= 𝑊1 β1

*
2
● 𝑣𝑎𝑟 (β 1) = 𝑊1 𝑣𝑎𝑟β1

2
Its std error is 𝑊1 = 𝑊1

When we change multiplying X and Y with 𝑊1 and 𝑊2 as you can see the intercept is multiplied

by 𝑊1 . That is 𝑊1 β1 .

What is 𝑊1 , it is the weight of 𝑌1.

151
*
Standard error of β 1 is 𝑊1 times standard error of β1.

*
So both β 1 and standard error multiplied by 𝑊1 .

2 2 * *
● 𝑟 𝑥𝑦 = 𝑟 𝑥 𝑦

2
That is whatever may be the unit of measurement 𝑟 is not affected.

*2 2
2
● σ = 𝑊1 σ

This is the case in which X and Y are changed.

Now suppose that X scale has not changed. Only the Y scale changed.

Then the slope, the intercept, and the standard errors multiplied by 𝑊1.

𝑊1 = 𝑌 𝑊2 = 𝑋

Then Y scale unchanged, X scale changed.

1
Then the slope and standard errors multiplied by 𝑊2
.

If 𝑊1 = 𝑊2, then the slope, standard error remain unchanged. Intercept and standard error

2
multiplied by 𝑊1. 𝑟 is not affected,

Functional forms of regression model, log log model

In the previous classes we have discussed simple linear regression model. And our discussion in

simple linear regression model was based on the assumption that the model is linear in

parameters, may or may not be linear in variables. But in the real world we will consider many

variations of the simple linear regression model. That is we will consider a model which are non

linear in variables or models which are non linear in appearance. But can be transformed into

152
linearity by giving suitable transformation. So we will consider models which are non linear in

variables on the one hand and also non linear in parameters but can be converted into linear in

parameter models. A model which is non linear in parameters but can be converted into linearity

by giving a suitable transformation is known as intrinsically linear models. And in the real world

we can not expect that the world is always linear. The world is characterized by various types of

non linearities. So for empirical application we have to consider such non linearities. So we will

consider many cases and we will also discuss how such models are applied in the real situation to

deal with real economic data. The first model that we consider is Log Linear Model.

Log Linear Model.

Log Linear Models are used to estimate elasticity. To explain it let us consider a model as

exponential regression model.

β 𝑢𝑖
𝑌𝑖 = β1𝑋𝑖 2𝑒

This model is known as the exponential regression model. By giving a log transformation we are

using natural log

𝑙𝑛𝑌𝑖 = 𝑙𝑛β1 + β2𝑙𝑛𝑋𝑖 + 𝑢𝑖

The difference between natural log and common log

The base of the common log is 10. The base of natural log is e. e=2.718.

As an example 𝑙𝑛𝑋 = 2. 3026𝐿𝑜𝑔𝑋

Properties of log

𝑙𝑛 ab = 𝑙𝑛a +𝑙𝑛b

𝑎
𝑙𝑛 𝑏 = 𝑙𝑛a - 𝑙𝑛b

153
𝑘
𝑙𝑛𝑎 =k.𝑙𝑛a

𝑙𝑛 1 = 0

𝑎
𝑙𝑛𝑒 = a

This can be written as 𝑙 𝑌 = α + β 𝑙 𝑋 + 𝑢


𝑛 𝑖 2 𝑛 𝑖 𝑖

α = 𝑙𝑛β1

This model is linear in the parameters α and β2. And linear in the log of the variables Y and X.

So as both the dependent and independent variable is in log form, we can say that it is a log log

model or a double log model or a log linear model.

So this is the linear model, the only difference is Y and X is expressed in log form.

We can estimate the model using OLS.

If we apply OLS not by regressing Y on X, but by regressing𝑙𝑛y and 𝑙𝑛x, then you will get

estimators α and β.

The reason why the model is popular is that this model will give a direct estimate of elasticity or

the slope coefficient β2 is the elasticity.

The elasticity of Y with respect to X, that is the percentage change in Y, that is


∆𝑌
%𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑌 𝑌
β2 = %𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑋
= ∆𝑋
𝑋

So the model is popular because the slope coefficient is the estimate of elasticity.

𝐼𝑓 𝑌 = 𝑙𝑛𝑋

𝑑𝑙𝑛𝑋 1
𝑑𝑥
= 𝑋

𝑑𝑥
Or 𝑑𝑙𝑛𝑋 = 𝑋

154
So the derivative of a log variable is equal to proportionate change in that variable.

𝑋𝑡−𝑋𝑡−1
That is 𝑋𝑡−1

𝑋𝑡
Or 𝑋𝑡−1
− 1

And you know that 𝑋𝑡 − 𝑋𝑡−1 is the absolute change.

𝑋𝑡−𝑋𝑡−1
𝑋𝑡−1
is the relative change.

𝑋𝑡−𝑋𝑡−1
𝑋𝑡−1
× 100 is the percentage change.

And to calculate change we generally use the proportionate change or the percentage change.

Because absolute change depends on the unit of measurement, whereas, proportionate change is

independent of the unit of measurement.

The proportionate change is used as a measure of change in economics rather than absolute

change.

Because absolute change depends on the unit of measurement.

If Y is quantity demanded and X is price. Then we have quantity demanded is a function of

price. And if you estimate by convertingY into 𝑙𝑛𝑌 , X into 𝑙𝑛𝑋, we will get β2, and β2 is

nothing but price elasticity of demand.

So this model will give you an elasticity of demand.

If you simply regress Y on X we get only the slope of the regression line. That is β2

will measure change in Y per unit change in X, If price increases by 1 unit, quantity demanded

decreases by β2 units.

155
∆𝑌
∆𝑌 𝑋 ∆𝑌 𝑋
In this mode β2 is 𝑌
∆𝑋 this can be written as 𝑌
× ∆𝑌
𝑜𝑟 ∆𝑋
× 𝑌
𝑋

If you want to know what is the slope then it is slop = β2 ( )


𝑌
𝑋

Now let us see how these equations look in the form of a graph. Suppose that we have quantity

demanded as a f(P)

Then the curve graph is converted into a straight line.

The features of log log, double log or a log linear model.

● As we can see in the model elasticity β2 is assumed to be constant throughout. So it is

also known as the constant elasticity model. Slope = β2 ( ) as X increases, slope


𝑌
𝑋

decreases.

156
● 𝑙𝑛𝑌𝑖 = α + β2𝑙𝑛𝑋𝑖 + 𝑢𝑖 then we will get α and β .

() ( )
𝐸 α is an unbiased estimator of α, and 𝐸 β2 is unbiased estimator of β2.

But β1 the parameter entering in the original model when estimated as β1.

β1 is antilog of α

β1 estimated as β1. obtained by taking the antilog of α is a biased estimator. This is not a big

problem as in most cases intercept is not very important

Example for log log model:

𝑙𝑛𝐸𝑋𝑃𝐷𝑡 = − 7. 5417 + 1. 6266𝑙𝑛𝑃𝐶𝐸𝑡

Now expenditure on PCE (per capita consumption) of personal computers and a expenditure of

durable goods.

So we want to know when personal computer expenditure changes, how expenditure on durables

changes.

Now the interpretation is that when personal consumption expenditure increases by 1 %,

expenditure on durables increases by 1.6266%.

Difference between percentage change and percentage point change

Suppose that the unemployment rate increases on 6% to 8%.

Then the % point change is 2.

8−6
Then % change is 6
= 33%

157
Functional forms of Regression models, Semi log model

Now we consider the Semi-Log models, there are two types of Semi-Log models; namely log lin

model and lin log model. Log lin means Y is in log form and X is not, and lin log means X is in

log form and Y is not. So in this session we discuss the first model, the log lin model.

Log lin models are extensively used to measure the growth rates. Such as population, GNP,

money supply, employment etc…such economic phenomena are measured under log lin models.

To explain this model, suppose that we want to estimate the growth rate of consumption

expenditure at an aggregate level using time series data.

Let, 𝑌𝑡 is the real expenditure on services at t (it can be the rate of growth of population at time

t, rate of growth of employment at time t, or rate of growth of money supply at time t etc…..)

𝑌0 is the initial value or value in the beginning of the sample period

Suppose that, we consider the compounding formula

𝑡
𝑌𝑡 = 𝑌0 (1 + 𝑟)

Where, r = compound, the rate of of growth of Y over time

𝑌𝑡 = Value of Y at time t

𝑌0 = Initial value

Taking log on both sides;

𝑙𝑛𝑌𝑡 = 𝑙𝑛 𝑌0 + t 𝑙𝑛(1+r)

This can be written as;

𝑙𝑛𝑌𝑡= β1+ β2 t

Adding a disturbance, will get it as;

𝑙𝑛𝑌𝑡= β1+ β2 t + 𝑢𝑡

158
This model like any other model, linear in parameters β1and β2

In this model, the dependent variable is 𝑙𝑛𝑌𝑡, and the independent variable is t

Suppose that you have data on expenditure on services from 2000 to 2015 as y, and you have to

construct it, starting from 1,2 ….to 15. You can regress 𝑙𝑛𝑌𝑡 on t. That will give you β2.

This model is called a semi-log model, the reason is that only one variable is in log form, and the

other is not. For the descriptive purpose, we called it a log lin model, because the dependent

variable is in log form.

Properties of the model

In this model , as we can see

△𝑙𝑛𝑌𝑡
β2= △𝑡

△𝑌 △𝑌

= 𝑌
△𝑡
or 𝑌
△𝑥

Now, the slope coefficient measures the relative change in Y (the regressent) divided by the

absolute change in X, or it measures the constant proportional change in Y for given absolute

change in X.

△𝑙𝑛𝑌𝑡
If we multiply the relative change in Y with 100 ( △𝑌
× 100) , you will get the percentage

change in Y or the growth rate in Y (growth rate in population, growth rate in money supply

etc…..)

So, 100×β2 is the growth rate. In the literature 100×β2 is known as semi elasticity of Y with

respect to X.
△𝑌
△𝑌 1 △𝑌 1
In this model, β2 = △𝑋
𝑌
or 𝑌
× △𝑋 or △𝑋
×𝑌

△𝑌
So, what is slope? Slope is △𝑋
;

159
Slope = β2(Y)

What is elasticity?

Elasticity = β2(X)

Both slope and elasticity change from one point to another point, because bothe slop and

elasticity depend on Y and X. That is the log lin model.

As an example, An estimated model is given,

𝑙𝑛𝐸𝑋𝑠𝑡 (expenditure on services at time t) = 8.32 + 0.007t

This is based on quarterly data, the quarterly growth rate = 0.705%, so the annual growth

rate of expenditure on services = 2. 82 (0.705*4).

Now, in the literature, a distinction is always made between the compound growth rate and

instantaneous growth rate. The coefficient β2, that we estimated is instantaneous growth rate,

means at a point in time. It is different from compound growth rate, which is measured over a

period of time (say one month, six month, one year etc,,,,)

Now, if you want to calculate the compound growth rate, the procedure is

1. Take antilog of instantaneous β2 (β2 𝑖𝑠 𝑙𝑛(1 + 𝑟)) , that will give you (1+r)

2. Subtract 1 from it, (1+r)-1, that will give you r

3. Multiply this with 100, r*100, it will give you compound growth rate

This compound growth rate will be higher than the instantaneous growth rate. Because of the

compounding effect. When you estimate, you will get the β2, and it is the instantaneous growth

rate, if you want compound growth rate follow the above steps.

In addition to this, there is a linear growth model.

160
Suppose that you regress 𝑌𝑡= β1+ β2 t + 𝑢𝑡, where Y is in original units , t is the time, you

simply regress Y on t, you will get β2. It is an absolute change. It tells us by how many units Y

will change when time will change by one unit. But remember this, linear trend model is not

generally preferred, what is more interesting is relative change or growth rate rather than

absolute change, this model is known as the linear trend model. This will give you the absolute

change in Y for a unit change in t. For example, if the value of the money supply increase from

𝑋1 to 𝑋2, that change will be β2 for one year (if you are using the yearly data).

Let us consider one more example, consider a model of wage and education;

Wage = β1+ β2 education+ 𝑢

Suppose that wage is measured in Dollar per hour, and education in years

Now, as you know if you estimate this model, the estimated model is;

=-0.90+0.54 education

It says that, as education increases by one year, the hourly wage is increased by .54 dollars. So, if

education increases by one year, this model states that hourly wage increases by .54, whether the

education increases from 5 to 6, or 18 to 19. But this is unrealistic in most cases, the reason is if

education increases from 5 to 6, the wage increases by .50 dollars. When education increases

from 18 to 19, wages may increase by 5 dollars. Here the rate of increase in wage is uniform for

increase in education at each level of education or the starting level of education is more

important. But what is more meaningful is an estimation of the percentage change, so we convert

this model in to log lin as;

Log lin = β1+ β2 education+ 𝑢 (wage is converted into log)

161
When you convert wage into log, you will get, as education increases by one unit, β2 is the

percentage change (say this percentage change is 5%), the percentage change is constant , but

absolute change will increase. So for a wage equation , if you expect the relation as like below;

Here, the relation is not linear.

When estimated using log wage and education

β2= 0.083 (8.3%), this is the growth rate wage for each additional unit of education. And the

curve representing the wage and education is like this. This is known as the log linear model.

The dependent variable is log and the independent variable is not.

Log(Y)= β1+ β2 X+ 𝑢

The other shapes possible are’

162
When β2 >0 (as we already considered)

When β2<0

163
: Functional forms of Regression models, Lin log and reciprocal models

Another model which we consider is the Lin log model. Unlike in the Log lin model, here we

consider the independent variable in log form. We measure what is known as absolute change in

Y for a given proportionate change in X. Here, our concern is to estimate the absolute change in

Y for a given percentage change in X. the model can be written as;

𝑌𝑖= β1+ β2 𝑙𝑛 𝑋𝑖+ 𝑢𝑖

∆𝑌
The slope coefficient β2= ∆𝑙𝑛 𝑋
, absolute change in Y for a given proportionate change in X

This can be written as:∆𝑌 = β2 ∆𝑙𝑛 (x)

∆𝑌 ∆𝑌 𝑋
= ∆𝑋 or ∆𝑋
× 1
𝑋

1
The slope of this model is = β2× 𝑋

1
The elasticity is =β2× 𝑌

This equations states that, the absolute change in Y is equals to the relative change in X,

∆𝑋
As an example; if ∆𝑋
= 0.01 (1%)

Absolute change in Y (∆𝑌) = 0.01 (β2), it means 1% change in X causes 0.01 (β2) times change

in Y.

If β2= 500

Then, (∆𝑌) = 0.01 (500) = 5, which means that, if there is 1% change in X , the absolute change

in Y is 5.

So, it is very important to remember that. Once this model is estimated, multiply the slope β2

with 0.01, otherwise interpretation will be misleading.

164
1
As you can see in this model, the slope is β2× 𝑋
, so naturally, as X increases the slope

decreases, so we have a pattern like below;

1. When β2 is positive, as X increases, Y increases at an diminishing rate (Slope decreases

as X increases)

2. When β2 <0, again it is negative but the negative slope decreases in absolute terms.

165
If you have Y and X, if you plot these data, if you observe patterns and experiment with

different patterns . the question is what is the application of a model like this. An important

application is the “Angels law of consumptions' ' or angles elasticity function , which says that as

income increases the proportion of income allotted for food decreases.

If your model is, food expenditure is = β1+ β2𝑙𝑛 (Income) +µ, if you plot this on the graph, you

can see that, as income increases, the food expenditure increases at a diminishing rate.

Another model we consider is reciprocal model, suppose that we have model like;

1
𝑌𝑖= β1+ β2 𝑥 +𝑢𝑖This is an example of a reciprocal model. As you can see, this is a model
𝑖

1
which is non linear in X, because X appears as 𝑥𝑖
not as X, but it is linear in parameters. So you

can estimate this model. If you have data on X and Y, your dependent variable is Y, the

1
independent variable is 𝑥𝑖
. you regress Y on X, you will get estimators β1and β2. As β1and β2

appears as linear, you can estimate this model using OLS, the only thing is X appears in the

model reciprocally.

Let us see the features of this model

1
1. As X increases indefinitely, β2 𝑥 approaches 0, and naturally Y approaches the value of
𝑖

β1, which is known as the asymptotic value. So, there is built into such model is an

asymptote of limit value of β1.

2. Some of the possible shapes are;

166
Here β1>0 and β2>0, X increases indefinitely Y approaches the asymptotic value β1 and β1 is

positive .

Here, β1<0 and β2>0, as X increases indefinitely, Y approaches the asymptotic value β1 but now

β1 is negative.

167
If β2 is positive the curve decreases throughout, to see this , we have found the slope. The 𝑌𝑖=

1 −1 ∆𝑌 1
β1+ β2 𝑥 +𝑢𝑖 can be written as; β1+ β2𝑋 +𝑢𝑖, the slope is ∆𝑋
= − β2 ( 2 ) (the slope is
𝑖 𝑋

negative throughout), and if β2 is negative, then the slope is positive throughout.

Here the elasticity is; the elasticity can be written as;

∆𝑌 𝑋
Elasticity is= ∆𝑋
× 𝑌

1 𝑋
=− β2 ( 2 )× 𝑌
𝑋

1
= − β2× 𝑋𝑌

Remember this, one celebrated application of the reciprocal model is the “Phillips Curve” in

macroeconomics, the point is if X increases indefinitely, there is a limit value to Y. Can be used

in various contexts.

2 3
In addition to these models, we can also consider various powers of X, say 𝑋 , 𝑋 etc…. To

estimate the quadratic function and cubic functions etc….such models will discuss in the context

of multiple linear regression models.

Few more functional form of log log model (log log model already discussed in previous

session)

If both Y and X are in log form,

168
If we are considering a simple linear regression model,

The slope is =β2

∆𝑌
The elasticity is = β2( ∆𝑋 )

Now suppose that, you are estimating a function like;

Y = β1+ β2 𝑋 + 𝑢 (Y is the quantity and X is the demand)

∆𝑄 𝑃
You will get an estimate of the slope only, the elasticity formula is: ∆𝑃
× 𝑄
, in our formulation

∆𝑌 𝑋
it is; ∆𝑋 × 𝑌
. Now, if you are estimating a linear equation, then slope can be calculated as;

𝑋 𝑃
=β2 × 𝑌
, or β2 × 𝑄
. If you are estimating a linear equation, you can calculate elasticity by

giving values to X and Y, but the problem is which X and which Y. The solution is you take

average X and average Y, then it will give you the estimate of average elasticity.

169

You might also like