Sampling Distribution Explained with Example

Sampling distribution
● We have defined the sampling distribution as the distribution of an estimator. We have
explained this using arithmetic mean as an example to this. To give a little more insight
about a sampling distribution, let us consider an example.
● Now suppose that the age of the population is as follows:
(X)
10
There are only 5 observations. This is the entire population. Then
2 +4+ 6+8+10
µ = 5
= 6 𝑦𝑒𝑎𝑟𝑠 . This is the population parameter. Similarly
2
2 Σ(𝑋− µ) 40
σ = 𝑁
= 5
= 8
2
σ = σ = 8 = 2. 83
X is a random variable, and our interest is to find out population mean µ and population
2
variance σ . If the entire population is known, there is no problem. These quantities can easily be
estimated, there is no need for statistical inference or hypothesis testing.
● Now suppose that we take n = 2, then the samples are as follows
Samples 𝑋
2 4 3
2 6 4
1
2 8 5
2 10 6
4 6 5
4 8 6
4 10 7
6 8 7
6 10 8
8 10 9
The values of 𝑋 varies from 3 to 9, 𝑋 is used as an estimator of µ. here µ = 6. You
will never know the value of population parameter µ.
If your sample happens to be first one (2, 4), then 𝑋 = 3, which is far away from
µ. If your sample happens to be last one (8, 10), then 𝑋 = 9, which is also far
away from µ. if you are lucky that your sample is the fourth one ( 2, 10) then your
𝑋 = 6 which is equal to the population mean(µ = 6).
Now, let’s see what the mean of the mean. It is called as 𝑋
3+4+5+6+5+6+7+7+8+9
or 𝐸(𝑋) = 10
= 6 𝑌𝑒𝑎𝑟𝑠
● That means E(X) = µ
● That is if X is distributed with mean µ, then 𝑋 is also distributed with mean µ.
Now, in which sense we say that 𝑋 is a random variable? Now, we know that 𝑋 takes the values
as follows with frequencies, if we divide the frequencies with the total frequency , we will get
relative frequencies as follows.
𝑋 F Relative Frequency f(X)
2
3 1 0.1
4 1 0.1
5 2 0.2
6 2 0.2
7 2 0.2
8 1 0.1
9 1 01
Total 10 1
● The total of relative frequency is 1. That is f(X). This relative frequency is known as the
probability.
● Remember that, 𝑋 takes many values, we do not know the values of 𝑋 before the sample
is observed. The values of 𝑋 is unknown, it is a random variable. And for any random
variable we have an associated PDF. PDF is given like f(X). If you draw this you will get
a curve as follows
2
2 σ
σ𝑥 = 𝑛
2 8
σ𝑥 = 2
= 4
3
𝑆𝐷𝑋 = 4 = 2
2
● So, if X is distributed with mean µ and variance σ , then 𝑋 is also distribute with the same
2
σ
mean µ and variance 𝑛
● As you can see 𝑋 is a random variable and associated probabilities and the characteristics
2
given as 𝐸(𝑋), σ𝑥 and 𝑆𝐸𝑋 . This is the example of sampling distribution.
● In hypothesis testing we use this sampling distribution, not directly, we convert this sampling
distribution in to standard normal and t, F and chi-square etc., depending on what we are
trying to test. Just as we have converted X in to Z. we can convert 𝑋 into Z also.
4
● Now we have the distribution and area properties, just like we have described earlier.
µ ± σ = ±1
µ ± 2σ = ±2
µ ± 3σ = ±3
● So, we convert 𝑋 into Z. and Z is used for hypothesis testing. 𝑋 is the sampling distribution
and Z is the standard normal distribution derived from this sampling distribution.
● Random variables are PDFs are not what is used for testing, what is used for testing is the
sampling distribution of estimators converted into test statistics. One test statistic is Z. Z is
used if and only if the population variance and population standard deviation is known.
Otherwise Z conversion is not possible.
● Let us derive this results mathematically,
Σ𝑋 1 1 1
𝑋 = 𝑛
= 𝑛
𝑋1 + 𝑛
𝑋2 +........................................ + 𝑛
𝑋𝑛
And we assume that this 𝑋1, 𝑋2,...............................𝑋𝑛 are iid random variables. That is
identically and independently distributed random variables with same mean and variance.
That is each 𝑋𝑖 is distributed normal with same mean and variance.
2
𝑋𝑖 ∼ 𝑁 (µ, σ )
1 1 1
Then 𝐸(𝑋) = 𝑛
𝐸(𝑋1) + 𝑛
𝐸(𝑋2) +................................... + 𝑛
𝐸(𝑋𝑛)
1 1 1
= 𝑛
µ + 𝑛
µ +................................... + 𝑛
µ
1
= 𝑛
𝑛µ = µ
𝐸(𝑋) = µ
5
● So, this result is obtained if and only if this 𝑋1, 𝑋2,...............................𝑋𝑛 are identically and
independently distributed. And this condition is always satisfied in the case of random
sampling.
● In random sampling procedure, the samples are selected independently and it is also
assumed that they are identical.
Σ𝑋
𝑉𝑎𝑟 (𝑋) = 𝑉𝑎𝑟 ( 𝑛
)
1 2 1 2 1 2
= 2 σ + 2 σ +.................................... + 2 σ
𝑛 𝑛 𝑛
1 2
= 2 𝑛σ
𝑛
2
σ
= 𝑛
● As we have stated earlier, 𝑉𝑎𝑟 ( 𝑋 + 𝑌) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) if and only if X and Y
are independent. Otherwise there is a covariance term also.
● Now 𝑋1, 𝑋2,...............................𝑋𝑛 are random variables, and have written the variance of
𝑋1 and variance of 𝑋2 etc. variance of 𝑋𝑛 only, covariance terms are ignorant because as 𝑋1,
𝑋2,...............................𝑋𝑛 are iid, covariance terms disappear. That is why we have write it as
1 2 1 2 1 2
2 σ + 2 σ +.................................... + 2 σ and the variance are assumed to be same
𝑛 𝑛 𝑛
because random variables 𝑋1, 𝑋2,...............................𝑋𝑛 are iid random variables.
2
● If X is distributed with mean µ and variance σ , then 𝑋 is also distributed with mean µ and
2
σ
variance 𝑛
this will be satisfied if the random variables are iid random variables.
Otherwise this will not be satisfied.
6
● For hypothesis testing estimators are used, and estimators can be used if and only if the
relationship between population distribution and the distribution of sample statistics.
● If we know the distribution of the population, then distribution of sample statistics can be
derived.
● Remember this, all the estimators are assumed to be random variables. And distribution of
the estimator is known as sampling distribution. Sampling distribution is used for hypothesis
testing. Remember this, sampling distribution is not directly used for hypothesis testing.
From sampling distribution we derive test statistics. The popular test statistics used for
hypothesis testing are Z, chi-square, F and t.
Chi-Square Distribution
● We have discussed
2
𝑋 ∼ 𝑁 (µ, σ )—-------------- Normal Distribution
𝑋−µ
𝑍 = σ
—-------------- Standard Normal Distribution
2
𝑋 ∼ 𝑁 (µ, σ )—-------------- Normal Distribution
𝑋−µ
𝑍 = σ —-------------- Standard Normal Distribution
𝑛
● In statistics we come across squared quantities such as

2
2 Σ(𝑋 − 𝑋)
𝑆 = 𝑛 −1
…………………….. Sample Variance
● To derive sampling distribution of such squared quantities we use Chi-Square
distribution. Now let’s see what is Chi-Square distribution.
7
2 2
● If X is normal with mean µ and variance σ (𝑋 ∼ 𝑁 (µ, σ )), Z is standard normal with
𝑋−µ 2
0 and 1. (𝑍 =
𝑋−µ
σ
). Then 𝑍 =
2
( σ ) 2 2
∼ χ(1) , that is 𝑍 follows chi square
distribution with 1 degrees of freedom.
● So, the square of a standard normal variable is distributed as a chi square variable with 1
degrees of freedom.
● So, the definition is the square of a standard normal is a chi square with 1 degrees of
freedom.
● We can generalize this. If 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently
𝑘
2
distributed chi square variables, and the sum 𝑍 = ∑ 𝑍𝑖 ∼ χ(𝑘)
𝑖=1
● . If 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are standard normal variables, and the sum
𝑘
2 2 2
𝑍 = ∑ 𝑍𝑖 ∼ χ(𝑘)
𝑖=1
Where k means the number of independent terms in the sum. And here, as there are k
independent chi square variables the degrees of freedom is k.
2
● If Z is standard normal 𝑍 is chi square. So, naturally if
𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently distributed standard normal
2
variables, then 𝑍 , that is the square of sum of 𝑍1, 𝑍2.......................................................... 𝑍𝑘
follows chi square with k degrees of freedom.
𝑘
2 2 2
𝑍 = ∑ 𝑍𝑖 ∼ χ(𝑘)
𝑖=1
This is the definition of chi square
8
● Now, if 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently distributed chi square
variables, with 𝑘𝑖 degrees of freedom, then ∑ 𝑍𝑖 also follows chi square with Σ𝑘𝑖 degrees
of freedom
● Let us consider it simply as if 𝑍1, 𝑍2 are independently distributed chi square variables,
each with 𝑘1, 𝑘2 degrees of freedom, it means there are 𝑘1 random variables in 𝑍1 and 𝑘2
random variables in 𝑍2, then the sum 𝑍 = 𝑍1 + 𝑍2 also follows chi square with 𝑘1+ 𝑘2
degrees of freedom.this property is known as the reproductive property of chi square
distribution.
2 2
● So, to summarize, if X is normal Z is standard normal, 𝑍 distributed with χ(1) and
𝑍1, 𝑍2.......................................................... 𝑍𝑘 each following standard normal distribution, then
𝑘
2 2
∑ 𝑍𝑖 ∼ χ(𝑘) . And if 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently distributed
𝑖=1
chi square variables with 𝑘𝑖 degrees of freedom and 𝑍 = 𝑍1 + 𝑍2 also follows chi
square with 𝑘1+ 𝑘2 degrees of freedom.
𝑋1 − µ 2 𝑋2 − µ 2 𝑋𝑛 − µ 2
● If expand this, 𝑍 = ( σ ) ( + σ ) +............................................ + ( σ ) . so, this
quantity has k independently distributed random variables, So, this quantity has k degrees
of freedom.
● The PDF of the Chi-square distribution is as follows
9
2
● For χ distribution, as it is squared quantity it will not take negative values, it always take
non- negative values, so it starts with 0. And for small degrees of freedom, it is highly
skewed. Say K=2, K= 5. But when degrees of freedom increases it becomes more and
more like a normal distribution.
2
● Remember this to find out the area property of χ distribution, you must always consider
the degrees of freedom.
● If you go through any book of statistics, you will see that tables providing, the areas of
2
χ distribution for various degrees of freedom. We will consider one case.
2
● Suppose that the degrees of freedom is 20, then χ distribution will be as follows.
10
2
● Then the probability that χ is greater than 10.85 is 95%. Probability that it is >23.83 is
25% and probability that it is >39.99 is 5%.
2
● In words if the degrees of freedom is 20, the probability of getting a χ variable in excess
2
of 39.99 is only 5%. That is 95% of the area under χ is in between 0 and 39.99. You can
calculate other areas for different degrees of freedom.
2 2
● For χ distribution, mean that is expected value of χ distribution is K
2
𝐸(χ ) = 𝐾
2
𝑉𝑎𝑟(χ ) = 2𝐾
2
● χ distribution is extensively used for testing hypothesis about population variance, it is
also used for testing association between two variables using contingency tables, that is
test for independence etc.
11
2
● Any way χ distribution is skewed, its mean is K and variance is 2K and area properties
depends on degrees of freedom. We will not use estimators directly for hypothesis testing.
For hypothesis testing, we develop test statistics based on estimators.
𝑋−µ 2 2
● The test statistic based on 𝑋 that is 𝑍 = σ , similarly the 𝑆 is the estimator of σ .
𝑛
2 2 2
but 𝑆 not directly used for testing inference about the parameter σ . Using 𝑆 We develop
a test statistic.
● Test statistic developed for for testing the significance of variance is given as
2
𝑆 2
( 𝑛 − 1) 2 and we can show that this quantity follows χ (𝑛 − 1)𝑑𝑓
σ
2
𝑆 2
( 𝑛 − 1) 2 ∼ χ (𝑛 − 1)𝑑𝑓
σ
2 2
Where 𝑆 is an unbiased estimator of σ based on a sample of size n and (𝑛 − 1)
degrees of freedom.
2
𝑆 2
● And if expand this (𝑛 − 1) 2 ∼ χ (𝑛 − 1)𝑑𝑓
σ
2 2
(Σ𝑋 −𝑋) (Σ𝑋 −𝑋)
( 𝑛 − 1) (𝑛−1) = 2
2 σ
σ
2
So, this quantity follows χ with (𝑛 − 1) degrees of freedom.
● And this quantity will be extensively used later
● There are n random variables, but all these variables are not independent that is why we
have (𝑛 − 1) degrees of freedom.

2
𝑆
● ( 𝑛 − 1) 2 this quantity is used to test hypothesis about population variance
σ
12
● This is all about chi square distribution. So, chi square distribution is the square of a
standard normal distribution. And is used to test the significance of population variance .
t Distribution
In econometrics or empirical analysis the most widely used distribution is t- distribution. This is
because Z distribution cannot be applied in most cases. Because the population variance σ is
unknown . when σ is replaced by its unbiased estimator sample variance. Z distribution is
converted into t- distribution.
Properties which respect z are different from t .
𝑧1∼ N (0,1) is standard normal distribution
2
𝑍2 ∼ χ𝐾 is distributed independently of Z , then the variable,
𝑍1
t= ∼ 𝑡𝑘
𝑍2/𝑘
The degree of freedom of 𝑡𝑘 is same as the degree of freedom of χ𝑘 .
It is the ratio between two independently distributed random variables. Before deriving this
distribution, what are its area properties?
13
1. standard normal distribution range of t -α < t < +α
2. Degrees of freedom increase indefinitely, t- distribution depending on consideration of
degrees of freedom.
3. There is a family of t- distribution depending on consideration of degrees of freedom.
4. For small degrees of freedom t- distribution is flat. As degrees of freedom increase, it
approaches normal distribution.
5. Like standard normal distribution, range of t is -α to +α.
6. t distribution is symmetric, but its flatness depends on degrees of freedom.
𝑘
7. Mean of t = 0, variance of t = 𝑘−2
i.e k >2. But in the case of Z distribution, mean = 0
and variance is 1. The difference is with respect to variance only.
8. If k = 10, variance ( t) = 10/8 = 1.25
If k = 30, var( t) = 30/28 =1.08
If k = 100, and var(t) = 100/98 = 1.07.
14
As degrees of freedom increases, the variance of t approaches to 1 , that means
approaches to standard distribution.
As compared to t with z , the only difference is variance.
How to transform a random variable x and its sampling distribution 𝑥 into t rather than z.
𝑍1
𝑡 = ∼ 𝑡𝑘
𝑍2/𝐾
𝑥−µ 𝑥 −µ
𝑡 = 𝑠
𝑡 =
𝑠/ 𝑛
2
We assume that , x ∼ ( µ, σ )
𝑥−µ
𝑧1 = ( σ
) ∼ 𝑁 ( 0, 1)
𝑠2 (𝑥−µ) 2
𝑧 2 = ( 𝑛 − 1) 2 =Σ 2
σ σ
2
∼ χ ( 𝑛 − 1)
Suppose that , population σ is unknown. You cannot use Z as the test statistics. You require
knowledge about σ. whenever is replaced by unbiased estimator s . The distribution Z is
converted into t distribution ( whether your sample size is large or sample)
F distribution
𝑧1 𝑧2 two independently distributed chi square variable with 𝑘1,𝑘2 degrees of freedom, the
𝑧1/𝑘1
𝐹 = 𝑧2𝑘2
∼ f ( 𝑘1 , 𝑘2 )
15
● As the degrees of freedom increase, it approaches a normal distribution.
● This F distribution is the test statistics used in ANOVA, econometrics etc…
● F distribution is used to test the overall significance of an estimated regression line.
Properties :
1. F distribution is highly skewed for small degrees of freedom. And it approaches normal
distribution as degrees of freedom increase.
𝐾2
2. E (F) = 𝐾2− 2
= 𝐾2 > 2
2
(
2𝑘2 𝑘1+𝐾2−2 )
Variance ( F) =
(
𝑘1 𝑘2− 2 ) (𝑘 − 4)
2
2
It is defined only if it is 𝐾2 >4
16
2
3. 𝑡𝑘= 𝑓 (1, 𝑘) as in the case of t distribution the degrees of freedom of f is also derived
2
from the χ distribution.
2
All these distributions z, χ , 𝑡, f are related to the normal distribution. Normal distribution is the
basis of sampling distribution.
If the population is not normal, sample size is sufficiently large i.e central limit theorem. And the
normality assumption is not at all a problem in the case of a large sample. We can conduct
2
hypothesis testing using t, z, F, χ …etc.
Degrees of Freedom
Suppose that you select 10 numbers. Then if no other conditions stated. Free to select all these 10
numbers. You code that, sum of 10 numbers = 100.
You can select 9 numbers freely, but the 10 th number is selected only; it will provide a sum of
10 numbers = 100.
Degrees of freedom = 9 df ( if 10 numbers are known 10th will be automatically known)

2
2 Σ(𝑋−𝑋) 2
𝑆= 𝑛−1
unbiased estimator of σ
2 2
E(𝑆 ) = σ
E(𝑋) = µ
22
E⎡⎣Σ(𝑋 − 𝑋) ⎤⎦ = σ (n-1)
2 2
E⎡⎢ 𝑛 ⎤⎥ =
Σ(𝑋−𝑋) σ (𝑛−1)
𝑛
⎣ ⎦
2 1 2
=σ − 𝑛
σ biased estimator
17
2
E⎡⎢ 𝑛−1 ⎤⎥ =
Σ(𝑋−𝑋) (𝑛−1) 2
𝑛−1
σ
⎣ ⎦
2
=σ
2 2
Then 𝑆 unbiased estimator of σ
Degrees of freedom is an adjustment used to make an estimator unbiased.
df = n - k, k is number of constraint or restrictions, n is sample size
Properties of Estimators
As we have discussed earlier, our aim is to estimate population parameters. But the population
parameter is generally unknown as the entire population is unknown. So in any situation we
have only a sample , and using the sample we have to estimate population parameters.Arithmetic
mean is an estimator of population parameter µ sample variance is an estimator of population
variance , OLS estimators β1 and β2 are the estimators of regression parameters. For any
parameter, there will be many estimators. For the population parameter µ, the estimators are
sample mean , sample median , mode etc. As an estimator of the population variance we can use
standard deviation , mean deviation , range, coefficient of variation etc. Similarly as estimators
of the regression coefficients we have OLS estimators we have the method of maximum
18
likelihood, we have instrumental variable estimators, method of moment estimation procedures,
so many procedures are there. So the question is among various estimators of a parameter which
estimator we will select, or what are the criteria for selecting estimators. To select one estimator
from an array of estimators, we will consider the properties of estimators and we will select as
our estimator of the population parameter one with most of the desirable properties. The
properties of estimators are classified into two categories
1. Small sample properties
2. Large sample properties/ asymptotic properties
Small sample properties will be satisfied in small samples as well in large samples. So small
sample properties are the most desirable properties. On the other hand large sample properties
will be satisfied only in large samples. They will not satisfy in small samples. The discussion
about properties of the estimator will be based on the assumption that the estimator will be a
random variable and it has a PDF. The PDF is known as the sampling distribution
Small sample properties
If the small sample properties are satisfied we will not go for large sample properties
1. Unbiasedness
Suppose that the parameter of interest is θ, the estimator is θ
We will say that estimator θ is an unbiased estimator of θ if the expected value of θ is equal to
the parameter θ
E (θ) = θ
E(θ) - θ = 0
bias(θ) = E(θ)-θ
19
As we have discussed in the context of sampling distribution, unbiasedness is a repeated sample
property, if we take all the possible samples and if the expectation of the estimator is equal to the
parameter, it is unbiased.
Example : suppose we have 2 estimators
θ1 θ2 → θ
Now consider the graph
So unbiasedness is a repeated sample property. Unbiasedness itself is not a very useful property
because even if an estimator is unbiased it says nothing about the properties of an estimator in a
single sample, In the numerical example we have considered earlier
µ= 6
But the range of 𝑥 from 3 to 9 , in any given situation , we will take only one sample and 𝑥
may be 3 far away from µ , it may be 9 again far away from µ. So unbiasedness by itself is not a
useful property. But it will become a useful property when it is combined with another property
namely minimum variance. So the next small sample property is minimum variance
2. Minimum variance/ Best Estimator
θ1 is the minimum variance estimator of θ , ie θ1 = θ if the variance of θ1 is less than or at least
equal to another estimator of θ namely θ2.
20
θ1→ θ → θ2
θ2 need not be unbiased or θ1need not be unbiased.We consider only the minimum variance
property here, that is we consider 2 estimators of θ , one with the smallest variance is the
minimum variance estimator.This can be represented in the form of a notation as
2 2
E ⎡⎢θ − 𝐸 (θ ) ⎤⎥ ≤ E ⎡⎢θ − 𝐸(θ ) ⎤⎥
⎣ 1 1 ⎦
⎣ 2 2
⎦
Variance(θ1) ≤ variance(θ2)
21
As in the case of unbiasedness , minimum variance by itself is not a very useful property. The
reason is that even though in this example θ1 is the minimum variance estimator , its distribution
is far away from the population parameter, or you consider the estimator as a constant , a
constant has no variability. But it cannot be a sensible estimator of the parameter θ. So by itself
the minimum variance is not a very useful property. Now it becomes a useful property when it is
combined with unbiasedness.This property is minimum variance unbiased estimator.
3. Minimum variance unbiased or Best Unbiased or efficient
It is a combination of unbiasedness and variance. If θ1 and (θ2) are two unbiased estimators of
θ, then θ1 is said to be minimum variance unbiased or the best unbiased if
Variance(θ1) ≤ variance( θ2).
Here we compare the estimators which are unbiased. Thus the property is a combination of
unbiasedness and variance
First condition: E (θ1)= θ= E (θ2)
22
Second condition :Variance( θ1) ≤ variance( θ2).
If these two conditions are satisfied, then we can say that ( θ1) is the Minimum variance unbiased
or Best Unbiased or efficient estimator.
4. Linear estimator
An estimator is linear if it is a linear combination of sample values. An estimator θ is said to be a
linear estimator of θ if it is obtained as a linear combination of the sample values.
Example: consider the estimator 𝑋
Σ𝑋 1 1
𝑋= 𝑛
= 𝑛 𝑋1 +............... 𝑛 𝑋𝑛 by giving equal weights to X1, X2….Xn we have 𝑋 can be written
as K 𝑋1 +.......... 𝐾 𝑋n, so it is a linear combination of random variables X1, X2….Xn with K
serving as the weights.So 𝑥 is a linear estimator. An estimator is linear if it is a linear function of
sample observations
23
5. Best Linear Unbiased Estimator(BLUE)
This is a combination of these four properties. The estimator θ is said to be BLUE if it has
minimum variance in the class of all linear unbiased estimators. This BLUE property is very
important in econometrics and this blue property is known as Gauss Markov Theorem.
6. Minimum mean square error estimator (MSE Estimator)
This properties are not extensively used in econometrics. Consider two estimators of θ, say θ1
and θ2
The point is θ1 is unbiased. θ2is biased.
Which estimator should we select?
If unbiasedness is used as the property θ1 can be selected.
But it has a large variance , so the uncertainty is very high.
In the case of θ2 it is slightly biased, but its variance is low. The chance of getting a value far
away from the parameter is low. So there is a trade off between bias and variance and the
24
property says that if there is a trade off between bias and variance select the estimator with
minimum mean square error.
Properties of Estimators (2)
Large sample properties
We consider these properties only when small sample properties are not satisfied.
1. Asymptotic unbiasedness
An estimator θ is said to be an asymptotically unbiased estimator of θ if
limit E ( θn )=θ
n→ α
We say that θ is an asymptotically unbiased estimator of θ if θ become unbiased as sample
size increases indefinitely. Remember this it is a repeated sample property like unbiasedness but
in small samples the estimator will be biased but as the sample size increases indefinitely it
becomes an unbiased estimator.
Example : suppose
2
2 Σ(𝑋−𝑋 )
𝑆 = 𝑛
2 (𝑛−1) 2
E (𝑠) = 𝑛
σ
2 1 2
=σ - 𝑛
σ
1 2
𝑛
σ → the bias term
2 2
So in a small sample 𝑆 is a biased estimator of σ
As the sample size increases indefinitely the bias term disappears , so we will say that it is
asymptotically unbiased.
2. Consistency
25
This is the minimum requirement an estimator must satisfy to be useful. If an estimator will not
satisfy at least consistency then we cannot use such an estimator for estimation and inference.
θ is said to be a consistent estimator of θ, if θ approaches the true value θ as sample size
increases or the sample size gets larger and larger
Consider arithmetically
E (𝑋 ) = µ
2
σ
Variance(𝑋 )= 𝑛
When n approaches to infinity , the variance of (𝑋 )approaches to 0
Example
When the sample size is small estimator θ is biased. Not only biased its variance is also very
high. As the sample size increases the bias decreases, variance also decreases. When sample size
is 100 the entire bias disappears and Eθ=θ. The bias disappears not only that the variance
26
approaches to 0, actually if an estimator is consistent, when the sample size increases indefinitely
the entire distribution collapses to a single point and we will say that the distribution degenerate
and we will say that the estimator is consistent. So consistency requires that if there is bias in
small samples bias must disappear and also variance approaches 0. Asymptotic unbiasedness is
about what happens to bias only. It says nothing about what happens to variance.Consistency
requires both : bias disappears and variance also disappears then only we can say that an
estimator is consistent. Consistency can be technically written as
{| |
lim 𝑃𝑟 θ − θ < δ =1,
𝑛→∞
} δ> 0
We say that plim (θ) = θ, plim means probability limit
Consider a graph
If the distribution collapses like this , the absolute difference between θ
27
and θ is very small. Then we will say that the estimator is consistent.
We write it as
𝑝 lim (θ)= θ
𝑛→∞
plim means probability limit
An estimator is consistent if the distribution of θ not only collapses to θ, its variance also
decreases or if in the limit that is as N approaches to infinity,the distribution of θ collapses to a
single point θ the distribution has 0 spread, or 0 variance.Then we say that θ is a consistent
estimator of θ.
Two sufficient conditions for consistency is ,
1) lim E( θ𝑛) = θ
𝑛→∞
2) lim Var( θ𝑛) = 0

𝑛→∞
Arithmetic mean is an example of a consistent estimator.

There are many situations in which we may not be able to assess the small sample properties.In
such cases we will consider the large sample properties.And the large sample property which we
consider is consistency. The probability limit has some properties , they are:
Suppose that θ is a consistent estimator of θ
And, h(θ) is a continuous function of θ
then 𝑝 lim h (θ) = h(θ)

𝑛→∞
For Example: if θ is a consistent estimator of θ

Then we can say that
1 1
● consistent estimator of θ
θ
● log (θ) is a consistent estimator of logθ

● 𝑝 lim b = b, b is a constant
𝑛→∞
28
Suppose that θ1 and θ2 are the consistent estimators of θ , then
plim ( θ1+ θ2) = plim(θ1) + plim( θ2)
plim(θ1 θ2) = plim(θ1)plim( θ2)
θ1 𝑝𝑙𝑖𝑚(θ1)
plim( )=
θ2 𝑝𝑙𝑖𝑚( θ2)
Last two properties will not hold in the case of expectations operator.
3. Asymptotic efficiency
Let θ is a consistent estimator of θ
θ -θ , Then the variance of the asymptotic distribution of θ is known as the asymptotic
variance. If θ is consistent and if its asymptotic variance is smaller than the asymptotic
variance of all the other consistent estimators then we will say that θ1is asymptotically efficient
estimator of θ
4. Asymptotic normality
2
If X ∼N (µ, σ )
2
σ
𝑋∼N (µ, 𝑛
)
This is not a theorem , this will always be satisfied. Now we have seen the central limit theorem ,
even if
2
X is not normal with (µ, σ )
2
σ
𝑥 ∼N (µ, 𝑛
) as n→ ∞
This property is known as asymptotic normality. It is a useful property in many cases, if you are
not sure about the nature of distribution of population under consideration , you can use this
property for hypothesis testing provided that we have a sufficiently large sample size. This is the
property when the sample size is sufficiently large.
Among all the large sample properties consistency is the most important.which we will use when
the small sample properties are either unknown or cannot be derived.
29
Assumptions of Classical Linear Regression Model
In the previous class we have derived the OLS estimators.For the simple linear regression model
the parameters of interest are β1 and β2. And the estimators are the c.
● β 1 = 𝑌 = β 2𝑋
Σ𝑥𝑖𝑦𝑖
● β2 = 2
Σ𝑥 𝑖
Now if our purpose is estimation only then OLS is enough. By applying the principles of OLS β1
and β2 are obtained. But in econometrics our objective is not estimation only. We have to use
this estimators to make inferences about the parameters β1 and β2. Our aim is not only estimation
of β1 and β2. We have to use this estimators to make inferences about regression coefficients β1
and β2.
For that we have to make assumptions about the model.
PRF stochastic form is 𝑌𝑖=β1+β2 𝑋𝑖+𝑢𝑖.
We know that 𝑌𝑖 depends on 𝑋𝑖 and 𝑢𝑖.
30
So if we have to make inferences about β1 and β2 , we have to make assumptions about how 𝑌𝑖
is generated. As 𝑌𝑖 depends on 𝑋𝑖 and 𝑢𝑖. we have to make inferences about how 𝑋𝑖 and 𝑈𝑖.. are
generated.
Only then we can use the estimators to make inferences about population parameters. Now what
we discuss below is the assumptions of the classical linear regression model. The assumptions
underlying the method of OLS. While discussing the properties of estimators we have suggested
that the properties are classified into statistical properties and numerical properties.
Numerical properties we have discussed. That hold as a result of the use of OLS regardless of
how the data are generated. Statistical properties of the estimators hold only after making
assumptions about how data are generated. So in the discussion we are considering how the data
𝑌𝑖 or 𝑋𝑖 and 𝑢𝑖 are generated.These assumptions are important and the properties of estimators
depends on the assumptions which we make about 𝑋𝑖 and 𝑢𝑖 and hence 𝑌𝑖.
So the Classical linear regression model, the Gaussian linear regression model the method of
OLS make seven assumptions about how 𝑋𝑖 and 𝑢𝑖 and hence 𝑌𝑖 are generated.
Now let us see what the assumptions are:
● Linear in parameters.
That is the linear regression model. We have defined linearity as linearity in parameters
and linearity in variables. And in regression analysis in econometrics linearity is defined
as linearity in parameters.
So our model 𝑌𝑖=β1+β2 𝑋𝑖+𝑢𝑖 is linear in parameters. As we have discussed this
property in detail there is nothing more to discuss here. If the model is not linear in
parameters we have to go for nonlinear least squares.
31
This is the first assumption underlying the method of ordinary least squares or the
gaussian linear regression model.
● Fixed X values or X values are assumed to be independent of the stochastic error term.
The assumption is in the form of equation:
( )
cov 𝑋𝑖 𝑈𝑖 =0
This assumption can be stated in different ways.
1. X is assumed to be fixed in repeated samples. That is fixed in repeated sample
assumptions. But suppose that X is not fixed in repeated samples. That is like Y,
X also varies from sample to sample.
( )
2. cov 𝑋𝑖𝑈𝑖 = 0 .
Correlation between 𝑋𝑖 𝑎𝑛𝑑 𝑈𝑖 is zero.
Now the question is what is the implication of this assumption.
Let me first write an equation for the covariance.
( ) {[
cov 𝑋𝑖𝑈𝑖 = 𝐸 𝑋𝑖 − 𝐸 𝑋𝑖 ( ) ][𝑈𝑖 − 𝐸(𝑈𝑖)]}
( )
As we will see as the next assumption E 𝑈𝑖 is assumed to be zero the
implications we will see.
So it can be written as {[ ( )][𝑈𝑖]}. Expected 𝑈𝑖 assumed to be zero.

E 𝑋𝑖 − 𝐸 𝑋𝑖
[ ( ) ]
= E 𝑋𝑖𝑈𝑖 − 𝐸 𝑋𝑖 𝑈𝑖
( ) ( )( ) ( )
= E 𝑋𝑖𝑈𝑖 -𝐸 𝑋𝑖 𝐸 𝑈𝑖 since we assumed 𝐸 𝑈𝑖 = 0
(
= E 𝑋𝑖𝑈𝑖 )
(
Assumption requires that expected value of 𝑋𝑖𝑈𝑖 =0. )
32
Now we assume that 𝑋𝑖 is fixed in repeated samples then 𝑋𝑖 can be considered as constant from
sample to sample. The we can write it as:
(
= E 𝑋𝑖𝑈𝑖 )
( )
=𝑋𝑖E 𝑈𝑖 =0
So if X is assumed to be fixed in repeated samples then the assumption will be satisfied
automatically.
( )
But if X is stochastic like 𝑈𝑖then we have to ensure that cov 𝑋𝑖 𝑈𝑖 =0.
( )
Because this condition will not be automatically satisfied we have to ensure that E 𝑋𝑖𝑈𝑖 = 0
As a first approximation if this assumption is not satisfied. That is if 𝑋𝑖 is call later with 𝑈𝑖 we
will get biased estimators.
So this assumption is very important.
(
cov 𝑋𝑖 𝑈𝑖 =0 )
Now the assumption that x is fixed in repeated samples is a highly unrealistic assumption.
If 𝑌𝑖 is stochastic 𝑋𝑖 also stochastic.
But the question is why it is assumed that X is fixed in repeated samples. As we will see later
this assumption is used at an introductory level to derive some results easily. If X is assumed to
be fixed in repeated samples it is easy to derive the estimators and properties. That is why it is
assumed to be fixed in repeated samples.
Also the assumption that is assumed to be fixed in repeated samples is not unrealistic always.
There are many cases in which this assumption is valid. Suppose that we consider a model salary,
it is a function of education. Salary= f(𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛).
Suppose that the education is given in years:
33
Years(x) Salary(y)
10
12
Suppose that you take one sample , you will be able to find out many people with different levels
of salary for the same levels of education, with 8 years of education, 9 years of education etc. In
this case the level of education will remain the same, but salary varies from sample to sample. So
this is the realistic case.
You consider another case, suppose that you are applying a certain amount of fertilizers and
study the relation between fertilizers and yield.
Yield Fertilizer
10
20
30
Now if you apply 10 kilos of fertilizers on different plots you will get different levels of yield.
Yield will vary from one plot to another plot. If the plots are randomly selected the procedure is
correct also. So fixed in repeated sample assumptions is not highly not unrealistic. Any way for
(
the time being just keep in your mind cov 𝑋𝑖 𝑈𝑖 =0. )
( )
3. E 𝑈𝑖/𝑋𝑖 = 0
( )
The assumption is that given 𝑋𝑖 values E 𝑈𝑖 = 0
As we have seen in an earlier time
34
There is PRF. So corresponding to 𝑋𝑖 is X1, X2, X3 etc. The PRF passes through the conditional
mean values.Now the deviation of a Y from its mean is U. So there are so many positive and
( )
negative U values, the assumption simply says that E 𝑈𝑖/𝑋𝑖 , that is given 𝑋𝑖 = 𝑋1𝑈1 has a
( )
distribution with E(𝑈)=0, 𝑈2 has distribution with E 𝑈2 =0, 𝑈3 has distribution with E 𝑈3 =0 ( )
like that.
Actually this is not a very important assumption, not an assumption with impact.As long as there
is an intercept in this model this assumption will be automatically satisfied or stated differently
(
we can always rewrite our model in such a way that E 𝑈𝑖/𝑋𝑖 = 0. )
( )
As an example suppose that E 𝑈𝑖 = α0
Now consider a model 𝑌𝑖= β1+ α0+ β2𝑋𝑖+ 𝑈𝑖-α0
If you take the expectations you will get E (𝑌 ) =β1+ α0+ β2𝑋𝑖+α0-α0 where α0-α0= 0
So the only thing is the intercept become β1+ α0. We will get a biased estimator for the intercept,
that is a problem but is not a serious problem as intercept is not very important in most cases. So
this is not an assumption with impact. But is assumed to hold in all the cases. Provided that the
35
model is correctly specified , provided all the important variables are included. Then we can
( )
conclude that E 𝑈𝑖/𝑋𝑖 = 0 is always satisfied.
There is one more point, if the conditional variances of one random variable given the other is
( ) (
equal to zero, the random variables are independent. So the E 𝑈𝑖/𝑋𝑖 = 0 then the cov 𝑋𝑖 𝑈𝑖 )
=0.
( )
cov 𝑋𝑖 𝑈𝑖 =0 does not mean that the variables are independent.
But this is true in the case of conditional distribution.
E (𝑌/𝑋)=E(𝑌) that is conditional distribution = marginal distribution. Then the variables Y and
X are independent.
( ) ( )
If E 𝑈𝑖/𝑋𝑖 = 0 then it follows that E 𝑌/𝑋𝑖 = β1+ β2𝑋𝑖
So the conditional mean of Y is β1+ β2𝑋𝑖 The unconditional mean of Y is different from the
( )
conditional mean. So this condition E 𝑈𝑖/𝑋𝑖 = 0 also implies this condition
( )
E 𝑌/𝑋𝑖 = β1+ β2𝑋𝑖.
The conditional expectation of a random variable given the other is equal to zero and it means
that the random variables are independent.
This is not true in the case of covariance.
E (𝑌/𝑋)=0 , X and Y are independent.
cov(𝑋𝑌)=0 does not mean that X and Y are independent. Because X and Y will be non linearly
related. But remember this E (𝑌/𝑋)=0 X and Y are independent.
( )
So if E 𝑈𝑖/𝑋𝑖 = 0 then it means that U and X are independent. But this is not true for
covariance between two random variables.
36
The assumption that X is fixed in repeated samples is only a simplification used in the
introductory textbooks to derive some of the results easily. In the real situation X is stochastic
like Y. It means from sample to sample X varies like Y.
( )
Then the requirement is cov 𝑋𝑖 𝑈𝑖 =0. If this assumption is violated we will get biased
estimates..
4. Homoskedasticity. In some books it is written as Homoscedasticity.
Now the accepted spelling is Homoskedasticity.
Homo = equal
Skedasticity = spread
So Homoskedasticity means equal spread.
( ) [
The assumption is that var 𝑈𝑖/𝑋𝑖 = E 𝑈𝑖 − 𝐸 𝑈𝑖/𝑋𝑖( )]2
[
= E 𝑈𝑖/𝑋𝑖]2
2
=σ
( )
Because E 𝑈𝑖/𝑋𝑖 =0 by assumption. Is a constant sigma square.
2
That means the variance of 𝑈𝑖 given different values of 𝑋𝑖 is a constant number σ .
In the form of a graph
37
This is an example of homoskedasticity/ Equal spread.
2
That is the distribution of U1,U2,U3 having mean zero and variance same. Same number if σ of
U1 is 10 of U2 10, U3 10 like that.
Violation of the assumption of homoskedasticity is known as Heteroskedasticity.
Hetero= Unequal
Skedasticity= variance.
So unequal variance.
2
( )
Heteroskedasticity is written as var 𝑈𝑖/𝑋𝑖 =σ𝑖 . ‘i’ represents fact that the variance of U varies
from 1 point to another.
2
σ = homoskedasticity
2
σ𝑖 = heteroskedasticity.
Example of heteroskedasticity:
There is one more meaning to this assumption:
( ) ( )
Homoskedasticity means var 𝑈𝑖/𝑋𝑖 ≠ f 𝑋𝑖
Whether X increases or decreases variance of 𝑈𝑖 remains the same.
38
( ) ( )
Heteroskedasticity means var 𝑈𝑖/𝑋𝑖 = f 𝑋𝑖
There is one more point here. We know that
( ) [
var 𝑌𝑖 = E 𝑌𝑖 −𝐸 𝑌𝑖 ( )]2
=E [β1 + β2𝑋𝑖 + 𝑈𝑖 − β1 − β2χ𝑖]2
= E 𝑈𝑖 ( )2 =σ2
So the point is variance of 𝑈𝑖 same as variance of 𝑌𝑖.
If 𝑈𝑖 is homoskedastic 𝑌𝑖 is also homoskedastic. But in the case of mean, mean of 𝑈𝑖 is zero, but
mean of 𝑌𝑖 is β1 + β2𝑋𝑖, that is the difference.
( )
Mean of 𝑌𝑖 is different from mean of 𝑈𝑖 or what we say is, if we subtract E 𝑌𝑖 form 𝑌𝑖 that is 𝑈𝑖,
( )
𝑈𝑖 is 𝑌𝑖-E 𝑌𝑖 that is we change the origin. Then the mean becomes zero. Variance is not
affected.
Variance of 𝑌𝑖 same as variance of 𝑈𝑖.
This is the homoskedasticity assumption. There is another way to express this :
39
Now consider the graph.It is a three dimensional graph .
( )
f 𝑈𝑖 is the PDF of 𝑈𝑖.
( )
𝐸 𝑌𝑖 = β1+ β2𝑋𝑖 It is the relationship between consumption and income, the PRF.
When X=𝑋1, X= 𝑋2, X=𝑋3 like this.
PRF starts from Y axis and as the relation is assumed to be positive the distance between x axis
and PRF in cases.
( )
This is the cases of third dimension, that is f 𝑈𝑖 and this is the case of homoskedasticity.
That is the distribution of 𝑈𝑖 is same in all the cases.
Heteroscedasticity
The case where variance decreases.
40
.
5. No autocorrelation between disterbances.
It means that given any two X values 𝑋𝑖 and 𝑋𝑗 the correlation between 𝑈𝑖 and 𝑈𝑗=0. Using
notations
( ) {[ (
cov 𝑈𝑖𝑈𝑗 =E 𝑈𝑖 − 𝐸 𝑈𝑖/𝑋𝑖 )][𝑈𝑗 − 𝐸(𝑈𝑗/𝑋𝑗)]}
[ ]
=E 𝑈𝑖/𝑋𝑖,, 𝑈𝑗/𝑋𝑗 =0
It means given any X values the covariance between 𝑈𝑖 and 𝑈𝑗 = 0.
41
It simply means that given any 𝑋1 and 𝑋2 , Covariance between 𝑈1 and 𝑈2 is zero.
If this assumption is violated we have a case of autocorrelation. That is covariance 𝑈𝑖 𝑈𝑗 not
equal to zero. One implication of this assumption is suppose that we are using quarterly data.
And suppose that there is a strike in the first quarter. Then production will be disrupted. And if it
is effect die out in the first quarter itself. Then there is no autocorrelation. But if its effect exists
to the 1st, 2nd, 3rd quarter etc then there is autocorrelation. The implication of this assumption is
violation and consequences etc will be considered in the later time.
6. n > K
The number of observations must be greater than the number of parameters to be estimated. It is
very important in the sense that suppose we have two X Y values:
Y X
10 15
With just 2 values we cannot estimate a model like this β1 and β2. So there must be more than
one observation.
In a technical way the assumption says that,
42
● if n is not greater than k we will not have sufficient degrees of freedom.
● If n = k, degrees of freedom will become zero. Because degrees of freedom are defined as
n-k.
7. There will be some variability in X.
That means given a pair of XY values not all the X values will be same in a given sample or state
2
differently Σ(𝑋 − 𝑋) >0.
Now the point is there are various implications to this assumptions.
Σ𝑋𝑖𝑌𝑖
we have β2= 2
Σ𝑋𝑖
2
(
If all the x values are identical then it follows that Σ 𝑋 − 𝑋 ) =0, estimation is not possible.
The second implication is that irrespective of values of y then variability in Y cannot be
attributed to X.
X is remaining constant Y is changing. Then changing Y is not due to X. X can explain
variability in Y if and only if X values are not the same in a given sample. Technically the
2
Σ(𝑋−𝑋)
variance of X that is 𝑛−1
must be a positive number.
Now when we go beyond to a two variable model we will add 8th assumption:
8. There should be no multicollinearity.
43
This assumption is not important for the 2 variable model.
Suppose that we have 𝑌𝑖= β1+ β2𝑋2 + β3𝑋3 + 𝑈𝑖
We have 2 explanatory variables. Then there is a possibility that there will be correlation
between 𝑋2 and 𝑋3. For example if i is consumption, 𝑋2 is income 𝑋3 is wealth, there is a
possibility that correlation between income and wealth. So this assumption says that there should
be no perfect collinearity between 𝑋2 and 𝑋3.
Perfect collinearity, that means the correlation between 𝑋2 and 𝑋3 must not be +1 or -1.
So these are the assumptions of the classical linear regression model.
All the assumptions related to PRF, not SRF.
But some of the assumptions of PRF will be duplicated by SRF also.
Assumptions related to SRF
( )
1.For example: E 𝑢𝑖/𝑋𝑖 =0 (assumption number 3)
And we have Σ𝑢𝑖=0, 𝑢 = 0, the residual has zero mean.
2. cov (𝑢𝑖𝑋𝑖)=0 related to PRF
In the context of SRF Σ𝑈𝑖𝑋𝑖=0.
These are the 2 conditions imposed on data for estimating the OLS distribution.
So these 2 assumptions are duplicated to SRF.
Now the question is in addition when we proceed we will consider some other
assumption also like the model is correctly specified, that is there should be no specification
errors.there should be no errors of measurements and so many other issues related are there.
44
You will not see all the assumptions in all the books, some assumptions like assumption number
6,7 etc assumed, will not be explicitly stated in many text books.
And also in this model the assumptions stated in terms of U. But you will also see assumptions
stated in terms of Y also.
( ) ( )
We can say that covariance E 𝑢𝑖/𝑋𝑖 =0, we can say that E 𝑌/𝑋𝑖 = β1 + β2𝑋𝑖
2
( )
Similarly var 𝑢𝑖/𝑋𝑖 =σ Can also be stated as var 𝑌𝑖/𝑋𝑖( )
So we can specify this assumption of homoskedasticity in terms of U and in terms of Y.
( ) (
Similarly Cov 𝑢𝑖𝑢𝑗 =0 can also be stated as Cov 𝑌𝑖𝑌𝑗 =0 )
Now the question is how realistic are these assumptions.
The point is if you have a set of data satisfying all these assumptions that is the most ideal data
set. This is because if you apply OLS on such data the resulting estimators will be the most
desirable in terms of its properties.
But the problem is in the real world many of these assumptions will be violated. And we will see
what happens to the properties of estimators if these assumptions are violated. And remember
that these assumptions are stated to derive the properties of estimators. And we can compare this
model with the model of perfect competition in micro economics.You know that a perfectly
competitive market is the most desirable. But we know that in the real world most of the
assumptions of perfect competition will be violated, and we study what happens to price and
quantity when these assumptions are violated. In the same way we know that a data set satisfying
these assumptions is the most desirable. And if these assumptions are violated it will affect the
properties of estimators. And later we will consider what happened to the properties of
estimators if these assumptions were violated. Some assumptions are not important in terms of
its impact on properties.
45
So these are the assumptions of classical linear regression model, gaussian linear regression
model, and the method of OLS.
Mean and Variance of OLS Estimators
It is clear from the previous analysis that OLS estimators are functions of sample values. In the
two variables model in deviation form as;
Σ 𝑥𝑖𝑦𝑖
β2 = 2
Σ𝑥𝑖
β1= 𝑌 - β2 𝑋
So, as functions of sample values, the value of the OLS estimators vary from
sample to sample, we require a measure of precision of OLS estimators. The measure of
precision of the OLS estimators is given by the “Standard error” of estimators, and
standard error is nothing but the standard deviation of the sampling distribution of the
estimator. The sampling distribution of these estimators is nothing but the probability
distribution of the estimates.
The next concern is derivation of the standard error of estimators β1 and β2 .
Derivation of standard error required as a measure of the precision of the estimators, not only
that, it is also required for hypothesis testing, because for hypothesis testing we use the sampling
distribution, and the test statistics is derived from the sampling distribution. So, we can use a test
statistics for an estimator for hypothesis testing, if and only if the standard error of the estimate is
known.
46
The next concern is the derivation of the standard error of β1 and β2. In this section , we
derive only the mean and standard error of β2 and mean and standard error of β1 is simply
stated.
Mean and SE of β2
Σ 𝑥𝑖𝑦𝑖
We have shown that, β2 = 2
, this can be written as;
Σ𝑥𝑖
Σ 𝑥𝑖𝑌𝑖
β2 = 2
this we derived earlier
Σ𝑥𝑖
This can be also written as;
𝑥𝑖
β2 = Σ𝑘 𝑌 , where 𝑘 =
𝑖 𝑖 𝑖 2
Σ𝑥𝑖
Now, as β2= 𝑘1𝑌1+𝑘2𝑌2………………………+𝑘𝑛𝑌𝑛
So, β2 is a linear function of 𝑌1, 𝑌2, 𝑌3………………𝑌𝑛 with 𝑘1, 𝑘2, 𝑘3…………..𝑘𝑛
serving as the weights
So, we can say that, β2 is a linear estimator with;
𝑥𝑖
𝑘𝑖= 2
serving as the weights
Σ𝑥𝑖
The weight 𝑘𝑖 is very important in econometrics
Properties of weight 𝑘𝑖
1. Since 𝑋𝑖 is assumed to be fixed in repeated samples, 𝑋𝑖 is assumed to be non-stochastic.
47
𝑥𝑖
𝑘𝑖= 2
, 𝑘𝑖 is also non-stochastic (𝑘𝑖 being the function of 𝑋𝑖 , 𝑘𝑖 is
Σ𝑥𝑖
non-stochastic, it remains constant from sample to sample).
𝑥𝑖 Σ 𝑥𝑖
2. Σ𝑘𝑖= Σ( 2
) = 2
= 0, because in the numerator Σ 𝑥𝑖 =( Σ 𝑋𝑖 -𝑋)=0
Σ𝑥𝑖 Σ𝑥𝑖
So, Σ𝑘𝑖= 0
𝑥𝑖 2 Σ𝑥𝑖
2
2 1
3. Σ𝑘𝑖 = Σ( 2
) = =
Σ𝑥𝑖 2 2 2
(Σ𝑥𝑖 ) Σ𝑥𝑖
2 1
So, Σ𝑘𝑖 =
2
Σ𝑥𝑖
4. Σ𝑘𝑖 𝑥𝑖 = Σ𝑘𝑖 𝑋𝑖 , the reason is Σ𝑘𝑖 (𝑋𝑖 − 𝑋)= Σ𝑘𝑖𝑋𝑖 − 𝑋Σ𝑘𝑖= Σ𝑘𝑖𝑋𝑖-0 = Σ𝑘𝑖𝑋𝑖
Σ𝑥𝑖 𝑋𝑖
Σ𝑘𝑖𝑋𝑖 =
2
Σ𝑥𝑖
Σ(𝑋𝑖−𝑋 )𝑋𝑖
= 2
Σ (𝑋𝑖−𝑋)
2
Σ 𝑋𝑖 − 𝑋 Σ 𝑋𝑖
= =1 denominator expansion derived earlier
2
Σ 𝑋𝑖 − 𝑋 Σ 𝑋𝑖
So, Σ𝑘𝑖 𝑥𝑖 = Σ𝑘𝑖 𝑋𝑖 =1
48
Using this we can derive the standard error of β1 and β2. Since, β2 = Σ𝑘𝑖 𝑌𝑖 , β2
is a linear estimator with 𝑘 serving as the weights. It is a small sample property of

𝑖
an estimator we have discussed.
Now, we substitute for 𝑌𝑖 from the PRF, as;
Σ𝑘𝑖𝑌𝑖 =Σ𝑘𝑖( β1 + β2𝑋𝑖 + 𝑢𝑖 )
β2= β1Σ𝑘𝑖 + β2 Σ𝑘𝑖𝑋𝑖 + Σ𝑘𝑖𝑢𝑖
Here, Σ𝑘𝑖=0 and Σ𝑘𝑖 𝑋𝑖 = 1
2
So, β2= β + Σ𝑘𝑖𝑢𝑖 , it also follows that
β2-β2 = Σ𝑘𝑖𝑢𝑖
E(β2) = β2 + Σ𝑘𝑖Σ (𝑢𝑖)
Σ (𝑢𝑖)= 0
β2= β2
And this shows that, β2 is an unbiased estimator of β2. it is actually the unbiasedness
property, and this the unbiasedness property will be satisfied if and only if 𝑥𝑖 𝑢𝑖 are
uncorrelated , and that is the second assumption , Cov(𝑥𝑖 𝑢𝑖 )= 0
E(β2)= β2
And similarly, we can prove that,
49
E(β1)= β1
Variance of β2
2
⎡ ⎤
Var(β2 ) = 𝐸⎢β2 − 𝐸(β2) ⎥
⎣ ⎦
Since, E(β2)=β2
2
= 𝐸⎡⎢β2 − β2⎤⎥
⎣ ⎦
And we have already written that, β2 − β2 = Σ𝑘𝑖𝑢𝑖
So, this can be rewritten as;
Var(β2 ) = 𝐸 Σ𝑘𝑖𝑢𝑖 [ ]2
While we are discussed about the summation operators,we have expanded that the term
2 2 2 2
(Σ𝑥𝑖) as 𝑥1+𝑥2...... 𝑥𝑛+ 2𝑥1𝑥2 etc…….
By expanding this, we will get
2 2 2 2 2 2
= 𝐸⎡⎢𝑘1𝑢1 + 𝑘2𝑢2 ......... + 𝑘𝑛𝑢𝑛 + 2𝑘𝑖 𝑘2𝑢1𝑢2 + ....... + 2𝑘𝑛−1 𝑘𝑛𝑢𝑛−1𝑢𝑛⎤⎥
⎣ ⎦
We take expectation operator inside;
2 2 2 2 2 2
= 𝑘1 𝐸(𝑢1) + 𝑘2 𝐸(𝑢2) +...............+ 𝑘𝑛 𝐸(𝑢𝑛) + 2𝑘1 𝑘2 𝐸(𝑢1𝑢2) +..........+ 2𝑘𝑛−1𝑘𝑛𝐸
(𝑢𝑛−1𝑢𝑛)
You can see that,
50
2 2
𝐸(𝑢𝑖) = σ (homoskedasticity assumption)
2 2 2 2 2 2
So, =𝑘1σ + 𝑘2σ +................+ 𝑘𝑛σ
𝐸(𝑢1𝑢𝑗) = 0 No autocorrelation
We use the assumption of homoskedasticity and no autocorrelation. So the derivation of
the formula of the variance expected value etc….. Critically depends on the assumptions
of classical linear regression model, because these assumptions are used;
2 2
= σ Σ𝑘𝑖
2
1
Σ𝑘𝑖 = 2
Σ𝑥𝑖
2
σ
𝑉𝑎𝑟 (β2) = 2
Σ𝑥𝑖
The positive square root of the variance of the β2 is known as the standard error of β2
Standard error of β2 as;
σ
SE(β2)=
2
Σ𝑥𝑖
When this quantity is used for hypothesis testing, for the derivation of Z, t etc…..
2
σ σ
𝑉𝑎𝑟 (β2) = 2 and SE(β2)=
2
It is nothing but the standard derivation of the sampling distribution of β2, because, β2 is an
estimator, it has a sampling distribution and it is a random variable.
51
Similarly without proof, we can show that;
2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖
2
Σ𝑋𝑖
SE(β1)= 2 σ
𝑛Σ𝑥𝑖
These quantities are used as the measure of the precision of the estimator, not only that, these
quantities are used for hypothesis testing
For prediction purposes, we use the following derivation.
2⎡ 1 2 1 ⎤
𝑉𝑎𝑟 (β1) = σ ⎢ 𝑛 +𝑥 ⎥
⎢ Σ𝑥𝑖
2
⎥
⎣ ⎦
2 2
2⎡ Σ𝑥𝑖 + 𝑛𝑥 ⎤
= σ⎢ ⎥
⎢ 𝑛Σ𝑥𝑖
2
⎥
⎣ ⎦
2 2 2
2⎡ Σ𝑋𝑖 − 𝑛𝑥 +𝑛𝑥 ⎤
=σ ⎢ ⎥
⎢ 𝑛Σ𝑥𝑖
2
⎥
⎣ ⎦
2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖
Consider these formulas of variance and also standard errors you will immediately feel
2
that, in this formula, Σ𝑥 𝑖 is known, once we have a sample, and n is also known. The
2 2
unknown is σ , and the σ is the variance of 𝑢𝑖, and u is unknown. So we cannot apply
the formula of variances for hypothesis testing. So what we do in practice, so we replace
2
σ with its unbiased estimator.
2
2 Σ𝑢𝑖
2
σ =σ = 𝑛−2
n-2 = degrees of freedom
52
2
Σ𝑢𝑖 = sum of squares of residuals
2
2 Σ(𝑋−𝑥)
For example, 𝑠 = 𝑛−1
, here n-1 is the degrees of freedom
2
In the context of OLS, Σ𝑢𝑖 can be derived, if and only if β1 𝑎𝑛𝑑 β2 are derived. So, two
restrictions are imposed on the data, when two quantities are estimated from the data.
1. Σ𝑢𝑖 = 0
2. Σ𝑢𝑖 𝑋𝑖 = 0
The degrees of freedom, is generally known as n minus number of parameters estimated
df= n-k
Degrees of freedom is an adjustment to make an estimator unbiased.
2
2
Now,σ = is the unbiased estimator of σ
()
2
2
Then,𝐸 σ = σ
⎡ 2 ⎤ 2
If ,E ⎢Σ𝑢𝑖 ⎥ = (n-2) σ
⎢ ⎥
⎣ ⎦
2
The expectation of the numerator is n-2 σ , so if we divide the equation with n-2, we get;
2
⎡ Σ𝑢 ⎤
= Σ⎢⎢ 𝑛−𝑖2 ⎥⎥= ⎡ 𝑛−2 ⎤σ
𝑛−2 2
⎢ ⎥ ⎣ ⎦
⎣ ⎦
2
2
So, division by n-2, make the estimator of σ that is 𝑢 an unbiased estimator.
2
2 Σ𝑢𝑖
In the equation of Var(β1) and Var(β2 ), σ is replaced with 𝑛−2
.
53
2
How do we calculate Σ𝑢𝑖
2 2
Actually Σ𝑢𝑖 = Σ( 𝑌 − β + β2 𝑋𝑖 )
𝑖 1
2
Once β1 and β2 are calculate, then it is easy to calculate the Σ𝑢𝑖 . But in practice, we calculate it
differently .
We know that,
𝑌𝑖 = 𝑌𝑖+ 𝑢 𝑖
And in derivation form,
𝑦 𝑖 = 𝑦 𝑖+ 𝑢 𝑖
2 2
2
Σ𝑦𝑖 = Σ𝑦 + Σ𝑢 + 2Σ𝑦 𝑢
𝑖 𝑖 𝑖 𝑖
Here, Σ𝑦 𝑖 𝑢 𝑖 =0
2 2
2
Σ𝑦𝑖 = Σ𝑦 𝑖 + Σ𝑢 𝑖
2 2
2
Σ𝑢 𝑖 = Σ𝑦𝑖 - Σ𝑦 𝑖
Since, 𝑦𝑖 = β2𝑥𝑖
2 2
2
Σ𝑦𝑖 = β2Σ𝑥 𝑖
2 2
2 2
So, Σ𝑢 = Σ𝑦 - β2Σ𝑥 𝑖
𝑖 𝑖
54
2
This is the formula that we used to calculated Σ𝑢
𝑖
2
Σ𝑢
σ= 𝑛−2
𝑖
is known as the SE of the estimates, also known as SE of regression, it is
nothing but standard deviation of Y around the estimated regression line, and it is often
considered as a measure of goodness of fit of the sample regression line to the sample
data.
From the definition of variance and standard error of estimates consider these points;
2
σ
𝑉𝑎𝑟 (β2) = 2
Σ𝑥𝑖
2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖
2
Consider the slope coefficient β2, in this case variance of β2 is directly proportional to σ
2
as the variance of Y increases, variance of β2 increases. It is inversely proportional to Σ𝑥 𝑖
2
2
(
Σ𝑥 𝑖 = Σ 𝑋𝑖 − 𝑋 )
If all the X values equal to 𝑋 (there is no variability) this become 0, and variance of β2
become infinite. So, if you have a sample with low variability in X, then the variance of
the estimator will be very high and precision of the estimator will be low, and vice versa.
If you have two data sets with different values of X, then always select the set with high
variability in X, as the variance increases, precision also increases.
55
Gauss Markov Theorem
In this section, we are discussing the statistical properties of least squares. We have already
discussed that properties are divided into numerical properties and statistical properties.
Numerical properties are those that hold as result of the use of OLS regardless of how the data is
generated. Statistical properties are those that hold only after making some assumptions about
how data are generated. The statistical properties of least squares is given in the form of a
theorem known as Gauss Markov Theorem. This is a basic theorem in econometrics. The
statement of the theorem is, given the assumptions of the classical linear regression model, OLS
estimators in the class of all linear unbiased estimators have minimum variance, they are BLUE.
So we consider a class of linear unbiased estimators and we will prove that in this class OLS
estimators have minimum variance, that is Gauss Markov Theorem.
As we discussed earlier, the BLUE property is a small sample property. Properties of estimators
are classified into small sample properties and large sample properties, and small sample
properties are the most desirable. If small sample properties are satisfied, then we need not
consider the large sample properties.
Let's prove the Gauss Markov Theorem with respect to β2, and it will automatically applicable
to β1 . as we have written earlier
β2 = Σ𝑘 𝑌 , so the β2 is a linear estimator………………(1)

𝑖 𝑖
We have also proved that, E( β2) = β2, So β2 an unbiased estimator also……….(2)
2
σ
𝑉𝑎𝑟 (β2) = 2 ……………….(3)
Σ𝑥𝑖
56
We have derived the variance and we have to prove that, in the class of unbiased linear
estimators β2 has the minimum variance.
The 1st and 2nd properties are absolute, means, these properties do not depend on other
estimators. But the 3rd property is a relative property, meaning a property in relation to the
variance of another estimate. So we define an alternative estimator as;
∼
β2 (β 𝑡𝑖𝑙𝑑𝑒 2)= Σ𝑤𝑖𝑌𝑖 , it is linear, were 𝑤𝑖 is serving as the weights.
Substitute for 𝑌𝑖, it will become;
= Σ𝑤𝑖(β1+β2Xi+𝑢𝑖)
∼
β2 = β1Σ𝑤𝑖+β2Σ𝑤𝑖𝑥 𝑖+Σ𝑤𝑖𝑢
𝑖
∼
E(β2 ) = β2, it is satisfied if and only if Σ𝑤𝑖=0, and Σ𝑤𝑖𝑥 𝑖=1, so we assume that Σ𝑤𝑖=0, and Σ𝑤𝑖
𝑥 𝑖=1
∼
β2 = β2+Σ𝑤𝑖𝑢
𝑖
∼
E(β2 ) = β2 because Σ𝑤𝑖𝑢 = 0
𝑖
So, here we imposed two restrictions;
1. Σ𝑤𝑖=0,
2. Σ𝑤𝑖𝑥 𝑖=1
∼ ∼
Now, we have to derive the variance of β2 , we can derive the variance of β2 just as we derived
the variance of β2.
∼
We have,β2 = Σ𝑤𝑖𝑌𝑖
57
∼
Then, Var(β2 )= Var(Σ𝑤𝑖𝑌𝑖)
2 2 2
= 𝑤1 Var(𝑌1)+𝑤2 Var(𝑌2)+.................𝑤𝑛 Var(𝑌𝑛)
2
It is assumed that Var(𝑌𝑖) = σ , And also, Cov(𝑌𝑖 𝑌𝑗) = 0, as we already discussed Var(𝑌𝑖) is
same as the Var(𝑢𝑖).
Then, Cov(𝑌𝑖 𝑌𝑗) = Cov(𝑢𝑖 𝑢𝑗)
Using this covariance terms will disappear, and we can write it as;
2 2 2 2 2 2
= 𝑤1 σ +𝑤2 σ +............𝑤𝑛 σ
∼ 2 2
Var(β2 ) = σ Σ𝑤𝑖
To derive this, we have use two assumptions;
2
1. Var(𝑌𝑖) = σ
2. Cov(𝑌𝑖 𝑌𝑗) = 0,
Consider that, the Var(β2) = Var(Σ𝑘𝑖𝑌𝑖)
2 2 2 2 2 2
= 𝑘1σ +𝑘2σ +.........................𝑘𝑛σ
2
2 2 σ
σ Σ 𝑘𝑖 = 2
Σ𝑥𝑖
2
To write the variance like this, we have to assume that Var(𝑌𝑖) = σ , and Cov(𝑌𝑖 𝑌𝑗) = 0. As we
have stated the assumptions in terms of u, we have adopted a different root, if we state the
assumption in terms of y, we can use this root also, both give same result. Here we used the
2
assumptions that Var(𝑌𝑖) = σ , and Cov(𝑌𝑖 𝑌𝑗) = 0.
58
∼ 2 2
So, Var(β2 ) = σ Σ𝑤𝑖
Let us consider , 𝑤𝑖 = 𝑘𝑖 + (𝑤𝑖 - 𝑘𝑖 )
2 2 2
Σ𝑤𝑖 = Σ𝑘𝑖 +Σ(𝑤𝑖 − 𝑘𝑖) + 2Σ𝑘𝑖 (𝑤𝑖- 𝑘𝑖 )
2
Consider the last terms, Σ𝑘𝑖 (𝑤𝑖- 𝑘𝑖 )= Σ𝑘𝑖 𝑤𝑖 - Σ𝑘𝑖
2
= Σ 𝑘 𝑖 𝑤 𝑖- Σ 𝑥 𝑖
Σ𝑥𝑖 𝑤𝑖 1
= 2 - 2
Σ𝑥 𝑖 𝑤𝑖 = Σ(𝑋𝑖 − 𝑋)𝑤𝑖
=Σ𝑤𝑖𝑋𝑖 - 𝑋Σ𝑤 , Σ𝑤 = 0
=Σ𝑤𝑖𝑋𝑖 =1
1 1
= 2 - 2 =0
So, the last term of above equation (Σ𝑘𝑖 (𝑤𝑖-𝑘𝑖 )=0
2
Then we write Var(β 2) = σ ⎡⎢Σ𝑘𝑖 + Σ(𝑤𝑖 − 𝑘𝑖) ⎤⎥
∼ 2 2
⎣ ⎦
2 2 2 2
= σ Σ 𝑘𝑖 + σ Σ(𝑤𝑖 − 𝑘𝑖)
∼
2
2 2
σ
Var (β 2)= 2 +σ Σ(𝑤𝑖 − 𝑘𝑖)
Σ𝑥𝑖
∼ 2 2
Var (β 2)= 𝑉𝑎𝑟 (β2) +σ Σ(𝑤𝑖 − 𝑘𝑖)
59
∼ 2 2
As you can see here, Var (β 2) = = 𝑉𝑎𝑟 (β2) +σ Σ(𝑤𝑖 − 𝑘𝑖) , Now, if 𝑤𝑖 = 𝑘𝑖, the weight
𝑤𝑖 is same as 𝑘𝑖, the last term of the above equation will disappear , and
∼
Var (β 2)= 𝑉𝑎𝑟 (β2)
If 𝑤𝑖 is any value other than 𝑘𝑖, 𝑤𝑖 < 𝑘𝑖 or 𝑤𝑖 > 𝑘𝑖, the squared quantity of the last term in
the above equation will become positive. Then, you will get the result that;
∼
Var (β 2)> 𝑉𝑎𝑟 (β2)
So, another way of saying this is that, if there if minimum variance unbiased estimator (BLUE)
of β2, it is given by OLS estimator β2. This is the proof of Gauss Markov Theorem.
The graphical presentation of the Gauss Markov Theorem
∼ ∼
Both are unbiased, and the Var (β 2)> 𝑉𝑎𝑟 (β2) or 𝑉𝑎𝑟 (β2)<Var (β 2).
Coefficient of Determination
60
● We have estimated the regression coefficients β1 𝑎𝑛𝑑 β2 , we have also derived the standard
errors of estimators.
Σ𝑥𝑖 𝑦𝑖
β2 = 2
Σ𝑥𝑖
2
σ
𝑉𝑎𝑟 (β2) = 2
Σ𝑥𝑖
β1 = 𝑌 − β2 𝑋
2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖
● Now if β1 and β2 are known, easily determine the sample Regression Line. that is equal to
𝑌𝑖 = β1 + β2 𝑋𝑖 this is the SRL
● If we know β1 and β2 , by substituting the values𝑋𝑖, 𝑌𝑖 can be estimated, you will get the
sample regression line.
● Now, our next issue is to find out how fit is the sample regression line to the sample data.
● Ideally what we expect is all the sample points lie on the sample regression line. But in the
real world, all sample points will not lie on the sample regression line. Some points will lie
above the sample regression line and some will lie below the sample regression line. That is
there will be positive u hat and negative u hat. i.e., (𝑢) 𝑎𝑛𝑑 (− 𝑢)
● Now, as a measure of goodness of fit, goodness of fit means how good is the fit of the
sample regression line to the sample data, we use a measure known as coefficient of
determination.
61
● Coefficient of determination is the measure used to assess the fit of the sample regression
2
line to the sample data . For simple linear regression model it is represented as (𝑟 ). and
when we go beyond the 2 variable model , in the general K variable model , we denote
2
coefficient of determination as multiple coefficient of determination (𝑅 ).
2
● Coefficient of determination is written as (𝑟𝑦,𝑥 )
● Now, in order to see what is meant by coefficient of determination, first consider an
2
informal method of considering (𝑟 ). suppose that we use a technique known as venn
Diagram, also known as Ballentine, suppose that, the circle Y represents variation Y and the
circle X represents variation X.
● The term variance and variation are two different things. When we talk about variation, the
question is variation from what? There will be a base. The base that we usually refer is a
2
2
(
Σ 𝑌𝑖 −𝑌 )
(
mean, so the variation is denoted as Σ 𝑌𝑖 − 𝑌 ) and variance is denoted as 𝑛−1
● So, variation is not variance. Now suppose that the circle Y represents variation in Y and
circle X represents variation X, there is no overlap in the following diagram
● But there is overlap in the following diagram
62
The overlap increases in the following diagrams
2 2
● Now 𝑅 is a measure of this overlap. In the first case as overlap is zero 𝑅 = 0 , in the
2 2
last case as the overlap is complete, 𝑅 = 1. in the fourth case 𝑅 < 1. This is the
informal definition of coefficient of determination.
63
2
● To derive 𝑅 we proceed as follows
● Now, we know that 𝑌𝑖 = 𝑌𝑖 + 𝑢𝑖 and in deviation form 𝑦𝑖 = 𝑦𝑖 + 𝑢𝑖 . this the two
variable model in terms of original observation and in terms of deviation form. And
2
2
(
Σ𝑦𝑖 = Σ 𝑦𝑖 + 𝑢𝑖 )
2 2
2
Σ𝑦𝑖 = Σ𝑦𝑖 + Σ𝑢𝑖 + 2 Σ𝑦𝑖 𝑢𝑖 = 0
This can be also written as
2 2 2
2
Σ 𝑦𝑖 = β 2 Σ 𝑥 𝑖 + Σ𝑢𝑖
Because , as we have derived earlier like this
● Now, let’s see the components in the form of a graph.
● Now, let’s define various sum of squares. The first one is
64
2
2
(1) Σ𝑦𝑖 = Σ 𝑌𝑖 − 𝑌 ( ) = 𝑇𝑆𝑆 ( 𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠)
2 2
( )
2
(2) Σ𝑦𝑖 = Σ 𝑌𝑖 − 𝑌 (
= Σ 𝑌𝑖 − 𝑌 ) we have proved that 𝑌 = 𝑌, that is the mean
of actual Y is equal to mean of estimated Y.
2
( )
Σ 𝑌𝑖 − 𝑌 —----------------- we called this ESS, that is Explained Sum of
Squares. That is that part of variation is explained by the sample regression line .
2
(3) Σ𝑢𝑖 That part of the variation in Y not explained by the sample regression line.
We called it as RSS, that is Residual Sum of Squares.
● So, various Sum of Squares we can arranged as
TSS = ESS +RSS
Now, dividing through out with TSS
𝐸𝑆𝑆 𝑅𝑆𝑆
1 = 𝑇𝑆𝑆
+ 𝑇𝑆𝑆
2 2
Σ𝑦𝑖 Σ𝑢𝑖
1 = 2 + 2
Σ𝑦𝑖 Σ𝑦𝑖
2 2 2
β2 Σ𝑥𝑖 Σ𝑢𝑖
1 = 2 + 2
2
Now, we define 𝑟 as
2 𝐸𝑆𝑆
𝑟 = 𝑇𝑆𝑆
2
2 𝑅𝑆𝑆 Σ𝑦𝑖
So, 1 = 𝑟 + 𝑇𝑆𝑆
= 2
Σ𝑦𝑖
2
So 𝑟 can also be written as
65
2
2 𝑅𝑆𝑆 Σ𝑢𝑖
𝑟 = 1 − 𝑇𝑆𝑆
= 1 − 2
Σ𝑦𝑖
This quantity is known as the coefficient of determination. Coefficient of determination
often used as the summary measure of the goodness of fit. That is the fit of the sample
2
regression line to the sample data. Then 𝑟 is defined as the proportion of variation in Y
explained by X, or the fraction of sample variation in Y explained by X . if you multiply
2 2
𝑟 with 100 ( 𝑟 *100) , then it will be the percentage of variation in Y explained by X.
2
For ex: if 𝑟 = 0.6, the 0.6*100 = 60% then the explanation is that 60% of the variation
in Y explained by X.
● For Computational purpose, the formula used is
2 Σ𝑥 2
2 𝑖
𝑟 = β2 2
Σ𝑦𝑖
This can be written as
2 2
2 𝑆𝑥
𝑟 = β2 2
𝑆𝑦
2
● The value of 𝑟 always lie in between 0 and 1.
2
● If Σ𝑢𝑖 = 0 means all the sample points lie on the sample regression line, then
2 0
𝑟 = 1 − 2 = 1 that is the entire variation in Y is explained by X. on the other
Σ𝑦𝑖
2 2
Σ𝑢𝑖 Σ𝑢𝑖
hand, 2 that is no part of the variation in Y is explained by X. Then, 2 = 1 so,
2
𝑟 = 1 − 1 = 0
66
2
● So, 𝑟 is squared quantity, it will not be negative. The range is in between 0 and 1.
2
Where 0 means the regression line has no fit at all, 1 means perfect fit. But typically 𝑟
always lie in between 0 and 1.
2
● If the value of 𝑟 is 0.8, 0.9 etc., we will say that the fit of the SRL is very high. And if it
is 0.3, 0.2 etc., the fit of the SRL is comparatively low.
2
● As we will see later, in the cross sectional data due to heterogeneity in the data 𝑟 will
2
be generally low, but in time series data 𝑟 will be very high
2
● And always remember that 𝑟 is the measure of the fit of the SRL to the sample data.
2
● As you see later, 𝑟 not a very important quantity in econometrics.
2
● Now, a quantity which is closely related to but conceptually very different from 𝑟 is the
sample coefficient of correlation r.
Σ𝑥𝑖𝑦𝑖
𝑟 =
2 2
Σ𝑥𝑖 Σ𝑦𝑖
● Remember the properties of coefficient of correlation
(1) − 1 ≤ 𝑟 ≤+ 1
(2) 𝑟 is symmetric. Symmetric means 𝑟 between X&Y is same as Y&X. that is
𝑟𝑥𝑦 = 𝑟𝑦𝑥
(3) r is independent of origin and scale, which means
𝐶𝑜𝑟𝑟(𝑋, 𝑌) = 𝑐𝑜𝑟𝑟(𝑎𝑥 + 𝑏, 𝑐𝑦 + 𝑏) that is whether we change the origin or
scale, correlation is not affected. Correlation is a unitless number.
(4) If two variables X, Y are independent , then r = 0, but 0 correlation does not mean
that X and Y are independent. That means even if the correlation is a measure of
67
2
linear association only, if 𝑌 = 𝑋 , 𝑡ℎ𝑒𝑦 𝑎𝑟𝑒 𝑟𝑒𝑙𝑎𝑡𝑒𝑑, 𝑏𝑢𝑡 𝑟 = 0 the reason is
that they are non-linearly related. So, even if correlation is 0, we cannot say that
the two variables are independent. But If two variables are independent , then r =
0.
There is an exception to this rule that, if X and Y are normally distributed random
variables, zero correlation always mean statistical independence. And this will not
be true in all cases. This will true only in the case of normally distributed random
variables. And finally
(5) Correlation need not imply causation: a classical example to this is a spurious
correlation. This concept was first introduced by Yules in Statistics. Later it was
introduced in econometrics, in the context of time series . the point is that, even if
two variables are highly correlated, we cannot say that one is a cause. There will
not be a causation between two variables. That means, even if these two variables
are not related, you may possibly get a high correlation. Such a possibility is
called spurious correlation.
● Now the question is if correlation is represented by r, and it is a linear association
2
between variables, 𝑟 is conceptually different. It is the pattern of variation in Y
± 2 2
explained by X. but actually this 𝑟 is 𝑟 and which sense 𝑟 is the square of r will
consider later
2
● Suppose that 𝑟 = 09, then r = 0.81, but the problem is if r can be positive or negative but
2
𝑟 is positive only. So, the question is 0.81 is positive or negative? To see that you must
know the relation between X and Y.
68
𝑌 = 𝑓( 𝑥 )
Suppose consumption is the function of income, then the slope coefficient is positive. So,
if the slope coefficient is positive r is positive. Suppose that
𝑄 = 𝑓(𝑝) quantity demanded is the function of p, then slope coefficient if negative, so r
is negative.
● So, in order to determine what will be the sign of r, you just consider the sign of the slope
coefficient in the relation between Y and X.
2 2
● So, r can be positive or negative, but 𝑟 is always positive. And always remember that 𝑟
is the square of r even though the meaning of the two terms are different. The sense in
2
which 𝑟 is the square of r will be discussed and proved in the context of multiple linear
regression model.
Normality Assumptions
● Classical theory of statistical inference has two branches of or is divided in to
(1) Estimation and
(2) Hypothesis Testing
● Using the method of ordinary least squares we have estimated the parameters.
β1 𝑖𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑎𝑠 β1
β2 𝑖𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑎𝑠 β2
2
2
σ 𝑖𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑎𝑠 σ
● And the assumptions of classical linear regression model these estimators posses many
desirable statistical properties, summarized as Gauss Markov theorem. That is the BLUE
property. A small sample property.
69
● So, provided that the assumptions classical linear regression model are satisfied, β1 and
β2 𝑎𝑟𝑒 best unbiased estimators. But we also know that these are only estimators. As
estimators, their value vary from sample to sample. Or stated differently, they are random
variables. So, a measure of the position of these estimators is given by the standard error
of the estimators(SE). We have derive standard error of β1 and β2 . Standard error is used
as the measure of the position of the estimator, not only that SE are used for hypothesis
testing.
● The estimation is the only one part, hypothesis testing is the other part.
● Suppose that in the model 𝑌𝑖 = β1 + β2𝑋𝑖 + 𝑢𝑖, where 𝑌𝑖 is consumption and 𝑋𝑖 is
income .
● Suppose that we have β2 = 0. 7 −−−−−−−−− 𝑀𝑃𝐶, suppose that the estimated
MPC is 0.7. Now the question is if we take another sample, you will get another value for
β2 if we take another sample, you will get a third value for β2 , and we do not know
which β2 will approximate the parameter β2. because of sampling fluctuations β2 vary
from sample to sample and we do not know which β2 will approximate the true
population parameter β2
● Our aim is not estimate β1 or β2 but use these estimators to make inferences about the
population parameters. So the question is how do we use the estimators to make
inferences about parameters?
70
● Suppose that we want to make inference about µ, using a sample we estimate a 𝑋. 𝑋 is an
estimator and used to make inference about the parameter µ.
● So once 𝑋 is estimated from sample, using the procedure of hypothesis testing or using Z
statistic or t statistic we can test hypotheses about population parameters.
𝑋 −µ
𝑍= σ
𝑋 −µ
𝑡= 𝑆
𝑛
● That branch is known as hypothesis testing. Statistician developed rules and procedure
for testing hypothesis, steps in formulating hypothesis and steps in testing hypothesis etc.,
● The question is how do we use these estimators for hypothesis testing?
● We know that estimators are random variables, so, naturally, estimators are sampling
distributions. We can use estimators for hypothesis testing, if and only if the sampling
distribution of the estimator is known. So, for ex: in the case of 𝑋 , X follows a normal
2
distribution with mean µ and variance σ and also 𝑋 follows a normal distribution with
2
mean µ and variance σ /𝑛.
2
𝑋 ∼ 𝑁(µ, σ )
2
𝑋 ∼ 𝑁(µ, σ /𝑛)
● So, sampling distribution of 𝑋 is given, so the test statistic Z or t can be constructed, so,
Z or t can be used for hypothesis testing.
2
● 𝑋 ∼ 𝑁(µ, σ /𝑛) this is based on the assumption that the samples are iid, or simply
samples are taken by using the random sampling technique, it is assumed X is normal. If
71
even is X is not normal, we use the central limit theorem. Provided that n is greater than
30 like that.
● Similarly, in the case of OLS estimators, we have suggested that
( )
2 2
σ Σ𝑋𝑖
β1 ∼ β1 2
𝑛Σ𝑥𝑖
Similarly
β2 ∼ β2( ) σ
2
Σ𝑥𝑖
2
● In both cases, the nature of the distribution is not specified. In the case of arithmetic
mean, arithmetic mean is assumed to be normally distributed if X is normal. But even if
2
X is not normal 𝑋 is normal with (µ, σ /𝑛) provided that sample size is sufficiently large.
So that we can construct the test statistic Z and t. In the case of OLS we have proved that
the estimators are unbiased, variances are derived, and we also shown that estimators are
true. But estimators are random variables, so, naturally it has a sampling distribution. But
the nature of the sampling distribution has not been discussed so far.
● But use these estimators to make inferences about the population parameters, we have to
specify sampling distributions of β1 and β2. without sampling distributions we can not
use estimators to make inferences about population parameters. Now let us derive the
sampling distribution of β1 and β2
● Now, we have stated that
Σ𝑥𝑖𝑦𝑖
β2 = 2
Σ𝑥𝑖
= Σ𝐾𝑖𝑌𝑖
72
β2 = β2 + Σ𝐾𝑖𝑢𝑖
● This step we have already derived. So, naturally, the distribution of the β2 or β2 simply
depends on 𝑢𝑖. because β2 is a constant and 𝐾𝑖 is also a constant
𝑥𝑖
𝐾𝑖 = 2
Σ𝑥𝑖
● So, as β2 is depends on 𝑢𝑖, we can also show that β1 𝑎𝑙𝑠𝑜 𝑑𝑒𝑝𝑒𝑛𝑑𝑠 𝑜𝑛 𝑢𝑖, we will prove
this only with respect to β2
● Since, this distribution β2 is depends on 𝑢𝑖, we have to state assumptions about how 𝑢𝑖 is
distributed? Then only we can determine the distribution of 𝑢𝑖
● So, in the regression context, we assume that 𝑢𝑖 is normally distributed. That is the
stochastic error term 𝑢𝑖 is normally distributed, and when we add the normality
assumption, Classical Linear Regression Model(CLRM) becomes Classical Normal
Linear Regression Model(CNLRM)
● Now we assume that each 𝑢𝑖 is normally distributed with
𝐸(𝑢𝑖) = 0
Then variance of 𝑢𝑖 that is
2 2
𝐸[𝑢𝑖 − 𝐸(𝑢𝑖) ] = σ
Covariance , that is
[
𝐸[𝑢𝑖 − 𝐸(𝑢𝑖) ] 𝑢𝑗 − 𝐸(𝑢𝑗) ] = 0
We can write it as
73
2
𝑢𝑖 ∼ 𝑁(0, σ )
This is the normality assumption
● So, in regression analysis, when we talk about normality assumption, we are talking
about the distribution of 𝑢𝑖. as we know, as each 𝑢𝑖 is normally distributed, covariance is
assumed to be zero, we can say that each 𝑢𝑖 is normally and independently distributed
2
with 0 and σ .
2
That is 𝑢𝑖 ∼ 𝑁𝐼𝐷(0, σ )
● For 2 normally distributed random variables 0 covariance means independence.
● Now the question is that why we assume the 𝑢𝑖 is normal?
● The celebrated Central Limit Theorem says that if there are a large number of
independently and identically distributed random variables, then with a few exemptions,
their sum of the random variables will also be normally distributed. A variant of the
central limit theorem says that even if the number of random variables is not very large,
also even if they are not strictly independently distributed, even then their sum tends to be
normally distributed.
● So, we know that U in the linear regression model in econometrics, is the sum of a large
number of random variables. For example, in the consumption income example you
consider, wealth, number of family members and so many other variables. As U is the
sum of identically and independently distributed large number of variables, U is normally
distributed. And if u is normally distributed β1 𝑎𝑛𝑑 β2 are normally distributed. The
reason is that, there is a theorem in statistics, which says that a linear combination of
normally distributed random variables itself is normally distributed. We have seen that
74
● β2 = β2 + Σ𝑘𝑖𝑢𝑖
● β2 is the linear combination of with 𝑘𝑖 and β2 as serving as the weights. As β2 is the
linear combination of 𝑢𝑖, β2 is normally distributed. Similarly β1 . so we can write
( )
2 2
σ Σ𝑋𝑖
β1 ∼ 𝑁 β1 2
𝑛Σ𝑥𝑖
Similarly
β2 ∼ 𝑁 β2( ) σ
2
Σ𝑥𝑖
2
● When we study 𝑋, we simple assume that if X , that is the population is normal, the
distribution of the 𝑋 is also normal. That is proved by the statisticians. In the context of
OLS, we use the normality assumption to prove that estimators are normal. Normality
assumption is about u and the estimators. It says nothing about the distribution of X or Y.
● Then why normality assumption? Because u is the sum of a large number of identically
and independently distributed random variables. And if a phenomena is the sum of a large
number of factors, then that phenomena will be normally distributed. Height is normally
distributed, weight is normally distributed. Because the height phenomena, weight
phenomena etc, are the large number of variables. So, such variables will be naturally
distributed. And if u is normal, OLS estimators are normally distributed. And if OLS
estimators are normal then we can use normal distribution for hypothesis testing.
( )
2 2
σ Σ𝑋𝑖
● Since β1 ∼ 𝑁 β1 2
𝑛Σ𝑥𝑖
β1 −β1
𝑍= ∼ 𝑁 ( 0, 1)
𝑆𝐸 (β1)
75
Similarly,
β2 ∼ 𝑁 β2( ) σ
2
Σ𝑥𝑖
2 is converted to
β2 −β2
𝑍= ∼ 𝑁 ( 0, 1)
𝑆𝐸 (β2)
● This is the standard normal variable.
( ) ( )
2 2
σ Σ𝑋𝑖 σ
2
● β1 ∼ 𝑁 β1 2 this is the sampling distribution of β1 and β2 ∼ 𝑁 β2 2 is the
𝑛Σ𝑥𝑖 Σ𝑥𝑖
sampling distribution of β2. and estimators are not directly used for hypothesis testing,
using the estimators we will derive the test statistics. The test statistic is Z.
● So, in the form of a graph
76
● The sampling distribution β1 is converted in to Z. similarly we can convert β2 in to Z.
● In hypothesis testing, we will use the Z distribution. The sampling distribution can be
used for hypothesis testing. But in general, we will not use sampling distribution directly.
We derive the test statistic Z and this distribution is used for hypothesis testing. That is to
test hypothesis about β1 and β2 , the regression coefficients using the parameters β1 and
β2
● Remember this, Z conversion is possible, if and only if the standard errors are known. In
reality, standard errors will be unknown. So, we will be go for t distribution. That is the
point we considered earlier. Now, for our future use the quantity,
2
σ 2
( 𝑛 − 2) 2 ∼ χ (𝑛 − 2)𝑑𝑓
σ
2
Now, in the context of χ distribution, we have shown that
2
𝑆 2
( 𝑛 − 1) 2 ∼ χ (𝑛 − 2)𝑑𝑓
σ
2
That was in the context of testing the significance of σ
2 2
● In statistics, we use the notation 𝑆 for σ and 𝑋 𝑓𝑜𝑟 µ
2
2
● In econometrics, we use σ for σ and µ 𝑓𝑜𝑟 µ
2
𝑆 2
● ( 𝑛 − 1) 2 ∼ χ (𝑛 − 2)𝑑𝑓 this quantity is used to derive the t distribution.
σ
2
● We can also show that β1 and β2 are independently distributed of the quantity σ .
Because that is the basis requirement for t distribution. If 𝑍1 is the standard normal and
2
𝑍2 is χ with k degrees of freedom and is distributed independently of 𝑍1, then the ratio
77
2
has t distribution. We assume that estimators are independent of σ . To prove this matrix
algebra is required. So, this is the normality assumption.
● Normality assumption is very important in econometrics. Only if we can assume that u is
normally distributed we can use the estimators for hypothesis testing. But there is another
point. If this normality assumption is very important, it is important if the sample size is
more. If the sample size is very large, then the normality assumption is not very
important. Because, in general as the sample size increases, indefinitely, estimators will
be normally distributed. In the case of 𝑋, if n >30, whatever is the nature distribution of
the population 𝑋 is normally distributed. We can say the same in the context of OLS
estimators also. If you have a large sample say n >100, then normality assumptions not
required. Because whether u is normal or nor, or whether you are not sure about the
distribution of the u, β1 and β2 will be normally distributed. So, you can simply use the
normality assumption and the corresponding Z statistics or if the variances are unknown
the t statistics for hypothesis testing. So, adding the normality assumption, we have the
Classical Normal Linear Regression Model otherwise it is simply the Classical Linear
Regression Model.
Interval Estimation Vs Point Estimation
The classical definition has two branches i.e, estimation and hypothesis .Estimation is divided in
to two :
● Point estimation
● Interval estimation
78
OLS estimators are point estimation. When we consider hypothesis testing there are two
approaches , confidence interval and test of significant approach.
Confidence interval approach is based on interval approach and test of significance is based on
point estimation.
Difference between point and interval estimation
2
RV → X height→follows a certain distribution with parameters with µ, σ Random variables
are normally distributed, but we didn't know about parameters. Our interest is distribution of
parameters.
In point estimation, we assume that RV → 𝑋 follows a certain distribution and we have a random
sample of size ‘n’ from this known population, for example 50 sample, we use this sample to
estimate the unknown parameters. In the case of point estimation - random variable x follows
certain PDF.
PDF f (𝑥 , θ) ,θ is the parameter of the distribution. We assume that PDF is normal.
θ = f (𝑥1, 𝑥2........ 𝑥𝑛)
θ is the estimator or the sample statistics, a rule or formula which tells us how to estimate the
unknown parameter.
Interval Estimation
Point estimation will gives only a single value for the parameter θ → θ.
The value of θ varies sample to sample. due to sampling fluctuations.
In interval estimation, instead of suggesting single value for θ i.e, θ→ 150, We suggest a range
in which the parameter will lie. The idea behind the notion of interval estimation is the concept
of sampling distribution of estimators. The parameter will lie in between 2 values. We construct
79
an interval around point estimation or taking 2 or 3 standard errors on either side. And we say
that the probability is 95% or 99%. Of including the parameter.
For eg: X ∼ 𝑁 µ , σ ( ) 2
𝑋∼ 𝑁 µ,( ) σ
𝑛
2
σ
We can suggest interval 𝑋± 1. 96
𝑛
The interval varies from one sample to another sample.
If you want to find two positive numbers: δ , α
(θ − δ, θ + δ, = 1 − α )
θ= 1 − α
[
P θ − δ≤ θ ≤θ + δ= 1 − α ]
⎡ ⎤
P⎢ θ1 ≤ θ ≤ θ2 ⎥ = 1 − α
⎣ ⎦
1 − α = confidence coefficient
α = level of significance , the probability of committing type I error.
Eg : suppose that height of population µ 𝑖𝑛𝑐ℎ𝑒𝑠, σ = 2. 5, 𝑛 = 100, 𝑥 = 67.
𝑥 − 1. 96( ) ≤ µ ≤ 𝑥 + 1. 96( )
σ
𝑛
σ
𝑛
67 − 1. 96 ( ) ≤ µ ≤ 67 + 1. 96( )
2.5 2.5
100 100
66.51≤ µ ≤ 67. 49
The parameter will lie intervals like in between 95 out of 100 times.
80
Construction of Confidence Interval of Population Mean
First we explain the construction of confidence interval for the parameter µ using its estimator 𝑋
2
𝑋 ∼ 𝑁 µ, σ ( )
( )
2
σ
𝑋 ∼ 𝑁 µ,
𝑛
𝑥− µ
Z= σ ∼ 𝑁 ( 0, 1)
𝑛
If σ it is known we can use Z for constructing confidence intervals.
⎡ ⎤
𝑃𝑟⎢− 𝑧α/2 < 𝑍 < 𝑧α/2⎥=1-α
⎣ ⎦
Suppose that , α = 0. 05 (5%).
1- 0.05 = .95 or 95%.
95% of z lies between these 2 intervals.
Then z lie between -1.96 to +1.96
α α 𝑥−µ
− 2
and 2
are known as critical values, then z in the middle is empirical z, ie z =
σ/ 𝑛
Substituting for z,
𝑥−µ
𝑃𝑟⎡⎢− 𝑧α/2 < < 𝑧α/2 ⎤⎥=1-α
⎣ σ/ 𝑛 ⎦
σ
Multiplying all the term with
𝑛
𝑃𝑟⎡⎢− 𝑧α/2
⎣ ( ) < 𝑋 − µ < 𝑧 ( )⎤⎥⎦=1-α
σ
𝑛 α/2
σ
𝑛
𝑋 is subtracted from all the term
𝑃𝑟⎡⎢− 𝑧α/2
⎣ ( ) − 𝑋 <− µ < 𝑧 ( ) − 𝑋⎤⎥⎦=1-α
σ
𝑛 α/2
σ
𝑛
Multiplying all with positive value.
81
When we multiply with positive value < change to >
𝑃𝑟⎡⎢𝑋 + 𝑧α/2
⎣ ( ) > µ > 𝑋 − 𝑧 ( )⎤⎥⎦=1-α
σ
𝑛 α/2
σ
𝑛
Change the last part to first part
( ) < µ < 𝑋 + ( )⎤⎥⎦=1-α

𝑃𝑟⎡⎢𝑋 − 𝑧α/2
⎣
σ
𝑛
α
2
σ
𝑛
µ = 𝑋 ± 𝑧 α/2( )
σ
𝑛
If α = 0. 5
µ = 𝑋 ± 1. 96 ( ) σ
𝑛
σ
is standard error of the distribution of 𝑋.
𝑛
In empirical application this σ will be unknown, if σ is unknown we cannot use z as test
statistics. Then we use t distribution.
𝑥−µ
𝑡= 𝑠 ∼ 𝑡(𝑛−1)
𝑛
Whenever σ is replaced by its unbiased estimator s, the distribution becomes t.
𝑋−µ
If 𝑍1= σ ∼ 𝑁 ( 0, 1)
𝑛
2
𝑠
𝑍2 = (𝑛 − 1) 2
σ
2
Σ(𝑥 −𝑥) 2
= 2 ∼ χ(𝑛−1)
σ
(𝑋−µ) 𝑛
𝑍1= σ
2
1 Σ(𝑥 −𝑥)
𝑛− 1 2
σ
(𝑋−µ)
𝑠
𝑛
82
⎡ ⎤
𝑃𝑟⎢− 𝑡α/2 < 𝑡 < 𝑡α/2⎥=1-α
⎣ ⎦
𝑋−µ
𝑃𝑟⎡⎢− 𝑡α/2 < < 𝑡α/2 ⎤⎥=1-α
⎣ 𝑠 𝑛 ⎦
If α = 5%
In the case of t distribution , 95% of t depends on degrees of freedom. When degrees of freedom
are different, the critical values are different.
𝑠
Multiplied with , 𝑋 is subtracted from all the term, Multiplying all with positive value, we
𝑛
will get
⎣ ( ) < µ < 𝑋 + 𝑡 ( )⎤⎥⎦=1-α

𝑃𝑟⎡⎢𝑋 − 𝑡α/2
𝑠
𝑛 α/2
𝑠
𝑛
µ = 𝑋 ± 𝑡α/2( )
𝑠
𝑛
If degrees of freedom is 20 α = 5%
µ = 𝑋 ± 2. 036 ( ) 𝑠
𝑛
Construction of Confidence Interval of Regression Coefficients
Now we consider how to construct confidence intervals for the regression coefficients β1, β2.
We will explain how to construct confidence interval for β2 and generalize it for β1 or any
regression coefficients. We know that

2 2
σ Σ𝑋𝑖
β1 ∼ 𝑁 (β1 2 )
𝑛Σ𝑥𝑖
2
σ
β2∼ 𝑁(β2 2 )
Σ𝑥𝑖
With normality assumption OLS estimators are normally distributed. Let us consider how we
construct confidence interval forβ2
83
From β2
β2− β2
Z= 2
∼N(0,1)
σ/ Σ𝑥𝑖
2
( On the assumption that σ is known)
If σ is known , we can use Z for the construction of confidence intervals
Let us see how to construct a confidence interval forβ2
⎡ ⎤
𝑃𝑟⎢− 𝑧α/2 < 𝑍 < 𝑧α/2⎥ = 1-α
⎣ ⎦
Once α is fixed the critical values are known, For example ifα is .05, then − 𝑧α/2= -1.96
𝑧α/2=1.96 that 95% of Z lies between these 2 intervals. And Z in the middle is Z obtained from
sample or it is the test statistic .This can be written as
⎡ β2−β2 ⎤
𝑃𝑟⎢− 𝑧α/2 < 2
< 𝑧α/2 ⎥ =1-α
⎣ σ/ Σ𝑥𝑖 ⎦
We can give any value to α like 5%,10% etc…
As we have done in the case of arithmetic mean, here also assume that you multiply the entire
2
terms with σ Σ𝑥𝑖 then assume that β2 is subtracted from all the terms then again you assume
that you multiply all the terms with a minus you will get
Pr⎡⎢ β2 − 𝑧α/2(σ/ Σ𝑥𝑖 ) < β2 < ⎡ β + 𝑍 (σ/ Σ𝑥𝑖2 )⎤⎥⎤⎥ =1-α

2
⎢ 2 α/2
⎣ ⎣ ⎦⎦
This is the confidence interval for β2 for the level of significance α.If α is 5%, this is the 95%
confidence interval for β2.So we can write
β2 = β2 ±𝑧α/2⎡⎢𝑆𝐸( β2)⎤⎥
⎣ ⎦
84
This can be generalized as
β𝑖 = β𝑖 ±𝑧α/2⎡⎢𝑆𝐸 ( β𝑖⎤⎥
⎣ ⎦
This is the 100(1-α)% confidence interval for any regression coefficient β𝑖.
Z distribution can be used for constructing confidence interval only if σ is known. If σ is known
and α is fixed then this confidence interval can be constructed. But as we have discussed in the
case of arithmetic mean and using it for constructing confidence interval for µ, in practical
applications σ will be unknown. So as in the case of arithmetic mean, we use in the case of OLS
estimators also, the t distribution, for constructing the confidence interval.
β2−β2
t= 2
∼t(n-2)
σ/ Σ𝑥𝑖
Now we are considering the case where the degrees of freedom is n-2 not n-1.
Now let us see how we actually derive this t. We know that
2
( β2−β2) Σ𝑥𝑖
𝑍1 = σ
∼N(0,1)
2
σ
𝑍2=(n-2)
σ2
2
Σ𝑢𝑖 2
= 2 ∼χ (n-2)
σ
Now substituting the values
2
( β2−β2)/ Σ𝑥𝑖
= σ
2
1 Σ𝑢𝑖
𝑛−2 2
σ
85
= β2 − β2
2
σ / Σ𝑥𝑖
This quantity follows t distribution with n-2 degrees of freedom. Whenever σ is replaced by its
unbiased estimator , Z is converted into t distribution.This t distribution can be used whether the
sample size is large or small, because whenever σ is replaced its estimator σ , it is the t
distribution (irrespective of the fact that whether the sample size is small or large).
But the question is why we use Z even when σ is unknown in large samples? when its estimator
is used. This is because , in large samples, the distribution Z and t are not much different. So in
empirical applications , t distribution is widely used. And Z distribution cannot be used. But if
the sample size is large Z can also be used but as you will see in most of the econometric
softwares, we use t distribution rather than Z distribution for hypothesis testing. Now we will use
t distribution for constructing confidence intervals.
[
Pr − 𝑡 α/2
< 𝑡< 𝑡 α/2]=1-α
t is now the empirical quantity obtained from the sample
⎡ ( β2−β2) ⎤
Pr⎢− 𝑡 α < < 𝑡 α ⎥=1-α
⎢ 2 σ / Σ𝑥𝑖
2
2⎥
⎣ ⎦
Pr⎡⎢β2 − 𝑡 α/2⎢⎣
⎡𝑆𝐸 (β )⎤⎥ < β < β + 𝑡 ⎡𝑆𝐸 (β )⎤⎥⎤⎥=1-α
α/2⎢⎣
⎣ 2 ⎦ 2 2 2 ⎦⎦
The point in this case is the critical t value 𝑡α/2 depends on the degrees of freedom in addition to
α.ie, the level of significance .
Once
α is fixed as 5% and
Degrees of freedom is 20
86
The critical t value from the table or from the software = 2.038
So − 𝑡 = -2.038
α/2
𝑡 α/2
=+ 2.038
95% of the t value lie between these two interval.and this is the 95% confidence interval for β2
using the t distribution.
That is β2=β2±𝑡 α/2

SE (β2)
This can be generalized as
β𝑖=β𝑖±𝑡 α/2
SE (β𝑖)
100(1-α)% 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝑎 𝑔𝑖𝑣𝑒𝑛 𝑑𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚
There are mainly two approaches to hypothesis testing
1) Test of significance approach ( based on point estimation)
2) Confidence interval approach (based confidence interval estimation)
Hypothesis Testing
Formation of Null and Alternative Hypotheses
How to formulate hypotheses?
What are the various steps in the hypothesis testing procedure?
How to accept or reject a hypothesis?
Let us see how to formulate hypotheses?
There are some procedures for formulating the hypothesis
87
Classical statistical inference has two branches
1) Estimation
2) Hypothesis testing
Hypothesis testing is required because, as we have discussed earlier We may have samples, and
using the samples we have to make inference about the parameter. If the entire population is
known to us you can directly estimate population parameters, then there will be no need for
hypothesis testing .But this is not practically possible . Now let us see how to formulate
hypotheses. Hypothesis testing is the process of determining whether or not a given hypothesis is
true. The first step in hypothesis testing is formulating the hypothesis by specifying what is
known as Null hypothesis
Null Hypothesis(H0)
A null hypothesis is an assertion about the value of a population parameter. The null hypothesis
is an assertion about the value of a population parameter. It is the assertion we call it true until
we have sufficient statistical evidence to conclude otherwise. So as an example, a null hypothesis
may assume that the population parameter µ = 100
𝐻0:µ = 100
This is the assertion we called as true until we have sufficient statistical evidence to conclude
that the µ ≠ 100.
For example , the average height of a population will be
𝐻0 :µ = 160 cm
88
This is the null hypothesis
There will be an alternate hypothesis also. An alternate hypothesis is the negation of the null
hypothesis.that is an alternate hypothesis may be written as
𝐻1: µ ≠ 100
It is sometimes represented as 𝐻𝑎
Even though formulation of 𝐻0 𝑎𝑛𝑑 𝐻1seems to be very simple. Determination of 𝐻0 in a given
situation is somewhat complicated
It is important that we must be clear about what is null hypothesis or H0 in a particular situation
otherwise the entire process of hypothesis testing is invalid. Now let us see how to formulate H0
and H1 using a concrete example to clarify some points. As an example to this, suppose that a
company selling coca cola is selling bottles with 2 liters, we expect that on an average each
bottle will contain 2 liters of cola , due to some issues some bottles may contain a little more than
2 liters, some bottles may contain a little less than 2 liters. But on an average there will be 2 liters
of cola in each bottle. Suppose that a consumer suspects that the company is cheating the
consumers by filling less than 2 liters of cola. Then in the first case what is the null hypothesis?
and what is the alternate hypothesis? . As we discussed earlier, the null hypothesis is an assertion
about population parameters. It is an assertion we call true until we have sample evidence to
conclude that it is not true. So the company producing and selling cola is a reputed company , so
we expect that the company is doing the business honestly. So there is no need to believe
otherwise.So the null hypothesis is
89
𝐻0 : µ ≥ 2
This the null hypothesis and it is tested against Alternate hypothesis
𝐻1:µ< 2
There will be some significance in the consumer advocates suspicion if and only company is
taking this (𝐻1:µ< 2)
As this example shows
H0 is often a claim made by someone ,
H1 - The suspicion someone will have about the company.
Remember that we will go for new research if and only if we have some doubt about the status
quo. If we have no doubt about the status quo, there is no need for further research. The purpose
of research is to find something new or something different. In this case the claim made by the
company is not true, or you have doubt about this claim , you find out an instrument, then you
collect the data or samples, using the samples, we can test the hypothesis, and if the null
hypothesis is rejected you can conclude that the claim made by the company is false. But
variations are possible. Suppose that the company is not taking any claim, the company is simply
filling all bottles with 2 liters of cola. Suppose that we want to demonstrate that, on average, a
company is selling less than 2 liters. As an independent a researcher you want to prove that the
company is selling less than 2 liters of cola .In this case
H1= what we want to demonstrate
90
𝐻0 : µ ≥ 2
Now you have to demonstrate that the company is on average selling less than 2 liters. Now on
way to distinguish H0 and H1 is
If H0 is accepted = No corrective action is required
If H1 is accepted = Corrective action is required.
This is an easy way to distinguish H0 and H1
In this case if you are a consumer advocate or an independent researcher if H0 is accepted the
company is selling on an average 2 liters of cola no corrective action is required, suppose that if
H0 is rejected , on an average the company is selling less than 2 liters of cola .Thus corrective
action is required,
Suppose the consumer advocate has no suspicion about the company but the owner has a doubt
whether the machines fill on an average more than 2 liters. If the machine is filling more than 2
liters , it will be a loss as far as the owner is concerned. Now from the owners point of view we
write
𝐻0: µ ≤ 2
𝐻1:µ > 2
So in the owners perspective corrective action is required if the bottles are filled with more than
2 liters ie when alternate hypothesis is accepted
Third perspective:-
91
From the machine operator’s perspective
H0 : μ = 2 liters
H1 : μ ≠ 2 liters
These are the 3 principles to remember whenever we formulate hypotheses.
If you are an independent researcher, doing an academic research, for example:- in your field
of research such as Economics, Sociology, Psychology etc. your research hypothesis is
always H1
What you want to demonstrate is H1
H0 is Status quo
If H0 is accepted, nothing new is discovered from the sample.
On the other hand, if H0 is rejected, you have statistical evidence against H0 and you collect
data to estimate the parameters etc to demonstrate something new. So, the researcher’s
hypothesis is H1 whatever is your field of study: Economics, Sociology, Psychology etc.
This is true if you have tested the significance of only one parameter or if you have tested the
difference between the 2 parameters, i.e; between μ1 and μ2 or if you have tested the
difference between many samples like in the case of ANOVA or whatever is the context,
the null hypothesis is the resultant hypothesis.
If we would say that, if the test statistic is statistically significant, if and only if H0 is rejected.
Significant means something have been discovered from the sample
Test statistic is based on the sample, we would say that, test statistic is significant
Remember one more point:- whenever we go for research, always formulate a hypothesis first.
If we formulate a hypothesis after observing the results, you can always reject H0 or
formulate H0 in such a way to be rejected.n
92
So, never go for such a practice. The correct practice is to formulate H0 & H1 first.
If we formulate hypotheses after observing the results, it is what is known as Data Snooping.
After formulating the hypothesis, collect the suitable data; primary or secondary, time series or
panel, depending on your research question.
Formulate H1 , then estimate parameters and using the procedures of hypothesis testing, test the
significance of the population parameters. This is the first step, the formulation of null and
alternative hypotheses.
Steps in Hypothesis testing
Estimation and hypothesis testing have the twin branches of statistical inference.
Estimation means estimation of the parameters. Using a sample we estimate the parameters using
the sample statistics. The second part is hypothesis testing.
How to test hypotheses about the population parameters using sample statistics.
Now let us see what are the steps of hypothesis testing.
Steps in hypothesis testing
1. Formulation of null and alternative hypothesis.
Formulation of 𝐻0and 𝐻1
Before beginning any research formulation of hypotheses otherwise the entire procedure
is invalid.
Now let us formulate the problem of hypothesis testing more formally as follows.
Suppose that a random variable X with a PDF f(𝑋, θ)
Where θ is the parameter of the distribution. Now we observed a random sample of size
n from this population we obtain a point estimator θ for the parameter θ.
93
And using this random sample we obtain a point estimate. Since θ the parameter is
unknown we ask a question, the question is the estimator θ compatible with hypothesized
*
value of θ. So θ = θ Where compatible means sufficiently close to.
So the question is: Is the estimator θ is compatible with some hypothesized value of θ,
* *
So θ = θ . Where θ is the value of the θ stated under null hypothesis or it is specific value of θ.
*
(
So state differently could our sample have been come from a PDF 𝑓 𝑋 θ = θ . )
So the question is whether the sample is generated from a population with the value of
* *
θ=θ . And in the language of hypothesis testing θ=θ is known as null hypothesis and it is
written as 𝐻0.
*
This discussion is general applicable on all the situations now the parameter is θ, θ is estimated.
*
The question is whether the sample is generated from a population with θ=θ and it is 𝐻0. It is
tested against alternative hypothesis.
*
𝐻0: θ = θ
*
𝐻1 : θ ≠ θ
Now this 𝐻0 and 𝐻1 can be simple or composite. Hypothesis is an assertion about a population
parameter.
Now the hypothesis can be simple or composite.
The hypothesis is simple if it is specified a specific value of the parameter. If the specific value is
not specified, then it is composite.
2
As an example suppose that 𝑋 ∼ 𝑁 µ, σ ( )
94
* 2 *2
Suppose that our 𝐻0: µ = µ , σ = σ . This is simple hypothesis.
* 2 *2
Suppose that our 𝐻0: µ = µ and σ > σ . This is the composite hypothesis. Because its
2
value is not specific. Here the value of σ is not specified.
In econometrics we always test a hypothesis known as θ=0. Instead of specifying a particular
value we often test the hypothesis like this. This is known as zero null hypothesis. But this is not
the only hypothesis which we will test, we can give any value for θ.
When we read a software output always remember that the software output is based on this
hypothesis. If you want to test other hypotheses then you have to specify options available in all
the softwares to test hypotheses other than zero.
Now usually 𝐻0 is simple.
*
𝐻0: θ = θ . But 𝐻1, the alternative hypothesis can take any of these forms
*
𝐻1: θ > θ
*
𝐻1: θ < θ
*
𝐻1: θ ≠ θ
And remember that all these alternative hypotheses are composite in nature. And which
hypothesis we will use depends on the problem at hand.
2
Example: Suppose that our null hypothesis is that 𝑋 ∼ 𝑁 µ, σ ( )
Suppose also that we want to test µ only, then suppose that we formulate a hypothesis µ = 70.
Teste against µ≠ 70. Then the form in which 𝐻1 is formulated to determine the critical region.
The location of the critical region depends on how we formulate 𝐻1.
95
Now to test a hypothesis we divide the entire population into 2.
● One is the acceptance region.
● Other one is the critical region
Hypothesis testing we use a test statistic. Suppose that it is 𝑋 to test hypothesis about µ.
Estimators are not directly used for hypothesis testing. From the estimators we develop test
𝑋−µ
statistic say 𝑍 =
σ/ 𝑛
( 2
)
𝑋 ∼ 𝑁 µ, σ , then𝑋 ∼ 𝑁 µ, ( ) σ
𝑛
2
. It mean if X has mean=µ say, µ=100, then 𝑋=100.
When the χ = µ
96
Then the value of Z=0.
*
My hypothesis is 𝐻0:µ=µ , here µ is specific value is about the mean of χ.
It is also the mean of X. Suppose the value is specific is 100.
Now the entire region of test statistic divide into two, highly likely under 𝐻0.
It is the acceptance region, say 95%.
The 2.5% on both sides of the graph is known as the critical region, value of the population is
having very low probability to observe.
Acceptance region is value of the population with a high probability to observe given that 𝐻0:µ
=100.
So the entire population , the entire distribution of test statistic is divided into highly likely under
𝐻0 and highly unlikely under 𝐻0.
If 𝐻1is not equal to tail, the critical region on both tails, the left and right tail.
If your 𝐻1> tail entire critical region is in the right side only. Otherwise in the left side.
So whether you have a two tailed test, a right tailed and left tailed test depends on how you
formulate 𝐻1, the research hypothesis/ alternative hypothesis.
Now in the empirical applications the critical region is often fixed as 5%, and this is also known
as the level of significance α. .
In econometrics the highly unlikely region is often fixed at 5%, it is not only the value which we
use. In medical sciences it is fixed at 1%. In some cases we can go 15% etc. depending on
circumstances.
97
The second step in hypothesis testing is to choose the level of significance of the test.
So the point is our decision to accept 𝐻0or 𝐻1 is based on sample information. Due to sampling
fluctuations the conclusions will not be true always. In many decisions regarding accepting or
rejecting hypothesis we are supposed to commit 2 types of errors.
● Type 1 error. That is reject 𝐻0 when 𝐻0true.
● Type 2 error: accept 𝐻0 when 𝐻0 false.
Decision 𝐻0 True 𝐻0 False
Reject Type 1 No error.
Do not reject No error Type 2
We can not manage 2 types of errors simultaneously.
When we reduce type 1 error, Type 2 error increases.and vice versa.
Example: we know that in the legal proceedings police arrest a criminal, collect evidence present
the accused before the court. The court after considering the evidence suggested by the police
take a decision regarding 𝐻0 or 𝐻1.
Now in this example the 𝐻0: The accused is innocent.
Then 𝐻1: the accused has committed the crime.
In a democratic country, we can not judge a person as a criminal only because he has been
arrested by the police. Police arrest and suspect. Then collect the evidence. In our case the
evidence is the sample. And suppose that on the basis of evidence the sample we can not reject
98
𝐻0. That means the accused is innocent. In our case the sample evidence is not strong enough to
reject 𝐻0 or nothing new has been reduced from sample evidence.
Now what is the Type 1 error here?
● Reject 𝐻0 when 𝐻0 true means , the accused is innocent if we reject it what happens here
is the innocent will go to jail.
● The second is 𝐻0 is false, but we do not reject it, false means the accused is not innocent,
he has committed the crime. But he will go free.
Which one is more serious?
The classical approach to hypothesis testing procedure is Type 1 error is considered to be more
serious.
So what we do is we fix the type 1 error at 5%, 10%,1% etc and try to minimize Type 2 error.
And the probability of Type 1 error is known as α. It is known as the level of significance.
The probability of Type 2 error is β.
The probability of not committing Type 2 error is 1-β. It is known as the power of the test.
Even though in this example Type 1 error is more serious, variations are possible.
Example: Steel is used for constructing house.also for fencing. And steel of a particular strength
is required for housing .
𝐻0: The required strength of steel is 1000
𝐻1: The required strength of steel < 1000
What is the cost of Type 1 error?
The cost of producing steel only
.What is the cost of Type 2 error?
99
Type 2 error means 𝐻0 is false since it has less strength.
Now the question is the cost of Type 1 and Type 2 error depends on the purpose for which it is
used. If it is used for constructing a house , the house will collapse, there will be loss of life.It is
highly costly.
Type 2 error is costly compared to Type 1 error. Because Type 1 error is only a cost of
production.
But suppose that it is used for fencing only, here Type 2 error is not very important because even
if the fence collapses its consequence will be very low.
So whether Type 1 or Type 2 error is more serious depends on the issue at the hand.
In general Type 1 error is more serious.
3. The third step is the location of the critical region.
We use test statistics like z, t etc for hypothesis testing
And the entire distribution is divided into 2, highly likely under 𝐻0, highly unlikely under 𝐻0.
Highly likely is the acceptance region, highly unlikely is the critical region.
Now the critical region may be the right side only , left side only or both side, depending on how
you formulate :
*
● 𝐻1: θ ≠ θ is your alternative hypothesis, then it is a 2 tailed critical region.
*
● If it is 𝐻1: θ > θ Critical region in the right tail of the test statistic.
*
● If it is 𝐻1: θ < θ Critical region in the left tail.
So whether the critical region in the right tail, left tail or both tails depends on how we
formulate the alternative hypothesis.
100
4. The fourth step is to choose appropriate test statistics.
To test hypotheses we have to use test statistics. If you want to test the significance of a mean we
can test z or t.
● If the population variance is known Whatever the sample size you can use z.
● If population variance is unknown and the sample size is less than 30, we will use t test.
● To test the significance of variance we use the chi square test.
● If you want to test the difference between more than two means µ1 − µ2. we can use the
test statistic as F.
● While discussing the normal distribution and distribution related to normal distribution
we have seen that chi square is used to derive the sampling distribution of squared
quantities F and related issues.
We can use the appropriate test statistics depending on the context.
5. Compute the sample value of the test statistics.
Suppose that z is our test statistics and we have to make a testing about a population parameter µ
χ −µ
Then 𝑧 = σ
𝑛
Now if the value is given under H0, as an example to this Suppose that we specify
σ
µ = 60, χ = 70, = 2
𝑛
70−60
Then 𝑍 = 2
= 5.
This is the value of the test statistics.
This is based on the assumption that σ is known to us otherwise we will use t test.
6. Compare the value of the test statistics with the critical value.
If α=5% 95% of z is between -1.96 and +1.96.
101
So if the value of the test statistic calculated from the sample is in the critical region we reject 𝐻0
. That µ=60 against the alternative µ≠60.
So when we reject a null hypothesis using the test statistics we say that our sample is not
generated by a population with µ=60, that is our sample generated by a different population.
The population µ≠60.
These are the steps in hypothesis testing. For example suppose that the value of test statistics
= -1. Then we accept 𝐻0. We will say that our sample is generated by a population with
mean = 60. There is no difference between what we have assumed under 𝐻0 and what we have
obtained from the sample or the sample value is close to the hypothesized value. So 𝐻0 is not
rejected.
Hypothesis testing, population mean
In this session we discuss how to test the hypothesis. Now there are two approaches to
hypothesis testing;
1. Confidence interval approach
2. Test of significance approach
Test of significance approach was developed by Fisher, Neyman and Pearson.
We will first consider the confidence interval approach. While discussing the estimation theory,
we discussed that there are two types of estimators; point estimation and interval estimation.
Confidence interval approach is based on the concept of interval estimation, and the test of
significance approach based on the concept of point estimation.
Confidence interval approach
102
The first step in hypothesis testing is to formulate the null and alternative hypothesis. Let our
null hypothesis is;
*
𝐻0:µ= µ
Tested against ;
*
𝐻1:µ≠ µ
The first explain, the approaches to hypothesis testing using the estimator arithmetic mean, then
we will extend our analysis using OLS estimators β1 and β2.
2
We assume that the population X is normally distributed with mean µ and variance σ ;
2
𝑋 ∼N (µ σ )
2 2
σ σ
And, 𝑋 is also normally distributed with same µ and variance 𝑛
; 𝑥 ∼N(µ 𝑛
)
If X is normally distributed, 𝑋 is also normally distributed, since the probability distribution of 𝑋
is known, we can use this known probability distribution to construct confidence interval as;
100(1-α) % confidence interval for the parameter µ
This confidence interval for µ is based on the estimator 𝑋
The decision rule is once we construct a confidence interval of 100(1-α) % , α will be specified.
If α=5%, then it is 95% confidence interval, if α=1%, then it is 99% confidence interval
*
etc….then we will see whether µ= µ is included in this interval or not, once the interval is
constructed ,what we do is whether the hypothesized value is included in this interval or not. If
*
the hypothesized value µ= µ is in the interval, we will not reject the null hypothesis. Otherwise
we will reject it.
103
So, in the confidence interval approach, we construct a 100(1-α) % confidence interval for the
parameter µ using the estimator 𝑋. if the hypothesized value is included in this interval we will
not reject the null hypothesis and whether the hypothesized value is not included in this interval
we will reject the null hypothesis.
If α= 0.05 (5%) , and 100(1-α) % = .95 (95%) confidence interval for µ

2
σ
We know that, if 𝑋∼N(µ 𝑛
), Z can be written as
*
𝑋−µ
Z= σ
, it is standard normal variable with mean 0 and variance 1;
𝑛
*
𝑋−µ
Z= σ
∼ 𝑁(0 1)
𝑛
Here the distribution of 𝑋 and hence Z, depends on the value of µ assumed under null
hypothesis, it is not an actual distribution, it is only an assumed distribution.
⎡ −𝑍α 𝑋−µ
*
𝑍α ⎤
Pr⎢ 2 < < ⎥= 1-α
⎢ σ 2 ⎥
⎣ 𝑛 ⎦
If α = 5%, the above equation can be written as;
⎡ 𝑋−µ
* ⎤
Pr⎢− 1. 96 < < 1. 96 ⎥= 1-α
⎢ σ ⎥
⎣ 𝑛 ⎦
Rearranging this, we will get;
σ * σ
Pr⎡⎢𝑋 − 1. 96 ( ) < µ < 𝑋 + 1. 96 ( ) ⎤⎥= 1-α
⎣ 𝑛 𝑛 ⎦
* σ
The 95% confidence interval for µ is 𝑋 ± 1. 96( ).
𝑛
So this confidence interval is constructed for the value of µ assumed under the null hypothesis
*
𝐻 0 µ =µ
104
Once the confidence interval is established, we will see whether the hypothesized value lie
within this interval or not.
Now the procedure is simple, once the 95% CI is constructed for µ using 𝑋 and its
standard error, we will see whether the hypothesized value is within this limit or not. If
the hypothesized value is within this limit we will not reject the null hypothesis, if the
hypothesized value is not within this limit, we will reject the null hypothesis.
As an example to this; suppose that we have taken a sample of 100, from which 𝑋= 67, σ
σ
=2.5, n=100, and =0.25.
𝑛
σ σ
Now, 𝑋+1.96 = 67+1.96 (0.25) and 𝑋-1.96 = 67+1.96 (0.25), this will give you
𝑛 𝑛
the critical values
σ
𝑋+1.96 = 67+1.96 (0.25) =67.49
𝑛
σ
𝑋-1.96 = 67+1.96 (0.25)= 66.51
𝑛
Suppose that;
105
𝐻0:µ= 69
Tested against
𝐻0:µ≠ 69
What we do is, we consider whether the hypothesized value is within the range of 67.49 and
66.51.
The hypothesized value is 69, and the value is not within this range (67.49 and 66.51). So we
reject the null hypothesis.
In the context of confidence intervals, as we have discussed earlier, if we construct intervals like
this a repeated number of times, the parameters will lie in such intervals 95 out of 100 times.
Suppose that the hypothesized value is within this interval, and we accept the hypothesis. But the
parameter will lie within this interval 95 out of 100 times. So the parameter will not lie within
this interval 5 out of 100 times.
So when we reject the null hypothesis, we will be rejecting a null hypothesis 5 out of 100 times.
Thus this is the probability of committing Type I error.
2. The test of significance approach
The test of significance is a procedure by which sample results are used to verify the truth or
falsity of a null hypothesis, and the key idea behind this is test statistics and its distribution of the
test statistic under null hypothesis . The decision to accept or reject the hypothesis is based on the
value of the test statistic obtained from the sample.

2
𝑋−µ 2 σ
Under normality assumption Z= σ
, X∼ 𝑁(µ σ ) and 𝑋 ∼ 𝑁(µ 𝑛
)
𝑛
If the value of µ is specified under null hypothesis, we can calculate Z on the assumption that σ
is known. So Z can be used as the test statistics.
106
As Z∼ 𝑁(0 1 ), can state the probability statement as
⎡ 𝑥−µ
* ⎤
Pr⎢− 𝑧α/2 < < 𝑧α/2 ⎥= 1-α
⎢ σ ⎥
⎣ 𝑛 ⎦
If α= 5%
⎡ 𝑥−µ
* ⎤
Pr⎢− 1. 96 < < 1. 96 ⎥= 1-α
⎢ σ ⎥
⎣ 𝑛 ⎦
Then we rearrange this as;
* σ * σ
Pr⎡⎢ µ − 𝑧α/2 ( ) < 𝑋 < µ + 𝑧α/2( ) ⎤⎥= 1-α
⎣ 𝑛 𝑛 ⎦
σ *
What we did here is, multiply throughout with . then we add µ for all the terms.
𝑛
*
This is the interval in which 𝑋 will fall with 1-α probability given that µ =µ . ( this is not the
confidence interval)
In the language of hypothesis testing this 100(1-α) % CI established is known as the region of
acceptance of null hypothesis and outside is the region of rejection.
And this confidence interval is based on the assumption that σ is known. In most of the case the
σ is unknown, and we use t;
𝑥−µ
t= 𝑠
∼ t (n-1)
𝑛
In that case, we replace Z with t, and rearranging that, we will get;
* 𝑠 * 𝑠
Pr⎡⎢ µ − 𝑡α/2 ( ) < 𝑥 < µ + 𝑡α/2( ) ⎤⎥= 1-α
⎣ 𝑛 𝑛 ⎦
This 𝑡α/2 is the critical t value for α level of level of significance and (n-1) degrees of freedom.
*
We accept the null hypothesis, if 𝑋 in the interval, given µ =µ
107
𝑠
If a distribution is normal, we know the area property , that is µ ± 1. 96 ( ) includes 95% of
𝑛
the Z. and this Z (test statistics derived from the 𝑋) is generated from 𝑋 (sampling distribution).
*
So if µ =µ the distribution has 95% within this interval, and 5% will fall outside. If the sample
𝑋 obtained from the sample lie in this interval, we do not reject the null hypothesis. In the
language of significance test, a statistic is said to be statistically significant if the value of the test
statistics lies in the critical region. Then the null hypothesis will be rejected. A statistic is said to
be statistically insignificant if the value of the test statistics lies in the acceptance region. Then
we fail to reject the null hypothesis.
Let us see a concrete example;
Suppose that the average height of students in a college is claimed to be 160 cm;
𝐻0: µ =160cm
𝐻1: µ ≠160cm
σ
A sample size of n= 100 is taken, 𝑋= 150 cm, σ= 4, SE= =0.4
𝑛
σ
µ ± 1. 96 ( ) = 160± 1. 96(0.4) = 160.714 and 159.216 (critical values)
𝑛
Present this graphically;
108
How to find the area between 159.21 and 160.71, that area is 95% of the 𝑋. If we construct the
sampling distribution of 𝑋, the area property is same as the area property of µ.
What is the probability that, sample is generated from a population with a mean=160, and we
know that the probability is very low. Because, bcz the 150 is lying in the low probability area of
the distribution. So we reject the null hypothesis. 95% of the value lies between 159.21 and
160.71, only 5% of value lies below and above these values (2.5%values above 160.71 and 2.5%
of the values below 159.21). So the probability of a sample generated from the population with a
mean= 160 is very low.
We can calculate Z as;
150−160
Z= .4
= 25
Let convert the graph in to Z as;
109
Hypothesis testing about regression coefficients
Now we discuss in detail the hypothesis testing in the context of regression analysis. The
parameters of our discussion are β1and β2, and the estimators are β1and β2. We use these
estimators to make inferences about parameters. As we discussed earlier, remember this, before
estimating the model, we should develop hypothesis about β1and β2.
Once the hypotheses are fixed, we collect data, and using the inferential procedures, the steps in
hypothesis testing, we test the hypothesis after fixing the level of significance and critical region
based on the alternative hypothesis and choose the test statistics based on the problem at hand.
Finally we calculate the value of the test statistics from the sample and compare it with the
critical value, and make a decision.
There are two approaches to hypothesis testing; the confidence interval approach and the test of
significance approach. The question of hypothesis testing is; is a given observation of finding is
compatible with a stated hypothesis or not? This is the fundamental question that we ask in
hypothesis testing. compatible means, sufficiently close to, so that we do not reject the stated
hypothesis, and if it is sufficiently close to the stated hypothesis , we do not reject it or otherwise
we reject it. In the language of hypothesis testing, the stated hypothesis is the null hypothesis
tested against the alternative hypothesis, also known as the maintained hypothesis or the research
hypothesis. The theory of hypothesis testing is concerned with developing rules or procedures of
deciding whether to accept or reject the null hypothesis. In both approaches of hypothesis testing,
we assume that the estimator is a random variable, and the estimator has a probability
distribution, and the probability distribution of the estimator is known as the sampling
distribution, and the sampling distribution is used for hypothesis testing. The normality
assumption is used for hypothesis testing, because if X is normal the estimator is also normal. In
110
the context of regression, we assumed that the u is normal and as estimators are linear
combinations of u, the estimators are also normally distributed.
Both the approaches of hypothesis testing are based on an example; Example of income and
consumption
Income starting from 80-260 and consumption also, n=10, now using OLS suppose that the
estimated model is;
𝑌= 24.4545+0.5091 𝑋𝑖
SE(β1)= 6.4138 0.0357
n=10
df= 8
α=5%
Critical t value = 2.306
In the discussion of regression analysis, we consider only the t value, because the standard errors
are estimated standard errors.
Now let us consider the confidence interval approach of hypothesis testing first;
Here, β2 = 0.5091
Suppose our hypothesis is 𝐻0:β2=0.3 tested against 𝐻1:β2≠0.3 (0.3 is an assumed value for the
purpose of hypothesis testing)
Once we formulate the hypothesis, we construct the 95% confidence interval for β2 as;
β2= 0.5091± 2.306 (0.0357)
= 0.4268 β2 < 0.5914
111
The 95% confidence interval for β2 is
In the language of hypothesis testing, if β2under 𝐻0, lie in this interval (the interval is 100(1-α
)%.ie., at 95% confidence interval we do not reject the 𝐻0, otherwise we reject it. The
hypothesized value is 0.3, and the 0.3 is outside the interval, so we reject the 𝐻0. That means, the
sample is not regenerated by the population with mean=0.3.
The procedure in confidence interval approach to hypothesis testing is;
1. Construct a 100(1-α)%confidence interval for β2
2. If β2 under the 𝐻0, falls in this region, do not reject the 𝐻0, otherwise reject
As we already discussed, the parameter will lie in such intervals 95 out of 100 times, only 5
times the parameter will lie outside. Actually the parameter may be outside. In our example the
parameter is outside the limit, so we reject the 𝐻0. our sample is not generated from the
population with β2 =0.3.
In the language of hypothesis testing, when we reject a null hypothesis, we say that our finding is
statistically significant, when do not reject 𝐻0 ,we say that our finding is not statistically
significant. In this case we reject the 𝐻0, so our finding is statistically significant.
112
Statistically significant means, our sample finding is significant or our test statistic is significant.
While in the case of court procedure, the evidence collected by the police is used to verify
whether the person has committed the crime or not. If the court accepts 𝐻0, then the person is
innocent, that means evidence is not significant. In our case, the evidence is our sample. Sample
is used to decide whether to accept or reject the 𝐻0. If 𝐻0 is accepted our sample finding is not
significant, in this sense, we say that the test statistics is not significant. When we reject a
hypothesis, we will be rejecting a true hypothesis (5 out of 100 times). The parameter will lie
within this interval 95 out of 100 times, 5 times the parameter will lie outside. But still we reject
the 𝐻0, even though the 𝐻0 is true. Reject the 𝐻0, when the 𝐻0is true is called type I error.
Test of significance approach
● Now we consider the test of significance approach developed independently by Fisher,
Neyman and Pearson . A test of significance is a procedure by which sample results are
used to verify the truth or falsity of a null hypothesis. The key idea behind the test of
significance approach is the test statistic and sampling distribution of test statistic under
null hypothesis. The decision to accept or reject the 𝐻0 is based on the value of test
statistics received from the sample at hand.
● Now, under the normality assumption
β2 − β2
𝑡=
𝑆𝐸(β2)
The standard error is the estimated standard error.
β2 − β2
𝑡= we have derived this
2
σ/ σ𝑥𝑖
113
● Here explain the test of significance approach as in the case of confidential interval
approach, using the t test statistic only because in the case of econometrics, we will not
use Z statistic in this context because σ is always replaced by σ. and the standard errors
given in the example are estimated standard errors
● Now, if the value of β2 is specified under 𝐻0 then t statistic can be calculated. Because
standard error is given. The only unknown is β2. then t can be used as the test statistic.
And our decision depends on the value of t obtained from the sample compared to critical
t values.
● Now, as we have written earlier

*
⎡ β2 −β2 ⎤
𝑃𝑟⎢ − 𝑡α/2 < < 𝑡α/2 ⎥ = 1 − α
⎢ 𝑆𝐸(β2) ⎥
⎣ ⎦
*
β2 is the value of β2 specified under 𝐻0
𝑡α/2 is the critical value for given degrees of freedom and specified level of significance.
In our example it is 2.306.

*
⎡ β2 −β2 ⎤
𝑃𝑟⎢− 2. 306 < < 2. 306 ⎥ = 1 − α
⎢ 𝑆𝐸(β2) ⎥
⎣ ⎦
This is the typical example.
Now, this can be rearranged as
* *
𝑃𝑟⎡⎢β2 − 𝑡α/2 𝑆𝐸(β2) < β2 < β2 + 𝑡α/2𝑆𝐸(β2) ⎤⎥ = 1 − α
⎣ ⎦
This is not the confidence interval, this is a different procedure, multiplied through out
*
with standard error of β2 and adding a β2on all the terms, we can rearrange the equation.
*
● Now, it is the interval in which β2 falls with 1 − α probability given that β2 = β2
114
● And in the language of hypothesis testing this 100(1 − α)% confidence interval is
known as the region of acceptance of 𝐻0and outside it is the region of rejection. Before
explaining more, let us elaborate on the example.
● In our example first explained in terms of β2
● If we calculate the values
* *
𝑃𝑟⎡⎢β2 − 𝑡α/2 𝑆𝐸(β2) < β2 < β2 + 𝑡α/2𝑆𝐸(β2) ⎤⎥ = 1 − α
⎣ ⎦
0. 3 ± 2. 306(0. 0357)
*
Not a confidence interval , β2 to -t to +t
That is 0.5% of β2 lie in this interval.
What it says is if you take samples like this, a repeated number of times, you will
generate a distribution and you will plot like this 95% of the β2 lie in between 0.2177
and 0.3823. And provided that the mean is 3. And this is not an actual distribution. This
115
is a hypothetical distribution. You will get this distribution only after formulating the 𝐻0.
if 𝐻0 is not formulated you can not derive the ‘t' statistic. You will not get this
distribution
● Remember, the hypothesis testing is based on the sampling distribution of the estimators.
Not the actual sampling distribution. Based on a hypothetical sampling distribution.
Based on an assumed parameter.
● Now, we will not do this directly, we will convert the unit of β2 in to unit of t.
● 95 % of t with 8 degrees of freedom lies in between -2.306 and +2.306. Now we calculate
this t
116
0.5091 − 0.3
0.0357
= 5. 86
The value of β2 obtained is 0.5091, when it is converted into t, it is 5.86
Now, in the test of significance approach, we compare the value obtained from the
sample with the hypothesized value. If the value obtained from the sample is far away
from the hypothesized value it means our hypothesis may not be correct or our sample
may not be generated from a population with hypothesized value equal to 0.3. So, we
have increasing suspicion about 𝐻0. As soon as the value of β2 is close to the
*
hypothesized value β2, then we will be in the acceptance region. Otherwise will be in the
117
rejection region. What it says, the probability that our sample is generated from a
population with mean is equal to 0.3 is less than 5%. because , 2.5% in the left side and
2.5% in the right side. Only a low probability, so we reject 𝐻0. By converting in to t, our
t value is 5.86. The probability of t equal to 5.86 is less than 5%. The actual probability
will be discussed later. It is less than 5% so again 𝐻0 is rejected.
● You can test a hypothesis using an estimator itself, and a sampling distribution. We will
not do this. What we do is, we convert an estimator into a sample statistic and the test
statistic is used for hypothesis testing. This is the procedure that we do.
*
● If β2 = β2 then t = 0 , t= 0 means null hypothesis accepted. Now estimated β2 is far away
from the hypothesized value, then large value is evidence against 𝐻0 then small value is
evidence in favour of 𝐻0. And to conclude, in the language of significance test, a test is
said to be significant, if the value of the test statistic lie in the critical region. If the value
lie in the acceptance region, it is statistically significant. So, in this example, as the value
of the test statistic lie in the critical region, we will say that the test statistic is significant.
Means the sample finding is significant.
● As we have using t test here, we have a test procedure based on t test. Now, remember
this, a statistical finding or test statistic is said to be significant only if 𝐻0 is rejected,
otherwise it is not significant. If null hypothesis is not rejected, the test statistics is not
significant. And this is a two tailed test. We can adopt a one-tailed test also.
● If our null hypothesis is
*
𝐻0: β2 = 0. 3
118
*
𝐻1: β2 > 0. 3
Then the critical t value become, t = 1. 86, so, if you take one tailed test, the critical
value will be small. In the case of a two-tailed test, 𝐻0 is accepted, one tailed it is
rejected. So, it increases the power of the test.
*
● If β2 is less than tail, we have a one tailed test. Whether we have a one-tailed test or a
two-tailed test or right tailed or left tailed depends on the alternative hypothesis
● So, to summarize,
𝐻0 𝐻1 Reject 𝐻0 if
Two tailed β2= β2

* *
β2≠ β2 |𝑡| > 𝑡α/2, 𝑑𝑓
Right tailed β2≤ β2

* *
β2> β2 𝑡 > 𝑡α, 𝑑𝑓
Left tailed β2≥ β2

* *
β2< β2 𝑡 <− 𝑡α/2, 𝑑𝑓
● Always remember that, for hypothesis testing, we use sampling distribution. And the area
of the sampling distribution is divided in to two. Highly likely, and highly unlikely under
𝐻0 . Highly likely values are close to hypothesized values. Unlikely values are far away
from the hypothesized values. Also remember that the distribution is a hypothetical
distribution. If you give a different distribution, say for example, instead of 0.3, if it was
0.5, then β2 will be in the acceptance region, not in the rejection region.
● So, it is important to fix the hypothesis first, otherwise once you obtain the results and if
you fix the hypothesis after observing the results, you can always reject 𝐻0. The practice
119
is not recommended. Fix the hypothesis first, then collect data, estimate parameters, using
the sampling distribution, remember, sampling distribution is not directly used, we use
test statistics derived from the sampling distribution.
● And also remember this, the sampling distribution of the test statistic depends on the
normality assumption. So, normality assumption of the estimator is important. And in the
regression context, normality of the estimator depends on normality of few. It was
nothing to do with dependent variable or independent variable, also this normality
assumption is important if and only if the sample size is more. If you have a sufficiently
large sample size, normality assumption is not very important. That is provided that n
>100 for example, normality is not very important.
● This is the lessons of hypothesis testing and there are many other issues related with
hypothesis testing and these issues will be discussed in the next sessions.
Hypothesis testing, some practical aspects
● Hypothesis testing is perhaps one of the most important areas of empirical analysis. The
success of an empirical research depends on how to formulate an hypothesis and how to
test it using inferential statistics. We have discussed hypothesis testing about the
population parameter µ using the estimator 𝑋, we have also discussed in detail how to test
hypotheses about regression coefficients β1and β2. The test statistics we have extensively
used is the t test, now we have to consider some practical issues regarding hypothesis
testing. In addition to the technical details of hypothesis testing, we must also be familiar
with some of the practical aspects of hypothesis testing. This session is meant to give
some concrete ideas about some of the practical issues, some of the precautions, some of
the considerations, when we test the hypothesis.
120
● The first issue that we are considering is what is meant by accepting or rejecting a
hypothesis?
● Remember this, our decision to accept or reject the null hypothesis is based on sample
evidence. Our decisions will not always be correct. Whenever we decide to accept or
reject a null hypothesis, we are falling to commit two types of errors.
● Type I error : reject 𝐻0 when 𝐻0 is true
● Type II error : Accept 𝐻0 when 𝐻0 is false
● Now, as the decision to accept or reject 𝐻0is based on sample evidence and sample value
vary from sample to sample, we are not 100% sure about to accept or reject 𝐻0. And
suppose that on the basis of sample evidence we decided to accept 𝐻0 but that doesn't
mean that 𝐻0 is true beyond any doubt. To explain this point let us consider an example.
● Suppose that we hypothesis a value for MPC, 𝐻0: β2 = 0. 7 suppose in the consumption
income example we hypothesize that
𝐻0: β2 = 0. 7
𝐻1: β2 ≠ 0. 7
Suppose that the estimated value β2 = 0. 7241 and 𝑆𝐸 (β2) = 0. 0701, now consider
the t
0.7241 −0.7
𝑡 = 0.0701
= 0. 3438
So, this t value is not statistically significant, so, we accept 𝐻0 because 0.3438 is in the
acceptance region. So, we say that we accept 𝐻0
121
● Now, suppose that β2 = 0. 6, then
𝐻0: β2 = 0. 6
𝐻1: β2 ≠ 0. 6
0.7241 −0.6
𝑡 = 0.0701
= 1. 7703
Again the t value is not significant, again we accept 𝐻0
● So, we accept the null hypothesis that β2 = 0. 7 again accept the null hypothesis that
β2 = 0. 6 but we do not know which 𝐻0 is true. Because the same data is compatible
with different hypotheses. The same sample evidence is compatible with different
hypotheses. So, we do not know which hypothesis is true.
● So, what we do is we may accept 𝐻0 rather than we do accept 𝐻0. Or we do not reject 𝐻0
say at 5% level of significance rather than we accept 𝐻0 at 5% level of significance. Just
as a Court declares as “Not Found Guilty” rather than “Innocent”. So on the basis of the
evidence collected by the Police, the Judge declares that the accused is not found guilty.
But it doesn't mean that he is innocent. The sample evidence is not sufficient to prove
that he has committed the crime. So “Not Found Guilty” rather than “Innocent”. When
the accused is free by the Court, it doesn’t mean that the accused is innocent beyond a
doubt. But the evidence is such that there is no evidence to punish, in the same way we
do not reject 𝐻0 rather than we accept 𝐻0 or we may accept 𝐻0 rather than we do accept
𝐻0. This is so because the same sample evidence will be compatible with different
hypotheses. And we do not know which hypothesis is correct. So, whenever we reject or
accept a hypothesis, always keep this point in mind.
122
● So, as one author says, when we accept a null hypothesis, or we can say data do not allow
us to reject either of these hypotheses at 5% level of significance or x% level of
significance.
● The second point we want to discuss is Zero Null Hypothesis
● We can formulate 𝐻0 depending on the formula under consideration. But the null
hypothesis often tested in empirical analysis is the zero null hypothesis. Consider the
relation
𝑌𝑖 = β1 + β2𝑋𝑖 + 𝑢𝑖
𝐻0: β2 = 0
𝐻1: β2 ≠ 0
Now, the purpose of this hypothesis is whether Y is related to X at all. If this hypothesis
is accepted or not rejected, there is no meaning in testing other hypotheses. Not only that,
we have many softwares for estimating econometric relations. In all these softwares the
default hypothesis is the zero null hypothesis. You have the option to test other
hypotheses and when we proceed we will consider other hypotheses also. Testing the zero
null hypothesis is not always the best practice. We have to test other hypotheses also.
Anyway, in the softwares, the default hypothesis is the zero null hypothesis.
● If we test the null hypothesis which is the zero null hypothesis, then t become
β2
𝑡 =
( )
𝑆𝐸 β2
Because it is β2 - 0
123
So, whether t is positive or negative depends on whether β2 is positive or negative. In the
consumption income example t will be positive as β2 is positive, in the quantity
demanded and price example t will be negative as β2 is negative, like that.
β2
● So, t becomes 𝑡 =
( )
𝑆𝐸 β2
● As you can see, to test the significance of the t, we compare β2 with its standard error. T
will be significant if and only if β2 is much higher than the standard error of β2 . if β2 is
equal to standard error of β2 then t = 1, not significant. Again if the standard error of β2 is
higher than the β2 , t is not significant. We place β2 in relation to its standard error.
● So, remember this, when we obtain the computer output, we will have the coefficient ,
below the coefficient standard error , coefficient divided by the standard error is the t
value. Below that we will get p value, which is a topic we will see soon
● So, a hypothesis which we test in econometrics is a zero null hypothesis.
● Another issue of practical importance is how to formulate 𝐻0and 𝐻1?
● There are no hard and fast rules regarding how to formulate 𝐻0and 𝐻1. in many cases,
economic theory will tell us how to formulate 𝐻0and 𝐻1. If economic theory is not
strong, we have to use our intuition, previous empirical work having a similar nature etc
to formulate 𝐻0and 𝐻1. and remember that formulation of 𝐻0and 𝐻1depends on the
problem at hand. Often theory will give you some guidance or previous empirical works
124
will give you some guidance. Or your intellectual guess can be used to formulate 𝐻0and
𝐻1
● No matter how you formulate 𝐻0and 𝐻1, it is absolutely essential that you must formulate
𝐻0and 𝐻1before estimation. This is very important. The point is, when you go for an
empirical research, the first issue is to identify a problem, by literature review we identify
a research gap, formulate a research problem.
● A research problem is specified in the form of 𝐻0and 𝐻1. Once 𝐻0and 𝐻1are formulated
you can collect your own data, that is primary data, or you can go for secondary data,
collected by some other agency. The point is 𝐻0and 𝐻1 should be formulated first.
Suppose that you identified an issue, collected data, and observed the results. And if you
formulate the hypotheses after observing the results, you can always formulate 𝐻0and 𝐻1
in such a way that your null hypothesis is rejected. Because your sampling or your
finding is significant. Because hypothesis testing is based on a hypothetical distribution.
⋆
This distribution takes shape only when the value of µ is specified as µ , so if you give
⋆
one value to µ you have one result, give another value, you have another distribution.
125
⋆
● 𝐻0 : µ = µ
● So, for one value you may reject 𝐻0 ad for another value you may accept 𝐻0. So, if you
observe the results and formulate the hypothesis after observing the results, the problem
is you can always fix µ which is far away from the observed value of 𝑋 as an example,
for a given standard error. In that case 𝐻0 will be rejected
● Now, if you formulate a hypothesis after observing the results, there may be a temptation
to form hypotheses just to justify one’s results. If you formulate a hypothesis after
observing the results, you will always formulate a hypothesis just to justify your results.
● So, as one author says, the researcher will be guilty of circular reasoning or self fulfilling
prophecies . And there is a possibility that, he will formulate a 𝐻0 such a way that the
sampling finding is justified. That is 𝐻0 is rejected. This is not scientific.
● So, the scientific procedure is, formulate the hypothesis first, then collect data, estimate,
test and accept the hypothesis on the basis of sample evidence and hypothesis already
formulated. That is the practical issue we must always keep in mind.
● Then another issue that a researcher confronts is how to fix α- level of significance?
126
● We divide the entire area statistics into two. Highly likely under 𝐻0 and highly unlikely
under 𝐻0. And the division of the area into two depends on the value of α. If α is say 5%,
that is 0.5 and for a two tailed test 2.5% in the left region and 2.5% in the right region and
95% in the central portion of a normal distribution curve as follows
● So, our decision to accept or reject 𝐻0 depends on the level of significance,α. The level
of significance is the probability of committing Type I error. You know that the following
is the distribution for hypothesis testing, say the distribution of 𝑋 and hypothesized
values is µ. This is a hypothetical distribution and is depends on µ. If you give another µ
you will get another distribution like that. Now the question is what is α ? if α is 5% the
critical region is divided in to 2.5% in each tail. If α is 10% the critical region expands
and the acceptance region shrinks.
127
● And suppose that initially, a value is in 5% right tail, then at 5% it is accepted and at 10%
it is rejected. Our decision to accept or reject 𝐻0 will critically depends on α
● So, the question is how do we fix α ?
● There are no hard and fast rules. α is fixed at 1%, 5%, 10% etc. in social science we most
preferred the level of significance of 5%. In medical sciences it is 1%.
● The point is α is the probability of committing type I error, you can reduce the probability
of committing type I error by fixing α at a low level 1% etc. When you reduce the value
of α, the probability of type II error will increase. That is β . that is accept 𝐻0, when it is
false. In a given sample, you cannot minimize both type of errors. When α is reduced,
that is probability of type I error is reduced, then probability of type II error will increase
& viceversa. This is the problem.
● But if you ask a question, can you reduce type I error to zero? Such a possibility is there.
Type I error is rejecting 𝐻0 when 𝐻0 is true. If you never reject 𝐻0, probability of type I
error will reduce. But the problem is the probability of type II error is 1. You accept a
false hypothesis is 1.
● So, we cannot minimize both types of errors in a particular case. So, the common
practice is, econometricians and statisticians generally believe that type I error is more
serious than type II error. So, conventionally, we fix type I error at 1%, 5%, 10% etc. And
try to reduce type II error as much as possible.
● Regarding whether type I error or type II error is more serious, we have discussed an
example earlier. The cost of type I error and type II errors, the example of steel. If the
steel is used for building a house, the type II error is more serious but if it is used for
fencing type I error is more serious.
128
● So, when fixing α remember this as α increases, β decreases and vice versa. So, while
fixing α, one must keep in mind the cost of type I and type II errors. This is important
and it depends on the cost or purpose of type I and type II errors.
Hypothesis testing P value
The other issue is the P value or exact level of significance. The concept is
very important as part of inferential statistics. One author said, Achilles heal of classical
statistical inference is arbitrary in fixing at α 1%, 5% or 10%. This is the only weak spot in
classical statistical inference procedure. What is suggested is that, instead of fixing , arbitrary at
once a test statistic obtained from a sample i.e eg : t = 5.86. Find out the actual probability of
obtaining such t value. It is known as probability value. Or it is termed as an exact or observed
level of significance.
Eg : 𝐻0 → 𝑀𝑃𝐶 = 0. 3
t = 2.036 what is the probability of getting a t equal to or greater than is 5% or P= 0.05. I.e
probability of getting t = 0.05.
α = 5%.
Sample values lie in the critical value. 𝐻0 is rejected, what is the probability of committing type
1 error . it is 5%. But actually it is not 5%. What is the probability of committing type 1 error, we
have to find out the probability of t = 5.86
P =0.002 two tail test
= 0.001 one tail test.
Computer value is = 0.000189 ( p value of observed t).
Suppose that p = 0.01, so the test statistic is significance at 1%
p = 0.05, so the test statistic is significant at 5%
129
p = 0.1, so the test statistic is significant at 10%
p = 0.15, so the test statistic is significance at 15%
You Report p value along with the coefficient, standard error and t , and let the reader decide
whether to accept or reject at 1%, 5%,10%,15%..suppose that a reader willing to accept only at
1%, 5%,10%... etc. you report P value along with results, let the reader decide whether accept or
reject.
p value is the probability value, the lowest level of significance in which a null hypothesis is
rejected. This is also known as the exact level of significance or the observed level of
significance. This is obtained after calculating the empirical t from the sample.
Statistical significance v/s practical significance
Statistical significance is purely based on 𝑡 . if t value is low, the coefficient will be

β
statistically insignificant. If t value is high, the coefficient is statistically significant. In practical,
significance depends upon the sign of the coefficient and size of the coefficient. Suppose that the
coefficient is statistically significant, but it has no practical significant. i.e , β2 = 0.00002.
Both are important from a practitioner point of view. Even if a coefficient is statistically
significant, the sign is not according to our expectation . it is not practically significant. For
example in the consumption income if you get β2 as negative coefficient, it is not theoretically
acceptable, so sign is important. The second one is size is also important.
Statistical significance is not confused with the practical significance.
Guidelines
1. Check for statistical significant : yes, if the coefficient is statistically significant. Then
consider the sign and size.
130
2. It is not statistically significant at 1%,5%,10%... but still you consider expected sign and
size. And calculate p value. For small samples, reject 𝐻0, 𝑃 = 0. 2 ( Up to 20%)
3. Sign + (-), if the sign is not theoretically justified , then we can discard it.
Let me consider an example, dependent variable ( Y) is hourly wage and independent variable
( X) is years of education.
Y ( Hourly X ( years of
wage) education)
4.45 6
.. 7
.. ..
13.53 18
𝑦1 =− 0. 0144 + 0. 7240 𝑋1
SE = 0.93173 0.7000
t = -0.0154 10.3428
131
P = 0.987 0.000
n= 13
𝑑𝑓2 = 11
α = 5%
t = 2.201
2
𝑟 = 0. 9065
90% of variations in Y are explained by X. Range of value not close to 0. Slope = 0.7240. Slope
coefficient is significant. There is a positive relationship between years of education and hourly
wage.
Regression and ANOVA, the F test
ANOVA is introduced by Fisher. To analyze data relating to agricultural experiments. Nowadays
ANOVA is extensively used to compare differences between more than two means. It is used in
regression points only for statistical inference. To discuss ANOVA,
TSS = ESS +RSS
2 2
2
Σ𝑦𝑖 = Σ𝑦𝑖 + Σ𝑢𝑖
2 2
2 2
Σ𝑦𝑖 = β2Σ𝑥 𝑖 + Σ𝑢𝑖
A study of components of TSS is known as ANOVA in the context of regression. It is associated
with any sum of squares and it is the degrees of freedom.
2
TSS = Σ (𝑌 − 𝑌 ) (𝑛 − 1) degrees of freedom.
2 2
Σ𝑢𝑖 =Σ𝑢𝑖 (𝑛 − 2) degrees of freedom.
132
2
ESS = Σ (𝑌 − 𝑌 ) 1
( 𝑛 − 1) ( 𝑛 − 2)
= n-1-n+2
= -1+2
=1
ANOVA Table
Source of SS df MSS
Variation
ESS 2 2
2 1 ESS/1
Σ𝑦𝑖 =β2Σ𝑥 𝑖
RSS 2 𝑛− 2 2
Σ𝑢𝑖 2
Σ𝑢𝑖 =σ
𝑛−2
TSS Σ 𝑦𝑖
2 𝑛− 1
2
2
β2Σ𝑥𝑖
F= 2 ∼ 𝐹 ( 1, 𝑛 − 2)
σ
2
𝑧1 ∼χ 𝑘1
2
𝑧2 ∼χ 𝑘2
𝑍1/𝐾1
𝐹 = 𝑍2/𝐾2
∼ 𝐹 (𝐾1, 𝐾2)
β2 −β2
𝑧1= ∼ 𝑁 ( 0, 1)
2
σ/ Σ𝑥𝑖
133
( )∼ χ
2
(β2 −β2 )
2 2
𝑍1 = 2 2
(1)
σ /Σ𝑥𝐼
2
𝑧𝑖 /1
F= 𝑧2/𝑛−2
2
2 2
(β2 −β2 ) Σ𝑥𝑖 2
2
(β2 −β2 ) Σ𝑥𝑖
= 2
σ
= 2
∼ 𝐹 ( 1, 𝑛 − 2)
Σ𝑢 𝑖
×
1 Σ𝑢𝑖 /𝑛−2
2 𝑛−2
σ
Suppose that 𝐻0: β2= 0
2
2
β2Σ𝑥𝑖
= 2 ∼ 𝐹 ( 1, 𝑛 − 2)
σ
This is the ratio between two independently distributed one.
β2= 0
β2
t=
𝑆𝐸 β2 ( )
( )
2
2 2 2 2
E β2Σ𝑥 𝑖 = σ +β2Σ𝑥 𝑖
2
2
E(σ ) = σ Therefore the F value is small. RSS is a measure of noise in the system. If
2 2
𝑡𝑘 = 𝐹 (1, 𝑘) . In two variable analysis 𝑡𝑛−2= F ((1, 𝑛 − 2)
Prediction using simple linear regression model
Example: there are two observations
Yi Xi n= 10
Using the observations and applying OLS , you have estimated a consumption function
134
𝑌𝑖 = 24.45+.5091Xi
Where, 24.45= β1 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
.5091= β2= slope ,
the marginal propensity to consume
Then What is the use of estimated regression / historical regression?
An estimated regression is used to verify an economic theory. It is used to policy purposes. It is
also used for prediction. One use of this historical regression is to forecast a future value
Say 𝑌0 − 𝑋0
That is consumption expenditure𝑌0 associated with income 𝑋0
𝑌0 and 𝑋0 are not included in the original observations (Yi Xi: here i= 1,2,3,.......n, this Y0 and
𝑋0 not included in original observations)to estimate the regression line.
𝑋0 may be within the sample range . To estimate this,
Range of sample used may be 80-250
𝑋0 may be outside the range or it may be within this range . The only requirement is that 𝑋0 was
not included in the original observations
It also assumed that while using 𝑋0 to predict Y= 𝑌0, we assume that the original relation
between Y and X holds for the predicted𝑋0 𝑌0 also. We assume that the relationship between Y
and X determined by the parameters holds true for the predicted𝑌0 𝑋0 also, ie the relationship
which generated the sample data is applicable to the future observations also whether 𝑋0 is
within the range or whether it is outside the range. Now there are two types of predictions
135
Prediction of an individual Y say 𝑌0 corresponding to X =𝑋0
1) 𝑌0= 𝑋0= individual prediction
Predicting a value on the population regression line .
2) E(Y/𝑋0)= mean prediction
Individual prediction
Suppose our aim is to predict
Y= 𝑌0→X=𝑋0 (corresponding to)
We want to predict an individual value Y corresponding to X= 𝑋0.
𝑌0=β1+β2𝑋0+𝑢𝑂
We predict it as
𝑌𝑂= β1+ β2= 𝑋0
If 𝑌𝑂 is the predictor of 𝑌0 and as 𝑌𝑂 is an estimator it is likely to be different from the true
value 𝑌0. Both need not be one and the same thing due to the fact that 𝑌𝑂 is an estimator as it is
based on the estimators β1+ β2
The difference between 𝑌0 and 𝑌𝑂= the prediction error or the forecast error
𝑢0 =( 𝑌0 − 𝑌0)= (β1 +β2𝑋0+𝑢0) - (β1 +β2 𝑋0)
= (β1 − β1 )+(β2- β2 )𝑋0+𝑢0
There are 2 sources of error in this formula
1. 𝑢0= The stochastic error term or the presence of additive error term 𝑢0
136
2. Second source of error is estimators . The reason is estimators are random variables and
the value of estimators varies from sample to sample.
But if you take expectations:
E(𝑢0)= E ( 𝑌0 − 𝑌0) = 0
It means that the predictor is unbiased
𝑌𝑂→unbiased.
Variance of prediction error
⎡ ⎤
⎢ ⎥
Variance (𝑢0)= Var( 𝑌0 − 𝑌0)=Var ⎢ (β1 − β1) + (β2 − β2 )𝑋0 + 𝑢0⎥
⎢ ⎥
⎣ ⎦
𝑢0 can also be written as
● 𝑢0=(𝑌0 − 𝑌0) =(β1 − β1 ) +(β2-β2)𝑋0+𝑢0
=𝑢0+ (β1-β1) −(β2-β2)𝑋0
2 2 2
2 2 2
● (𝑢0)= (𝑌0 − 𝑌0 ) =𝑢0+(β1 − β1) +(β2 − β2) 𝑋0-2𝑢0(β1-β1)- 2𝑢𝑜(β2-β2)𝑥 𝑜+2
(β1 − β1) (β2 -β2)𝑥 0
2
2 2 2⎡ 1 (𝑋0−𝑋) ⎤
● E((𝑢0)= 𝐸(𝑌0 − 𝑌0) =σ ⎢1 + + ⎥
⎢ 𝑛 2
Σ𝑥𝑖 ⎥
⎣ ⎦
The variance of prediction error is given in this formula.
The variance of prediction error will be smallest when 𝑋0= 𝑥
137
2 1
If 𝑋0= 𝑥 , variance of prediction error becomes σ (1 + 𝑛
)
The prediction error increases non linearly when 𝑋0 deviates from 𝑥
We know that 𝑢0,or 𝑦0- 𝑦0 is a linear function of normal variables because these are the linear
function of β1, β2 and 𝑢0
We can write
𝑌0−𝑌0
∼ 𝑁 ( 0, 1)
1 𝑋𝑂−𝑥
σ 1+ +
𝑛 2
Σ𝑥𝑖
σ is unknown . So replace it with σ
𝑌0−𝑌0
= ∼ 𝑡(𝑛 − 2)
1 (𝑋𝑂−𝑥)2
σ 1+ 𝑛 + 2
Σ𝑥𝑖
In this formula everything is known except 𝑌0.So we construct a 95% confidence interval for 𝑌0
as
1 (𝑋𝑂−𝑥)2
𝑌𝑜= 𝑌0 ±t 0.025 σ 1 + 𝑛
+ 2
Σ𝑥𝑖
Example : In the example considered earlier , we want to predict
𝑌0 =24.4545+.5091(100)
=75.3645
2 1 (𝑋𝑂−𝑥)2
Var ( 𝑢0 ) = σ 1 + 𝑛
+ 2
Σ𝑥𝑖
138
2
1 (100−170)
=42.159(1+ 10 + 33000
= 52.63 SE= 7.255
The 95% confidence interval for 𝑌𝑜
58.6245<𝑌𝑜 <92.0945= known as confidence band
This can be repeated to any value of x , we will get the confidence band . If your aim is
individual prediction Then we have to predict
E𝑌/𝑋0=β1 + β2 𝑥 0
A point on the PRF.
We predict it as
𝑦0=β1 + β2 𝑥 0
When you take the variance of prediction error it becomes
139
2⎡ 1 (𝑋𝑂−𝑥)2 ⎤
σ ⎢𝑛 + ⎥
⎢ 2
Σ𝑥𝑖 ⎥
⎣ ⎦
When you construct 95% confidence interval for the predictor the confidence band will be close
to SRF
The point is , once you estimate a model, you can predict any value of i corresponding to X=X0
But the problem is if you try to predict a value which is far away from the mean of x, prediction
error and its variance increases non linearly.So never try to predict a value which is far away
from the mean of independent variable or atleast never try to predict a value which is far away
from sample range .It can be explained as
If we extend an estimated regression line far away from the sample range it will give
contradictory results .Suppose that we have information about GNP of India up to 2020 and also
that we have the value of aggregate consumption and we estimate a consumption income
relationship and if you try to forecast the consumption expenditure in the year 2030 by
140
substituting a value for income in the year 2030 , it will give the absurd, because the variance of
prediction error will be very high , such prediction is meaningless. Thus never predict a value
which is far away from the sample range.That will give you misleading results
Regression through the origin
Consider a PRF
𝑌𝐼 = β2 𝑋𝑖 + 𝑢𝑖
In this model the intercept β1is assumed to be 0.Such a model is known as regression through the
origin model. As you know, the PRF of such a model starts from the origin not from the Y axis.
As an example to this , consider the permanent income hypothesis of Milton Friedman, which
says that consumption is proportional to permanent income
141
𝑝
1) 𝐶 = α𝑌 , adding a disturbance term u, it is an example of regression through the origin
model.
2 3
2) Consider a cost function TC = 𝑎0+𝑎1 𝑄 + 𝑎2𝑄 + 𝑎 𝑄
3
TC = TFC+TVC
If you are modeling a variable cost figure,it starts from the origin.Because we know that
Total Variable Cost Curve starts from the origin.
Now if you have strong theoretical reasons to believe that the correct model is a
regression through the origin model , then you can have a model like this. Regression
through the origin mode is not generally preferred in empirical analysis, but still it is
useful for some other purposes. So it is important to know what are the properties of
regression through the origin model, because when we discuss heteroscedasticity , auto
correlation etc we will come across such models and a discussion will be useful later.
Now this is the PRF . Now let us say how to estimate SRF is
𝑌𝑖 = β2𝑋𝑖 𝑢𝑖
142
𝑌𝑖 = β2 𝑋𝑖 + 𝑢𝑖
𝑢𝑖= 𝑌𝑖 - β2 𝑋𝑖
2 2
Σ𝑢𝑖 = Σ[ 𝑌𝑖 − β2 𝑋𝑖 ]
2
𝑑Σ𝑢𝑖
=2Σ[ 𝑌𝑖 − β2 𝑋𝑖 ] [−Xi]= 0
𝑑 β2
= -2Σ[ 𝑌𝑖 − β2 𝑋𝑖 ] [Xi] = 0
2
Can be written as Σ 𝑌𝑖 𝑋𝑖 = β2 𝑋𝑖
From this we know that it is
Σ𝑢 𝑋𝑖 = 0
Solving this you will get
Σ 𝑌𝑖 𝑋𝑖
β2 = 2
Σ 𝑋𝑖
Now if we use a substitute for Yi from the PRF , we can write it as
Σ𝑋𝑖( β2 𝑋𝑖 +𝑢𝑖)
= 2
Σ 𝑋𝑖
2
β2Σ 𝑋𝑖 +Σ𝑢𝑖
2
Σ 𝑋𝑖
Σ 𝑋𝑖 𝑢𝑖
β2 =β2 + 2
Σ 𝑋𝑖
E(β2)=β2
So β2 in the regression through the origin model is unbiased.
143
Now let us see what is the variance of β2
We know that From this step
2 Σ 𝑋𝑖 𝑢𝑖 2
E [β2 − β2] = E [ 2 ]
Σ 𝑋𝑖
This can be written as
1 2
= 2 E [ Σ 𝑋𝑖 𝑢𝑖]
(Σ𝑋𝑖)
If we expand this , it becomes
2 2 2 2
𝑋1 𝑢1 +.........𝑋𝑛𝑢𝑛 + 2 𝑋1𝑋2𝑢1𝑢2 +...... + 2𝑋𝑛−1𝑋𝑛 𝑢𝑛−1𝑢𝑛
And from the assumptions all these cross terms will disappear as we derived earlier, then
this becomes
1 2 2
2 2
σ Σ𝑋𝑖
(Σ𝑋𝑖 )
That will give you variance of β2

2
σ
Var(β2)= 2
Σ𝑋𝑖
2
Now we know that σ is unknown , so we replace it with its unbiased estimator
2
2 Σ𝑢𝑖
σ = 𝑛−1
Here it is n-1 not n-2 because
The reason is , in the regression through the origin model , we estimate only one parameter so
only one degree of freedom is lost.
If we write expected value of
144
( )
2
Σ𝑢𝑖 2
(𝑛−1)σ 2
E 𝑛−1
= 𝑛−1
=σ
Now it is informative to compare the regression through the origin model with the model having
the intercept. As we have discussed while deriving the estimator , a step was written
1)Σ𝑢𝑖𝑋𝑖 = 0
This is a condition imposed on the data to derive the estimator β2, as only one condition is
imposed we have what is known as we have n-1 degrees of freedom.In the conventional model,
in addition to this we have
Σ𝑢𝑖=0 . But in the regression through origin model we cannot impose both these conditions
together.Let us see why?
We know that
𝑌𝑖 = β2𝑋𝑖 + 𝑢𝑖
Σ𝑌𝑖 = Σβ2Σ𝑋𝑖 + Σ𝑢𝑖
If we impose the condition that
Σ𝑢𝑖= 0
Σ𝑌 𝑌
Then β2 becomes Σ𝑋𝑖 = = This quantity is biased
𝑖 𝑋
So in the regression through the origin model we can introduce
Σ𝑢𝑖𝑋𝑖= 0 but we cannot introduce Σ𝑢𝑖=0 .
Also as we have derived earlier , we have shown that
𝑌𝑖=𝑌𝑖+ 𝑢𝑖
145
𝑌= 𝑌+ 𝑢
But since
Σ𝑢𝑖 ≠ 0
Then
𝑢=0
Then
𝑌= 𝑌
But, in our model 𝑢 ≠ 0
So 𝑌≠ 𝑌
We cannot impose the condition that mean of Y is the mean of estimated Y. Both are different.
So as a comparison between both the models we know that for the conventional model
Σ𝑥𝑖𝑦𝑖
β2= 2
Σ𝑥𝑖
2
σ
Var (β2)= 2
Σ𝑥𝑖
2 2
Σ𝑢
σ = 𝑛−2
Now let us see a comparison of paths
1) In the regression through the origin model we use the raw sum of squares and cross
products. But in the model with intercept we use adjusted sum of squares and adjusted
cross products.(adjusted from the mean)
2) In the regression through origin model , we have n-1 degrees of freedom , in the
conventional model we have n-2 degrees of freedom.
146
3) In the conventional model Σ𝑢𝑖=0. But we cannot assume that Σ𝑢𝑖=0 in the regression
through the origin model.
2 2
4) In the model with intercept, 0< 𝑟 <+1 ie 𝑟 𝑖𝑠 𝑎𝑙𝑤𝑎𝑦𝑠 𝑎 𝑛𝑜𝑛 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑛𝑢𝑚𝑏𝑒𝑟. But in
the regression through the origin model there is a possibility that
2
𝑟 𝑤𝑖𝑙𝑙 𝑏𝑒𝑐𝑜𝑚𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
5) In the regression through the origin model we can show that
2 2
2 2
Σ𝑢𝑖 = Σ𝑌𝑖 - β2Σ𝑋𝑖 And we know that
2
2 Σ𝑢𝑖
𝑟 =1- 2
Σ𝑦𝑖
2 2
2 2
And this Σ𝑢𝑖 =Σ𝑦𝑖 - β2Σ𝑥 𝑖
We can write this quantity as
2
2
Σ𝑢𝑖 ≤ Σ𝑦𝑖
2 2 2
But in the model without intercept Σ𝑦𝑖 = Σ𝑌𝑖 - n Σ𝑦
Then compare the equations

2
2 2
Σ𝑌𝑖 −β2Σ𝑋𝑖
It can be written as 1- 2 2
Σ𝑌𝑖 − 𝑛 𝑦
So there is a possibility that
10−5 5
1- 10−4
= 4
2
Here 1 - greater than 1 𝑟 becomes negative
147
2
So in the regression through origin model we cannot use the conventional 𝑟 as a
2
measure of goodness of fit. So as a measure of goodness of fit 𝑟 suggested in the
2 (Σ𝑋𝑖𝑌𝑖)
literature is raw 𝑟 = 2 2
Σ𝑋𝑖 Σ𝑌𝑖
2 2 2
This Raw 𝑟 satisfy the condition that 0< 𝑟 < 1. The conventional 𝑟 is not.
Thus compared to a model with an intercept, regression through the origin model has
many problems . So unless you have sufficient theoretical reasons , it is better to use the
conventional model . Also if your model is
𝑌𝑖= β1 + β2𝑋𝑖+ 𝑢𝑖
This β1 𝑤𝑖𝑙𝑙 𝑛𝑜𝑡 𝑏𝑒 𝑚𝑒𝑎𝑛𝑖𝑛𝑔𝑓𝑢𝑙 𝑖𝑛 𝑚𝑎𝑛𝑦 𝑐𝑎𝑠𝑒𝑠.
So we can simply ignore this
Suppose that we have a model that
H0: β1 = 0
H1 :β1 ≠ 0
If this null hypothesis is accepted, from the practical point of view, this is equal to
regression through the origin model. So the point is, it is better to regress a model with an
intercept rather than a model without intercept unless you have very strong theoretical
reasons.
Scaling and unit of measurement
The unit in which data is expressed is very important. If we are not familiar with units of
measurement in which dependent and independent variables are given there is a possibility that
we will make wrong interpretations.
148
Example : Suppose that United States Gross Private Domestic Investments (GPDI) and Gross
Domestic Product(GDP) is given from 1992 to 2005.When suppose that we regress GPDI on
GDP to see how GPDI responds to GDP.
● Suppose that one researcher uses data in billions of Dollars.
● Another researcher uses data in millions of Dollars.
One billion = 1000 million
We can use data in any units. Eg. We can use data in Rupees, thousands and lakhs of rupees ,
millions,billions as you like. If the data is used in billions we can convert it into millions by
multiplying this with 1000.
Now let us see what are some of the issues associated with unit of measurement.
● If one researcher is using billions of Dollars, Another researcher uses data in millions of
Dollars. The question is, will the regression line be the same in both cases?
● The second question is, if the results are different, which result should one use ?
● Next question is, whether the unit of measurement of X and Y makes any difference in
the regression results obtained?
If so, what is the most sensible course of action? That is what should be the best course of
action to follow?
When we interpret the results the unit of measurement is very important.
Let us consider a regression model as 𝑌𝑖 = β1 + β2 𝑋𝑖 + 𝑢𝑖
Where 𝑌𝑖 is GPDI, 𝑋𝑖 is GDP.
*
𝑌𝑖 = 𝑊1𝑌𝑖
*
𝑋𝑖 = 𝑊2𝑋𝑖
149
* *
We defined 2 variables 𝑌𝑖 and 𝑋𝑖 .
𝑊𝑖 and 𝑊2
Are known as scale factors. They are constants.
𝑊𝑖 = 𝑊2.
𝑊𝑖 ≠ 𝑊2
It means data given in billions, one can be converted into millions, another can be converted into
some other unit of measurement.
* *
Now we have 𝑌𝑖 and 𝑋𝑖, 𝑌𝑖 and 𝑋𝑖 , and as it is multiplied by the 𝑊𝑖 and 𝑊2 and as 𝑊𝑖 and
* *
𝑊2 are known as scale factors we can say that 𝑌𝑖 and 𝑋𝑖 are rescaled values of 𝑌𝑖 𝑎𝑛𝑑 𝑋 .
𝑖
*
And if 𝑌𝑖 𝑎𝑛𝑑 𝑋 . are in billions of dollars. Then 𝑌𝑖 =1000𝑌𝑖
𝑖
*
𝑋𝑖 = 1000𝑋𝑖
* * *
* *
Now consider regression 𝑌𝑖 = β 1
+ β 2
+ 𝑋𝑖 + 𝑢 𝑖
*
𝑌𝑖 = 𝑊𝑖𝑌𝑖
*
𝑋𝑖 = 𝑊2𝑋𝑖
*
𝑢 𝑖 = 𝑊1𝑢 𝑖
The rescaled variables.
*
Relationship between β1 and β 1,
*
β2 and β 2
150
*
var(β1) and var (β 1)
*
var(β2 ) and var(β 2)
2 *2
σ and σ
2 2 * *
𝑟 𝑥𝑦 and 𝑟 𝑥 𝑦
* 𝑊1
● β 2
= 𝑊2
β2
*
If 𝑊1 = 𝑊2 then β 2
= β2
( )
* 𝑊1
● 𝑣𝑎𝑟(β 2) = 𝑊2
𝑣𝑎𝑟(β2 )
If 𝑊1 = 𝑊2 the unit of measurement changed from one to another and it is the same for both Y
and X. Then estimator β variance is also the same.
Now what about the intercept?
*
● β 1
= 𝑊1 β1
*
2
● 𝑣𝑎𝑟 (β 1) = 𝑊1 𝑣𝑎𝑟β1
2
Its std error is 𝑊1 = 𝑊1
When we change multiplying X and Y with 𝑊1 and 𝑊2 as you can see the intercept is multiplied
by 𝑊1 . That is 𝑊1 β1 .
What is 𝑊1 , it is the weight of 𝑌1.
151
*
Standard error of β 1 is 𝑊1 times standard error of β1.
*
So both β 1 and standard error multiplied by 𝑊1 .
2 2 * *
● 𝑟 𝑥𝑦 = 𝑟 𝑥 𝑦
2
That is whatever may be the unit of measurement 𝑟 is not affected.
*2 2
2
● σ = 𝑊1 σ
This is the case in which X and Y are changed.
Now suppose that X scale has not changed. Only the Y scale changed.
Then the slope, the intercept, and the standard errors multiplied by 𝑊1.
𝑊1 = 𝑌 𝑊2 = 𝑋
Then Y scale unchanged, X scale changed.
1
Then the slope and standard errors multiplied by 𝑊2
.
If 𝑊1 = 𝑊2, then the slope, standard error remain unchanged. Intercept and standard error
2
multiplied by 𝑊1. 𝑟 is not affected,
Functional forms of regression model, log log model
In the previous classes we have discussed simple linear regression model. And our discussion in
simple linear regression model was based on the assumption that the model is linear in
parameters, may or may not be linear in variables. But in the real world we will consider many
variations of the simple linear regression model. That is we will consider a model which are non
linear in variables or models which are non linear in appearance. But can be transformed into
152
linearity by giving suitable transformation. So we will consider models which are non linear in
variables on the one hand and also non linear in parameters but can be converted into linear in
parameter models. A model which is non linear in parameters but can be converted into linearity
by giving a suitable transformation is known as intrinsically linear models. And in the real world
we can not expect that the world is always linear. The world is characterized by various types of
non linearities. So for empirical application we have to consider such non linearities. So we will
consider many cases and we will also discuss how such models are applied in the real situation to
deal with real economic data. The first model that we consider is Log Linear Model.
Log Linear Model.
Log Linear Models are used to estimate elasticity. To explain it let us consider a model as
exponential regression model.
β 𝑢𝑖
𝑌𝑖 = β1𝑋𝑖 2𝑒
This model is known as the exponential regression model. By giving a log transformation we are
using natural log
𝑙𝑛𝑌𝑖 = 𝑙𝑛β1 + β2𝑙𝑛𝑋𝑖 + 𝑢𝑖
The difference between natural log and common log
The base of the common log is 10. The base of natural log is e. e=2.718.
As an example 𝑙𝑛𝑋 = 2. 3026𝐿𝑜𝑔𝑋
Properties of log
𝑙𝑛 ab = 𝑙𝑛a +𝑙𝑛b
𝑎
𝑙𝑛 𝑏 = 𝑙𝑛a - 𝑙𝑛b
153
𝑘
𝑙𝑛𝑎 =k.𝑙𝑛a
𝑙𝑛 1 = 0
𝑎
𝑙𝑛𝑒 = a
This can be written as 𝑙 𝑌 = α + β 𝑙 𝑋 + 𝑢

𝑛 𝑖 2 𝑛 𝑖 𝑖
α = 𝑙𝑛β1
This model is linear in the parameters α and β2. And linear in the log of the variables Y and X.
So as both the dependent and independent variable is in log form, we can say that it is a log log
model or a double log model or a log linear model.
So this is the linear model, the only difference is Y and X is expressed in log form.
We can estimate the model using OLS.
If we apply OLS not by regressing Y on X, but by regressing𝑙𝑛y and 𝑙𝑛x, then you will get
estimators α and β.
The reason why the model is popular is that this model will give a direct estimate of elasticity or
the slope coefficient β2 is the elasticity.
The elasticity of Y with respect to X, that is the percentage change in Y, that is

∆𝑌
%𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑌 𝑌
β2 = %𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑋
= ∆𝑋
𝑋
So the model is popular because the slope coefficient is the estimate of elasticity.
𝐼𝑓 𝑌 = 𝑙𝑛𝑋
𝑑𝑙𝑛𝑋 1
𝑑𝑥
= 𝑋
𝑑𝑥
Or 𝑑𝑙𝑛𝑋 = 𝑋
154
So the derivative of a log variable is equal to proportionate change in that variable.
𝑋𝑡−𝑋𝑡−1
That is 𝑋𝑡−1
𝑋𝑡
Or 𝑋𝑡−1
− 1
And you know that 𝑋𝑡 − 𝑋𝑡−1 is the absolute change.
𝑋𝑡−1
is the relative change.
𝑋𝑡−1
× 100 is the percentage change.
And to calculate change we generally use the proportionate change or the percentage change.
Because absolute change depends on the unit of measurement, whereas, proportionate change is
independent of the unit of measurement.
The proportionate change is used as a measure of change in economics rather than absolute
change.
Because absolute change depends on the unit of measurement.
If Y is quantity demanded and X is price. Then we have quantity demanded is a function of
price. And if you estimate by convertingY into 𝑙𝑛𝑌 , X into 𝑙𝑛𝑋, we will get β2, and β2 is
nothing but price elasticity of demand.
So this model will give you an elasticity of demand.
If you simply regress Y on X we get only the slope of the regression line. That is β2
will measure change in Y per unit change in X, If price increases by 1 unit, quantity demanded
decreases by β2 units.
155
∆𝑌
∆𝑌 𝑋 ∆𝑌 𝑋
In this mode β2 is 𝑌
∆𝑋 this can be written as 𝑌
× ∆𝑌
𝑜𝑟 ∆𝑋
× 𝑌
𝑋
If you want to know what is the slope then it is slop = β2 ( )

𝑌
𝑋
Now let us see how these equations look in the form of a graph. Suppose that we have quantity
demanded as a f(P)
Then the curve graph is converted into a straight line.
The features of log log, double log or a log linear model.
● As we can see in the model elasticity β2 is assumed to be constant throughout. So it is
also known as the constant elasticity model. Slope = β2 ( ) as X increases, slope

𝑌
𝑋
decreases.
156
● 𝑙𝑛𝑌𝑖 = α + β2𝑙𝑛𝑋𝑖 + 𝑢𝑖 then we will get α and β .
() ( )
𝐸 α is an unbiased estimator of α, and 𝐸 β2 is unbiased estimator of β2.
But β1 the parameter entering in the original model when estimated as β1.
β1 is antilog of α
β1 estimated as β1. obtained by taking the antilog of α is a biased estimator. This is not a big
problem as in most cases intercept is not very important
Example for log log model:
𝑙𝑛𝐸𝑋𝑃𝐷𝑡 = − 7. 5417 + 1. 6266𝑙𝑛𝑃𝐶𝐸𝑡
Now expenditure on PCE (per capita consumption) of personal computers and a expenditure of
durable goods.
So we want to know when personal computer expenditure changes, how expenditure on durables
changes.
Now the interpretation is that when personal consumption expenditure increases by 1 %,
expenditure on durables increases by 1.6266%.
Difference between percentage change and percentage point change
Suppose that the unemployment rate increases on 6% to 8%.
Then the % point change is 2.
8−6
Then % change is 6
= 33%
157
Functional forms of Regression models, Semi log model
Now we consider the Semi-Log models, there are two types of Semi-Log models; namely log lin
model and lin log model. Log lin means Y is in log form and X is not, and lin log means X is in
log form and Y is not. So in this session we discuss the first model, the log lin model.
Log lin models are extensively used to measure the growth rates. Such as population, GNP,
money supply, employment etc…such economic phenomena are measured under log lin models.
To explain this model, suppose that we want to estimate the growth rate of consumption
expenditure at an aggregate level using time series data.
Let, 𝑌𝑡 is the real expenditure on services at t (it can be the rate of growth of population at time
t, rate of growth of employment at time t, or rate of growth of money supply at time t etc…..)
𝑌0 is the initial value or value in the beginning of the sample period
Suppose that, we consider the compounding formula
𝑡
𝑌𝑡 = 𝑌0 (1 + 𝑟)
Where, r = compound, the rate of of growth of Y over time
𝑌𝑡 = Value of Y at time t
𝑌0 = Initial value
Taking log on both sides;
𝑙𝑛𝑌𝑡 = 𝑙𝑛 𝑌0 + t 𝑙𝑛(1+r)
This can be written as;
𝑙𝑛𝑌𝑡= β1+ β2 t
Adding a disturbance, will get it as;
𝑙𝑛𝑌𝑡= β1+ β2 t + 𝑢𝑡
158
This model like any other model, linear in parameters β1and β2
In this model, the dependent variable is 𝑙𝑛𝑌𝑡, and the independent variable is t
Suppose that you have data on expenditure on services from 2000 to 2015 as y, and you have to
construct it, starting from 1,2 ….to 15. You can regress 𝑙𝑛𝑌𝑡 on t. That will give you β2.
This model is called a semi-log model, the reason is that only one variable is in log form, and the
other is not. For the descriptive purpose, we called it a log lin model, because the dependent
variable is in log form.
Properties of the model
In this model , as we can see
△𝑙𝑛𝑌𝑡
β2= △𝑡
△𝑌 △𝑌
= 𝑌
△𝑡
or 𝑌
△𝑥
Now, the slope coefficient measures the relative change in Y (the regressent) divided by the
absolute change in X, or it measures the constant proportional change in Y for given absolute
change in X.
△𝑙𝑛𝑌𝑡
If we multiply the relative change in Y with 100 ( △𝑌
× 100) , you will get the percentage
change in Y or the growth rate in Y (growth rate in population, growth rate in money supply
etc…..)
So, 100×β2 is the growth rate. In the literature 100×β2 is known as semi elasticity of Y with
respect to X.
△𝑌
△𝑌 1 △𝑌 1
In this model, β2 = △𝑋
𝑌
or 𝑌
× △𝑋 or △𝑋
×𝑌
△𝑌
So, what is slope? Slope is △𝑋
;
159
Slope = β2(Y)
What is elasticity?
Elasticity = β2(X)
Both slope and elasticity change from one point to another point, because bothe slop and
elasticity depend on Y and X. That is the log lin model.
As an example, An estimated model is given,
𝑙𝑛𝐸𝑋𝑠𝑡 (expenditure on services at time t) = 8.32 + 0.007t
This is based on quarterly data, the quarterly growth rate = 0.705%, so the annual growth
rate of expenditure on services = 2. 82 (0.705*4).
Now, in the literature, a distinction is always made between the compound growth rate and
instantaneous growth rate. The coefficient β2, that we estimated is instantaneous growth rate,
means at a point in time. It is different from compound growth rate, which is measured over a
period of time (say one month, six month, one year etc,,,,)
Now, if you want to calculate the compound growth rate, the procedure is
1. Take antilog of instantaneous β2 (β2 𝑖𝑠 𝑙𝑛(1 + 𝑟)) , that will give you (1+r)
2. Subtract 1 from it, (1+r)-1, that will give you r
3. Multiply this with 100, r*100, it will give you compound growth rate
This compound growth rate will be higher than the instantaneous growth rate. Because of the
compounding effect. When you estimate, you will get the β2, and it is the instantaneous growth
rate, if you want compound growth rate follow the above steps.
In addition to this, there is a linear growth model.
160
Suppose that you regress 𝑌𝑡= β1+ β2 t + 𝑢𝑡, where Y is in original units , t is the time, you
simply regress Y on t, you will get β2. It is an absolute change. It tells us by how many units Y
will change when time will change by one unit. But remember this, linear trend model is not
generally preferred, what is more interesting is relative change or growth rate rather than
absolute change, this model is known as the linear trend model. This will give you the absolute
change in Y for a unit change in t. For example, if the value of the money supply increase from
𝑋1 to 𝑋2, that change will be β2 for one year (if you are using the yearly data).
Let us consider one more example, consider a model of wage and education;
Wage = β1+ β2 education+ 𝑢
Suppose that wage is measured in Dollar per hour, and education in years
Now, as you know if you estimate this model, the estimated model is;
=-0.90+0.54 education
It says that, as education increases by one year, the hourly wage is increased by .54 dollars. So, if
education increases by one year, this model states that hourly wage increases by .54, whether the
education increases from 5 to 6, or 18 to 19. But this is unrealistic in most cases, the reason is if
education increases from 5 to 6, the wage increases by .50 dollars. When education increases
from 18 to 19, wages may increase by 5 dollars. Here the rate of increase in wage is uniform for
increase in education at each level of education or the starting level of education is more
important. But what is more meaningful is an estimation of the percentage change, so we convert
this model in to log lin as;
Log lin = β1+ β2 education+ 𝑢 (wage is converted into log)
161
When you convert wage into log, you will get, as education increases by one unit, β2 is the
percentage change (say this percentage change is 5%), the percentage change is constant , but
absolute change will increase. So for a wage equation , if you expect the relation as like below;
Here, the relation is not linear.
When estimated using log wage and education
β2= 0.083 (8.3%), this is the growth rate wage for each additional unit of education. And the
curve representing the wage and education is like this. This is known as the log linear model.
The dependent variable is log and the independent variable is not.
Log(Y)= β1+ β2 X+ 𝑢
The other shapes possible are’
162
When β2 >0 (as we already considered)
When β2<0
163
: Functional forms of Regression models, Lin log and reciprocal models
Another model which we consider is the Lin log model. Unlike in the Log lin model, here we
consider the independent variable in log form. We measure what is known as absolute change in
Y for a given proportionate change in X. Here, our concern is to estimate the absolute change in
Y for a given percentage change in X. the model can be written as;
𝑌𝑖= β1+ β2 𝑙𝑛 𝑋𝑖+ 𝑢𝑖
∆𝑌
The slope coefficient β2= ∆𝑙𝑛 𝑋
, absolute change in Y for a given proportionate change in X
This can be written as:∆𝑌 = β2 ∆𝑙𝑛 (x)
∆𝑌 ∆𝑌 𝑋
= ∆𝑋 or ∆𝑋
× 1
𝑋
1
The slope of this model is = β2× 𝑋
1
The elasticity is =β2× 𝑌
This equations states that, the absolute change in Y is equals to the relative change in X,
∆𝑋
As an example; if ∆𝑋
= 0.01 (1%)
Absolute change in Y (∆𝑌) = 0.01 (β2), it means 1% change in X causes 0.01 (β2) times change
in Y.
If β2= 500
Then, (∆𝑌) = 0.01 (500) = 5, which means that, if there is 1% change in X , the absolute change
in Y is 5.
So, it is very important to remember that. Once this model is estimated, multiply the slope β2
with 0.01, otherwise interpretation will be misleading.
164
1
As you can see in this model, the slope is β2× 𝑋
, so naturally, as X increases the slope
decreases, so we have a pattern like below;
1. When β2 is positive, as X increases, Y increases at an diminishing rate (Slope decreases
as X increases)
2. When β2 <0, again it is negative but the negative slope decreases in absolute terms.
165
If you have Y and X, if you plot these data, if you observe patterns and experiment with
different patterns . the question is what is the application of a model like this. An important
application is the “Angels law of consumptions' ' or angles elasticity function , which says that as
income increases the proportion of income allotted for food decreases.
If your model is, food expenditure is = β1+ β2𝑙𝑛 (Income) +µ, if you plot this on the graph, you
can see that, as income increases, the food expenditure increases at a diminishing rate.
Another model we consider is reciprocal model, suppose that we have model like;
1
𝑌𝑖= β1+ β2 𝑥 +𝑢𝑖This is an example of a reciprocal model. As you can see, this is a model
𝑖
1
which is non linear in X, because X appears as 𝑥𝑖
not as X, but it is linear in parameters. So you
can estimate this model. If you have data on X and Y, your dependent variable is Y, the
1
independent variable is 𝑥𝑖
. you regress Y on X, you will get estimators β1and β2. As β1and β2
appears as linear, you can estimate this model using OLS, the only thing is X appears in the
model reciprocally.
Let us see the features of this model
1
1. As X increases indefinitely, β2 𝑥 approaches 0, and naturally Y approaches the value of
𝑖
β1, which is known as the asymptotic value. So, there is built into such model is an
asymptote of limit value of β1.
2. Some of the possible shapes are;
166
Here β1>0 and β2>0, X increases indefinitely Y approaches the asymptotic value β1 and β1 is
positive .
Here, β1<0 and β2>0, as X increases indefinitely, Y approaches the asymptotic value β1 but now
β1 is negative.
167
If β2 is positive the curve decreases throughout, to see this , we have found the slope. The 𝑌𝑖=
1 −1 ∆𝑌 1
β1+ β2 𝑥 +𝑢𝑖 can be written as; β1+ β2𝑋 +𝑢𝑖, the slope is ∆𝑋
= − β2 ( 2 ) (the slope is
𝑖 𝑋
negative throughout), and if β2 is negative, then the slope is positive throughout.
Here the elasticity is; the elasticity can be written as;
∆𝑌 𝑋
Elasticity is= ∆𝑋
× 𝑌
1 𝑋
=− β2 ( 2 )× 𝑌
𝑋
1
= − β2× 𝑋𝑌
Remember this, one celebrated application of the reciprocal model is the “Phillips Curve” in
macroeconomics, the point is if X increases indefinitely, there is a limit value to Y. Can be used
in various contexts.
2 3
In addition to these models, we can also consider various powers of X, say 𝑋 , 𝑋 etc…. To
estimate the quadratic function and cubic functions etc….such models will discuss in the context
of multiple linear regression models.
Few more functional form of log log model (log log model already discussed in previous
session)
If both Y and X are in log form,
168
If we are considering a simple linear regression model,
The slope is =β2
∆𝑌
The elasticity is = β2( ∆𝑋 )
Now suppose that, you are estimating a function like;
Y = β1+ β2 𝑋 + 𝑢 (Y is the quantity and X is the demand)
∆𝑄 𝑃
You will get an estimate of the slope only, the elasticity formula is: ∆𝑃
× 𝑄
, in our formulation
∆𝑌 𝑋
it is; ∆𝑋 × 𝑌
. Now, if you are estimating a linear equation, then slope can be calculated as;
𝑋 𝑃
=β2 × 𝑌
, or β2 × 𝑄
. If you are estimating a linear equation, you can calculate elasticity by
giving values to X and Y, but the problem is which X and which Y. The solution is you take
average X and average Y, then it will give you the estimate of average elasticity.
169

Sampling Distribution Explained with Example

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sampling Distribution Explained with Example

Uploaded by

Copyright:

Available Formats

Sampling distribution

● We have defined the sampling distribution as the distribution of an estimator. We have

about a sampling distribution, let us consider an example.

● Now suppose that the age of the population is as follows:

There are only 5 observations. This is the entire population. Then

estimated, there is no need for statistical inference or hypothesis testing.

● Now suppose that we take n = 2, then the samples are as follows

The values of 𝑋 varies from 3 to 9, 𝑋 is used as an estimator of µ. here µ = 6. You

will never know the value of population parameter µ.

𝑋 = 6 which is equal to the population mean(µ = 6).

Now, let’s see what the mean of the mean. It is called as 𝑋

● That means E(X) = µ

● That is if X is distributed with mean µ, then 𝑋 is also distributed with mean µ.

relative frequencies as follows.

𝑋 F Relative Frequency f(X)

trying to test. Just as we have converted X in to Z. we can convert 𝑋 into Z also.

Otherwise Z conversion is not possible.

● Let us derive this results mathematically,

That is each 𝑋𝑖 is distributed normal with same mean and variance.

assumed that they are identical.

● As we have stated earlier, 𝑉𝑎𝑟 ( 𝑋 + 𝑌) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) if and only if X and Y

are independent. Otherwise there is a covariance term also.

because random variables 𝑋1, 𝑋2,...............................𝑋𝑛 are iid random variables.

Otherwise this will not be satisfied.

relationship between population distribution and the distribution of sample statistics.

hypothesis testing are Z, chi-square, F and t.

● In statistics we come across squared quantities such as

● To derive sampling distribution of such squared quantities we use Chi-Square

distribution. Now let’s see what is Chi-Square distribution.

distribution with 1 degrees of freedom.

● We can generalize this. If 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently

● . If 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are standard normal variables, and the sum

independent chi square variables the degrees of freedom is k.

𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently distributed standard normal

follows chi square with k degrees of freedom.

This is the definition of chi square

degrees of freedom.this property is known as the reproductive property of chi square

𝑍1, 𝑍2.......................................................... 𝑍𝑘 each following standard normal distribution, then

square with 𝑘1+ 𝑘2 degrees of freedom.

● The PDF of the Chi-square distribution is as follows

more like a normal distribution.

the degrees of freedom.

25% and probability that it is >39.99 is 5%.

calculate other areas for different degrees of freedom.

test for independence etc.

For hypothesis testing, we develop test statistics based on estimators.

● And this quantity will be extensively used later

have (𝑛 − 1) degrees of freedom.

unknown . when σ is replaced by its unbiased estimator sample variance. Z distribution is

converted into t- distribution.

Properties which respect z are different from t .

𝑧1∼ N (0,1) is standard normal distribution

The degree of freedom of 𝑡𝑘 is same as the degree of freedom of χ𝑘 .

distribution, what are its area properties?

2. Degrees of freedom increase indefinitely, t- distribution depending on consideration of

3. There is a family of t- distribution depending on consideration of degrees of freedom.

4. For small degrees of freedom t- distribution is flat. As degrees of freedom increase, it

approaches normal distribution.

5. Like standard normal distribution, range of t is -α to +α.

6. t distribution is symmetric, but its flatness depends on degrees of freedom.

and variance is 1. The difference is with respect to variance only.

8. If k = 10, variance ( t) = 10/8 = 1.25

If k = 30, var( t) = 30/28 =1.08