Professional Documents
Culture Documents
explained this using arithmetic mean as an example to this. To give a little more insight
(X)
10
2 +4+ 6+8+10
µ = 5
= 6 𝑦𝑒𝑎𝑟𝑠 . This is the population parameter. Similarly
2
2 Σ(𝑋− µ) 40
σ = 𝑁
= 5
= 8
2
σ = σ = 8 = 2. 83
X is a random variable, and our interest is to find out population mean µ and population
2
variance σ . If the entire population is known, there is no problem. These quantities can easily be
Samples 𝑋
2 4 3
2 6 4
1
2 8 5
2 10 6
4 6 5
4 8 6
4 10 7
6 8 7
6 10 8
8 10 9
If your sample happens to be first one (2, 4), then 𝑋 = 3, which is far away from
µ. If your sample happens to be last one (8, 10), then 𝑋 = 9, which is also far
away from µ. if you are lucky that your sample is the fourth one ( 2, 10) then your
3+4+5+6+5+6+7+7+8+9
or 𝐸(𝑋) = 10
= 6 𝑌𝑒𝑎𝑟𝑠
Now, in which sense we say that 𝑋 is a random variable? Now, we know that 𝑋 takes the values
as follows with frequencies, if we divide the frequencies with the total frequency , we will get
2
3 1 0.1
4 1 0.1
5 2 0.2
6 2 0.2
7 2 0.2
8 1 0.1
9 1 01
Total 10 1
● The total of relative frequency is 1. That is f(X). This relative frequency is known as the
probability.
● Remember that, 𝑋 takes many values, we do not know the values of 𝑋 before the sample
is observed. The values of 𝑋 is unknown, it is a random variable. And for any random
variable we have an associated PDF. PDF is given like f(X). If you draw this you will get
a curve as follows
2
2 σ
σ𝑥 = 𝑛
2 8
σ𝑥 = 2
= 4
3
𝑆𝐷𝑋 = 4 = 2
2
● So, if X is distributed with mean µ and variance σ , then 𝑋 is also distribute with the same
2
σ
mean µ and variance 𝑛
● As you can see 𝑋 is a random variable and associated probabilities and the characteristics
2
given as 𝐸(𝑋), σ𝑥 and 𝑆𝐸𝑋 . This is the example of sampling distribution.
● In hypothesis testing we use this sampling distribution, not directly, we convert this sampling
distribution in to standard normal and t, F and chi-square etc., depending on what we are
4
● Now we have the distribution and area properties, just like we have described earlier.
µ ± σ = ±1
µ ± 2σ = ±2
µ ± 3σ = ±3
● So, we convert 𝑋 into Z. and Z is used for hypothesis testing. 𝑋 is the sampling distribution
and Z is the standard normal distribution derived from this sampling distribution.
● Random variables are PDFs are not what is used for testing, what is used for testing is the
sampling distribution of estimators converted into test statistics. One test statistic is Z. Z is
used if and only if the population variance and population standard deviation is known.
Σ𝑋 1 1 1
𝑋 = 𝑛
= 𝑛
𝑋1 + 𝑛
𝑋2 +........................................ + 𝑛
𝑋𝑛
And we assume that this 𝑋1, 𝑋2,...............................𝑋𝑛 are iid random variables. That is
identically and independently distributed random variables with same mean and variance.
2
𝑋𝑖 ∼ 𝑁 (µ, σ )
1 1 1
Then 𝐸(𝑋) = 𝑛
𝐸(𝑋1) + 𝑛
𝐸(𝑋2) +................................... + 𝑛
𝐸(𝑋𝑛)
1 1 1
= 𝑛
µ + 𝑛
µ +................................... + 𝑛
µ
1
= 𝑛
𝑛µ = µ
𝐸(𝑋) = µ
5
● So, this result is obtained if and only if this 𝑋1, 𝑋2,...............................𝑋𝑛 are identically and
independently distributed. And this condition is always satisfied in the case of random
sampling.
● In random sampling procedure, the samples are selected independently and it is also
Σ𝑋
𝑉𝑎𝑟 (𝑋) = 𝑉𝑎𝑟 ( 𝑛
)
1 2 1 2 1 2
= 2 σ + 2 σ +.................................... + 2 σ
𝑛 𝑛 𝑛
1 2
= 2 𝑛σ
𝑛
2
σ
= 𝑛
● Now 𝑋1, 𝑋2,...............................𝑋𝑛 are random variables, and have written the variance of
𝑋1 and variance of 𝑋2 etc. variance of 𝑋𝑛 only, covariance terms are ignorant because as 𝑋1,
𝑋2,...............................𝑋𝑛 are iid, covariance terms disappear. That is why we have write it as
1 2 1 2 1 2
2 σ + 2 σ +.................................... + 2 σ and the variance are assumed to be same
𝑛 𝑛 𝑛
2
● If X is distributed with mean µ and variance σ , then 𝑋 is also distributed with mean µ and
2
σ
variance 𝑛
this will be satisfied if the random variables are iid random variables.
6
● For hypothesis testing estimators are used, and estimators can be used if and only if the
● If we know the distribution of the population, then distribution of sample statistics can be
derived.
● Remember this, all the estimators are assumed to be random variables. And distribution of
the estimator is known as sampling distribution. Sampling distribution is used for hypothesis
testing. Remember this, sampling distribution is not directly used for hypothesis testing.
From sampling distribution we derive test statistics. The popular test statistics used for
Chi-Square Distribution
● We have discussed
2
𝑋 ∼ 𝑁 (µ, σ )—-------------- Normal Distribution
𝑋−µ
𝑍 = σ
—-------------- Standard Normal Distribution
2
𝑋 ∼ 𝑁 (µ, σ )—-------------- Normal Distribution
𝑋−µ
𝑍 = σ —-------------- Standard Normal Distribution
𝑛
7
2 2
● If X is normal with mean µ and variance σ (𝑋 ∼ 𝑁 (µ, σ )), Z is standard normal with
𝑋−µ 2
0 and 1. (𝑍 =
𝑋−µ
σ
). Then 𝑍 =
2
( σ ) 2 2
∼ χ(1) , that is 𝑍 follows chi square
● So, the square of a standard normal variable is distributed as a chi square variable with 1
degrees of freedom.
● So, the definition is the square of a standard normal is a chi square with 1 degrees of
freedom.
𝑘
2
distributed chi square variables, and the sum 𝑍 = ∑ 𝑍𝑖 ∼ χ(𝑘)
𝑖=1
𝑘
2 2 2
𝑍 = ∑ 𝑍𝑖 ∼ χ(𝑘)
𝑖=1
Where k means the number of independent terms in the sum. And here, as there are k
2
● If Z is standard normal 𝑍 is chi square. So, naturally if
2
variables, then 𝑍 , that is the square of sum of 𝑍1, 𝑍2.......................................................... 𝑍𝑘
𝑘
2 2 2
𝑍 = ∑ 𝑍𝑖 ∼ χ(𝑘)
𝑖=1
8
● Now, if 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently distributed chi square
variables, with 𝑘𝑖 degrees of freedom, then ∑ 𝑍𝑖 also follows chi square with Σ𝑘𝑖 degrees
of freedom
● Let us consider it simply as if 𝑍1, 𝑍2 are independently distributed chi square variables,
each with 𝑘1, 𝑘2 degrees of freedom, it means there are 𝑘1 random variables in 𝑍1 and 𝑘2
random variables in 𝑍2, then the sum 𝑍 = 𝑍1 + 𝑍2 also follows chi square with 𝑘1+ 𝑘2
distribution.
2 2
● So, to summarize, if X is normal Z is standard normal, 𝑍 distributed with χ(1) and
𝑘
2 2
∑ 𝑍𝑖 ∼ χ(𝑘) . And if 𝑍1, 𝑍2.......................................................... 𝑍𝑘 are independently distributed
𝑖=1
chi square variables with 𝑘𝑖 degrees of freedom and 𝑍 = 𝑍1 + 𝑍2 also follows chi
𝑋1 − µ 2 𝑋2 − µ 2 𝑋𝑛 − µ 2
● If expand this, 𝑍 = ( σ ) ( + σ ) +............................................ + ( σ ) . so, this
quantity has k independently distributed random variables, So, this quantity has k degrees
of freedom.
9
2
● For χ distribution, as it is squared quantity it will not take negative values, it always take
non- negative values, so it starts with 0. And for small degrees of freedom, it is highly
skewed. Say K=2, K= 5. But when degrees of freedom increases it becomes more and
2
● Remember this to find out the area property of χ distribution, you must always consider
● If you go through any book of statistics, you will see that tables providing, the areas of
2
χ distribution for various degrees of freedom. We will consider one case.
2
● Suppose that the degrees of freedom is 20, then χ distribution will be as follows.
10
2
● Then the probability that χ is greater than 10.85 is 95%. Probability that it is >23.83 is
2
● In words if the degrees of freedom is 20, the probability of getting a χ variable in excess
2
of 39.99 is only 5%. That is 95% of the area under χ is in between 0 and 39.99. You can
2 2
● For χ distribution, mean that is expected value of χ distribution is K
2
𝐸(χ ) = 𝐾
2
𝑉𝑎𝑟(χ ) = 2𝐾
2
● χ distribution is extensively used for testing hypothesis about population variance, it is
also used for testing association between two variables using contingency tables, that is
11
2
● Any way χ distribution is skewed, its mean is K and variance is 2K and area properties
depends on degrees of freedom. We will not use estimators directly for hypothesis testing.
𝑋−µ 2 2
● The test statistic based on 𝑋 that is 𝑍 = σ , similarly the 𝑆 is the estimator of σ .
𝑛
2 2 2
but 𝑆 not directly used for testing inference about the parameter σ . Using 𝑆 We develop
a test statistic.
● Test statistic developed for for testing the significance of variance is given as
2
𝑆 2
( 𝑛 − 1) 2 and we can show that this quantity follows χ (𝑛 − 1)𝑑𝑓
σ
2
𝑆 2
( 𝑛 − 1) 2 ∼ χ (𝑛 − 1)𝑑𝑓
σ
2 2
Where 𝑆 is an unbiased estimator of σ based on a sample of size n and (𝑛 − 1)
degrees of freedom.
2
𝑆 2
● And if expand this (𝑛 − 1) 2 ∼ χ (𝑛 − 1)𝑑𝑓
σ
2 2
(Σ𝑋 −𝑋) (Σ𝑋 −𝑋)
( 𝑛 − 1) (𝑛−1) = 2
2 σ
σ
2
So, this quantity follows χ with (𝑛 − 1) degrees of freedom.
● There are n random variables, but all these variables are not independent that is why we
12
● This is all about chi square distribution. So, chi square distribution is the square of a
standard normal distribution. And is used to test the significance of population variance .
t Distribution
In econometrics or empirical analysis the most widely used distribution is t- distribution. This is
because Z distribution cannot be applied in most cases. Because the population variance σ is
2
𝑍2 ∼ χ𝐾 is distributed independently of Z , then the variable,
𝑍1
t= ∼ 𝑡𝑘
𝑍2/𝑘
It is the ratio between two independently distributed random variables. Before deriving this
13
1. standard normal distribution range of t -α < t < +α
degrees of freedom.
𝑘
7. Mean of t = 0, variance of t = 𝑘−2
i.e k >2. But in the case of Z distribution, mean = 0
14
As degrees of freedom increases, the variance of t approaches to 1 , that means
How to transform a random variable x and its sampling distribution 𝑥 into t rather than z.
𝑍1
𝑡 = ∼ 𝑡𝑘
𝑍2/𝐾
𝑥−µ 𝑥 −µ
𝑡 = 𝑠
𝑡 =
𝑠/ 𝑛
2
We assume that , x ∼ ( µ, σ )
𝑥−µ
𝑧1 = ( σ
) ∼ 𝑁 ( 0, 1)
𝑠2 (𝑥−µ) 2
𝑧 2 = ( 𝑛 − 1) 2 =Σ 2
σ σ
2
∼ χ ( 𝑛 − 1)
Suppose that , population σ is unknown. You cannot use Z as the test statistics. You require
F distribution
𝑧1 𝑧2 two independently distributed chi square variable with 𝑘1,𝑘2 degrees of freedom, the
𝑧1/𝑘1
𝐹 = 𝑧2𝑘2
∼ f ( 𝑘1 , 𝑘2 )
15
● As the degrees of freedom increase, it approaches a normal distribution.
Properties :
1. F distribution is highly skewed for small degrees of freedom. And it approaches normal
𝐾2
2. E (F) = 𝐾2− 2
= 𝐾2 > 2
2
(
2𝑘2 𝑘1+𝐾2−2 )
Variance ( F) =
(
𝑘1 𝑘2− 2 ) (𝑘 − 4)
2
2
16
2
3. 𝑡𝑘= 𝑓 (1, 𝑘) as in the case of t distribution the degrees of freedom of f is also derived
2
from the χ distribution.
2
All these distributions z, χ , 𝑡, f are related to the normal distribution. Normal distribution is the
If the population is not normal, sample size is sufficiently large i.e central limit theorem. And the
normality assumption is not at all a problem in the case of a large sample. We can conduct
2
hypothesis testing using t, z, F, χ …etc.
Degrees of Freedom
Suppose that you select 10 numbers. Then if no other conditions stated. Free to select all these 10
You can select 9 numbers freely, but the 10 th number is selected only; it will provide a sum of
10 numbers = 100.
2 2
E(𝑆 ) = σ
E(𝑋) = µ
22
E⎡⎣Σ(𝑋 − 𝑋) ⎤⎦ = σ (n-1)
2 2
E⎡⎢ 𝑛 ⎤⎥ =
Σ(𝑋−𝑋) σ (𝑛−1)
𝑛
⎣ ⎦
2 1 2
=σ − 𝑛
σ biased estimator
17
2
E⎡⎢ 𝑛−1 ⎤⎥ =
Σ(𝑋−𝑋) (𝑛−1) 2
𝑛−1
σ
⎣ ⎦
2
=σ
2 2
Then 𝑆 unbiased estimator of σ
Properties of Estimators
As we have discussed earlier, our aim is to estimate population parameters. But the population
have only a sample , and using the sample we have to estimate population parameters.Arithmetic
variance , OLS estimators β1 and β2 are the estimators of regression parameters. For any
parameter, there will be many estimators. For the population parameter µ, the estimators are
sample mean , sample median , mode etc. As an estimator of the population variance we can use
standard deviation , mean deviation , range, coefficient of variation etc. Similarly as estimators
of the regression coefficients we have OLS estimators we have the method of maximum
18
likelihood, we have instrumental variable estimators, method of moment estimation procedures,
so many procedures are there. So the question is among various estimators of a parameter which
estimator we will select, or what are the criteria for selecting estimators. To select one estimator
from an array of estimators, we will consider the properties of estimators and we will select as
our estimator of the population parameter one with most of the desirable properties. The
Small sample properties will be satisfied in small samples as well in large samples. So small
sample properties are the most desirable properties. On the other hand large sample properties
will be satisfied only in large samples. They will not satisfy in small samples. The discussion
about properties of the estimator will be based on the assumption that the estimator will be a
random variable and it has a PDF. The PDF is known as the sampling distribution
If the small sample properties are satisfied we will not go for large sample properties
1. Unbiasedness
We will say that estimator θ is an unbiased estimator of θ if the expected value of θ is equal to
the parameter θ
E (θ) = θ
E(θ) - θ = 0
bias(θ) = E(θ)-θ
19
As we have discussed in the context of sampling distribution, unbiasedness is a repeated sample
property, if we take all the possible samples and if the expectation of the estimator is equal to the
parameter, it is unbiased.
θ1 θ2 → θ
So unbiasedness is a repeated sample property. Unbiasedness itself is not a very useful property
because even if an estimator is unbiased it says nothing about the properties of an estimator in a
µ= 6
But the range of 𝑥 from 3 to 9 , in any given situation , we will take only one sample and 𝑥
may be 3 far away from µ , it may be 9 again far away from µ. So unbiasedness by itself is not a
useful property. But it will become a useful property when it is combined with another property
namely minimum variance. So the next small sample property is minimum variance
20
θ1→ θ → θ2
θ2 need not be unbiased or θ1need not be unbiased.We consider only the minimum variance
property here, that is we consider 2 estimators of θ , one with the smallest variance is the
2 2
E ⎡⎢θ − 𝐸 (θ ) ⎤⎥ ≤ E ⎡⎢θ − 𝐸(θ ) ⎤⎥
⎣ 1 1 ⎦
⎣ 2 2
⎦
Variance(θ1) ≤ variance(θ2)
21
As in the case of unbiasedness , minimum variance by itself is not a very useful property. The
reason is that even though in this example θ1 is the minimum variance estimator , its distribution
is far away from the population parameter, or you consider the estimator as a constant , a
constant has no variability. But it cannot be a sensible estimator of the parameter θ. So by itself
the minimum variance is not a very useful property. Now it becomes a useful property when it is
It is a combination of unbiasedness and variance. If θ1 and (θ2) are two unbiased estimators of
Here we compare the estimators which are unbiased. Thus the property is a combination of
22
Second condition :Variance( θ1) ≤ variance( θ2).
If these two conditions are satisfied, then we can say that ( θ1) is the Minimum variance unbiased
4. Linear estimator
Σ𝑋 1 1
𝑋= 𝑛
= 𝑛 𝑋1 +............... 𝑛 𝑋𝑛 by giving equal weights to X1, X2….Xn we have 𝑋 can be written
sample observations
23
5. Best Linear Unbiased Estimator(BLUE)
This is a combination of these four properties. The estimator θ is said to be BLUE if it has
minimum variance in the class of all linear unbiased estimators. This BLUE property is very
important in econometrics and this blue property is known as Gauss Markov Theorem.
This properties are not extensively used in econometrics. Consider two estimators of θ, say θ1
and θ2
In the case of θ2 it is slightly biased, but its variance is low. The chance of getting a value far
away from the parameter is low. So there is a trade off between bias and variance and the
24
property says that if there is a trade off between bias and variance select the estimator with
We consider these properties only when small sample properties are not satisfied.
1. Asymptotic unbiasedness
limit E ( θn )=θ
n→ α
We say that θ is an asymptotically unbiased estimator of θ if θ become unbiased as sample
size increases indefinitely. Remember this it is a repeated sample property like unbiasedness but
in small samples the estimator will be biased but as the sample size increases indefinitely it
Example : suppose
2
2 Σ(𝑋−𝑋 )
𝑆 = 𝑛
2 (𝑛−1) 2
E (𝑠) = 𝑛
σ
2 1 2
=σ - 𝑛
σ
1 2
𝑛
σ → the bias term
2 2
So in a small sample 𝑆 is a biased estimator of σ
As the sample size increases indefinitely the bias term disappears , so we will say that it is
asymptotically unbiased.
2. Consistency
25
This is the minimum requirement an estimator must satisfy to be useful. If an estimator will not
satisfy at least consistency then we cannot use such an estimator for estimation and inference.
Consider arithmetically
E (𝑋 ) = µ
2
σ
Variance(𝑋 )= 𝑛
Example
When the sample size is small estimator θ is biased. Not only biased its variance is also very
high. As the sample size increases the bias decreases, variance also decreases. When sample size
is 100 the entire bias disappears and Eθ=θ. The bias disappears not only that the variance
26
approaches to 0, actually if an estimator is consistent, when the sample size increases indefinitely
the entire distribution collapses to a single point and we will say that the distribution degenerate
and we will say that the estimator is consistent. So consistency requires that if there is bias in
small samples bias must disappear and also variance approaches 0. Asymptotic unbiasedness is
about what happens to bias only. It says nothing about what happens to variance.Consistency
requires both : bias disappears and variance also disappears then only we can say that an
{| |
lim 𝑃𝑟 θ − θ < δ =1,
𝑛→∞
} δ> 0
Consider a graph
27
and θ is very small. Then we will say that the estimator is consistent.
We write it as
𝑝 lim (θ)= θ
𝑛→∞
plim means probability limit
An estimator is consistent if the distribution of θ not only collapses to θ, its variance also
single point θ the distribution has 0 spread, or 0 variance.Then we say that θ is a consistent
estimator of θ.
Two sufficient conditions for consistency is ,
1) lim E( θ𝑛) = θ
𝑛→∞
28
Suppose that θ1 and θ2 are the consistent estimators of θ , then
θ1 𝑝𝑙𝑖𝑚(θ1)
plim( )=
θ2 𝑝𝑙𝑖𝑚( θ2)
Last two properties will not hold in the case of expectations operator.
3. Asymptotic efficiency
variance. If θ is consistent and if its asymptotic variance is smaller than the asymptotic
variance of all the other consistent estimators then we will say that θ1is asymptotically efficient
estimator of θ
4. Asymptotic normality
2
If X ∼N (µ, σ )
2
σ
𝑋∼N (µ, 𝑛
)
This is not a theorem , this will always be satisfied. Now we have seen the central limit theorem ,
even if
2
X is not normal with (µ, σ )
2
σ
𝑥 ∼N (µ, 𝑛
) as n→ ∞
This property is known as asymptotic normality. It is a useful property in many cases, if you are
not sure about the nature of distribution of population under consideration , you can use this
property for hypothesis testing provided that we have a sufficiently large sample size. This is the
property when the sample size is sufficiently large.
Among all the large sample properties consistency is the most important.which we will use when
the small sample properties are either unknown or cannot be derived.
29
Assumptions of Classical Linear Regression Model
In the previous class we have derived the OLS estimators.For the simple linear regression model
the parameters of interest are β1 and β2. And the estimators are the c.
● β 1 = 𝑌 = β 2𝑋
Σ𝑥𝑖𝑦𝑖
● β2 = 2
Σ𝑥 𝑖
Now if our purpose is estimation only then OLS is enough. By applying the principles of OLS β1
and β2 are obtained. But in econometrics our objective is not estimation only. We have to use
this estimators to make inferences about the parameters β1 and β2. Our aim is not only estimation
of β1 and β2. We have to use this estimators to make inferences about regression coefficients β1
and β2.
30
So if we have to make inferences about β1 and β2 , we have to make assumptions about how 𝑌𝑖
is generated. As 𝑌𝑖 depends on 𝑋𝑖 and 𝑢𝑖. we have to make inferences about how 𝑋𝑖 and 𝑈𝑖.. are
generated.
Only then we can use the estimators to make inferences about population parameters. Now what
we discuss below is the assumptions of the classical linear regression model. The assumptions
underlying the method of OLS. While discussing the properties of estimators we have suggested
that the properties are classified into statistical properties and numerical properties.
Numerical properties we have discussed. That hold as a result of the use of OLS regardless of
how the data are generated. Statistical properties of the estimators hold only after making
assumptions about how data are generated. So in the discussion we are considering how the data
𝑌𝑖 or 𝑋𝑖 and 𝑢𝑖 are generated.These assumptions are important and the properties of estimators
depends on the assumptions which we make about 𝑋𝑖 and 𝑢𝑖 and hence 𝑌𝑖.
So the Classical linear regression model, the Gaussian linear regression model the method of
OLS make seven assumptions about how 𝑋𝑖 and 𝑢𝑖 and hence 𝑌𝑖 are generated.
● Linear in parameters.
That is the linear regression model. We have defined linearity as linearity in parameters
as linearity in parameters.
property in detail there is nothing more to discuss here. If the model is not linear in
31
This is the first assumption underlying the method of ordinary least squares or the
● Fixed X values or X values are assumed to be independent of the stochastic error term.
( )
cov 𝑋𝑖 𝑈𝑖 =0
assumptions. But suppose that X is not fixed in repeated samples. That is like Y,
( )
2. cov 𝑋𝑖𝑈𝑖 = 0 .
( ) {[
cov 𝑋𝑖𝑈𝑖 = 𝐸 𝑋𝑖 − 𝐸 𝑋𝑖 ( ) ][𝑈𝑖 − 𝐸(𝑈𝑖)]}
( )
As we will see as the next assumption E 𝑈𝑖 is assumed to be zero the
[ ( ) ]
= E 𝑋𝑖𝑈𝑖 − 𝐸 𝑋𝑖 𝑈𝑖
( ) ( )( ) ( )
= E 𝑋𝑖𝑈𝑖 -𝐸 𝑋𝑖 𝐸 𝑈𝑖 since we assumed 𝐸 𝑈𝑖 = 0
(
= E 𝑋𝑖𝑈𝑖 )
(
Assumption requires that expected value of 𝑋𝑖𝑈𝑖 =0. )
32
Now we assume that 𝑋𝑖 is fixed in repeated samples then 𝑋𝑖 can be considered as constant from
(
= E 𝑋𝑖𝑈𝑖 )
( )
=𝑋𝑖E 𝑈𝑖 =0
automatically.
( )
But if X is stochastic like 𝑈𝑖then we have to ensure that cov 𝑋𝑖 𝑈𝑖 =0.
( )
Because this condition will not be automatically satisfied we have to ensure that E 𝑋𝑖𝑈𝑖 = 0
As a first approximation if this assumption is not satisfied. That is if 𝑋𝑖 is call later with 𝑈𝑖 we
(
cov 𝑋𝑖 𝑈𝑖 =0 )
Now the assumption that x is fixed in repeated samples is a highly unrealistic assumption.
But the question is why it is assumed that X is fixed in repeated samples. As we will see later
this assumption is used at an introductory level to derive some results easily. If X is assumed to
be fixed in repeated samples it is easy to derive the estimators and properties. That is why it is
Also the assumption that is assumed to be fixed in repeated samples is not unrealistic always.
There are many cases in which this assumption is valid. Suppose that we consider a model salary,
33
Years(x) Salary(y)
10
12
Suppose that you take one sample , you will be able to find out many people with different levels
of salary for the same levels of education, with 8 years of education, 9 years of education etc. In
this case the level of education will remain the same, but salary varies from sample to sample. So
You consider another case, suppose that you are applying a certain amount of fertilizers and
Yield Fertilizer
10
20
30
Now if you apply 10 kilos of fertilizers on different plots you will get different levels of yield.
Yield will vary from one plot to another plot. If the plots are randomly selected the procedure is
correct also. So fixed in repeated sample assumptions is not highly not unrealistic. Any way for
(
the time being just keep in your mind cov 𝑋𝑖 𝑈𝑖 =0. )
( )
3. E 𝑈𝑖/𝑋𝑖 = 0
( )
The assumption is that given 𝑋𝑖 values E 𝑈𝑖 = 0
34
There is PRF. So corresponding to 𝑋𝑖 is X1, X2, X3 etc. The PRF passes through the conditional
mean values.Now the deviation of a Y from its mean is U. So there are so many positive and
( )
negative U values, the assumption simply says that E 𝑈𝑖/𝑋𝑖 , that is given 𝑋𝑖 = 𝑋1𝑈1 has a
( )
distribution with E(𝑈)=0, 𝑈2 has distribution with E 𝑈2 =0, 𝑈3 has distribution with E 𝑈3 =0 ( )
like that.
Actually this is not a very important assumption, not an assumption with impact.As long as there
is an intercept in this model this assumption will be automatically satisfied or stated differently
(
we can always rewrite our model in such a way that E 𝑈𝑖/𝑋𝑖 = 0. )
( )
As an example suppose that E 𝑈𝑖 = α0
If you take the expectations you will get E (𝑌 ) =β1+ α0+ β2𝑋𝑖+α0-α0 where α0-α0= 0
So the only thing is the intercept become β1+ α0. We will get a biased estimator for the intercept,
that is a problem but is not a serious problem as intercept is not very important in most cases. So
this is not an assumption with impact. But is assumed to hold in all the cases. Provided that the
35
model is correctly specified , provided all the important variables are included. Then we can
( )
conclude that E 𝑈𝑖/𝑋𝑖 = 0 is always satisfied.
There is one more point, if the conditional variances of one random variable given the other is
( ) (
equal to zero, the random variables are independent. So the E 𝑈𝑖/𝑋𝑖 = 0 then the cov 𝑋𝑖 𝑈𝑖 )
=0.
( )
cov 𝑋𝑖 𝑈𝑖 =0 does not mean that the variables are independent.
E (𝑌/𝑋)=E(𝑌) that is conditional distribution = marginal distribution. Then the variables Y and
X are independent.
( ) ( )
If E 𝑈𝑖/𝑋𝑖 = 0 then it follows that E 𝑌/𝑋𝑖 = β1+ β2𝑋𝑖
So the conditional mean of Y is β1+ β2𝑋𝑖 The unconditional mean of Y is different from the
( )
conditional mean. So this condition E 𝑈𝑖/𝑋𝑖 = 0 also implies this condition
( )
E 𝑌/𝑋𝑖 = β1+ β2𝑋𝑖.
The conditional expectation of a random variable given the other is equal to zero and it means
cov(𝑋𝑌)=0 does not mean that X and Y are independent. Because X and Y will be non linearly
( )
So if E 𝑈𝑖/𝑋𝑖 = 0 then it means that U and X are independent. But this is not true for
36
The assumption that X is fixed in repeated samples is only a simplification used in the
introductory textbooks to derive some of the results easily. In the real situation X is stochastic
( )
Then the requirement is cov 𝑋𝑖 𝑈𝑖 =0. If this assumption is violated we will get biased
estimates..
Homo = equal
Skedasticity = spread
( ) [
The assumption is that var 𝑈𝑖/𝑋𝑖 = E 𝑈𝑖 − 𝐸 𝑈𝑖/𝑋𝑖( )]2
[
= E 𝑈𝑖/𝑋𝑖]2
2
=σ
( )
Because E 𝑈𝑖/𝑋𝑖 =0 by assumption. Is a constant sigma square.
2
That means the variance of 𝑈𝑖 given different values of 𝑋𝑖 is a constant number σ .
37
This is an example of homoskedasticity/ Equal spread.
2
That is the distribution of U1,U2,U3 having mean zero and variance same. Same number if σ of
Hetero= Unequal
Skedasticity= variance.
So unequal variance.
2
( )
Heteroskedasticity is written as var 𝑈𝑖/𝑋𝑖 =σ𝑖 . ‘i’ represents fact that the variance of U varies
2
σ = homoskedasticity
2
σ𝑖 = heteroskedasticity.
Example of heteroskedasticity:
( ) ( )
Homoskedasticity means var 𝑈𝑖/𝑋𝑖 ≠ f 𝑋𝑖
38
( ) ( )
Heteroskedasticity means var 𝑈𝑖/𝑋𝑖 = f 𝑋𝑖
( ) [
var 𝑌𝑖 = E 𝑌𝑖 −𝐸 𝑌𝑖 ( )]2
=E [β1 + β2𝑋𝑖 + 𝑈𝑖 − β1 − β2χ𝑖]2
= E 𝑈𝑖 ( )2 =σ2
So the point is variance of 𝑈𝑖 same as variance of 𝑌𝑖.
If 𝑈𝑖 is homoskedastic 𝑌𝑖 is also homoskedastic. But in the case of mean, mean of 𝑈𝑖 is zero, but
( )
Mean of 𝑌𝑖 is different from mean of 𝑈𝑖 or what we say is, if we subtract E 𝑌𝑖 form 𝑌𝑖 that is 𝑈𝑖,
( )
𝑈𝑖 is 𝑌𝑖-E 𝑌𝑖 that is we change the origin. Then the mean becomes zero. Variance is not
affected.
39
Now consider the graph.It is a three dimensional graph .
( )
f 𝑈𝑖 is the PDF of 𝑈𝑖.
( )
𝐸 𝑌𝑖 = β1+ β2𝑋𝑖 It is the relationship between consumption and income, the PRF.
PRF starts from Y axis and as the relation is assumed to be positive the distance between x axis
( )
This is the cases of third dimension, that is f 𝑈𝑖 and this is the case of homoskedasticity.
Heteroscedasticity
40
.
It means that given any two X values 𝑋𝑖 and 𝑋𝑗 the correlation between 𝑈𝑖 and 𝑈𝑗=0. Using
notations
( ) {[ (
cov 𝑈𝑖𝑈𝑗 =E 𝑈𝑖 − 𝐸 𝑈𝑖/𝑋𝑖 )][𝑈𝑗 − 𝐸(𝑈𝑗/𝑋𝑗)]}
[ ]
=E 𝑈𝑖/𝑋𝑖,, 𝑈𝑗/𝑋𝑗 =0
41
It simply means that given any 𝑋1 and 𝑋2 , Covariance between 𝑈1 and 𝑈2 is zero.
equal to zero. One implication of this assumption is suppose that we are using quarterly data.
And suppose that there is a strike in the first quarter. Then production will be disrupted. And if it
is effect die out in the first quarter itself. Then there is no autocorrelation. But if its effect exists
to the 1st, 2nd, 3rd quarter etc then there is autocorrelation. The implication of this assumption is
6. n > K
The number of observations must be greater than the number of parameters to be estimated. It is
Y X
10 15
With just 2 values we cannot estimate a model like this β1 and β2. So there must be more than
one observation.
42
● if n is not greater than k we will not have sufficient degrees of freedom.
● If n = k, degrees of freedom will become zero. Because degrees of freedom are defined as
n-k.
That means given a pair of XY values not all the X values will be same in a given sample or state
2
differently Σ(𝑋 − 𝑋) >0.
Σ𝑋𝑖𝑌𝑖
we have β2= 2
Σ𝑋𝑖
2
(
If all the x values are identical then it follows that Σ 𝑋 − 𝑋 ) =0, estimation is not possible.
The second implication is that irrespective of values of y then variability in Y cannot be
attributed to X.
variability in Y if and only if X values are not the same in a given sample. Technically the
2
Σ(𝑋−𝑋)
variance of X that is 𝑛−1
must be a positive number.
Now when we go beyond to a two variable model we will add 8th assumption:
43
This assumption is not important for the 2 variable model.
We have 2 explanatory variables. Then there is a possibility that there will be correlation
possibility that correlation between income and wealth. So this assumption says that there should
Perfect collinearity, that means the correlation between 𝑋2 and 𝑋3 must not be +1 or -1.
( )
1.For example: E 𝑢𝑖/𝑋𝑖 =0 (assumption number 3)
These are the 2 conditions imposed on data for estimating the OLS distribution.
Now the question is in addition when we proceed we will consider some other
assumption also like the model is correctly specified, that is there should be no specification
errors.there should be no errors of measurements and so many other issues related are there.
44
You will not see all the assumptions in all the books, some assumptions like assumption number
6,7 etc assumed, will not be explicitly stated in many text books.
And also in this model the assumptions stated in terms of U. But you will also see assumptions
( ) ( )
We can say that covariance E 𝑢𝑖/𝑋𝑖 =0, we can say that E 𝑌/𝑋𝑖 = β1 + β2𝑋𝑖
2
( )
Similarly var 𝑢𝑖/𝑋𝑖 =σ Can also be stated as var 𝑌𝑖/𝑋𝑖( )
So we can specify this assumption of homoskedasticity in terms of U and in terms of Y.
( ) (
Similarly Cov 𝑢𝑖𝑢𝑗 =0 can also be stated as Cov 𝑌𝑖𝑌𝑗 =0 )
Now the question is how realistic are these assumptions.
The point is if you have a set of data satisfying all these assumptions that is the most ideal data
set. This is because if you apply OLS on such data the resulting estimators will be the most
But the problem is in the real world many of these assumptions will be violated. And we will see
what happens to the properties of estimators if these assumptions are violated. And remember
that these assumptions are stated to derive the properties of estimators. And we can compare this
model with the model of perfect competition in micro economics.You know that a perfectly
competitive market is the most desirable. But we know that in the real world most of the
assumptions of perfect competition will be violated, and we study what happens to price and
quantity when these assumptions are violated. In the same way we know that a data set satisfying
these assumptions is the most desirable. And if these assumptions are violated it will affect the
properties of estimators. And later we will consider what happened to the properties of
estimators if these assumptions were violated. Some assumptions are not important in terms of
45
So these are the assumptions of classical linear regression model, gaussian linear regression
It is clear from the previous analysis that OLS estimators are functions of sample values. In the
Σ 𝑥𝑖𝑦𝑖
β2 = 2
Σ𝑥𝑖
β1= 𝑌 - β2 𝑋
So, as functions of sample values, the value of the OLS estimators vary from
precision of the OLS estimators is given by the “Standard error” of estimators, and
standard error is nothing but the standard deviation of the sampling distribution of the
estimator. The sampling distribution of these estimators is nothing but the probability
Derivation of standard error required as a measure of the precision of the estimators, not only
that, it is also required for hypothesis testing, because for hypothesis testing we use the sampling
distribution, and the test statistics is derived from the sampling distribution. So, we can use a test
statistics for an estimator for hypothesis testing, if and only if the standard error of the estimate is
known.
46
The next concern is the derivation of the standard error of β1 and β2. In this section , we
derive only the mean and standard error of β2 and mean and standard error of β1 is simply
stated.
Mean and SE of β2
Σ 𝑥𝑖𝑦𝑖
We have shown that, β2 = 2
, this can be written as;
Σ𝑥𝑖
Σ 𝑥𝑖𝑌𝑖
β2 = 2
this we derived earlier
Σ𝑥𝑖
𝑥𝑖
β2 = Σ𝑘 𝑌 , where 𝑘 =
𝑖 𝑖 𝑖 2
Σ𝑥𝑖
So, β2 is a linear function of 𝑌1, 𝑌2, 𝑌3………………𝑌𝑛 with 𝑘1, 𝑘2, 𝑘3…………..𝑘𝑛
𝑥𝑖
𝑘𝑖= 2
serving as the weights
Σ𝑥𝑖
Properties of weight 𝑘𝑖
47
𝑥𝑖
𝑘𝑖= 2
, 𝑘𝑖 is also non-stochastic (𝑘𝑖 being the function of 𝑋𝑖 , 𝑘𝑖 is
Σ𝑥𝑖
𝑥𝑖 Σ 𝑥𝑖
2. Σ𝑘𝑖= Σ( 2
) = 2
= 0, because in the numerator Σ 𝑥𝑖 =( Σ 𝑋𝑖 -𝑋)=0
Σ𝑥𝑖 Σ𝑥𝑖
So, Σ𝑘𝑖= 0
𝑥𝑖 2 Σ𝑥𝑖
2
2 1
3. Σ𝑘𝑖 = Σ( 2
) = =
Σ𝑥𝑖 2 2 2
(Σ𝑥𝑖 ) Σ𝑥𝑖
2 1
So, Σ𝑘𝑖 =
2
Σ𝑥𝑖
4. Σ𝑘𝑖 𝑥𝑖 = Σ𝑘𝑖 𝑋𝑖 , the reason is Σ𝑘𝑖 (𝑋𝑖 − 𝑋)= Σ𝑘𝑖𝑋𝑖 − 𝑋Σ𝑘𝑖= Σ𝑘𝑖𝑋𝑖-0 = Σ𝑘𝑖𝑋𝑖
Σ𝑥𝑖 𝑋𝑖
Σ𝑘𝑖𝑋𝑖 =
2
Σ𝑥𝑖
Σ(𝑋𝑖−𝑋 )𝑋𝑖
= 2
Σ (𝑋𝑖−𝑋)
2
Σ 𝑋𝑖 − 𝑋 Σ 𝑋𝑖
= =1 denominator expansion derived earlier
2
Σ 𝑋𝑖 − 𝑋 Σ 𝑋𝑖
48
Using this we can derive the standard error of β1 and β2. Since, β2 = Σ𝑘𝑖 𝑌𝑖 , β2
2
So, β2= β + Σ𝑘𝑖𝑢𝑖 , it also follows that
β2-β2 = Σ𝑘𝑖𝑢𝑖
Σ (𝑢𝑖)= 0
β2= β2
And this shows that, β2 is an unbiased estimator of β2. it is actually the unbiasedness
property, and this the unbiasedness property will be satisfied if and only if 𝑥𝑖 𝑢𝑖 are
E(β2)= β2
49
E(β1)= β1
Variance of β2
2
⎡ ⎤
Var(β2 ) = 𝐸⎢β2 − 𝐸(β2) ⎥
⎣ ⎦
Since, E(β2)=β2
2
= 𝐸⎡⎢β2 − β2⎤⎥
⎣ ⎦
Var(β2 ) = 𝐸 Σ𝑘𝑖𝑢𝑖 [ ]2
While we are discussed about the summation operators,we have expanded that the term
2 2 2 2
(Σ𝑥𝑖) as 𝑥1+𝑥2...... 𝑥𝑛+ 2𝑥1𝑥2 etc…….
2 2 2 2 2 2
= 𝐸⎡⎢𝑘1𝑢1 + 𝑘2𝑢2 ......... + 𝑘𝑛𝑢𝑛 + 2𝑘𝑖 𝑘2𝑢1𝑢2 + ....... + 2𝑘𝑛−1 𝑘𝑛𝑢𝑛−1𝑢𝑛⎤⎥
⎣ ⎦
2 2 2 2 2 2
= 𝑘1 𝐸(𝑢1) + 𝑘2 𝐸(𝑢2) +...............+ 𝑘𝑛 𝐸(𝑢𝑛) + 2𝑘1 𝑘2 𝐸(𝑢1𝑢2) +..........+ 2𝑘𝑛−1𝑘𝑛𝐸
(𝑢𝑛−1𝑢𝑛)
50
2 2
𝐸(𝑢𝑖) = σ (homoskedasticity assumption)
2 2 2 2 2 2
So, =𝑘1σ + 𝑘2σ +................+ 𝑘𝑛σ
𝐸(𝑢1𝑢𝑗) = 0 No autocorrelation
the formula of the variance expected value etc….. Critically depends on the assumptions
2 2
= σ Σ𝑘𝑖
2
1
Σ𝑘𝑖 = 2
Σ𝑥𝑖
2
σ
𝑉𝑎𝑟 (β2) = 2
Σ𝑥𝑖
The positive square root of the variance of the β2 is known as the standard error of β2
σ
SE(β2)=
2
Σ𝑥𝑖
When this quantity is used for hypothesis testing, for the derivation of Z, t etc…..
2
σ σ
𝑉𝑎𝑟 (β2) = 2 and SE(β2)=
Σ𝑥𝑖 Σ𝑥𝑖
2
It is nothing but the standard derivation of the sampling distribution of β2, because, β2 is an
51
Similarly without proof, we can show that;
2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖
2
Σ𝑋𝑖
SE(β1)= 2 σ
𝑛Σ𝑥𝑖
These quantities are used as the measure of the precision of the estimator, not only that, these
2⎡ 1 2 1 ⎤
𝑉𝑎𝑟 (β1) = σ ⎢ 𝑛 +𝑥 ⎥
⎢ Σ𝑥𝑖
2
⎥
⎣ ⎦
2 2
2⎡ Σ𝑥𝑖 + 𝑛𝑥 ⎤
= σ⎢ ⎥
⎢ 𝑛Σ𝑥𝑖
2
⎥
⎣ ⎦
2 2 2
2⎡ Σ𝑋𝑖 − 𝑛𝑥 +𝑛𝑥 ⎤
=σ ⎢ ⎥
⎢ 𝑛Σ𝑥𝑖
2
⎥
⎣ ⎦
2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖
Consider these formulas of variance and also standard errors you will immediately feel
2
that, in this formula, Σ𝑥 𝑖 is known, once we have a sample, and n is also known. The
2 2
unknown is σ , and the σ is the variance of 𝑢𝑖, and u is unknown. So we cannot apply
2
σ with its unbiased estimator.
2
2 Σ𝑢𝑖
2
σ =σ = 𝑛−2
52
2
Σ𝑢𝑖 = sum of squares of residuals
2
2 Σ(𝑋−𝑥)
For example, 𝑠 = 𝑛−1
, here n-1 is the degrees of freedom
2
In the context of OLS, Σ𝑢𝑖 can be derived, if and only if β1 𝑎𝑛𝑑 β2 are derived. So, two
restrictions are imposed on the data, when two quantities are estimated from the data.
1. Σ𝑢𝑖 = 0
2. Σ𝑢𝑖 𝑋𝑖 = 0
df= n-k
2
2
Now,σ = is the unbiased estimator of σ
()
2
2
Then,𝐸 σ = σ
⎡ 2 ⎤ 2
If ,E ⎢Σ𝑢𝑖 ⎥ = (n-2) σ
⎢ ⎥
⎣ ⎦
2
The expectation of the numerator is n-2 σ , so if we divide the equation with n-2, we get;
2
⎡ Σ𝑢 ⎤
= Σ⎢⎢ 𝑛−𝑖2 ⎥⎥= ⎡ 𝑛−2 ⎤σ
𝑛−2 2
⎢ ⎥ ⎣ ⎦
⎣ ⎦
2
2
So, division by n-2, make the estimator of σ that is 𝑢 an unbiased estimator.
2
2 Σ𝑢𝑖
In the equation of Var(β1) and Var(β2 ), σ is replaced with 𝑛−2
.
53
2
How do we calculate Σ𝑢𝑖
2 2
Actually Σ𝑢𝑖 = Σ( 𝑌 − β + β2 𝑋𝑖 )
𝑖 1
2
Once β1 and β2 are calculate, then it is easy to calculate the Σ𝑢𝑖 . But in practice, we calculate it
differently .
We know that,
𝑌𝑖 = 𝑌𝑖+ 𝑢 𝑖
𝑦 𝑖 = 𝑦 𝑖+ 𝑢 𝑖
2 2
2
Σ𝑦𝑖 = Σ𝑦 + Σ𝑢 + 2Σ𝑦 𝑢
𝑖 𝑖 𝑖 𝑖
Here, Σ𝑦 𝑖 𝑢 𝑖 =0
2 2
2
Σ𝑦𝑖 = Σ𝑦 𝑖 + Σ𝑢 𝑖
2 2
2
Σ𝑢 𝑖 = Σ𝑦𝑖 - Σ𝑦 𝑖
Since, 𝑦𝑖 = β2𝑥𝑖
2 2
2
Σ𝑦𝑖 = β2Σ𝑥 𝑖
2 2
2 2
So, Σ𝑢 = Σ𝑦 - β2Σ𝑥 𝑖
𝑖 𝑖
54
2
This is the formula that we used to calculated Σ𝑢
𝑖
2
Σ𝑢
σ= 𝑛−2
𝑖
is known as the SE of the estimates, also known as SE of regression, it is
nothing but standard deviation of Y around the estimated regression line, and it is often
considered as a measure of goodness of fit of the sample regression line to the sample
data.
From the definition of variance and standard error of estimates consider these points;
2
σ
𝑉𝑎𝑟 (β2) = 2
Σ𝑥𝑖
2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖
2
Consider the slope coefficient β2, in this case variance of β2 is directly proportional to σ
2
as the variance of Y increases, variance of β2 increases. It is inversely proportional to Σ𝑥 𝑖
2
2
(
Σ𝑥 𝑖 = Σ 𝑋𝑖 − 𝑋 )
If all the X values equal to 𝑋 (there is no variability) this become 0, and variance of β2
become infinite. So, if you have a sample with low variability in X, then the variance of
the estimator will be very high and precision of the estimator will be low, and vice versa.
If you have two data sets with different values of X, then always select the set with high
55
Gauss Markov Theorem
In this section, we are discussing the statistical properties of least squares. We have already
discussed that properties are divided into numerical properties and statistical properties.
Numerical properties are those that hold as result of the use of OLS regardless of how the data is
generated. Statistical properties are those that hold only after making some assumptions about
how data are generated. The statistical properties of least squares is given in the form of a
theorem known as Gauss Markov Theorem. This is a basic theorem in econometrics. The
statement of the theorem is, given the assumptions of the classical linear regression model, OLS
estimators in the class of all linear unbiased estimators have minimum variance, they are BLUE.
So we consider a class of linear unbiased estimators and we will prove that in this class OLS
As we discussed earlier, the BLUE property is a small sample property. Properties of estimators
are classified into small sample properties and large sample properties, and small sample
properties are the most desirable. If small sample properties are satisfied, then we need not
Let's prove the Gauss Markov Theorem with respect to β2, and it will automatically applicable
2
σ
𝑉𝑎𝑟 (β2) = 2 ……………….(3)
Σ𝑥𝑖
56
We have derived the variance and we have to prove that, in the class of unbiased linear
The 1st and 2nd properties are absolute, means, these properties do not depend on other
estimators. But the 3rd property is a relative property, meaning a property in relation to the
∼
β2 (β 𝑡𝑖𝑙𝑑𝑒 2)= Σ𝑤𝑖𝑌𝑖 , it is linear, were 𝑤𝑖 is serving as the weights.
= Σ𝑤𝑖(β1+β2Xi+𝑢𝑖)
∼
β2 = β1Σ𝑤𝑖+β2Σ𝑤𝑖𝑥 𝑖+Σ𝑤𝑖𝑢
𝑖
∼
E(β2 ) = β2, it is satisfied if and only if Σ𝑤𝑖=0, and Σ𝑤𝑖𝑥 𝑖=1, so we assume that Σ𝑤𝑖=0, and Σ𝑤𝑖
𝑥 𝑖=1
∼
β2 = β2+Σ𝑤𝑖𝑢
𝑖
∼
E(β2 ) = β2 because Σ𝑤𝑖𝑢 = 0
𝑖
1. Σ𝑤𝑖=0,
2. Σ𝑤𝑖𝑥 𝑖=1
∼ ∼
Now, we have to derive the variance of β2 , we can derive the variance of β2 just as we derived
∼
We have,β2 = Σ𝑤𝑖𝑌𝑖
57
∼
Then, Var(β2 )= Var(Σ𝑤𝑖𝑌𝑖)
2 2 2
= 𝑤1 Var(𝑌1)+𝑤2 Var(𝑌2)+.................𝑤𝑛 Var(𝑌𝑛)
2
It is assumed that Var(𝑌𝑖) = σ , And also, Cov(𝑌𝑖 𝑌𝑗) = 0, as we already discussed Var(𝑌𝑖) is
Using this covariance terms will disappear, and we can write it as;
2 2 2 2 2 2
= 𝑤1 σ +𝑤2 σ +............𝑤𝑛 σ
∼ 2 2
Var(β2 ) = σ Σ𝑤𝑖
2
1. Var(𝑌𝑖) = σ
2. Cov(𝑌𝑖 𝑌𝑗) = 0,
2 2 2 2 2 2
= 𝑘1σ +𝑘2σ +.........................𝑘𝑛σ
2
2 2 σ
σ Σ 𝑘𝑖 = 2
Σ𝑥𝑖
2
To write the variance like this, we have to assume that Var(𝑌𝑖) = σ , and Cov(𝑌𝑖 𝑌𝑗) = 0. As we
have stated the assumptions in terms of u, we have adopted a different root, if we state the
assumption in terms of y, we can use this root also, both give same result. Here we used the
2
assumptions that Var(𝑌𝑖) = σ , and Cov(𝑌𝑖 𝑌𝑗) = 0.
58
∼ 2 2
So, Var(β2 ) = σ Σ𝑤𝑖
2 2 2
Σ𝑤𝑖 = Σ𝑘𝑖 +Σ(𝑤𝑖 − 𝑘𝑖) + 2Σ𝑘𝑖 (𝑤𝑖- 𝑘𝑖 )
2
Consider the last terms, Σ𝑘𝑖 (𝑤𝑖- 𝑘𝑖 )= Σ𝑘𝑖 𝑤𝑖 - Σ𝑘𝑖
2
= Σ 𝑘 𝑖 𝑤 𝑖- Σ 𝑥 𝑖
Σ𝑥𝑖 𝑤𝑖 1
= 2 - 2
Σ𝑥𝑖 Σ𝑥𝑖
Σ𝑥 𝑖 𝑤𝑖 = Σ(𝑋𝑖 − 𝑋)𝑤𝑖
=Σ𝑤𝑖𝑋𝑖 - 𝑋Σ𝑤 , Σ𝑤 = 0
=Σ𝑤𝑖𝑋𝑖 =1
1 1
= 2 - 2 =0
Σ𝑥𝑖 Σ𝑥𝑖
2
Then we write Var(β 2) = σ ⎡⎢Σ𝑘𝑖 + Σ(𝑤𝑖 − 𝑘𝑖) ⎤⎥
∼ 2 2
⎣ ⎦
2 2 2 2
= σ Σ 𝑘𝑖 + σ Σ(𝑤𝑖 − 𝑘𝑖)
∼
2
2 2
σ
Var (β 2)= 2 +σ Σ(𝑤𝑖 − 𝑘𝑖)
Σ𝑥𝑖
∼ 2 2
Var (β 2)= 𝑉𝑎𝑟 (β2) +σ Σ(𝑤𝑖 − 𝑘𝑖)
59
∼ 2 2
As you can see here, Var (β 2) = = 𝑉𝑎𝑟 (β2) +σ Σ(𝑤𝑖 − 𝑘𝑖) , Now, if 𝑤𝑖 = 𝑘𝑖, the weight
𝑤𝑖 is same as 𝑘𝑖, the last term of the above equation will disappear , and
∼
Var (β 2)= 𝑉𝑎𝑟 (β2)
If 𝑤𝑖 is any value other than 𝑘𝑖, 𝑤𝑖 < 𝑘𝑖 or 𝑤𝑖 > 𝑘𝑖, the squared quantity of the last term in
the above equation will become positive. Then, you will get the result that;
∼
Var (β 2)> 𝑉𝑎𝑟 (β2)
So, another way of saying this is that, if there if minimum variance unbiased estimator (BLUE)
of β2, it is given by OLS estimator β2. This is the proof of Gauss Markov Theorem.
∼ ∼
Both are unbiased, and the Var (β 2)> 𝑉𝑎𝑟 (β2) or 𝑉𝑎𝑟 (β2)<Var (β 2).
Coefficient of Determination
60
● We have estimated the regression coefficients β1 𝑎𝑛𝑑 β2 , we have also derived the standard
errors of estimators.
Σ𝑥𝑖 𝑦𝑖
β2 = 2
Σ𝑥𝑖
2
σ
𝑉𝑎𝑟 (β2) = 2
Σ𝑥𝑖
β1 = 𝑌 − β2 𝑋
2 2
σ Σ𝑋𝑖
𝑉𝑎𝑟 (β1) = 2
𝑛Σ𝑥𝑖
● Now if β1 and β2 are known, easily determine the sample Regression Line. that is equal to
● If we know β1 and β2 , by substituting the values𝑋𝑖, 𝑌𝑖 can be estimated, you will get the
● Now, our next issue is to find out how fit is the sample regression line to the sample data.
● Ideally what we expect is all the sample points lie on the sample regression line. But in the
real world, all sample points will not lie on the sample regression line. Some points will lie
above the sample regression line and some will lie below the sample regression line. That is
there will be positive u hat and negative u hat. i.e., (𝑢) 𝑎𝑛𝑑 (− 𝑢)
● Now, as a measure of goodness of fit, goodness of fit means how good is the fit of the
sample regression line to the sample data, we use a measure known as coefficient of
determination.
61
● Coefficient of determination is the measure used to assess the fit of the sample regression
2
line to the sample data . For simple linear regression model it is represented as (𝑟 ). and
when we go beyond the 2 variable model , in the general K variable model , we denote
2
coefficient of determination as multiple coefficient of determination (𝑅 ).
2
● Coefficient of determination is written as (𝑟𝑦,𝑥 )
2
informal method of considering (𝑟 ). suppose that we use a technique known as venn
Diagram, also known as Ballentine, suppose that, the circle Y represents variation Y and the
● The term variance and variation are two different things. When we talk about variation, the
question is variation from what? There will be a base. The base that we usually refer is a
2
2
(
Σ 𝑌𝑖 −𝑌 )
(
mean, so the variation is denoted as Σ 𝑌𝑖 − 𝑌 ) and variance is denoted as 𝑛−1
● So, variation is not variance. Now suppose that the circle Y represents variation in Y and
62
The overlap increases in the following diagrams
2 2
● Now 𝑅 is a measure of this overlap. In the first case as overlap is zero 𝑅 = 0 , in the
2 2
last case as the overlap is complete, 𝑅 = 1. in the fourth case 𝑅 < 1. This is the
63
2
● To derive 𝑅 we proceed as follows
variable model in terms of original observation and in terms of deviation form. And
2
2
(
Σ𝑦𝑖 = Σ 𝑦𝑖 + 𝑢𝑖 )
2 2
2
Σ𝑦𝑖 = Σ𝑦𝑖 + Σ𝑢𝑖 + 2 Σ𝑦𝑖 𝑢𝑖 = 0
2 2 2
2
Σ 𝑦𝑖 = β 2 Σ 𝑥 𝑖 + Σ𝑢𝑖
64
2
2
(1) Σ𝑦𝑖 = Σ 𝑌𝑖 − 𝑌 ( ) = 𝑇𝑆𝑆 ( 𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠)
2 2
( )
2
(2) Σ𝑦𝑖 = Σ 𝑌𝑖 − 𝑌 (
= Σ 𝑌𝑖 − 𝑌 ) we have proved that 𝑌 = 𝑌, that is the mean
2
( )
Σ 𝑌𝑖 − 𝑌 —----------------- we called this ESS, that is Explained Sum of
Squares. That is that part of variation is explained by the sample regression line .
2
(3) Σ𝑢𝑖 That part of the variation in Y not explained by the sample regression line.
𝐸𝑆𝑆 𝑅𝑆𝑆
1 = 𝑇𝑆𝑆
+ 𝑇𝑆𝑆
2 2
Σ𝑦𝑖 Σ𝑢𝑖
1 = 2 + 2
Σ𝑦𝑖 Σ𝑦𝑖
2 2 2
β2 Σ𝑥𝑖 Σ𝑢𝑖
1 = 2 + 2
Σ𝑦𝑖 Σ𝑦𝑖
2
Now, we define 𝑟 as
2 𝐸𝑆𝑆
𝑟 = 𝑇𝑆𝑆
2
2 𝑅𝑆𝑆 Σ𝑦𝑖
So, 1 = 𝑟 + 𝑇𝑆𝑆
= 2
Σ𝑦𝑖
2
So 𝑟 can also be written as
65
2
2 𝑅𝑆𝑆 Σ𝑢𝑖
𝑟 = 1 − 𝑇𝑆𝑆
= 1 − 2
Σ𝑦𝑖
often used as the summary measure of the goodness of fit. That is the fit of the sample
2
regression line to the sample data. Then 𝑟 is defined as the proportion of variation in Y
2 2
𝑟 with 100 ( 𝑟 *100) , then it will be the percentage of variation in Y explained by X.
2
For ex: if 𝑟 = 0.6, the 0.6*100 = 60% then the explanation is that 60% of the variation
in Y explained by X.
2 Σ𝑥 2
2 𝑖
𝑟 = β2 2
Σ𝑦𝑖
2 2
2 𝑆𝑥
𝑟 = β2 2
𝑆𝑦
2
● The value of 𝑟 always lie in between 0 and 1.
2
● If Σ𝑢𝑖 = 0 means all the sample points lie on the sample regression line, then
2 0
𝑟 = 1 − 2 = 1 that is the entire variation in Y is explained by X. on the other
Σ𝑦𝑖
2 2
Σ𝑢𝑖 Σ𝑢𝑖
hand, 2 that is no part of the variation in Y is explained by X. Then, 2 = 1 so,
Σ𝑦𝑖 Σ𝑦𝑖
2
𝑟 = 1 − 1 = 0
66
2
● So, 𝑟 is squared quantity, it will not be negative. The range is in between 0 and 1.
2
Where 0 means the regression line has no fit at all, 1 means perfect fit. But typically 𝑟
2
● If the value of 𝑟 is 0.8, 0.9 etc., we will say that the fit of the SRL is very high. And if it
2
● As we will see later, in the cross sectional data due to heterogeneity in the data 𝑟 will
2
be generally low, but in time series data 𝑟 will be very high
2
● And always remember that 𝑟 is the measure of the fit of the SRL to the sample data.
2
● As you see later, 𝑟 not a very important quantity in econometrics.
2
● Now, a quantity which is closely related to but conceptually very different from 𝑟 is the
Σ𝑥𝑖𝑦𝑖
𝑟 =
2 2
Σ𝑥𝑖 Σ𝑦𝑖
(1) − 1 ≤ 𝑟 ≤+ 1
𝑟𝑥𝑦 = 𝑟𝑦𝑥
(4) If two variables X, Y are independent , then r = 0, but 0 correlation does not mean
that X and Y are independent. That means even if the correlation is a measure of
67
2
linear association only, if 𝑌 = 𝑋 , 𝑡ℎ𝑒𝑦 𝑎𝑟𝑒 𝑟𝑒𝑙𝑎𝑡𝑒𝑑, 𝑏𝑢𝑡 𝑟 = 0 the reason is
that they are non-linearly related. So, even if correlation is 0, we cannot say that
the two variables are independent. But If two variables are independent , then r =
0.
There is an exception to this rule that, if X and Y are normally distributed random
variables, zero correlation always mean statistical independence. And this will not
be true in all cases. This will true only in the case of normally distributed random
(5) Correlation need not imply causation: a classical example to this is a spurious
correlation. This concept was first introduced by Yules in Statistics. Later it was
introduced in econometrics, in the context of time series . the point is that, even if
two variables are highly correlated, we cannot say that one is a cause. There will
not be a causation between two variables. That means, even if these two variables
are not related, you may possibly get a high correlation. Such a possibility is
2
between variables, 𝑟 is conceptually different. It is the pattern of variation in Y
± 2 2
explained by X. but actually this 𝑟 is 𝑟 and which sense 𝑟 is the square of r will
consider later
2
● Suppose that 𝑟 = 09, then r = 0.81, but the problem is if r can be positive or negative but
2
𝑟 is positive only. So, the question is 0.81 is positive or negative? To see that you must
68
𝑌 = 𝑓( 𝑥 )
Suppose consumption is the function of income, then the slope coefficient is positive. So,
is negative.
● So, in order to determine what will be the sign of r, you just consider the sign of the slope
2 2
● So, r can be positive or negative, but 𝑟 is always positive. And always remember that 𝑟
is the square of r even though the meaning of the two terms are different. The sense in
2
which 𝑟 is the square of r will be discussed and proved in the context of multiple linear
regression model.
Normality Assumptions
● Using the method of ordinary least squares we have estimated the parameters.
β1 𝑖𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑎𝑠 β1
β2 𝑖𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑎𝑠 β2
2
2
σ 𝑖𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑎𝑠 σ
● And the assumptions of classical linear regression model these estimators posses many
desirable statistical properties, summarized as Gauss Markov theorem. That is the BLUE
69
● So, provided that the assumptions classical linear regression model are satisfied, β1 and
β2 𝑎𝑟𝑒 best unbiased estimators. But we also know that these are only estimators. As
estimators, their value vary from sample to sample. Or stated differently, they are random
variables. So, a measure of the position of these estimators is given by the standard error
of the estimators(SE). We have derive standard error of β1 and β2 . Standard error is used
as the measure of the position of the estimator, not only that SE are used for hypothesis
testing.
● The estimation is the only one part, hypothesis testing is the other part.
income .
MPC is 0.7. Now the question is if we take another sample, you will get another value for
β2 if we take another sample, you will get a third value for β2 , and we do not know
which β2 will approximate the parameter β2. because of sampling fluctuations β2 vary
from sample to sample and we do not know which β2 will approximate the true
population parameter β2
● Our aim is not estimate β1 or β2 but use these estimators to make inferences about the
70
● Suppose that we want to make inference about µ, using a sample we estimate a 𝑋. 𝑋 is an
● So once 𝑋 is estimated from sample, using the procedure of hypothesis testing or using Z
𝑋 −µ
𝑍= σ
𝑋 −µ
𝑡= 𝑆
𝑛
● That branch is known as hypothesis testing. Statistician developed rules and procedure
for testing hypothesis, steps in formulating hypothesis and steps in testing hypothesis etc.,
● We know that estimators are random variables, so, naturally, estimators are sampling
distributions. We can use estimators for hypothesis testing, if and only if the sampling
distribution of the estimator is known. So, for ex: in the case of 𝑋 , X follows a normal
2
distribution with mean µ and variance σ and also 𝑋 follows a normal distribution with
2
mean µ and variance σ /𝑛.
2
𝑋 ∼ 𝑁(µ, σ )
2
𝑋 ∼ 𝑁(µ, σ /𝑛)
● So, sampling distribution of 𝑋 is given, so the test statistic Z or t can be constructed, so,
2
● 𝑋 ∼ 𝑁(µ, σ /𝑛) this is based on the assumption that the samples are iid, or simply
samples are taken by using the random sampling technique, it is assumed X is normal. If
71
even is X is not normal, we use the central limit theorem. Provided that n is greater than
30 like that.
( )
2 2
σ Σ𝑋𝑖
β1 ∼ β1 2
𝑛Σ𝑥𝑖
Similarly
β2 ∼ β2( ) σ
2
Σ𝑥𝑖
2
● In both cases, the nature of the distribution is not specified. In the case of arithmetic
2
X is not normal 𝑋 is normal with (µ, σ /𝑛) provided that sample size is sufficiently large.
So that we can construct the test statistic Z and t. In the case of OLS we have proved that
the estimators are unbiased, variances are derived, and we also shown that estimators are
true. But estimators are random variables, so, naturally it has a sampling distribution. But
the nature of the sampling distribution has not been discussed so far.
● But use these estimators to make inferences about the population parameters, we have to
specify sampling distributions of β1 and β2. without sampling distributions we can not
use estimators to make inferences about population parameters. Now let us derive the
Σ𝑥𝑖𝑦𝑖
β2 = 2
Σ𝑥𝑖
= Σ𝐾𝑖𝑌𝑖
72
β2 = β2 + Σ𝐾𝑖𝑢𝑖
● This step we have already derived. So, naturally, the distribution of the β2 or β2 simply
𝑥𝑖
𝐾𝑖 = 2
Σ𝑥𝑖
● So, as β2 is depends on 𝑢𝑖, we can also show that β1 𝑎𝑙𝑠𝑜 𝑑𝑒𝑝𝑒𝑛𝑑𝑠 𝑜𝑛 𝑢𝑖, we will prove
● Since, this distribution β2 is depends on 𝑢𝑖, we have to state assumptions about how 𝑢𝑖 is
● So, in the regression context, we assume that 𝑢𝑖 is normally distributed. That is the
stochastic error term 𝑢𝑖 is normally distributed, and when we add the normality
𝐸(𝑢𝑖) = 0
2 2
𝐸[𝑢𝑖 − 𝐸(𝑢𝑖) ] = σ
Covariance , that is
[
𝐸[𝑢𝑖 − 𝐸(𝑢𝑖) ] 𝑢𝑗 − 𝐸(𝑢𝑗) ] = 0
We can write it as
73
2
𝑢𝑖 ∼ 𝑁(0, σ )
● So, in regression analysis, when we talk about normality assumption, we are talking
assumed to be zero, we can say that each 𝑢𝑖 is normally and independently distributed
2
with 0 and σ .
2
That is 𝑢𝑖 ∼ 𝑁𝐼𝐷(0, σ )
● The celebrated Central Limit Theorem says that if there are a large number of
independently and identically distributed random variables, then with a few exemptions,
their sum of the random variables will also be normally distributed. A variant of the
central limit theorem says that even if the number of random variables is not very large,
also even if they are not strictly independently distributed, even then their sum tends to be
normally distributed.
● So, we know that U in the linear regression model in econometrics, is the sum of a large
number of random variables. For example, in the consumption income example you
consider, wealth, number of family members and so many other variables. As U is the
reason is that, there is a theorem in statistics, which says that a linear combination of
normally distributed random variables itself is normally distributed. We have seen that
74
● β2 = β2 + Σ𝑘𝑖𝑢𝑖
( )
2 2
σ Σ𝑋𝑖
β1 ∼ 𝑁 β1 2
𝑛Σ𝑥𝑖
Similarly
β2 ∼ 𝑁 β2( ) σ
2
Σ𝑥𝑖
2
● When we study 𝑋, we simple assume that if X , that is the population is normal, the
distribution of the 𝑋 is also normal. That is proved by the statisticians. In the context of
OLS, we use the normality assumption to prove that estimators are normal. Normality
assumption is about u and the estimators. It says nothing about the distribution of X or Y.
● Then why normality assumption? Because u is the sum of a large number of identically
and independently distributed random variables. And if a phenomena is the sum of a large
number of factors, then that phenomena will be normally distributed. Height is normally
phenomena etc, are the large number of variables. So, such variables will be naturally
distributed. And if u is normal, OLS estimators are normally distributed. And if OLS
estimators are normal then we can use normal distribution for hypothesis testing.
( )
2 2
σ Σ𝑋𝑖
● Since β1 ∼ 𝑁 β1 2
𝑛Σ𝑥𝑖
β1 −β1
𝑍= ∼ 𝑁 ( 0, 1)
𝑆𝐸 (β1)
75
Similarly,
β2 ∼ 𝑁 β2( ) σ
2
Σ𝑥𝑖
2 is converted to
β2 −β2
𝑍= ∼ 𝑁 ( 0, 1)
𝑆𝐸 (β2)
( ) ( )
2 2
σ Σ𝑋𝑖 σ
2
● β1 ∼ 𝑁 β1 2 this is the sampling distribution of β1 and β2 ∼ 𝑁 β2 2 is the
𝑛Σ𝑥𝑖 Σ𝑥𝑖
sampling distribution of β2. and estimators are not directly used for hypothesis testing,
using the estimators we will derive the test statistics. The test statistic is Z.
76
● The sampling distribution β1 is converted in to Z. similarly we can convert β2 in to Z.
● In hypothesis testing, we will use the Z distribution. The sampling distribution can be
used for hypothesis testing. But in general, we will not use sampling distribution directly.
We derive the test statistic Z and this distribution is used for hypothesis testing. That is to
test hypothesis about β1 and β2 , the regression coefficients using the parameters β1 and
β2
● Remember this, Z conversion is possible, if and only if the standard errors are known. In
reality, standard errors will be unknown. So, we will be go for t distribution. That is the
point we considered earlier. Now, for our future use the quantity,
2
σ 2
( 𝑛 − 2) 2 ∼ χ (𝑛 − 2)𝑑𝑓
σ
2
Now, in the context of χ distribution, we have shown that
2
𝑆 2
( 𝑛 − 1) 2 ∼ χ (𝑛 − 2)𝑑𝑓
σ
2
That was in the context of testing the significance of σ
2 2
● In statistics, we use the notation 𝑆 for σ and 𝑋 𝑓𝑜𝑟 µ
2
2
● In econometrics, we use σ for σ and µ 𝑓𝑜𝑟 µ
2
𝑆 2
● ( 𝑛 − 1) 2 ∼ χ (𝑛 − 2)𝑑𝑓 this quantity is used to derive the t distribution.
σ
2
● We can also show that β1 and β2 are independently distributed of the quantity σ .
Because that is the basis requirement for t distribution. If 𝑍1 is the standard normal and
2
𝑍2 is χ with k degrees of freedom and is distributed independently of 𝑍1, then the ratio
77
2
has t distribution. We assume that estimators are independent of σ . To prove this matrix
normally distributed we can use the estimators for hypothesis testing. But there is another
point. If this normality assumption is very important, it is important if the sample size is
more. If the sample size is very large, then the normality assumption is not very
important. Because, in general as the sample size increases, indefinitely, estimators will
the population 𝑋 is normally distributed. We can say the same in the context of OLS
estimators also. If you have a large sample say n >100, then normality assumptions not
required. Because whether u is normal or nor, or whether you are not sure about the
distribution of the u, β1 and β2 will be normally distributed. So, you can simply use the
normality assumption and the corresponding Z statistics or if the variances are unknown
the t statistics for hypothesis testing. So, adding the normality assumption, we have the
Classical Normal Linear Regression Model otherwise it is simply the Classical Linear
Regression Model.
The classical definition has two branches i.e, estimation and hypothesis .Estimation is divided in
to two :
● Point estimation
● Interval estimation
78
OLS estimators are point estimation. When we consider hypothesis testing there are two
Confidence interval approach is based on interval approach and test of significance is based on
point estimation.
2
RV → X height→follows a certain distribution with parameters with µ, σ Random variables
are normally distributed, but we didn't know about parameters. Our interest is distribution of
parameters.
In point estimation, we assume that RV → 𝑋 follows a certain distribution and we have a random
sample of size ‘n’ from this known population, for example 50 sample, we use this sample to
estimate the unknown parameters. In the case of point estimation - random variable x follows
certain PDF.
θ is the estimator or the sample statistics, a rule or formula which tells us how to estimate the
unknown parameter.
Interval Estimation
Point estimation will gives only a single value for the parameter θ → θ.
In interval estimation, instead of suggesting single value for θ i.e, θ→ 150, We suggest a range
in which the parameter will lie. The idea behind the notion of interval estimation is the concept
of sampling distribution of estimators. The parameter will lie in between 2 values. We construct
79
an interval around point estimation or taking 2 or 3 standard errors on either side. And we say
For eg: X ∼ 𝑁 µ , σ ( ) 2
𝑋∼ 𝑁 µ,( ) σ
𝑛
2
σ
We can suggest interval 𝑋± 1. 96
𝑛
(θ − δ, θ + δ, = 1 − α )
θ= 1 − α
[
P θ − δ≤ θ ≤θ + δ= 1 − α ]
⎡ ⎤
P⎢ θ1 ≤ θ ≤ θ2 ⎥ = 1 − α
⎣ ⎦
1 − α = confidence coefficient
𝑥 − 1. 96( ) ≤ µ ≤ 𝑥 + 1. 96( )
σ
𝑛
σ
𝑛
67 − 1. 96 ( ) ≤ µ ≤ 67 + 1. 96( )
2.5 2.5
100 100
66.51≤ µ ≤ 67. 49
The parameter will lie intervals like in between 95 out of 100 times.
80
Construction of Confidence Interval of Population Mean
First we explain the construction of confidence interval for the parameter µ using its estimator 𝑋
2
𝑋 ∼ 𝑁 µ, σ ( )
( )
2
σ
𝑋 ∼ 𝑁 µ,
𝑛
𝑥− µ
Z= σ ∼ 𝑁 ( 0, 1)
𝑛
⎡ ⎤
𝑃𝑟⎢− 𝑧α/2 < 𝑍 < 𝑧α/2⎥=1-α
⎣ ⎦
α α 𝑥−µ
− 2
and 2
are known as critical values, then z in the middle is empirical z, ie z =
σ/ 𝑛
Substituting for z,
𝑥−µ
𝑃𝑟⎡⎢− 𝑧α/2 < < 𝑧α/2 ⎤⎥=1-α
⎣ σ/ 𝑛 ⎦
σ
Multiplying all the term with
𝑛
𝑃𝑟⎡⎢− 𝑧α/2
⎣ ( ) < 𝑋 − µ < 𝑧 ( )⎤⎥⎦=1-α
σ
𝑛 α/2
σ
𝑛
𝑃𝑟⎡⎢− 𝑧α/2
⎣ ( ) − 𝑋 <− µ < 𝑧 ( ) − 𝑋⎤⎥⎦=1-α
σ
𝑛 α/2
σ
𝑛
81
When we multiply with positive value < change to >
𝑃𝑟⎡⎢𝑋 + 𝑧α/2
⎣ ( ) > µ > 𝑋 − 𝑧 ( )⎤⎥⎦=1-α
σ
𝑛 α/2
σ
𝑛
µ = 𝑋 ± 𝑧 α/2( )
σ
𝑛
If α = 0. 5
µ = 𝑋 ± 1. 96 ( ) σ
𝑛
σ
is standard error of the distribution of 𝑋.
𝑛
𝑥−µ
𝑡= 𝑠 ∼ 𝑡(𝑛−1)
𝑛
𝑋−µ
If 𝑍1= σ ∼ 𝑁 ( 0, 1)
𝑛
2
𝑠
𝑍2 = (𝑛 − 1) 2
σ
2
Σ(𝑥 −𝑥) 2
= 2 ∼ χ(𝑛−1)
σ
(𝑋−µ) 𝑛
𝑍1= σ
2
1 Σ(𝑥 −𝑥)
𝑛− 1 2
σ
(𝑋−µ)
𝑠
𝑛
82
⎡ ⎤
𝑃𝑟⎢− 𝑡α/2 < 𝑡 < 𝑡α/2⎥=1-α
⎣ ⎦
𝑋−µ
𝑃𝑟⎡⎢− 𝑡α/2 < < 𝑡α/2 ⎤⎥=1-α
⎣ 𝑠 𝑛 ⎦
If α = 5%
In the case of t distribution , 95% of t depends on degrees of freedom. When degrees of freedom
𝑠
Multiplied with , 𝑋 is subtracted from all the term, Multiplying all with positive value, we
𝑛
will get
µ = 𝑋 ± 𝑡α/2( )
𝑠
𝑛
If degrees of freedom is 20 α = 5%
µ = 𝑋 ± 2. 036 ( ) 𝑠
𝑛
Now we consider how to construct confidence intervals for the regression coefficients β1, β2.
We will explain how to construct confidence interval for β2 and generalize it for β1 or any
2
σ
β2∼ 𝑁(β2 2 )
Σ𝑥𝑖
With normality assumption OLS estimators are normally distributed. Let us consider how we
83
From β2
β2− β2
Z= 2
∼N(0,1)
σ/ Σ𝑥𝑖
2
( On the assumption that σ is known)
⎡ ⎤
𝑃𝑟⎢− 𝑧α/2 < 𝑍 < 𝑧α/2⎥ = 1-α
⎣ ⎦
Once α is fixed the critical values are known, For example ifα is .05, then − 𝑧α/2= -1.96
𝑧α/2=1.96 that 95% of Z lies between these 2 intervals. And Z in the middle is Z obtained from
⎡ β2−β2 ⎤
𝑃𝑟⎢− 𝑧α/2 < 2
< 𝑧α/2 ⎥ =1-α
⎣ σ/ Σ𝑥𝑖 ⎦
As we have done in the case of arithmetic mean, here also assume that you multiply the entire
2
terms with σ Σ𝑥𝑖 then assume that β2 is subtracted from all the terms then again you assume
that you multiply all the terms with a minus you will get
This is the confidence interval for β2 for the level of significance α.If α is 5%, this is the 95%
β2 = β2 ±𝑧α/2⎡⎢𝑆𝐸( β2)⎤⎥
⎣ ⎦
84
This can be generalized as
β𝑖 = β𝑖 ±𝑧α/2⎡⎢𝑆𝐸 ( β𝑖⎤⎥
⎣ ⎦
This is the 100(1-α)% confidence interval for any regression coefficient β𝑖.
Z distribution can be used for constructing confidence interval only if σ is known. If σ is known
and α is fixed then this confidence interval can be constructed. But as we have discussed in the
case of arithmetic mean and using it for constructing confidence interval for µ, in practical
applications σ will be unknown. So as in the case of arithmetic mean, we use in the case of OLS
β2−β2
t= 2
∼t(n-2)
σ/ Σ𝑥𝑖
Now we are considering the case where the degrees of freedom is n-2 not n-1.
2
( β2−β2) Σ𝑥𝑖
𝑍1 = σ
∼N(0,1)
2
σ
𝑍2=(n-2)
σ2
2
Σ𝑢𝑖 2
= 2 ∼χ (n-2)
σ
2
( β2−β2)/ Σ𝑥𝑖
= σ
2
1 Σ𝑢𝑖
𝑛−2 2
σ
85
= β2 − β2
2
σ / Σ𝑥𝑖
This quantity follows t distribution with n-2 degrees of freedom. Whenever σ is replaced by its
unbiased estimator , Z is converted into t distribution.This t distribution can be used whether the
sample size is large or small, because whenever σ is replaced its estimator σ , it is the t
distribution (irrespective of the fact that whether the sample size is small or large).
But the question is why we use Z even when σ is unknown in large samples? when its estimator
is used. This is because , in large samples, the distribution Z and t are not much different. So in
empirical applications , t distribution is widely used. And Z distribution cannot be used. But if
the sample size is large Z can also be used but as you will see in most of the econometric
softwares, we use t distribution rather than Z distribution for hypothesis testing. Now we will use
[
Pr − 𝑡 α/2
< 𝑡< 𝑡 α/2]=1-α
t is now the empirical quantity obtained from the sample
⎡ ( β2−β2) ⎤
Pr⎢− 𝑡 α < < 𝑡 α ⎥=1-α
⎢ 2 σ / Σ𝑥𝑖
2
2⎥
⎣ ⎦
Pr⎡⎢β2 − 𝑡 α/2⎢⎣
⎡𝑆𝐸 (β )⎤⎥ < β < β + 𝑡 ⎡𝑆𝐸 (β )⎤⎥⎤⎥=1-α
α/2⎢⎣
⎣ 2 ⎦ 2 2 2 ⎦⎦
The point in this case is the critical t value 𝑡α/2 depends on the degrees of freedom in addition to
Once
α is fixed as 5% and
Degrees of freedom is 20
86
The critical t value from the table or from the software = 2.038
So − 𝑡 = -2.038
α/2
𝑡 α/2
=+ 2.038
95% of the t value lie between these two interval.and this is the 95% confidence interval for β2
β𝑖=β𝑖±𝑡 α/2
SE (β𝑖)
Hypothesis Testing
87
Classical statistical inference has two branches
1) Estimation
2) Hypothesis testing
Hypothesis testing is required because, as we have discussed earlier We may have samples, and
using the samples we have to make inference about the parameter. If the entire population is
known to us you can directly estimate population parameters, then there will be no need for
hypothesis testing .But this is not practically possible . Now let us see how to formulate
hypotheses. Hypothesis testing is the process of determining whether or not a given hypothesis is
true. The first step in hypothesis testing is formulating the hypothesis by specifying what is
Null Hypothesis(H0)
A null hypothesis is an assertion about the value of a population parameter. The null hypothesis
is an assertion about the value of a population parameter. It is the assertion we call it true until
𝐻0:µ = 100
This is the assertion we called as true until we have sufficient statistical evidence to conclude
𝐻0 :µ = 160 cm
88
This is the null hypothesis
There will be an alternate hypothesis also. An alternate hypothesis is the negation of the null
𝐻1: µ ≠ 100
It is sometimes represented as 𝐻𝑎
It is important that we must be clear about what is null hypothesis or H0 in a particular situation
otherwise the entire process of hypothesis testing is invalid. Now let us see how to formulate H0
and H1 using a concrete example to clarify some points. As an example to this, suppose that a
company selling coca cola is selling bottles with 2 liters, we expect that on an average each
bottle will contain 2 liters of cola , due to some issues some bottles may contain a little more than
2 liters, some bottles may contain a little less than 2 liters. But on an average there will be 2 liters
of cola in each bottle. Suppose that a consumer suspects that the company is cheating the
consumers by filling less than 2 liters of cola. Then in the first case what is the null hypothesis?
and what is the alternate hypothesis? . As we discussed earlier, the null hypothesis is an assertion
about population parameters. It is an assertion we call true until we have sample evidence to
conclude that it is not true. So the company producing and selling cola is a reputed company , so
we expect that the company is doing the business honestly. So there is no need to believe
89
𝐻0 : µ ≥ 2
𝐻1:µ< 2
There will be some significance in the consumer advocates suspicion if and only company is
Remember that we will go for new research if and only if we have some doubt about the status
quo. If we have no doubt about the status quo, there is no need for further research. The purpose
of research is to find something new or something different. In this case the claim made by the
company is not true, or you have doubt about this claim , you find out an instrument, then you
collect the data or samples, using the samples, we can test the hypothesis, and if the null
hypothesis is rejected you can conclude that the claim made by the company is false. But
variations are possible. Suppose that the company is not taking any claim, the company is simply
filling all bottles with 2 liters of cola. Suppose that we want to demonstrate that, on average, a
company is selling less than 2 liters. As an independent a researcher you want to prove that the
90
𝐻0 : µ ≥ 2
Now you have to demonstrate that the company is on average selling less than 2 liters. Now on
In this case if you are a consumer advocate or an independent researcher if H0 is accepted the
company is selling on an average 2 liters of cola no corrective action is required, suppose that if
H0 is rejected , on an average the company is selling less than 2 liters of cola .Thus corrective
action is required,
Suppose the consumer advocate has no suspicion about the company but the owner has a doubt
whether the machines fill on an average more than 2 liters. If the machine is filling more than 2
liters , it will be a loss as far as the owner is concerned. Now from the owners point of view we
write
𝐻0: µ ≤ 2
𝐻1:µ > 2
So in the owners perspective corrective action is required if the bottles are filled with more than
Third perspective:-
91
From the machine operator’s perspective
H0 : μ = 2 liters
H1 : μ ≠ 2 liters
If you are an independent researcher, doing an academic research, for example:- in your field
of research such as Economics, Sociology, Psychology etc. your research hypothesis is
always H1
H0 is Status quo
On the other hand, if H0 is rejected, you have statistical evidence against H0 and you collect
data to estimate the parameters etc to demonstrate something new. So, the researcher’s
hypothesis is H1 whatever is your field of study: Economics, Sociology, Psychology etc.
This is true if you have tested the significance of only one parameter or if you have tested the
difference between the 2 parameters, i.e; between μ1 and μ2 or if you have tested the
difference between many samples like in the case of ANOVA or whatever is the context,
the null hypothesis is the resultant hypothesis.
If we would say that, if the test statistic is statistically significant, if and only if H0 is rejected.
Test statistic is based on the sample, we would say that, test statistic is significant
Remember one more point:- whenever we go for research, always formulate a hypothesis first.
If we formulate a hypothesis after observing the results, you can always reject H0 or
formulate H0 in such a way to be rejected.n
92
So, never go for such a practice. The correct practice is to formulate H0 & H1 first.
If we formulate hypotheses after observing the results, it is what is known as Data Snooping.
After formulating the hypothesis, collect the suitable data; primary or secondary, time series or
panel, depending on your research question.
Formulate H1 , then estimate parameters and using the procedures of hypothesis testing, test the
significance of the population parameters. This is the first step, the formulation of null and
alternative hypotheses.
Estimation and hypothesis testing have the twin branches of statistical inference.
Estimation means estimation of the parameters. Using a sample we estimate the parameters using
How to test hypotheses about the population parameters using sample statistics.
Formulation of 𝐻0and 𝐻1
Before beginning any research formulation of hypotheses otherwise the entire procedure
is invalid.
Now let us formulate the problem of hypothesis testing more formally as follows.
Where θ is the parameter of the distribution. Now we observed a random sample of size
93
And using this random sample we obtain a point estimate. Since θ the parameter is
unknown we ask a question, the question is the estimator θ compatible with hypothesized
*
value of θ. So θ = θ Where compatible means sufficiently close to.
So the question is: Is the estimator θ is compatible with some hypothesized value of θ,
* *
So θ = θ . Where θ is the value of the θ stated under null hypothesis or it is specific value of θ.
*
(
So state differently could our sample have been come from a PDF 𝑓 𝑋 θ = θ . )
So the question is whether the sample is generated from a population with the value of
* *
θ=θ . And in the language of hypothesis testing θ=θ is known as null hypothesis and it is
written as 𝐻0.
*
This discussion is general applicable on all the situations now the parameter is θ, θ is estimated.
*
The question is whether the sample is generated from a population with θ=θ and it is 𝐻0. It is
*
𝐻0: θ = θ
*
𝐻1 : θ ≠ θ
Now this 𝐻0 and 𝐻1 can be simple or composite. Hypothesis is an assertion about a population
parameter.
The hypothesis is simple if it is specified a specific value of the parameter. If the specific value is
2
As an example suppose that 𝑋 ∼ 𝑁 µ, σ ( )
94
* 2 *2
Suppose that our 𝐻0: µ = µ , σ = σ . This is simple hypothesis.
* 2 *2
Suppose that our 𝐻0: µ = µ and σ > σ . This is the composite hypothesis. Because its
2
value is not specific. Here the value of σ is not specified.
value we often test the hypothesis like this. This is known as zero null hypothesis. But this is not
the only hypothesis which we will test, we can give any value for θ.
When we read a software output always remember that the software output is based on this
hypothesis. If you want to test other hypotheses then you have to specify options available in all
*
𝐻0: θ = θ . But 𝐻1, the alternative hypothesis can take any of these forms
*
𝐻1: θ > θ
*
𝐻1: θ < θ
*
𝐻1: θ ≠ θ
And remember that all these alternative hypotheses are composite in nature. And which
2
Example: Suppose that our null hypothesis is that 𝑋 ∼ 𝑁 µ, σ ( )
Suppose also that we want to test µ only, then suppose that we formulate a hypothesis µ = 70.
Teste against µ≠ 70. Then the form in which 𝐻1 is formulated to determine the critical region.
95
Now to test a hypothesis we divide the entire population into 2.
Hypothesis testing we use a test statistic. Suppose that it is 𝑋 to test hypothesis about µ.
Estimators are not directly used for hypothesis testing. From the estimators we develop test
𝑋−µ
statistic say 𝑍 =
σ/ 𝑛
( 2
)
𝑋 ∼ 𝑁 µ, σ , then𝑋 ∼ 𝑁 µ, ( ) σ
𝑛
2
. It mean if X has mean=µ say, µ=100, then 𝑋=100.
When the χ = µ
96
Then the value of Z=0.
*
My hypothesis is 𝐻0:µ=µ , here µ is specific value is about the mean of χ.
Now the entire region of test statistic divide into two, highly likely under 𝐻0.
The 2.5% on both sides of the graph is known as the critical region, value of the population is
Acceptance region is value of the population with a high probability to observe given that 𝐻0:µ
=100.
So the entire population , the entire distribution of test statistic is divided into highly likely under
If 𝐻1is not equal to tail, the critical region on both tails, the left and right tail.
If your 𝐻1> tail entire critical region is in the right side only. Otherwise in the left side.
So whether you have a two tailed test, a right tailed and left tailed test depends on how you
Now in the empirical applications the critical region is often fixed as 5%, and this is also known
In econometrics the highly unlikely region is often fixed at 5%, it is not only the value which we
use. In medical sciences it is fixed at 1%. In some cases we can go 15% etc. depending on
circumstances.
97
The second step in hypothesis testing is to choose the level of significance of the test.
So the point is our decision to accept 𝐻0or 𝐻1 is based on sample information. Due to sampling
fluctuations the conclusions will not be true always. In many decisions regarding accepting or
Example: we know that in the legal proceedings police arrest a criminal, collect evidence present
the accused before the court. The court after considering the evidence suggested by the police
In a democratic country, we can not judge a person as a criminal only because he has been
arrested by the police. Police arrest and suspect. Then collect the evidence. In our case the
evidence is the sample. And suppose that on the basis of evidence the sample we can not reject
98
𝐻0. That means the accused is innocent. In our case the sample evidence is not strong enough to
● Reject 𝐻0 when 𝐻0 true means , the accused is innocent if we reject it what happens here
● The second is 𝐻0 is false, but we do not reject it, false means the accused is not innocent,
The classical approach to hypothesis testing procedure is Type 1 error is considered to be more
serious.
So what we do is we fix the type 1 error at 5%, 10%,1% etc and try to minimize Type 2 error.
And the probability of Type 1 error is known as α. It is known as the level of significance.
The probability of not committing Type 2 error is 1-β. It is known as the power of the test.
Even though in this example Type 1 error is more serious, variations are possible.
Example: Steel is used for constructing house.also for fencing. And steel of a particular strength
99
Type 2 error means 𝐻0 is false since it has less strength.
Now the question is the cost of Type 1 and Type 2 error depends on the purpose for which it is
used. If it is used for constructing a house , the house will collapse, there will be loss of life.It is
highly costly.
Type 2 error is costly compared to Type 1 error. Because Type 1 error is only a cost of
production.
But suppose that it is used for fencing only, here Type 2 error is not very important because even
So whether Type 1 or Type 2 error is more serious depends on the issue at the hand.
And the entire distribution is divided into 2, highly likely under 𝐻0, highly unlikely under 𝐻0.
Highly likely is the acceptance region, highly unlikely is the critical region.
Now the critical region may be the right side only , left side only or both side, depending on how
you formulate :
*
● 𝐻1: θ ≠ θ is your alternative hypothesis, then it is a 2 tailed critical region.
*
● If it is 𝐻1: θ > θ Critical region in the right tail of the test statistic.
*
● If it is 𝐻1: θ < θ Critical region in the left tail.
So whether the critical region in the right tail, left tail or both tails depends on how we
100
4. The fourth step is to choose appropriate test statistics.
To test hypotheses we have to use test statistics. If you want to test the significance of a mean we
can test z or t.
● If the population variance is known Whatever the sample size you can use z.
● If population variance is unknown and the sample size is less than 30, we will use t test.
● If you want to test the difference between more than two means µ1 − µ2. we can use the
test statistic as F.
● While discussing the normal distribution and distribution related to normal distribution
we have seen that chi square is used to derive the sampling distribution of squared
Suppose that z is our test statistics and we have to make a testing about a population parameter µ
χ −µ
Then 𝑧 = σ
𝑛
Now if the value is given under H0, as an example to this Suppose that we specify
σ
µ = 60, χ = 70, = 2
𝑛
70−60
Then 𝑍 = 2
= 5.
This is based on the assumption that σ is known to us otherwise we will use t test.
6. Compare the value of the test statistics with the critical value.
101
So if the value of the test statistic calculated from the sample is in the critical region we reject 𝐻0
So when we reject a null hypothesis using the test statistics we say that our sample is not
generated by a population with µ=60, that is our sample generated by a different population.
These are the steps in hypothesis testing. For example suppose that the value of test statistics
= -1. Then we accept 𝐻0. We will say that our sample is generated by a population with
mean = 60. There is no difference between what we have assumed under 𝐻0 and what we have
obtained from the sample or the sample value is close to the hypothesized value. So 𝐻0 is not
rejected.
In this session we discuss how to test the hypothesis. Now there are two approaches to
hypothesis testing;
We will first consider the confidence interval approach. While discussing the estimation theory,
we discussed that there are two types of estimators; point estimation and interval estimation.
Confidence interval approach is based on the concept of interval estimation, and the test of
102
The first step in hypothesis testing is to formulate the null and alternative hypothesis. Let our
*
𝐻0:µ= µ
Tested against ;
*
𝐻1:µ≠ µ
The first explain, the approaches to hypothesis testing using the estimator arithmetic mean, then
2
We assume that the population X is normally distributed with mean µ and variance σ ;
2
𝑋 ∼N (µ σ )
2 2
σ σ
And, 𝑋 is also normally distributed with same µ and variance 𝑛
; 𝑥 ∼N(µ 𝑛
)
is known, we can use this known probability distribution to construct confidence interval as;
The decision rule is once we construct a confidence interval of 100(1-α) % , α will be specified.
If α=5%, then it is 95% confidence interval, if α=1%, then it is 99% confidence interval
*
etc….then we will see whether µ= µ is included in this interval or not, once the interval is
constructed ,what we do is whether the hypothesized value is included in this interval or not. If
*
the hypothesized value µ= µ is in the interval, we will not reject the null hypothesis. Otherwise
103
So, in the confidence interval approach, we construct a 100(1-α) % confidence interval for the
parameter µ using the estimator 𝑋. if the hypothesized value is included in this interval we will
not reject the null hypothesis and whether the hypothesized value is not included in this interval
*
𝑋−µ
Z= σ
, it is standard normal variable with mean 0 and variance 1;
𝑛
*
𝑋−µ
Z= σ
∼ 𝑁(0 1)
𝑛
Here the distribution of 𝑋 and hence Z, depends on the value of µ assumed under null
⎡ −𝑍α 𝑋−µ
*
𝑍α ⎤
Pr⎢ 2 < < ⎥= 1-α
⎢ σ 2 ⎥
⎣ 𝑛 ⎦
⎡ 𝑋−µ
* ⎤
Pr⎢− 1. 96 < < 1. 96 ⎥= 1-α
⎢ σ ⎥
⎣ 𝑛 ⎦
σ * σ
Pr⎡⎢𝑋 − 1. 96 ( ) < µ < 𝑋 + 1. 96 ( ) ⎤⎥= 1-α
⎣ 𝑛 𝑛 ⎦
* σ
The 95% confidence interval for µ is 𝑋 ± 1. 96( ).
𝑛
So this confidence interval is constructed for the value of µ assumed under the null hypothesis
*
𝐻 0 µ =µ
104
Once the confidence interval is established, we will see whether the hypothesized value lie
Now the procedure is simple, once the 95% CI is constructed for µ using 𝑋 and its
standard error, we will see whether the hypothesized value is within this limit or not. If
the hypothesized value is within this limit we will not reject the null hypothesis, if the
hypothesized value is not within this limit, we will reject the null hypothesis.
As an example to this; suppose that we have taken a sample of 100, from which 𝑋= 67, σ
σ
=2.5, n=100, and =0.25.
𝑛
σ σ
Now, 𝑋+1.96 = 67+1.96 (0.25) and 𝑋-1.96 = 67+1.96 (0.25), this will give you
𝑛 𝑛
σ
𝑋+1.96 = 67+1.96 (0.25) =67.49
𝑛
σ
𝑋-1.96 = 67+1.96 (0.25)= 66.51
𝑛
Suppose that;
105
𝐻0:µ= 69
Tested against
𝐻0:µ≠ 69
What we do is, we consider whether the hypothesized value is within the range of 67.49 and
66.51.
The hypothesized value is 69, and the value is not within this range (67.49 and 66.51). So we
In the context of confidence intervals, as we have discussed earlier, if we construct intervals like
this a repeated number of times, the parameters will lie in such intervals 95 out of 100 times.
Suppose that the hypothesized value is within this interval, and we accept the hypothesis. But the
parameter will lie within this interval 95 out of 100 times. So the parameter will not lie within
So when we reject the null hypothesis, we will be rejecting a null hypothesis 5 out of 100 times.
The test of significance is a procedure by which sample results are used to verify the truth or
falsity of a null hypothesis, and the key idea behind this is test statistics and its distribution of the
test statistic under null hypothesis . The decision to accept or reject the hypothesis is based on the
If the value of µ is specified under null hypothesis, we can calculate Z on the assumption that σ
106
As Z∼ 𝑁(0 1 ), can state the probability statement as
⎡ 𝑥−µ
* ⎤
Pr⎢− 𝑧α/2 < < 𝑧α/2 ⎥= 1-α
⎢ σ ⎥
⎣ 𝑛 ⎦
If α= 5%
⎡ 𝑥−µ
* ⎤
Pr⎢− 1. 96 < < 1. 96 ⎥= 1-α
⎢ σ ⎥
⎣ 𝑛 ⎦
* σ * σ
Pr⎡⎢ µ − 𝑧α/2 ( ) < 𝑋 < µ + 𝑧α/2( ) ⎤⎥= 1-α
⎣ 𝑛 𝑛 ⎦
σ *
What we did here is, multiply throughout with . then we add µ for all the terms.
𝑛
*
This is the interval in which 𝑋 will fall with 1-α probability given that µ =µ . ( this is not the
confidence interval)
In the language of hypothesis testing this 100(1-α) % CI established is known as the region of
And this confidence interval is based on the assumption that σ is known. In most of the case the
𝑥−µ
t= 𝑠
∼ t (n-1)
𝑛
* 𝑠 * 𝑠
Pr⎡⎢ µ − 𝑡α/2 ( ) < 𝑥 < µ + 𝑡α/2( ) ⎤⎥= 1-α
⎣ 𝑛 𝑛 ⎦
This 𝑡α/2 is the critical t value for α level of level of significance and (n-1) degrees of freedom.
*
We accept the null hypothesis, if 𝑋 in the interval, given µ =µ
107
𝑠
If a distribution is normal, we know the area property , that is µ ± 1. 96 ( ) includes 95% of
𝑛
the Z. and this Z (test statistics derived from the 𝑋) is generated from 𝑋 (sampling distribution).
*
So if µ =µ the distribution has 95% within this interval, and 5% will fall outside. If the sample
𝑋 obtained from the sample lie in this interval, we do not reject the null hypothesis. In the
language of significance test, a statistic is said to be statistically significant if the value of the test
statistics lies in the critical region. Then the null hypothesis will be rejected. A statistic is said to
be statistically insignificant if the value of the test statistics lies in the acceptance region. Then
Suppose that the average height of students in a college is claimed to be 160 cm;
𝐻0: µ =160cm
𝐻1: µ ≠160cm
σ
A sample size of n= 100 is taken, 𝑋= 150 cm, σ= 4, SE= =0.4
𝑛
σ
µ ± 1. 96 ( ) = 160± 1. 96(0.4) = 160.714 and 159.216 (critical values)
𝑛
108
How to find the area between 159.21 and 160.71, that area is 95% of the 𝑋. If we construct the
What is the probability that, sample is generated from a population with a mean=160, and we
know that the probability is very low. Because, bcz the 150 is lying in the low probability area of
the distribution. So we reject the null hypothesis. 95% of the value lies between 159.21 and
160.71, only 5% of value lies below and above these values (2.5%values above 160.71 and 2.5%
of the values below 159.21). So the probability of a sample generated from the population with a
150−160
Z= .4
= 25
109
Hypothesis testing about regression coefficients
Now we discuss in detail the hypothesis testing in the context of regression analysis. The
parameters of our discussion are β1and β2, and the estimators are β1and β2. We use these
estimators to make inferences about parameters. As we discussed earlier, remember this, before
Once the hypotheses are fixed, we collect data, and using the inferential procedures, the steps in
hypothesis testing, we test the hypothesis after fixing the level of significance and critical region
based on the alternative hypothesis and choose the test statistics based on the problem at hand.
Finally we calculate the value of the test statistics from the sample and compare it with the
There are two approaches to hypothesis testing; the confidence interval approach and the test of
significance approach. The question of hypothesis testing is; is a given observation of finding is
compatible with a stated hypothesis or not? This is the fundamental question that we ask in
hypothesis testing. compatible means, sufficiently close to, so that we do not reject the stated
hypothesis, and if it is sufficiently close to the stated hypothesis , we do not reject it or otherwise
we reject it. In the language of hypothesis testing, the stated hypothesis is the null hypothesis
tested against the alternative hypothesis, also known as the maintained hypothesis or the research
hypothesis. The theory of hypothesis testing is concerned with developing rules or procedures of
deciding whether to accept or reject the null hypothesis. In both approaches of hypothesis testing,
we assume that the estimator is a random variable, and the estimator has a probability
distribution, and the probability distribution of the estimator is known as the sampling
distribution, and the sampling distribution is used for hypothesis testing. The normality
assumption is used for hypothesis testing, because if X is normal the estimator is also normal. In
110
the context of regression, we assumed that the u is normal and as estimators are linear
Both the approaches of hypothesis testing are based on an example; Example of income and
consumption
Income starting from 80-260 and consumption also, n=10, now using OLS suppose that the
𝑌= 24.4545+0.5091 𝑋𝑖
n=10
df= 8
α=5%
In the discussion of regression analysis, we consider only the t value, because the standard errors
Now let us consider the confidence interval approach of hypothesis testing first;
Here, β2 = 0.5091
Suppose our hypothesis is 𝐻0:β2=0.3 tested against 𝐻1:β2≠0.3 (0.3 is an assumed value for the
Once we formulate the hypothesis, we construct the 95% confidence interval for β2 as;
111
The 95% confidence interval for β2 is
In the language of hypothesis testing, if β2under 𝐻0, lie in this interval (the interval is 100(1-α
)%.ie., at 95% confidence interval we do not reject the 𝐻0, otherwise we reject it. The
hypothesized value is 0.3, and the 0.3 is outside the interval, so we reject the 𝐻0. That means, the
2. If β2 under the 𝐻0, falls in this region, do not reject the 𝐻0, otherwise reject
As we already discussed, the parameter will lie in such intervals 95 out of 100 times, only 5
times the parameter will lie outside. Actually the parameter may be outside. In our example the
parameter is outside the limit, so we reject the 𝐻0. our sample is not generated from the
In the language of hypothesis testing, when we reject a null hypothesis, we say that our finding is
statistically significant, when do not reject 𝐻0 ,we say that our finding is not statistically
significant. In this case we reject the 𝐻0, so our finding is statistically significant.
112
Statistically significant means, our sample finding is significant or our test statistic is significant.
While in the case of court procedure, the evidence collected by the police is used to verify
whether the person has committed the crime or not. If the court accepts 𝐻0, then the person is
innocent, that means evidence is not significant. In our case, the evidence is our sample. Sample
is used to decide whether to accept or reject the 𝐻0. If 𝐻0 is accepted our sample finding is not
significant, in this sense, we say that the test statistics is not significant. When we reject a
hypothesis, we will be rejecting a true hypothesis (5 out of 100 times). The parameter will lie
within this interval 95 out of 100 times, 5 times the parameter will lie outside. But still we reject
the 𝐻0, even though the 𝐻0 is true. Reject the 𝐻0, when the 𝐻0is true is called type I error.
Neyman and Pearson . A test of significance is a procedure by which sample results are
used to verify the truth or falsity of a null hypothesis. The key idea behind the test of
significance approach is the test statistic and sampling distribution of test statistic under
null hypothesis. The decision to accept or reject the 𝐻0 is based on the value of test
β2 − β2
𝑡=
𝑆𝐸(β2)
β2 − β2
𝑡= we have derived this
2
σ/ σ𝑥𝑖
113
● Here explain the test of significance approach as in the case of confidential interval
approach, using the t test statistic only because in the case of econometrics, we will not
use Z statistic in this context because σ is always replaced by σ. and the standard errors
● Now, if the value of β2 is specified under 𝐻0 then t statistic can be calculated. Because
standard error is given. The only unknown is β2. then t can be used as the test statistic.
And our decision depends on the value of t obtained from the sample compared to critical
t values.
𝑡α/2 is the critical value for given degrees of freedom and specified level of significance.
* *
𝑃𝑟⎡⎢β2 − 𝑡α/2 𝑆𝐸(β2) < β2 < β2 + 𝑡α/2𝑆𝐸(β2) ⎤⎥ = 1 − α
⎣ ⎦
This is not the confidence interval, this is a different procedure, multiplied through out
*
with standard error of β2 and adding a β2on all the terms, we can rearrange the equation.
*
● Now, it is the interval in which β2 falls with 1 − α probability given that β2 = β2
114
● And in the language of hypothesis testing this 100(1 − α)% confidence interval is
known as the region of acceptance of 𝐻0and outside it is the region of rejection. Before
* *
𝑃𝑟⎡⎢β2 − 𝑡α/2 𝑆𝐸(β2) < β2 < β2 + 𝑡α/2𝑆𝐸(β2) ⎤⎥ = 1 − α
⎣ ⎦
0. 3 ± 2. 306(0. 0357)
*
Not a confidence interval , β2 to -t to +t
What it says is if you take samples like this, a repeated number of times, you will
generate a distribution and you will plot like this 95% of the β2 lie in between 0.2177
and 0.3823. And provided that the mean is 3. And this is not an actual distribution. This
115
is a hypothetical distribution. You will get this distribution only after formulating the 𝐻0.
if 𝐻0 is not formulated you can not derive the ‘t' statistic. You will not get this
distribution
● Remember, the hypothesis testing is based on the sampling distribution of the estimators.
● Now, we will not do this directly, we will convert the unit of β2 in to unit of t.
● 95 % of t with 8 degrees of freedom lies in between -2.306 and +2.306. Now we calculate
this t
116
0.5091 − 0.3
0.0357
= 5. 86
Now, in the test of significance approach, we compare the value obtained from the
sample with the hypothesized value. If the value obtained from the sample is far away
from the hypothesized value it means our hypothesis may not be correct or our sample
may not be generated from a population with hypothesized value equal to 0.3. So, we
have increasing suspicion about 𝐻0. As soon as the value of β2 is close to the
*
hypothesized value β2, then we will be in the acceptance region. Otherwise will be in the
117
rejection region. What it says, the probability that our sample is generated from a
population with mean is equal to 0.3 is less than 5%. because , 2.5% in the left side and
2.5% in the right side. Only a low probability, so we reject 𝐻0. By converting in to t, our
t value is 5.86. The probability of t equal to 5.86 is less than 5%. The actual probability
● You can test a hypothesis using an estimator itself, and a sampling distribution. We will
not do this. What we do is, we convert an estimator into a sample statistic and the test
statistic is used for hypothesis testing. This is the procedure that we do.
*
● If β2 = β2 then t = 0 , t= 0 means null hypothesis accepted. Now estimated β2 is far away
from the hypothesized value, then large value is evidence against 𝐻0 then small value is
evidence in favour of 𝐻0. And to conclude, in the language of significance test, a test is
said to be significant, if the value of the test statistic lie in the critical region. If the value
lie in the acceptance region, it is statistically significant. So, in this example, as the value
of the test statistic lie in the critical region, we will say that the test statistic is significant.
● As we have using t test here, we have a test procedure based on t test. Now, remember
otherwise it is not significant. If null hypothesis is not rejected, the test statistics is not
significant. And this is a two tailed test. We can adopt a one-tailed test also.
*
𝐻0: β2 = 0. 3
118
*
𝐻1: β2 > 0. 3
Then the critical t value become, t = 1. 86, so, if you take one tailed test, the critical
value will be small. In the case of a two-tailed test, 𝐻0 is accepted, one tailed it is
*
● If β2 is less than tail, we have a one tailed test. Whether we have a one-tailed test or a
two-tailed test or right tailed or left tailed depends on the alternative hypothesis
● So, to summarize,
𝐻0 𝐻1 Reject 𝐻0 if
● Always remember that, for hypothesis testing, we use sampling distribution. And the area
of the sampling distribution is divided in to two. Highly likely, and highly unlikely under
𝐻0 . Highly likely values are close to hypothesized values. Unlikely values are far away
from the hypothesized values. Also remember that the distribution is a hypothetical
distribution. If you give a different distribution, say for example, instead of 0.3, if it was
0.5, then β2 will be in the acceptance region, not in the rejection region.
● So, it is important to fix the hypothesis first, otherwise once you obtain the results and if
you fix the hypothesis after observing the results, you can always reject 𝐻0. The practice
119
is not recommended. Fix the hypothesis first, then collect data, estimate parameters, using
the sampling distribution, remember, sampling distribution is not directly used, we use
● And also remember this, the sampling distribution of the test statistic depends on the
normality assumption. So, normality assumption of the estimator is important. And in the
assumption is important if and only if the sample size is more. If you have a sufficiently
large sample size, normality assumption is not very important. That is provided that n
● This is the lessons of hypothesis testing and there are many other issues related with
hypothesis testing and these issues will be discussed in the next sessions.
● Hypothesis testing is perhaps one of the most important areas of empirical analysis. The
test it using inferential statistics. We have discussed hypothesis testing about the
population parameter µ using the estimator 𝑋, we have also discussed in detail how to test
hypotheses about regression coefficients β1and β2. The test statistics we have extensively
used is the t test, now we have to consider some practical issues regarding hypothesis
testing. In addition to the technical details of hypothesis testing, we must also be familiar
with some of the practical aspects of hypothesis testing. This session is meant to give
some concrete ideas about some of the practical issues, some of the precautions, some of
120
● The first issue that we are considering is what is meant by accepting or rejecting a
hypothesis?
● Remember this, our decision to accept or reject the null hypothesis is based on sample
evidence. Our decisions will not always be correct. Whenever we decide to accept or
● Now, as the decision to accept or reject 𝐻0is based on sample evidence and sample value
vary from sample to sample, we are not 100% sure about to accept or reject 𝐻0. And
suppose that on the basis of sample evidence we decided to accept 𝐻0 but that doesn't
mean that 𝐻0 is true beyond any doubt. To explain this point let us consider an example.
● Suppose that we hypothesis a value for MPC, 𝐻0: β2 = 0. 7 suppose in the consumption
𝐻0: β2 = 0. 7
𝐻1: β2 ≠ 0. 7
Suppose that the estimated value β2 = 0. 7241 and 𝑆𝐸 (β2) = 0. 0701, now consider
the t
0.7241 −0.7
𝑡 = 0.0701
= 0. 3438
So, this t value is not statistically significant, so, we accept 𝐻0 because 0.3438 is in the
121
● Now, suppose that β2 = 0. 6, then
𝐻0: β2 = 0. 6
𝐻1: β2 ≠ 0. 6
0.7241 −0.6
𝑡 = 0.0701
= 1. 7703
● So, we accept the null hypothesis that β2 = 0. 7 again accept the null hypothesis that
β2 = 0. 6 but we do not know which 𝐻0 is true. Because the same data is compatible
with different hypotheses. The same sample evidence is compatible with different
● So, what we do is we may accept 𝐻0 rather than we do accept 𝐻0. Or we do not reject 𝐻0
as a Court declares as “Not Found Guilty” rather than “Innocent”. So on the basis of the
evidence collected by the Police, the Judge declares that the accused is not found guilty.
But it doesn't mean that he is innocent. The sample evidence is not sufficient to prove
that he has committed the crime. So “Not Found Guilty” rather than “Innocent”. When
the accused is free by the Court, it doesn’t mean that the accused is innocent beyond a
doubt. But the evidence is such that there is no evidence to punish, in the same way we
do not reject 𝐻0 rather than we accept 𝐻0 or we may accept 𝐻0 rather than we do accept
𝐻0. This is so because the same sample evidence will be compatible with different
hypotheses. And we do not know which hypothesis is correct. So, whenever we reject or
122
● So, as one author says, when we accept a null hypothesis, or we can say data do not allow
significance.
● We can formulate 𝐻0 depending on the formula under consideration. But the null
hypothesis often tested in empirical analysis is the zero null hypothesis. Consider the
relation
𝑌𝑖 = β1 + β2𝑋𝑖 + 𝑢𝑖
𝐻0: β2 = 0
𝐻1: β2 ≠ 0
Now, the purpose of this hypothesis is whether Y is related to X at all. If this hypothesis
is accepted or not rejected, there is no meaning in testing other hypotheses. Not only that,
we have many softwares for estimating econometric relations. In all these softwares the
default hypothesis is the zero null hypothesis. You have the option to test other
hypotheses and when we proceed we will consider other hypotheses also. Testing the zero
null hypothesis is not always the best practice. We have to test other hypotheses also.
Anyway, in the softwares, the default hypothesis is the zero null hypothesis.
● If we test the null hypothesis which is the zero null hypothesis, then t become
β2
𝑡 =
( )
𝑆𝐸 β2
Because it is β2 - 0
123
So, whether t is positive or negative depends on whether β2 is positive or negative. In the
β2
● So, t becomes 𝑡 =
( )
𝑆𝐸 β2
● As you can see, to test the significance of the t, we compare β2 with its standard error. T
will be significant if and only if β2 is much higher than the standard error of β2 . if β2 is
equal to standard error of β2 then t = 1, not significant. Again if the standard error of β2 is
higher than the β2 , t is not significant. We place β2 in relation to its standard error.
● So, remember this, when we obtain the computer output, we will have the coefficient ,
below the coefficient standard error , coefficient divided by the standard error is the t
value. Below that we will get p value, which is a topic we will see soon
● There are no hard and fast rules regarding how to formulate 𝐻0and 𝐻1. in many cases,
economic theory will tell us how to formulate 𝐻0and 𝐻1. If economic theory is not
strong, we have to use our intuition, previous empirical work having a similar nature etc
to formulate 𝐻0and 𝐻1. and remember that formulation of 𝐻0and 𝐻1depends on the
problem at hand. Often theory will give you some guidance or previous empirical works
124
will give you some guidance. Or your intellectual guess can be used to formulate 𝐻0and
𝐻1
● No matter how you formulate 𝐻0and 𝐻1, it is absolutely essential that you must formulate
𝐻0and 𝐻1before estimation. This is very important. The point is, when you go for an
empirical research, the first issue is to identify a problem, by literature review we identify
● A research problem is specified in the form of 𝐻0and 𝐻1. Once 𝐻0and 𝐻1are formulated
you can collect your own data, that is primary data, or you can go for secondary data,
collected by some other agency. The point is 𝐻0and 𝐻1 should be formulated first.
Suppose that you identified an issue, collected data, and observed the results. And if you
formulate the hypotheses after observing the results, you can always formulate 𝐻0and 𝐻1
in such a way that your null hypothesis is rejected. Because your sampling or your
⋆
This distribution takes shape only when the value of µ is specified as µ , so if you give
⋆
one value to µ you have one result, give another value, you have another distribution.
125
⋆
● 𝐻0 : µ = µ
● So, for one value you may reject 𝐻0 ad for another value you may accept 𝐻0. So, if you
observe the results and formulate the hypothesis after observing the results, the problem
is you can always fix µ which is far away from the observed value of 𝑋 as an example,
● Now, if you formulate a hypothesis after observing the results, there may be a temptation
to form hypotheses just to justify one’s results. If you formulate a hypothesis after
observing the results, you will always formulate a hypothesis just to justify your results.
● So, as one author says, the researcher will be guilty of circular reasoning or self fulfilling
prophecies . And there is a possibility that, he will formulate a 𝐻0 such a way that the
● So, the scientific procedure is, formulate the hypothesis first, then collect data, estimate,
test and accept the hypothesis on the basis of sample evidence and hypothesis already
● Then another issue that a researcher confronts is how to fix α- level of significance?
126
● We divide the entire area statistics into two. Highly likely under 𝐻0 and highly unlikely
under 𝐻0. And the division of the area into two depends on the value of α. If α is say 5%,
that is 0.5 and for a two tailed test 2.5% in the left region and 2.5% in the right region and
● So, our decision to accept or reject 𝐻0 depends on the level of significance,α. The level
of significance is the probability of committing Type I error. You know that the following
is the distribution for hypothesis testing, say the distribution of 𝑋 and hypothesized
you will get another distribution like that. Now the question is what is α ? if α is 5% the
critical region is divided in to 2.5% in each tail. If α is 10% the critical region expands
127
● And suppose that initially, a value is in 5% right tail, then at 5% it is accepted and at 10%
● There are no hard and fast rules. α is fixed at 1%, 5%, 10% etc. in social science we most
● The point is α is the probability of committing type I error, you can reduce the probability
of committing type I error by fixing α at a low level 1% etc. When you reduce the value
of α, the probability of type II error will increase. That is β . that is accept 𝐻0, when it is
false. In a given sample, you cannot minimize both type of errors. When α is reduced,
that is probability of type I error is reduced, then probability of type II error will increase
● But if you ask a question, can you reduce type I error to zero? Such a possibility is there.
Type I error is rejecting 𝐻0 when 𝐻0 is true. If you never reject 𝐻0, probability of type I
error will reduce. But the problem is the probability of type II error is 1. You accept a
false hypothesis is 1.
● So, we cannot minimize both types of errors in a particular case. So, the common
practice is, econometricians and statisticians generally believe that type I error is more
serious than type II error. So, conventionally, we fix type I error at 1%, 5%, 10% etc. And
● Regarding whether type I error or type II error is more serious, we have discussed an
example earlier. The cost of type I error and type II errors, the example of steel. If the
steel is used for building a house, the type II error is more serious but if it is used for
128
● So, when fixing α remember this as α increases, β decreases and vice versa. So, while
fixing α, one must keep in mind the cost of type I and type II errors. This is important
The other issue is the P value or exact level of significance. The concept is
very important as part of inferential statistics. One author said, Achilles heal of classical
statistical inference is arbitrary in fixing at α 1%, 5% or 10%. This is the only weak spot in
classical statistical inference procedure. What is suggested is that, instead of fixing , arbitrary at
once a test statistic obtained from a sample i.e eg : t = 5.86. Find out the actual probability of
level of significance.
Eg : 𝐻0 → 𝑀𝑃𝐶 = 0. 3
t = 2.036 what is the probability of getting a t equal to or greater than is 5% or P= 0.05. I.e
α = 5%.
Sample values lie in the critical value. 𝐻0 is rejected, what is the probability of committing type
1 error . it is 5%. But actually it is not 5%. What is the probability of committing type 1 error, we
129
p = 0.1, so the test statistic is significant at 10%
You Report p value along with the coefficient, standard error and t , and let the reader decide
whether to accept or reject at 1%, 5%,10%,15%..suppose that a reader willing to accept only at
1%, 5%,10%... etc. you report P value along with results, let the reader decide whether accept or
reject.
p value is the probability value, the lowest level of significance in which a null hypothesis is
rejected. This is also known as the exact level of significance or the observed level of
significance. This is obtained after calculating the empirical t from the sample.
significance depends upon the sign of the coefficient and size of the coefficient. Suppose that the
Both are important from a practitioner point of view. Even if a coefficient is statistically
significant, the sign is not according to our expectation . it is not practically significant. For
example in the consumption income if you get β2 as negative coefficient, it is not theoretically
Guidelines
1. Check for statistical significant : yes, if the coefficient is statistically significant. Then
130
2. It is not statistically significant at 1%,5%,10%... but still you consider expected sign and
size. And calculate p value. For small samples, reject 𝐻0, 𝑃 = 0. 2 ( Up to 20%)
3. Sign + (-), if the sign is not theoretically justified , then we can discard it.
Let me consider an example, dependent variable ( Y) is hourly wage and independent variable
( X) is years of education.
Y ( Hourly X ( years of
wage) education)
4.45 6
.. 7
.. ..
13.53 18
𝑦1 =− 0. 0144 + 0. 7240 𝑋1
SE = 0.93173 0.7000
t = -0.0154 10.3428
131
P = 0.987 0.000
n= 13
𝑑𝑓2 = 11
α = 5%
t = 2.201
2
𝑟 = 0. 9065
90% of variations in Y are explained by X. Range of value not close to 0. Slope = 0.7240. Slope
coefficient is significant. There is a positive relationship between years of education and hourly
wage.
ANOVA is extensively used to compare differences between more than two means. It is used in
2 2
2
Σ𝑦𝑖 = Σ𝑦𝑖 + Σ𝑢𝑖
2 2
2 2
Σ𝑦𝑖 = β2Σ𝑥 𝑖 + Σ𝑢𝑖
2
TSS = Σ (𝑌 − 𝑌 ) (𝑛 − 1) degrees of freedom.
2 2
Σ𝑢𝑖 =Σ𝑢𝑖 (𝑛 − 2) degrees of freedom.
132
2
ESS = Σ (𝑌 − 𝑌 ) 1
( 𝑛 − 1) ( 𝑛 − 2)
= n-1-n+2
= -1+2
=1
ANOVA Table
Source of SS df MSS
Variation
ESS 2 2
2 1 ESS/1
Σ𝑦𝑖 =β2Σ𝑥 𝑖
RSS 2 𝑛− 2 2
Σ𝑢𝑖 2
Σ𝑢𝑖 =σ
𝑛−2
TSS Σ 𝑦𝑖
2 𝑛− 1
2
2
β2Σ𝑥𝑖
F= 2 ∼ 𝐹 ( 1, 𝑛 − 2)
σ
2
𝑧1 ∼χ 𝑘1
2
𝑧2 ∼χ 𝑘2
𝑍1/𝐾1
𝐹 = 𝑍2/𝐾2
∼ 𝐹 (𝐾1, 𝐾2)
β2 −β2
𝑧1= ∼ 𝑁 ( 0, 1)
2
σ/ Σ𝑥𝑖
133
( )∼ χ
2
(β2 −β2 )
2 2
𝑍1 = 2 2
(1)
σ /Σ𝑥𝐼
2
𝑧𝑖 /1
F= 𝑧2/𝑛−2
2
2 2
(β2 −β2 ) Σ𝑥𝑖 2
2
(β2 −β2 ) Σ𝑥𝑖
= 2
σ
= 2
∼ 𝐹 ( 1, 𝑛 − 2)
Σ𝑢 𝑖
×
1 Σ𝑢𝑖 /𝑛−2
2 𝑛−2
σ
2
2
β2Σ𝑥𝑖
= 2 ∼ 𝐹 ( 1, 𝑛 − 2)
σ
β2= 0
β2
t=
𝑆𝐸 β2 ( )
( )
2
2 2 2 2
E β2Σ𝑥 𝑖 = σ +β2Σ𝑥 𝑖
2
2
E(σ ) = σ Therefore the F value is small. RSS is a measure of noise in the system. If
2 2
𝑡𝑘 = 𝐹 (1, 𝑘) . In two variable analysis 𝑡𝑛−2= F ((1, 𝑛 − 2)
Yi Xi n= 10
Using the observations and applying OLS , you have estimated a consumption function
134
𝑌𝑖 = 24.45+.5091Xi
also used for prediction. One use of this historical regression is to forecast a future value
Say 𝑌0 − 𝑋0
𝑌0 and 𝑋0 are not included in the original observations (Yi Xi: here i= 1,2,3,.......n, this Y0 and
𝑋0 may be outside the range or it may be within this range . The only requirement is that 𝑋0 was
It also assumed that while using 𝑋0 to predict Y= 𝑌0, we assume that the original relation
between Y and X holds for the predicted𝑋0 𝑌0 also. We assume that the relationship between Y
and X determined by the parameters holds true for the predicted𝑌0 𝑋0 also, ie the relationship
which generated the sample data is applicable to the future observations also whether 𝑋0 is
within the range or whether it is outside the range. Now there are two types of predictions
135
Prediction of an individual Y say 𝑌0 corresponding to X =𝑋0
Individual prediction
𝑌0=β1+β2𝑋0+𝑢𝑂
We predict it as
value 𝑌0. Both need not be one and the same thing due to the fact that 𝑌𝑂 is an estimator as it is
The difference between 𝑌0 and 𝑌𝑂= the prediction error or the forecast error
1. 𝑢0= The stochastic error term or the presence of additive error term 𝑢0
136
2. Second source of error is estimators . The reason is estimators are random variables and
E(𝑢0)= E ( 𝑌0 − 𝑌0) = 0
𝑌𝑂→unbiased.
⎡ ⎤
⎢ ⎥
Variance (𝑢0)= Var( 𝑌0 − 𝑌0)=Var ⎢ (β1 − β1) + (β2 − β2 )𝑋0 + 𝑢0⎥
⎢ ⎥
⎣ ⎦
2 2 2
2 2 2
● (𝑢0)= (𝑌0 − 𝑌0 ) =𝑢0+(β1 − β1) +(β2 − β2) 𝑋0-2𝑢0(β1-β1)- 2𝑢𝑜(β2-β2)𝑥 𝑜+2
2
2 2 2⎡ 1 (𝑋0−𝑋) ⎤
● E((𝑢0)= 𝐸(𝑌0 − 𝑌0) =σ ⎢1 + + ⎥
⎢ 𝑛 2
Σ𝑥𝑖 ⎥
⎣ ⎦
137
2 1
If 𝑋0= 𝑥 , variance of prediction error becomes σ (1 + 𝑛
)
We know that 𝑢0,or 𝑦0- 𝑦0 is a linear function of normal variables because these are the linear
We can write
𝑌0−𝑌0
∼ 𝑁 ( 0, 1)
1 𝑋𝑂−𝑥
σ 1+ +
𝑛 2
Σ𝑥𝑖
𝑌0−𝑌0
= ∼ 𝑡(𝑛 − 2)
1 (𝑋𝑂−𝑥)2
σ 1+ 𝑛 + 2
Σ𝑥𝑖
In this formula everything is known except 𝑌0.So we construct a 95% confidence interval for 𝑌0
as
1 (𝑋𝑂−𝑥)2
𝑌𝑜= 𝑌0 ±t 0.025 σ 1 + 𝑛
+ 2
Σ𝑥𝑖
𝑌0 =24.4545+.5091(100)
=75.3645
2 1 (𝑋𝑂−𝑥)2
Var ( 𝑢0 ) = σ 1 + 𝑛
+ 2
Σ𝑥𝑖
138
2
1 (100−170)
=42.159(1+ 10 + 33000
This can be repeated to any value of x , we will get the confidence band . If your aim is
E𝑌/𝑋0=β1 + β2 𝑥 0
We predict it as
𝑦0=β1 + β2 𝑥 0
139
2⎡ 1 (𝑋𝑂−𝑥)2 ⎤
σ ⎢𝑛 + ⎥
⎢ 2
Σ𝑥𝑖 ⎥
⎣ ⎦
When you construct 95% confidence interval for the predictor the confidence band will be close
to SRF
The point is , once you estimate a model, you can predict any value of i corresponding to X=X0
But the problem is if you try to predict a value which is far away from the mean of x, prediction
error and its variance increases non linearly.So never try to predict a value which is far away
from the mean of independent variable or atleast never try to predict a value which is far away
If we extend an estimated regression line far away from the sample range it will give
contradictory results .Suppose that we have information about GNP of India up to 2020 and also
that we have the value of aggregate consumption and we estimate a consumption income
relationship and if you try to forecast the consumption expenditure in the year 2030 by
140
substituting a value for income in the year 2030 , it will give the absurd, because the variance of
prediction error will be very high , such prediction is meaningless. Thus never predict a value
which is far away from the sample range.That will give you misleading results
Consider a PRF
𝑌𝐼 = β2 𝑋𝑖 + 𝑢𝑖
In this model the intercept β1is assumed to be 0.Such a model is known as regression through the
origin model. As you know, the PRF of such a model starts from the origin not from the Y axis.
As an example to this , consider the permanent income hypothesis of Milton Friedman, which
141
𝑝
1) 𝐶 = α𝑌 , adding a disturbance term u, it is an example of regression through the origin
model.
2 3
2) Consider a cost function TC = 𝑎0+𝑎1 𝑄 + 𝑎2𝑄 + 𝑎 𝑄
3
TC = TFC+TVC
If you are modeling a variable cost figure,it starts from the origin.Because we know that
Now if you have strong theoretical reasons to believe that the correct model is a
regression through the origin model , then you can have a model like this. Regression
through the origin mode is not generally preferred in empirical analysis, but still it is
useful for some other purposes. So it is important to know what are the properties of
regression through the origin model, because when we discuss heteroscedasticity , auto
correlation etc we will come across such models and a discussion will be useful later.
Now this is the PRF . Now let us say how to estimate SRF is
𝑌𝑖 = β2𝑋𝑖 𝑢𝑖
142
𝑌𝑖 = β2 𝑋𝑖 + 𝑢𝑖
𝑢𝑖= 𝑌𝑖 - β2 𝑋𝑖
2 2
Σ𝑢𝑖 = Σ[ 𝑌𝑖 − β2 𝑋𝑖 ]
2
𝑑Σ𝑢𝑖
=2Σ[ 𝑌𝑖 − β2 𝑋𝑖 ] [−Xi]= 0
𝑑 β2
= -2Σ[ 𝑌𝑖 − β2 𝑋𝑖 ] [Xi] = 0
2
Can be written as Σ 𝑌𝑖 𝑋𝑖 = β2 𝑋𝑖
Σ𝑢 𝑋𝑖 = 0
Σ 𝑌𝑖 𝑋𝑖
β2 = 2
Σ 𝑋𝑖
Σ𝑋𝑖( β2 𝑋𝑖 +𝑢𝑖)
= 2
Σ 𝑋𝑖
2
β2Σ 𝑋𝑖 +Σ𝑢𝑖
2
Σ 𝑋𝑖
Σ 𝑋𝑖 𝑢𝑖
β2 =β2 + 2
Σ 𝑋𝑖
E(β2)=β2
143
Now let us see what is the variance of β2
2 Σ 𝑋𝑖 𝑢𝑖 2
E [β2 − β2] = E [ 2 ]
Σ 𝑋𝑖
1 2
= 2 E [ Σ 𝑋𝑖 𝑢𝑖]
(Σ𝑋𝑖)
2 2 2 2
𝑋1 𝑢1 +.........𝑋𝑛𝑢𝑛 + 2 𝑋1𝑋2𝑢1𝑢2 +...... + 2𝑋𝑛−1𝑋𝑛 𝑢𝑛−1𝑢𝑛
And from the assumptions all these cross terms will disappear as we derived earlier, then
this becomes
1 2 2
2 2
σ Σ𝑋𝑖
(Σ𝑋𝑖 )
2
Now we know that σ is unknown , so we replace it with its unbiased estimator
2
2 Σ𝑢𝑖
σ = 𝑛−1
Here it is n-1 not n-2 because
The reason is , in the regression through the origin model , we estimate only one parameter so
144
( )
2
Σ𝑢𝑖 2
(𝑛−1)σ 2
E 𝑛−1
= 𝑛−1
=σ
Now it is informative to compare the regression through the origin model with the model having
the intercept. As we have discussed while deriving the estimator , a step was written
1)Σ𝑢𝑖𝑋𝑖 = 0
This is a condition imposed on the data to derive the estimator β2, as only one condition is
imposed we have what is known as we have n-1 degrees of freedom.In the conventional model,
Σ𝑢𝑖=0 . But in the regression through origin model we cannot impose both these conditions
We know that
𝑌𝑖 = β2𝑋𝑖 + 𝑢𝑖
Σ𝑢𝑖= 0
Σ𝑌 𝑌
Then β2 becomes Σ𝑋𝑖 = = This quantity is biased
𝑖 𝑋
𝑌𝑖=𝑌𝑖+ 𝑢𝑖
145
𝑌= 𝑌+ 𝑢
But since
Σ𝑢𝑖 ≠ 0
Then
𝑢=0
Then
𝑌= 𝑌
So 𝑌≠ 𝑌
We cannot impose the condition that mean of Y is the mean of estimated Y. Both are different.
So as a comparison between both the models we know that for the conventional model
Σ𝑥𝑖𝑦𝑖
β2= 2
Σ𝑥𝑖
2
σ
Var (β2)= 2
Σ𝑥𝑖
2 2
Σ𝑢
σ = 𝑛−2
1) In the regression through the origin model we use the raw sum of squares and cross
products. But in the model with intercept we use adjusted sum of squares and adjusted
2) In the regression through origin model , we have n-1 degrees of freedom , in the
146
3) In the conventional model Σ𝑢𝑖=0. But we cannot assume that Σ𝑢𝑖=0 in the regression
2 2
4) In the model with intercept, 0< 𝑟 <+1 ie 𝑟 𝑖𝑠 𝑎𝑙𝑤𝑎𝑦𝑠 𝑎 𝑛𝑜𝑛 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑛𝑢𝑚𝑏𝑒𝑟. But in
2
𝑟 𝑤𝑖𝑙𝑙 𝑏𝑒𝑐𝑜𝑚𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
2 2
2 2
Σ𝑢𝑖 = Σ𝑌𝑖 - β2Σ𝑋𝑖 And we know that
2
2 Σ𝑢𝑖
𝑟 =1- 2
Σ𝑦𝑖
2 2
2 2
And this Σ𝑢𝑖 =Σ𝑦𝑖 - β2Σ𝑥 𝑖
2
2
Σ𝑢𝑖 ≤ Σ𝑦𝑖
2 2 2
But in the model without intercept Σ𝑦𝑖 = Σ𝑌𝑖 - n Σ𝑦
10−5 5
1- 10−4
= 4
2
Here 1 - greater than 1 𝑟 becomes negative
147
2
So in the regression through origin model we cannot use the conventional 𝑟 as a
2
measure of goodness of fit. So as a measure of goodness of fit 𝑟 suggested in the
2 (Σ𝑋𝑖𝑌𝑖)
literature is raw 𝑟 = 2 2
Σ𝑋𝑖 Σ𝑌𝑖
2 2 2
This Raw 𝑟 satisfy the condition that 0< 𝑟 < 1. The conventional 𝑟 is not.
Thus compared to a model with an intercept, regression through the origin model has
many problems . So unless you have sufficient theoretical reasons , it is better to use the
𝑌𝑖= β1 + β2𝑋𝑖+ 𝑢𝑖
H0: β1 = 0
H1 :β1 ≠ 0
If this null hypothesis is accepted, from the practical point of view, this is equal to
regression through the origin model. So the point is, it is better to regress a model with an
intercept rather than a model without intercept unless you have very strong theoretical
reasons.
The unit in which data is expressed is very important. If we are not familiar with units of
measurement in which dependent and independent variables are given there is a possibility that
148
Example : Suppose that United States Gross Private Domestic Investments (GPDI) and Gross
Domestic Product(GDP) is given from 1992 to 2005.When suppose that we regress GPDI on
We can use data in any units. Eg. We can use data in Rupees, thousands and lakhs of rupees ,
millions,billions as you like. If the data is used in billions we can convert it into millions by
Now let us see what are some of the issues associated with unit of measurement.
● If one researcher is using billions of Dollars, Another researcher uses data in millions of
Dollars. The question is, will the regression line be the same in both cases?
● The second question is, if the results are different, which result should one use ?
● Next question is, whether the unit of measurement of X and Y makes any difference in
If so, what is the most sensible course of action? That is what should be the best course of
action to follow?
*
𝑌𝑖 = 𝑊1𝑌𝑖
*
𝑋𝑖 = 𝑊2𝑋𝑖
149
* *
We defined 2 variables 𝑌𝑖 and 𝑋𝑖 .
𝑊𝑖 and 𝑊2
𝑊𝑖 = 𝑊2.
𝑊𝑖 ≠ 𝑊2
It means data given in billions, one can be converted into millions, another can be converted into
* *
Now we have 𝑌𝑖 and 𝑋𝑖, 𝑌𝑖 and 𝑋𝑖 , and as it is multiplied by the 𝑊𝑖 and 𝑊2 and as 𝑊𝑖 and
* *
𝑊2 are known as scale factors we can say that 𝑌𝑖 and 𝑋𝑖 are rescaled values of 𝑌𝑖 𝑎𝑛𝑑 𝑋 .
𝑖
*
And if 𝑌𝑖 𝑎𝑛𝑑 𝑋 . are in billions of dollars. Then 𝑌𝑖 =1000𝑌𝑖
𝑖
*
𝑋𝑖 = 1000𝑋𝑖
* * *
* *
Now consider regression 𝑌𝑖 = β 1
+ β 2
+ 𝑋𝑖 + 𝑢 𝑖
*
𝑌𝑖 = 𝑊𝑖𝑌𝑖
*
𝑋𝑖 = 𝑊2𝑋𝑖
*
𝑢 𝑖 = 𝑊1𝑢 𝑖
*
Relationship between β1 and β 1,
*
β2 and β 2
150
*
var(β1) and var (β 1)
*
var(β2 ) and var(β 2)
2 *2
σ and σ
2 2 * *
𝑟 𝑥𝑦 and 𝑟 𝑥 𝑦
* 𝑊1
● β 2
= 𝑊2
β2
*
If 𝑊1 = 𝑊2 then β 2
= β2
( )
* 𝑊1
● 𝑣𝑎𝑟(β 2) = 𝑊2
𝑣𝑎𝑟(β2 )
If 𝑊1 = 𝑊2 the unit of measurement changed from one to another and it is the same for both Y
*
● β 1
= 𝑊1 β1
*
2
● 𝑣𝑎𝑟 (β 1) = 𝑊1 𝑣𝑎𝑟β1
2
Its std error is 𝑊1 = 𝑊1
When we change multiplying X and Y with 𝑊1 and 𝑊2 as you can see the intercept is multiplied
by 𝑊1 . That is 𝑊1 β1 .
151
*
Standard error of β 1 is 𝑊1 times standard error of β1.
*
So both β 1 and standard error multiplied by 𝑊1 .
2 2 * *
● 𝑟 𝑥𝑦 = 𝑟 𝑥 𝑦
2
That is whatever may be the unit of measurement 𝑟 is not affected.
*2 2
2
● σ = 𝑊1 σ
Now suppose that X scale has not changed. Only the Y scale changed.
Then the slope, the intercept, and the standard errors multiplied by 𝑊1.
𝑊1 = 𝑌 𝑊2 = 𝑋
1
Then the slope and standard errors multiplied by 𝑊2
.
If 𝑊1 = 𝑊2, then the slope, standard error remain unchanged. Intercept and standard error
2
multiplied by 𝑊1. 𝑟 is not affected,
In the previous classes we have discussed simple linear regression model. And our discussion in
simple linear regression model was based on the assumption that the model is linear in
parameters, may or may not be linear in variables. But in the real world we will consider many
variations of the simple linear regression model. That is we will consider a model which are non
linear in variables or models which are non linear in appearance. But can be transformed into
152
linearity by giving suitable transformation. So we will consider models which are non linear in
variables on the one hand and also non linear in parameters but can be converted into linear in
parameter models. A model which is non linear in parameters but can be converted into linearity
by giving a suitable transformation is known as intrinsically linear models. And in the real world
we can not expect that the world is always linear. The world is characterized by various types of
non linearities. So for empirical application we have to consider such non linearities. So we will
consider many cases and we will also discuss how such models are applied in the real situation to
deal with real economic data. The first model that we consider is Log Linear Model.
Log Linear Models are used to estimate elasticity. To explain it let us consider a model as
β 𝑢𝑖
𝑌𝑖 = β1𝑋𝑖 2𝑒
This model is known as the exponential regression model. By giving a log transformation we are
The base of the common log is 10. The base of natural log is e. e=2.718.
Properties of log
𝑙𝑛 ab = 𝑙𝑛a +𝑙𝑛b
𝑎
𝑙𝑛 𝑏 = 𝑙𝑛a - 𝑙𝑛b
153
𝑘
𝑙𝑛𝑎 =k.𝑙𝑛a
𝑙𝑛 1 = 0
𝑎
𝑙𝑛𝑒 = a
α = 𝑙𝑛β1
This model is linear in the parameters α and β2. And linear in the log of the variables Y and X.
So as both the dependent and independent variable is in log form, we can say that it is a log log
So this is the linear model, the only difference is Y and X is expressed in log form.
If we apply OLS not by regressing Y on X, but by regressing𝑙𝑛y and 𝑙𝑛x, then you will get
estimators α and β.
The reason why the model is popular is that this model will give a direct estimate of elasticity or
So the model is popular because the slope coefficient is the estimate of elasticity.
𝐼𝑓 𝑌 = 𝑙𝑛𝑋
𝑑𝑙𝑛𝑋 1
𝑑𝑥
= 𝑋
𝑑𝑥
Or 𝑑𝑙𝑛𝑋 = 𝑋
154
So the derivative of a log variable is equal to proportionate change in that variable.
𝑋𝑡−𝑋𝑡−1
That is 𝑋𝑡−1
𝑋𝑡
Or 𝑋𝑡−1
− 1
𝑋𝑡−𝑋𝑡−1
𝑋𝑡−1
is the relative change.
𝑋𝑡−𝑋𝑡−1
𝑋𝑡−1
× 100 is the percentage change.
And to calculate change we generally use the proportionate change or the percentage change.
Because absolute change depends on the unit of measurement, whereas, proportionate change is
The proportionate change is used as a measure of change in economics rather than absolute
change.
price. And if you estimate by convertingY into 𝑙𝑛𝑌 , X into 𝑙𝑛𝑋, we will get β2, and β2 is
If you simply regress Y on X we get only the slope of the regression line. That is β2
will measure change in Y per unit change in X, If price increases by 1 unit, quantity demanded
decreases by β2 units.
155
∆𝑌
∆𝑌 𝑋 ∆𝑌 𝑋
In this mode β2 is 𝑌
∆𝑋 this can be written as 𝑌
× ∆𝑌
𝑜𝑟 ∆𝑋
× 𝑌
𝑋
Now let us see how these equations look in the form of a graph. Suppose that we have quantity
demanded as a f(P)
decreases.
156
● 𝑙𝑛𝑌𝑖 = α + β2𝑙𝑛𝑋𝑖 + 𝑢𝑖 then we will get α and β .
() ( )
𝐸 α is an unbiased estimator of α, and 𝐸 β2 is unbiased estimator of β2.
But β1 the parameter entering in the original model when estimated as β1.
β1 is antilog of α
β1 estimated as β1. obtained by taking the antilog of α is a biased estimator. This is not a big
Now expenditure on PCE (per capita consumption) of personal computers and a expenditure of
durable goods.
So we want to know when personal computer expenditure changes, how expenditure on durables
changes.
8−6
Then % change is 6
= 33%
157
Functional forms of Regression models, Semi log model
Now we consider the Semi-Log models, there are two types of Semi-Log models; namely log lin
model and lin log model. Log lin means Y is in log form and X is not, and lin log means X is in
log form and Y is not. So in this session we discuss the first model, the log lin model.
Log lin models are extensively used to measure the growth rates. Such as population, GNP,
money supply, employment etc…such economic phenomena are measured under log lin models.
To explain this model, suppose that we want to estimate the growth rate of consumption
Let, 𝑌𝑡 is the real expenditure on services at t (it can be the rate of growth of population at time
t, rate of growth of employment at time t, or rate of growth of money supply at time t etc…..)
𝑡
𝑌𝑡 = 𝑌0 (1 + 𝑟)
𝑌𝑡 = Value of Y at time t
𝑌0 = Initial value
𝑙𝑛𝑌𝑡 = 𝑙𝑛 𝑌0 + t 𝑙𝑛(1+r)
𝑙𝑛𝑌𝑡= β1+ β2 t
𝑙𝑛𝑌𝑡= β1+ β2 t + 𝑢𝑡
158
This model like any other model, linear in parameters β1and β2
In this model, the dependent variable is 𝑙𝑛𝑌𝑡, and the independent variable is t
Suppose that you have data on expenditure on services from 2000 to 2015 as y, and you have to
construct it, starting from 1,2 ….to 15. You can regress 𝑙𝑛𝑌𝑡 on t. That will give you β2.
This model is called a semi-log model, the reason is that only one variable is in log form, and the
other is not. For the descriptive purpose, we called it a log lin model, because the dependent
△𝑙𝑛𝑌𝑡
β2= △𝑡
△𝑌 △𝑌
= 𝑌
△𝑡
or 𝑌
△𝑥
Now, the slope coefficient measures the relative change in Y (the regressent) divided by the
absolute change in X, or it measures the constant proportional change in Y for given absolute
change in X.
△𝑙𝑛𝑌𝑡
If we multiply the relative change in Y with 100 ( △𝑌
× 100) , you will get the percentage
change in Y or the growth rate in Y (growth rate in population, growth rate in money supply
etc…..)
So, 100×β2 is the growth rate. In the literature 100×β2 is known as semi elasticity of Y with
respect to X.
△𝑌
△𝑌 1 △𝑌 1
In this model, β2 = △𝑋
𝑌
or 𝑌
× △𝑋 or △𝑋
×𝑌
△𝑌
So, what is slope? Slope is △𝑋
;
159
Slope = β2(Y)
What is elasticity?
Elasticity = β2(X)
Both slope and elasticity change from one point to another point, because bothe slop and
This is based on quarterly data, the quarterly growth rate = 0.705%, so the annual growth
Now, in the literature, a distinction is always made between the compound growth rate and
instantaneous growth rate. The coefficient β2, that we estimated is instantaneous growth rate,
means at a point in time. It is different from compound growth rate, which is measured over a
period of time (say one month, six month, one year etc,,,,)
Now, if you want to calculate the compound growth rate, the procedure is
1. Take antilog of instantaneous β2 (β2 𝑖𝑠 𝑙𝑛(1 + 𝑟)) , that will give you (1+r)
3. Multiply this with 100, r*100, it will give you compound growth rate
This compound growth rate will be higher than the instantaneous growth rate. Because of the
compounding effect. When you estimate, you will get the β2, and it is the instantaneous growth
rate, if you want compound growth rate follow the above steps.
160
Suppose that you regress 𝑌𝑡= β1+ β2 t + 𝑢𝑡, where Y is in original units , t is the time, you
simply regress Y on t, you will get β2. It is an absolute change. It tells us by how many units Y
will change when time will change by one unit. But remember this, linear trend model is not
generally preferred, what is more interesting is relative change or growth rate rather than
absolute change, this model is known as the linear trend model. This will give you the absolute
change in Y for a unit change in t. For example, if the value of the money supply increase from
𝑋1 to 𝑋2, that change will be β2 for one year (if you are using the yearly data).
Let us consider one more example, consider a model of wage and education;
Suppose that wage is measured in Dollar per hour, and education in years
Now, as you know if you estimate this model, the estimated model is;
=-0.90+0.54 education
It says that, as education increases by one year, the hourly wage is increased by .54 dollars. So, if
education increases by one year, this model states that hourly wage increases by .54, whether the
education increases from 5 to 6, or 18 to 19. But this is unrealistic in most cases, the reason is if
education increases from 5 to 6, the wage increases by .50 dollars. When education increases
from 18 to 19, wages may increase by 5 dollars. Here the rate of increase in wage is uniform for
increase in education at each level of education or the starting level of education is more
important. But what is more meaningful is an estimation of the percentage change, so we convert
161
When you convert wage into log, you will get, as education increases by one unit, β2 is the
percentage change (say this percentage change is 5%), the percentage change is constant , but
absolute change will increase. So for a wage equation , if you expect the relation as like below;
β2= 0.083 (8.3%), this is the growth rate wage for each additional unit of education. And the
curve representing the wage and education is like this. This is known as the log linear model.
Log(Y)= β1+ β2 X+ 𝑢
162
When β2 >0 (as we already considered)
When β2<0
163
: Functional forms of Regression models, Lin log and reciprocal models
Another model which we consider is the Lin log model. Unlike in the Log lin model, here we
consider the independent variable in log form. We measure what is known as absolute change in
Y for a given proportionate change in X. Here, our concern is to estimate the absolute change in
∆𝑌
The slope coefficient β2= ∆𝑙𝑛 𝑋
, absolute change in Y for a given proportionate change in X
∆𝑌 ∆𝑌 𝑋
= ∆𝑋 or ∆𝑋
× 1
𝑋
1
The slope of this model is = β2× 𝑋
1
The elasticity is =β2× 𝑌
This equations states that, the absolute change in Y is equals to the relative change in X,
∆𝑋
As an example; if ∆𝑋
= 0.01 (1%)
Absolute change in Y (∆𝑌) = 0.01 (β2), it means 1% change in X causes 0.01 (β2) times change
in Y.
If β2= 500
Then, (∆𝑌) = 0.01 (500) = 5, which means that, if there is 1% change in X , the absolute change
in Y is 5.
So, it is very important to remember that. Once this model is estimated, multiply the slope β2
164
1
As you can see in this model, the slope is β2× 𝑋
, so naturally, as X increases the slope
as X increases)
2. When β2 <0, again it is negative but the negative slope decreases in absolute terms.
165
If you have Y and X, if you plot these data, if you observe patterns and experiment with
different patterns . the question is what is the application of a model like this. An important
application is the “Angels law of consumptions' ' or angles elasticity function , which says that as
If your model is, food expenditure is = β1+ β2𝑙𝑛 (Income) +µ, if you plot this on the graph, you
can see that, as income increases, the food expenditure increases at a diminishing rate.
Another model we consider is reciprocal model, suppose that we have model like;
1
𝑌𝑖= β1+ β2 𝑥 +𝑢𝑖This is an example of a reciprocal model. As you can see, this is a model
𝑖
1
which is non linear in X, because X appears as 𝑥𝑖
not as X, but it is linear in parameters. So you
can estimate this model. If you have data on X and Y, your dependent variable is Y, the
1
independent variable is 𝑥𝑖
. you regress Y on X, you will get estimators β1and β2. As β1and β2
appears as linear, you can estimate this model using OLS, the only thing is X appears in the
model reciprocally.
1
1. As X increases indefinitely, β2 𝑥 approaches 0, and naturally Y approaches the value of
𝑖
β1, which is known as the asymptotic value. So, there is built into such model is an
166
Here β1>0 and β2>0, X increases indefinitely Y approaches the asymptotic value β1 and β1 is
positive .
Here, β1<0 and β2>0, as X increases indefinitely, Y approaches the asymptotic value β1 but now
β1 is negative.
167
If β2 is positive the curve decreases throughout, to see this , we have found the slope. The 𝑌𝑖=
1 −1 ∆𝑌 1
β1+ β2 𝑥 +𝑢𝑖 can be written as; β1+ β2𝑋 +𝑢𝑖, the slope is ∆𝑋
= − β2 ( 2 ) (the slope is
𝑖 𝑋
∆𝑌 𝑋
Elasticity is= ∆𝑋
× 𝑌
1 𝑋
=− β2 ( 2 )× 𝑌
𝑋
1
= − β2× 𝑋𝑌
Remember this, one celebrated application of the reciprocal model is the “Phillips Curve” in
macroeconomics, the point is if X increases indefinitely, there is a limit value to Y. Can be used
in various contexts.
2 3
In addition to these models, we can also consider various powers of X, say 𝑋 , 𝑋 etc…. To
estimate the quadratic function and cubic functions etc….such models will discuss in the context
Few more functional form of log log model (log log model already discussed in previous
session)
168
If we are considering a simple linear regression model,
∆𝑌
The elasticity is = β2( ∆𝑋 )
∆𝑄 𝑃
You will get an estimate of the slope only, the elasticity formula is: ∆𝑃
× 𝑄
, in our formulation
∆𝑌 𝑋
it is; ∆𝑋 × 𝑌
. Now, if you are estimating a linear equation, then slope can be calculated as;
𝑋 𝑃
=β2 × 𝑌
, or β2 × 𝑄
. If you are estimating a linear equation, you can calculate elasticity by
giving values to X and Y, but the problem is which X and which Y. The solution is you take
average X and average Y, then it will give you the estimate of average elasticity.
169