The Method of Ordinary Least Square

Ordinary Least Square
The Method of Ordinary Least Square:

Let us consider the two variable PRF:
Y i=β 1 + β 2 X i+ μ i … … … ( 1 )
However, PRF is not directly observable. We estimate if from the SRF:
Y i= β^ 1 + ^β 2 X i+ μ^ i
Y i=Y^ i + μ^ i
Where Y^ i is the estimated value of Y i
^μi=Y i−Y^ i ¿ Y i− β^ 1− β^ 2 X i………………………….(2)
Taking square and sum we get
∑ μ^ 2i =∑ (Y i− ^β 1− ^β 2 X i )2……………………………….(3)
Now differentiating (3) partially with respect to ^β 1 and ^β 2 we obtain
δ
^ (∑ μ^ 2i )=−2 ∑ (Y i− ^β 1− ^β 2 X i )……………………………(4)
δ β1
δ
δ ^β 2
( ∑ μ^ i2) =−2 ∑ (Y i− ^β 1− ^β 2 X i) X i………………………..(5)
Now setting (4) = 0 we get

−2 ∑ (Y i − ^β 1− ^β 2 X i )=0
⟹ ∑ (Y i − ^β 1− ^β 2 X i )=0
⇒ ∑ Y i −∑ β 1−¿ ∑ ^β 2 X i ¿=0
⇒ ∑ Y i −n ^β1 − ^β 2 ∑ X i =0
⇒ β^ 1=Y − ^β 2 X ………………………………..(6)
Setting (5) =0, we get
−2 ∑ (Y i − ^β 1− ^β 2 X i ) X i=0
⟹ ∑ (Y i − ^β 1− ^β 2 X i ) X i=0
⟹ ∑ Y i X i−∑ β^ 1 X i−∑ β^ 2 X 2i =0
⟹ ∑ Y i X i− ^β 1 ∑ X i− ^β 2 ∑ X i =0
2
⟹ ∑ Y i X i−(Y − ^β 2 X) ∑ X i− ^β 2 ∑ X i =0
2
⟹ ∑ Y i X i−Y ∑ X i + β^ 2 X ∑ X i− β^ 2 ∑ X i =0
2
⟹ ∑ Y i X i−
∑ X i ∑ Y i − ^β ¿
2
n
1
∑ X ∑Y
∑ Y i X i − ni i
⟹ ^β 2=
∑ X 2i −¿ ¿ ¿ ¿
Numerical properties of the estimators:
1. The OLS estimators are expressed solely in terms of the observable quantities.
Therefore, they can be easily computed.
2. They are point estimators
3. Once the OLS estimates are obtained from the sample data, the regression line can be
easily obtained. The regression line thus obtained has the following properties:
i. It passes through the sample means of Y and X
ii. The mean value of the estimated Y =Y^ i is equal to the mean value of the actual Y
for
Y^ i= β^ 1 + ^β 2 X i
¿(Y − ^β 2 X)+ β^ 2 X i
¿ Y + ^β 2 (X i− X )
Summing both sides of this last equality over the sample values and dividing through by n
give
Y^ ¯¿ Y
iii. The mean value of ^μi is equal to zero
iv. The residuals ^μi are uncorrelated with the predicted Y i
v. The residuals are uncorrelated with X i
2
Gauss- Markov Theorem:
Given the assumptions of the classical linear regression model, the least square estimators, in the class of unbiased
linear estimators, have minimum variance, that is, they are BLUE.
Proof of the Gauss Markov Theorem:
Linearity:
Let us consider the regression model
Y i=β 1 + β 2 X i+ μ i … … … ( 1 )
Sample regression equation corresponding to the PRE
Y i= β^ 1 + ^β 2 X i+ μ^ i
Now the OLS coefficient estimators given by the formulas
∑X ∑Y
∑ Y i X i− ni i
^β =
2
∑ X 2i −¿ ¿ ¿ ¿
^β =Y − ^β X
1 2
From (2) we can write
^β = ∑ y i x i
2 2
∑ xi
^β = ∑
(Y i −Y ) x i
2 2
∑ xi
^β = ∑ Y i x i − ∑ Y x i
2 2 2
∑ xi ∑ x i
^β = ∑ Y i x i =∑ k Y … … … .(3)
2 2 i i
∑ xi
The OLS estimator ^β 2 is the linear function of the sample values Y i
Some properties of k i :
xi
i. ∑ k i= ∑ 2
=¿ 0 ¿
∑ xi
∑ k i =∑ (
2 xi
2
)=
∑ x i2
ii. ∑ xi
2
2 2
( ∑ x¿¿ i ) =
1
¿
∑ x i2
iii. ∑ k i x i=∑ k i ( X i− X )=∑ k i X i−X ∑ k i=¿ ∑ k i X i ¿
iv. ∑ k i x i=1
Unbiasedness of ^
β2
We know that
^β =∑ k Y
2 i i
¿ ∑ k i( β1 + β 2 X i+ μ i) (from (1)
¿ ∑ k i β 1+ ¿ β2 ∑ k i X i + ∑ k i μi ¿
3
¿ β 1 ∑ k i + β 2 ∑ k i X i + ∑ k i μi
¿ β 2+ ∑ k i μ i
Taking expectation on both side
E ( β^ 2 ) =E ( β 2) + ∑ k i E( μ¿¿ i)=β 2 ¿
Unbiasedness of ^β 1:
Y i=β 1 + β 2 X i+ μ i
⇒ ∑ Y i =∑ β 1+ β 2 ∑ X i + ∑ μi
⇒ ∑ Y i =N β 1 + β 2 ∑ X i+ ∑ μi
⇒
∑ Y i =β + β ∑ X i + ∑ μi
1 2
N N N
⇒ Y =β 1+ β 2 X … … … … ..(2)
Substituting the (2) in the formula of ^β 1, we get
^β =Y − ^β X
1 2
¿ β 1+ β2 X − ^β2 X
¿ β 1+( β 2− ^β 2) X
Taking Expectation E ( β^ 1 ) =β1
Variance of ^β 2:
We know that
V ( ^β 2 )= E [ ^β2 −E( ^β 2) ]
2
2
¿ E [ ^β 2−β 2 ]
¿ E [ ∑ k i μi ]
2
2
¿ E(k 1 μ1 +k 2 μ2 +… … … … +k n μn )
2 2 2 2
¿ E(k 1 μ1 +… … … … k n μ n+ 2k 1 k 2 μ1 μ 2+ … ….2 k n−1 k n μn−1 μ n)
Since by assumption, E ( μi ) =σ for each i and E ( μi μ j ) =0 , i≠ J , it follows that
2 2
2
σ
V ( ^β 2 )=σ ∑ k i =
2 2
∑ xi
2
Variance of ^β 1:
σ2
V ( ^β 1 )=V ( Y − ^β 2 X ) =V ( − ^β 2 X +Y )= X 2 V ( β^ 2 ) =X 2 2
∑ xi
Minimum variance property:

We know that
^β =∑ k Y
2 i i
Which shows that ^β 2 is a weighted average of the Y’s, with k i serving as the weights.
Let us define an alternative linear estimator of β 2 as follows:
4
β 2=∑ wi Y i
¿
Where w i are also weights, not necessarily equal to k i. Now

E ( β ¿2 )=∑ wi E(Y i )
¿ ∑ w i (β 1+ β 2 X i )
¿ β 1 ∑ wi + β 2 ∑ w i X i
¿
Therefore, for β to be unbiased, we must have
2
∑ wi=0 and ∑ wi X i=1

Also, we may write
V ( β 2 )=V ( ∑ wi Y i )
¿
¿ ∑ w i V (Y i)
2
¿ σ 2 ∑ w 2i
2
xi xi
¿ σ ∑ (wi −
2
2
+ 2
)
∑ xi ∑ xi
(∑ ) ( )
2 2
xi xi xi xi
¿σ
2
∑ (wi − 2
) +σ
2
∑ 2
+2 σ
2
∑ w i− 2
(¿ 2
)¿
∑ xi xi ∑ xi ∑ xi
2
xi 1
¿ σ 2 ∑ (wi − 2
) +σ 2 ( 2
)
∑ xi ∑ xi
2
xi
V ( β ¿2 )=σ 2 ∑ (w i− 2
) +V ( β^ 2 )
∑ xi
V ( β ) ≥ V ( β^ 2 )
¿
2
 Covariance of ^β 2 and ^β 1:
 The least – Square estimator of σ 2
The Probability distribution of disturbances:

We know that,
^β =∑ k Y
2 i i
Since X’s are assumed to be fixed, or non-stochastic, and since Y i=β 1 + β 2 X i+ μ i we can write,
^β =∑ k (β + β X + μ )
2 i 1 2 i i
Because, k. the beats and X are all fixed, ^β 2 is ultimately a linear function of the random variable μi , which
is random by assumption. Therefore, the probability distribution of β^ will depend on the assumption made 2
about the probability distribution of μi .

The Normality Assumption for μi
The classical linear regression model assumes that each μi is distributed normally with
2
E ( μi ) =0∧V ( μi ) =σ Cov ( μi , μ j ) =0 , i ≠ j
The assumptions given above can be more compactly stated as
2
μi N (0 , σ )
Why the Normality Assumption?
1. μi represent the combined influence of a large number of independent variables that are not
explicitly introduced in the regression model. As noted, we hope that the influence of these omitted
5
variables is small and at best random. Now by the central limit theorem of statistics, it can be
shown that if there are a large number of independent and identically distributed random variables,
then, with few exceptions, the distribution of their sum tends to a normal distribution as the
number of such variables increases indefinitely. It is the CLT that provides a theoretical justification
for the assumption of normality of μi.
2. A variant of the CLT states that, even if the number of variables is not very large or if these variables
are strictly independent, their sum may still be normally distributed.
3. With the normality assumptions, the probability distributions of OLS estimators can easily derived
because, one of the property of the normal distribution is that any linear function of normally
distributed variable is itself normally distributed.
4. The normal distribution is a simple distribution, involves only two parameters and its properties
have been extensively studied in mathematical statistics. Besides, many phenomena seem to follow
normal distribution.
5. If we are dealing with a small, or finite, sample size, say data less than 100 observations, the
normality assumptions assume a critical role. It is not only helps us derive the exact probability
distributions of OLS estimators, but also enables us to use the t, F and ℵ2probability distributions
6. Finally in large samples, t and F statistics have approximately the t and F distributions so that the t
and F tests that are based on the assumption that the error term is normally distributed can still be
applied validity.
Properties of OLS estimators under the Normality Assumption:
1. They are unbiased
2. They have minimum variance
3. They have consistency
4. ^β 1 N ( β 1 , σ ^β ), then by the standard normal distribution, the variable Z, which is defined as
2
1
β^ 1−β 1
Z= N (0,1)
σ ^β
1
5. ^β is normally distributed with

2
β^ 2− β2
Z= N (0,1)
σ ^β
2
6
Functional Forms of Regression Models:
We are concerned with the models that are linear in parameters; they may or may not be linear in variables.
In this section we will consider some commonly used regression models that may be nonlinear in variable
but linear in parameter or that can be made so by suitable transformations of the variables. In particular, we
discuss the following regression model.
1. The log-linear model
2. Semi log models
3. Reciprocal models
4. The logarithmic reciprocal model
The Log-Linear Model:
Consider the following regression model, known as exponential regression model:
β μ
Y i=β 1 X i e
2 i
Which may be expressed alternatively as

ln Y i=ln β1 + β 2 ln X i + μi
It can be written as
ln Y i=α + β 2 ln X i + μi
Where α =ln β 1, this model is linear in parameters, linear in logarithms of the variables Y and X, and
can be estimated by OLS regression. Because of the linearity such models are called log-log, double log or
log linear models.
Slope coefficient β2 of this model measures the elasticity of Y with respect to X, that is, percentage change
in Y for a given (small) percentage change in X.
Two special features:
1. The model assumes that the elasticity coefficient between Y and X ,β 2 remains
constant throughout, hence the alternative name constant elasticity model
2. Although α^ and ^β 2 are unbiased estimate of α and β2, β1 when estimated as
^β 1=antilog ¿) is itself a biased estimator
2. Semi log Models: Log- Lin and Lin-Log Models
Log- Lin Models:
Economist, businesspeople, and governments are often interested in finding out the rate of growth of certain
economic variables, such as population, GNP, money supply, employment, productivity, and trade deficit.
Suppose we want to find out the growth rate of private final consumption expenditure on services. Let Yi
denote real expenditure on services at t and Y0 the initial value of the expenditure services. We may recall
the following well-known compound interest formula from your introductory course in economics.
t
Y t =Y 0 (1+r )
Where is compound rate of growth of Y. Taking log on both side we can write
ln Y t =ln Y 0 +tln(1+r )
Now letting
β 1=ln Y 0 , β 2=ln ⁡(1+r )
We can write
ln Y t =β 1+ β2 t+ μ t
This model is like any other linear regression model in that the parameters β 1 and β 2 are linear. The only
difference is that the regressand is the logarithm of Y and the regressor is “time”, which will take values of
1,2,3 etc.
These types of models are also called semi log models because only one variable appears in the logarithmic
form. For descriptive purpose a model in which the regressand is logarithmic will be called a log-lin model
7
and model with regressor is logarithmic is called lin-log model. Type of these model is also called semi log
models, because only one variable, regressor or regressand is in logarithmic form.
Features:
1. In this model slope coefficient measures the constant proportional or relative
change in Y for a given absolute change in the value of the regressor, that is
relative change ∈regressand
β 2=
absolute change ∈regressor
The Lin –Log Model:
Unlike the growth model, in which we are interested in finding the percent growth in Y for an absolute
change in X, Suppose we now want to find the absolute change in y for a percent change in x. A model that
can accomplish this purpose can be written as:
Y i=β 1 + β 2 ln X i+ μ i
For descriptive purpose we call such a model a lin-log model
Let us interpret the slope coefficient β 2. As usual,
Change∈Y
β 2=
Change∈ln X
Change∈Y
β 2=
Relative Change∈X
The step follows from the fact that a change in the log of a number is a relative change. Symbolically, we
have
ΔY
β 2=
ΔX / X
Where as usual Δ denotes a small change. The above equation can be written as
ΔY =β 2 ( ΔX / X )
This equation states that the absolute change in Y is equal to the slope times the relative change in X. If the
latter is multiplied by 100 then the above equation gives the absolute change in Y for a percent change in X.
Thus, if (ΔX / X ) changes by .01 unit (or 1%), the absolute change in Y is .01( β 2 ¿ ; if an application one
finds that β 2=500, the absolute change in Y is (.01) (500) =5.0.
The Cobb- Douglas Production Function:

The Cobb-Doulas production function, in its stochastic form, may be expressed as
β β μ
Y i=β 1 X 2 i X 3 i e …………………… (1)
2 3 i
Where,
Y= output, X2= labor input, X3= Capital input, µ=Stochastic disturbance term, e= base of natural logarithm
From (1) it is clear that the relationship between output and the two inputs is nonlinear. However, if we log-
transform this model, we obtain:
ln Y i=ln β1 + β 2 ln X 2 i+ β3 ln X 3 i + μi
¿ β 0 + β 2 ln X 2i + β 3 ln X 3 i+ μ i
Where β 0=ln β 1
Properties:
1. β2 is the (partial) elasticity of output with respect to the labor input, that is, it measures the
percentage change in output for , say, a 1% change in the labor input holding the capita input
constant.
2. β3 is the (partial) elasticity of output with respect to the capita input, that is, it measures the
percentage change in output for, say, a 1% change in the capital input holding the labor input
constant.
8
3. The sum (β2+ β3) gives information about the returns to scale, that is, the response of output to a
proportionate change in the inputs. If this sum is 1, then there is constant return to scale, that is
doubling the inputs will double the output and so on. If sum is less than 1, there are decreasing return
to scale- doubling the inputs will less than double the output. Finally, if the sum is greater than 1,
there are increasing return to scale- doubling the inputs will more than double the output.
9
Dummy Variable Regression Model:
In regression analysis the dependent variable, or regressand, is frequently influenced not only by ratio scale
variables but also by variables that are essentially qualitative, or nominal scale, in nature, such as sex, race,
color, religion etc.
For example, holding all other factors constant, female workers are found to earn less than their male
counterparts or non-white workers are found to earn less than whites. Whatever the reason, qualitative
variables such as sex and race seem to influence the regressand and clearly should be included among the
explanatory variables.
One way we could “quantify” such attributes is by constructing artificial variables that take on values of 1 or
0, 1 indicating the presence of that attribute and 0 indicating the absence of that attribute. For example, 1
may indicate that a person is a female 0 may designate a male.
Variables that assume such 0 and 1 values are called dummy variables. Dummy variables can be
incorporated in regression model just easily as quantitative variables. As a matter of fact, a regression model
may contain regressors that are all exclusively dummy, or qualitative in nature. Such models are called
ANOVA models.
ANOVA Models:
Let us consider that, we are given data on average per capita consumption expenditure for 17 states in India
for the year 2006-07. The data pertains to consumption expenditure per person per 30 days. These 17 states
are classified into three geographical regions: 1. East, 2. North-west center, and 3. South.
Suppose we want to find out if the average per capita consumption expenditure (PEC) differs among the
three geographical regions of the country. This objective can be accomplished within the framework of
regression analysis
To see this, consider the following model:
Y i=β 1 + β 2 D2 i + β 3 D3 i + μi ………………..(1)
Where, Y i= Average consumption expenditure per person per 30 days in state i
D 2 i= 1 if the state is in the eastern region of India = 0, otherwise
D 3 i= 1 if the state is in north west central region of the country =0, otherwise
Equation (1) is like multiple regression model, except that, instead of quantitative regressor, we have only
qualitative, or dummy regressor.
What does the model (1) tell us? Assuming that error term satisfies the usual OLS assumptions, on taking
expectation on both sides, we obtain:
Mean per capita consumption expenditure in the eastern region:
E ( Y i Ι D2 i=1 , D3 i =0 )= β1 + β 2
Mean per capita consumption expenditure in the north-west-central region:
E ( Y i Ι D2 i=0 , D3 i=1 )= β1 + β 3
Mean per capita consumption expenditure in the south region:
E ( Y i Ι D2 i=0 , D3 i=0 ) =β 1
In other words, the mean per capita consumption expenditure in the south region is given by the intercept, β 1
, in the multiple regression (1) and the slope coefficient β 2, and β 3 tell by how much the mean per capita
consumption in the east and in the north-west-central differ from the mean per capita consumption
expenditure in south.
Consider the following results:
Y^ i=1097.38−241.04 D2 i−30.09 D3 i
se=( 103.31 )( 133.37 ) (129.50)
10
t=( 10.62 ) (−1.81 )(−.23 )
¿ ¿ ¿
P= (0.00) ( 0.09) (0.82)
As these regression results show, the mean per capita consumption expenditure in the south is about
1097.38, in the eastern region the per capita consumption is lower by about 241.04, and in the northwest
central region it is lower about 30.09.
Caution in the Use of Dummy Variables:
1. If a qualitative variable has m categories, introduce only (m-1) dummy variables.

2. The category for which no dummy variables is assigned is known as the base, benchmark, control,
comparison, reference, or omitted category.
3. The intercept β 1 represents the mean of the benchmark category.
4. The coefficients attached to the dummy variables are known as the differential intercept coefficients
because they tell by how much the value of the category that receives the value of 1 differs from the
intercept coefficient of the benchmark category.
5. If a qualitative variable has more than one category, the choice of the benchmark category is strictly
up to the researcher.
Multicollinearity
One of assumption of the classical linear regression analysis is that there is no multicollinearity among the
regressors included in the regression. In this chapter we take a critical look at this assumption by seeking
answer to the following questions:
1. What is the nature of the multicollinearity?
2. Is multicollinearity really a problem?
3. What are the consequences
4. How does one detect it?
5. What remedial measures can be taken to alleviate the problem of multicollinearity?
Definition:
One of the major assumptions of linear multiple regression is that, independent variables should be
independent of one another. If this is not the case problem arises is called the multicollinearity.
The nature of multicollinearity:
The term multicollinearity is due to the Rngar Frisch. Originally it is meant the existence of a perfect, or
exact, linear relationship among some or all explanatory variables of a regression model. For the k- variable
regression involving explanatory variables X 1 , X 2 , … … … … … … … X k (where X 1 =1, for all observations to
allow for the intercept term), an exact linear relationship is said to be exist if the following conditions is
satisfied:
λ 1 X 1 + λ 2 X 2+ … … … … …+ λ k X k =0 … … … … … … .(1)
where λ 1 , λ2 , … … … . λk are constants such that not all of them are zero simultaneously.
Today, however, the term multicollinearity is used in a broader sense to include the case of perfect
multicollinearity, as shown by (1), as well as the case where the X variables are intercorrelated but not
perfectly so, as follows:
λ 1 X 1 + λ 2 X 2+ … … … … …+ λ k X k + υi =0 … … … … … … .(2)
Where υ i is a stochastic error term.
To see the difference between perfect and less than perfect multicollinearity, assume, for example, that
λ 2 ≠ 0. then (2) can be written as
−λ 1 λ3 λk 1
X 2 i= X 1 i− X 3 i−… … ..− X ki − υ i
λ2 λ2 λ2 λ2
11
Which shows that X2 is not an exact linear combination of other X’s because it is also determined by the
stochastic error term υ i.
Sources of Multicollinearity:
1. The data collection method employed. For example, sampling over a limited range of the values
taken by the regressors in the populations.
2. Constraints on the model or in the population being sampled. For example, in the regression of
electricity consumption on income and household size there is a physical constraint in the population
in that families with higher incomes generally have larger homes than families with lower income.
3. Model specification. For example, adding polynomials terms to a regression model, especially when
the range of the X variable is small.
4. An overdetermined model. This happens when the model has more explanatory variables than the
number of observations. This could happen in medical research where there may be a small number
of patients about whom information is collected
5. An additional reason for multicollinearity, especially in time series data, may be that the regressors
included in the model share a common trend, that is, they all increases or decreases over time
Estimation in the presence of perfect multicollinearity:

Let us consider a three variable regression model:
Y i=β 1 + β 2 X 2i + β 3 X 3 i + μ i
Using the deviation form, where all the variables are expressed as deviations from their sample means, we
can write the three variable regression model as
y i= β^ 2 x2 i + ^β 3 x 3 i + ^μi
Now according to the OLS, we can find
^β 2=¿ ¿
^β =¿ ¿
3
Assuming X 3 i =λ X 2 i, where λ is a nonzero constant. Substituting this into (1) we obtain

^β =¿ ¿
2
0
¿
0
Which is an indeterminate expression. Similarly, it can be verified that ^β 3 is also indeterminate.
If there is perfect collinearity, then there is no way to estimate β’s uniquely. To check this let us substitute
X 3 i =λ X 2 i into
y i= β^ 2 x2 i + ^β 3 x 3 i + μ
^i
And we get
y i= β^ 2 x2 i + ^β 3 (λx ¿¿ 2i)+ μ^ i ¿
¿ ( β^ + λ ^β ) x + μ^
2 3 2i i
¿ α^ x 2 i + ^μi … ….. ¿4)

Where α^ =( ^β2 + λ β^ 3 )
Applying the usual OLS formula to (4) we get
α^ =( ^β2 + λ β^ 3 ) =
∑ yi x2i
∑ x 22i
12
Although we can estimate α uniquely, there is no way we can estimate β 2 and β 3 uniquely; mathematically
α^ =( ^β2 + λ β^ 3 ) gives us only one equation in two unknowns and there is an infinite number of solutions to this
α^ =( ^β2 + λ β^ 3 ) equation.
Practical Consequence of multicollinearity:
In case of near or high multicollinearity, one likely to encounter the following consequences:
1. Although BLUE, the OLS estimation has large variances and covariances, making precise estimation
difficult.
2. Because of consequence 1, the confidence intervals tend to be much wider, leading to the acceptance
of the “zero null hypothesis” (i.e., the true population coefficient is zero) more readily.
3. Also because of consequence 1, the t ratio of one or more coefficients tends to be statistically
insignificant.
4. Although the t ratio of one or more coefficients is statistically insignificant, R 2, the overall measure
of goodness of fit, can be very high.
5. The OLS estimator and their standard errors can be sensitive to small changes in the data.
Large variances and covariances:

Y i=β 1 + β 2 X 2i + β 3 X 3 i + μ i
The variances and covariances of β 2 and ^β 3 are given by
^
2
σ
var ( ^β 2) =
∑ x 22i (1−r 223 ¿)… … … … … … (1) ¿
^ σ2
var ( β 3) =
∑ x 23 i( 1−r 223 ¿)… … … … ..(2) ¿
2
−r 23 σ
var ( β^ 2) = … … .(3)
(1−r 23) √ ∑ x2 i ∑ x 3i
2 2 2
where , r 23 is the correlation between X 2 ∧X 3
It is apparent from the (1) and (2) that as r 23 tends toward 1, variances of the two estimators increases, and
when r 23 =1, they are infinite. The speed at which the variances and covariances increases can be seen with
the variance inflating factor (VIF), which is defined as
1
VIF=
(1−r 223 )
Using this definition, we can write (1) and (2) as
2
σ
var ( ^β 2) = VIF
∑ x 22i
σ2
var ( ^β 3) = VIF
∑ x 23i
Which shows that the variances of ^β 2 and ^β 3 are directly proportional to the VIF.
Detection of the Multicollinearity:
1. High R2 but few significant t ratios:
If R2 is high, say, in excess of .8, the F test in the most cases will reject the hypothesis that the partial
correlation coefficients are simultaneously equal to zero but individual t tests will show that none or
very few of partial slope coefficients are statistically different from zero.
2. High pair-wise correlations among regressors:
13
Another suggested rule of thumb is that if the pair-wise or zero correlation coefficient between two
regressors is high say in excess of .8. Then multicollinearity is a serious problem. Models involving
more than two explanatory variables, the simple or zero order correlation will not provide an
infallible guide to the presence of multicollinearity.
3. Examination of the partial correlations:
Because of the problem just mentioned in the relying on zero-order correlations, Farrar and Glauber
have suggested that one should look at the partial correlation coefficients. Thus, in the regression of
Y on X2 , X3 and X4, a finding that R21.234 is very high but r 212.34 , r 213.24 , are comparatively low may
suggest that the variables X2 , X3 and X4, are highly intercorrelated and that at least one of these
variables is superfluous.
Although a study of the partial correlations may be useful, there is no guarantee that they will
provide an infallible guide to multicollinearity, for it may happen that both R2 and all the partial
correlations are significantly high. But more importantly, C. Robert Wichers has shown that the
Farrar Glauber partial correlations test ineffective in that a given partial correlation may be
compatible with different multicollinearity patterns.
4. Auxiliary regressions:
Since multicollinearity arises because one or more of the regressors are exact or approximately linear
combinations of other regressors, one way of finding out which X variable is related to other X
variables is to regress each Xi on the remaining X variables and compute the corresponding R 2,
which we designate as R2i ; each one of these regressions is called an auxiliary regression, auxiliary to
the main regression of Y on the X’s. Then, following the relation between F and R 2 established in (1)
the variable
R2x . x x … … x
i 2 3 k
(k−2)
F i= 2
… … … … … …(1)
(1−R x . x x … … x )/( n−k +1)
i 2 3 k
In (1) n stands for the sample size, k stands for the number of explanatory variables including the
2
intercept term, R x . x x … … x is the coefficient of determination in the regression of variables X i on the
i 2 3 k
remaining X variables.
If computed F exceeds the critical value F at the chosen level of significance, it is taken to mean the
particular Xi is collinear with other X’s.
5. Tolerance and VIF:

The larger the values of VIFj, the more collinear the variable Xj. If the VIF of a variable exceeds 10,
that variable is said to be highly collinear.
The closer TOL is to zero, the greater the degree of collinearity of that variable. On the other hand,
the closer TOL is to 1, the greater the evidence that Xj is not collinear with the other regressor.
Remedial Measures:
1. Do Nothing
2. Rule-of Thumb Procedures
a. A prior information
b. Combining cross-sectional and time series data
c. Dropping a variable(s) and specification bias
d. Transformation of variables
e. Reducing collinearity in polynomial regressions
14
Multicollinearity
One of assumption of the classical linear regression analysis is that there is no multicollinearity among the
regressors included in the regression. In this chapter we take a critical look at this assumption by seeking
answer to the following questions:
6. What is the nature of the multicollinearity?
7. Is multicollinearity really a problem?
8. What are the consequences
10. What remedial measures can be taken to alleviate the problem of multicollinearity?
Definition:
One of the major assumptions of linear multiple regression is that, independent variables should be
independent of one another. If this is not the case problem arises is called the multicollinearity.
The nature of multicollinearity:
The term multicollinearity is due to the Rngar Frisch. Originally it is meant the existence of a perfect, or
exact, linear relationship among some or all explanatory variables of a regression model. For the k- variable
regression involving explanatory variables X 1 , X 2 , … … … … … … … X k (where X 1 =1, for all observations to
allow for the intercept term), an exact linear relationship is said to be exist if the following conditions is
satisfied:
λ 1 X 1 + λ 2 X 2+ … … … … …+ λ k X k =0 … … … … … … .(1)
where λ 1 , λ2 , … … … . λk are constants such that not all of them are zero simultaneously.
Today, however, the term multicollinearity is used in a broader sense to include the case of perfect
multicollinearity, as shown by (1), as well as the case where the X variables are intercorrelated but not
perfectly so, as follows:
λ 1 X 1 + λ 2 X 2+ … … … … …+ λ k X k + υi =0 … … … … … … .(2)
Where υ i is a stochastic error term.
To see the difference between perfect and less than perfect multicollinearity, assume, for example, that
λ 2 ≠ 0. then (2) can be written as
−λ 1 λ3 λk 1
X 2 i= X 1 i− X 3 i−… … ..− X ki − υ i
λ2 λ2 λ2 λ2
Which shows that X2 is not an exact linear combination of other X’s because it is also determined by the
stochastic error term υ i.
Sources of Multicollinearity:
6. The data collection method employed. For example, sampling over a limited range of the values
taken by the regressors in the populations.
7. Constraints on the model or in the population being sampled. For example, in the regression of
electricity consumption on income and household size there is a physical constraint in the population
in that families with higher incomes generally have larger homes than families with lower income.
8. Model specification. For example, adding polynomials terms to a regression model, especially when
the range of the X variable is small.
9. An overdetermined model. This happens when the model has more explanatory variables than the
number of observations. This could happen in medical research where there may be a small number
of patients about whom information is collected
10. An additional reason for multicollinearity, especially in time series data, may be that the regressors
included in the model share a common trend, that is, they all increases or decreases over time
15
Estimation in the presence of perfect multicollinearity:
Y i=β 1 + β 2 X 2i + β 3 X 3 i + μ i
Using the deviation form, where all the variables are expressed as deviations from their sample means, we
can write the three variable regression model as
y i= β^ 2 x2 i + ^β 3 x 3 i + ^μi
Now according to the OLS, we can find
^β =¿ ¿
2
^β =¿ ¿
3
Assuming X 3 i =λ X 2 i, where λ is a nonzero constant. Substituting this into (1) we obtain

^β =¿ ¿
2
0
¿
0
Which is an indeterminate expression. Similarly, it can be verified that ^β 3 is also indeterminate.
If there is perfect collinearity, then there is no way to estimate β’s uniquely. To check this let us substitute
X 3 i =λ X 2 i into
y i= β^ 2 x2 i + ^β 3 x 3 i + μ
^i
And we get
y i= β^ 2 x2 i + ^β 3 (λx ¿¿ 2i)+ μ^ i ¿
¿ ( β^ 2 + λ ^β 3 ) x 2i + μ^ i
¿ α^ x 2 i + ^μi … ….. ¿4)
Where α^ =( ^β2 + λ β^ 3 )
Applying the usual OLS formula to (4) we get
α^ =( ^β2 + λ β^ 3 ) =
∑ yi x2 i
∑ x 22i
Although we can estimate α uniquely, there is no way we can estimate β 2 and β 3 uniquely; mathematically
α^ =( ^β + λ β^ ) gives us only one equation in two unknowns and there is an infinite number of solutions to this
2 3
α^ =( ^β2 + λ β^ 3 ) equation.
Practical Consequence of multicollinearity:
In case of near or high multicollinearity, one likely to encounter the following consequences:
6. Although BLUE, the OLS estimation has large variances and covariances, making precise estimation
difficult.
7. Because of consequence 1, the confidence intervals tend to be much wider, leading to the acceptance
of the “zero null hypothesis” (i.e., the true population coefficient is zero) more readily.
8. Also because of consequence 1, the t ratio of one or more coefficients tends to be statistically
insignificant.
9. Although the t ratio of one or more coefficients is statistically insignificant, R 2, the overall measure
of goodness of fit, can be very high.
10. The OLS estimator and their standard errors can be sensitive to small changes in the data.
Large variances and covariances:

16
Y i=β 1 + β 2 X 2i + β 3 X 3 i + μ i
The variances and covariances of ^β 2 and ^β 3 are given by
σ2
var ( ^β 2) =
∑ x 22i (1−r 223 ¿)… … … … … … (1) ¿
σ2
var ( ^β 3) =
∑ x 23 i( 1−r 223 ¿)… … … … ..(2) ¿
2
−r 23 σ
var ( ^β 2) = … … .(3)
(1−r 23)
2
√∑ x ∑ x 2
2i
2
3i
where , r 23 is the correlation between X 2 ∧X 3
It is apparent from the (1) and (2) that as r 23 tends toward 1, variances of the two estimators increases, and
when r 23 =1, they are infinite. The speed at which the variances and covariances increases can be seen with
the variance inflating factor (VIF), which is defined as
1
VIF= 2
(1−r 23 )
Using this definition, we can write (1) and (2) as
σ2
var ( β^ 2) = VIF
∑ x 22i
σ2
var ( ^β 3) = VIF
∑ x 23i
Which shows that the variances of ^β 2 and ^β 3 are directly proportional to the VIF.
Detection of the Multicollinearity:
6. High R2 but few significant t ratios:
If R2 is high, say, in excess of .8, the F test in the most cases will reject the hypothesis that the partial
correlation coefficients are simultaneously equal to zero but individual t tests will show that none or
very few of partial slope coefficients are statistically different from zero.
7. High pair-wise correlations among regressors:
Another suggested rule of thumb is that if the pair-wise or zero correlation coefficient between two
regressors is high say in excess of .8. Then multicollinearity is a serious problem. Models involving
more than two explanatory variables, the simple or zero order correlation will not provide an
infallible guide to the presence of multicollinearity.
8. Examination of the partial correlations:
Because of the problem just mentioned in the relying on zero-order correlations, Farrar and Glauber
have suggested that one should look at the partial correlation coefficients. Thus, in the regression of
2 2 2
Y on X2 , X3 and X4, a finding that R1.234 is very high but r 12.34 , r 13.24 , are comparatively low may
suggest that the variables X2 , X3 and X4, are highly intercorrelated and that at least one of these
variables is superfluous.
Although a study of the partial correlations may be useful, there is no guarantee that they will
provide an infallible guide to multicollinearity, for it may happen that both R2 and all the partial
correlations are significantly high. But more importantly, C. Robert Wichers has shown that the
17
Farrar Glauber partial correlations test ineffective in that a given partial correlation may be
compatible with different multicollinearity patterns.
9. Auxiliary regressions:
Since multicollinearity arises because one or more of the regressors are exact or approximately linear
combinations of other regressors, one way of finding out which X variable is related to other X
variables is to regress each Xi on the remaining X variables and compute the corresponding R 2,
2
which we designate as Ri ; each one of these regressions is called an auxiliary regression, auxiliary to
the main regression of Y on the X’s. Then, following the relation between F and R 2 established in (1)
the variable
R2x . x x … … x
i 2 3 k
(k−2)
F i= 2
… … … … … …(1)
(1−R x . x x … … x )/( n−k +1)
i 2 3 k
In (1) n stands for the sample size, k stands for the number of explanatory variables including the
2
intercept term, R x . x x … … x is the coefficient of determination in the regression of variables X i on the
i 2 3 k
remaining X variables.
If computed F exceeds the critical value F at the chosen level of significance, it is taken to mean the
particular Xi is collinear with other X’s.
10. Tolerance and VIF:

The larger the values of VIFj, the more collinear the variable Xj. If the VIF of a variable exceeds 10,
that variable is said to be highly collinear.
The closer TOL is to zero, the greater the degree of collinearity of that variable. On the other hand,
the closer TOL is to 1, the greater the evidence that Xj is not collinear with the other regressor.
Remedial Measures:
3. Do Nothing
4. Rule-of Thumb Procedures
f. A prior information
g. Combining cross-sectional and time series data
h. Dropping a variable(s) and specification bias
i. Transformation of variables
j. Reducing collinearity in polynomial regressions
18
Heteroscedasticity
An important assumption of the classical liner regression model is that disturbances are homoscedastic. In
this chapter we examine the validity of this assumption and find out what happens if this assumption is not
fulfilled. In this chapter we will seek the answer of the following questions:
1. What is heteroscedasticity?
2. What are its consequences?
4. What are the remedial measures?
What is heteroscedasticity:
One of the important assumptions of classical linear regression model is that the variance of each
disturbance term µi, conditional on the chosen values of the explanatory variables, is some constant number
equal to σ2. This is the assumption of homoscedasticity, or equal spread, that is equal variance.
Symbolically,
E ( μ2i ) =σ 2 i=1,2,3 … .. n
If this is not the case, the situation arises is called heteroscedasticity. Symbolically,
2 2
E(μ i )=σ i i=1,2,3 … … …. n
Reasons of heteroscedasticity:
1. Following the error learning model
2. Discretionary income
3. Improvement of the data collection technique
4. Presence of outlier
5. Specification error
6. Skewness in the distribution
7. Incorrect data transformation and incorrect functional form
OLS estimation in the presence of Heteroscedasticity:
Let us consider the two variable regression model:
Y i=β 1 + B2 X i+ μ i
Applying the usual formula, the OLS estimator of β2 is
^β = ∑ x i y i
2
∑ x 2i
n ∑ X i Y i− ∑ X i ∑ Y i
¿ ……………………..(1)
n ∑ X i −¿ ¿ ¿¿
2
But its variance is now given by the following expression
Var ( β 2 )=
^ ∑ 2 2
xi σi
2
( ∑ x2i )
which is obviously different from the usual variance formula obtained under the assumption of
homoscedasticity, namely,
σ2
Var ( ^β 2 )= ¿ ¿ ¿
We know that ^β 2 is the linear unbiased estimator (BLUE) if the assumptions of the classical linear model,
including homoscedasticity, hold. If we drop the assumption of homoscedasticity and replace it with
19
heteroscedasticity then it can still be proved that ^β 2 is still unbiased, linear, and consistent, but no longer
have minimum variance property.
BLUE in the presence of heteroscedasticity:

Let us consider the two variable regression model:
Y i=β 1 + B2 X i+ μ i……………………………………………..(1)
Which for ease of algebraic manipulation we can write
Y i=β 1 X 0 i + β 2 X i+ μi ……………………………………………………(2)
Where X 0i =1 for each i
2
Now assume that the heteroscedastic variance σ i are known, divide (2) by σi to obtain
Yi
σi
X
( ) ( )
X μ
= β1 0i + β 2 i +( i )……………………………………..(3)
σi σi σi
which for exposition we write as
¿ ¿ ¿ ¿ ¿ ¿
Y i =β 1 X 0i + β 2 X i + μ i ……………………………………………(4)
¿ ¿
Where transformed variables are original variables divided by σ i. We use the notation β 1 and β 2, the
parameters of the transformed model, to distinguish them from the usual OLS parameter β 1 and β 2.
Now,
2
2 μi ¿
Var ( μ¿i ) =E ( μ¿i ) =E( ) since each E( μi ¿ =0
σi
1 2
¿ 2
E( μ i ) since σ 2i is known
σi
1 2
¿ 2 (σ i ) since E( μ 2i ) =σ 2i
σi
=1 which is constant.
¿
That is, the variance of the transformed disturbance term μi is now homoscedastic. Since we are still
¿
retaining the other assumptions of the classical model, the finding that it is μi that is homoscedastic suggests
that if we apply OLS to the transformed model (4) it will produce estimators that are BLUE.
In short, GLS is the OLS on the transformed variables that satisfy the standard least squares assumptions.
The estimators thus obtained are known as GLS estimators, and it is these estimators that are BLUE.
Difference between OLS and GLS:

1. In OLS we minimize
∑ μ^ 2i =∑ (Y i− ^β 1− ^β 2 X i )2
But in GLS we minimize
∑ wi μ^ 2i =∑ wi (Y i− β^ 1¿ X 01− β^ 2¿ X i )2
1
Where w i=
σ 2i
1
Thus in GLS we minimize a weighted sum of residual squares with w i= acting as weights, but in OLS
σ 2i
we minimize an unweighted residual sum of squares.
20
2.
C
In OLS, each ^μ2i associated with point A, B, C will receive the same weight in minimizing the RSS.
^ 2i associated with point C will dominate the RSS. But in GLS the extreme
Obviously, in this case the μ
observation C will get relatively smaller weight than the other two observations.
3. GLS is OLS on the transformed variables that satisfy the standard least squares assumptions.
Detection of the heteroscedasticity:

Two methods are available.
1. Informal method
a. Nature of the problem
b. Graphical method
2. Formal Method:
a. Park test
b. Glejser Test
c. Spearman’s Rank Correlation Test
d. Goldfeld-Quandt Test
e. Breusch- Pagan-Goldfrey Test
f. Others
Goldfeld-Quandt Test:
This method is applicable if one assumes that heteroscedasticity variance, σ 2i is positively associated to one
of the explanatory variables in the regression model. For simplicity, consider a two variable regression
model:
Y i=B1 + B2 X i+ μ i…………………………….(1)
Suppose σ 2i is positively related to X i as

2 2 2
σ i =σ X i where, σ 2 is constant
21
If this assumption is appropriate, it would mean σ 2i would be larger, the larger the value of X. If that it turns
out to be the case, heteroscedasticity is most likely to be present in the model. To test this explicitly,
Goldfeld and Quandt suggest the following steps:
1. Order or rank the observation according to the values of X i , beginning with the lowest X value.
2. Omit c central observations, where c is specified a prior, and divide the remaining (n-c) observations
into two groups each of (n-c)/2 observations.
3. Fit separate OLS regression to the first (n-c)/2 observations and the last (n-c)/2 observations, and
obtain the respective residual sums of squares RSS1 and RSS2, RSS1 representing the RSS from the
regression corresponding to the smaller Xi values and RSS2 that from the larger Xi values. These RSS
each have [(n-c)/2]-K df. Where K is the number of parameters to be estimated, including the
intercept. For the two variable case K=2.
4. Compute the ratio
RSS 1 /df
λ=
RSS 2 /df
If we assume μi are normally distributed, and if the assumption of homoscedasticity is valid, then it
can be shown that λ follows the F distribution with numerator and denominator df each of [(n-c)/2]-
K df.
If λ is greater than the critical F at the choosen level of significance, we can reject the hypothesis of
homoscedasticity, that is we can say that heteroscedasticity is very likely.
Note: Succes of the G-Q test depends on how c is chosen. For two variable regressions model the Monte
Carlo experiments done by Goldfeld and Quandt suggest that c is about 8 if the sample size is about 30, and
it is about 16 if the sample size is about 60. But Judge et al. note that C=4 if n=30 and C=10 if n is about 60
have been found satisfactory
Remedial Measures:
As we have seen, heteroscedasticity does not destroy the unbiasedness and consistency properties of the
OLS estimators, but they are no longer efficient, not even asymptotically. This lack of efficiency makes the
usual hypothesis testing procedure a dubious value. Therefore, remedial measures may be called for. There
are two approaches to reduction: when σ 2i is known and when σ 2i is not known.
When σ 2i is known:
THE METHOD OF WLS
When σ 2i is not known:
1. White,s Heteroscedasticity- Consistent Variances and Standard Errors:
White has shown that this estimate can be performed so that asymptotically valid statistical inferences can
be made about the true parameter values. White’s heteroscedasticity –corrected standard errors are also
known as robust standard errors.
2. Plausible Assumptions about Heteroscedasticity Pattern:
22
Apart from being a large-sample procedure, one drawback of the White procedure is that the estimators
thus obtained may not be so efficient as those obtained by methods that transform data to reflect specific
types of heteroscedasticities.
To illustrate this, let us revert to the two variable regression model:
Y i=B1 + B2 X i+ μ i
We now consider several assumptions about the pattern of heteroskedasticity.
Assumption #01:
E ( μi ) =σ X i
2 2 2
If, as a matter of “speculation,” graphic methods or Park and Glejser approaches, it is believed that the
variance of μi is proportional to the square of the explanatory variable X, one may transform the original
model as follows. Divide the original model through by X i :
Y i β1 μi
= + β2+
Xi Xi Xi
1
¿ β 1 + β2 +ϑ i
Xi
μi
Where ϑ i is the transformed disturbance term, equal to . Now it is easy to verify that
Xi
( )
2
μi 1
E ( ϑ )=E = 2 E ( μ i ) =σ
2 2 2
i
Xi Xi
Hence the variance ϑ i is now homoscedastic, and one may proceed to apply OLS to the transformed
equation.
Assumption #02:
E ( μ2i ) =σ 2 X i
To illustrate this, let us revert to the two variable regression model:
Y i=B1 + B2 X i+ μ i
If it is believed that the variance of μi , instead of being proportional to X 2i , is proportional to X i itself, then
the original model can be transformed as follows
Yi β1 μi
= + β2√ Xi+
√ Xi √ Xi √ Xi
1
¿ β1 + β 2 √ X i+ ϑ i
√ Xi
23
μi
Where ϑ i = and where Xi >0
√Xi
Now
( )
2
μi 1
E ( ϑ i )=E = E ( μi ) =σ
2 2 2
√X i Xi
equation.
Assumption #03:
E ( μ2i ) =σ 2 [E ( Y i ) ]
2
i.e the variance of μi is proportional to the square of the expected value of Y. Now
E ( Y i ) =β1 + β 2 X i
Therefore, if we transform the original equation as follows,
Yi β1 Xi μi
= + β2 +
E (Y i ) E ( Y i) E (Y i ) E (Y i )
β1 Xi
¿ + β2 +ϑ i
E (Y i) E (Y i)
μi
Where, ϑ i =
E (Y i )
Now,
( )
2
μi 1
E ( ϑ )=E E ( μ i ) =σ
2 2 2
i = 2
E ( Y i) [ E (Y i)]
equation.
Assumption # 04:
A log transformation such as
ln Y i= β1 + ln X i+ μ i
Very often reduces heteroscedasticity when compared with the regression Y i=B1 + B2 X i+ μ i
Autocorrelation
In this chapter we will discuss;
1. What is the nature of autocorrelation?
24
2. What are the theoretical and practical consequences of autocorrelation?
3. How to detect autocorrelation.
4. How does one remedy the problem of autocorrelation?
Definition:
The term autocorrelation may be defined as “correlation between members of series of observations ordered
in time or space”. In regression context, the classical linear regression model assumes that, there is no
correlation between disturbance terms. Symbolically,
E ( μi μ j ) =0 i≠ j
If this assumption is violated, symbolically, E ( μi μ j ), the problem arises is called autocorrelation.
Reasons of autocorrelation:
Inertia:
A salient feature of the most economic time series is inertia, or sluggishness. As is well known, time series
such as GNP, price indexes, production, employment, and unemployment exhibit cycles. Starting at the
bottom of the recession, when economic recovery starts, most of the series start moving upward. In this
upswing, the value of a series at point in time is greater than its previous value. Thus there is a momentum
built into them, it continues until something happens to slow down. Therefore, in regressions involving time
series data, successive observations are likely to be interdependent.
Specification bias: Excluded Variables Case
Suppose we have the following demand model:
Y t =β 1+ β2 X 2 t + β 3 X 3 t + β 4 X 4 t + μt …………………….(1)
Where, Y t = Quantity of beef demanded

X 2 = price of beef
X 3 = consumes income
X 3 = price of pork,
t= time
However, for some reason we run the following regression:
Y t =β 1+ β2 X 2 t + β 3 X 3 t +ϑ t …………………..(2)
Now if eq. (1) is the correct model or true relation, running (2) is tantamount to letting ϑ t = β 4 X 4 t + μt . And to
the extent the price of pork affects the consumption of beef; the error of disturbance term ϑ will reflect a
systematic pattern, thus creating autocorrelation.
Specification bias: Incorrect functional form
Suppose the “true” or correct model in a cost-output study as follows:
Marginal costi=β1+β2 outputi+ β2 output 2i+µi ( non linear)
But we fit the following model:
Marginal costi=α1+α2 outputi+ϑ i ( linear)
True model
25
Marginal cost of B
production
Linear
A
model
Output
The marginal cost curve corresponding to the true model is shown in the above figure along with the
incorrect linear cost curve. Figure shows that, between points A and B the linear marginal cost curve will
consistently overestimate the true marginal cost, whereas beyond these points it will consistently
underestimate the true marginal cost.
This result to be expected, because the disturbance term ϑ i is, in fact, will equal to β 2 output 2i+µi, and hence
catch the systematic effect of the output2 term on marginal cost. In this case, ϑ i will reflect autocorrelation
because of an incorrect functional form.
Cobweb Phenomenon:
The supply of many agricultural commodities reflects the so-called cobweb phenomenon, where supply
reacts to the price with a lag of one time period because supply decisions take time to implement. Thus, at
the beginning of this year’s planting of crops, farmers are influenced by the price prevailing last year, so that
their supply function is
Supply t =β1 + β 2 P t−1 + μt
Suppose at the end of period t, price Pt turns out to be lower than Pt-1. Therefore, in period t+1 farmers may
very well decide to produce less than they did in time period t. Obviously, in this situation the disturbances
μt are not expected to be random because if the farmer overproduce in year t, they are likely to reduce their
production in t+1, and so on, leading to a cobweb pattern.
Lags:
In a time series regression of consumption expenditure on income, it is not uncommon to find that the
consumption expenditure in the current periods depends, among other things , on the consumptions
expenditure of the previous period. That is,
Consumptiont =β 1+ β2 Incomet + β2 Consumptiont −1 + μt
Consumers do not change their habits readily for psychological, technological, or institutional reasons. Now
if we neglect the lagged term, the resulting error term will reflect a systematic pattern due to the influence of
lagged consumption on current consumption.
Manipulation of Data: (Check, P- 441, Text book)
Data Transformation: (Check, P- 441, Text book)
Nonstationary:
26
A time series is stationary if its characteristics (e.g., mean variance, covariance) are time invariant; that is,
they do not change over time. If that is not the case, we have a nonstationary time series. In a two variable
regression model, it is quite possible that both Y and X are are nonstationary and therefore the error is also
non stationary. In that case, the error term will exhibit autocorrelation.
OLS Estimation in the Presence of Autocorrelation
Let us consider the two variable regression model
Y t =β 1+ β2 X t + μ t … … ..(1)
To make any headway, we must assume the mechanism that generates μt for E(μ t , μ t+ s) ≠ 0 ( s ≠ 0 ¿ is too
general an assumption to be of any practical use. As a starting point, or first approximation, one can assume
that the disturbance, or error terms are generated by the following mechanism.
μt =ρ μt−1 + ε t−1< ρ<+ 1…………………….(2)
Where ρ is known as the coefficient of autocovariance and ε t is the stockhastic disturbance term such that it
satisfy the standard OLS assumptions, namely,
E ( ε t ) =0
2
V ( ε t )=σ ε
Cov ( ε t , ε t +s ) =0 s ≠ 0
In the engineering literature, an error term with the preceding properties is often called a white noise error
term. The equation (2) postulates that the value of the disturbance term in period t is equal to ρ times its
value in the previous period plus a purely random error term.
The scheme (1) is known as a Markov first –order autoregressive scheme, or simply first order
autoregressive scheme, usually denoted as AR (1). Note that ρ, the coefficient of autocovariance in (1), can
also be interpreted as the first order coefficient of autocorrelation, or more accurately, coefficient of
autocorrelation at lag 1. Given the AR (1) scheme, it can be shown that
2
σε
v ( μt )=E ( μ )=
2
t 2
1−ρ
s σ 2ε
Cov ( μt , μt + s )=E ( μt μ t+ s )=ρ
1−ρ 2
Cor ( μ t , μ t+ s )= ρs
One reason we use AR(1) process is not only because of its simplicity compared to higher order AR
schemes, but also because in many applications it has proved to be quite useful. Now for the two variable
regression model we know that
^β 2= ∑ t t
x y
∑ x2t
and its variance is given by
2
σ
v ( β^ 2 ) =
∑ x2t
where the small letters as usual denote deviation from the mean values.
27
Now under the AR (1) scheme, it can be shown that the variance of this estimator is
v( ^β2 ) AR(1)=
σ2
[1+2 ρ
∑ x t x t−1 + 2 ρ2 ∑ xt x t −2 +… … .+2 ρn−1 ∑ x t x n ]where v( ^β ) means the
∑ x 2t ∑ x 2t ∑ x 2t ∑ x 2t 2 AR(1)
variance of ^β 2 under a first order autoregressive scheme.
BLUE estimator in the presence of Autocorrelation:
For the regression model Y t =β 1+ β2 X t + μ t and assuming the AR(1) process, we can show that the BLUE
estimator of β 2 is given by the following expression
n
∑ ( x t −ρ x t−1 ) ( y t −ρ y t−1 )
^β GLS
2 =
t =2
+C
n
∑ ( x t −ρ xt −1 )
2
t =2
where C is a correction factor that may be disregarded in practice. And its variance is given by
2
σ
v ( β^ 2 )= n
GLS
+D
∑ ( x t −ρ xt −1 ) 2
t=2
where ,D too is a correction factor that may be disregarded in practice.
Consequences of using OLS in the presence of Autocorrelation:
In the presence of autocorrelation, the OLS estimators are still linear unbiased as well as consistent and
asymptotically normally distributed, but they are no longer efficient.
OLS estimation Disregarding Autocorrelation:
^
1. The residual variance σ =2 ∑ μ^ 2i
is likely to underestimate the true σ 2.
(n−2)
2. As a result, we are likely to overestimate R2.
3. Even if σ 2is not underestimated, v ( β^ ) may be underestimate v( ^β )
2 2 AR(1) , its variance under first
autocorrelation, even though the latter is inefficient compared to v ( β^ ).
GLS
2
4. Therefore, the usual t and F test of significance are no longer valid, and if applied, are likely to give
seriously misleading conclusions about the statistical significance of the estimated regression
coefficients.
RUN TEST:
Null and alternative hypothesis for the run test
H0: Observed sequence of residuals in random

H1: Observed sequence of residual is not random
Let the level of significance α=.05
Let
N= Total no. of observations (N1+N2)

N1=Number of + symbols (Residuals)
N2= Number of (-) symbols
28
R= number of runs
Under the null hypothesis that successive outcomes are independent and assuming N1> 10 andN2> 10, the
number of run is distributed normally with
2 N1 N2
Mean: E ( R )= +1
N
2 2 N 1 N 2(2 N 1 N 2−N )
Variance: σ R = 2
N ( N −1)
If the null hypothesis of randomness is sustainable, following the properties of the normal distribution, we
should expect that
Prob [ E ( R ) −1.96 σ R ≤ R ≤ E ( R ) +1.96 σ R ]=.95
That is, the probability is 95% that the preceding interval will include R. So
Do not reject the null hypothesis of randomness with 95% confidence if R, the number of runs, lies in the
preceding confidence interval; reject the null hypothesis if the estimated R lies outside these limits.
Durbin Watson Test:

The most celebrated test for detecting serial correlation is that developed by statisticians Durbin and Watson.
It is popularly known as the Durbin- Watson d statistic. Null and alternative hypothesis for the D-W test is
H0: There is no first order autocorrelation (+Ve or –Ve)

H1: There is autocorrelation
“d” test statistic is defined as
t=n
∑ (^μt −^μt−1 )2
t =2
d= t=n …………………………………(1)
∑ ^μ 2
t
t=1
Which is simply the ratio of the sum of squared differences in successive residuals to the RSS.
Assumptions:
1. The regression model includes the intercept term.

2. The X’s are no stochastic
3. µt’s are generated by first order autoregressive scheme: μt =ρ μt−1 + ε t . Therefore, it cannot be used to
detect higher order autoregressive schemes.
4. The µt assumed to be normally distributed.
5. The regression model does not include the lagged value(s) of the dependent as one of the
independent variable.
The exact sampling or probability distribution of the d statistic given in 1 is difficult to derive because as the
Durbin Watson have shown , it depends in a complicated way on the X values present in a given samples.
Since d is computed from ^μt which are of course dependent on the given X’s, therefore unlike t, F, or ℵ2 tests
there is no unique critical value that will lead to the rejection or the acceptance of the null hypothesis that
there is no first order auto correlation.
29
Durbin Watson was successful in deriving a lower bound d L and an upper bound du such that if the computed
d lies outside of these critical values, a decision can be made regarding the presence of auto correlation.
Decision rule:
Do not Zone of Reject H0*

Reject H0 Zone of
Evidence of
reject H0 or Indecision Evidence of
Indecision negative
positive H0*
autocorrelation
autocorrelation
2 4- 4- dL
0 dL du 4
H0: No positive autocorrelation
H0*: No negative autocorrelation
Correcting for pure Autocorrelation (GLS):

Knowing the consequence of autocorrelation, especially the lack of efficiency of OLS estimator, we may
need to remedy the problem. The remedy depends on the knowledge one has about the nature of
interdependence among the disturbances that is knowledge about the structure of autocorrelation.
As a starter, consider the two variable regression model:
Y t =β 1+ β2 X t + μ t … … ..(1)
and assume that the error term follows the AR(1) scheme , namely,
μt =ρ μt−1 + ε t−1 ≤ ρ≤ 1
Now consider two cases:
When ρ is known:
If the coefficient of the first order autocorrelation is known, the problem of autocorrelation can be easily
solved. If equation no.1 holds true at time t-1. Hence
Y t −1=β 1+ β 2 X t −1 + μt −1 … … … (2)
Multiplying equation (2) by ρ on both sides, we obtain
ρ Y t −1= ρ β 1 + ρ β 2 X t −1+ ρ μ t−1 … … …(3)
Subtracting (3) from (1) we get
( Y t −ρ Y t −1 ) =β 1 ( 1− ρ )+ β2 ( X t −ρ X t −1 ) +(μt −ρ μt −1)……………..(4)
( Y t −ρ Y t −1 ) =β 1 ( 1− ρ )+ β2 ( X t −ρ X t −1 ) +ε t where ε t=μ t−ρ μt −1
We can express (4) as
¿ ¿ ¿ ¿
Y t =β 1+ β2 X t + ε t
Where
β ¿1=β 1 ( 1−ρ ) ,Y ¿t =( Y t −ρ Y t −1 ) , X ¿t =( X t− ρ X t−1 ) ,∧β ¿2=β 2
Since the error term in the equation (4) satisfies the usual OLS assumptions, we can apply OLS to the
transformed variables and obtain estimators with all the optimum properties namely BLUE.
When ρ is not known:
30
The method of generalized difference is difficult to implement because ρ is rarely known in practice.
Therefore, we need to find a ways of estimating ρ. We have several possibilities.
The first difference method:
Since ρ lies between 0 and ±1, one could start from two extreme positions. In one extreme, one could
assume that ρ=0, that is, no (first order ) serial correlation, and at the other extreme we could let ρ=±1, that
is perfect positive or negative correlation. As a matter of fact, when a regression is run, one generally
assumes that there is no autocorrelation and then lets the Durbin –Watson or other test show whether this
assumption is justified. If however, ρ=+1, the generalized difference equation (4) reduce to the first
difference equation:
( Y t −Y t −1) =β 2 ( X t −X t −1 ) +(μt −μt −1 )
∆ Y t =β 2 ∆ X t + ε t ……………………….(5)
where ∆ is the first difference operator.
Since the error term in (5) is free from autocorrelation, to run the regression (5) all one has to do is form a
first differences of both the regressand and regressor(s) and run the regression on these first difference.
This transformation may be appropriate if the coefficient of autocorrelation is very high, say in excess of .8,
or the Durbin-Watson d is quite low. Use first difference method form whenever d<R 2 (proposed by
Maddala). An interesting feature of the (5) is that there is no intercept in it. Hence to estimate (5) one have
to use the regression through origin routine. If, however, you forget to drop the intercept term in the model
and estimate the following model that includes the intercept term
∆ Y t =β 1 + β 2 ∆ X t + ε t
Then the original model must have a trend in it and β1 represents the coefficient of the trend variable.
31

The Method of Ordinary Least Square

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Method of Ordinary Least Square

Uploaded by

Copyright:

Available Formats

Ordinary Least Square

The Method of Ordinary Least Square:

Now setting (4) = 0 we get

Minimum variance property:

Where w i are also weights, not necessarily equal to k i. Now

∑ wi=0 and ∑ wi X i=1

The Probability distribution of disturbances:

about the probability distribution of μi .

5. ^β is normally distributed with

Which may be expressed alternatively as

The Cobb- Douglas Production Function:

Caution in the Use of Dummy Variables:

1. If a qualitative variable has m categories, introduce only (m-1) dummy variables.

Estimation in the presence of perfect multicollinearity:

Assuming X 3 i =λ X 2 i, where λ is a nonzero constant. Substituting this into (1) we obtain

¿ α^ x 2 i + ^μi … ….. ¿4)

Large variances and covariances:

where , r 23 is the correlation between X 2 ∧X 3

5. Tolerance and VIF:

Assuming X 3 i =λ X 2 i, where λ is a nonzero constant. Substituting this into (1) we obtain

Large variances and covariances:

The variances and covariances of ^β 2 and ^β 3 are given by

where , r 23 is the correlation between X 2 ∧X 3

10. Tolerance and VIF:

But its variance is now given by the following expression

BLUE in the presence of heteroscedasticity:

Difference between OLS and GLS:

Detection of the heteroscedasticity:

Suppose σ 2i is positively related to X i as

THE METHOD OF WLS

When σ 2i is not known:

1. White,s Heteroscedasticity- Consistent Variances and Standard Errors:

2. Plausible Assumptions about Heteroscedasticity Pattern:

To illustrate this, let us revert to the two variable regression model:

We now consider several assumptions about the pattern of heteroskedasticity.

To illustrate this, let us revert to the two variable regression model:

Therefore, if we transform the original equation as follows,

1. What is the nature of autocorrelation?

If this assumption is violated, symbolically, E ( μi μ j ), the problem arises is called autocorrelation.

Specification bias: Excluded Variables Case

Suppose we have the following demand model:

Where, Y t = Quantity of beef demanded

Specification bias: Incorrect functional form

Suppose the “true” or correct model in a cost-output study as follows:

Marginal costi=β1+β2 outputi+ β2 output 2i+µi ( non linear)

But we fit the following model:

Marginal costi=α1+α2 outputi+ϑ i ( linear)

Supply t =β1 + β 2 P t−1 + μt

Consumptiont =β 1+ β2 Incomet + β2 Consumptiont −1 + μt

Manipulation of Data: (Check, P- 441, Text book)

Data Transformation: (Check, P- 441, Text book)

OLS Estimation in the Presence of Autocorrelation

Let us consider the two variable regression model

μt =ρ μt−1 + ε t−1< ρ<+ 1…………………….(2)

variance of ^β 2 under a first order autoregressive scheme.

BLUE estimator in the presence of Autocorrelation:

where ,D too is a correction factor that may be disregarded in practice.

Consequences of using OLS in the presence of Autocorrelation:

OLS estimation Disregarding Autocorrelation:

Null and alternative hypothesis for the run test

H0: Observed sequence of residuals in random

N= Total no. of observations (N1+N2)

Prob [ E ( R ) −1.96 σ R ≤ R ≤ E ( R ) +1.96 σ R ]=.95