You are on page 1of 488

ECONOMETRIC THEORY

Bibliography

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

1. Jack Johnston and John DiNardo (1997): Econometric Methods, Third edition, McGraw Hill.

2. G.S. Maddala (2002): Introduction to Econometrics, Third edition, John Wiley.

3. Damodar Gujrati (2003): Basic Econometrics, McGraw Hill.

4. William Greene (2008): Econometric Analysis, Prentice Hall, Sixth edition.

Suggested readings

1. George G. Judge, William E. Griffiths, R. Carter Hill, Helmut Lütkepohl, Tsoung-Chao Lee (1985): The Theory and

Practice of Econometrics, John Wiley.

2. C.R. Rao, H. Toutenburg, Shalabh and C. Heumann (2008): Linear Models and Generalizations - Least Squares

and Alternatives, Springer.

3. J.M. Woolridge(2002): Introductory Econometrics- A Modern Approach, South-Western College, Publications.

4. Badi H. Baltagi (1999): Econometrics, Second edition, Springer-Verlag.

5. Aman Ullah and H.D. Vinod (1981): Recent Advances in Regression Models, Marcel Dekker.

6. Henry Theil (1971): Principles of Econometrics, John Wiley.


ECONOMETRIC THEORY
MODULE – I
Lecture - 1
Introduction to Econometrics

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Econometrics deals with the measurement of economic relationships.

It is an integration of economic theories, mathematical economics and statistics with an objective to provide numerical

values to the parameters of economic relationships.

The relationships of economic theories are usually expressed in mathematical forms and combined with empirical

economics.

The econometrics methods are used to obtain the values of parameters which are essentially the coefficients of

mathematical form of the economic relationships. The statistical methods which help in explaining the economic

phenomenon are adapted as econometric methods.

The econometric relationships depict the random behaviour of economic relationships which are generally not considered in

economics and mathematical formulations.

It may be pointed out that the econometric methods can be used in other areas like engineering sciences, biological

sciences, medical sciences, geosciences, agricultural sciences etc.

In simple words, whenever there is a need of finding the stochastic relationship in mathematical format, the econometric

methods and tools help. The econometric tools are helpful in explaining the relationships among variables.
3

Econometric models
A model is a simplified representation of a real world process. It should be representative in the sense that it should

contain the salient features of the phenomena under study.

In general, one of the objectives in modeling is to have a simple model to explain a complex phenomenon. Such an

objective may sometimes lead to oversimplified model and sometimes the assumptions made are unrealistic.

In practice, generally all the variables which the experimenter thinks are relevant to explain the phenomenon are included

in the model. Rest of the variables are dumped in a basket called “disturbances” where the disturbances are random

variables. This is the main difference between the economic modeling and econometric modeling. This is also the main

difference between the mathematical modeling and statistical modeling.

The mathematical modeling is exact in nature whereas the statistical modeling contains a stochastic term also.

An economic model is a set of assumptions that describes the behavior of an economy, or more general, a phenomenon.

An econometric model consists of

 a set of equations describing the behavior. These equations are derived from the economic model and have two

parts - observed variables and disturbances.

 a statement about the errors in the observed values of variables.

 a specification of the probability distribution of disturbances.


4

Aims of econometrics
The three main aims of econometrics are as follows:

1. Formulation and specification of econometric models


The economic models are formulated in an empirically testable form. Several econometric models can be derived from an
economic model. Such models differ due to different choice of functional form, specification of stochastic structure of the
variables etc.

2. Estimation and testing of models


The models are estimated on the basis of observed set of data and are tested for their suitability. This is the part of
statistical inference of the modeling. Various estimation procedures are used to know the numerical values of the unknown
parameters of the model. Based on various formulations of statistical models, a suitable and appropriate model is
selected.

3. Use of models
The obtained models are used for forecasting and policy formulation which is an essential part in any policy decision. Such
forecasts help the policy makers to judge the goodness of fitted model and take necessary measures to re-adjust the
relevant economic variables.
5

Econometrics and statistics


Econometrics differs both from mathematical statistics and economic statistics. In economic statistics, the empirical data is

collected recorded, tabulated and used in describing the pattern in their development over time. The economic statistics is

a descriptive aspect of economics. It does not provide either the explanations of the development of various variables or

measurement of the parameters of the relationships.

Statistical methods describe the methods of measurement which are developed on the basis of controlled experiments.

Such methods may not be suitable for economic phenomenon as they don’t fit in the framework of controlled experiments.

For example, in real world experiments, the variables usually change continuously and simultaneously and so the set up

of controlled experiments are not suitable.

Econometrics uses statistical methods after adapting them to the problems of economic life. These adopted statistical

methods are usually termed as econometric methods. Such methods are adjusted so that they become appropriate for

the measurement of stochastic relationships. These adjustments basically attempt to specify the stochastic element which

operate in real world data and enters into the determination of observed data. This enables the data to be called as

random sample which is needed for the application of statistical tools.


6

The theoretical econometrics includes the development of appropriate methods for the measurement of economic

relationships which are not meant for controlled experiments conducted inside the laboratories. The econometric methods

are generally developed for the analysis of non-experimental data.

The applied econometrics includes the application of econometric methods to specific branches of econometric theory

and problems like demand, supply, production, investment, consumption etc. The applied econometrics involves the

application of the tools of econometric theory for the analysis of economic phenomenon and forecasting the economic

behaviour.
7
Types of data
Various types of data is used in the estimation of the model.

1. Time series data


Time series data give information about the numerical values of variables from period to period and are collected over time.
For example, the data during the years 1990-2010 for monthly income constitutes a time series data.

2. Cross section data


The cross section data give information on the variables concerning individual agents (e.g., consumers or produces) at a
given point of time. For example, a cross section of sample of consumers is a sample of family budgets showing
expenditures on various commodities by each family, as well as information on family income, family composition and other
demographic, social or financial characteristics.

3. Panel data
The panel data are the data from repeated survey of a single (cross-section) sample in different periods of time.

4. Dummy variable data


When the variables are qualitative in nature, then the data is recorded in the form of indicator function. The values of the
variables do not reflect the magnitude of data. They reflect only the presence/absence of a characteristic. For example,
the variables like religion, sex, taste, etc. are qualitative variables. The variable `sex’ takes two values – male or female,
the variable `taste’ takes values-like or dislike etc. Such values are denoted by dummy variable. For example, these values
can be represented as ‘1’ represents male and ‘0’ represents female. Similarly, ‘1’ represents the liking of taste and ‘0’
represents the disliking of taste.
8

Aggregation problem

The aggregation problems arise when aggregative variables are used in functions. Such aggregative variables may involve.

1. Aggregation over individuals


For example the total income may comprise the sum of individual incomes.

2. Aggregation over commodities


The quantity of various commodities may be aggregated over e.g., price or group of commodities. This is done by using
suitable index.

3. Aggregation over time periods


Sometimes the data is available for shorter or longer time periods than required to be used in the functional form of
economic relationship. In such cases, the data needs to be aggregated over time period. For example, the production of
most of the manufacturing commodities is completed in a period shorter than a year. If annual figures are to be used in the
model then there may be some error in the production function.

4. Spatial aggregation
Sometimes the aggregation is related to spatial issues. For example, the population of towns, countries, or the

production in a city or region etc.


Such sources of aggregation introduces “aggregation bias” in the estimates of the coefficients. It is important to examine
the possibility of such errors before estimating the model.
9
Econometrics and regression analysis
One of the very important role of econometrics is to provide the tools for modeling on the basis of given data. The
regression modeling technique helps a lot in this task. The regression models can be either linear or non-linear based on
which we have linear regression analysis and non-linear regression analysis. We will consider only the tools of linear
regression analysis and our main interest will be the fitting of linear regression model to a given set of data.

Linear models and regression analysis


Suppose the outcome of any process is denoted by a random variable y, called as dependent (or study) variable, depends
on k independent (or explanatory) variables denoted by X1, X2,…, Xk. Suppose the behaviour of y can be explained by a
relationship given by
y f ( X 1, X 2 ,..., X k , β1 , β 2 ,..., β k ) + ε

where f is some well defined function and β1 , β 2 ,..., β k are the parameters which characterize the role and contribution of
X1, X2,…, Xk, respectively. The term ε reflects the stochastic nature of the relationship between y and X1, X2,…, Xk and
indicates that such a relationship is not exact in nature. When ε = 0, then the relationship is called the mathematical
model otherwise the statistical model. The term “model” is broadly used to represent any phenomenon in a mathematical
frame work.

A model or relationship is termed as linear if it is linear in parameters and nonlinear, if it is not linear in parameters. In other
words, if all the partial derivatives of y with respect to each of the parameters β1 , β 2 ,..., β k are independent of the
parameters, then the model is called as a linear model. If any of the partial derivatives of y with respect to any of the
β1 , β 2 ,..., β k is not independent of the parameters, the model is called as nonlinear. Note that the linearity or non-linearity
of the model is not described by the linearity or nonlinearity of explanatory variables in the model.
10

For example
y=β1 X 12 + β 2 X 2 + β3 log X 3 + ε
1, 2,3) are independent of the parameters β i , (i = 1, 2,3). On the other hand,
is a linear model because ∂y / ∂βi , (i =

y = β12 X 1 + β 2 X 2 + β 3 log X + ε

2 β1 X 1 depends on β1 although ∂y / ∂β 2 and ∂y / ∂β 3 are independent of any of the


is a nonlinear model because ∂y / ∂β1 =
β1 , β 2 or β3 .

=
When the function f is linear in parameters, then y f ( X 1 , X 2 ,..., X k , β1 , β 2 ,..., β k ) + ε is called a linear model and when
the function f is nonlinear in parameters, then it is called a nonlinear model. In general, the function f is chosen as
f ( X 1 , X 2 ,..., X k , β1 , β 2 ..., β k=
) β1 X 1 + β 2 X 2 + ... + β k X k

to describe a linear model. Since X1,X2,…,Xk are pre-determined variables and y is the outcome, so both are known.

Thus the knowledge of the model depends on the knowledge of the parameters β1 , β 2 ,..., β k.

The statistical linear modeling essentially consists of developing approaches and tools to determine β1 , β 2 ,..., β k in the
linear model
y β1 X 1 + β 2 X 2 + ... + β k X k + ε
=
given the observations on y and X1,X2,…,Xk.

Different statistical estimation procedures, e.g., method of maximum likelihood, principle of least squares, method of
moments etc. can be employed to estimate the parameters of the model. The method of maximum likelihood needs further
knowledge of the distribution of y whereas the method of moments and the principle of least squares do not need any
knowledge about the distribution of y.
11

The regression analysis is a tool to determine the values of the parameters given the data on y and X1,X2,…,Xk. The
literal meaning of regression is “to move in the backward direction”. Before discussing and understanding the meaning of
“backward direction”, let us find which of the following statements is correct:
S1: model generates data or
S2: data generates model.

Obviously, S1 is correct. It can be broadly thought that the model exists in nature but is unknown to the experimenter.
When some values to the explanatory variables are provided, then the values for the output or study variable are
generated accordingly, depending on the form of the function f and the nature of phenomenon. So ideally, the pre-existing
model gives rise to the data. Our objective is to determine the functional form of this model. Now we move in the backward
direction. We propose to first collect the data on study and explanatory variables. Then we employ some statistical
techniques and use this data to know the form of function f. Equivalently, the data from the model is recorded first and then
used to determine the parameters of the model. The regression analysis is a technique which helps in determining the
statistical model by using the data on study and explanatory variables. The classification of linear and nonlinear regression
analysis is based on the determination of linear and nonlinear models, respectively.

Consider a simple example to understand the meaning of “regression”. Suppose the yield of crop (y) depends linearly on
two explanatory variables, viz., the quantity of a fertilizer (X1) and level of irrigation (X2) as

y = β1 X 1 + β 2 X 2 + ε .
12

There exist the true values of β1 and β 2 in nature but are unknown to the experimenter. Some values on y are recorded
by providing different values to X1 and X2. There exists some relationship between y and X1, X2 which gives rise to a
systematically behaved data on y, X1 and X2. Such relationship is unknown to the experimenter. To determine the model,
we move in the backward direction in the sense that the collected data is used to determine the unknown parameters
β1 and β 2 of the model. In this sense such an approach is termed as regression analysis.
ECONOMETRIC THEORY

MODULE – I
Lecture - 2
Introduction to Econometrics

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

The theory and fundamentals of linear models lay the foundation for developing the tools for regression analysis that are
based on valid statistical theory and concepts.

Steps in regression analysis

Regression analysis includes the following steps:

 Statement of the problem under consideration

 Choice of relevant variables

 Collection of data on relevant variables

 Specification of model

 Choice of method for fitting the data

 Fitting of model

 Model validation and criticism

 Using the chosen model(s) for the solution of the posed problem and forecasting
3
These steps are examined below.

1. Statement of the problem under consideration

The first important step in conducting any regression analysis is to specify the problem and the objectives to be

addressed by the regression analysis. The wrong formulation or the wrong understanding of the problem will give the

wrong statistical inferences. The choice of variables depends upon the objectives of study and understanding of the

problem. For example, height and weight of children are related. Now there can be two issues to be addressed.

(i) Determination of height for given weight, or

(ii) determination of weight for given height.

In the case (i), the height is response variable whereas weight is response variable in case (ii). The role of explanatory

variables are also interchanged in the cases (i) and (ii).

2. Choice of potentially relevant variables


Once the problem is carefully formulated and objectives have been decided, the next question is to choose the relevant

variables. It has to kept in mind that the correct choice of variables will determine the statistical inferences correctly. For

example, in any agricultural experiment, the yield depends on explanatory variables like quantity of fertilizer, rainfall,

irrigation, temperature etc. These variables are denoted by X1,X2,…,Xk as a set of k explanatory variables.
4

3. Collection of data on relevant variables


Once the objective of study is clearly stated and the variables are chosen, the next question arises is to collect data on

such relevant variables. The data is essentially the measurement on these variables. For example, suppose we want

to collect the data on age. For this, it is important to know how to record the data on age. Then either the date of birth

can be recorded which will provide the exact age on any specific date or the age in terms of completed years as on

specific date can be recorded. Moreover, it is also important to decide that whether the data has to be collected on

variables as quantitative variables or qualitative variables. For example, if the ages (in years) are 15, 17, 19, 21, 23,

then these are quantitative values. If the ages are defined by a variable that takes value 1 if ages are less than 18

years and 0 if the ages are more than 18 years, then the earlier recorded data is converted to 1, 1, 0, 0, 0. Note that

there is a loss of information in converting the quantitative data into qualitative data.

The methods and approaches for qualitative and quantitative data are also different. If the study variable is binary, then

logistic and probit regressions etc. are used. If all explanatory variables are qualitative, then analysis of variance

technique is used. If some explanatory variables are qualitative and others are quantitative, then analysis of covariance

technique is used. The techniques of analysis of variance and analysis of covariance are the special cases of regression

analysis .
5

Generally, the data is collected on n subjects, then y denotes the response or study variable and y1 , y2 ,..., yn
are the n values. If there are k explanatory variables X 1 , X 2 ,.., X k then xij denotes the ith value of jth variable,
i = 1, 2, …, n; j = 1, 2,…, k. The observation can be presented in the following table:

Notations for the data used in regression analysis

Observation Response Explanatory variables


Number y
______________________________________________
X1 X2  Xk

1 y1 x11 x12  x1k

2 y2 x21 x22  x2k

3 y3 x31 x32  x3k

     
n yn xn1 xn 2  xnk
_______________________________________________________________________
6

4. Specification of model
The experimenter or the person working in the subject usually helps in determining the form of the model. Only the form of
the tentative model can be ascertained and it will depend on some unknown parameters. For example, a general form
will be like
y f ( X 1 , X 2 ,..., X k ; β1 , β 2 ,..., β k ) + ε
where ε is the random error reflecting mainly the difference in the observed value of y and the value of y obtained
through the model. The form of f ( X 1 , X 2 ,..., X k ; β1 , β 2 ,..., β k ) can be linear as well as nonlinear depending on the form of
parameters β1 , β 2 ,..., β k . A model is said to be linear if it is linear in parameters. For example,

y = β1 X 1 + β 2 X 12 + β3 X 2 + ε
β1 + β 2 ln X 2 + ε
y=
are linear models whereas
y = β1 X 1 + β 22 X 2 + β3 X 2 + ε
y (ln β1 ) X 1 + β 2 X 2 + ε
=
are non-linear models. Many times, the nonlinear models can be converted into linear models through some
transformations. So the class of linear models is wider than what it appears initially.

If a model contains only one explanatory variable, then it is called as simple regression model. When there are more
than one independent variables, then it is called as multiple regression model. When there is only one study variable,
the regression is termed as univariate regression. When there are more than one study variables, the regression is
termed as multivariate regression. Note that the simple and multiple regressions are not same as univariate and
multivariate regressions. The simple and multiple regression are determined by the number of explanatory variables
whereas univariate and multivariate regressions are determined by the number of study variables.
7

5. Choice of method for fitting the data


After the model has been defined and the data have been collected, the next task is to estimate the parameters of the
model based on the collected data. This is also referred to as parameter estimation or model fitting. The most commonly
used method of estimation is the least squares method. Under certain assumptions, the least squares method produces
estimators with desirable properties. The other estimation methods are the maximum likelihood method, ridge method,
principal components method etc.

6. Fitting of model
The estimation of unknown parameters using appropriate method provides the values of the parameters. Substituting these
values in the equation gives us a usable model. This is termed as model fitting. The estimates of parameters β1 , β 2 ,..., β k
in the model
y f ( X 1 , X 2 ,..., X k , β1 , β 2 ,..., β k ) + ε

are denoted as βˆ1 , βˆ2 ,..., βˆk . which gives the fitted model as

y = f ( X 1 , X 2 ,..., X k , βˆ1 , βˆ2 ,..., βˆk ).

When the value of y is obtained for the given values of X 1 , X 2 ,..., X k , it is denoted as ŷ and called as fitted value.
The fitted equation is used for prediction. In this case, ŷ is termed as predicted value. Note that the fitted value is where
the values used for explanatory variables correspond to one of the n observations in the data whereas predicted value is
the one obtained for any set of values of explanatory variables. It is not generally recommended to predict the y - values
for the set of those values of explanatory variables which lie outside the range of data. When the values of explanatory
variables are the future values of explanatory variables, the predicted values are called forecasted values.
8

7. Model criticism and selection


The validity of statistical method to be used for regression analysis depends on various assumptions. These assumptions
are essentially the assumptions for the model and the data. The quality of statistical inferences heavily depends on
whether these assumptions are satisfied or not. For making these assumptions to be valid and to be satisfied, care is
needed from beginning of the experiment. One has to be careful in choosing the required assumptions and to examine
whether the assumptions are valid for the given experimental conditions or not. It is also important to decide the situations
in which the assumptions may not meet.

The validation of the assumptions must be made before drawing any statistical conclusion. Any departure from validity of
assumptions will be reflected in the statistical inferences. In fact, the regression analysis is an iterative process where the
outputs are used to diagnose, validate, criticize and modify the inputs. The iterative process is illustrated in the following
figure.

Inputs Outputs

•Theories Estimate •Estimation of parameters


•Model •Confidence regions
•Assumptions •Tests of hypotheses
•Data •Graphical displays
•Statistocal methods Diagnosis,
validation and
criticism
9

8. Objectives of regression analysis

The determination of explicit form of regression equation is the ultimate objective of regression analysis. It is finally a good

and valid relationship between study variable and explanatory variables. The regression equation helps in understanding

the interrelationships among the variables. Such regression equation can be used for several purposes. For example, to

determine the role of any explanatory variable in the joint relationship in any policy formulation, to forecast the values of

response variable for given set of values of explanatory variables.


ECONOMETRIC THEORY

MODULE – II
Lecture - 3
Simple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

The simple linear regression model

We consider the modeling between the dependent and one independent variable. When there is only one independent
variable in the linear regression model, the model is generally termed as simple linear regression model. When there are
more than one independent variables in the model, then the linear model is termed as the multiple linear regression model.

Consider a simple linear regression model

y =β 0 + β1 X + ε

where y is termed as the dependent or study variable and X is termed as independent or explanatory variable.
The terms β 0 and β1 are the parameters of the model. The parameter β 0 is termed as intercept term and the parameter
β1 is termed as slope parameter. These parameters are usually called as regression coefficients. The unobservable
error component ε accounts for the failure of data to lie on the straight line and represents the difference between the true
and observed realization of y. This is termed as disturbance or error term. There can be several reasons for such
difference, e.g., the effect of all deleted variables in the model, variables may be qualitative, inherit randomness in the
observations etc. We assume that ε is observed as independent and identically distributed random variable with mean zero
and constant variance σ . Later, we will additionally assume that
2
ε is normally distributed.

The independent variable is viewed as controlled by the experimenter, so it is considered as non-stochastic whereas y is
viewed as a random variable with
) β 0 + β1 X
E ( y=
and
Var ( y ) = σ 2 .
3

Sometimes X can also be a random variable. In such a case, instead of simple mean and simple variance of y, we
consider the conditional mean of y given X = x as

) β 0 + β1 x
E ( y | x=
and the conditional variance of y given X = x as

Var ( y | x ) = σ 2 .

When the values of β 0 , β1 and σ 2 are known, the model is completely described.

The parameters β 0 , β1 and σ 2 are generally unknown and ε is unobserved. The determination of the statistical model
y =β 0 + β1 X + ε depends on the determination (i.e., estimation ) of β 0 , β1 and σ 2 .

In order to know the value of the parameters, n pairs of observations ( xi , yi )(i = 1,..., n) on ( X , y ) are observed/collected
and are used to determine these unknown parameters.

Various methods of estimation can be used to determine the estimates of the parameters. Among them, the least squares
and maximum likelihood principles are the popular methods of estimation.
4

Least squares estimation


Suppose a sample of n sets of paired observations ( xi , yi )(i = 1, 2,..., n) are available. These observations are assumed to
satisfy the simple linear regression model and so we can write
yi =β 0 + β1 xi + ε i (i =1, 2,..., n).

The method of least squares estimates the parameters β 0 and β1 by minimizing the sum of squares of difference between
the observations and the line in the scatter diagram. Such an idea is viewed from different perspectives. When the vertical
difference between the observations and the line in the scatter diagram is considered and its sum of squares is minimized
to obtain the estimates of β 0 and β1 , the method is known as direct regression.
5

Alternatively, the sum of squares of difference between


the observations and the line in horizontal direction in the
scatter diagram can be minimized to obtain the
estimates of β 0 and β1 . This is known as reverse (or
inverse) regression method.

Instead of horizontal or vertical errors, if the sum of


squares of perpendicular distances between the
observations and the line in the scatter diagram is
minimized to obtain the estimates of β 0 and β1, the
method is known as orthogonal regression or major
axis regression method.
6

Instead of minimizing the distance, the area can also be


minimized. The reduced major axis regression
method minimizes the sum of the areas of rectangles
defined between the observed data points and the
nearest point on the line in the scatter diagram to obtain
the estimates of regression coefficients. This is shown in
the following figure:

The method of least absolute deviation regression considers the sum of the absolute deviation of the observations
from the line in the vertical direction in the scatter diagram as in the case of direct regression to obtain the estimates of
β 0 and β1
7

No assumption is required about the form of probability distribution of εi in deriving the least squares estimates. For
the purpose of deriving the statistical inferences only, we assume that εi ' s are observed as random variable with

E (ε=
i) 0,Var (ε=
i) σ 2 and Cov (ε i , ε=
j ) 0 for all i ≠ j (i=
, j 1, 2,..., n ).

This assumption is needed to find the mean, variance and other properties of the least squares estimates. The
assumption that εi ' s are normally distributed is utilized while constructing the tests of hypotheses and confidence
intervals of the parameters.

Based on these approaches, different estimates of β 0 and β1 are obtained which have different statistical properties.
Among them the direct regression approach is more popular. Generally, the direct regression estimates are referred as
the least squares estimates or ordinary least squares estimates.
8
Direct regression method

This method is also known as the ordinary least squares estimation. Assuming that a set of n paired observations
on ( xi , yi ), i = 1, 2,..., n are available which satisfy the linear regression model y =β 0 + β1 X + ε . So we can write the
model for each observation as yi =β 0 + β1 xi + ε i , , (i =1, 2,..., n) .

The direct regression approach minimizes the sum of squares due to errors given by
n n
S (β , β =
0 1 ) i
2

=i 1 =i 1
∑ ε= ∑ ( y − β i 0 − β1 xi ) 2

with respect to β 0 and β1.


The partial derivatives of S ( β 0 , β1 ) with respect to β 0 are
∂S ( β 0 , β1 ) n
−2∑ ( yt − β 0 − β1 xi )
=
∂β 0 i =1

and the partial derivative of S ( β 0 , β1 ) with respect to β1 is


∂S ( β 0 , β1 ) n
−2∑ ( yi − β 0 − β1 xi )xi .
=
∂β1 i =1

The solution of β 0 and β1 is obtained by setting


∂S ( β 0 , β1 )
=0
∂β 0
∂S ( β 0 , β1 )
= 0.
∂β1

The solutions of these two equations are called the direct regression estimators, or usually called as the ordinary
least squares (OLS) estimators of β 0 and β1 .
9

This gives the ordinary least squares estimates b0 of β 0 and b1 of β1 as


b0= y − b1 x
s xy
b1 =
s xx
where
n
sxy = ∑ ( x − x )( y − y ),
i =1
i i

n
=
sxx ∑ (x − x ) ,
i =1
i
2

1 n
x= ∑ xi ,
n i =1
1 n
y = ∑ yi .
n i =1

Further, we have

∂ 2 S ( β 0 , β1 ) n

∂β 02
=−2 ∑
i =1
(−1) =2n,

∂ 2 S ( β 0 , β1 ) n

∂β1 2
= 2 ∑
i =1
xi2 ,

∂ 2 S ( β 0 , β1 ) n
= 2=
∂β 0 ∂β1

i =1
xt 2nx .
The Hessian matrix which is the matrix of second order partial derivatives in this case is given as 10

 ∂ 2 S ( β 0 , β1 ) ∂ 2 S ( β 0 , β1 ) 
 
 ∂β 02 ∂β 0 ∂β1 
H* = 2
 ∂ S (β , β ) ∂ 2 S ( β 0 , β1 ) 
 0 1

 ∂β 0 ∂β1 ∂β12 

 n nx 
= 2 n 
 nx



i =1
2
xi 

 '
= 2   ( , x )
 x '
where  = (1,1,...,1) ' is a n-vector of elements unity and x = ( x1 ,..., xn ) ' is a n-vector of observations on X. The matrix H*
is positive definite if its determinant and the element in the first row and column of H* are positive.

The determinant of H* is given by


 n 2 
=H * 2  n∑ xi − n 2 x 2 
 i =1 
n
= 2n∑ ( xi − x ) 2
i =1

≥ 0.
n
The case when ∑ (x − x )
i =1
i
2
=
0 is not interesting because then all the observations are identical, i.e., xi = c (some constant).
n
In such a case there is no relationship between x and y in the context of regression analysis. Since ∑ (x − x )
i =1
i
2
> 0,

therefore H * > 0. So H* is positive definite for any ( β 0 , β1 ); therefore S ( β 0 , β1 ) has a global minimum at (b0 , b1 ).
11

The fitted line or the fitted linear regression model is


y= b0 + b1 x
and the predicted values are
yˆi =
b0 + b1 xi (i =
1, 2,..., n).
The difference between the observed value yi and the fitted (or predicted) value yˆ i is called as a residual.
The ith residual is

=ei y= ˆ
i ~ yi (i 1, 2,..., n).

We consider it as

e=
i yi − yˆi
= yi − (b0 + b1 xi ).
ECONOMETRIC THEORY

MODULE – II
Lecture - 4
Simple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Properties of the direct regression estimators

Unbiased property
sxy
Note that b1= and b0= y − b1 x are the linear combinations of yi (i = 1,..., n).
sxx
Therefore
n
b1 = ∑ ki yi
i =1
n n
where ki =
( xi − x ) / sxx . Note that
=i 1 =i 1
∑ ki =
0 and ∑ ki xi =
1,

n
E (b1 ) = ∑ ki E ( yi )
i =1
n
= ∑ k (β
i =1
i 0 + β1 xi )

= β1.

Thus b1 is an unbiased estimator of β1 . Next


b0 ) E [ y − b1 x ]
E (=
= E [ β 0 + β1 x − b1 x ]
=β 0 + β1 x − β1 x
= β0.

Thus b0 is an unbiased estimators of β 0 .


3
Variances
Using the assumption that yi ' s are independently distributed, the variance of b1 is
n
=
Var (b1 )
=i 1
∑ k Var ( y ) + ∑∑ k k Cov( y , y )
i
2
i
i j ≠i
i j i j

∑ (x − x )i
2

=σ2 i
(since y1 ,..., yn are independent)
sxx2
σ 2 sxx
=
sxx2
σ2
= .
sxx

Similarly, the variance of b0 is

Var (b0 ) =
Var ( y ) + x 2 Var (b1 ) − 2 xCov( y , b1 ).
First we find that
E { y − E ( y )}{b1 − E (b1 )}
Cov( y , b1 ) =
 
= E ε (∑ ki yi − β1 ) 
 i 
1  
= E (∑ ε i )( β 0 ∑ ki + β1 ∑ ki xi + ∑ kiε i ) − β1 ∑ ε i 
n  i i i i i 
1
= [ 0 + 0 + 0 + 0]
n
= 0,
so
 1 x2 
(b0 ) σ 2  +
Var= .
 n sxx 
4

Covariance

The covariance between b0 and b1 is

=
Cov (b0 , b1 ) Cov( y , b1 ) − xVar (b1 )
x 2
= − σ .
sxx
It can further be shown that the ordinary least squares estimators b0 and b1 possess the minimum variance in the class of

linear and unbiased estimators. So they are termed as the Best Linear Unbiased Estimators (BLUE). Such a property is

known as the Gauss-Markov theorem which is discussed later in multiple linear regression model.
5
Residual sum of squares
The residual sum of squares is given as
n
SS res = ∑ εˆi2
i =1
n
= ∑ ( y − yˆ )
i =1
i i
2

n
= ∑(y −b
i =1
i 0 − b1 xi ) 2
n
= ∑ [( y − y + b x − b x )]
2
i 1 1 i
i =1
n
= ∑ [( y − y ) − b ( x − x )]
2
i 1 i
i =1
n nn

i 1
2
=i
=i 1 =i 1 =i 1
2
∑ ( y − y)
1
2
+b ∑ (x − x ) − 2b ∑ ( xi − x )( yi − y )

=s yy + b12 sxx − 2b12 sxx


= s yy − b12 sxx
2
s 
= s yy −  xy  sxx
 sxx 
sxy2
= s yy −
sxx
= s yy − b1sxy .
where
n
1 n
s yy =∑ ( yi − y ) 2 , y = ∑ yi .
=i 1 = ni1
6

Estimation of σ
2

The estimator of σ is obtained from residual sum of squares as follows. Assuming that Since yi
2
is normally distributed,
so SSres has a χ distribution with (n - 2) degrees of freedom, so
2

SS res
~ χ 2 (n − 2).
σ 2

Thus using the result about the expectation of a chi-square random variable, we have

) (n − 2)σ 2 .
E ( SS res=
Thus an unbiased estimator of σ is
2

SS res
s2 = .
n−2
Note that SSres has only (n - 2) degrees of freedom. The two degrees of freedom are lost due to estimation of b0 and b1.
Since s2 depends on the estimates b0 and b1, so it is a model dependent estimate of σ2 .
7

Estimate of variances of b0 and b1

The estimators of variances of b0 and b1 are obtained by replacing σ2 by σˆ 2 = s 2 as follows:

 21 x2 
Var=
(b0 ) s  + 
 n sxx 
and
 s2
Var (b1 ) = .
sxx
n n
It is observed that since ∑ ( yi − yˆi ) =
0, so ∑e i = 0. In the light of this property, ei can be regarded as an estimate
i =1 i =1

of unknown ε i (i = 1,..., n) . This helps in verifying the different model assumptions on the basis of the given sample

( xi , yi ), i = 1, 2,..., n.

Further, note that


n
(i) ∑xe
i =1
i i = 0,
n
(ii) ∑ yˆ e
i =1
i i = 0,
n n
(iii)
i
=i 1 =i 1
∑ y = ∑ yˆ i
and

(iv) the fitted line always passes through ( x , y ).


8

Centered model
Sometimes it is useful to measure the independent variable around its mean. In such a case, model yi =β 0 + β1 X i + ε i
has a centered version as follows:

yi = β 0 + β1 ( xi − x ) + β1 x + ε (i = 1, 2,..., n )
= β 0* + β1 ( xi − x ) + ε i

where β= β 0 + β1 x . The sum of squares due to error is given by


*
0

n n 2

0 ) ∑ ε = ∑  yi − β 0* − β 1 ( xi − x )  .
S (β , β =
*
1 i
2

=i 1 =i 1

Now solving
∂S ( β 0* , β1 )
=0
∂β 0*
∂S ( β 0* , β1 )
= 0,
∂β1*
we get the direct regression least squares estimates of β 0* and β1 as

b0* = y

and sxy
b1 = ,
sxx
respectively.
9

Thus the form of the estimate of slope parameter β1 remains same in usual and centered model whereas the form of the
estimate of intercept term changes in the usual and centered models.

Further, the Hessian matrix of the second order partial derivatives of S ( β 0* , β1 ) with respect to β 0* and β1 is positive definite
at β 0* = b0* and β1 = b1 which ensures that S ( β 0* , β1 ) is minimized at β 0* = b0* and β1 = b1 .

E (ε i ) 0, Var
Under the assumption that = = (ε i ) σ 2 and Cov=
(ε iε j ) 0 for all=
i ≠ j 1, 2,..., n. It follows that

E (b0* ) β=
= *
0 , E (b1 ) β1 ,
σ2 σ2
=
Var (b0* ) = , Var (b1 ) .
n sxx

In this case, the fitted model of yi = β 0 + β1 ( xi − x ) + ε i is


*

y=y + b1 ( x − x ),

and the predicted values are

yˆi =+
y b1 ( xi − x ) (i =
1,..., n).

Note that in centered model

Cov(b0* , b1 ) = 0.
10

No intercept term model


Sometimes in practice a model without an intercept term is used in those situations when xi =0 ⇒ yi =0 for all
i = 1, 2,..., n . A no-intercept model is
yi = β1 xi + ε i (i = 1, 2,.., n ).
For example, in analyzing the relationship between illumination of bulb (y) and electric current (X), the illumination of bulb is
zero when current is zero.

Using the data ( xi , yi ), i = 1, 2,..., n, the direct regression least squares estimate of β1 is obtained by minimizing

n n
( β1 )
S=
=i 1 =i 1
∑=
ε i2 ∑(y − β x )
i 1 i
2

and solving
∂S ( β1 )
=0
∂β1
gives the estimator of β1 as
n

∑yx i i
b =
*
1
i =1
n
.
∑x
i =1
2
i

The second order partial derivative of S ( β1 ) with respect to β1 at β1 = b1 is positive which ensures that b1 minimizes S ( β1 ).
11

E (ε i ) 0, Var
Using the assumption that = = (ε i ) σ 2 and Cov=
(ε iε j ) 0 for all=
i ≠ j 1, 2,..., n., the properties of b1* can be
n
derived as follows: ∑ x E( y ) i i
E (b ) =
*
1
i =1
n

∑x
i =1
2
i

∑x β 2
i 1
= i =1
n

∑x
i =1
2
i

= β1.

This b1* is an unbiased estimator of β1 . The variance of b1* is obtained as follows:


n

∑ x Var ( y )
2
i i
Var (b ) =
*
1
i =1
2
 n 2
 ∑ xi 
 i =1 
n

∑x 2
i
=σ2 i =1
2
 n 2
 ∑ xi 
 i =1 
σ2
= n

∑x 2
i n n
i =1
∑ yi2 − b1 ∑ yi xi
and an unbiased estimator of σ=
2
is i 1 =i 1 .
n −1
ECONOMETRIC THEORY

MODULE – II
Lecture - 5
Simple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Maximum likelihood estimation


We assume that ε i ' s (i = 1, 2,..., n) are independent and identically distributed following a normal distribution N (0, σ ).
2

Now we use the method of maximum likelihood to estimate the parameters of the linear regression model

yi =β 0 + β1 xi + ε i (i =1, 2,..., n),

the observations yi (i = 1, 2,..., n) are independently distributed with N ( β 0 + β1 xi , σ ) for all i = 1, 2,..., n. The likelihood
2

function of the given observations ( xi , yi ) and unknown parameters β 0 , β1 and σ 2 is

1/ 2
n
 1   1 
i , yi ; β 0 , β1 , σ
L( x= 2
∏  2 
i =1  2πσ 
exp  − 2 ( yi − β 0 − β1 xi ) 2 .
 2σ 

The maximum likelihood estimates of β 0 , β1 and σ 2 can be obtained by maximizing L( xi , yi ; β 0 , β1 , σ 2 ) or equivalently


ln L( xi , yi ; β 0 , β1 , σ 2 ) where

n n  1  n
ln L( xi , yi ; β 0 , β1 , σ 2 ) =
−   ln 2π −   ln σ 2 −  2  ∑ ( yi − β 0 − β1 xi ) 2 .
2 2  2σ  i =1

The normal equations are obtained by partial differentiation of log-likelihood with respect to β 0 , β1 and σ 2 equating them
to zero
∂ ln L( xi , yi ; β 0 , β1 , σ 2 ) 1 n
− 2 ∑ ( yi − β 0 − β1 xi ) =
= 0
∂β 0 σ i =1
∂ ln L( xi , yi ; β 0 , β1 , σ 2 ) 1 n
− 2 ∑ ( yi − β 0 − β1 xi )xi =
= 0
∂β1 σ i =1
and
∂ ln L( xi , yi ; β 0 , β1 , σ 2 ) n 1 n
4 ∑
=
− 2+ ( yi − β 0 − β1 xi ) 2 =
0.
∂σ 2
2σ 2σ i =1
3
The solution of these normal equations give the maximum likelihood estimates of β 0 , β1 and σ 2 as
b0= y − b1 x
n

∑ ( x − x )( y − y )
i i
sxy
b1 =
i =1
n

∑ ( xi − x )2
sxx
i =1

and
n

∑ ( y − bi 0 − b1 xi ) 2
s 2 = i =1
,
n
respectively.

It can be verified that the Hessian matrix of second order partial derivation of ln L with respect to β 0 , β1 , and σ 2 is negative
=
definite at β 0 b=
0 , β1 b1 , and σ 2 = s 2 which ensures that the likelihood function is maximized at these values.

Note that the least squares and maximum likelihood estimates of β 0 and β1 are identical when disturbances are normally

distributed. The least squares and maximum likelihood estimates of σ are different. In fact, the least squares estimate of σ 2 is
2

1 n
=s2 ∑
n − 2 i =1
( yi − y ) 2

n−2 2
so that it is related to maximum likelihood estimate as s 2 =
s .
n
Thus b0 and b1 are unbiased estimators of β 0 and β1 whereas s is a biased estimate of σ 2 , but it is asymptotically unbiased.
2

The variances of b0 and b1 are same as that of b0 and b1 respectively but the mean squared error

MSE ( s 2 ) < Var ( s 2 ).


4

Testing of hypotheses and confidence interval estimation for slope parameter


Now we consider the tests of hypothesis and confidence interval estimation for the slope parameter of the model under two
cases, viz., when σ 2 is known and when σ 2 is unknown.

Case 1: When σ2 is known


Consider the simple linear regression model yi =β 0 + β1 xi + ε i (i =1, 2,..., n). It is assumed that ε i ' s are independent
and identically distributed and follow N (0, σ 2 ).
First we develop a test for the null hypothesis related to the slope parameter

H 0 : β1 = β10
where β10 is some given constant.

Assuming σ 2 to be known, we know that


σ2
E (b1 ) β=
= 1 , Var (b1 )
sxx

and b1 is a linear combination of normally distributed yi ' s , so

 σ2 
b1 ~ N  β1 , 
 sxx 

and so the following statistic can be constructed


b1 − β10
Z1 =
σ2
sxx
which is distributed as N(0, 1) when H0 is true.
5

A decision rule to test H1 : β1 ≠ β10 can be framed as follows:


Reject H0 if Z 0 > zα /2
where zα /2 is the α / 2 percent points on normal distribution. Similarly, the decision rule for one sided alternative
hypothesis can also be framed.

The 100(1 − α )% confidence interval for β1 can be obtained using the Z1 statistic as follows:

 
P  − zα ≤ Z1 ≤ zα  =1 − α
 2 2 

 
 
 b1 − β1
P − zα ≤ ≤ zα  =1 − α
 σ 2 
 2 2

 sxx 
 σ2 σ2 
P b1 − zα ≤ β1 ≤ b1 + zα  =1 − α .
 2 s xx 2 s xx 

So 100(1 − α )% confidence interval for β1 is

 α2 α2 
 b1 − zα /2 , b1 + zα /2 
 s s 
 xx xx 

where zα / 2 is the α / 2 percentage point of the N(0,1) distribution.


6

Case 2: When σ 2 is unknown

When σ 2 is unknown, we proceed as follows. We know that


SS res
~ χ 2 (n − 2).
σ 2

and
 SS 
E  res  = σ 2 .
n−2
Further, SS res / σ 2 and b1 are independently distributed. This result will be proved formally later in module on multiple linear
regression. This result also follows from the result that under normal distribution, the maximum likelihood estimates, viz.,
sample mean (estimator of population mean) and sample variance (estimator of population variance) are independently
distributed, so b1 and s2 are also independently distributed.

Thus the following statistic can be constructed:


b1 − β1
t0 =
σˆ 2
sxx
b1 − β1
= ~ tn − 2
SS res
(n − 2) sxx

which follows a t-distribution with (n - 2) degrees of freedom, denoted as tn − 2 , when H0 is true.


7

A decision rule to test H1 : β1 ≠ β10 is to


reject H 0 if t0 > tn − 2,α / 2
where tn − 2,α / 2 is the α / 2 percent point of the t-distribution with (n - 2) degrees of freedom.

Similarly, the decision rule for one sided alternative hypothesis can also be framed.

The 100(1 − α )% confidence interval of β1 can be obtained using the t0 statistic as follows :
 
P  −tα ≤ t0 ≤ tα  =1 − α
 2 2 

 
 
 b1 − β1
P −tα ≤ ≤ tα  =1 − α
 σˆ 2 
 2 2

 sxx 
 σˆ 2 σˆ 2 
P b1 − tα ≤ β1 ≤ b1 + tα  =1 − α .
 2 sxx 2 sxx 

So the 100(1 − α )% confidence interval β1 is

 SS res SS res 
 b1 − tn − 2,α /2 , b1 + tn − 2,α /2  .
 (n − 2) sxx (n − 2) sxx 
8

Testing of hypotheses and confidence interval estimation for intercept term

Now, we consider the tests of hypothesis and confidence interval estimation for intercept term under two cases, viz., when
σ 2 is known and when σ 2 is unknown.

Case 1: When σ 2 is known

Suppose the null hypothesis under consideration is H 0 : β 0 = β 00 ,

 1 x2 
where σ is known, then using the result that
2
(b0 ) β 0 , Var=
E= (b0 ) σ 2  +  and b0 is a linear combination of
 n sx 
normally distributed random variables, the following statistic
b0 − β 00
Z0 = ~ N (0,1),
21 x2 
σ  + 
 n sxx 
has a N(0, 1) distribution when H0 is true.

A decision rule to test H1 : β 0 ≠ β 00 can be framed as follows:


Reject H0 if Z 0 > zα /2
where zα /2 is the α / 2 percentage points on normal distribution.

Similarly, the decision rule for one sided alternative hypothesis can also be framed.
9

The 100(1 − α )% confidence intervals for β 0 when σ 2 is known can be derived using the Z 0 statistic as follows:
 
P  − zα ≤ Z 0 ≤ zα  =1 − α
 2 2

 
 
 b0 − β 0 
P  − zα ≤ ≤ zα  =1 − α
 2  1 x2  2

 + 
  n s xx 

 
 x2  
21 x2  21
P b0 − zα σ  +  ≤ β 0 ≤ b0 + zα σ  +   =1 − α .
 2  n s xx  2  n s xx  

So the 100(1 − α )% of confidential interval of β 0 is

 
21 x2  21 x2
 b0 − zα /2 σ  +  , b0 + zα /2 σ  +  .
  n sxx   n sxx  

10

Case 2: When σ 2 is unknown

When σ 2 is unknown, then the statistic is constructed

b0 − β 00
t0 =
SSres  1 x 2 
+
n − 2  n s xx 

which follows a t-distribution with (n - 2) degrees of freedom, i.e., tn − 2 when H0 is true.

A decision rule to test H1 : β 0 ≠ β 00 is as follows:

Reject H0 whenever t0 > tn − 2,α / 2


where tn − 2,α / 2 is the α / 2 percentage point of the t-distribution with (n - 2) degrees of freedom.

Similarly, the decision rule for one sided alternative hypothesis can also be framed.
11
The 100 (1 − α )% of confidential interval of β 0 can be obtained as follows:
Consider
P tn − 2,α /2 ≤ t0 ≤ tn − 2,α /2  =1 − α
 
 
 b0 − β 0 
P tn − 2,α /2 ≤ ≤ tn − 2,α /2  = 1−α
 
SS res 1 x 2
 
 + 
 n − 2  n sxx  
 
 SS res  1 x 2  SS res  1 x 2 
P b0 − tn − 2,α /2  +  ≤ β ≤ b + t n − 2,α /2  +   =1 − α .
n − 2  n sxx  n − 2  n sxx
0 0
  
The 100 (1 − α )% confidential interval for β 0 is
 SS res  1 x 2  SS res  1 x 2 
b0 − tn − 2,α / 2  +  , b0 + tn − 2,α / 2  + .
 n − 2  n sxx  n − 2  n sxx  

Confidence interval for σ 2


SS res
A confidence interval for σ 2 can also be derived as follows. Since ~ χ n2− 2 , thus consider
σ 2

 SS 
P  χ n2− 2,α /2 ≤ res ≤ χ n2− 2,1−α /2  =
1−α
 σ 2

 SS SS 
P  2 res ≤ σ 2 ≤ 2 res  =− 1 α.
χ
 n − 2,1−α /2 χ 
n − 2,α /2 
 SS res SS res 
The corresponding 100(1 − α )% confidence interval for σ is  2 .
2
,
χ  n − 2,1−α /2 χ n2− 2,α /2 
ECONOMETRIC THEORY

MODULE – II
Lecture - 6
Simple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Joint confidence region for β 0 and β1


A joint confidence region for β 0 and β1 can also be found. Such region will provide a 100(1 − α )% confidence that both the
estimates of β 0 and β1 are correct. Consider the centered version of the linear regression model

yi = β 0* + β1 ( xi − x ) + ε i
sxy
where β=
*
0 β 0 + β1 x . The least squares estimators of β 0* and β=
1
are b0* y=
and b1 , respectively.
sxx
Using the results that
E (b0* ) = β 0* ,
E (b1 ) = β1 ,
σ2
Var (b ) =*
0 ,
n
σ2
Var (b1 ) = .
s xx

When σ 2 is known, then the statistic

b0* − β 0*
~ N (0,1)
σ2
n
and

b1 − β1
~ N (0,1).
σ 2

sxx
3
Moreover, both the statistics are independently distributed. Thus
2
 
 * * 
 b0 − β 0  ~ χ12
 σ  2
 
 n 
and
2
 
 
 b1 − β1  ~ χ 2
 2  1

 σ 
 s 
 xx 

*
are also independently distributed because b0 and b1 are independently distributed. Consequently, the sum of these two

n(b0* − β o* ) 2 sxx (b1 − β1 ) 2


+ ~ χ 22 .
σ 2
σ 2

Since
SS res
~ χ n2− 2
σ 2

and SSres is independently distributed of b0* and b1, so the ratio


 n(b0* − β 0* ) 2 sxx (b1 − β1 ) 2 
 +  2
 σ2 σ2  ~ F2,n − 2 .
 SS res 
 2  (n − 2)
 σ 
4

Substituting b0=
*
b0 + b1 x and β=
*
0 β 0 + β1 x , we get

 n − 2   Qf 
  
 2   SS res 

where
n n
Q f = n(b0 − β 0 ) + 2∑ xi (b0 − β1 )(b1 − β1 ) + ∑ xi2 (b1 − β1 ) 2 .
2

=i 1 =i 1

Since
 n − 2  Q f 
P   1−α
≤ F2,n − 2  =
 2  SS res 

holds true for all values of β 0 and β1, so the 100(1 − α )% confidence region for β 0 and β1 is

 n − 2  Qf
  ≤ F2,n − 2;α .
 2  SS res

This confidence region is an ellipse which gives the 100 (1 − α )% probability that β 0 and β1 are contained simultaneously in
this ellipse.
5
Analysis of variance

The technique of analysis of variance is usually used for testing the hypothesis related to equality of more than one
parameters, like population means or slope parameters. It is more meaningful in case of multiple regression model when
there are more than one slope parameters. This technique is discussed and illustrated here to understand the related
basic concepts and fundamentals which will be used in developing the analysis of variance in the next module in multiple
linear regression model where the explanatory variables are more than two.
A test statistic for testing H 0 : β1 = 0 can also be formulated using the analysis of variance technique as follows.

On the basis of the identity yi − yˆi = ( yi − y ) − ( yˆi − y ),

the sum of squared residuals is


n
=
S (b) ∑ ( y − yˆ )
i =1
i i
2

n n n
= ∑ ( yi − y ) + ∑ ( yˆi − yi ) 2 − 2∑ ( yi − y )( yˆi − y ).
2

=i 1 =i 1 =i 1
n n
Further consider
i i
=i 1 =i 1
∑ ( y − y )( yˆ − y=) ∑ ( y − y )b ( x − x ) i 1 i

n
= b12 ∑ ( xi − x ) 2
i =1
n
= ∑ ( yˆ − y ) .
i =1
i
2

n n n
Thus we have
i i i
=i 1 =i 1 =i 1
∑ ( y − y )= ∑ ( y − yˆ ) + ∑ ( yˆ − y ).
2 2
i
2

The term ∑ ( y − y)
i =1
i
2
is called the sum of squares about the mean or corrected sum of squares of y (i.e., SScorrected ) or
total sum of squares denoted as syy.
6
n
The term ∑ ( y − yˆ )
i =1
i i
2
describes the deviation: observation minus predicted value, viz., the residual sum of squares, i.e.:
n
=
SS res ∑ ( y − yˆ )
i =1
i i
2

n
whereas the term ∑ ( yˆ − y )
i =1
i
2
describes the proportion of variability explained by regression,
n
=
SS reg ∑ ( yˆ − y ) .
i =1
i
2

n
If all observations yi are located on a straight line, then in this case= ∑
( yi − yˆi ) 0 and= 2
thus SScorrected SS r e g .
i =1

Note that SSreg is completely determined by b1 and so has only one degrees of freedom. The total sum of squares
n n
s yy = ∑ ( yi − y ) 2 has (n - 1) degrees of freedom due to constraint ∑ ( y − y) =
i 0 and SS res has (n - 2) degrees of
i =1 i =1

freedom as it depends on b0 and b1.

All sums of squares are mutually independent and distributed as χ df2 with df degrees of freedom if the errors are normally
distributed.
The mean square due to regression is
SS r e g
MS r e g =
1
and mean square due to residuals is
SS res
MSE =.
n−2
The test statistic for testing H 0 : β1 = 0 is

MS r e g
F0 = .
MSE
7
If H 0 : β1 = 0 is true, then MS r e g and MSE are independently distributed and thus F0 ~ F1, n − 2 .

The decision rule for H1 : β1 ≠ 0 is to reject H0 if F0 > F1,n − 2;1−α


at α level of significance. The test procedure can be described in an Analysis of Variance table.
Analysis of variance for testing H 0 : β1 = 0

Source of variation Sum of squares Degrees of freedom Mean square F

Regression SS r e g 1 MS r e g MS r e g / MSE

Residual SS res n−2 MSE

Total s yy n −1

Some other forms of SS reg , SS res and syy can be derived as follows:
The sample correlation coefficient then may be written as
sxy
rxy = .
sxx s yy
sxy s
Moreover, we have =
b1 = rxy yy .
sxx sxx

The estimator of σ 2 in this case may be expressed as


1 n 2
s2 = ∑ ei
n − 2 i =1
1
= SS res .
n−2
8

Various alternative formulations for SSres are in use as well:


n
SS res= ∑ [ y − (b
i =1
i 0 + b1 xi )]2

n
= ∑ [( y − y ) − b ( x − x )]
i =1
i 1 i
2

=s yy + b12 sxx − 2b1sxy

= s yy − b12 sxx

( sxy ) 2
= s yy − .
sxx

Using this result, we find that

SScorrected = s yy
and

SS r e=
g s yy − SS res

( sxy ) 2
=
sxx
= b12 sxx
= b1sxy .
9

Goodness of fit of regression

It can be noted that a fitted model can be said to be good when residuals are small. Since SSres is based on residuals, so a
measure of quality of fitted model can be based on SSres. When intercept term is present in the model, a measure of
goodness of fit of the model is given by

SS res
R2 = 1 −
s yy
SS r e g
= .
s yy
This is known as the coefficient of determination. This measure is based on the concept that how much variation in y’s
stated by syy is explainable by SSreg. and how much unexplainable part is contained in SSres. The ratio SSreg / syy describes
the proportion of variability that is explained by regression in relation to the total variability of y. The ratio SSres / syy
describes the proportion of variability that is not covered by the regression.

It can be seen that


R 2 = rxy2
where rxy is the simple correlation coefficient between x and y. Clearly 0 ≤ R 2 ≤ 1 , so a value of R2 closer to one
indicates the better fit and value of R2 closer to zero indicates the poor fit.
ECONOMETRIC THEORY

MODULE – II
Lecture - 7
Simple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Reverse regression method

The reverse (or inverse) regression approach minimizes the sum of squares of horizontal distances between the observed
data points and the line in the following scatter diagram to obtain the estimates of regression parameters.

The reverse regression has been advocated in the analysis of sex (or race) discrimination in salaries. For example, if y
denotes salary and x denotes qualifications and we are interested in determining if there is a sex discrimination in salaries,
we can ask:
“Whether men and women with the same qualifications (value of x) are getting the same salaries (value of y). This
question is answered by the direct regression.”
Alternatively, we can ask:
“Whether men and women with the same salaries (value of y) have the same qualifications (value of x). This
question is answered by the reverse regression, i.e., regression of x on y.”
3

The regression equation in case of reverse regression can be written as xi =β 0* + β1* yi + δ i (i =1, 2,..., n)

where δ i ’s are the associated random error components and satisfy the assumptions as in the case of usual simple linear

regression model.

The reverse regression estimates βˆOR of β 0 and βˆ1R of β 1 for the model are obtained by interchanging the x and y in
* *

the direct regression estimators of β 0 and β1 . The estimates are obtained as

βˆOR= x − βˆ1R y
and
sxy
βˆ1R =
s yy
for β 0 and β 1 respectively.
* *

The residual sum of squares in this case is

sxy2
SS = sxx −
*
res .
s yy
Note that
sxy2
β=
ˆ b
1R 1 = rxy2
sxx s yy

where b1 is the direct regression estimator of slope parameter and rxy is the correlation coefficient between x and y. Hence
2
if rxy is close to 1, the two regression lines will be close to each other.

An important application of reverse regression method is in solving the calibration problem.


4

Orthogonal regression method (or major axis regression method)

The direct and reverse regression methods of estimation assume that the errors in the observations are either in x-direction
or y-direction. In other words, the errors can be either in dependent variable or independent variable. There can be
situations when uncertainties are involved in dependent and independent variables both. In such situations, the orthogonal
regression is more appropriate. In order to take care of errors in both the directions, the least squares principle in
orthogonal regression minimizes the squared perpendicular distance between the observed data points and the line in the
following scatter diagram to obtain the estimates of regression coefficients. This is also known as major axis regression
method. The estimates obtained are called as orthogonal regression
estimates or major axis regression estimates of regression coefficients. yi,

If we assume that the regression line to be fitted is Y=


i β 0 + β1 X i ,
then it is expected that all the observations ( xi , yi ), i = 1, 2,..., n (xi, yi)

lie on this line. But these points deviate from the line and in such a
case, the squared perpendicular distance of observed data
( xi , yi )(i = 1, 2,..., n) from the line is given by di2 = ( X i − xi ) 2 + (Yi − yi ) 2
(Xi, Yi)
where ( X i , Yi ) denotes the i th pair of observation without any error
which lie on the line.

xi,

Orthogonal or major axis regression method


5
n
The objective is to minimize the sum of squared perpendicular distances given by ∑d
i =1
i
2
to obtain the estimates of
β0 and β1 .
The observations ( xi , yi )(i = 1, 2,..., n) are expected to lie on the line
Y=
i β 0 + β1 X i
so let
Ei =Yi − β 0 − β1 X i =0.
n
The regression coefficients are obtained by minimizing ∑d
i =1
i
2
under the constraints Ei ' s using the Lagrangian’s multiplier
method. The Lagrangian function is
n
n
=
L
0 i
2
∑d
=i 1 =i 1
− 2∑ λi Ei

where λ1 ,..., λn are the Lagrangian multipliers.

The set of equations are obtained by setting


∂L0 ∂L0 ∂L0 ∂L0
= 0, = 0, = 0 and = 0=
(i 1, 2,..., n ).
∂X i ∂Yi ∂β 0 ∂β1
Thus we find ∂L0
= ( X i − xi ) + λi β1 = 0
∂X i
∂L0
= (Yi − yi ) − λi = 0
∂Yi
∂L0 n
=
∂β 0

= λ
i =1
i 0

∂L0 n
=
∂β1

= λi X i 0.
i =1
6
Since
X=
i xi − λi β1
Y=
i yi + λi
so substituting these values in Ei , we obtain
Ei = ( yi + λi ) − β 0 − β1 ( xi − λi β1 ) = 0
β + β1 xi − yi
⇒ λi =0 .
1 + β12
n
Also using this λi in the equation ∑λ
i =1
i = 0, we get
n

∑ (β 0 + β1 xi − yi )
i =1
=0
1 + β12
n
xi ) + λi β1 0 and
and using ( X i −= = ∑ λi X i 0, we get i =1
n

∑ λ (x − λ β ) =
i =1
i 0. i i 1

Substituting λi in this equation, we get

n n

∑ (β x + β x
0 i 1 i
2
i i 1 −yx) β ∑ (β 0 + β1 xi − yi ) 2
i 1 =i 1
− =
0. (1)
1
2
(1 + β ) (1 + β12 ) 2
n
Using λi in the equation and using the equation ∑λ
i =1
i = 0 , we solve
n

∑ (β 0 + β1 xi − yi )
i =1
=0.
1 + β12
7

The solution provides an orthogonal regression estimate of β0 as

βˆ0OR= y − βˆ1OR x

where βˆ1OR is an orthogonal regression estimate of β1.

Now, substituting β 0OR in equation (1), we get


2

∑ (1 + β12 )  yxi − β1 xxi + β1 xi2 − xi yi  − β1 ∑ ( y − β1 x + β1 xi − yi ) =


n n
0
i 1 =i 1
n n 2

or (1 + β ) 1
2
∑ x [ y − y − β ( x − x ) ] + β ∑ [ −( y − y ) + β ( x − x ) ]
i i
i 1 =i 1
1 i 1 i 1 i =0
n n
or (1 + β )∑ (ui + x )(vi − β1ui ) + β1 ∑ (−vi + β1ui ) 2 =
1
2
0
i 1 =i 1

where
u=
i xi − x ,
v=
i yi − y .

n n
Since ∑=
ui
=i 1 =i 1
∑=
vi 0, so
n

∑  β
i =1
u v + β1 (ui2 − vi2 ) − ui vi  =
2
1 i i 0

or
β12 sxy + β1 ( sxx − s yy ) − sxy =
0.
8

Solving this quadratic equation provides the orthogonal regression estimate of β1 as

(s yy − sxx ) + sign ( sxy ) ( sxx − s yy ) 2 + 4 s 2xy


βˆ1OR =
2 s xy

where sign ( sxy ) denotes the sign of s xy which can be positive or negative. So

1 if sxy > 0


sign( sxy ) = 
−1 if sxy < 0.
n
Notice that this gives two solutions for βˆ1OR . We choose the solution which minimizes
n
∑d
i =1
i
2
.
The other solution maximizes ∑ di2 and is in the direction perpendicular to the optimal solution.
i =1

The optimal solution can be chosen with the sign of sxy .


ECONOMETRIC THEORY

MODULE – II
Lecture - 8
Simple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Reduced major axis regression method

The direct, reverse and orthogonal methods of estimation minimize the errors in a particular direction which is usually the
distance between the observed data points and the line in the scatter diagram. Alternatively, one can consider the area
extended by the data points in certain neighbourhood and instead of distances, the area of rectangles defined between
corresponding observed data point and nearest point on the line in the following scatter diagram can also be minimized.
Such an approach is more appropriate when the uncertainties are present in study as well as explanatory variables. This
approach is termed as reduced major axis regression.

Suppose the regression line is Y=


i β 0 + β1 X i on which all the observed points are expected to lie. Suppose the points
( xi , yi ), i = 1, 2,..., n are observed which lie away from the line.
3

The area of rectangle extended between the ith observed data point and the line is

Ai (=
X i ~ xi )(Yi ~ yi ) (i 1, 2,..., n )

where ( X i , Yi ) denotes the ith pair of observation without any error which lie on the line.
n n
The total area extended by n data points is ∑ Ai = ∑ ( X i ~ xi )(Yi ~ yi ).
=i 1 =i 1

All observed data points ( xi , yi ), (i = 1, 2,..., n) are expected to lie on the line

Y=
i β 0 + β1 X i
and let

Ei* =Yi − β 0 − β1 X i =0.

So now the objective is to minimize the sum of areas under the constraints Ei* to obtain the reduced major axis estimates
of regression coefficients. Using the Lagrangian multiplies method, the Lagrangian function is
n n
=
LR i ∑ A −∑µ E
=i 1 =i 1
i
*
i

n n

=i 1 =i 1
= ∑ ( X i − xi )(Yi − yi ) − ∑ µi Ei*
where µ1 ,..., µn are the Lagrangian multipliers. The set of equations are obtained by setting

∂LR ∂LR ∂LR ∂LR


= 0,= 0,= 0,= =
0 (i 1, 2,..., n).
∂X i ∂Yi ∂β 0 ∂β1
4
Thus ∂LR
=(Yi − yi ) + β1µi = 0
∂X i
∂LR
= ( X i − xi ) − µi = 0
∂Yi
∂LR n
=
∂β 0
∑=
µ
i =1
i 0

∂LR n
=
∂β1
∑=
µX
i =1
i i 0.

Now
X=
i xi + µi
Y=
i yi − β1µi
β 0 + β1 X i =yi − β1µi
β 0 + β1 ( xi + µi ) =yi − β1µi
y − β 0 − β1 xi
⇒ µi =i .
2 β1
n
Substituting µi in ∑µ i =1
i = 0 , we get the reduced major axis regression estimate of β 0 is obtained as

βˆ0 RM= y − βˆ1RM x


n
where βˆ1RM is the reduced major axis regression estimate of β1. Using X=
i xi + µi , µi and βˆ0 RM in ∑µ X
i =1
i i = 0,

we get n
 yi − y + β1 x − β1 xi   yi − y + β1 x − β1 xi 
∑
i =1  2 β1
  xi −
2 β1
=0.
 
5

n
Let ui =
xi − x and vi =
yi − y , then this equation can be re-expressed as ∑ (v − β u )(v + β u + 2β x ) =
i =1
i 1 i i 1 i 10.
n n
Using i∑=
u ∑
=i 1 =i 1
= v i 0, we get

n n
2
i 1∑v
2

=i 1 =i 1
−β ∑u 2
i =
0.

Solving this equation, the reduced major axis regression estimate of β1 is obtained as

s yy
βˆ1RM = sign( sxy )
s xx

where
1 if sxy > 0
sign ( sxy ) = 
−1 if sxy < 0.

We choose the regression estimator which has same sign as that of sxy.
6

Least absolute deviation regression method

The least squares principle advocates the minimization of sum of squared errors. The idea of squaring the errors is useful in

place of simple errors because the random errors can be positive as well as negative. So consequently their sum can be

close to zero indicating that there is no error in the model which can be misleading. Instead of the sum of random errors,

the sum of absolute random errors can be considered which avoids the problem due to positive and negative random

errors.

In the method of least squares, the estimates of the parameters β0 and β1 in the model yi =β 0 + β1 xi + ε i . (i =1, 2,..., n)
n
are chosen such that the sum of squares of deviations ∑ε
i =1
i
2
is minimum. In the method of least absolute deviation (LAD)

regression, the parameters β0 and β1 are estimated such that the


n
sum of absolute deviations ∑ε
i =1
i is minimum. It minimizes the

absolute vertical sum of errors as in the following scatter diagram:


7

n
The LAD estimates βˆ0L and βˆ1L are the values β 0 and β1 , respectively which minimize LAD ( β 0 , β1=
) ∑ y −β
i =1
i 0 − β1 xi

for the given observations ( xi , yi ) (i = 1, 2,..., n).

Conceptually, LAD procedure is simpler than OLS procedure because |e| (absolute residuals) is a more straightforward

measure of the size of the residual than e2 (squared residuals). The LAD regression estimates of β 0 and β1 are not

available in closed form. Rather they can be obtained numerically based on algorithms. Moreover, this creates the

problems of non-uniqueness and degeneracy in the estimates. The concept of non-uniqueness relates to more than one

best lines passing through a data point. The degeneracy concept describes that the best line through a data point also

passes through more than one other data points. The non-uniqueness and degeneracy concepts are used in algorithms to

judge the quality of the estimates. The algorithm for finding the estimators generally proceeds in steps. At each step, the

best line is found that passes through a given data point. The best line always passes through another data point, and this

data point is used in the next step. When there is non-uniqueness, then there are more than one best lines. When there

is degeneracy, then the best line passes through more than one other data point. When either of the problem is present,

then there is more than one choice for the data point to be used in the next step and the algorithm may go around in circles

or make a wrong choice of the LAD regression line. The exact tests of hypothesis and confidence intervals for the LAD

regression estimates can not be derived analytically. Instead they are derived analogous to the tests of hypothesis and

confidence intervals related to ordinary least squares estimates.


8

Estimation of parameters when X is stochastic

In a usual linear regression model, the study variable is supped to be random and explanatory variables are assumed to be
fixed. In practice, there may be situations in which the explanatory variable also becomes random.
Suppose both dependent and independent variables are stochastic in the simple linear regression model

y =β 0 + β1 X + ε

where ε is the associated random error component. The observations ( xi , yi ), i = 1, 2,..., n are assumed to be jointly
distributed. Then the statistical inferences can be drawn in such cases which are conditional on X.

Assume the joint distribution of X and y to be bivariate normal N ( µ x , µ y , σ x , σ y , ρ ) where


2 2
µ x and µ y are the means of X
and y; σ x and
2
σ y2 are the variances of X and y, and ρ is the correlation coefficient between X and y. Then the
conditional distribution of y given X = x is univariate normal conditional mean

) µ y=
E ( y | X= x= |x β 0 + β1 x

and conditional variance of y given X = x is

) σ y2=
Var ( y|X= x= |x σ y2 (1 − ρ 2 )
where
β=
0 µ y − µ x β1
and
σy
β1 = ρ.
σx
9
When both X and y are stochastic, then the problem of estimation of parameters can be reformulated as follows. Consider
a conditional random variable y|X = x having a normal distribution with mean as conditional mean µ y |x and variance as
) σ y2|x . Obtain n independently distributed observation
conditional variance Var ( y|X= x= yi |xi , i = 1, 2,..., n from
N ( µ y |x , σ y2|x ) with nonstochastic X. Now the method of maximum likelihood can be used to estimate the parameters
which yields the estimates of β 0 and β1 as earlier in the case of nonstochastic X as

b= y − b1 x
and
s
b1 = xy
sxx
respectively.

Moreover, the correlation coefficient

E ( y − µ y )( X − µ x )
ρ=
σ yσ x
can be estimated by the sample correlation coefficient
n

∑ ( y − y )( x − x )
i i
ρˆ = i =1
n n

=i 1 =i 1
∑ ( xi − x )2 ∑ ( y − y)
i
2

sxy
=
sxx s yy
s
= b1 xx .
s yy
10

Thus
sxx
ρˆ 2 = b12
s yy
sxy
= b1
s yy
n
s yy − ∑ εˆi2
= i =1

s yy
= R2
which is same as the coefficient of determination.

Thus R2 has the same expression as in the case when X is fixed.

Thus R2 again measures the goodness of fitted model even when X is stochastic.
ECONOMETRIC THEORY

MODULE – III
Lecture - 9
Multiple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

We consider now the problem of regression when study variable depends on more than one explanatory or independent
variables, called as multiple linear regression model. This model generalizes the simple linear regression in two ways. It
allows the mean function E(y) to depend on more than one explanatory variables and to have shapes other than straight
lines, although it does not allow for arbitrary shapes.

The multiple linear regression model


Let y denotes the dependent (or study) variable that is linearly related to k independent (or explanatory) variables
X1, X2,…, Xk through the parameters β1 , β 2 ,..., β k and we write

y X 1β1 + X 2 β 2 + ... + X k β k + ε .
=

This is called as the multiple linear regression model. The parameters β1 , β 2 ,..., β k are the regression coefficients
associated with X1,X2,…,Xk respectively and ε is the random error component reflecting the difference between the
observed and fitted linear relationship. There can be various reasons for such difference, e.g., joint effect of those
variables not included in the model, random factors which can not be accounted in the model etc.

The jth regression coefficient βj represents the expected change in y per unit change in jth independent variable Xj .
Assuming E (ε ) = 0,

∂E ( y )
βj = .
∂X j
3

Linear model
∂y ∂E ( y )
A model is said to be linear when it is linear in parameters. In such a case (or equivalently ) should not
∂β j ∂X j
depend on any β 's . For example
y β1 + β 2 X is a linear model as it is linear is parameter.
i) =
β2
ii) y = β1 X can be written as

=
log y log β1 + β 2 log X
y=
*
β1* + β 2 x*

which is linear is parameter β1* and β 2 , but nonlinear is variables


= =
y* log y, x* log x. So it is a linear model.

iii) y=β1 + β 2 X + β3 X 2
is linear in parameters β1 , β 2 and β 3 but it is nonlinear is variables X. So it is a linear model.

β2
iv) y β1 +
=
X − β3
is nonlinear in parameters and variables both. So it is a nonlinear model.

v) y β1 + β 2 X β3
=
is nonlinear in parameters and variables both. So it is a nonlinear model.

vi) y=β1 + β 2 X + β3 X 2 + β 4 X 3
β1 + β 2 X 2 + β3 X 3 + β 4 X 4
is a cubic polynomial model which can be written as y =
which is linear in parameters β1 , β 2 , β 3 , β 4 and linear in variables=
X 2 X=
, X3 X=
2
, X 4 X 3. So it is a linear model.
4

Example

The income and education of a person are related. It is expected that, on an average, higher level of education provides
higher income. So a simple linear regression model can be expressed as

β1 + β 2 education + ε .
income =

Not that β 2 reflects the change in income with respect to per unit change in education and β1 reflects the income when
education is zero as it is expected that even an illiterate person can also have some income.

Further this model neglects that most people have higher income when they are older than when they are young,
regardless of education. So β 2 will over-state the marginal impact of education. If age and education are positively
correlated, then the regression model will associate all the observed increase in income with an increase in education. So
better model is
β1 + β 2 education + β3 age + ε .
income =

Usually it is observed that the income tends to rise less rapidly in the later earning years than is early years. To
accommodate such possibility, we might extend the model to

β1 + β 2 education + β3 age + β 4 age 2 + ε .


income =

This is how we proceed for regression modeling in real life situation. One needs to consider the experimental condition and
the phenomenon before taking the decision on how many, why and how to choose the dependent and independent
variables.
5

Model set up
Let an experiment be conducted n times and the data is obtained as follows:

Observation number Response Explanatory variables


(y) X1 X2  Xk

1 y1 x11 x12  x1k


2 y2 x21 x22  x2 k
     
n yn xn1 xn 2  xnk

Assuming that the model is


y β1 X 1 + β 2 X 2 + ... + β k X k + ε ,
=
the n-tuples of observations are also assumed to follow the same model. Thus they satisfy

y=
1 β1 x11 + β 2 x12 + ... + β k x1k + ε1
y=
2 β1 x21 + β 2 x22 + ... + β k x2 k + ε 2
 
y=
n β1 xn1 + β 2 xn 2 + ... + β k xnk + ε n .
6
These n equations can be written as

 y1   x11 x12  x1k   β 0   ε1 


      
 y2   x21 x22  x2 k   β1  +  ε 2 
           
      
 yn   xn1 xn 2  xnk   β k   ε n 

or
y Xβ +ε
=
where y = ( y1 , y2 ,..., yn ) ' is a n × 1 vector of n observation on study variable,

 x11 x12  x1k 


 
x x22  x2 k 
X =  21
    
 
 xn1 xn 2  xnk 

is a n x k matrix of n observations on each of the k explanatory variables, β = ( β1 , β 2 ,..., β k ) ' is a k x 1 vector


of regression coefficients and ε = (ε1 , ε 2 ,..., ε n ) ' is a n x 1 vector of random error components or disturbance term.

If intercept term is present, take first column of X to be (1,1,…,1)’. So that

1 x11 x12  x1k 


 
1 x21 x22  x2 k 
X = .
    
 
1 xn1 xn 2  xnk 
In this case, there are (k – 1) explanatory variables and one intercept term.
7

Assumptions in multiple linear regression model


y X β + ε for drawing the statistical inferences. The following
Some assumptions are needed in the model=
assumptions are made:

(i) E (ε ) = 0

(ii) E (εε ') = σ 2 I n

(iii) Rank ( X ) = k

(iv) X is a non-stochastic matrix.

(v) ε ~ N (0, σ 2 I n )
These assumptions are used to study the statistical properties of estimator of regression coefficients. The
following assumption is required to study particularly the large sample properties of the estimators
 X 'X 
(vi) lim   = ∆ exists and is a non-stochastic and nonsingular matrix (with finite elements).
n →∞
 n 

The explanatory variables can also be stochastic in some cases. We assume that X is non-stochastic unless stated
separately.

We consider the problems of estimation and testing of hypothesis on regression coefficient vector under the stated
assumption.
8

Estimation of parameters
A general procedure for the estimation of regression coefficient vector is to minimize

n n

i∑ M=
=i 1 =i 1
(ε ) ∑ M ( y − x β i i1 1 − xi 2 β 2 − ... − xik β k )

for a suitably chosen function M.

Some examples of choice of M are

M ( x) = x

M ( x) = x 2

M ( x) = x , in general.
p

We consider the principle of least square which is related to M(x) = x2 and method of maximum likelihood estimation for

the estimation of parameters.


9

Principle of ordinary least squares (OLS)


Let B be the set of all possible vectors β . If there is no further information, then B is k-dimensional real Euclidean
space. The object is to find a vector b ' = (b1 , b2 ,..., bk ) from B that minimizes the sum of squared deviations of ε i ' s, i.e.,
n

∑ ε i2 ==
S (β ) =
i =1
ε ' ε ( y − X β ) '( y − X β )

for given y and X. A minimum will always exist as S ( β ) is a real valued, convex and differentiable function. Write
.
S (β ) =
y ' y + β ' X ' X β − 2 β ' X ' y.

Differentiate S ( β ) with respect to β


∂S ( β )
= 2X ' X β − 2X ' y
∂β
∂ 2 S (β )
= 2 X ' X (atleast non-negative definite).
∂β ∂β '

The normal equation is

∂S ( β )
=0
∂β
⇒ X ' Xb =
X'y
where the following result is used:
10


Result: If f ( z ) = Z ' AZ is a quadratic form, Z is a m × 1 vector and A is any m × m symmetric matrix then F ( z ) = 2 Az .
∂z

Since it is assumed that rank ( X ) = k (full rank), then X ' X is positive definite and unique solution of normal equation is

b = ( X ' X ) −1 X ' y

which is termed as ordinary least squares estimator (OLSE) of β .

∂ 2 S (β )
Since is at least non-negative definite, so b minimizes S ( β ) .
∂β 2

In case, X is not of full rank, then

=b ( X ' X ) − X ' y +  I − ( X ' X ) − X ' X  ω

where ( X ' X ) − is the generalized inverse of X ' X and ω is an arbitrary vector. The generalized inverse ( X ' X )− of X ' X
satisfies

X ' X ( X ' X )− X ' X = X ' X

X ( X ' X )− X ' X = X

X ' X ( X ' X ) − X ' = X '.


ECONOMETRIC THEORY

MODULE – III
Lecture - 10
Multiple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Theorem

i. Let ŷ = Xb be the empirical predictor of y. Then ŷ has the same value for all solutions b of X ' Xb = X ' y.

ii. S ( β ) attains the minimum for any solution of X ' Xb = X ' y.

Proof:

in b ( X ' X ) X ' y +  I − ( X ' X ) X ' X  ω


− −
i. =
Let b be any member

Since X ( X ' X ) − X ' X = X , so then

=Xb X ( X ' X ) − X ' y + X  I − ( X ' X ) − X ' X  ω


= X ( X ' X )− X ' y

which is independent of ω. This implies that ŷ has same value for all solution b of X ' Xb = X ' y.

ii. Note that for any β ,

S ( β ) =[ y − Xb + X (b − β ) ]′ [ y − Xb + X (b − β ) ]
= ( y − Xb)′( y − Xb) + (b − β )′ X ' X (b − β ) + 2(b − β )′ X ′( y − Xb)
= ( y − Xb)′( y − Xb) + (b − β )′ X ' X (b − β ) (Using X ' Xb = X ' y )
≥ ( y − Xb)′( y − Xb) = S (b)
=y ' y − 2 y ' Xb + b ' X ' Xb
= y ' y − b ' X ' Xb
= y ' y − yˆ ' yˆ .
3

Fitted values

Now onwards, we assume that X is a full column rank matrix.

If y X β + ε , then the fitted values are defined as ŷ = X βˆ


β̂ is any estimator of β for the model=
where β̂ is any estimator of β .

In case of βˆ = b,
yˆ = Xb
= X ( X ' X ) −1 X ' y
= Hy

where H = X ( X ' X ) −1 X ' is termed as ‘hat matrix’ which is


i. symmetric,

ii. idempotent (i.e., HH = H ) and

iii. H tr X ( X ′X ) −1=
tr= ) −1 tr=
X ' tr X ' X ( X ' X= I k k.
4

Residuals
The difference between the observed and fitted values of study variable is called as residual. It is denoted as
e = y ~ yˆ
= y − yˆ
= y − Xb
= y − Hy
= (I − H ) y
= Hy
where H= I − H .

Note that

(i) H is a symmetric matrix,

(ii) H is an idempotent matrix, i.e., HH =( I − H )( I − H ) =( I − H ) =H and

(iii) trH =trI n − trH =(n − k ).


5

Properties of OLSE

(i) Estimation error

The estimation error of b is


b − β ( X ' X ) −1 X ' y − β
=

= ( X ' X ) −1 X '( X β + ε ) − β

= ( X ' X ) −1 X ' ε .

(ii) Bias

Since X is assumed to be nonstochastic and E (ε ) = 0


E (b − β ) =
( X ' X ) −1 X ' E (ε )
= 0.
Thus OLSE is an unbiased estimator of β.

(iii) Covariance matrix

The covariance matrix of b is


V (b) =E (b − β )(b − β ) '
= E ( X ' X ) −1 X ' εε ' X ( X ' X ) −1 
= ( X ' X ) −1 X ' E (εε ') X ( X ' X ) −1
= σ 2 ( X ' X ) −1 X ' IX ( X ' X ) −1
= σ 2 ( X ' X ) −1.
6

(iv) Variance

The variance of b can be obtained as the sum of variances of all b1 , b2 ,..., bk which is the trace of covariance matrix of b.
Thus
Var (b) = tr [V (b) ]
k
= ∑ E (b − β )
i =1
i i
2

k
= ∑ Var (bi ).
i =1

Estimation of σ
2

The least squares criterion can not be used to estimate σ2 because σ2 does not appear in S (β ) . Since E (ε i ) = σ
2 2
,
so we attempt with residuals ei to estimate σ 2 as follows:
e= y − yˆ
= y − X ( X ' X ) −1 X ' y
= [ I − X ( X ' X ) −1 X '] y
= Hy.
7

Consider the residual sum of squares


n
SS r e s = ∑ ei2
i =1

= e 'e

=
( y − Xb) '( y − Xb)

=y '( I − H )( I − H ) y

= y '( I − H ) y

= y ' Hy.

Also

SS r e s =
( y − Xb) '( y − Xb)
=
y ' y − 2b ' X ' y + b ' X ' Xb
=
y ' y − b ' X ' y (Using X ' Xb =
X ' y ).
SS r e s = y ' Hy
(X β + ε )'H (X β + ε )
=
= ε=
' H ε (Using HX 0).

Since ε ~ N (0, σ I ),
2

so y ~ N ( X β , σ 2 I ).
Hence y ' Hy ~ χ 2 (n − k ).
8

] (n − k )σ 2
Thus E[ y ' Hy=

 y ' Hy 
or E  =σ2
 n−k 
or E [ MS r e s ] = σ 2

SS r e s
where MS r e s = is the mean sum of squares due to residual.
n−k

Thus an unbiased estimator of σ2 is

σˆ 2 MS
= = res s 2 (say),

which is a model dependent estimator.

Covariance matrix of ŷ
The covariance matrix of ŷ is
V ( yˆ ) = V ( Xb)
= XV (b) X '
= σ 2 X ( X ' X ) −1 X '
= σ 2H .
9

Gauss-Markov theorem
The ordinary least squares estimator (OLSE) is the best linear unbiased estimator (BLUE) of β .
Proof: The OLSE of β is

b = ( X ' X ) −1 X ' y
which is a linear function of y. Consider the arbitrary linear estimator b* = a ' y of linear parametric function  ' β where
the elements of a are arbitrary constants.
*
Then for b ,

= (a ' y ) a ' X β
E (b* ) E=
is an unbiased estimator of  ' β when
*
and so b

=
E ' X β 'β
(b* ) a=

⇒ a' X =
 '.

Since we wish to consider only those estimators that are linear and unbiased, so we restrict ourselves to those estimators
for which a ' X =  '.

Further

= 'Var ( y )a σ 2a ' a
Var ( a ' y ) a=
Var (  ' b) =  'Var (b)
= σ 2 a ' X ( X ' X ) −1 X ' a .
10

Consider

Var (a ' y ) − Var ( ' b) =σ 2  a ' a − a ' X ( X ' X ) −1 X ' a 


= σ 2 a '  I − X ( X ' X ) −1 X ' a
= σ 2 a '( I − H )a.

Since (I - H) is a positive semi-definite matrix, so

Var (a ' y ) − Var ( ' b) ≥ 0.

This reveals that if b* is any linear unbiased estimator then its variance must be no smaller than that of b.

If we consider  = ( 0, 0,…, 0,1, 0,…, 0 ) (here 1 occurs at ith place), then  ' b = bi is best linear unbiased estimator of
 ' β = βi for all i = 1,2,…,k.

Consequently b is the best linear unbiased estimator of β , where ‘best’ refers to the fact that b is efficient within the class of
linear and unbiased estimators.
ECONOMETRIC THEORY

MODULE – III
Lecture - 11
Multiple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Maximum likelihood estimation


y X β + ε , it is assumed that the errors are normally and independently distributed with constant variance
In the model= σ2
i.e.,
ε ~ N (0, σ 2 I ).
The normal density function for the errors is
1  1 
f (ε i ) = exp  − 2 ε i2  i = 1, 2,..., n.
σ 2π  2σ 
The likelihood function is the joint density of ε1 , ε 2 ,..., ε n given as
n
L( β , σ ) = ∏ f (ε i )
2

i =1

 1 n 21
=
(2πσ 2 ) n /2
exp  − 2σ 2 ∑ ε i 
 i =1 
1  1 
= exp  − 2 ε ' ε 
(2πσ )
2 n /2
 2σ 
1  1 
= exp  − 2 ( y − X β ) '( y − X β )  .
(2πσ )2 n /2
 2σ 

Since the log transformation is monotonic, so we maximize ln L( β , σ 2 ) instead of L( β , σ ) .


2

n 1
ln L( β , σ 2 ) =
− ln(2πσ 2 ) − 2 ( y − X β ) '( y − X β ).
2 2σ
3

The maximum likelihood estimators (m.l.e.) of β and σ


2
are obtained by equating the first order derivatives of ln L( β , σ )
2

with respect to β and σ 2 to zero as follows:


∂ ln L( β , σ 2 ) 1
= Xβ) 0
2 X '( y − =
∂β 2σ 2
∂ ln L( β , σ 2 ) n 1
=
− + ( y − X β ) '( y − X β ).
∂σ 2 2σ 2 2(σ 2 ) 2

The likelihood equations are given by


X 'Xβ = X ' y
1
σˆ 2 =( y − X βˆ ) '( y − X βˆ ).
n

Since rank ( X ) = k , so that the unique m.l.e. of β and σ 2 are obtained as


β = ( X ' X ) −1 X ' y
1
σ 2 =( y − X βˆ ) '( y − X βˆ ).
n
Next we verify that these values maximize the likelihood function. First we find
∂ 2 log L( β , σ 2 ) 1
= − 2 X 'X
∂β 2
σ
∂ 2 log L( β , σ 2 ) n 1
= 4 − 6 ( y − X β ) '( y − X β )
∂ (σ )
2 2 2
2σ σ
∂ 2 log L( β , σ 2 ) 1
− 4 X '( y − X β ).
=
∂β∂σ 2
σ
4

Thus the Hessian matrix of second order partial derivatives of ln L( β , σ 2 ) with respect to β and σ 2 is
 ∂ 2 ln L( β , σ 2 ) ∂ 2 ln L( β , σ 2 ) 
 
 ∂β 2 ∂β∂σ 2 
 ∂ 2 log L( β , σ 2 ) ∂ 2 ln L( β , σ 2 ) 
 
 ∂σ 2 ∂β ∂ 2 (σ 2 ) 2 

which is negative definite at β = β and σ 2 = σ 2.


This ensures that the likelihood function is maximized at these values.

Comparing with OLSEs, we find that

i. OLSE and m.l.e. of β are same. So m.l.e. of β is also an unbiased estimator of β.

n−k 2
ii. OLSE of σ2 is s 2 which is related to m.l.e. of σ 2 as σ 2 = s . So m.l.e. of σ 2 is a biased estimator of σ .
2

n
5

Cramer-Rao lower bound


Let θ = ( β , σ 2 ) '. Assume that both β and σ 2 are unknown. If E (θˆ) = θ , then the Cramer-Rao lower bound for θˆ is
grater than or equal to the matrix inverse of

 ∂ 2 ln L(θ ) 
I (θ ) = − E  
 ∂θ∂θ ' 
  ∂ ln L( β , σ 2 )   ∂ ln L( β , σ 2 )  
− E   −E 
  ∂β 2   ∂β∂σ
2

= 
 − E  ∂ ln L( β , σ )   ∂ ln L( β , σ 2 )  
2
−E
  ∂σ 2 ∂β   ∂ (σ )  
2 2 2 

  X 'X   X '( y − X β )  
− E − σ 2  −E
σ4  
  
= 
 (y − X β)' X   n ( y − X β ) '( y − X β )  
− E   −E 4 −  
  σ4  2σ σ6 
X 'X 
 σ2 0 
= .
 0 n 
 2σ 4 

Then
σ 2 ( X ' X ) −1 0 
[ I (θ )] =  
−1
2σ 4 
0
 n 
is the Cramer-Rao lower bound matrix of β and σ 2.
6

The covariance matrix of OLSEs of β and σ


2
is

σ 2 ( X ' X ) −1 0 
∑ OLS =  0 2σ 4 

 n − k 

which means that the Cramer-Rao bound is attained for the covariance of b but not for s2.
ECONOMETRIC THEORY

MODULE – III
Lecture - 12
Multiple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Standardized regression coefficients

Usually it is difficult to compare the regression coefficients because the magnitude of βˆ j reflects the units of
measurement of jth explanatory variable Xj. For example, in the following fitted regression model

yˆ =+
5 X 1 + 1000 X 2 ,

y is measured in liters, X1 is milliliters and X2 in liters. Although βˆ2 >> βˆ1 but effect of both explanatory variables is
identical. One liter change in either X1 and X2 when other variable is held fixed produces the same change in ŷ .

Sometimes it is helpful to work with scaled explanatory and study variables that produces dimensionless regression
coefficients.

These dimensionless regression coefficients are called as standardized regression coefficients.

There are two popular approaches for scaling which gives standardized regression coefficients.
We discuss them as follows:
3

1. Unit normal scaling


Employ unit normal scaling to each explanatory variable and study variable .

xij − x j
=
So define zij = , i 1,=
2,..., n, j 1, 2,..., k
sj
yi − y
yi* =
sy

1 n 1 n
=
where s ∑
2

n − 1 i =1
j ( xij − x j ) 2 =
and s ∑2

n − 1 i =1
y ( yi − y ) 2

are the sample variances of jth explanatory variable and study variable, respectively.

All scaled explanatory variables and the scaled study variable have mean zero and sample variance unity, i.e., using these
new variables, the regression model becomes

yi*= γ 1 zi1 + γ 2 zi 2 + ... + γ ik zik + ε i , i= 1, 2,..., n.

Such centering removes the intercept term from the model. The least squares estimate of γ = (γ 1 , γ 2 ,..., γ k ) ' is
γˆ = ( Z ' Z ) −1 Z ' y* .
This scaling has a similarity to standardizing a normal random variable, i.e., observation minus its mean and divided by its
standard deviation. So it is called as a unit normal scaling.
4
2. Unit length scaling
In unit length scaling, define
xij − x j
=ωij = , i 1,=
2,..., n; j 1, 2,..., k
S 1/2
jj

yi − y
yi0 =
SST1/2
where
n
=
S jj ∑ (x
i =1
ij − x j )2

is the corrected sum of squares for j th explanatory variables X j and


n
= =
ST SST ∑ ( y − y)
i =1
i
2

is the total sum of squares.


1 n
In this scaling, each new explanatory variable =
W j has mean ωj = ∑ ωij 0
n i =1
and length

∑ (ω
i =1
ij − ω j )2 =
1.

In terms of these variables, the regression model is

y=
o
i δ1ωi1 + δ 2ωi 2 + ... + δ k ωik + ε i , =i 1, 2,..., n.

The least squares estimate of regression coefficient δ = (δ1 , δ 2 ,..., δ k ) ' is

δˆ = (W 'W ) −1W ' y 0 .


5

In such a case, the matrix W 'W is in the form of correlation matrix, i.e.,

1 r12 r13  r1k 


 
 r12 1 r23  r2 k 
W 'W =  r13 r23 1  r3k 
 
      
r r3k  1 
 1k r2 k

where n

∑ (x ui − xi )( xuj − x j )
rij = u =1

( Sii S jj )1/2
Sij
=
( Sii S jj )1/2
is the simple correlation coefficient between the explanatory variables Xi and Xj.

Similarly,
W ' y o = (r1 y , r2 y ,..., rky ) '
where
n

∑ (x
− x j )( yu − y )
uj
S jy
=rjy =
u =1

( S jj SST )1/2 ( S jj SST )1/2

is the simple correlation coefficient between jth explanatory variable Xj and study variable y.

Note that it is customary to refer rij and rjy as correlation coefficient though Xi ’s are not random variable.
6

If unit normal scaling is used, then


Z ' Z= (n − 1)W 'W .
So the estimates of regression coefficient in unit normal scaling (i.e., γˆ ), and unit length scaling (i.e., δˆ ), are identical.

So it does not matter which scaling is used, so

γˆ = δˆ.

The regression coefficients obtained after such scaling, viz., γˆ or δˆ are usually called standardized regression
coefficients.

The relationship between the original and standardized regression coefficients is


1/ 2
 SST 
=b j δˆ=
j 
 S 
, j 1, 2,..., k
 jj 
and
k
b0= y − ∑ b j x j
j =1

where b0 is the OLSE of intercept term and bj are the OLSE of slope parameters β j.
7

The model in deviation form


The multiple linear regression model can also be expressed in the deviation form.
First all the data is expressed in terms of deviations from sample mean.
The estimation of regression parameters is performed in two steps:

First step: Estimate the slope parameters.


Second step : Estimate the intercept term.
The multiple linear regression model in deviation form is expressed as follows:

Let
1
A= I −  '
n
=
where  (1,1,...,1) ' is a n × 1 vector of each element unity. So

1 0  0 1 1  1
0 1  0  1 1 1  1
=A  − .
     n    
   
0 0  1 1 1  1
Then
 y1 
 
1 n 1 y
=y = ∑ yi (1,1,...,1)  2 
n i =1 n 
 
 yn 
1
= ' y
n
Ay =y − y =( y1 − y , y2 − y ,..., yn − y ) '.
8

Thus pre-multiplication of any column vector by A produces a vector showing those observations in deviation form.
Note that
1
A=  −  ' 
n
1
=  − .n
n
= −
=0
and A is symmetric and idempotent matrix.

In the model
=y X β +ε,
the OLSE of β is

b =(X 'X ) X 'y


−1

and residual vector is

e= y − Xb.

Note that Ae = e.
9
Let the n × k matrix X is partitioned as

X =  X 1 X 2* 

=
where X 1 (1,1,...,1) ' is n ×1 vector with all elements unity representing the intercept term, X 2* is n × ( k − 1)

matrix of observations on ( k − 1) explanatory variables X 2 , X 3 ,..., X k (


and OLSE b = b1 , b2* ' ) is suitably partitioned

with OLSE of intercept term β1 as b1 and b2* as a ( k − 1) × 1 vector of OLSEs associated with β 2 , β3 ,..., β k .

Then

y = X 1b1 + X 2*b2* + e.

Premultiplication by A gives

Ay = AX 1b1 + AX 2*b2* + Ae
= AX 2*b2* + e.

*
Further, premultiplication by X 2 ' gives

=
X 2* ' Ay X 2* ' AX 2*b2* + X 2* ' e
= X 2* ' AX 2*b2* .

Since A is symmetric and idempotent, so

( AX ) ' ( Ay ) = ( AX ) ' ( AX ) b
*
2
*
2
*
2
*
2 .

y X β + ε . Such a comparison
This equation can be compared with the normal equations X ' y = X ' Xb in the model=

yields the following conclusions:


10

*
 b2 is the sub-vector of OLSE.

 Ay is the study variables vector in deviation form.

 AX 2* is the explanatory variable matrix in deviation form.

 This is normal equation in terms of deviations. Its solution gives OLS of slope coefficients as

b2* = ( AX 2* ) ' ( AX 2* )  ( AX ) ' ( Ay ) .


−1
*
2

The estimate of intercept term is obtained in the second step as follows:


1
Premultiplying =
y Xb + e by  ' gives
n

1 1 1
= ' y  ' Xb +  ' e
n n n
 b1 
 
b
y 1 X 2 X 3 ... X k   2  + 0

 
 bk 
⇒ b1 = y − b2 X 2 − b3 X 3 − ... − bk X k .
11
Now we explain the various sums of squares in terms of this model.

The total sum of squares is

TSS = y ' Ay.

=
Since Ay AX 2*b2* + e
=
y ' Ay y ' AX 2*b2* + y ' e

( Xb + e ) ' AX 2*b2* + y ' e


=

= ( X 1b1 + X 2*b2* + e ) ' AX 2*b2* + ( X 1b1 + X 2*b2* + e ) ' e

= b2* ' X 2* ' AX 2*b2* + e ' e,


so

= SS reg + SS res
TSS

where the sum of squares due to regression is

SS reg = b2* ' X 2* ' AX 2*b2*


and the sum of squares due to residual is

SS res = e ' e.
ECONOMETRIC THEORY

MODULE – III
Lecture - 13
Multiple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Testing of hypothesis
There are several important questions which can be answered through the test of hypothesis concerning the regression
coefficients.

For example
1. What is the overall adequacy of the model?
2. Which specific explanatory variables seems to be important?
etc.

In order to the answer such questions, we first develop the test of hypothesis for a general framework, viz., general linear
hypothesis.

Then several tests of hypothesis can be derived as its special cases.

So first we discuss the test of a general linear hypothesis.


3

Test of hypothesis for H 0 : R β = r


We consider a general linear hypothesis that the parameters in β are contained in a subspace of parameter space for

which Rβ = r , where R is a (J x k) matrix of known elements and r is a (J x 1) vector of known elements. Note that the

matrix X ' X is of full rank. In general, the null hypothesis

H 0 : Rβ = r

is termed as general linear hypothesis and

H1 : R β ≠ r

is the alternative hypothesis.

We assume that rank (R) = J, i.e., R is of full column rank, so that there is no linear dependence in the hypothesis.

Some special cases and interesting example of H 0 : Rβ = r are as follows:


(i) H 0 : βi = 0
Choose =
J 1,=
r 0,=
R [0, 0,..., 0,1, 0,..., 0] where 1 occurs at the ith position is R.
This particular hypothesis explains whether Xi has any effect on the linear model or not.

H 0 : β3
(ii) = β 4 or H 0=
: β3 − β 4 0

Choose=
J 1,=r 0,=
R [0, 0,1, −1, 0,..., 0].
4
(iii) H 0 : β=
3 β=
4 β5
or H 0 : β3 − β 4= 0, β3 − β5= 0

0 0 1 − 1 0 0 ... 0 
Choose =J 2,=r (0, 0) ',=R  .
 0 0 1 0 − 1 0 ... 0 

(iv) H 0 : β 3 +5β 4 =
2
Choose=J 1,=r 2,=
R [0, 0, 1, 5, 0,..., 0] .

(v)
H 0 : β 2= β3= ...= β k= 0
Choose
J= k − 1
r = (0, 0,..., 0) '
0 1 0 ... 0  0 
0 0 1 ... 0  0 I k −1 
R =   .
       
   
0 0 0 ... 1  ( k −1)×k 0 

This particular hypothesis explains the goodness of fit. It tells whether βi has linear effect or not and are they of any
importance. It also tests whether X 2 , X 3 ,..., X k have any influence in the determination of y or not. Here β1 = 0 is
excluded because this involves additional implication that the mean level of y is zero. Our main concern is to know whether
the explanatory variables help in explaining the variation in y around its mean value or not.

We develop the likelihood ratio test for H 0 : R β = r.


5
Likelihood ratio test

The likelihood ratio test statistic is


max L( β , σ 2 | y, X ) Lˆ (Ω)
=λ =
max L( β , σ 2 | y, X , Rβ = r ) Lˆ (ω )
where Ω denotes the whole parametric space and ω denotes the sample space.

If both the likelihoods are maximized, one constrained and the other unconstrained, then the value of the unconstrained will
not be smaller than the value of the constrained. Hence λ ≥ 1.
=
First we discus the likelihood ratio test for a simpler case when R I=
k and r β 0 ,=
i.e., β β 0 .
This will give us better and detailed understanding for the minor details and then we generalize it for R β = r , in general.

Likelihood ratio test for H 0 : β = β 0

Let the null hypothesis related to k × 1 vector β is H 0 : β = β 0

where β0 is specified by the investigator. The elements of β0 can take on any value, including zero.
The concerned alternative hypothesis is H1 : β ≠ β 0 .

Since y X β + ε , so y ~ N ( X β , σ 2 I ). Thus the whole parametric space and sample space are
ε ~ N (0, σ 2 I ) in =
Ω and ω respectively given by
Ω : {( β , σ 2 ) : − ∞ < βi < ∞, σ 2 > 0, i =1, 2,..., k }
ω : {( β , σ =
2
) : β β 0 , σ 2 > 0} .
6

The unconstrained likelihood under Ω is

1  1 
L( β , σ 2 |=
y, X ) exp  − 2 ( y − X β ) '( y − X β )  .
(2πσ ) 2 n/2
 2σ 

This is maximized over Ω when

β = ( X ' X ) −1 X ' y
1
σ 2 =( y − X β ) '( y − X β )
n

where β and σ 2 are the maximum likelihood estimates of β and σ2 which are the values obtained by maximizing
the likelihood function. Thus

Lˆ (Ω) =max L ( β , σ 2 | y, X ) )
 
  ) '( y − X β ) 
− β
exp  − 
1 ( y X
n
  2( y − X β ) '( y − X β )  
 2π  ) '( y − X β )  2   
 n ( y − X β    n  
 n
n n / 2 exp  − 
=  2 .
n
(2π ) ( y − X β ) '( y − X β )  2
n/2
7

The constrained likelihood under ω is


=Lˆ (ω ) max
= L( β , σ 2 | y, X , β β 0 )
1  1 
= exp  − 2 ( y − X β 0 ) '( y − X β 0 )  .
(2πσ 2 ) n /2  2σ 

Since β 0 is known, so the constrained likelihood function has an optimum variance estimator
1
σω2 =( y − X β 0 ) '( y − X β 0 )
n
 n
n n /2 exp  − 
Lˆ (ω ) =  2 .
n /2
(2π ) n /2 ( y − X β 0 ) '( y − X β 0) 

The likelihood ratio is


 
 n n / 2 exp(−n / 2) 
 (2π ) ( y − X β ) '( y − X β )  
n / 2
n/2
Lˆ (Ω)   
= 
L(ω ) 
ˆ 
 n n / 2 exp(−n / 2) 
 (2π ) ( y − X β ) '( y − X β )  
n / 2
n/2

  0 0  
n/2
 ( y − X β 0 ) '( y − X β 0 ) 
= 
 ( y − X β ) '( y − X β ) 
n/2
 σω2 
= 2 
 σ 
= (λ )
n/2
8

where
( y − X β 0 ) '( y − X β 0 )
λ=
( y − X β ) '( y − X β )
is the ratio of the quadratic forms. Now we simplify the numerator of λ as follows:


( y − X β 0 ) '( y − X β 0 ) =( y − X β ) + X ( β − β 0 )  ( y − X β ) + X ( β − β 0 ) 

= ( y − X β ) '( y − X β ) + 2 y '  I − X ( X ' X ) −1 X ' X ( β − β 0 ) + ( β − β 0 ) ' X ' X ( β − β 0 )

= ( y − X β ) '( y − X β ) + ( β − β 0 ) ' X ' X ( β − β 0 ).

Thus
( y − X β ) '( y − X β ) + ( β − β 0 ) ' X ' X ( β − β 0 )
λ=
( y − X β ) '( y − X β )

( β − β 0 ) ' X ' X ( β − β 0 )
= 1+
( y − X β ) '( y − X β )

( β − β 0 ) ' X ' X ( β − β 0 )
or λ − 1 = λ0 =
( y − X β ) '( y − X β )

where

0 ≤ λ0 < ∞.
9

Distribution of ratio of quadratic forms

Now we find the distribution of the quadratic forms involved in λ0 to find the distribution of λ0 as follows:

( y − X β ) '( y − X β ) =
e ' e
= y '  I − X ( X ' X ) −1 X ' y

= y ' Hy
(X β + ε )'H (X β + ε )
=
= ε=
' Hε (using HX 0)
= (n − k )σˆ 2 .

Result: Let Z is a n x 1 random vector that is distributed as N (0, σ 2 I n ) and A is any symmetric idempotent n x n
Z ' AZ
matrix of rank p then ~ χ 2 ( p ). Let B is another n x n symmetric idempotent matrix of rank q, then
σ 2
Z ' BZ
~ χ ( q).
2

σ2
If AB = 0 then Z’AZ is distributed independently of Z’BZ.

So using this result, we have

y ' Hy (n − k )σˆ 2
= ~ χ 2 (n − k ).
2
σ
2
σ
10

Further, if H0 is true, then β = β 0 . Substituting β = β 0 in this expression, we have the quantity which is same as the
numerator of λ0. The numerator of λ0 can be rewritten in a general form for any β as

( β − β ) ' X ' X ( β − β ) =
ε ' X ( X ' X ) −1 X ' X ( X ' X ) −1 X ' ε

= ε ' X ( X ' X ) −1 X ' ε

= ε ' Hε
where H is an idempotent matrix with rank k.

Thus using this result, we have

ε ' H ε ε ' X '( X ' X ) −1 X ' ε


= ~ χ 2 (k ).
σ 2
σ 2

Furthermore, the product of the quadratic form matrices in the numerator (ε ' H ε ) and in the denominator (ε ' H ε ) of λ0
is

 I − X ( X ' X ) −1 X ' X ( X ' X ) −1 X ' =−


X ( X ' X ) −1 X ' X ( X ' X ) −1 X ' X ( X ' X ) −1 X ' =
0

and hence the χ 2 random variables in numerator and denominator of λ0 are independent. Dividing each of the χ2
random variable by their respective degrees of freedom, we get
11
 
   
 (β − β0 ) ' X ' X (β − β0 ) 
 σ2 
 k 
λ1 =  
  (n − k )σˆ 2  
 
  σ 2
 
  n − k  
 
   
( β − β 0 ) ' X ' X ( β − β 0 )
=
kσˆ 2
( y − X β 0 ) '( y − X β 0 ) − ( y − X β ) '( y − X β )
=
kσˆ 2
~ F (k , n − k ) under H 0 .

Note that
( y − X β 0 ) '( y − X β 0 ) : Restricted error sum of squares
( y − X β ) '( y − X β ) : Unrestricted error sum of squares
Numerator of λ1 : Difference between the restricted and unrestricted error sum of squares.

The decision rule is to reject H 0 : β = β 0 at α level of significance whenever

λ1 ≥ Fα (k , n − k )

where Fα ( k , n − k ) is the upper critical points on the central F-distribution with k and n - k degrees of freedom.
ECONOMETRIC THEORY

MODULE – III
Lecture - 14
Multiple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Likelihood ratio test for H 0 : Rβ = r


The same logic and reasons used in the development of likelihood ratio test for H 0 : β = β 0 can be extended to develop
the likelihood ratio test for H 0 : R β = r as follows:

=Ω {(β ,σ 2
) : − ∞ < βi < ∞, σ=
2
> 0, i 1, 2,..., k }

ω
= {(β ,σ 2
Rβ r , σ 2 > 0} .
) : − ∞ < β i < ∞, =

Let β = ( X ' X ) −1 X ' y.

Then

E ( Rβ ) = Rβ

V ( Rβ )= E  R( β − β )( β − β ) ' R ' 

= RV ( β ) R '

= σ 2 R( X ' X ) −1 R '.

Since β ~ N  β , σ 2 ( X ' X ) −1 

so R β ~ N  R β , σ 2 R ( X ' X ) −1 R '

Rβ − r= Rβ − Rβ= R ( β − β ) ~ N 0, σ 2 R ( X ' X ) −1 R ' .


3
There exists a matrix Q such that
−1
 R( X ' X ) −1 R ' = QQ '
and then
=ξ QR(b − β )  N (0, σ 2 I n ).
Therefore under H 0 : Rβ − r =0,

ξξ ' ( Rβ − r ) ' QQ '( Rβ − r )


=
σ2 σ2
−1
( Rβ − r ) '  R ( X ' X ) −1 R ' ( R β − r )
=
σ2
−1
( β − β ) ' R '  R( X ' X ) −1 R ' R ( β − β )
=
σ2
−1
ε ' X ( X ' X ) −1 R '  R( X ' X ) −1 R ' R ( X ' X ) −1 X ' ε
=
σ2
~ χ 2 (J )
−1
which is obtained as X ( X ' X ) −1 R '  R ( X ' X ) −1 R ' R ( X ' X ) −1 X ' is an idempotent matrix and its trace is J which is the
associated degrees of freedom with the respective quadratic form.
Also, irrespective of whether H 0 is true or not,

e ' e ( y − X β ) '( y − X β ) y ' Hy (n − k )σˆ 2


= = = ~ χ 2 (n − k ).
σ 2 2
σ 2
σ 2
σ
4

Moreover, the product of quadratic form matrices of e ' e and


−1
( β − β ) ' R '  R ( X ' X ) −1 R ' R ( β − β ) is zero implying that both the quadratic forms are independent.

So in terms of likelihood ratio test statistic

 ( Rβ − r ) '  R ( X ' X ) −1 R ' −1 ( R β − r ) 


   
 σ 2

 J 
 
λ1 =  
 (n − k )σˆ 2

 
 σ
2

n−k

( ( ))
−1
Rβ − r ) '  R ( X ' X ) −1 R ' R β − r
=
J σˆ 2
~ F ( J , n − k ) under H 0 .

So the decision rule is to reject H0 whenever

λ1 ≥ Fα ( J , n − k )
where Fα ( J , n − k ) is the upper critical points on the central F distribution with J and (n – k) degrees of freedom.
5
Test of significance of regression (Analysis of variance)
If= =
we set R [0 I k −1 ], r 0, then the hypothesis H 0 : Rβ = r reduces to the following null hypothesis:
H 0 : β 2= β3= ...= β k= 0
which is tested against the alternative hypothesis
H1 : β j ≠ 0 for at least one j = 2, 3,…, k.

This hypothesis determines if there is a linear relationship between y and any set of the explanatory variables X 2 , X 3 ,..., X k .
=
Notice that X 1 corresponds to the intercept term in the model and hence xi1 1=
for all i 1, 2,..., n.

This is an overall or global test of model adequacy. Rejection of the null hypothesis indicates that at least one of the
explanatory variables among X 2 , X 3 ,..., X k contributes significantly to the model. This is called as analysis of variance.

Since ε ~ N (0, σ I ),
2

so y ~ N ( X β , σ 2 I )

b = ( X ' X ) −1 X ' y ~ N  β , σ 2 ( X ' X ) −1  .

SS res
Also σˆ 2 =
n−k
( y − yˆ ) '( y − yˆ )
=
n−k
y '  I − X ( X ' X ) −1 X ' y y ' Hy y' y −b' X ' y
= = = .
n−k n−k n−k
Since ( X ' X )-1 X ' H = 0,

so b and σˆ 2 are independently distributed.


6

Since y ' Hy = ε ' H ε and H is an idempotent matrix, so

SS r e s
~ χ (2n − k ) ,
σ 2

i.e., central χ2 distribution with (n - k) degrees of freedom.

Partition X = [ X 1 , X 2* ] where the submatrix X 2* contains the explanatory variables X 2 , X 3 ,..., X k and
partition β = [ β1 , β 2* ] where the subvector β 2* contains the regression coefficients β 2 , β3 ,..., β k .

Now partition the total sum of squares due to y’s as

SST = y ' Ay
= SS r e g + SS r e s
where
SS r e g = b2* ' X 2* ' AX 2*b2*
is the sum of squares due to regression and b2* is the OLSE of β 2* .

The sum of squares due to residuals is given by

SS r e s =
( y − Xb) '( y − Xb)
= y ' Hy
= SST − SS r e g .
7
Further
SS r e g  β * ' X * ' AX * β *  β * ' X * ' AX * β *
~ χ k2−1  2 2 2 2 2  , i.e., non-central χ 2 distribution with non − centrality parameter 2 2 2 2 2 ,
σ2  2σ  2σ

SST  β 2* ' X 2* ' AX 2* β 2*  β 2* ' X 2* ' AX 2* β 2*


~χ  2
n −1  , i.e., non-central χ distribution with non − centrality parameter
2
.
σ2  2σ 2  2σ 2
Since X 2 H = 0, so SS r e g and SS r e s are independently distributed. The mean squares due to regression is

SS r e g
MS r e g =
k −1
and the mean square due to residuals (or error) is

SS r e s
MS res = .
n−k
Then

MS reg  β 2* ' X 2* ' AX 2* β 2* 


~ Fk −1,n − k  
MS res  2σ 2 
β 2* ' X 2* ' AX 2* β 2*
which is a non-central F -distribution with (k – 1)(n – k) degrees of freedom and non-centrality parameter .
2σ 2

Under H 0 : β 2= β3= ...= β k= 0,

MS reg
F= ~ Fk −1,n − k .
MS res

The decision rule is to reject at α level of significance whenever

F ≥ Fα (k − 1, n − k ).
8

The calculation of F-statistic can be summarized in the form of an analysis of variance (ANOVA) table given as follows:

Source of variation Sum of squares Degrees of freedom Mean squares F


Regression SS r e g k −1 =
MS reg SS r e g / k − 1 F
Error
n−k
SS r e s =
MS res SS r e s /(n − k )
Total SST n −1

Rejection of H0 indicates that it is likely that atleast one β i ≠ 0 (i =


1, 2,..., k ).
9

Test of hypothesis on individual regression coefficients


In case the test in analysis of variance is rejected, then another question arises is that which of the regression coefficients
is/are responsible for the rejection of null hypothesis. The explanatory variables corresponding to such regression
coefficients are important for the model.

Adding such explanatory variables also increases the variance of fitted values ŷ , so one needs to be cautious that only
those regressors are added which are really important in explaining the response. Adding unimportant explanatory
variables may increase the residual mean square which may decrease the usefulness of the model.

To test the null hypothesis H 0 : β j = 0


versus the alternative hypothesis H1 : β j ≠ 0
has already been discussed is the case of simple linear regression model. In present case, if H0 is accepted, it implies that
the explanatory variable Xj can be deleted from the model. The corresponding test statistic is
bj
=t ~ t (n − k − 1) under H 0
se(b j )

where the standard error of OLSE b j of β j is


se(b j ) = σˆ 2C jj where Cjj denotes the jth diagonal element of ( X ' X ) −1 corresponding to bj.
The decision rule is to reject H0 at α level of significance if | t |> tα .
, n − k −1
2

Note that this is only a partial or marginal test because b j depends on all the other explanatory variables X i (i ≠ j )
that are in the model. This is a test of the contribution of X j given the other explanatory variables in the model.
ECONOMETRIC THEORY

MODULE – III
Lecture - 15
Multiple Linear Regression Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Confidence interval estimation

The confidence intervals in multiple regression model can be constructed for individual regression coefficients as well as
jointly . We consider both of them as follows:

Confidence interval on the individual regression coefficient


Assuming y X β + ε , we have
ε i ' s are identically and independently distributed following N (0, σ 2 ) in=

y ~ N ( X β ,σ 2 I )

b ~ N ( β , σ 2 ( X ' X ) −1 ).
Thus the marginal distribution of any regression coefficient estimate

b j ~ N ( β j , σ 2C jj )

−1
where C jj is the jth diagonal element of ( X ' X ) .

Thus
b −βj
t j =j ~ t ( n − k ) under H 0 , j =
1, 2,...
σˆ 2C jj
where
SS r e s y ' y − b ' X ' y
σˆ 2
= = .
n−k n−k
3

So the 100(1 − α )% confidence interval for β j ( j = 1, 2,..., k ) is obtained as follows:

 bj − β j 

P − tα ≤ ≤ tα  = 1−α
 2 , n − k
σ 2
ˆ C jj ,n − k 
 2

 
P  b j − tα σˆ 2C jj ≤ β j ≤ b j + tα σˆ 2C jj  =1 − α .
,n − k ,n − k
 2 2 
So the confidence interval is

 
 b j − tα ,n − k σˆ C jj , b j + tα ,n − k σˆ C jj
2 2
.
 2 2 
4

Simultaneous confidence intervals on regression coefficients


A set of confidence intervals that are true simultaneously with probability (1 − α ) are called simultaneous or joint
confidence intervals.
It is relatively easy to define a joint confidence region for β in multiple regression model.
Since
(b − β ) ' X ' X (b − β )
~ Fk ,n −k
k MSr e s

 (b − β ) ' X ' X (b − β ) 
⇒ P 1 − α.
≤ Fα ( k , n − k )  =
 k MSr e s 

So a 100 (1 − α )% joint confidence region for all of the parameters in β is

(b − β ) ' X ' X (b − β )
~ Fα (k , n − k )
k MS r e s

which describes an elliptically shaped region.


5

Coefficient of determination (R2) and adjusted R2

Let R be the multiple correlation coefficient between y and X 1 , X 2 ,..., X k . Then square of multiple correlation coefficient
(R2) is called as coefficient of determination. The value of R2 commonly describes that how well the sample regression
line fits to the observed data. This is also treated as a measure of goodness of fit of the model.

Assuming that the intercept term is present in the model as

yi = β1 + β 2 X i 2 + β3 X i 3 + ... + β k X ik + ui , i = 1, 2,..., n
then
e'e
R2 = 1 − n

∑( y
i =1
i − y )2

SSres
= 1−
SST
SSr e g
=
SST
where
SSr e s : sum of squares due to residuals,
SST : total sum of squares,
SS r e g : sum of squares due to regression.

R2 measure the explanatory power of the model which in turn reflects the goodness of fit of the model.
It reflects the model adequacy in the sense that how much is the explanatory power of explanatory variable.
6
Since
y '  I − X ( X ' X ) −1 X ' y =
e 'e = y ' Hy,
n n

=i 1 =i 1
∑ ( yi − y )2 = ∑y 2
i − ny 2 ,

1 n 1
=
where y = ∑
n i =1
yi
n
' y =
with  (1,1,...,1=
) ', y ( y1 , y2 ,..., yn ) '.

n
 1 
Thus ∑ ( y − y)
i =1
i = y ' y − n  2  ' yy '  
2

n 
1
= y ' y − y '  ' y
n
= y ' y − y ' ( ' ) −1  ' y
= y '  I − ( ' ) −1  ' y
= y ' Ay
−1
where A= I − ( ' )  '.

y ' Hy
So R2 = 1 − .
y ' Ay

The limits of R2 are 0 and 1, i.e., 0 ≤ R 2 ≤ 1.


• R2 = 0 indicates the poorest fit of the model.
• R2 = 1 indicates the best fit of the model.
• R2 = 0.95 indicates that 95% of the variation in y is explained by the explanatory variables. In simple words, the model is
95% good.
• Similarly any other value of R2 between 0 and 1 indicates the adequacy of fitted model.
7

Adjusted R2
If more explanatory variables are added to the model, then R2 increases. In case the variables are irrelevant, then R2 will
still increase and gives an overly optimistic picture.
With a purpose of correction in overly optimistic picture, adjusted R2, denoted as R 2 or adj R2 is used which is defined as
SSr e s / ( n − k )
R2 = 1 −
SST / ( n − 1)
 n −1 
=
1−   (1 − R ).
2

n−k 

We will see later that (n - k) and (n - 1) are the degrees of freedom associated with the distributions of SSres and SST.
SS r e s SST are based on the unbiased estimators of respective variances of e and y is the
Moreover , the quantities and
n−k n −1
context of analysis of variance.

The adjusted R2 will decline if the addition of an extra variable produces too small a reduction in (1 − R ) to compensate
2

 n −1 
for the increase is  .
n−k 

Another limitation of adjusted R2 is that it can be negative also. For example if =


k 3,= =
n 10, R 2
0.16, then
9
R 2 =−
1 × 0.84 =−0.08 < 0
7
which has no interpretation.
8
Limitations
1. If constant term is absent in the model, then R2 can not be defined. In such cases, R2 can be negative. Some
ad-hoc measures based on R2 for regression line through origin have been proposed in the literature.
2. R2 is sensitive to extreme values, so R2 lacks robustness.
3. Consider a situation where we have following two models:
The question is now which model is better?

yi = β1 + β 2 X i 2 + ... + β k X ik + ui , i = 1, 2,.., n

log yi = γ 1 + γ 2 X i 2 + ... + γ k X ik + vi .

n
For the first model,
∑ ( y − yˆ )i i
2

R12 = 1 − i =1
n

∑ ( y − y)
i =1
i
2

and for the second model, an option is to define R2 as


n

∑ (log y − log yˆ ) i i
2

R22 = 1 − i =1
n
.
∑ (log y − log y )
i =1
i
2

As such R12 and R22 are not comparable.


If still, the two models are needed to be compared, a better proposition to define R2 can be as follows:
n

∑ ( y − anti log yˆ )
i
*
i
R32 = 1 − i =1
n

∑ ( y − y)
i =1
i
2

where
 y. Now R 2 and R 2 on comparison may give an idea about the adequacy of the two models.
yi* = log i 1 3
9

Relationship of analysis of variance test and coefficient of determination

Assuming β1 to be an intercept term, then for H 0 : β 2= β 3= ...= β k= 0 the F-statistic in analysis of variance test is

MS r e g
F=
MS res
SS r e g
(n − k ) SS r e g  n − k  SS r e g  n − k  SST
=  =  
(k − 1) SS r e s  k − 1  SST − SS r e g  k − 1  1 − SS r e g
SST
n−k  R
2
= 
 k −1  1− R
2

where R2 is the coefficient of determination.

So F and R2 are closely related. When R2 = 0, then F = 0.

In limit, when R 2 = 1, F = ∞. So both F and R2 vary directly. Larger R2 implies greater F value.

That is why the F test under analysis of variance is termed as the measure of overall significance of estimated regression.

It is also a test of significance of R2. If F is highly significant, it implies that we can reject H0, i.e., y is linearly related to
X’s.
ECONOMETRIC THEORY

MODULE – IV
Lecture - 16
Predictions in Linear Regression
Model
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Prediction of values of study variable


An important use of linear regression modeling is to predict the average and actual values of study variable. The term
prediction of value of study variable corresponds to knowing the value of E(y) (in case of average value) and value of y (in
case of actual value) for a given value of explanatory variable. We consider both the cases. The prediction of values
consists of two steps. In the first step, the regression coefficients are estimated on the basis of given observations. In the
second step, these estimators are then used to construct the predictor which provides the prediction of actual or average
values of study variables. Based on this approach of construction of predictors, there are two situations in which the actual
and average values of study variable can be predicted- within sample prediction and outside sample prediction. We
describe the prediction in both the situations.`

Within sample prediction in simple linear regression model


Consider the linear regression model y =β 0 + β1 x + ε . Based on a sample of n sets of paired observations

( xi , yi ) (i = 1, 2,..., n) following yi =β 0 + β1 xi + ε i where ε i ’s are identically and independently distributed following

N (0, σ 2 ). The parameters β 0 and β1 are estimated using the ordinary least squares estimation as b0 of β 0 and b1 of β1 as

b0= y − b1 x
sxy
b1 =
sxx
where
n n
1 n 1 n
sxy = ∑ ( xi − x )( yi − y ), sxx = ∑ ( xi − x ) , x = ∑ xi , y = ∑ yi .
2

=i 1 =i 1 = n i 1= ni1
The fitted model is y= b0 + b1 x .
3
Case 1: Prediction of average value of y
Suppose we want to predict the value of E(y) for a given value of x = x0 . Then the predictor is given by
pm= b0 + b1 x0 .
Here m stands for mean value.

Predictive bias
The prediction error is given as
pm − E ( y ) =b0 + b1 x0 − E ( β 0 + β1 x0 + ε )
=b0 + b1 x0 − ( β 0 + β1 x0 )
= (b0 − β 0 ) + (b1 − β1 ) x0 .
Then the prediction bias is given as
E [ pm − E ( y ) ] = E (b0 − β 0 ) + E (b1 − β1 ) x0

= 0 + 0 = 0.
Thus the predictor pm is an unbiased predictor of E(y).

Predictive variance
The predictive variance of pm is
PV=
( pm ) Var (b0 + b1 x0 )
= Var [ y + b1 ( x0 − x ) ]
= Var ( y ) + ( x0 − x ) 2 Var (b1 ) + 2( x0 − x )Cov( y , b1 )
σ 2 σ 2 ( x0 − x ) 2
= + +0
n sxx
 1 ( x − x )2 
= σ2  + 0 .
 n s xx 
4

Estimate of predictive variance


The predictive variance can be estimated by substituting σ2 by σˆ 2 = MSE as

 2 1 ( x0 − x ) 2 
( pm ) σˆ  +
PV = 
n sxx 
 1 ( x0 − x ) 2 
= MSE  + .
n sxx 

Prediction interval estimation


The 100(1 − α )% prediction interval for E(y) is obtained as follows:
The predictor pm is a linear combination of normally distributed random variables, so it is also normally distributed as

pm ~ N ( β 0 + β1 x0 , PV ( pm ) ) .
So if σ 2 is known, then the distribution of

pm − E ( y )
PV ( pm )

is N (0,1). So the 100(1 − α )% prediction interval is obtained as

 p − E ( y) 
P  − zα ≤ m ≤ zα  =1 − α
 2 PV ( pm ) 2 

which gives the prediction interval for E(y) as


  1 ( x − x )2  ( x0 − x ) 2  
2 1
 pm − zα σ 2  + 0  m α
, p + z σ  +  .
 2  n s xx  2  n s xx  
5

When σ 2 is unknown, it is replaced by σˆ 2 = MSE and in this case, the sampling distribution of
pm − E ( y )
 1 ( x0 − x ) 2 
MSE  +
n sxx 
is t-distribution with (n - 2) degrees of freedom, i.e., tn - 2.

The 100(1 − α )% prediction interval in this case is

 
 
 pm − E ( y ) 
P  −tα ≤ ≤ tα  = 1−α
 2
,n −2
 1 (x − x ) 
2
2
,n −2

MSE  + 0 
 n sxx  
 

which gives the prediction interval as

  1 ( x0 − x ) 2   1 ( x0 − x ) 2  
 pm − tα MSE  +  , pm + tα ,n − 2 MSE  + .
 2
,n −2
 n s xx  2  n s xx  

Note that the width of prediction interval E(y) is a function of x0. The interval width is minimum for x0 = x and widens as
x0 − x increases. This is expected also as the best estimates of y to be made at x-values lie near the center of the data
and the precision of estimation to deteriorate as we move to the boundary of the x-space.
6

Case 2: Prediction of actual value


If x0 is the value of the explanatory variable, then the actual value predictor for y is

p=
a b0 + b1 x0 .

Here a means “actual”. Note that the form of predictor is same as that of average value predictor but its predictive error

and other properties are different. This is the dual nature of predictor.

Predictive bias
The predictive error of pa is given by

pa − y0 = b0 + b1 x0 − ( β 0 + β1 x0 + ε )
= (b0 − β 0 ) + (b1 − β1 ) x0 − ε .

Thus, we find that

E ( pa − y0 ) = E (b0 − β 0 ) + E (b1 − β1 ) x0 − E (ε )
= 0+0+0
=0
which implies that pa is an unbiased predictor of y0.
7

Predictive variance

Because the future observation y0 is independent of pa, the predictive variance of pa is

PV (=
pa ) E ( pa − y0 ) 2
= E[(b0 − β 0 ) + ( x0 − x )(b1 − β1 ) + (b1 − β1 ) x − ε 0 ]2
= Var (b0 ) + ( x0 − x ) 2 Var (b1 ) + x 2Var (b1 ) + Var (ε ) + 2( x0 − x )Cov(b0 , b1 ) + 2 xCov(b0 , b1 ) + 2( x0 − x )Var (b1 )
[rest of the terms are 0 assuming the independence of ε 0 with ε1 , ε 2 ,..., ε n ]

= Var (b0 ) + [( x0 − x ) 2 + x 2 + 2( x0 − x )]Var (b1 ) + Var (ε ) + 2[( x0 − x ) + 2 x ]Cov(b0 , b1 )


= Var (b0 ) + x02Var (b1 ) + Var (ε ) + 2 x0Cov(b0 , b1 )
1 x2  2 σ 2 xσ 2
= σ  +  + x0
2
+ σ − 2 x0
2

 n sxx  sxx sxx

 1 ( x0 − x ) 2 
= σ 1 + +
2
.
 n sxx 

Estimate of predictive variance


The estimate of predictive variance can be obtained by replacing σ2 by its estimate σˆ 2 = MSE as

( p = 2  1 ( x0 − x ) 2 
PV a ) σˆ 1 + + 
 n sxx 
 1 ( x0 − x ) 2 
= MSE 1 + + .
 n sxx 
8
Prediction interval

If σ2 is known, then the distribution of


pa − E ( pa )
PV ( pa )
is N (0,1). So the 100(1 − α )% prediction interval is obtained as

 p − E ( pa ) 
P  − zα ≤ a ≤ zα  =1 − α
 2 PV ( pa ) 2 

which gives the prediction interval for ŷ0 as
 
2 1 ( x0 − x ) 2  2 1 ( x0 − x ) 2 
 pa − zα σ 1 + +  , pa + zα σ 1 + +  .
 2  n sxx  2  n sxx  

When σ 2 is unknown, then


pa − E ( pa )
( p )
PV a

follows a t-distribution with (n - 2) degrees of freedom. The 100(1 − α )% prediction interval for ŷ0 in this case is obtained

as  p − E ( p ) 
P  −tα ≤ a a
≤ tα  = 1−α
 2 ,n −2 
PV ( pa )
,n −2 
 2

which gives the prediction interval

  1 ( x0 − x ) 2   1 ( x0 − x ) 2  
 pa − tα MSE 1 + +  , pa + tα ,n − 2 MSE 1 + + .
 2
,n −2
 n s xx  2  n s xx  
9

The prediction interval is of minimum width at x0 = x and widens as x0 − x increases.

The prediction interval for pa is wider than the prediction interval for pm because the prediction interval for pa
depends on both the error from the fitted model as well as the error associated with the future observations.

Within sample prediction in multiple linear regression model


Consider the multiple regression model with k explanatory variables as
y X β +ε,
=
where y = ( y1 , y2 ,..., yn ) ' is a n ×1 vector of n observation on study variable,

 x11 x12  x1k 


 
x x22  x2 k 
X =  21
    
 
 xn1 xn 2  xnk 

is a n × k matrix of n observations on each of the k explanatory variables, β = ( β1 , β 2 ,..., β k ) ' is a k × 1 vector of


regression coefficients and ε = (ε1 , ε 2 ,..., ε n ) ' is a n ×1 vector of random error components or disturbance term
following N (0, σ 2 I n ).. If intercept term is present, take first column of X to be (1,1,...,1)' .

Let the parameter β be estimated by its ordinary least squares estimator b = ( X ' X ) −1 X ' y . Then the predictor is p = Xb
which can be used for predicting the actual and average values of study variable. This is the dual nature of predictor.
10

Case 1: Prediction of average value of y

When the objective is to predict the average value of y, i.e., E(y) then the estimation error is given by
p − E ( y ) =Xb − X β
= X (b − β )
= X ( X ' X ) −1 X ' ε
= Hε
where H = X ( X ' X ) −1 X '.

Then
E [ p − E ( y)] = X β − X β = 0
which proves that the predictor p = Xb provides unbiased prediction for average value.

The predictive variance of p is

E { p − E ( y )} '{ p − E ( y )}


PVm ( p ) =

= E [ε ' H .H ε ]
= E (ε ' H ε )

= σ 2tr H
= σ 2k.
 m ( p ) = σˆ 2 k where σˆ 2 = MSE is obtained from analysis of variance
The predictive variance can be estimated by PV
based on OLSE.
11
Case 2: Prediction of actual value of y
When the predictor p = Xb is used for predicting the actual value of study variable y, then its prediction error is given by
p − y = Xb − X β − ε
= X (b − β ) − ε
= X ( X ' X ) −1 X ' ε − ε
=−  I − X ( X ' X ) −1 X ' ε
= −Hε .
Thus
E ( p − y) =
0
which shows that p provides unbiased predictions for the actual values of study variable.

The predictive variance in this case is


PVa ( p ) = E ( p − y ) '( p − y ) 
= E ( ε ' H .H ε )

= E (ε ' H ε )
= σ 2trH
= σ 2 (n − k ).
The predictive variance can be estimated by
 m=
PV ( p ) σˆ 2 (n − k )
where σˆ 2 = MSE is obtained from analysis of variance based on OLSE. Comparing the performances of p to predict
actual and average values, we find that p in better predictor for predicting the average value in comparison to actual
value when
PVm < PVa ( p )
or k < (n − k ) or 2k < n,
i.e., when the total number of observations are more than twice the number of explanatory variables.
ECONOMETRIC THEORY

MODULE – IV
Lecture – 17
Predictions in Linear Regression
Model
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Outside sample prediction in multiple linear regression model


Consider the model
y Xβ +ε
= (1)
where y is a n x 1 vector of n observations on study variable, X is a n x k matrix of explanatory variables and ε is
a n x 1 vector of disturbances following N (0, σ 2 I n ).
Further, suppose a set of m observations on the same set of explanatory variables are also available but the
corresponding nf observations on the study variable are not available. Assuming that this set of observation also follows
the same model, we can write

yf X f β +ε f
= (2)

where yf is a nf x 1 vector of future values, Xf is a nf x k matrix of known values of explanatory variables and ε f is a
nf x 1 vector of disturbances following N (0, σ 2 I m ). It is also assumed that the elements of ε and ε f are independently
distributed.

We now consider the prediction of yf values for given Xf from model (2). This can be done by estimating the regression
coefficients from model (1) based on n observations and use it is formulating the predictor in model (2). If ordinary
least squares estimation is used to estimate β in model (1) as

b = ( X ' X ) −1 X ' y,

then the corresponding predictor is

=
p f X=
fb X f ( X ' X ) −1 X ' y.
3
Prediction of average value of y
When the aim is to predict the average value E(yf), then the prediction error is
p f − E ( y f ) =X f b − X f β
= X f (b − β )

= X f ( X ' X ) −1 X ' ε .
Then
X f ( X ' X ) X ' E (ε )
−1
E  p f − E ( y f )  =
= 0.
Thus pf provides unbiased prediction for average value.

The predictive covariance matrix of pf is

E { p f − E ( y f )}{ p f − E ( y f )} '


Covm ( p f ) =

= E  X f ( X ' X ) −1 X ' εε ' X ( X ' X ) −1 X 'f 

= X f ( X ' X ) X ' E ( εε ') X ( X ' X ) −1 X 'f


−1

= σ 2 X f ( X ' X ) −1 X ' X ( X ' X ) X 'f


−1

= σ 2 X f ( X ' X ) −1 X 'f .
The predictive variance of pf is
PVm ( p f ) = E { p f − E ( y f } '{ p f − E ( y f )}

= tr Covm ( p f ) 
= σ 2tr ( X ' X ) −1 X 'f X f  .
If σ2 is unknown, then replace σ 2 by σˆ 2 = MSE in the expressions of predictive covariance matrix and predictive
variance and there estimates are  m ( p ) = σˆ 2 X ( X ' X ) −1 X '
Cov f f f

 m ( p ) = σˆ 2tr ( X ' X )−1 ( X ' X )  .


PV f  f f 
4
Case 2: Prediction of actual value of y
When pf is used to predict the actual value yf , then the prediction error is

pf − yf = X f b − X f β −ε f
= X f (b − β ) − ε f .
Then
E ( p f − y=
f ) X f E ( b − β ) − E (ε f )
= 0.
Thus pf provides unbiased prediction for actual values.

The predictive covariance matrix of pf in this case is

Cova ( p f ) =E ( p f − y f )( p f − y f ) '

= E { X f (b − β ) − ε f }{(b − β ) ' X 'f − ε 'f }

= X f V (b) X 'f + E (ε f ε 'f ) (Using (=


b − β ) ( X ' X ) −1 X ' ε )

= σ 2  X f ( X ' X ) −1 X 'f X f + I n f  .

The predictive variance of pf is


PVa ( p f ) =E ( p f − y f ) ' ( p f − y f )

= tr Cova ( p f ) 

= σ 2 tr ( X ' X ) −1 X 'f X f + n f  .

The estimates of covariance matrix and predictive variance can be obtained by replacing σ 2 by σˆ 2 = MSE as

= a ( p ) σˆ 2  X ( X ' X ) −1 X ' + I 
Cov f  f f nf 

= a ( p ) σˆ 2 tr ( X ' X ) −1 X ' X + n  .


PV f  f f f 
5
Simultaneous prediction of average and actual values of study variable
The predictions are generally obtained either for the average values of study variable or actual values of study variable. In

many applications, it may not be appropriate to confine our attention only to either of the two. It may be more appropriate

in some situations to predict both the values simultaneously, i.e., consider the prediction of actual and average values of

study variable simultaneously.

For example, suppose a firm deals with the sale of fertilizer to the user. The interest of company would be in predicting

the average value of yield which the company would like to use in showing that the average yield of the crop increases by

using their fertilizer. On the other side, the user would not be interested in the average value. The user would like to know

the actual increase in the yield by using the fertilizer. Suppose both the seller and the user go for prediction through

regression modeling. Now using the classical tools, the statistician can predict either the actual value or the average value.

This can safeguard the interest of either the user or the seller. Instead of this, it is required to safeguard the interests of

both by striking a balance between the objectives of the seller and the user.

This can be achieved by combining both the predictions of actual and average values. This can be done by formulating an

objective function or target function. Such a target function has to be flexible and should allow to assign different weights

to the choice of two kinds of predictions depending upon their importance in any given application and also reducible to

individual predictions leading to actual and average value prediction.

Now we consider the simultaneous prediction in within and outside sample cases.
6

Simultaneous prediction in within sample prediction


Define a target function
τ = λ y + (1 − λ ) E ( y ); 0 ≤ λ ≤ 1
λ is a constant lying between zero
which is a convex combination of actual value y and average value E(y). The weight
and one whose value reflects the importance being assigned to actual value prediction. Moreover λ = 0 gives the
average value prediction and λ = 1 gives the actual value prediction. For example, the value of λ in the fertilizer
example depends on the rules and regulation of market, norms of society and other considerations etc. The value of λ
is the choice of practitioner.
Consider the multiple regression model
X β + ε , E (ε ) =
y= 0, E (εε ') =
σ 2 In .
Estimate β by ordinary least squares estimation and construct the predictor

p = Xb.
Now employ this predictor for predicting the actual and average values simultaneously through the target function.
The prediction error is

p − τ = Xb − λ y − (1 − λ ) E ( y )
= Xb − λ ( X β + ε ) − (1 − λ ) X β
= X (b − β ) − λε .
Thus
E ( p − τ=
) XE (b − β ) − λ E (ε )
= 0.
So p provides unbiased prediction for τ.
7

The variance of p is
Var ( p ) =E ( p − τ ) '( p − τ )
= E {(b − β ) ' X '− λε '}{ X (b − β ) − λε }

= E ε ' X ( X ' X ) −1 X ' X ( X ' X ) X ' ε + λ 2ε ' ε − λ (b − β ) ' X ' ε − λε ' X (b − β ') 
−1
 
E (1 − 2λ )ε ' X ( X ' X ) −1 X ' ε + λ 2ε ' ε 
=

σ 2 (1 − 2λ )tr ( X ' X ) −1 X ' X  + λ 2σ 2trI n


=

= σ 2 (1 − 2λ )k + λ 2 n  .

Simultaneous prediction in outside sample prediction


Consider the model described earlier under outside sample prediction as

X β + ε , E (ε ) =
y= 0, V (ε ) =
σ 2 In
n×1 n×k k ×1 n×1

X f β + ε f ; E (ε f ) =
yf = 0, V (ε f ) =
σ 2 Inf .
n f ×1 n f ×k k ×1 n f ×1

The target function in this case is defined as

τ f= λ y f + (1 − λ ) E ( y f ); 0 ≤ λ ≤ 1.

The predictor based on OLSE of β is

(X 'X )
−1
=p f X=
f b; b X ' y.
8

The predictive error of p is


p f − τ f= X f b − λ y f − (1 − λ ) E ( y f )
= X f b − λ ( X f β + ε f ) − (1 − λ ) X f β
= X f (b − β ) − λε f .
So
E ( p f − τ=
f ) X f E (b − β ) − λ E (ε f )
= 0.
Thus pf provides unbiased prediction for τ f .

The variance of pf is

Var ( p f ) =E ( p f − τ ) '( p f − τ )
= E {(b − β ) ' ε 'f − λε 'f }{ X f (b − β ) − λε f }

E ε ' X ( X ' X ) −1 X 'f X f ( X ' X ) X ' ε + λε 'f ε f − 2λε 'f X f ( X ' X ) −1 X ' ε 
−1
 
σ 2 tr { X ( X ' X ) −1 X 'f X f ( X ' X ) −1 X '} + λ 2 n f 

assuming that the elements is ε and ε f are mutually independent.


ECONOMETRIC THEORY

MODULE – V
Lecture - 18
Generalized and Weighted Least
Squares Estimation
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
The usual linear regression model assumes that all the random error components are identically and independently
distributed with constant variance. When this assumption is violated, then ordinary least squares estimator of regression
coefficient looses its property of minimum variance in the class of linear and unbiased estimators. The violation of such
assumption can arise in anyone of the following situations:
1. The variance of random error components is not constant.
2. The random error components are not independent.
3. The random error components do not have constant variance as well as they are not independent.

In such cases, the covariance matrix of random error components does not remain in the form of an identity matrix but can
be considered as any positive definite matrix. Under such assumption, the OLSE does not remain efficient as in the case of
identity covariance matrix. The generalized or weighted least squares method is used in such situations to estimate the
parameters of the model.

In this method, the deviation between the observed and expected values of yi is multiplied by a weight ωi where ωi
is chosen to be inversely proportional to the variance of yi.

n
For simple linear regression model, the weighted least squares function is S ( β 0=
, β1 ) ∑ω ( y − β − β1 xi ) .
2
i i 0

The least squares normal equations are obtained by differentiating S ( β 0 , β1 ) with respect to β 0 and β1 and equating
them to zero as n n n
βˆ0 ∑ ωi +βˆ1 ∑ ωi xi =
∑ ωi yi
=i 1 =i 1 =i 1
n n n

0 ∑ ωi xi +β1 ∑ ωi xi =
βˆ ∑ ωi xi yi .
ˆ 2

=i 1 =i 1 =i 1

Solution of these two normal equations give the weighted least squares estimate of β 0 and β1 .
3

Generalized least squares estimation

Suppose in usual multiple regression model


X β + ε with E (ε ) ==
y= 0, V (ε ) σ 2 I ,
the assumption V (ε ) = σ 2 I is violated and become
V (ε=
) σ 2Ω
where Ω is a known n × n nonsingular, positive definite and symmetric matrix.

Ω incorporates both the cases.


This structure of
 when Ω is diagonal but with unequal variances and
 when Ω is not necessarily diagonal depending on the presence of correlated errors, then the off - diagonal
elements are nonzero.

The OLSE of β is

b = ( X ' X ) −1 X ' y.

In such cases OLSE gives unbiased estimate but has more variability as

=E (b) ( X= X ' X ) −1 X ' X β β


' X ) −1 X ' E ( y ) (=
= y ) X ( X ' X ) −1 σ 2 ( X ' X ) −1 X ' Ω X ( X ' X ) − 1 .
V (b) ( X ' X ) −1 X 'V (=

Now we attempt to find better estimator as follows:


4
Since Ω is positive definite, symmetric, so there exists a nonsingular matrix K such that

KK ' = Ω .

Then in the model


y X β +ε,
=
premutliply by K −1 gives

K −1 y K −1 X β + K −1ε
=
or
z Bβ + g
=
=
where
−1
z K= −1
y, B K= X , g K −1ε . Now observe that

= −1)
E ( g ) K= E (ε ) 0
E { g − E ( g )}{ g − E ( g )} '
V (g) =
= E ( gg ')

= E  K −1εε ' K '−1 

= K −1 E (εε ') K '−1


= σ 2 K −1ΩK '−1
= σ 2 K −1 KK ' K '−1
= σ 2I.
Thus the elements of g have 0 mean, unit variance and they are uncorrelated.
5

So either minimize
S (β ) = g ' g
= ε ' Ω −1ε
=( y − X β ) ' Ω −1 ( y − X β )
and get normal equations as

(X'Ω-1 X ) βˆ =
X ' Ω −1 y

or βˆ =Ω
( X ' −1 X ) −1 X ' Ω −1 y.

Alternatively, we can apply OLS to transformed model and obtain OLSE of β as

βˆ = ( B ' B) −1 B ' z
= ( X ' K '−1 K −1 X ) −1 X ' K '−1 K −1 y
( X ' −1 X ) −1 X ' Ω −1 y.
=Ω

This is termed as generalized least squares estimator (GLSE) of β .

The estimation error of GLSE is

=βˆ ( B ' B) −1 B '( Bβ + g )


= β + ( B ' B ) −1 B ' g

or βˆ − β =
( B ' B ) −1 B ' g .
6

Then

E ( βˆ − β ) ( B ' B=
= ) −1 B ' E ( g ) 0
which shows that GLSE is an unbiased estimator of β . The covariance matrix of GLSE is given by

{ }{
E  βˆ − E ( βˆ ) βˆ − E ( βˆ ) ' 
V ( βˆ ) =
  }
= E ( B ' B ) −1 B ' gg ' B '( B ' B ) −1 

= ( B ' B ) −1 B ' E ( gg ') B '( B ' B ) −1.


Since
E ( gg ') = K −1 E (εε ') K '−1
= σ 2 K −1ΩK '−1
= σ 2 K −1 KK ' K '−1
= σ 2I,
so

V ( βˆ ) = σ 2 ( B ' B ) −1 B ' B( B ' B ) −1


= σ 2 ( B ' B ) −1
= σ 2 ( X ' K ' −1 K −1 X ) −1
= σ 2 ( X ' Ω −1 X ) −1 .

Now we prove that GLSE is the best linear unbiased estimator of β.


7

The Gauss-Markov theorem for the case Var (ε ) = Ω


The Gauss-Markov theorem establishes that the generalized least-squares (GLS) estimator of β given by
( X ' −1 X ) −1 X ' Ω −1 y , is BLUE (best linear unbiased estimator). By best β , we mean that β̂ minimizes the
βˆ =Ω
variance for any linear combination of the estimated coefficients,  ' βˆ . We note that

E ( βˆ ) =E ( X ' Ω −1 X ) −1 X ' Ω −1 y 

( X ' −1 X ) −1 X ' Ω −1 E ( y )
=Ω
( X ' −1 X ) −1 X ' Ω −1 X β
=Ω
= β.

Thus β̂ is an unbiased estimator of β .

The covariance matrix of β̂ is given by


V ( βˆ ) =
( X ' Ω −1 X ) −1 X ' Ω −1  V ( y ) ( X ' Ω −1 X ) −1 X ' Ω −1  '

( X ' Ω −1 X ) −1 X ' Ω −1  Ω ( X ' Ω −1 X ) −1 X ' Ω −1  '


=

= ( X ' Ω −1 X ) −1 X ' Ω −1  Ω Ω −1 X ( X ' Ω −1 X ) −1 

= ( X ' Ω −1 X ) −1.
Thus,
Var ( ' βˆ ) =  'Var ( βˆ )

=  ' ( X ' Ω −1 X ) −1  .


8

Let β be another unbiased estimator of β that is a linear combination of the data. Our goal, then, is to show that
Var ( ' β ) ≥  '( X ' Ω −1 X ) −1  with at least one  such that Var ( ' β ) ≥  '( X ' Ω −1 X ) −1  .
We first note that we can write any other estimator of β that is a linear combination of the data as

 ( X ' Ω −1 X ) −1 X ' Ω −1 + B  y + b*
β=   0

where B is an p x n matrix and bo* is a p x 1 vector of constants that appropriately adjusts the GLS estimator to form
the alternative estimate. Then

(
( β ) E ( X ' Ω −1 X ) −1 X ' Ω −1 + B  y + bo*
E= )
= ( X ' Ω −1 X ) −1 X ' Ω −1 + B  E ( y ) + b0*

= ( X ' Ω −1 X ) −1 X ' Ω −1 + B  XB + b0*

=( X ' Ω −1 X ) −1 X ' Ω −1 X β + BX β + b0*


β + BX β + b0* .
=

Consequently, β is unbiased if and only if both b0 = 0 and BX = 0. The covariance matrix of


*
β is

(
V ( β ) Var ( X ' Ω −1 X ) −1 X ' Ω −1 + B  y
= )
= ( X ' Ω −1 X ) −1 X ' Ω −1 + B  V ( y ) ( X ' Ω −1 X ) −1 X ' Ω −1 + B  '

= ( X ' Ω −1 X ) −1 X ' Ω −1 + B  Ω ( X ' Ω −1 X ) −1 X ' Ω −1 + B  '

= ( X ' Ω −1 X ) −1 X ' Ω −1 + B  Ω Ω −1 X ( X ' Ω −1 X ) −1 + B '

= ( X ' Ω −1 X ) −1 + BΩB '


9

because BX = 0, which implies that ( BX


= ) ' X=
' B ' 0. Then

Var ( ' β ) =  'V ( β )

(
=  ' ( X ' Ω −1 X ) −1 + BΩB '  )
=  '( X ' Ω −1 X ) −1  +  ' BΩB ' 

= Var ( ' βˆ ) +  ' BΩB ' .

We note that Ω is a positive definite matrix. Consequently, there exists some nonsingular matrix K such that Ω =K ' K As
a result, BΩB ' =
BK ' KB ' is at least positive semidefinite matrix; hence,  ' BΩB '  ≥ 0. Next note that we can define
* = KB ' .

As a result,
p
 ' BΩB '  =   = *' *
∑
i =1
*2
i

which must be strictly greater than 0 for some  ≠ 0 unless B = 0.


For  ( 0, 0,..., 0,1, 0, …, 0 ) where 1 occurs at the ith place,  ' βˆ = βˆi is the best linear unbiased estimator of  ' β = βi for

all i = 1, 2, …, k. Thus, the GLS estimate of β is the best linear unbiased estimator.
10

Weighted least squares estimation


When ε 's are uncorrelated and have unequal variances, then

1 
ω 0 0  0
 1 
 1 
2 
0 0  0
V (ε )= σ Ω= σ
2
ω2 .
 
     
 1
0 0 0  
 ωn 
The estimation procedure is usually called as weighted least squares.

Let W = Ω −1 then the weighted least squares estimator of β is obtained by solving normal equation ( X 'WX ) βˆ = X 'Wy
which gives βˆ = ( X 'WX ) X 'Wy where
−1
ω1 , ω2 ,..., ωn are called the weights.

The observations with large variances usual have smaller weights than observations with small variance.
ECONOMETRIC THEORY

MODULE – VI
Lecture – 19
Regression Analysis Under Linear
Restrictions
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

One of the basic objectives in any statistical modeling is to find good estimators of parameters. In the context of multiple
X β + ε , the ordinary least squares estimator b = ( X ' X ) X ' y is the best linear unbiased
−1
linear regression model=
y
estimator of β . Several approaches have been attempted in the literature to improve further the OLSE. One approach to
improve the estimators is the use of extraneous information or prior information. In applied work, such prior information
may be available about the regression coefficients. For example, in economics, the constant returns to scale imply that
the exponents in a Cobb-Douglas production function should sum to unity. In another example, absence of money illusion
on the part of consumers implies that the sum of money income and price elasticities in a demand function should be
zero. These types of constraints or the prior information may be available from
i. some theoretical considerations.
ii. past experience of the experimenter.
iii. empirical investigations.
iv. some extraneous sources etc.

To utilize such information in improving the estimation of regression coefficients, it can be expressed in the form of
i. exact linear restrictions.
ii. stochastic linear restrictions.
iii. inequality restrictions.

We consider the use of prior information in the form of exact and stochastic linear restrictions in the model=
y Xβ +ε
where y is a ( n × 1) vector of observations on study variable, X is a ( n × k ) matrix of observations on explanatory
variables X 1 , X 2 ,..., X k , β is a (k ×1) vector of regression coefficients and ε is a (n × 1) vector of disturbance terms.
3
Exact linear restrictions
Suppose the prior information binding the regression coefficients is available from some extraneous sources which can be
expressed in the form of exact linear restrictions as
r = Rβ
where r is a ( q × 1) vector and R is a (q × k ) matrix with rank(R) = q (q < k). The elements in r and R are known.
Some examples of exact linear restriction r = R β are as follows:
(i) If there are two restrictions with k = 6 like

β2 = β4
β3 + 2β 4 + β5 =
1
then
0  0 1 0 − 1 0 0 0 
=r =  , R 0 0 1 2 1 0 0  .
1   

(ii) If k = 3 and suppose β 2 = 3, then

=r [3=
] , R [0 1 0] .

(iii) If k = 3 and suppose β1 : β 2 : β3 :: ab : b :1

0  1 − a 0 
=r =
0  , R 0 1
 −b  .
0  1 0 −ab 

The ordinary least squares estimator b = ( X ' X ) −1 X ' y does not uses the prior information. It does not obey the
restrictions in the sense that r ≠ Rb. So the issue is how to use the sample information and prior information together in
finding an improved estimator of β .
4
Restricted least squares estimation
The restricted least squares estimation method enables the use of sample information and prior information
simultaneously. In this method, choose β such that the error sum of squares is minimized subject to linear restrictions
r = Rβ . This can be achieved using the Lagrangian multiplier technique. Define the Lagrangian function
S (β , λ ) =( y − X β ) '( y − X β ) − 2λ '( Rβ − r )
where λ is a ( k × 1) vector of Lagrangian multiplier.

Using the result that if a and b are vectors and A is a suitably defined matrix, then


= ( A + A ')a
a ' Aa
∂a

a ' b = b,
∂a
we have

∂S ( β , λ )
= 2 X ' X β − 2 X ' y − 2 R ' λ =' 0 (*)
∂β
∂S ( β , λ )
= Rβ − r= 0.
∂λ
Pre-multiplying equation (*) by R ( X ' X ) −1 , we have

2 R β − 2 R( X ' X ) −1 X ' y − 2 R( X ' X ) −1 R ' λ ' =


0

or Rβ − Rb − R( X ' X ) −1 R ' λ ' =


0
−1
⇒λ'=
−  R( X ' X ) −1 R ' ( Rb − r )
using R ( X ' X ) −1 R ' > 0.
5

Substituting λ in equation (*), we get


−1
2 X ' X β − 2 X ' y + 2 R '  R( X ' X ) −1 R ' ( Rb − r ) =
0
X ' y − R ' ( R( X ' X ) −1 R ') ( Rb − r ).
−1
or X 'Xβ =
Pre-multiplying by ( X ' X ) yields
−1

−1
( X ' X ) X ' y + ( X ' X ) R '  R( X ' X )−1 R ' ( r − Rb )
−1 −1
βˆR =
−1
b + ( X ' X ) R '  R ( X ' X ) R ' ( r − Rb ) .
−1 −1
=
 
This estimation is termed as restricted regression estimator of β .

Properties of restricted regression estimator


1. The restricted regression estimator obeys the exact restrictions, i.e., r = R βˆR . To verify this, consider

R b + ( X ' X ) R '{ R( X ' X ) −1} ( r − Rb ) 


−1 −1
RβˆR =
 
= Rb + r − Rb
= r.
6

2. Unbiasedness
The estimation error of βˆR is

βˆR − β = ( b − β ) + ( X ' X ) −1 R '  R ( X ' X ) R ' ( Rβ − Rb )


−1
 
= I − ( X ' X )−1 R { R( X ' X ) −1 R '}−1 R  ( b − β )
 
= D (b − β )

where −1
D= I − ( X ' X ) −1 R  R ( X ' X ) R ' R.
−1
 

Thus
( )
E βˆR − β= DE ( b − β )
=0
implying that βˆR is an unbiased estimator of β.
7

3. Covariance matrix
The covariance matrix of βˆR is

( )
V βˆR =E βˆR − β ( )( βˆ R )
−β '

= DE ( b − β )( b − β ) ' D '
= DV (b) D '

= σ 2D ( X ' X ) D '
−1

−1
= σ 2 ( X ' X ) − σ 2 ( X ' X ) R '  R ( X ' X ) R ' R ' ( X ' X )
−1 −1 −1 −1
 

which can be obtained as follows:

Consider
−1
D (=
X 'X) (X 'X ) − ( X ' X ) R '  R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1
 

{ } { }
'
D( X ' X )  X ' X −1 − X ' X −1 R ' R X ' X −1 R ' −1
  I − X ' X −1 R ' R X ' X −1 R ' −1
R 
( ) ( ) ( ) R( X ' X )   ( ) ( )
−1 −1
D' =

−1
( X ' X ) − ( X ' X ) R '  R ( X ' X ) R ' ' R ( X ' X ) − ( X ' X ) R '  R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1 −1 −1
=
−1 −1
+ ( X ' X ) R '  R ( X ' X ) R ' R ( X ' X ) R '  R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1
   
−1
(X 'X ) − ( X ' X ) R '  R ( X ' X ) R ' R ( X ' X ) .
−1 −1 −1 −1
=
 
8

Maximum likelihood estimation under exact restrictions


Assuming ε ~ N (0, σ 2 I ) , the maximum likelihood estimator of β and σ 2 can also be derived such that it follows r = R β .
The Lagrangian function as per the maximum likelihood procedure can be written as
n
   1  ( y − X β ) '( y − X β ) 
L ( β , σ= ,λ) 
1
( )
2
2
 exp −
 2  − λ ' R β − r  
 2πσ   σ2 
2

where λ is a ( q × 1) vector of Lagrangian multipliers. The normal equations are obtained by partially differentiating the
log– likelihood function with respect to β ,σ 2 and λ and equated to zero as

∂ ln L ( β , σ 2 , λ ) 1
− 2 ( X ' X β − X ' y ) + 2R ' λ =
= 0 (1)
∂β σ
∂ ln L ( β , σ 2 , λ )
= 2 ( Rβ − r=
) 0 (2)
∂λ
∂ ln L ( β , σ 2 , λ ) 2n 2 ( y − X β ) ' ( y − X β )
=
− 2+ =
0. (3)
∂σ 2 σ 4
σ
Let βR , σ R2 and λ denote the maximum likelihood estimators of β , σ and λ , respectively which are obtained by solving
2

equations (1), (2) and (3) as follows:


From equation (1), we get optimal λ as

( r − Rβ )
−1
 R ( X ' X )−1 R '
λ =   .
σ 2
9
Substituting λ in equation (1) gives

( r − Rβ )
−1
β + ( X ' X ) R '  R ( X ' X ) R '
βR =
−1 −1
 
where β = ( X ' X ) X ' y
−1
is the maximum likelihood estimator of β without restrictions. From equation (3), we get

σ 2
=
( y − X β ) ' ( y − X β )
R R
.
R
n

The Hessian matrix of second order partial derivatives of β and σ is positive definite
= at β β=
R and σ σ R2 .
2 2

The restricted least squares and restricted maximum likelihood estimators of β are same whereas they are different
for σ2.

Test of hypothesis
It is important to test the hypothesis
H 0 : r = Rβ
H1 : r ≠ Rβ
before using it in the estimation procedure.
The construction of the test statistic for this hypothesis is detailed in the module 3 on multiple linear regression model. The
resulting test statistic is  (r − Rb) '  R ( X ' X ) −1 R ' −1 (r − Rb) 
   
 q 
F= 
 ( y − Xb) '( y − Xb 
 
 n−k 
which follows a F-distribution with q and (n - k) degrees of freedom under H0. The decision rule is to reject H 0 at α
level of significance whenever F ≥ F1−α ( q, n − k ).
ECONOMETRIC THEORY

MODULE – VI
Lecture – 20
Regression Analysis Under Linear
Restrictions
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Stochastic linear restrictions


The exact linear restrictions assume that there is no randomness involved in the auxiliary or prior information. This
assumption may not hold true in many practical situations and some randomness may be present. The prior information in
such cases can be formulated as

r Rβ + V
=
where r is a (q x 1) vector, R is a (q x k) matrix and V is a (q x 1) vector of random errors. The elements in r and R
are known. The term V reflects the randomness involved in the prior information r = R β . Assume
E (V ) = 0
E (VV ') = ψ
E ( εV ' ) = 0.
where ψ is a known (q x q) positive definite matrix and y X β + ε.
ε is the disturbance term in multiple regression model=
Note that E (r ) = Rβ .

The possible reasons for such stochastic linear restriction are as follows:
i. Stochastic linear restrictions exhibits the unstability of estimates. An unbiased estimate with a reasonable standard
error may exhibit stability. For example, in repetitive studies, the surveys are conducted every year. Suppose the
regression coefficient β1 remains stable for several years. Suppose its estimate is provided along with its standard
error. Suppose its value remains stable around the value 0.5 with standard error 2. This information can be expressed
as r β1 + V1 ,
=
=
where =
r 0.5, E (V1 ) 0,=
E (V12 ) 22.
Now ψ can be formulated with this data. It is not necessary that we should have information for all regression
coefficients but we can have information on some of the regression coefficients only.
3
(ii) Sometimes the restrictions are in the form of inequality. Such restrictions may arise from theoretical considerations.
For example, the value of a regression coefficient may lie between 3 and 5, i.e., 3 ≤ β1 ≤ 5, say. In another example,
consider a simple linear regression model
y =β 0 + β1 x + ε
where y denotes the consumption expenditure on food and x denotes the income. Then the marginal propensity
(tendency) to consume is
dy
= β1 ,
dx
i.e., if salary increase by rupee one, then one is expected to spend β1 amount of rupee one on food or save (1 − β1 )
amount. We may put a bound on β1 that either one can not spend all of rupee one or nothing out of rupee one. So
0 < β1 < 1. This is a natural restriction arising from theoretical considerations.

These bounds can be treated as p-sigma limits, say 2-sigma limits or confidence limits. Thus
µ − 2σ =
0
µ + 2σ =
1
1 1
⇒ µ= , σ= .
2 4
These values can be interpreted as
1
β1 + V1 =
2
1
E (V12 ) = .
16
(iii) Sometimes the truthfulness of exact linear restriction r = R β can be suspected and accordingly, an element of
uncertainty can be introduced. For example, one may say that 95% of the restrictions hold. So some element of
uncertainty prevails.
4

Pure and mixed regression estimation

Consider the multiple regression model


y Xβ +ε
=
with n observations and k explanatory variables X 1 , X 2 ,..., X k . The ordinary least squares estimator of β is
b =(X 'X ) X 'y
−1

r Rβ + V . So the objective is
which is termed as pure estimator. The pure estimator b does not satisfy the restriction=
to obtain an estimate of β by utilizing the stochastic restrictions such that the resulting estimator satisfies the stochastic
restrictions also. In order to avoid the conflict between prior information and sample information, we can combine them as
follows:

Write
X β + ε with E (ε ) =
y= 0, E (εε ') =
σ 2 In
R β + V with E (V ) ==
r= 0, E (VV ') ψ , E (εV ') =
0
jointly as
 y  X  ε 
 r   R  β + V 
=
     
or
a Aβ + w
=

where
=a (=
y r ) ', A ( X=
R) ', w (ε V ) '.

Note that
 E (ε )  0 
=
E ( w) =   
 E (V )  0 
5

 εε ' εV ' 
E ( ww ') = E  
V ε ' VV '
σ 2 I n 0 
= 
 0 ψ
= Ω, say.

This shows that the disturbances are non spherical or heteroskedastic. So the application of generalized least squares
estimation will yield more efficient estimator than ordinary least squares estimation. So applying generalized least squares
to the model
a=
AB + w with E ( w) =
0, V ( w) =
Ω,

the generalized least square estimator of β is given by

( A ' Ω−1 A) A ' Ω−1a.


−1
βˆM =

The explicit form of this estimator is obtained as follows:

 1 
 I 0   y
[X '
−1
A'Ω a = R '] σ

2 n
 
−1  r 
0 Ψ 
= 1
σ2
X ' y + R ' Ψ −1r

 1 
 I 0  X 
[X '
−1
A'Ω A = R '] σ

2 n
 
−1  R 
0 Ψ 
= 1
σ2
X ' X + R ' Ψ −1 R.
6

Thus
−1
 1   1 
βˆM  2 X ' X + R 'ψ −1 R   2 X ' y + R ' Ψ −1r 
=
σ  σ 
assuming σ2 to be unknown. This is termed as mixed regression estimator.

If σ2 is unknown, then σ
2
can be replaced by its estimator

1
s 2 = ( y − Xb ) ' ( y − Xb )
σˆ 2 =
n−k

and feasible mixed regression estimator of β is obtained as

−1
1  1 
βˆ f  2 X ' X + R ' Ψ −1 R   2 X ' y + R ' Ψ −1r  .
=
s  s 

This is also termed as estimated or operationalized generalized least squares estimator.


7

Properties of mixed regression estimator

(i) Unbiasedness
The estimation error of βˆm is
( A ' ΩA)
−1
βˆM − β = A ' Ω −1a − β

= ( A ' Ω −1 A ) A ' Ω −1 ( AB + w ) − β
−1

( A ' Ω−1 A) A ' Ω−1w.


−1
=

( )
E βˆM − β = ( A ' Ω −1 A ) A ' Ω −1 E ( w)
−1

= 0.

So mixed regression estimator provides an unbiased estimator of β. Note that the pure regression b = ( X ' X ) X ' y
−1

estimator is also an unbiased estimator of β.

(ii) Covariance matrix


The covariance matrix of βˆM is
( )
V βˆM =E βˆM − β( )( βˆ M )
−β '

( A ' Ω−1 A) A ' Ω−1E (VV ') Ω−1 A ( A ' Ω−1 A)


−1 −1
=

( A ' Ω A)−1 −1
=
−1
 1 
=  2 X ' X + R ' Ω −1 R  .
σ 
8
(iii) The estimator βˆM satisfies the stochastic linear restrictions in the sense that
=r RβˆM + V

( )
E (r ) RE βˆM + E (V )
=

= Rβ + 0
= Rβ .

(iv) Comparison with OLSE

We first state a result that is used further to establish the dominance of βˆM over b.

Result: If A1 and A2 are suitably defined matrices, then the difference of matrices ( A1 − A2 ) is positive definite if
−1 −1

( A2 − A1 ) is positive definite.

Let σ 2 ( X ' X ) −1
A1 ≡ V (b) =
−1
 1 
A2 ≡ V=( βˆM )  2 X ' X + R ' Ψ −1 R 
σ 
1 1
then A1−1 −=
A2−1 X ' X + R ' Ψ −1 R − 2 X ' X
σ2
σ
= R ' Ψ −1 R
which is a positive definite matrix. This implies that
A1 − A2 = V (b) − V ( βˆM )
is a positive definite matrix. Thus βˆM is more efficient than b under the criterion of covariance matrices or Loewner
ordering provided σ2 is known.
9
Testing of hypothesis

In the prior information specified by stochastic restriction =


r Rβ + V , we want to test whether there is close relation
between the sample information and the prior information. The test for the compatibility of sample and prior information is
tested by χ 2 - test statistic given by
1 −1
( )  R ( X ' X )−1 R '+ Ψ  ( r − Rb ) , (X 'X )
−1
χ
= 2
r − Rb ' =
where b X 'y
σ2  
assumingσ is known . This follows a χ - distribution with q degrees of freedom under the null hypothesis H 0 : E (r ) = Rβ .
2 2

If Ψ =0, then the distribution is degenerated and hence r becomes a fixed quantity. For the feasible version of mixed
regression estimator
−1
1  1 
βˆ f  2 X ' X + R ' Ψ −1 R   2 X ' y + R ' Ψ −1r  ,
=
s  s 
the optimal properties of mixed regression estimator like linearity unbiasedness and/or minimum variance do not remain
valid. So there can be situations when the incorporation of prior information may lead to loss in efficiency. This is not a
favourable situation. Under such situations, the pure regression estimator is better to use. In order to know whether the
use of prior information will lead to better estimator or not, the null hypothesis H 0 : E (r ) = R β can be tested.
For testing the null hypothesis
H 0 : E (r ) = Rβ
when σ2 is unknown, we use the statistic given by

{
 r − Rb ' R X ' X −1 R '+ Ψ
} ( r − Rb )
−1

( ) ( )

q
F=
s2
1
where s 2 = ( y − Xb ) ' ( y − Xb ) and F follows a F - distribution with q and (n – k) degrees of freedom under H0 .
n−k
ECONOMETRIC THEORY

MODULE – VII
Lecture - 21
Multicollinearity

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
A basic assumption in multiple linear regression model is that the rank of the matrix of observations on explanatory
variables is same as the number of explanatory variables. In other words, such matrix is of full column rank. This in turn
implies that all the explanatory variables are independent, i.e., there is no linear relationship among the explanatory
variables. It is termed as the explanatory variables are orthogonal.

In many situations in practice, the explanatory variables may not remain independent due to various reasons. The
situation where the explanatory variables are highly intercorrelated is referred to as multicollinearity.

Consider the multiple regression model

=y X β + ε , ε ~ N (0, σ 2 I )
n×k k ×1 n×1

with k explanatory variables X1, X2,…, Xk with usual assumptions including rank(X) = k.
Assume the observations on all Xi’s and yi’s are centered and scaled to unit length. So
 X’ X becomes a k x k matrix of correlation coefficients between the explanatory variables and
 X’ y becomes a k x 1 vector of correlation coefficients between explanatory and study variables.

Let X = [ X1, X2,…,Xk ] where Xj is the jth column of X denoting the n observations on Xj. The column vectors X1, X2,…,Xk
are linearly dependent if there exists a set of constants 1 ,  2 ,...,  k , not all zero, such that
k

∑
j =1
j X j = 0.

If this holds exactly for a subset of the X1, X2,…,Xk , then rank(X’X) < k. Consequently (X’X)-1 does not exist. If the condition
k

∑
j =1
j X j = 0 is approximately true for some subset of X1, X2,…, Xk , then there will be a near-linear dependency in X’X. In
such a case, the multicollinearity problem exists. It is also said that X’X becomes ill-conditioned.
3

Source of multicollinearity
1. Method of data collection
It is expected that the data is collected over the whole cross-section of variables. It may happen that the data is collected
over a subspace of the explanatory variables where the variables are linearly dependent. For example, sampling is done
only over a limited range of explanatory variables in the population.

2. Model and population constraints


There may exists some constraints on the model or on the population from where the sample is drawn. The sample may be
generated from that part of population having linear combinations.

3. Existence of identities or definitional relationships


There may exist some relationships among the variables which may be due to the definition of variables or any identity
relation among them. For example, if data is collected on the variables like income, saving and expenditure, then
income = saving + expenditure.
Such relationship will not change even when the sample size increases.
4

4. Imprecise formulation of model


The formulation of the model may unnecessarily be complicated. For example, the quadratic (or polynomial) terms or cross
product terms may appear as explanatory variables. For example, let there be 3 variables X1, X2 and X3, so k = 3.
Suppose their cross-product terms X1 X2 , X2 X3 and X1 X3 are also added. Then k rises to 6.

5. An over-determined model
Sometimes, due to over enthusiasm, large number of variables are included in the model to make it more realistic and
consequently the number of observations (n) becomes smaller than the number of explanatory variables (k). Such
situation can arise in medical research where the number of patients may be small but information is collected on a large
number of variables. In another example, if there is time series data for 50 years on consumption pattern, then it is
expected that the consumption pattern does not remain same for 50 years. So better option is to choose smaller number
of variables and hence it results into n < k. But this is not always advisable. For example, in microarray experiments, it is
not advisable to choose smaller number of variables.
5
Consequences of multicollinearity

To illustrate the consequences of presence of multicollinearity, consider a model

y = β1 x1 + β 2 x2 + ε , E (ε ) = 0, V (ε ) =σ 2 I

where x1, x2 and y are scaled to length unity.


The normal equation ( X ' X )b = X ' y in this model becomes
 1 r   b1   r1 y 
   =  
 r 1   b2   r2 y 
where r is the correlation coefficient between x1 and x2; rjy is the correlation coefficient between xj and y ( j = 1, 2) and
b = (b1 b2)’ is the OLSE of β .

 1  1 −r 
(X 'X )
−1
= 2 
 1 − r   −r 1 
r1 y − r r2 y
⇒ b1 =
1− r2
r2 y − r r1 y
b2 = .
1− r2
So the covariance matrix is V (b) = σ 2 ( X ' X ) −1

σ2
⇒ Var (b1 ) = Var (b2 ) =
1− r2
rσ 2
Cov(b1 , b2 ) = − .
1− r2
6

If x1 and x2 are uncorrelated, then r = 0 and


1 0
X 'X = 
0 1
rank(X’X) = 2.
If x1 and x2 are perfectly correlated, then r = ±1 and rank(X’X) = 1.
If r → ±1, then

=
Var (b1 ) Var (b2 ) → ∞.
So if variables are perfectly collinear, the variance of OLSEs becomes large. This indicates highly unreliable estimates and

this is an inadmissible situation.

Consider the following result

r 0.99 0.9 0.1 0

Var(b1) = Var(b2) 50σ 2 5σ 2 1.01σ 2 σ2

The standard errors of b1 and b2 rise sharply as r → ±1 and they break down at r = ±1 because X’X becomes non-
singular.
 If r is close to 0, then multicollinearity does not harm and it is termed as non-harmful multicollinearity .
 If r is close to +1 or -1 then multicollinearity inflates the variance and it rises terribly. This is termed as harmful
multicollinearity.
7

There is no clear cut boundary to distinguish between the harmful and non-harmful multicollinearity. Generally, if r is low,
the multicollinearity is considered as non-harmful and if r is high, the multicollinearity is considered as harmful.

In case of near or high multicollinearity, following possible consequences are encountered.


1. The OLSE remains an unbiased estimator of β but its sampling variance becomes very large. So OLSE
becomes imprecise and property of BLUE does not hold anymore.
2. Due to large standard errors, the regression coefficients may not appear significant. Consequently, important
variables may be dropped.
For example, to test H 0 : β1 = 0, we use t - ratio as

b1
t0 = .

Var (b )
1

 (b ) is large, so t is small and consequently H is more often accepted.


Since Var 1 0 0

Thus harmful multicollinearity intends to delete important variables.

3. Due to large standard errors, the large confidence region may arise. For example, the confidence interval is
  (b )  . When Var
 (b ) becomes large, then confidence interval becomes wider.
given by  b1 ± tα Var 1  1
, n −1
 2 

4. The OLSE may be sensitive to small changes in the values of explanatory variables. If some observations are

added or dropped, OLSE may change considerably in magnitude as well as in sign. Ideally, OLSE should not

change with inclusion or deletion of few observations. Thus OLSE loses stability and robustness.
8

When the number of explanatory variables are more than two, say k as X1, X2,…,Xk then the jth diagonal element of
C =(X 'X )
−1
is
1
C jj =
1 − R 2j
where R 2j is the multiple correlation coefficient or coefficient of determination from the regression of Xj on the remaining
(k - 1) explanatory variables.
If Xj is highly correlated with any subset of other (k - 1) explanatory variables then R 2j is high and close to 1.
Consequently variance of jth OLSE
σ2
Var= jjσ
(b j ) C= 2

1 − R 2j

becomes very high. The covariance between bi and bj will also be large if Xi and Xj are involved in the linear
relationship leading to multicollinearity.

The least squares estimates bj become too large in absolute value in the presence of multicollinearity. For example,
consider the squared distance between b and β as

(b β ) '(b − β )
L2 =−
k
=
E(L )2
∑ E (b
j =1
j − β j )2

k
= ∑ Var (b j )
j =1

= σ 2tr ( X ' X ) −1.


9

The trace of a matrix is same as the sum of its eigenvalues. If λ1 , λ2 ,..., λk are the eigenvalues of (X ’X), then
1 1 1
, ,..., are the eigenvalues of (X’X)-1 and hence
λ1 λ2 λk
k
1
E(L ) σ
= 2 2
∑λ
j =1
, λ j > 0.
j

If (X ’X) is ill-conditioned due to the presence of multicollinearity then at least one of the eigenvalue will be small. So the
distance between b and β may also be large. Thus

E ( L2 ) =E (b − β ) '(b − β )
σ 2tr ( X ' X ) −1 = E (b ' b − 2b ' β + β ' β )
' b) σ 2tr ( X ' X ) −1 + β ' β
⇒ E (b=

⇒ b is generally larger in magnitude than β .


⇒ OLSE are too large in absolute value.

The least squares produces bad estimates of parameters in the presence of multicollinearity. This does not imply that the
fitted model produces bad predictions also. If the predictions are confined to X - space with non-harmful multicollinearity,
then predictions are satisfactory.
ECONOMETRIC THEORY

MODULE – VII
Lecture - 22
Multicollinearity

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Multicollinearity diagnostics
An important question that arises is how to diagnose the presence of multicollinearity in the data on the basis of given
sample information. Several diagnostic measures are available and each of them is based on a particular approach. It
is difficult to say that which of the diagnostic is the best or ultimate. Some of the popular and important diagnostics are
described further.

The detection of multicollinearity involves 3 aspects:


(i) Determining its presence.
(ii) Determining its severity.
(iii) Determining its form or location.

1. Determinant of X ' X ( X ' X )


This measure is based on the fact that the matrix X ' X becomes ill conditioned in the presence of multicollinearity. The
value of determinant of X ' X , i.e., | X ' X | declines as degree of multicollinearity increases.

If rank ( X ' X )< k then | X ' X | will be singular and so | X ' X |= 0 . So as X ' X → 0 , the degree of multicollinearity
increases and it becomes exact or perfect at | X ' X |= 0 . Thus | X ' X | serves as a measure of multicollinearity and
| X ' X |= 0 indicates that perfect multicollinearity exists.
3

Limitations
This measure has following limitations
i. It is not bounded as 0 < X ' X < ∞.
ii. It is affected by dispersion of explanatory variables. For example, if k = 2, then
n n

∑ x12i
=i 1 =i 1
∑x x
1i 2 i

X 'X = n n

=i 1 =i 1
∑ x2i x1i ∑x 2
2i

 n 2  n 2 
=  ∑ x1i  ∑ x2i  (1 − r122 )
= i 1=  i 1 

where r12 is the correlation coefficient between X1 and X2. So | X ' X | depends on correlation coefficient and
variability of explanatory variable. If explanatory variables have very low variability, then | X ' X | may tend to
zero which will indicate the presence of multicollinearity and which is not the case so.

iii. It gives no idea about the relative effects on individual coefficients. If multicollinearity is present, then it will not
indicate that which variable in | X ' X | is causing multicollinearity and is hard to determine.
4

2. Inspection of correlation matrix


The inspection of off-diagonal elements rij in X ' X gives an idea about the presence of multicollinearity. If Xi and Xj are
nearly linearly dependent then |rij| will be close to 1. Note that the observations in X are standardized in the sense that
each observation is subtracted from mean of that variable and divided by the square root of corrected sum of squares of
that variable.

When more than two explanatory variables are considered and if they are involved in near-linear dependency, then it is not
necessary that any of the rij will be large. Generally, pairwise inspection of correlation coefficients is not sufficient for
detecting multicollinearity in the data.

3. Determinant of correlation matrix


Let D be the determinant of correlation matrix then 0 ≤ D ≤ 1.
If D = 0 then it indicates the existence of exact linear dependence among explanatory variables.
If D = 1 then the columns of X matrix are orthonormal.
Thus a value close to 0 is an indication of high degree of multicollinearity. Any value of D between 0 and 1 gives an idea
of the degree of multicollinearity.

Limitation
It gives no information about the number of linear dependencies among explanatory variables.
5

Advantages over | X ' X |

(i) It is a bounded measure 0 ≤ D ≤ 1.

(ii) It is not affected by the dispersion of explanatory variables. For example, when k = 2,

n n

∑ x12i
=i 1 =i 1
∑x x
1i 2 i

X ' X= n n
= (1 − r122 ).
∑x
1i 2 i
=i 1 =i 1
x ∑x 2
2i

4. Measure based on partial regression


A measure of multicollinearity can be obtained on the basis of coefficients of determination based on partial regression. Let
R2 be the coefficient of determination in the full model, i.e., based on all explanatory variables and R 2j be the coefficient of
determination in the model when ith explanatory variable is dropped, i = 1, 2,…, k, and RL2 = Max( R12 , R22 ,..., Rk2 ).
6

Procedure
i. Drop one of the explanatory variable among k variables, say X1.
ii. Run regression of y over rest of the (k - 1) variables X2, X3,…, Xk.
2
iii. Calculate R1 .
iv. Similarly calculate R22 , R32 ,..., Rk2
Find RL = Max( R1 , R2 ,..., Rk ).
2 2 2 2
v.
vi. Determine R − RL .
2 2

The quantity ( R − RL ) provides a measure of multicollinearity. If multicollinearity is present, RL2 will be high. Higher the
2 2

degree of multicollinearity, higher the value of RL . So in the presence of multicollinearity, ( R − RL ) will be low.
2 2 2

Thus if ( R − RL ) is close to 0, it indicates the high degree of multicollinearity.


2 2

Limitations
i. It gives no information about the underlying relations about explanatory variables, i.e., how many relationships are
present or how many explanatory variables are responsible for the multicollinearity.
Small value of ( R − RL ) may occur because of poor specification of the model also and it may be inferred in
2 2
ii.
such situation that multicollinearity is present.
7
5. Variance inflation factors (VIF)
The matrix X ' X becomes ill-conditioned in the presence of multicollinearity in the data. So the diagonal elements of
C = ( X ' X ) −1 helps in the detection of multicollinearity. If R 2j denotes the coefficient of determination obtained when Xj is
regressed on the remaining (k - 1) variables excluding Xj , then the jth diagonal element of C is

1
C jj = .
1 − R 2j

2
If Xj is nearly orthogonal to remaining explanatory variables, then R j is small and consequently Cjj is close to 1.

If Xj is nearly linearly dependent on a subset of remaining explanatory variables, then R 2j is close to 1 and consequently
Cjj is large.
Since the variance of jth OLSE of β j is Var (b j ) = σ 2C jj .
So Cjj is the factor by which the variance of bj increases when the explanatory variables are near linear dependent. Based
on this concept, the variance inflation factor for the jth explanatory variable is defined as

1
VIFj = .
1 − R 2j

This is the factor which is responsible for inflating the sampling variance. The combined effect of dependencies among the
explanatory variables on the variance of a term is measured by the VIF of that term in the model.
One or more large VIFs indicate the presence of multicollinearity in the data.

In practice, usually a VIF > 5 or 10 indicates that the associated regression coefficients are poorly estimated because of
multicollinearity. If regression coefficients are estimated by OLSE and its variance is σ 2 ( X ' X ) −1. So VIF indicates that a
part of this variance is given by VIFj.
8

Limitations
(i) It sheds no light on the number of dependencies among the explanatory variables.
(ii) The rule of VIF > 5 or 10 is a rule of thumb which may differ from one situation to another situation.

Another interpretation of VIFj


The VIFs can also be viewed as follows.

The confidence interval of jth OLSE of βj is given by

 
 b ± σˆ C jj tα ,n − k −1  .
2

 2 
The length of the confidence interval is

L j = 2 σˆ 2C jj tα .
, n − k −1
2

Now consider a situation where X is an orthogonal matrix, i.e., X ' X = I so that Cjj = 1, sample size is same as earlier and
same root mean squares  1 2  , then the length of confidence interval becomes L* = 2σˆ tα
n

 n ∑ ( xij − x j ) 
.
, n − k −1
 i =1  2

Lj
Consider the ratio = C jj .
L*

Thus VIFj indicates the increase in the length of confidence interval of jth regression coefficient due to the presence of
multicollinearity.
9
6. Condition number and condition index
Let λ1 , λ2 ,..., λk be the eigenvalues (or characteristic roots) of X ' X . Let

λmax = max(λ1 , λ2 ,..., λk )


λmin = min(λ1 , λ2 ,..., λk ).

The condition number (CN) is defined as


λmax
=
CN , 0 < CN < ∞.
λmin

The small values of characteristic roots indicates the presence of near-linear dependencies in the data. The CN provides a
measure of spread in the spectrum of characteristic roots of X’X.

The condition number provides a measure of multicollinearity.


 If CN < 100, then it is considered as non-harmful multicollinearity.
 If 100 < CN < 1000, then it indicates that the multicollinearity is moderate to severe (or strong). This range is
referred to as danger level.
 If CN > 1000, then it indicates a severe (or strong) multicollinearity.

The condition number is based only or two eigenvalues: λmin and λmax . Another measures are condition indices which
use information on other eigenvalues as well.
λmax
The condition indices of X’X are defined=
as C j = , j 1, 2,..., k .
λj
In fact, largest Cj = CN.
The number of condition indices that are large, say more than 1000, indicate the number of near-linear dependencies in
X’X. A limitation of CN and Cj is that they are unbounded measures as 0 < CN < ∞ , 0 < C j < ∞ .
10

7. Measure based on characteristic roots and proportion of variances


Let λ1 , λ2 ,.., λk be the eigenvalues of X ' X , Λ =diag (λ1 , λ2 ,..., λk ) is k × k matrix and V is a k × k matrix
constructed by the eigenvectors of X’X. Obviously, V is an orthogonal matrix. Then X’X can be decomposed as
X ' X = V ΛV ' . Let V1 , V2 ,..., Vk be the column of V. If there is near-linear dependency in the data, then λ j is close to
zero and the nature of linear dependency is described by the elements of associated eigenvector Vj.

The covariance matrix of OLSE is

V (b) = σ 2 ( X ' X ) −1
= σ 2 (V ΛV ') −1
= σ 2V Λ −1V '
 vi21 vi22 vik2 
(bi ) σ  +
⇒ Var= 2
+ ... + 
λ
 1 λ 2 λk 
where vi1 , vi 2 ,..., vik are the elements in V.

The condition indices are

λmax
=Cj = , j 1, 2,..., k .
λj
11

Procedure
i. Find condition index C1 , C2 ,..., Ck .
ii. (a) Identify those λi ' s for which C j is greater than the danger level 1000.
(b) This gives the number of linear dependencies.
(c) Don’t consider those C j ' s which are below the danger level.
iii. For such λ ' s with condition index above the danger level, choose one such eigenvalue, say λj.
iv. Find the value of proportion of variance corresponding to λ j in Var (b1 ),Var (b2 ),...,Var (bk ) as
(vij2 / λ j ) vij2 / λ j
=pij = k
.
∑ (v
VIFj 2
ij / λj )
j =1
 vij2 
Note that   can be found from the expression
λ
 j 
2  vi1 vik2 
2
vi22
Var=(bi ) σ  + + ... + 
 λ1 λ2 λk 
i.e., corresponding to j th factor.
The proportion of variance pij provides a measure of multicollinearity.

If pij > 0.5, it indicates that bi is adversely affected by the multicollinearity, i.e., estimate of βi is influenced by the
presence of multicollinearity.

It is a good diagnostic tool in the sense that it tells about the presence of harmful multicollinearity as well as also indicates
the number of linear dependencies responsible for multicollinearity. This diagnostic is better than other diagnostics.
12
The condition indices are also defined by the singular value decomposition of X matrix as follows:

X = UDV '

where U is n × k matrix, V is matrix, is k × k matrix=


U 'U I=
, V 'V I , D is p × p matrix, D = diag ( µ1 , µ2 ,..., µk )
and µ1 , µ 2 ,..., µ k are the singular values of X , V is a matrix whose columns are eigenvectors corresponding to
eigenvalues of X’X and U is a matrix whose columns are the eigenvectors associated with the k nonzero eigenvalues
of X’X.
The condition indices of X matrix are defined as
µmax
=ηj = , j 1, 2,..., k
µj
where µ max = max( µ1 , µ 2 ,..., µ k ).
If λ1 , λ2 ,..., λk are the eigenvalues of X’X then

X ' X = (UDV ') 'UDV '= VD 2V '= V ΛV ',

µ 2j λ=
so = j, j 1, 2,..., k .

µ j = λj
Note that with  ,
2

k v 2ji
Var (b j ) = σ 2
∑µ
i =1
2
i
k v 2ji
VIFj = ∑
i =1 µi2
(v 2ji / µi2 )
pij = .
VIFj
13

The ill-conditioning in X is reflected in the size of singular values. There will be one small singular value for each non-
linear dependency. The extent of ill conditioning is described by how small is µj relative to µmax .

It is suggested that the explanatory variables should be scaled to unit length but should not be centered when computing
pij . This will help in diagnosing the role of intercept term in near-linear dependence. No unique guidance is available in
literature on the issue of centering the explanatory variables. The centering makes the intercept orthogonal to explanatory
variables. So this may remove the ill conditioning due to intercept term in the model.
ECONOMETRIC THEORY

MODULE – VII
Lecture - 23
Multicollinearity

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Remedies for multicollinearity


Various techniques have been proposed to deal with the problems resulting from the presence of multicollinearity in the
data.

1. Obtain more data


The harmful multicollinearity arises essentially because rank of X’X falls below k and then |X’X| = 0 which clearly
suggests the presence of linear dependencies in the columns of X. It is the case when |X’X| is close to zero which needs
attention. Additional data may help in reducing the sampling variance of the estimates. The data need to be collected such
that it helps in breaking up the multicollinearity in the data.

It is always not possible to collect additional data to various reasons as follows.


 The experiment and process have finished and no longer available.
 The economic constrains may also not allow to collect the additional data.
 The additional data may not match with the earlier collected data and may be unusual.
 If the data is in time series, then longer time series may force to take data that is too far in the past.
 If multicollinearity is due to any identity or exact relationship, then increasing the sample size will not help.
 Sometimes, it is not advisable to use the data even if it is available. For example, if the data on consumption
pattern is available for the years 1950-2010, then one may not like to use it as the consumption pattern usually
does not remains same for such a long period.
3
2. Drop some variables that are collinear

If possible, identify the variables which seems to causing multicollinearity. These collinear variables can be dropped so as
to match the condition of fall rank of X-matrix. The process of omitting the variables may be carried out on the basis of
some kind of ordering of explanatory variables, e.g., those variables can be deleted first which have smaller value of t-
ratio.

In another example, suppose the experimenter is not interested in all the parameters. In such cases, one can get the
estimators of the parameters of interest which have smaller mean squared errors than the variance of OLSE of full vector
by dropping some variables. If some variables are eliminated, then this may reduce the predictive power of the model.
Sometimes there is no assurance that how the model will exhibit less multicollinearity.

3. Use some relevant prior information


One may search for some relevant prior information about the regression coefficients. This may lead to specification of
estimates of some coefficients. More general situation includes the specification of some exact linear restrictions and
stochastic linear restrictions. The procedures like restricted regression and mixed regression can be used for this purpose.

The relevance and correctness of information plays an important role in such analysis but it is difficult to ensure it in
practice. For example, the estimates derived in U.K. may not be valid in India.
4
4. Employ generalized inverse
If rank ( X ' X ) < k , then the generalized inverse can be used to find the inverse of X 'X . Then β can be estimated by
βˆ = ( X ' X ) − X ' y.

In such case, the estimates will not be unique except in the case of use of Moore-Penrose inverse of ( X ' X ). Different
methods of finding generalized inverse may give different results. So applied workers will get different results. Moreover, it
is also not known that which method of finding generalized inverse is optimum.

5. Use of principal component regression


The principal component regression is based on the technique of principal component analysis. The explanatory variables

are transformed into a new set of orthogonal variables called as principal components. Usually this technique is used for

reducing the dimensionality of data by retaining some levels of variability of explanatory variables which is expressed by the

variability in study variable. The principal components involves the determination of a set of linear combinations of

explanatory variables such that they retain the total variability of the system and these linear combinations are mutually

independent of each other. Such obtained principal components are ranked in the order of their importance. The importance

being judged in terms of variability explained by a principal component relative to the total variability in the system. The

procedure then involves eliminating some of the principal components which contribute in explaining relatively less variation.

After elimination of the least important principal components, the set up of multiple regression is used by replacing the

explanatory variables with principal components.


5

Then study variable is regressed against the set of selected principal components using ordinary least squares method.

Since all the principal components are orthogonal, they are mutually independent and so OLS is used without any problem.

Once the estimates of regression coefficients for the reduced set of orthogonal variables (principal components) have been

obtained, they are mathematically transformed into a new set of estimated regression coefficients that correspond to the

original correlated set of variables. These new estimated coefficients are the principal components estimators of regression

coefficients.

Suppose there are k explanatory variables X 1 , X 2 ,..., X k . Consider the linear function of X 1 , X 2 ,.., X k like

k
Z1 = ∑ ai X i
i =1
k
Z 2 = ∑ bi X i etc.
i =1

The constants a1 , a2 ,..., ak are determined such that the variance of Z1 is maximized subject to the normalizing condition
that ∑a
i =1
2
i = 1. The constant b1 , b2 ,..., bk are determined such that the variance of Z is maximized subject to the normality
2

condition that ∑b
i =1
i
2
= 1 and is independent of the first principal component.
6

We continue with such process and obtain k such linear combinations such that they are orthogonal to their preceding linear
combinations and satisfy the normality condition. Then we obtain their variances. Suppose such linear combinations are
Z1 , Z 2 ,.., Z k and for them, Var ( Z1 ) > Var ( Z 2 ) > ... > Var ( Z k ). The linear combination having the variance is the first
principal component. The linear combination having the second largest variance is the second largest principal component
k k

and so on. These principal components have the property that i


=i 1 =i 1
∑Var (Z ) =∑Var ( X ). i Also, the X 1 , X 2 ,..., X k are correlated
but Z1 , Z 2 ,.., Z k are orthogonal or uncorrelated. So there will be zero multicollinearity among Z1 , Z 2 ,.., Z k .

The problem of multicollinearity arises because X 1 , X 2 ,..., X k are not independent. Since the principal components based
on X 1 , X 2 ,..., X k are mutually independent, so they can be used as explanatory variables and such regression will combat
the multicollinearity.

Let λ1 , λ2 ,..., λk be the eigenvalues of X ' X , Λ =diag (λ1 , λ2 ,..., λk ) is k × k diagonal matrix, V is a k × k orthogonal
matrix whose columns are the eigenvectors associated with λ1 , λ2 ,..., λk .

Consider the canonical form of the linear model

y Xβ +ε
=
= XVV ' β + ε
= Zα + ε
where

Z = XV , α = V ' β , V ' X ' XV = Z ' Z = Λ.

Columns of Z = ( Z1 , Z 2 ,..., Z k ) define a new set of explanatory variables which are called as principal component.
7

The OLSE of α is

αˆ = ( Z ' Z ) −1 Z ' y
= Λ −1Z ' y

and its covariance matrix is

V (αˆ ) = σ 2 ( Z ' Z ) −1
= σ 2 Λ −1
1 1 1 
= σ 2 diag  , ,..., .
 λ1 λ2 λk 

k k
Note that λj is the variance of j th principal component and Z ' Z = ∑∑ Z Z
=i 1 =j 1
i j = Λ.

A small eigenvalue of X’X means that the linear relationship between the original explanatory variable exist and the variance
of corresponding orthogonal regression coefficient is large which indicates that the multicollinearity exists. If one or more λ j
are small, then it indicates that multicollinearity is present.
8

Retainment of principal components

The new set of variables, i.e., principal components are orthogonal, and they retain the same magnitude of variance as of

original set. If multicollinearity is severe, then there will be at least one small value of eigenvalue. The elimination of one or

more principal components associated with smallest eigenvalues will reduce the total variance in the model. Moreover, the

principal components responsible for creating multicollinearity will be removed and the resulting model will be appreciably

improved.

The principal component matrix Z = [ Z1 , Z 2 ,..., Z k ] with Z1 , Z 2 ,..., Z k contains exactly the same information as the original

data in X in the sense that the total variability in X and Z is same. The difference between them is that the original data are

arranged into a set of new variables which are uncorrelated with each other and can be ranked with respect to the

magnitude of their eigenvalues. The j


th
column vector Z j corresponding to the largest λ j accounts for the largest
proportion of the variation in the original data. Thus the Z j ’s are indexed so that λ1 > λ2 > ... > λk > 0 and λ j is the
variance of Zj .
A strategy of elimination of principal components is to begin by discarding the component associated with the smallest

eigenvalue. The idea behind to do this is that the principal component with smallest eigenvalue is contributing the least

variance and so is least informative.


9
Using this procedure, principal components are eliminated until the remaining components explain some preselected variance
is terms of percentage of total variance. For example, if 90% of total variance is needed, and suppose r principal components
are eliminated which means that (k – r) principal components contribute 90% of the total variation, then r is selected to
satisfy k −r

∑λ i
i =1
k
> 0.90.
∑λ
i =1
i

Various strategies to choose required number of principal components are also available in the literature.
Suppose after using such a rule, the r principal components are eliminated. Now only (k - r) components will be used for
regression. So Z matrix is partitioned as
=Z (=
Z r Z k −r ) X (Vr Vk − r )

where submatrix Zr is of order n x r and contains the principal components to be eliminated. The submatrix Zk-r is of order
n x (k – r) and contains the principal components to be retained.

The reduced model obtained after the elimination of r principal components can be expressed as
=y Z k − rα k − r + ε *.
The random error component is represented as ε* just to distinguish with ε. The reduced coefficients contain the
coefficients associated with retained Zj’s. So

Z k − r = ( Z1 , Z 2 ,..., Z k − r )
α k −r = (α1 , α 2 ,..., α k −r )
Vk − r = (V1 ,V2 ,...,Vk − r ) .
10
Using OLS on the model with retained principal components, the OLSE of α k −r is

αˆ k − r = ( Z k' − r Z k − r ) −1 Z k' − r y.
.
Now it is transformed back to original explanatory variables as follows:
α =V 'β
α k − r = Vk'− r β
⇒ βˆ pc =
Vk − rαˆ k − r

which is the principal component regression estimator of β .


This method improves the efficiency as well as combats multicollinearity.
ECONOMETRIC THEORY

MODULE – VII
Lecture - 24
Multicollinearity

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

6. Ridge regression

The OLSE is the best linear unbiased estimator of regression coefficient in the sense that it has minimum variance in the
class of linear and unbiased estimators. However, if the condition of unbiasedness can be relaxed then it is possible to find a
biased estimator of regression coefficient say β̂ that has smaller variance them the unbiased OLSE b. The mean squared
error (MSE) of β̂ is

βˆ ) E ( βˆ − β ) 2
MSE (=

{ } { }
2
= E  βˆ − E ( βˆ ) + E ( βˆ ) − β 
 
2
= Var ( βˆ ) +  E ( βˆ ) − β 
. 2
= Var ( βˆ ) +  Bias ( βˆ )  .

Thus 2 small bias in β̂ . One of the approach to do so is the


MSE ( βˆ ) can be made smaller than Var ( βˆ ) by introducing
ridge regression. The ridge regression estimator is obtained by solving the normal equations of least squares estimation.
The normal equations are modified as

( X ' X + δ I ) βˆridge = X 'y


⇒ βˆridge = ( X ' X + δ I ) X ' y.
−1

βˆridge is the ridge regression estimator of β and δ ≥ 0 is any characterizing scalar termed as biasing parameter.
3
As

δ → 0, βˆridge → b (OLSE ) and as δ → ∞, βˆridge → 0.

So larger the value of δ , larger shrinkage towards zero. Note that the OLSE is inappropriate to use in the sense that it has
very high variance when multicollinearity is present in the data. On the other hand, a very small value of β̂ may tend to
accept the null hypothesis H 0 : β = 0 indicating that the corresponding variables are not relevant. The value of biasing
parameter controls the amount of shrinkage in the estimates.

Bias of ridge regression estimator

The bias of βˆridge is

( βˆridge ) E ( βˆridge ) − β
Bias=
( X ' X δ I ) −1 X ' E ( y ) − β
=+
= ( X ' X + δ I ) −1 X ' X − I  β
= ( X ' X + δ I ) −1 [ X ' X − X ' X − δ I ] β
−δ ( X ' X + δ I ) −1 β .
=

Thus the ridge regression estimator is a biased estimator of β.


4

Covariance matrix

The covariance matrix of βˆridge is defined as

{ }{
E  βˆridge − E ( βˆridge ) βˆridge − E ( βˆridge )  . }
'
V ( βˆridge ) =
 
Since
βˆridge − E ( βˆridge ) = ( X ' X + δ I ) −1 X ' y − ( X ' X + δ I ) −1 X ' X β
( X ' X + δ I ) −1 X '( y − X β )
=
= ( X ' X + δ I ) −1 X ' ε ,

so
V ( βˆridge ) =
( X ' X + δ I ) −1 X 'V (ε ) X ( X ' X + δ I ) −1

σ 2 ( X ' X + δ I ) −1 X ' X ( X ' X + δ I ) −1.


=
5
Mean squared error

Writing X ' X = Λ = diag (λ1 , λ2 ,..., λk ), the mean squared error of βˆridge is
2
=
MSE ( βˆridge ) Var ( βˆridge ) + bias ( βˆridge ) 
2
= tr V ( βˆridge )  + bias ( βˆridge ) 

= σ 2tr ( X ' X + δ I ) −1 X ' X ( X ' X + δ I ) −1  + δ 2 β '( X ' X + δ I ) −2 β

β j2
k λj k
= σ ∑ +δ ∑ 2 2

=j 1 =(λ j + δ ) 2 j 1 (λ j + δ )
2

where λ1 , λ2 ,..., λk are the eigenvalues of X ' X .

Thus as δ increases, the bias in βˆridge increases but its variance decreases. The trade off between bias and variance
hinges upon the value of δ. It can be shown that there exists a value of δ such that MSE ( βˆridge ) < Var (b)
provided β ' β is bounded.

Choice of δ
β 'β
The estimation of ridge regression estimator depends upon the value of δ . Various approaches have been suggested in
the literature to determine the value of δ. The value of δ can be chosen on the basis of criteria like
- stability of estimators with respect to δ .
- reasonable signs.
- magnitude of residual sum of squares etc.
We consider here the determination of δ by the inspection of ridge trace.
6

Ridge trace

Ridge trace is the graphical display of ridge regression estimator versus δ .

If multicollinearity is present and is severe, then the instability of regression coefficients is reflected in the ridge trace. As δ
increases, some of the ridge estimates vary dramatically and they stabilizes at some value of δ. The objective in ridge

trace is to inspect the trace (curve) and find the reasonably small value of δ at which the ridge regression estimators are

stable. The ridge regression estimator with such a choice of δ will have smaller MSE than the variance of OLSE.

An example of ridge trace for a model with 6 parameters is as follows: In this ridge trace, the βˆridge is evaluated for various

choices of δ and the corresponding values of all regression coefficients βˆ j ( ridge ) ’s, j = 1, 2,…, 6 are plotted versus δ .

These values are denoted by different symbols and are joined by a smooth curve. This produces a ridge trace for

respective parameter. Now choose the value of δ where all the curves stabilize and become nearly parallel. For example,

the curves in following figure become nearly parallel starting from δ = δ4 or so. Thus one possible choice of δ is δ = δ4
and parameters can be estimated as

( X ' X + δ4I )
−1
βˆridge
= X ' y.
7

The figure drastically exposes the presence of multicollinearity in the data. The behaviour of βˆi ( ridge ) at δ0 ≈ 0 is very
different than at other values of δ . For small values of δ , the estimates change rapidly. The estimates stabilize
gradually as δ increases. The value of δ at which all the estimates stabilize gives the desired value of δ because
moving away from such δ will not bring any appreciable reduction in the residual sum of squares. If multicollinearity is
present, then the variation in ridge regression estimators is rapid around δ 0 ≈ 0. The optimal δ is chosen such that after
that value of δ , almost all traces stabilize.
8

Limitations

1. The choice of δ is data dependent and therefore is a random variable. Using it as a random variable violates the
assumption that δ is a constant. This will disturb the optimal properties derived under the assumption of constancy
of δ .
2. The value of δ lies in the interval (0, ∞). So large number of values are required for exploration. This results in
wasting of time. However, this is not a big issue when working with software.
3. The choice of δ from graphical display may not be unique. Different people may choose different δ and
consequently the values of ridge regression estimators will be changing. However δ is chosen so that all the
estimators of all the coefficients stabilize. Hence small variation in choosing the value of δ may not produce much
change in ridge estimators of the coefficients. Another choice of δ is

kσˆ 2
δ=
b 'b

where b and σˆ 2 are obtained from the least squares estimation.


4. The stability of numerical estimates of βˆi ' s is a rough way to determine δ . Different estimates may exhibit stability
for different δ and it may often be hard to strike a compromise. In such a situation, generalized ridge regression
estimators are used.
5. There is no guidance available regarding the testing of hypothesis and for confidence interval estimation.
9

Idea behind ridge regression estimator

The problem of multicollinearity arises because some of the eigenvalues roots of X’X are close to zero or are zero. So
if λ1 , λ2 ,..., λ p are the characteristic roots, and if

X ' X = Λ = diag (λ1 , λ2 ,..., λk )

then

βˆridge = (Λ + δ I ) −1 X 1 y = ( I + δΛ −1 ) −1 b

where b is the OLSE of β given by

b = ( X ' X ) −1 X ' y = Λ −1 X ' y.

Thus a particular element in βˆridge will be of the form

1 λi
xi' y = b.
λi + δ λi + δ i
1
So a small quantity δ is added to λi so that if λi = 0, even then remains meaningful.
λi + δ
10

Another interpretation of ridge regression estimator


k
In the model=
y X β + ε , obtain the least squares estimator of β when ∑β
i =1
1
2
= C , where C is some constant. So
minimize

δ (β ) =( y − X β ) '( y − X β ) + δ ( β ' β − C )

where δ is the Lagrangian multiplier. Differentiating S ( β ) with respect to β , the normal equations are obtained as

∂S ( β )
= 0 ⇒ −2 X ' y + 2 X ' X β + 2δβ = 0
∂β
⇒ βˆ = ( X ' X + δ I ) −1 X ' y.
ridge

Note that if δ is very small, it may indicate that most of the regression coefficients are close to zero and if δ is large,
then it may indicate that the regression coefficients are away from zero. So δ puts a sort of penalty on the regression
coefficients to enable its estimation.
ECONOMETRIC THEORY

MODULE – VIII
Lecture - 25
Heteroskedasticity

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

In the multiple regression model


y X β +ε,
=
it is assumed that
V (ε ) = σ 2 I ,
i.e., Var (ε i2 )= σ 2 , Cov(ε iε j )= 0, i ≠ j = 1, 2,..., n.
In this case, the diagonal elements of covariance matrix of ε are same indicating that the variance of each εi is same
and off-diagonal elements of covariance matrix of ε are zero indicating that all disturbances are pairwise uncorrelated.
This property of constancy of variance is termed as homoskedasticity and disturbances are called as homoskedastic
disturbances.

In many situations, this assumption may not be plausible and the variances may not remain same. The disturbances
whose variances are not constant across the observations are called heteroskedastic disturbance and this property is
termed as heteroskedasticity. In this case
(ε i ) σ=
Var= 2
i , i 1, 2,..., n
and disturbances are pairwise uncorrelated.

The covariance matrix of disturbances is

 σ 12 0  0 
 
 0 σ 22  0 
=V (ε ) diag
= (σ 1 , σ 2 ,..., σ n )
2 2 2
.
     
 
 0 0  σ n2 

Graphically, the following pictures depict homoskedasticity and heteroskedasticity.


3

Homoskedasticity

Heteroskedasticity (Var(y) increases with x) Heteroskedasticity (Var(y) decreases with x)


4

Examples
Suppose in a simple linear regression model, x denote the income and y denotes the expenditure on food. It is observed

that as the income increases, the variation in expenditure on food increases because the choice and varieties in food

increases, in general, upto certain extent. So the variance of observations on y will not remain constant as income

changes. The assumption of homoscedasticity implies that the consumption pattern of food will remain same irrespective

of the income of the person. This may not generally be a correct assumption in real situations. Rather the consumption

pattern changes and hence the variance of y and so the variances of disturbances will not remain constant. In general, it

will be increasing as income increases.

In another example, suppose in a simple linear regression model, x denotes the number of hours of practice for typing and

y denotes the number of typing errors per page. It is expected that the number of typing mistakes per page decreases as

the person practices more. The homoskedastic disturbances assumption implies that the number of errors per page will

remain same irrespective of the number of hours of typing practice which may not be true in practice.
5
Possible reasons for heteroskedasticity
There are various reasons due to which the heteroskedasticity is introduced in the data. Some of them are as follows:
1. The nature of phenomenon under study may have an increasing or decreasing trend. For example, the variation in
consumption pattern on food increases as income increases. Similarly the variation in number of typing mistakes
decreases as the number of hours of typing practice increases.

2. Sometimes the observations are in the form of averages and this introduces the heteroskedasticity in the model. For
example, it is easier to collect data on the expenditure on clothes for the whole family rather than on a particular
family member. Suppose in a simple linear regression model

yij =β 0 + β1 xij + ε ij , i =1, 2,..., n, j =1, 2,..., mi ,

yij denotes the expenditure on cloth for the jth family having mj members and xij denotes the age of the ith person in jth
family. It is difficult to record data for individual family member but it is easier to get data for the whole family. So yij’s
are known collectively.

Then instead of per member expenditure, we find the data on average expenditure for each family member as
mj
1
yi =
mj
∑y
j =1
ij

and the model becomes


yi =β 0 + β1 xi + ε i .
σ2
=
It we assume E (ε ij ) 0,=
Var (ε ij ) =
σ , then
2
E (ε i ) 0,=
Var (ε i ) which indicates that the resultant variance of
mj
disturbances does not remain constant but depends on the number of members in a family mj. So heteroskedasticity
enters in the data. The variance will remain constant only when all mj ’s are same.
6

3. Sometimes the theoretical considerations introduces the heteroskedasticity in the data. For example, suppose in the
simple linear model
yi =β 0 + β1 xi + ε i , i =1, 2,..., n
yi denotes the yield of rice and xi denotes the quantity of fertilizer in an agricultural experiment. It is observed that
when the quantity of fertilizer increases, then yield increases. In fact, initially the yield increases when quantity of
fertilizer increases. Gradually, the rate of increase slows down and if fertilizer is increased further, the crop burns. So
notice that β1 changes with different levels of fertilizer. In such cases, when β1 changes, a possible way is to express
it as a random variable with constant mean β1 and constant variance θ 2 like

β1i =β1 + vi , i =1, 2,..., n


with

= Var (vi ) θ 2=
E (vi ) 0,= , E (ε i vi ) 0.

So the complete model becomes

yi =β 0 + β1i xi + ε i
β1=i β1 + vi
⇒ yi = β 0 + β1 xi + (ε i + xi vi )
=β 0 + β1 xi + wi

where
w=
i ε i + xi vi
is like a new random error component.
7

So
E ( wi ) = 0
Var ( wi ) = E ( wi2 )
=E (ε i2 ) + xi2 E (vi2 ) + 2 xi E (ε i vi )
=σ 2 + xi2θ 2 + 0
= σ 2 + xi2θ 2 .

So variance depends on i and thus heteroskedasticity is introduced in the model. Note that we assume homoskedastic
disturbances for the model

yi =β 0 + β1 xi + ε i , β1i =β1 + vi

but finally ends up with heteroskedastic disturbances. This is due to theoretical considerations.

4. The skewness in the distribution of one or more explanatory variables in the model also causes
heteroskedasticity in the model.
5. The incorrect data transformations and incorrect functional form of the model can also give rise to the heteroskedasticity
problem.
6. Sometimes the study variable may have distribution such as Binomial or Poisson where variances are the functions of
means. In such cases, when the mean varies, then the variance also varies.
8

Tests for heteroskedasticity

The presence of heteroskedasticity affects the estimation and test of hypothesis. The heteroskedasticity can enter into the
data due to various reasons. The tests for heteroskedasticity assume a specific nature of heteroskedasticity. Various
tests are available in literature for testing the presence of heteroskedasticity, e.g.,

1. Bartlett test

2. Breusch Pagan test

3. Goldfeld Quandt test

4. Glesjer test

5. Test based on Spearman’s rank correlation coefficient

6. White test

7. Ramsey test

8. Harvey Phillips test

9. Szroeter test

10. Peak test (nonparametric) test.

We discuss the Bartlett’s test.


9

Bartlett’s test
It is a test for testing the null hypothesis
H 0 : σ 12= σ 22= ...= σ i2= ...= σ n2 .
This hypothesis is termed as the hypothesis of homoskedasticity. This test can be used only when replicated data is
available.

Since in the model


y=
i β1 X i1 + β 2 X i 2 + ... + β k X ik + ε i , E (ε =
i) 0, Var (ε =
i) σ i2 ,=i 1, 2,..., n,
only one observation yi is available to find σ i2 , so the usual tests can not be applied. This problem can be overcome if
replicated data is available. So consider the model of the form

yi* X i β + ε i*
=

where yi* is a mi x 1 vector, Xi is mi x k matrix, β is k x 1 vector and ε i* is mi x 1 vector. So replicated data is now
*
available for every y i in the following way:

y1* X 1β + ε1* consists of m1 observations


=
y2* X 2 β + ε 2* consists of m2 observations
=

yn* X n β + ε n* consists of mn observations.
=
10
All the individual model can be written as
 y1*   X1   ε1* 
 *    *
 y2 
=   β + ε2 
X 2
       
 *     * 
 yn   Xn  εn 
or
y* X β + ε *
=
 n   n   n 
where y * is a vector of order  ∑ i 
m × 1, X is  ∑ mi  × k matrix, β is k × 1 vector and ε * is  ∑ mi  × 1 vector.
=  i 1=  i1   i =1 

Application of OLS to this model yields


βˆ = ( X ' X ) −1 X ' y *
and obtain the residual vector
e=
*
i. yi* − X i βˆ .

Based on this, obtain


1
si2 = ei* ' ei*
mi − k
n

∑ (m − k ) s
i
2
i
s2 = i =1
n
.
∑ (m − k )
i =1
i
Now apply the Bartlett’s test as 11

1 n s2
=χ ∑ (mi − k ) log s 2
2

C i =1 i

which has asymptotic χ − distribution with


2
(n – 1) degrees of freedom where

 
1   1  n
1 
C=
1+ ∑   − .
3(n − 1)  i =1  mi − k  n
∑ (mi − k ) 
 i =1


Another variant of Bartlett’s test


Another variant of Bartlett’s test is based on likelihood ratio test statistic. If there are m independent normal random
samples where there are ni observations in the ith sample. Then the likelihood ratio test statistic for testing
H 0 : σ 12= σ 22= ...= σ i2= ...= σ m2 is
ni / 2
m
 si2 
u = ∑ 2 
i =1  s 

where yi and si2 are the sample mean and sample variance respectively of the ith sample.
ni

∑( y − yi ) , i= 1, 2,..., m; j= 1, 2,..., ni
1 2
s=
2
i ij
ni j =1

1 m
s = ∑ ni si2
2

n i =1
m
n = ∑ ni .
i =1
12

To obtain an unbiased test and a modification of −2 ln u which is a closer approximation to χ m2 −1 under H 0 , Bartlett test
replaces ni by (ni – 1) and divide by a scalar constant.

This leads to the statistic

m
(n − m) log σˆ 2 − ∑ (ni − 1) log σˆ i2
M= i =1

1   1 
m
1 
1+ ∑  − 
3(m − 1)  i =1  ni − 1  n − m 

which has a χ 2 distribution with (m - 1) degrees of freedom under H0 and


1 ni
=σˆ ∑ 2

n − 1 j =1
i ( yij − yi ) 2

1 m
=σˆ 2 ∑
n − m i =1
( ni − 1)σˆ i2 .

In experimental sciences, it is easier to get replicated data and this test can be easily applied. In real life applications, it is

difficult to get replicated data and this test may not be applied. This difficulty is overcome in Breusch Pagan test.
ECONOMETRIC THEORY

MODULE – VIII
Lecture - 26
Heteroskedasticity

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

2. Breusch Pagan test


This test can be applied when the replicated data is not available but only single observations are available. When it is
suspected that the variance is some function (but not necessarily multiplicative) of more than one explanatory variable,
then Breusch Pagan test can be used.

Assuming that under the alternative hypothesis σ i2 is expressible as


σ i2 h( Z=
= iγ )
'
ˆ h(γ 1 + Z i*γ * )
where h is some unspecified function and is independent of i,
= =
Z i' (1, Z i*' ) (1, Z i 2 , Z i 3 ,..., Z ip )
=
is the vector of observable explanatory variables with first element unity and γ ′ (γ=
1, γ i )
*
(γ 1 , γ 2 ,..., γ p ) is a vector of
unknown coefficients related to β with first element being the intercept term. The heterogeneity is defined by these p
variables. These Zi ’ s may also include some X ’s also.

Specifically, assume that

σ i2 = γ 1 + γ 2 Z i 2 + ... + γ p Z ip .

The null hypothesis


H 0 : σ 12= σ 22= ...= σ n2
can be expressed as

H 0 : γ 2= γ 3= ...= γ p= 0.

If H0 is accepted , it implies that Z i 2 , Z i 3 ,..., Z ip do not have any effect on σ i2 and we get σ i2 = γ 1 .
3
The test procedure is as follows:
1. Ignoring heteroskedasticity, apply OLS to
yi = β1 + β 2 X i1 + ... + β k X ik + ε i
and obtain residual
e= y − Xb
b = ( X ' X ) −1 X ' y.
2. Construct the variables
ei2 nei2
=gi =
 n 2  SS res
 ∑ ei n 
 i =1 
where SSres is the residual sum of squares based on ei ’s.
3. Run regression of g on Z , Z ,..., Z and get residual sum of squares SS * .
1 2 p res

4. For testing, calculate the test statistic

1 n 2 * 
=Q  ∑
2  i =1
gi − SS res 

which is asymptotically distributed as χ2 distribution with (p - 1) degrees of freedom.
5. The decision rule is to reject H0 if Q > χ 2
1−α (m − 1).

 This test is very simple to perform.


 A fairly general form is assumed for heteroskedasticity, so it is a very general test.
 This is an asymptotic test.
 This test is quite powerful in the presence of heteroskedasticity.
4

3. Goldfeld Quandt test


This test is based on the assumption that σ i is positively related to Xij , i.e., one of the explanatory variable explains the
2

heteroskedasticity in the model. Let jth explanatory variable explains the heteroskedasticity, so

σ i2 ∝ X ij
or σ i2 = σ 2 X ij .

The test procedure is as follows:


1. Rank the observations according to the decreasing order of Xj .
2. Split the observations into two equal parts leaving c observations in the middle.
So each part contains
n − c observations provided n − c > k .
2 2

3. Run two separate regression in the two parts using OLS and obtain the residual sum of squares SS res1 and SS res 2 .
4. The test statistic is
SS res 2
F0 =
SS res1
 n−c n−c 
which follows a F - distribution, i.e., F  − k, − k  when H0 true.
 2 2 
 n−c n−c 
5. The decision rule is to reject H0 whenever F0 > F1−α  − k, − k .
 2 2 
5

 This test is simple test but it is based on the assumption that only one of the explanatory variable helps is determining

the heteroskedasticity.

 This test is an exact finite sample test.

 One difficulty in this test is that the choice of c is not obvious. If large value of c is chosen, then it reduces the
n−c n−c
degrees of freedom − k and the condition > k may be violated.
2 2

On the other hand, if smaller value of c is chosen, then the test may fail to reveal the heteroskedasticity. The basic

objective of ordering of observations and deletion of observations in the middle part may not reveal the heteroskedasticity

effect. Since the first and last values of σ i2 give the maximum discretion, so deletion of smaller value may not give the
n
proper idea of heteroskedasticity. Keeping those two points in view, the working choice of c is suggested as c= .
3

Moreover, the choice of X ij is also difficult. Since σ i2 ∝ X ij , so if all important variables are included in the model, then it
may be difficult to decide that which of the variable is influencing the heteroskedasticity.
6

4. Glesjer test
This test is based on the assumption that σ i is influenced by one variable Z, i.e., there is only one variable which is
2

influencing the heteroskedasticity. This variable could be either one of the explanatory variable or it can be chosen from
some extraneous sources also.

The test procedure is as follows:


1. Use OLS and obtain the residual vector e on the basis of available study and explanatory variables.
2. Choose Z and apply OLS to
δ 0 + δ1Z ih + vi
ei =
where vi is the associated disturbance term.
3. Test H 0 : δ1 = 0 using t-ratio test statistic.
1
4. Conduct the test for h =±1, ± . So the test procedure is repeated four times.
2

In practice, one can choose any value of h. For simplicity, we choose h = 1.

 The test has only asymptotic justification and the four choices of h give generally satisfactory results.

 This test sheds light on the nature of heteroskedasticity.


7

5. Spearman’s rank correlation test


It di denotes the difference in the ranks assigned to two different characteristics of the ith object or phenomenon and n is
the number of objects or phenomenon ranked, then the Spearman’s rank correlation coefficient is defined as
 n 2 
 ∑ di 
r = 1 − 6  i =12  ; − 1 ≤ r ≤ 1.
 n(n − 1) 
 
 
This can be used for testing the hypothesis about the heteroskedasticity.
Consider the model
yi =β 0 + β1 X i + ε i .
1. Run the regression of y on X and obtain the residuals e.
2. Consider | ei |.
3. Rank both | ei | and X i (or yˆi ) in an ascending (or descending) order.
4. Compute rank correlation coefficient r based on | ei | and X i (or yˆ i ).
5. Assuming that the population rank correlation coefficient is zero and n > 8, use the test statistic
r n−2
t0 =
1− r2
which follows a t-distribution with (n - 2) degrees of freedom.
6. The decision rule is to reject the null hypothesis of heteroskedasticity whenever t0 ≥ t1−α (n − 2).
7. If there are more than one explanatory variables, then rank correlation coefficient can be computed between | ei | and
each of the explanatory variables separately and can be tested using t0.
8
Estimation under heteroskedasticity
Consider the model
y Xβ +ε
=
with k explanatory variables and assume that
E (ε ) = 0
 σ 12 0  0
 
 0 σ 22  0.
V (ε ) = Ω =
    
 
 0 0  σ n2 
The OLSE is
b = ( X ' X ) −1 X ' y.

Its estimation error is


b−β =
( X ' X ) −1 X ' ε
and
E (b − β ) ( X ' X
= = ) −1 X ' E (ε ) 0.
Thus OLSE remains unbiased even under heteroskedasticity.

The covariance matrix of b is

V (b) =E (b − β )(b − β ) '


= ( X ' X ) −1 X ' E (εε ') X ( X ' X ) −1
= ( X ' X ) −1 X ' Ω X ( X ' X ) −1
which is not the same as conventional expression. So OLSE is not efficient under heteroskedasticity as compared under
homoskedasticity
9
Now we check if E (e ) = σ 2
i i
2
or not where ei is the ith residual.
The residual vector is
e =y − Xb =H ε
e 
e 
ei = [ 0, 0,..., 0,1, 0,...0]  2 
 
 
en 
=  i ' H εε ' H  i
where i is a n x 1 vector with all elements zero except the ith element which is unity and H = I − X ( X ' X ) −1 X '. Then

=ei2 =
i ' e.e '  i  i ' H εε ' H  i
ei2 )  i ' HE (εε ') H
E (= =  i  i ' H ΩH  i
0 
 
 
   h1i i 
 h11  h1n  0   
    h2i i 
H i =     1    
 hn1  hnn  0   
  
 hni i 
 
 
0 

σ 12 0  0   h1i i 
  
 0 σ 22  0   h2i i 
 
E (ei ) =  h1i i, h2i i,..., hni i 
2
.
       
  
 0 0  σ n2   hn1 i 
10

Thus E (ei2 ) ≠ σ i2 and so ei becomes a biased estimator of


2
σ i2 in the presence of heteroskedasticity.

In the presence of heteroskedasticity, use the generalized least squares estimation. The generalized least squares estimator
(GLSE) of β is

βˆ =Ω
( X ' −1 X ) −1 X ' Ω −1 y.

Its estimation error is obtained as

βˆ =( X ' Ω −1 X ) −1 X ' Ω −1 ( X β + ε )

βˆ − β = ( X ' Ω −1 X ) −1 X ' Ω −1ε .

Thus

E ( βˆ − β ) = ( X ' Ω −1 X ) −1 X ' Ω −1E (ε ) = 0

V ( βˆ ) =E ( βˆ − β )( βˆ − β )
( X ' Ω −1 X ) −1 X ' Ω −1 E (εε )Ω −1 X ( X ' Ω −1 X ) −1
=
= ( X ' Ω −1 X ) −1 X ' Ω −1ΩΩ −1 X ( X ' Ω −1 X ) −1
= ( X ' Ω −1 X ) −1.
11

Example Consider a simple linear regression model


yi =β 0 + β1 xi + ε i , i =1, 2,..., n.

The variances of OLSE and GLSE of β are


n

∑ (x − x ) σ i
2
i
2

Var (b) = i =1
n

∑ (x − x )
i =1
i
2

∑ (x − x ) i
2

Var ( βˆ ) = i =1

 2 2 2 1 
n n

=
 ∑ (
 i 1=
xi − x ) σ i   ∑
i 1
( xi − x )
σ i2 
 xi − x 
Square of the correlation coefficient between σ i ( xi − x )   and  
 σi 
≤1
⇒ Var ( βˆ ) ≤ Var (b).

So efficient of OLSE and GLSE depends upon the correlation coefficient between ( xi − x )σ i and ( xi − x ) .
σi
12
The generalized least squares estimation assumes that Ω is known, i.e., the nature of heteroskedasticity is completely
specified. Based on this assumption, the possibilities of following two cases arise:
 Ω is completely specified or
 Ω is not completely specified

We consider both the cases as follows:

Case 1: σ i ' s are prespecified


2

Suppose σ 12 , σ 22 ,..., σ n2 are completely known in the model


yi = β1 + β 2 X i 2 + ... + β k X ik + ε i .

Now deflate the model by σi, i.e.,


yi 1 X X ε
= β1 + β 2 i 2 + ... + β k ik + i .
σi σi σi σi σi

εi σ i2
Let ε i* = , then E (ε=
*
) 0, Var (ε=*
) = 1.
σi i i
σ i2

Now OLS can be applied to this model and usual tools for drawing statistical inferences can be used.

Note that when the model is deflated, the intercept term is lost as β1 / σ i is itself a variable. This point has to be taken care
in a software output.
13
Case 2: Ω may not be completely specified

Let σ 1 , σ 2 ,..., σ n are partially known and suppose


2 2 2

σ i2 ∝ X ij2 λ
or σ i2 = σ 2 X ij2 λ
but σ 2 is not available. Consider the model

yi = β1 + β 2 X i 2 + ... + β k X ik + ε i

λ
and deflate it by X ij as
yi β1 X X ε
λ
= λ
+ β 2 iλ2 + ... + β k ikλ + iλ .
X ij X ij X ij X ij X ij
Now apply OLS to this transformed model and use the usual statistical tools for drawing inferences.

A caution is to be kept is mind while doing so. This is illustrated in the following example with one explanatory variable
model.
Consider the model
yi =β 0 + β1 xi + ε i .
Deflate it by xi , so we get
yi β 0 ε
= + β1 + i .
xi xi xi
Note that the roles of β0
β1 in original and deflated models are interchanged. In original model, β 0 is intercept term
and
and β1 is slope parameter whereas in deflated model, β1 becomes the intercept term and β 0 becomes the slope
parameter. So essentially, one can use OLS but need to be careful in identifying the intercept term and slope parameter,
particularly in the software output.
ECONOMETRIC THEORY

MODULE – IX
Lecture - 27
Autocorrelation

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

One of the basic assumptions in linear regression model is that the random error components or disturbances are
y X β + u , it is assumed that
identically and independently distributed. So in the model=
σ u2 if s = 0
E (ut , ut − s ) = 
0 if s ≠ 0,
i.e., the correlation between the successive disturbances is zero.

ut − s ) σ=
In this assumption, when E (ut , = u, s
2
0 is violated, i.e., the variance of disturbance term does not remains constant,
then problem of heteroskedasticity arises.

When E (ut , ut=


−s ) 0, s ≠ 0 is violated, i.e., the variance of disturbance term remains constant though the successive
disturbance terms are correlated, then such problem is termed as problem of autocorrelation.

When autocorrelation is present, some or all off diagonal elements in E (uu ') are nonzero.

Sometimes the study and explanatory variables have a natural sequence order over time, i.e., the data is collected with
respect to time. Such data is termed as time series data. The disturbance terms in time series data are serially
correlated.
3

The autocovariance at lag s is defined as

γ s = E (ut , ut − s ); s = 0, ±1, ±2,... .

At zero lag, we have constant variance, i.e.,

γ 0 E=
= (ut2 ) σ 2 .

The autocorrelation coefficient at lag s is defined as

E (ut ut − s )
ρs = ; s = 0, ±1, ±2,...
Var (ut )Var (ut − s )

Assume ρ s and γ s are symmetrical in s, i.e., these coefficients are constant over time and depend only on length of lag s.

The autocorrelation between the successive terms (u2 and u1), (u3 and u2),…, (un and un-1) gives the autocorrelation of
order one, i.e., ρ1 .

Similarly, the autocorrelation between the successive terms (u3 and u1 ), (u4 and u2 ),..., (un and un − 2 )
gives the autocorrelation of order two, i.e., ρ2 .
4

Source of autocorrelation

Some of the possible reasons for the introduction of autocorrelation in the data are as follows:

1. Carryover of effect, atleast in part, is an important source of autocorrelation. For example, the monthly data on
expenditure on household is influenced by the expenditure of preceding month. The autocorrelation is present in cross-
section data as well as time series data. In the cross-section data, the neighboring units tend to be similar with respect
to the characteristic under study. In time series data, the time is the factor that produces autocorrelation. Whenever
some ordering of sampling units is present, the autocorrelation may arise.

2. Another source of autocorrelation is the effect of deletion of some variables. In regression modeling, it is not possible
to include all the variables in the model. There can be various reasons for this, e.g., some variable may be qualitative,
sometimes direct observations may not be available on the variable etc. The joint effect of such deleted variables gives
rise to autocorrelation in the data.

3. The misspecification of the form of relationship can also introduce autocorrelation in the data. It is assumed that the form
of relationship between study and explanatory variables is linear. If there are log or exponential terms present in the
model so that the linearity of the model is questionable then this also gives rise to autocorrelation in the data.

4. The difference between the observed and true values of variable is called measurement error or errors–in-variable. The
presence of measurement errors on the dependent variable may also introduce the autocorrelation in the data.
5
Structure of disturbance term
Consider the situation where the disturbances are autocorrelated,

 γ0 γ1  γ n −1 
γ γ0  γ n − 2 
E (εε ') =  1
     
 
γ n −1 γ n−2  γ 0 
 1 ρ1  ρ n −1 
 ρ 1  ρ n − 2 
= γ0  1
     
 
 ρ n −1 ρn−2  1 
 1 ρ1  ρ n −1 
 ρ 1  ρ n − 2 
2 
= σu 1
.
     
 
 ρ n −1 ρn−2  1 

Observe that now there are (n + k) parameters- β1 , β 2 ,..., β k , σ u2 , ρ1 , ρ 2 ,..., ρ n −1. These (n + k) parameters are to be
estimated on the basis of available n observations. Since the number of parameters are more than the number of
observations, so the situation is not good from the statistical point of view. In order to handle the situation, some special
form and the structure of the disturbance term is needed to be assumed so that the number of parameters in the
covariance matrix of disturbance term can be reduced.
The following structures are popular in autocorrelation:
1. Autoregressive (AR) process.
2. Moving average (MA) process.
3. Joint autoregressive moving average (ARMA) process.
6

1. Autoregressive (AR) process


The structure of disturbance term in autoregressive process (AR) is assumed as
u=
t φ1ut −1 + φ2ut − 2 + ... + φq ut − q + ε t ,
i.e., the current disturbance term depends on the q lagged disturbances and φ1 , φ2 ,..., φk are the parameters (coefficients)
associated with ut −1 , ut − 2 ,..., ut − q , respectively. An additional disturbance term is introduced in ut which is assumed to
satisfy the following conditions:
E (ε t ) = 0
σ ε2 if s = 0
E (ε tε t −s ) = 
0 if s ≠ 0.
This process is termed as AR (q) process. In practice, the AR (1) process is more popular.

2. Moving average (MA) process


The structure of disturbance term in the moving average (MA) process is

ut = ε t + θ1ε t −1 + .... + θ pε t − p ,

i.e., the present disturbance term ut depends on the p lagged values. The coefficients θ1 ,θ 2 ,...,θ p are the parameters and
are associated with ε t −1 , ε t − 2 ,..., ε t − p , respectively. This process is termed as MA (p) process.
7

3. Joint autoregressive moving average (ARMA) process


The structure of disturbance term in the joint autoregressive moving average (ARMA) process is

u=
t φ1ut −1 + ... + φq ut − q + ε1 + θ1ε t −1 + ... + θ pε t − p .
This is termed as ARMA(q, p) process.

The method of correlogram is used to check that the data is following which of the processes. The correlogram is a two
dimensional graph between the lag s and autocorrelation coefficient ρs which is plotted as lag s on X-axis and ρ s on
y-axis.

In MA(1) process
u=
t ε t + θ1ε t −1
 θ1
 for s = 1
ρ s = 1 + θ12
0 for s ≥ 2

ρ0 = 1
ρ1 ≠ 0
ρi 0=
= i 2,3,...

So there is no autocorrelation between the disturbances that are more than one period apart.
8

In ARMA(1,1) process
u=
t φ1ut −1 + ε t + θ1ε t −1
 (1 + φ1θ1 )(φ1 + θ1 )
 for s = 1
ρ s =  (1 + θ1 + 2φ1θ1 )
2


φ1 ρ s −1 for s ≥ 2

 1 + θ12 + 2φ1θ1  2
σ =
2
σ ε .
1 − φ12
u
 

The autocorrelation function begins at some point determined by both the AR and MA components but thereafter, declines
geometrically at a rate determined by the AR component.

In general, the autocorrelation function


 is nonzero but is geometrically damped for AR process.
 becomes zero after a finite number of periods for MA process.

The ARMA process combines both these features.

The results of any lower order of process are not applicable in higher order schemes. As the order of the process increases,
the difficulty in handling them mathematically also increases
ECONOMETRIC THEORY

MODULE – IX
Lecture - 28
Autocorrelation

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Estimation under the first order autoregressive process


Consider a simple linear regression model
yt =β 0 + β1 X t + ut , t =1, 2,..., n.
Assume ut ‘s follow a first order autoregressive scheme defined as
ut ρ ut −1 + ε t
=
where

ρ < 1, E (ε t ) =
0,
σ ε2 if s = 0
E (ε t , ε t + s ) = 
0 if s ≠ 0
for all t = 1, 2,..., n where ρ is the first order autocorrelation between ut and ut −1 , t = 1, 2,..., n. Now
ut ρ ut −1 + ε t
=
= ρ (ut − 2 + ε t −1 ) + ε t
= 
ε t ρε t −1 + ρ 2ε t − 2 + ...
=+

= ∑ ρ rε t −r
r =0

E (ut ) = 0
E (ε t2 ) + ρ 2 E (ε t2−1 ) + ρ 4 E (ε t2− 2 ) + ...
E (ut2 ) =
=(1 + ρ 2 + ρ 4 + ....)σ ε2 (ε t' s are serially independent)
σ ε2
E (ut=
2
) σ=
2
for all t.
1− ρ 2
u
3

E (ut ut −=
1) E ( ε t + ρε t −1 + ρ 2ε t − 2 + ...) × ( ε t −1 + ρε t − 2 + ρ 2ε t −3 + ...) 

= E {ε t + ρ ( ε t −1 + ρε t − 2 + ...)}{ε t −1 + ρε t − 2 + ...}

= ρ E ( ε t −1 + ρε t − 2 + ...) 
2
 
= ρσ u2 .
Similarly,

E (ut ut − 2 ) = ρ 2σ u2 .

In general,

E (ut ut − s ) = ρ sσ u2
 1 ρ ρ2  ρ n −1 
 
 ρ 1 ρ  ρ n−2 
E (uu ') = Ω = σ u  ρ 2
2
ρ 1  ρ n −3  .
 
      
 ρ n −1 ρ n−2 ρ n −3  1 

Note that the disturbance terms are no more independent and E (uu ') ≠ σ I . The disturbance are nonspherical.
2
4

Consequences of autocorrelated disturbances


Consider the model with first order autoregressive disturbances
y X β+ u
=
n×1 n×k k ×1 n×1

ut = ρ ut −1 + ε t , t = 1, 2,..., n

with assumptions
E (u ) = 0, E (uu ') = Ω
σ ε2 if s = 0
E (ε t ) 0,=
= E (ε t ε t + s ) 
0 if s ≠ 0
where Ω is a positive definite matrix.
The ordinary least squares estimator of β is

b = ( X ' X ) −1 X ' y
= ( X ' X ) −1 X '( X β + u )
b−β = ( X ' X ) −1 X ' u
E (b − β ) =0.
So OLSE remains unbiased under autocorrelated disturbances.

The covariance matrix of b is


V (b) =E (b − β )(b − β ) '
= ( X ' X ) −1 X ' E (uu ') X ( X ' X ) −1
= ( X ' X ) −1 X ' ΩX ( X ' X ) −1
≠ σ u2 ( X ' X ) −1.
5
The residual vector is
e =y − Xb =Hy =Hu
=
e ' e y=
' Hy u ' Hu
(e ' e) E (u ' u ) − E u ' X ( X ' X ) −1 X ' u 
E=
nσ u2 − tr ( X ' X ) −1 X ' ΩX .
=
e 'e
Since s =
2
, so
n −1

σ u2 1
E (s 2 ) = − tr ( X ' X ) −1 X ' ΩX ,
n −1 n −1
so s2 is a biased estimator of σ 2. In fact, s2 has downward bias.

Application of OLS fails in case of autocorrelation in the data and leads to serious consequences as
 overly optimistic view from R2.
 narrow confidence interval.
 usual t-ratio and F - ratio tests provide misleading results.
 prediction may have large variances.
Since disturbances are nonspherical, so generalized least squares estimate of β yields more efficient estimates than
OLSE.
The GLSE of β is
βˆ =Ω
( X ' −1 X ) −1 X ' Ω −1 y
E ( βˆ ) = β
=
V ( βˆ ) σ 2 ( X ' Ω −1 X ) −1.
u

The GLSE is best linear unbiased estimator of β.


6

Tests of autocorrelation
Durbin-Watson (D-W) test
The Durbin-Watson (D-W) test is used for testing the hypothesis of lack of first order autocorrelation in the disturbance
term. The null hypothesis is
H0 : ρ = 0
Use OLS to estimate β in=
y Xβ +u and obtain residual vector

e =y − Xb =Hy
where
b= ( X ' X ) −1 X ' y, H = I − X ( X ' X ) −1 X '.

The D-W test statistic is .


n

∑ (e − e )
2
t t −1
d= t =2
n

∑e
t =1
2
t

n n n

∑ et2 ∑ et −1 ∑ et et −1
=n
=t 2=t 2
+ n = − 2 t 2n .
∑ et ∑ et
2

=t 1 =t 1
2
∑ et 2

=t 1

For large n,
d ≈ 1 + 1 − 2r
d  2(1 − r ).
7

where r is the sample autocorrelation coefficient from residuals based on OLSE and can be regarded as the regression
coefficient of et on et-1. Here
positive autocorrelation of et’s ⇒ d < 2,
negative autocorrelation of et’s ⇒ d > 2,
zero autocorrelation of et’s ⇒ d ≈ 2.

As -1 < r < 1, so
if -1 < r < 0, then 2 < d < 4 and
if 0 < r < 1, then 0 < d < 2.
So d lies between 0 and 4.

Since e depends on X, so for different data sets, different values of d are obtained. So the sampling distribution of d
depends on X. Consequently, exact critical values of d cannot be tabulated owing to their dependence on X. Durbin and
Watson therefore obtained two statistics d and d such that
d <d <d
and their sampling distributions do not depend upon X.

Considering the distribution of d and d , they tabulated the critical values as dL and dU respectively. They prepared the
tables of critical values for 15 < n <100 and k ≤ 5. Now tables are available for 6 < n < 200 and k ≤ 10.
8

The test procedure is as follows:

H0 : ρ = 0
Nature of H1 Reject H 0 when Retain H 0 when The test is
inconclusive when
H1 : ρ > 0 d < dL d > dU d L < d < dU
H1 : ρ < 0 d > (4 − d L ) d < (4 − dU ) ( 4 − dU ) < d < (4 − d L )
H1 : ρ ≠ 0 d < d L or dU < d < (4 − dU ) d L < d < dU
d > (4 − d L ) or
(4 − dU ) < d < (4 − d L )
Values of d L and dU are obtained from tables.
9

Limitations of D-W test


1. If d falls in the inconclusive zone, then no conclusive inference can be drawn. This zone becomes fairly larger for low
degrees of freedom. One solution is to reject H0 if the test is inconclusive. A better solutions is to modify the test as
• Reject H0 when d < dU .
• Accept H0 when d ≥ dU .
This test gives satisfactory solution when values of Xi’ s change slowly, e.g., price, expenditure etc.

2. The D-W test is not applicable when intercept term is absent in the model. In such a case, one can use another critical
values, say dM in place of dL. The tables for critical values dM are available.

3. The test is not valid when lagged dependent variables appear as explanatory variables. For example,

yt β1 yt −1 + β 2 yt − 2 + .... + β r yt − r + β r +1 xt1 + ... + β k xt ,k − r + ut


=
ut ρ ut −1 + ε t .
=

In such case, Durbin’s h test is used which is given as follows.


10

Durbin’s h-test
Apply OLS to
yt β1 yt −1 + β 2 yt − 2 + .... + β r yt − r + β r +1 xt1 + ... + β k xt ,k − r + ut
=
= ut ρ ut −1 + ε t
 (b ). Then the Dubin’s h-statistic is
and find OLSE b1 of β1. Let its variance be Var (b1 ) and its estimator is Var 1

n
h=r
 (b )
1 − n Var 1

which is asymptotically distributed as N (0,1) and


n

∑e e t t −1
r= t =2
n
.
∑e
t =2
2
t

 (b )  < 0, then test breaks down. In such cases, the following test
This test is applicable when n is large. When 1 − nVar 1  
procedure can be adopted.
Introduce a new variable ε t −1 to
= ut ρ ut −1 + ε t . Then

et δρt −1 + yt .
=

Now apply OLS to this model and test H 0 A : δ = 0 versus H1 A : δ ≠ 0 using t - test . It H 0 A is accepted then accept
H 0 : ρ = 0.
If H 0 A : δ = 0 is rejected, then reject H 0 : ρ = 0.
11

4. If H 0 : ρ = 0 is rejected by D-W test, it does not necessarily mean the presence of first order autocorrelation in the
disturbances. It could happen because of other reasons also, e.g.,
 distribution may follows higher order AR process.
 some important variables are omitted.
 dynamics of model is misspecified.
 functional term of model is incorrect.
ECONOMETRIC THEORY

MODULE – IX
Lecture - 29
Autocorrelation

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Estimation procedures with autocorrelated errors when autocorrelation coefficient is


known
Consider the estimation of regression coefficient under first order autoregressive disturbances and autocorrelation
coefficient is known. The model is
y X β + u,
=
ut ρ ut + ε t
=
and assume that E (u ) = ψ ≠ σ 2 I , E (ε ) =
0, E (uu ') = 0, E (εε ') =
σ ε2 I .
The OLSE of β is unbiased but not, in general, efficient and estimate of σ 2 is biased. So we use generalized least
squares estimation procedure and GLSE of β is

βˆ = ( X 'ψ −1 X ) −1 X 'ψ −1 y

where
 1 −ρ 0  0 0 
− ρ 1 + ρ 2 −ρ  0 0 

 0 −ρ 1+ ρ 2  0 0 
ψ =
−1
.
     
 0 0 0  1+ ρ 2 −ρ 
 
 0 0 0  −ρ 1 

To employ this, we proceed as follows:


1. Find a matrix P such that P ' P = ψ −1. In this case 3

 1− ρ 2 0 0  0 0
 
 −ρ 1 0  0 0
 0 −ρ 1  0 0
P= .
      
 
 0 0 0  1 0
 0 0 0  −ρ 1 

2. Transform the variables as


= =
y* Py =
, X * PX , ε * Pε .
Such transformation yields

 1− ρ 2 y   1− ρ 2 1 − ρ 2 x12  1 − ρ 2 x1k 
 1
  
 y2 − ρ y1   1− ρ x22 − ρ x12  x2 k − ρ x1k 
y − ρy ,  1− ρ
y* =
 3 2  X* = x32 − ρ x22  x3k − ρ x2 k  .
 
        
   
 yn − ρ yn −1   1− ρ xn 2 − ρ xn −1,2 ,  xn − ρ xn −1 

Note that the first observation is treated differently than other observations. For the first observation,

( ) (
1 − ρ 2 y1 = )
1 − ρ 2 x1' β + ( 1 − ρ 2 u1)
whereas for other observations
yt − ρ yt −1 = ( xt − ρ xt −1 ) ' β + (ut − ρ ut −1 ) ; t = 2,3,..., n
where xt' is a row vector of X. Also, 1 − ρ 2 u1 and (u1 − ρ u0 ) have same properties. So we expect these two errors to
be uncorrelated and homoscedastic.
4

If first column of X is a vector of ones, then first column of X* is not constant. Its first element is 1− ρ 2.

Now employ OLSE with observations y * and X *, then the OLSE of β is

β * = ( X *' X *) −1 X *' y*,

its covariance matrix is

V ( βˆ ) = σ 2 ( X *' X *) −1
= σ 2 ( X 'ψ −1 X ) −1

and its estimator is

Vˆ ( βˆ ) = σˆ 2 ( X 'ψ −1 X ) −1
where

( y − X βˆ ) 'ψ −1 ( y − X βˆ )
σˆ 2 = .
n−k
5

Estimation procedures with autocorrelated errors when autocorrelation coefficient is


unknown

Several procedure have been suggested to estimate the regression coefficients when autocorrelation coefficient is
unknown. The feasible GLSE of β is
βˆF =Ω
( X ' ˆ −1 X ) −1 X ' Ω
ˆ −1 y

ˆ −1 is the
where Ω Ψ −1 matrix with ρ replaced by its estimator ρ̂ .

1. Use of sample correlation coefficient


Most common method is to use the sample correlation coefficient r between successive residuals as the natural estimator
of ρ . The sample correlation can be estimated using the residuals in place of disturbances as
n

∑e e t t −1
r= t =2
n

∑e
t =2
2
t

where et =− 1, 2,..., n and b is OLSE of β .


yt xt' b, t =
Two modifications are suggested for r which can be used in place of r .
n−k 
1. r* =   r is the Theil’s estimator.
 n −1 
d
2. r **= 1 − for large n where d is the Durbin Watson statistic for H 0 : ρ = 0 .
2
6

2. Durbin procedure
In Durbin procedure, the model

yt − ρ yt −1 = β 0 (1 − ρ ) + β ( xt − ρ xt −1 ) + ε t , t = 2,3,..., n

is expressed as
yt = β 0 (1 − ρ ) + ρ yt −1 + β x1 − ρβ xt −1 + ε t
=β 0* + ρ yt −1 + β xt + β * xt −1 + ε t , t =2,3,..., n (*)

where β 0* =
β 0 (1 − ρ ), β * =
− ρβ. .

Now run regression using OLS to model (*) and estimate r * as the estimated coefficient of yt −1.

Another possibility is that since ρ ∈ (−1,1) , so search for a suitable ρ which has smaller error sum of squares.
7

3. Cochrane-Orcutt procedure

This procedure utilizes P matrix defined while estimating β when ρ is known. It has following steps:

(i) Apply OLS to yt =β 0 + β1 xt + ut and obtain residual vector e.

∑e e t t −1
(ii) Estimate ρ by r = t =2
n
.
∑e
t =2
2
t −1

Note that r is a consistent estimator of ρ .

(iii) Replace ρ by r is
yt − ρ yt −1 = β 0 (1 − ρ ) + β ( xt − ρ xt −1 ) + ε t

and apply OLS to transformed model


yt − ryt −1 =β 0* + β ( xt − rxt −1 ) + disturbance term
and obtain estimators of β 0 and β as βˆ and βˆ , respectively.
* *
0

This is Cochrane-Orcutt procedure. Since two successive applications of OLS are involved, so it is also called as two-step
procedure.
8

This application can be repeated in the procedure as follows:


(i) Put βˆ * and βˆ in original model.
0

(ii) Calculate the residual sum of squares.

∑e e t t −1
(iii) Calculate ρ by r= t =2
n
and substitute it in the model
∑e
t =2
2
t −1

yt − ρ yt −1 = β 0 (1 − ρ ) + β ( xt − ρ xt −1 ) + ε t
and again obtain the transformed model.

(iv) Apply OLS to this model and calculate the regression coefficients.

This procedure is repeated until convergence is achieved, i.e., iterate the process till the two successive estimates are nearly
same so that stability of estimator is achieved.

This is an iterative procedure and is numerically convergent procedure. Such estimates are asymptotically efficient and there
is a loss of one observation.
9

4. Hildreth-Lu procedure or Grid-search procedure

The Hilreth-Lu procedure has following steps:

(i) Apply OLS to

( yt − ρ yt −1 ) = β 0 (1 − ρ ) + β ( xt − ρ xt −1 ) + ε t , t = 2,3,..., n

using different values of ρ ( −1 ≤ ρ ≤ 1) such as ρ=


±0.1, ±0.2,... .

(ii) Calculate residual sum of squares in each case.

(iii) Select that value of ρ for which residual sum of squares is smallest.

Suppose we get ρ = 0.4. Now choose a finer grid. For example, choose ρ such that 0.3 < ρ < 0.5 and consider
ρ = 0.31, 0.32,..., 0.49 and pick up that ρ with smallest residual sum of squares. Such iteration can be repeated until a
suitable value of ρ corresponding to minimum residual sum of squares is obtained. The selected final value of ρ can be

used and for transforming the model as in the case of Cocharane-Orcutt procedure. The estimators obtained with this

procedure are as efficient as obtained by Cochrane-Orcutt procedure and there is a loss of one observation.
10

5. Prais-Winston procedure

This is also an iterative procedure based on two step transformation.


n

∑e e t t −1
(i) Estimate ρ by ρˆ = t =2
n
where et ’s are residuals based on OLSE.

∑e
t =3
2
t −1

(ii) Replace ρ by ρ̂ in the model as in Cochrane-Orcutt procedure

( ) (
1 − ρˆ 2 y1 = )
1 − ρˆ 2 β 0 + β ( ) (
1 − ρˆ 2 xt + )
1 − ρˆ 2 ut

yt − ρˆ yt −1 =(1 − ρˆ ) β 0 + β ( xt − ρˆ xt −1 ) + (ut − ρˆ ut −1 ), t =2,3,..., n.

(iii) Use OLS for estimating the parameters.

The estimators obtained with this procedure are asymptotically as efficient as best linear unbiased estimators. There is no

loss of any observation.


11

6. Maximum likelihood procedure

Assuming that y ~ N ( X β , σ ε ψ ), the likelihood function for β , ρ and σ ε is


2 2

1  1 
=L exp  − 2 ( y − X β ) 'ψ −1 ( y − X β )  .
 2σ ε
n

( 2πσ ) 
n
2 2
ψ 2

1
Ignoring the constant and using ψ = , the log-likelihood is
1− ρ 2
n 1 1
ln L( β , σ ε2 , ρ ) =
ln L = − ln σ ε2 + ln(1 − ρ 2 ) − 2 ( y − X β ) 'ψ −1 ( y − X β ).
2 2 2σ ε

The maximum likelihood estimators of β , ρ and σ ε can be obtained by solving the normal equations
2

∂ ln L ∂ ln L ∂ ln L
= 0,= 0,= 0.
∂β ∂ρ ∂σ ε2
There normal equations turn out to be nonlinear in parameters and can not be easily solved.
One solution is to
 first derive the maximum likelihood estimator of σ ε .
2

 Substitute it back into the likelihood function and obtain the likelihood function as the function of β and ρ .
 Maximize this likelihood function with respect to β and ρ .
Thus
∂ ln L n 1
= 0⇒− 2 + ( y − X β ) 'ψ −1 ( y − X β ) = 0
∂σ ε2
2σ ε 2σ ε
2

1
⇒ σˆ ε2 = ( y − X β ) 'ψ −1 ( y − X β )
n
is the estimator of σ ε2 .
12

Substituting σˆ ε in place of
2
σ ε2 in the log-likelihood function yields
n 1  1 n
ln L* = ln L *( β , ρ ) =
− ln  ( y − X β ) 'ψ −1 ( y − X β )  + ln(1 − ρ 2 ) −
2 n  2 2
n 
− ln {( y − X β ) 'ψ −1 ( y − X β )} − ln(1 − ρ 2 )  + k
1
=
2 n 
 
n  ( y − X β ) 'ψ −1 ( y − X β ) 
= k − ln
2  1 
 (1 − ρ 2 n
) 
n n
=
where k ln n − .
2 2

Maximization of ln L * is equivalent to minimizing the function

( y − X β ) 'ψ −1 ( y − X β )
1
.
(1 − ρ )
2 n

Using optimization techniques of non-linear regression, this function can be minimized and estimates of β and ρ
can be obtained.

If n is large and | ρ | is not too close to one, then the term (1 − ρ 2 ) −1/ n is negligible and the estimates of β

will be same as obtained by nonlinear least squares estimation.


ECONOMETRIC THEORY

MODULE – X
Lecture - 30
Dummy Variable Models

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

In general, the explanatory variables in any regression analysis are assumed to be quantitative in nature.

For example, the variables like temperature, distance, age etc. are quantitative in the sense that they are recorded on a well
defined scale.

In many applications, the variables can not be defined on a well defined scale and they are qualitative in nature.

For example, the variables like sex (male or female), colour (black, white), nationality, employment status (employed,

unemployed) are defined on a nominal scale. Such variables do not have any natural scale of measurement. Such

variables usually indicate the presence or absence of a “quality” or an attribute like employed or unemployed, graduate or

non-graduate, smokers or non- smokers, yes or no, acceptance or rejection, so they are defined on a nominal scale. Such

variables can be quantified by artificially constructing the variables that take the values, e.g., 1 and 0 where “1” indicates

usually the presence of attribute and “0” indicates usually the absence of attribute. For example, “1” indicates that the

person is male and “0” indicates that the person is female. Similarly, “1” may indicate that the person is employed and

then “0” indicates that the person is unemployed.

Such variables classify the data into mutually exclusive categories. These variables are called indicator variables or

dummy variables.
3

Usually, the dummy or indicator variables take on the values 0 and 1 to identify the mutually exclusive classes of the
explanatory variables. For example,

1 if person is male


D=
0 if person is female,
1 if person is employed
D=
0 if person is unemployed.

Here we use the notation D in place of X to denote the dummy variable. The choice of 1 and 0 to identify a category is
arbitrary. For example, one can also define the dummy variable in above examples as

1 if person is female


D=
0 if person is male,
1 if person is unemployed
D=
0 if person is employed.

It is also not necessary to choose only 1 and 0 to denote the category. In fact, any distinct value of D will serve the
purpose. The choices of 1 and 0 are preferred as they make the calculations simple, help in easy interpretation of the
values and usually turn out to be a satisfactory choice.

In a given regression model, the qualitative and quantitative variables may also occur together, i.e., some variables may be
qualitative and others quantitative.
4

When all explanatory variables are


 quantitative, then the model is called a regression model,
 qualitative, then the model is called an analysis of variance model and
 quantitative and qualitative both, then the model is called an analysis of covariance model.

Such models can be dealt within the framework of regression analysis. The usual tools of regression analysis can be used
in case of dummy variables.

Example
Consider the following model with x1 as quantitative and D2 as dummy variable

y =β 0 + β1 x1 + β 2 D2 + ε , E (ε ) =0, Var (ε ) =σ 2
0 if an observation belongs to group A
D2 = 
1 if an observation belongs to group B.

The interpretation of result is important. We proceed as follows:


If D2 = 0 then

y =β 0 + β1 x1 + β 2 .0 + ε
=β 0 + β1 x1 + ε
E ( y | D=
2 = β 0 + β1 x1
0)

which is a straight line relationship with intercept β0 and slope β1 .


5
If D2 = 1 then

y =β 0 + β1 x1 + β 2 .1 + ε
= ( β 0 + β 2 ) + β1 x1 + ε
E ( y | D2 =1) =( β 0 + β 2 ) + β1 x1
which is a straight line relationship with intercept ( β 0 + β 2 ) and slope β1.

The quantities E(y|D2 = 0) and E(y|D2 = 1) are the average responses when an observation belongs to group A and
group B, respectively. Thus
β2 =
E ( y | D2 =
1) − E ( y | D2 =
0)
which has an interpretation as the difference between the average values of y with D2 = 0 and D2 = 1.

Graphically, it looks like as in the following figure. It describes two parallel regression lines with same variances .

y
E ( y | D2 =1) =( β 0 + β 2 ) + β1 x1

β0 + β2 β1
β2
E ( y | D=
2 = β 0 + β1 x1
0)

β1
β0

x
6
If there are three explanatory variables in the model with two dummy variables D2 and D3 then they will describe three
levels, e.g., groups A, B and The levels of dummy variables are as follows:

1. =
D2 0,=
D3 0 if the observation is from group A.

2. =
D2 1,=
D3 0 if the observation is from group B.

3. =
D2 0,=
D3 1 if the observation is from group C.

The concerned regression model is

y =β 0 + β1 x1 + β 2 D2 + β3 D3 + ε , E (ε ) =0, Var (ε ) =σ 2 .

In general, if a qualitative variable has m levels, then (m - 1) dummy variables are required and each of them takes value
0 and 1.
7

Consider the following examples to understand how to define such dummy variables and how they can be handled.

Example
Suppose y denotes the monthly salary of a person and D denotes whether the person is graduate or non-graduate. The
model is
y =β 0 + β1 D + ε , E (ε ) =0, var(ε ) =σ 2 .
With n observations, the model is

yi =β 0 + β1 Di + ε i , i =1, 2,..., n
E ( yi | D=
i 0)= β 0
E ( yi | D=
i 1)= β 0 + β1
β1 =
E ( yi | Di =
1) − E ( yi | Di =
0).

Thus
- β0 measures the mean salary of a non-graduate
- β1 measures the difference in the mean salaries of a graduate and non-graduate person.

Now consider the same model with two dummy variables defined in the following way:

1 if person is graduate
Di1 = 
 0 if person is nongraduate,

1 if person is nongraduate
Di 2 = 
 0 if person is graduate.

8

The model with n observations is


yi =β 0 + β1 Di1 + β 2 Di 2 + ε i , E (ε i ) =
0, Var (ε i ) =
σ 2,i =
1, 2,..., n.
Then we have

1. E [ yi | Di1= 0, Di 2= 1=
] β0 + β 2 : Average salary of non-graduate
2. E [ yi | Di=
1 1, Di=
2 ] β 0 + β1 : Average salary of graduate
0=

3. E [ yi | Di=
1 0, Di=
2 ] β0
0= : cannot exist

4. E [ yi | Di1 =1, Di 2 =1] = β 0 + β1 + β 2 : cannot exist.

Notice that in this case


Di1 + Di 2 =
1 for all i
which is an exact constraint and indicates the contradiction as follows:
Di1 + Di 2 =
1 ⇒ person is graduate
Di1 + Di 2 =
1 ⇒ person is non-graduate.

So multicollinearity is present in such cases. Hence the rank of matrix of explanatory variables falls short by 1. So β 0 , β1
and β 2 are indeterminate and least squares method breaks down. So the proposition of introducing two dummy
variables is useful but they lead to serious consequences. This is known as dummy variable trap.
9

If the intercept term is ignored, then the model becomes

yi =β1 Di1 + β 2 Di 2 + ε i , E (ε i ) =0, Var (ε i ) =σ 2 , i =1, 2,..., n

then
E ( yi | Di=
1 1, Di=
2 0)= β1 ⇒ Average salary of a graduate.
E ( yi | Di1= 0, Di 2= 1)= β 2 ⇒ Average salary of a non − graduate.

So when intercept term is dropped, then β1 and β 2 have proper interpretations as the average salaries of a graduate
and non-graduate persons, respectively.

Now the parameters can be estimated using ordinary least squares principle and standard procedures for drawing
inferences can be used.

Rule: When the explanatory variable leads to m mutually exclusive categories classification, then use (m - 1) dummy

variables for its representation. Alternatively, use dummy variables but drop the intercept term.
10

Interaction term
Suppose a model has two explanatory variables – one quantitative variable and other a dummy variable. Suppose both
interact and an explanatory variable as the interaction of them is added to the model.

yi =β 0 + β1 xi1 + β 2 Di 2 + β3 xi1 Di 2 + ε i , E (ε i ) =0, Var (ε i ) =σ 2,i =1, 2,..., n.

To interpret the model parameters, we proceed as follows:


Suppose the indicator variable is given by

1 if i th person belongs to group A


Di 2 = 
0 if i th person belongs to group B

yi = salary of ith person.


Then
E ( yi | Di 2 =0) =β 0 + β1 xi1 + β 2 .0 + β3 xi1.0
= β 0 + β1 xi1.

This is a straight line with intercept β1 and slope β 2 . Next


E ( yi | Di 2 =
1) =β 0 + β1 xi1 + β 2 .1 + β3 xi1.1
= ( β 0 + β 2 ) + ( β1 + β3 ) xi1.

This is a straight line with intercept term ( β 0 + β 2 ) and slope ( β1 + β 3 ).


The model
E ( yi ) =β 0 + β1 xi1 + β 2 Di 2 + β3 xi1 Di 2
has different slopes and different intercept terms.
11

Thus
β 2 reflects the change in intercept term associated with the change in the group of person i.e., when group
changes from A to B.
β3 reflects the change in slope associated with the change in the group of person, i.e., when group changes
from A to B.

Fitting of the model yi =β 0 + β1 xi1 + β 2 Di 2 + β3 xi1 Di 2 + ε i

is equivalent to fitting two separate regression models corresponding to Di2 = 1 and Di2 = 0, i.e.,
yi =β 0 + β1 xi1 + β 2 .1 + β3 xi1.1 + ε i
= ( β 0 + β 2 ) + ( β1 + β3 ) xi1 Di 2 + ε i
and
yi =β 0 + β1 xi1 + β 2 .0 + β3 xi1.0 + ε i
=β 0 + β1 xi1 + ε i ,
respectively.
The test of hypothesis becomes convenient by using a dummy variable. For example, if we want to test whether the two
regression models are identical, the test of hypothesis involves testing

H 0 : β=
2 β=
3 0
H1 : β 2 ≠ 0 and/or β3 ≠ 0.

Acceptance of H0 indicates that only single model is necessary to explain the relationship.
In another example, if the objective is to test that the two models differ with respect to intercepts only and they have
same slopes, then the test of hypothesis involves testing H 0 : β 3 = 0
H1 : β 3 ≠ 0.
12

Dummy variables versus quantitative explanatory variable


The quantitative explanatory variables can be converted into dummy variables. For example, if the ages of persons are
grouped as follows:

Group 1: 1 day to 3 years


Group 2: 3 years to 8 years
Group 3: 8 years to 12 years
Group 4: 12 years to 17 years
Group 5: 17 years to 25 years

then the variable “age” can be represented by four different dummy variables.

Since it is difficult to collect the data on individual ages, so this will help in easy collection of data. A disadvantage is that some
loss of information occurs. For example, if the ages in years are 2, 3, 4, 5, 6, 7 and suppose the dummy variable is defined
as
1 if age of i th person is > 5 years
Di = 
0 if age of i person is ≤ 5 years.
th

Then these values become 0, 0, 0, 1, 1, 1. Now looking at the value 1, one can not determine if it corresponds to age 5, 6 or
7 years.
13

Moreover, if a quantitative explanatory variable is grouped into m categories, then (m -1) parameters are required
whereas if the original variable is used as such, then only one parameter is required.

Treating a quantitative variable as qualitative variable increases the complexity of the model.

The degrees of freedom for error are also reduced. This can effect the inferences if data set is small.

In large data sets, such effect may be small.

The use of dummy variables does not require any assumption about the functional form of the relationship between
study and explanatory variables.
ECONOMETRIC THEORY

MODULE – XI
Lecture - 31
Specification Error Analysis

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

The specification of a linear regression model consists of a formulation of the regression relationships and of statements or
assumptions concerning the explanatory variables and disturbances. If any of these is violated, e.g., incorrect functional
form, incorrect introduction of disturbance term in the model etc., then specification error occurs. In narrower sense, the
specification error refers to explanatory variables.

The complete regression analysis depends on the explanatory variables present in the model. It is understood in the
regression analysis that only correct and important explanatory variables appear in the model. In practice, after ensuring
the correct functional form of the model, the analyst usually has a pool of explanatory variables which possibly influence
the process or experiment. Generally, all such candidate variables are not used in the regression modeling but a subset
of explanatory variables is chosen from this pool. How to determine such an appropriate subset of explanatory variables
to be used in regression is called the problem of variable selection.

While choosing a subset of explanatory variables, there are two possible options:
1. In order to make the model as realistic as possible, the analyst may include as many as possible explanatory
variables.
2. In order to make the model as simple as possible, one may include only fewer number of explanatory variables.

There can be two types of incorrect model specifications.

1. Omission/exclusion of relevant variables.

2. Inclusion of irrelevant variables .

Now we discuss the statistical consequences arising from the both situations.
3

1. Exclusion of relevant variables


In order to keep the model simple, the analyst may delete some of the explanatory variables which may be of importance

from the point of view of theoretical considerations. There can be several reasons behind such decisions, e.g., it may be

hard to quantify the variables like taste, intelligence etc. Sometimes it may be difficult to take correct observations on the

variables like income etc.

Let there be k candidate explanatory variables out of which suppose r variables are included and (k – r) variables are to be
deleted from the model. So partition the X and β as
 β1 
 
X 1 X 2  and β  .
r ×1
= X =
n×k
 n×r n×( k − r )  
 2 β
 ( k − r )×1) 
The model y = X β + ε , E (ε ) = 0, V (ε ) =
σ I can be expressed as
2

y = X 1β1 + X 2 β 2 + ε
which is called as full model or true model.

After dropping the r explanatory variable in the model, the new model is
=y X 1β1 + δ
which is called as misspecified model or false model.
4

Applying OLS to the false model, the OLSE of β1 is


b1F = ( X 1' X 1 ) −1 X 1' y.
The estimation error is obtained as follows:
=b1F ( X 1' X 1 ) −1 X 1' ( X 1β1 + X 2 β 2 + ε )
β1 + ( X 1' X 1 ) −1 X 1' X 2 β 2 + ( X 1' X 1 ) −1 X 1'ε
=
b1F − β1 =θ + ( X 1' X 1 ) −1 X 1'ε
where
θ = ( X 1' X 1 ) −1 X 1' X 2 β 2 .

Thus
E (b1F − β ) =θ + ( X 1' X 1 ) −1 E (ε )

which is a linear function of β 2 , i.e., the coefficients of excluded variables. So b1F is biased, in general. The bias vanishes
if X 1 X 2 = 0, i.e., X1 and X2 are orthogonal or uncorrelated.
'

The mean squared error matrix of b1F is

MSE (b1F ) =E (b1F − β )(b1F − β ) '

E θθ '+ θε ' X 1 ( X 1' X 1 ) −1 + ( X 1' X 1 ) −1 X 1'εθ '+ ( X 1' X 1 ) −1 X 1'εε ' X 1 ( X 1' X 1 ) −1 
=

= θθ '+ 0 + 0 + σ 2 ( X 1' X 1 ) −1 X 1' IX 1 ( X 1' X 1 ) −1


= θθ '+ σ 2 ( X 1' X 1 ) −1.

So efficiency generally declines. Note that the second term is the conventional form of MSE.
5

The residual sum of squares is

SS res e 'e
σˆ 2
= =
n−r n−r
where
e=
y − X 1b1F =
H1 y ,
H1= I − X 1 ( X 1' X 1 ) −1 X 1' .
Thus
y H1 ( X 1β1 + X 2 β 2 + ε )
H1 =
0 + H1 ( X 2 β 2 + ε )
=
= H1 ( X 2 β 2 + ε ).

y′H1 y = ( X 1β1 + X 2 β 2 + ε ) H1 ( X 2 β 2 + ε )
= ( β 2' X 2 H1 X 2 β 2 + β 2' X 2' H1ε + ε ' H1' X 2 β 2 + ε ' H1ε ) (Using X 1=
H1 0).

1
=E (s 2 )  E ( β 2' X 2' H1 X 2 β 2 ) + 0 + 0 + E (ε ' H ε ) 
n−r
1
=  β 2' X 2' H1 X 2 β 2 + (n − r )σ 2 
n−r
1
= σ2 + β 2' X 2' H1 X 2 β 2 .
n−r
Thus s2 is a biased estimator of σ and s2 provides an over estimate of σ 2. Note that even if X 1' X 2 = 0, then also s2
2

gives an overestimate of σ 2. So the statistical inferences based on this will be faulty. The t-test and confidence region will
be invalid in this case.
6

If the response is to be predicted at x ' = ( x1' , x2' ), then using the full model, the predicted value is

= ' b x '( X ' X ) −1 X ' y


yˆ x=
with
E ( yˆ ) = x ' β
( yˆ ) σ 2 1 + x '( X ' X ) −1 x  .
Var=

When subset model is used then the predictor is


yˆ1 = x1' b1F
and then
E ( yˆ1 ) = x1' ( X 1' X 1 ) −1 X 1' E ( y )
= x1' ( X 1' X 1 ) −1 X 1' E ( X 1β1 + X 2 β 2 + ε )
= x1' ( X 1' X 1 ) −1 X 1' ( X 1β1 + X 2 β 2 )
= x1' β1 + x1' ( X 1' X 1 ) −1 X 1' X 2 β 2
= x1' β1 + xi'θ .

Thus ŷ1 is a biased predictor of y. It is unbiased when X 1 X 2 = 0. The MSE of predictor is


'

( y1 ) = σ 2 1 + x1' ( X 1' X 1 ) −1 x1  + ( x1'θ − x2' β 2 )


. ˆ 2
MSE
Also

Var ( yˆ ) ≥ MSE ( yˆ1 )

provided V ( βˆ2 ) − β 2 β 2' is positive semidefinite.


7

2. Inclusion of irrelevant variables

Sometimes due to enthusiasm and to make the model more realistic, the analyst may include some explanatory variables
that are not very relevant to the model. Such variables may contribute very little to the explanatory power of the model.
This may tend to reduce the degrees of freedom (n - k) and consequently the validity of inference drawn may be
questionable. For example, the value of coefficient of determination will increase indicating that the model is getting better
which may not really be true.

Let the true model be


X β + ε , E (ε ) =
y= 0, V (ε ) =
σ 2I
which comprise k explanatory variable. Suppose now r additional explanatory variables are added to the model and
resulting model becomes
y = X β + Zγ + δ
where Z is a n × r matrix of n observations on each of the r explanatory variables and γ is r ×1 vector of regression
coefficient associated with Z and δ is disturbance term. This model is termed as false model.
Applying OLS to false model, we get
−1
 bF   X ' X X ' Z   X ' y 
 =   
 cF   Z ' X Z ' Z   Z ' y 
 X ' X X ' Z   bF   X ' y 
   =  
 Z ' X Z ' Z   cF   Z ' y 
⇒ X ' XbF + X ' ZcF =
X 'y (1)
Z ' XbF + Z ' ZcF =
Z'y (2)

where bF and cF are the OLSEs of β and γ, respectively.


8
Premultiply equation (2) by X ' Z ( Z ' Z ) −1 , we get

X ' Z ( Z ' Z ) −1 Z ' XbF + X ' Z ( Z ' Z ) −1 Z ' ZcF =


X ' Z ( Z ' Z ) −1 Z ' y. (3)

Subtracting equation (1) from (3), we get

 X ' X − X ' Z ( Z ' Z ) −1 Z ' X  bF =


X ' y − X ' Z ( Z ' Z ) −1 Z ' y

X '  I − Z ( Z ' Z ) −1 Z ' XbF =


X '  I − Z ( Z ' Z ) −1 Z ' y

( X ' H Z X ) −1 X ' H Z y
⇒ bF =
where H Z = I − Z ( Z ' Z ) −1 Z '.

The estimation error of bF is


=bF − β ( X ' H Z X ) −1 X ' H Z y − β
= ( X ' H Z X ) −1 X ' H Z ( X β + ε ) − β
= ( X ' H Z X ) −1 X ' H Z ε .
Thus
E (bF − β ) ( X ' H
= = Z X ) X ' H Z E (ε )
−1
0,
so bF is unbiased even when some irrelevant variables are added to the model.

The covariance matrix is

V (bF ) =E ( bF − β )( bF − β )
1

= E ( X ' H Z X ) X ' H Z εε ' H Z X ( X ' H Z X ) −1 


−1

 
= σ 2 ( X ' H Z X ) X ' H Z IH Z X ( X ' H Z X )
−1 −1

= σ 2 ( X ' HZ X ) .
−1
9

If OLS is applied to true model, then the estimator of β is

bT = ( X ' X ) −1 X ' y
with
E (bT ) = β
V (bT ) = σ 2 ( X ' X ) −1.
To compare bF and bT we use the following result.

Result: If A and B are two positive definite matrices then A – B is atleast positive semi definite if B −1 − A−1 is also
atleast positive semidefinite.

Let
A = ( X ' H Z X ) −1
B = ( X ' X ) −1
B −1 − A−1 = X ' X − X ' H Z X
= X ' X − X ' X + X ' Z ( Z ' Z ) −1 Z ' X
= X ' Z ( Z ' Z ) −1 Z ' X

which is atleast positive semi definite matrix. This implies that the efficiency declines unless X’Z = 0. If X’Z = 0, i.e., X
and Z are orthogonal, then both are equally efficient.

The residual sum of squares under false model is

SS res = eF' eF
10

where eF =−
y XbF − ZcF
bF = ( XH Z X ) −1 X ' H Z y
=cF ( Z ' Z ) −1 Z ' y − ( Z ' Z ) −1 Z ' XbF
= ( Z ' Z ) −1 Z '( y − XbF )

= ( Z ' Z ) −1 Z '  I − X ( X ' H Z X ) −1 X ' H z  y

= ( Z ' Z ) −1 Z ' H XZ y
H Z = I − Z ( Z ' Z ) −1 Z '
H ZX = I − X ( X ' H Z X ) −1 X ' H Z
2
H ZX = H ZX : idempotent.

So
y − X ( X ' H Z X ) −1 X ' H Z y − Z ( Z ' Z ) −1 Z ' H ZX y
eF =
 I − X ( X ' H Z X ) −1 X ' H Z − Z ( Z ' Z ) −1 Z ' H ZX  y
=

=  H ZX − ( I − H Z ) H ZX  y

= H Z H ZX y
= H=
* *
ZX y where H ZX H Z H ZX .
11

Thus
SS res = eF' eF
= y ' H Z H ZX H ZX H Z y
= y ' H Z H ZX y
= y ' H ZX
*
y
E ( SS res ) = σ 2 tr ( H ZX
*
)
= σ 2 (n − k − r )
 SS res 
 =σ .
2
E
n−k −r 

SS res
So is an unbiased estimator of σ 2 .
n−k −r

A comparison of exclusion and inclusion of variables is as follows:


Exclusion type Inclusion type
Estimation of coefficients Biased Unbiased
Efficiency Generally declines Declines
Estimation of disturbance Over-estimate Unbiased
term
Conventional test of Invalid and faulty inferences Valid though erroneous
hypothesis and confidence
region
ECONOMETRIC THEORY

MODULE – XII
Lecture – 32
Tests for Structural Change and
Stability
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
A fundamental assumption in a regression modeling is that the pattern of data on dependent and independent variables
remains the same throughout the period over which the data is collected. Under such an assumption, a single linear
regression model is fitted over the entire data set. The regression model is estimated and used for prediction assuming that
the parameters remain same over the entire time period of estimation and prediction. When it is suspected that there exist
a change in the pattern of data, then the fitting of single linear regression model may not be appropriate and more than one
regression models may be required to be fitted. Before taking such a decision to fit a single or more than one regression
models, a question arises how to test and decide if there is a change in the structure or pattern of data. Such changes can
be characterized by the change in the parameters of the model and are termed as structural change.

Now we consider some examples to understand the problem of structural change in the data. Suppose the data on the
consumption pattern is available for several years and suppose there was a war in between the years over which the
consumption data is available. Obviously, the consumption pattern before and after war does not remains the same as the
economy of the country gets disturbed. So if a model
yi = β 0 + β1 X i1 + ... + β k X ik + ε i , i = 1, 2,..., n

is fitted then the regression coefficients before and after the war period will change. Such a change is referred to as
structural break or structural change in the data. A better option in this case would be to fit two different linear regression
models- one for the data before war and another for the data after war.

In an another example, suppose the study variable is the salary of a person and explanatory variable is the number of
years of schooling. Suppose the objective is to find if there is any discrimination in the salaries of males and females. To
know this, two different regression models can be fitted-one for male employees and another for females employees. By
calculating and comparing the regression coefficients of both the models, one can check the presence of sex
discrimination in the salaries of male and female employees.
3

Consider another example of structural change. Suppose an experiment is conducted to study certain objectives and data
is collected in USA and in India. Then a question arises whether the data sets from both the countries can be pooled
together or not. The data sets can be pooled if they originate from the same model in the sense that there is no structural
change present in the data. In such case, the presence of structural change in the data can be tested and if there is no
change, then both the data sets can be merged and single regression model can be fitted. If structural change is present,
then two models are needed to be fitted.

The objective is now how to test for the presence of structural change in the data and stability of regression coefficients. In
other words, we want to test the hypothesis that some of or all the regression coefficients differ in different subsets of
data.

Analysis

We consider here a situation where only one structural change is present in the data. The data in this case be divided into
two parts. Suppose we have a data set of n observations which is divided into two parts consisting of n1 and n2 observations
such that
n1 + n2 =
n.

Consider the model


y =α  + X β + ε
where  is a ( n × 1) vector with all elements unity, α is a scalar denoting the intercept term, X is a ( n × k ) matrix of
observations on k explanatory variables, β is a ( k ×1) vector of regression coefficients and ε is a ( n ×1) vector of
disturbances.
4
Now partition , X and ε into two subgroups based on n1 and n2 observation as
 1   X1   ε1 
= =  , X =
  , ε  
 2   X2  ε2 
where the orders of 1 is ( n1 × 1) ,  2 is ( n2 × 1) , X 1 is ( n1 × k ) , X 2 is ( n2 × k ) , ε1 is ( n1 × 1) and ε 2 is ( n2 × 1) .

Based on this partition, the two models corresponding to two subgroups are
y1 =α 1 + X 1β + ε1
y2 =α  2 + X 2 β + ε 2 .
In matrix notations, we can write

 y   X 1   α   ε1 
= (1) :  1   1 +
X 2   β   ε 2 
Model
 y2    2
and term it as Model (1).

In this case, the intercept terms and regression coefficients remain same for both the submodels. So there is no structural
change in this situation.
The problem of structural change can be characterized if intercept terms and/or regression coefficients in the submodels
are different.
If the structural change is caused due to change in the intercept terms only, then the situation is characterized by the
following model:
y1 = α11 + X 1β + ε1
y2 = α 2  2 + X 2 β + ε 2

or
α 
 y   0 X 1   1   ε1 
=
Model (2) :  1   1 α +
X 2   2   ε 2 
.
 y2   0  2
β 
5
If the structural change is due to different intercept terms as well as different regression coefficients, then the model is

y1 =α11 + X 1β1 + ε1
y2 =α 2  2 + X 2 β 2 + ε 2
or
 α1 
 
 y   0 X1 0   α 2   ε1 
=
Model (3) :  1   1  +  .
 y2   0  2 0 X 2   β1   ε 2 
 
 β2 

The test of hypothesis related to the test of structural change is conducted by testing anyone of the null hypothesis
depending upon the situation

(I) H 0 : α1 = α 2
(II) H 0 : β1 = β 2

0 : α1
(III) H= α=
2 , β1 β2.

To construct the test statistic, apply ordinary least squares estimation to models (1), (2) and (3) and obtain the residual sum
of squares as RSS1, RSS2 and RSS3 respectively.
Note that the degrees of freedom associated with
 RSS1 from model (1) is n − ( k + 1) ,
 RSS2 from model (2) is n − ( k + 1 + 1) =n − ( k + 2 )
 RSS3 from model (3) is n − ( k + 1 + k + 1) = n − 2 ( k + 1) .
6

The null hypothesis H 0 : α1 = α 2 , i.e., different intercept terms is tested by the statistic

F=
( RSS1 − RSS2 ) /1
RSS 2 /(n − k − 2)

which follows F (1, n − k − 2 ) under H0. This statistic tests α1 = α 2 for model (2) using model (1), i.e., model (1) contrasted
with model (2).

The null hypothesis H 0 : β1 = β 2 , i.e., different regression coefficients is tested by

F=
( RSS2 − RSS3 ) / k
RSS3 / (n − 2k − 2)

which follows F ( k , n − 2k − 2 ) under H0. This statistic tests β1 = β 2 from model (3) using model (2), i.e., model (2)
contrasted with model (3).

0 : α1
The test of null hypothesis H= α=
2 , β1 β 2,, i.e., different intercepts and different slope parameters can be jointly
tested by the test statistic

F=
( RSS1 − RSS3 ) / ( k + 1)
RSS3 / (n − 2k − 2)

which follows F ( k + 1, n − 2k − 2 ) under H0. This statistic tests jointly


= α1 α=
2 and β1 β 2 for model (3) using model (1),
i.e., model (1) contrasted with model (3). This test is known as Chow test. It requires n1 > k and n2 > k for the stability
of regression coefficients in the two models. The development of this test is as follows which is based on the set up of
analysis of variance test.
7
Development of Chow test
Consider the models
y1 X 1 β1 + ε1
= (i )
n1 ×1 n1 × p p×1 n1 ×1

y2 X 2 β 2 + ε 2
= (ii )
n2 ×1 n2 × p p×1 n2 ×1

=
y X β+ε (iii )
n×1 n × p p×1 n×1

n= n1 + n2

where p= k + 1 which includes k explanatory variables and an intercept term.

Define
H1= I1 − X 1 ( X 1' X 1 ) X 1'
−1

I 2 − X 2 ( X 2' X 2 ) X 2'
−1
H=
2

H= I − X ( X ' X ) X '
−1

where I1 and I 2 are the identity matrices of orders ( n1 × n1 ) and ( n2 × n2 ) . The residual sums of squares based on
models (i), (ii) and (iii) are obtained as

RSS1 = ε1' H1ε1


RSS 2 = ε 2' H 2ε 2
RSS = ε ' H ε .
Then define
 H1 0  0 0 
=H1* = *
 , H2  .
 0 0 0 H 2 
8

Both H1* and H 2* are ( n × n ) matrices. Now the RSS1 and RSS2 can be re-expressed as

RSS1 = ε ' H1*ε


RSS 2 = ε ' H 2*ε
where ε is the disturbance term related to model (iii) based on ( n1 + n2 ) observations.

Note that H1 H 2 = 0 which implies that


* *
RSS1 and RSS2 are independently distributed.

We can write

H= I − X ( X ' X ) X '
−1

I 0   X1 
−   ( X ' X ) ( X 1' X 2' )
−1
=  1 
0 I2   X 2 

H H12 
=  11 
 H 21 H 22 

where H11= I1 − X 1 ( X ' X ) X 1'


−1

H12 = − X 1 ( X ' X ) X 2'


−1

H 21 = − X 2 ( X ' X ) X 1'
−1

H 22= I 2 − X 2 ( X ' X ) X 2' .


−1
9

Define
H=
*
H1* + H 2*
so that
RSS3 = ε ' H *ε
RSS1 = ε ' H ε
Note that ( H − H ) and H *
*
( )
are idempotent matrices. Also H − H * H * =
0. First we see how this result holds.

Consider
 H11 − H1 H12   H1 0 
(H − H ) H *
1
*
1 =
  
H 22   0 0 
 H 21
 ( H − H1 ) H1 0 
=  11 .
 H H
21 1 0 

Since X 1' H1 = 0, so we have

H 21H1 = 0
H11H1 = 0.

Also, since H1 is idempotent, it follows that


(H − H ) H =
11 0. 1 1

Thus ( H − H ) H =
0*
1
*
1

or HH1* = H1* .
10
Similarly, it can be shown that

(H − H ) H *
2
*
2 =
0
or HH 2* = H 2*.

Thus this implies that

( H − H − H )( H + H ) =
*
1 0*
2
*
1
*
2

or (H − H ) H =*
0. *

Also, we have

tr H= n − p
tr=
H * tr H1* + tr H 2*
= n1 − p + n2 − p
= n−2p
=n − 2k − 2.

Hence RSS1 and RSS3 are independently distributed. Further

RSS1 − RSS3
~ χ p2 ,
σ 2

RSS3
~ χ (2n −2 p ) .
σ 2
11

 RSS1 − RSS3  RSS3


Also,   and are independently distributed. Hence under the null hypothesis
 σ 2
 σ2

 RSS1 − RSS3 
  p
 σ2  ~ F ( p, n − 2 p )
 RSS3 
 2  (n − 2 p)
 σ 

( RSS1 − RSS3 ) / ( k + 1) ~ F k + 1, n − 2k − 2 .
or ( )
RSS3 / ( n − 2k − 2 )

Limitations of these tests

1. All tests are based under the assumption that σ2 remains same. So first the stability of σ2 should be checked and
then these tests can be used.
2. It is assumed in these tests that the point of change is exactly known. In practice, it is difficult to find such a point at
which the change occurs. It is more difficult to know such point when the change occurs slowly. These tests are not
applicable when the point of change is unknown. An adhoc techniques when the point of change is unknown is to
delete the data of transition period.
3. When there are more than one points of structural change, then the analysis becomes difficult.
ECONOMETRIC THEORY

MODULE – XIII
Lecture - 33
Asymptotic Theory and Stochastic
Regressors
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

The nature of explanatory variable is assumed to be non-stochastic or fixed in repeated samples in any regression analysis.

Such an assumption is appropriate for those experiments which are conducted inside the laboratories where the

experimenter can control the values of explanatory variables. Then the repeated observations on study variable can be

obtained for fixed values of explanatory variables. In practice, such an assumption may not always be satisfied.

Sometimes, the explanatory variables in a given model are the study variable in another model. Thus the study variable

depends on the explanatory variables that are stochastic in nature. Under such situations, the statistical inferences drawn

from the linear regression model based on the assumption of fixed explanatory variables may not remain valid.

We assume now that the explanatory variables are stochastic but uncorrelated with the disturbance term. In case, they are

correlated then the issue is addressed through instrumental variable estimation. Such a situation arises in the case of

measurement error models.


3

Stochastic regressors model


Consider the linear regression model
y Xβ +ε
=
where X is a (n x k) matrix of n observations on k explanatory variables X 1 , X 2 ,..., X k which are stochastic in nature, y
is a (n x 1) vector of n observations on study variable, β is a (k x 1) vector of regression coefficients and ε is the (n x 1)
V ( ε ) σ 2 I , the distribution of
E (ε ) 0,=
vector of disturbances. Under the assumption= ε i , conditional on xi' , satisfy
these properties for all values of X where xi' denotes the ith row of X. This is demonstrated as follows:

Let p (ε i | xi' ) be the conditional probability density function of εi given xi' and p ( ε i ) is the unconditional probability
density function of ε i . Then
E (ε i | xi' ) = ∫ ε i p (ε i | xi' ) d ε i

= ∫ ε i p (ε i ) d ε i

= E (ε i )
=0

E (ε i2 | xi' ) = ∫ ε i2 p (ε i | xi' ) d ε i

= ∫ ε i2 p ( ε i ) d ε i

= E (ε i2 )

= σ 2.
In case, ε i and xi
'
(
are independent, then p ε i | xi = p ( ε i ) .
'
)
4

Least squares estimation of parameters

The additional assumption that the explanatory variables are stochastic poses no problem in the ordinary least squares
estimation of β and σ 2 . The OLSE of β is obtained by minimizing ( y − X β ) ' ( y − X β ) with respect β as
b =(X 'X ) X 'y
−1

and estimator of σ2 is obtained as


1
s2 = ( y − Xb ) ' ( y − Xb ) .
n−k
5

Maximum likelihood estimation of parameters

(
Assuming ε ~ N 0, σ 2 I ) in the model=y X β + ε along with X is stochastic and independent of ε , the joint probability
density function ε and X can be derived from the joint probability density function of y and X as follows:
f ( ε , X ) = f ( ε1 , ε 2 ,..., ε n , x1' , x2' ,..., xn' )
 n  n 
=  ∏ f ( ε i )  ∏ f ( xi' ) 
=  i 1=  i 1 
 n  n 
=  ∏ f ( yi | xi' )  ∏ f ( xi' ) 
=  i 1=  i 1 

(
= ∏ f ( yi | xi' ) f ( xi' ) )
n

i =1

= ∏ f ( yi , xi' )
n

i =1

= f ( y1 , y2 ,..., yn , x1' , x2' ,..., xn' )


= f ( y, X ) .

This implies that the maximum likelihood estimators of β and σ 2 will be based on

∏ f ( y | x ) = ∏ f (ε )
n n
'
i i i
=i 1 =i 1

(
so they will be same as based on the assumption that ε i ' s, i = 1, 2,..., n are distributed as N 0, σ 2 . So the maximum)
likelihood estimators of β and σ 2
when the explanatory variables are stochastic are obtained as

β = ( X ' X ) X ' y
−1

σ 2 =( y − X β )′ ( y − X β ) .
1
n
6
Alternative approach for deriving the maximum likelihood estimates

Alternatively, the maximum likelihood estimators of β and σ 2 can also be derived using the joint probability density
function of y and X.
Let xi , i = 1, 2,..., n are from a multivariate normal distribution with mean vector
'
µ x and covariance matrix Σ xx , i.e.,
xi' ~ N ( µ x , Σ xx ) and the joint distribution of y and xi' is
 y  µ y   σ yy Σ yx  
 '  ~ N  µ  ,   .
 xi   x   Σ xy Σ xx  

Let the linear regression model is


yi =β 0 + xi' β1 + ε i , i =1, 2,..., n
where xi' is a 1× ( k − 1)  vector of observation of random vector
x , β 0 is the intercept term and β1 is the ( k − 1) ×1
vector of regression coefficients. Further ε i is disturbance term with ε i ~ N ( 0, σ ) and is independent of x ' .
2

Note: Note that the vector x ' is represented by an underscore in this section to denote that it ‘s order is 1× ( k − 1) 
which excludes the intercept term.

Suppose
 y  µ y   σ yy Σ yx  
  ~ N   ,  .
 x  µ x   Σ xy Σ xx  

The joint probability density function of ( y, x ) based on random sample of size n is


i
'

 1  y − µ y ' y − µ y 
1 −1 
f ( y, x=
') exp  −  Σ   .
k 1
 x − µ x − µ  
( 2π ) 2 Σ2 
2  x   x

Now using the following result, we find Σ −1 :


7

Result:

Let A be a nonsingular matrix which in partitioned suitably as

 B C
A= ,
 D E 
−1
where E and F= B − CE D are nonsingular matrices, then

−1 F −1 − F −1CE −1 
A =  −1 −1 −1 −1 −1 −1 
.
 − E DF E + E DF CE 
−1 −1
= A=
Note that AA A I . Thus

−11  1 −Σ yx Σ −xx1 
Σ = 2  −1 ,
σ  −Σ xx Σ xy σ 2 Σ −xx1 + Σ −xx1Σ xy Σ yx Σ −xx1 
where

σ=
2
σ yy − Σ yx Σ −xx1Σ xy .

Then
f ( y, x ')
=
( 2π )
1
k
2 Σ
1
2
 1
 2σ
{ } .
exp  − 2  y − µ y − ( x − µ x ) ' Σ −xx1Σ xy  + σ 2 ( x − µ x ) ' Σ −xx1 ( x − µ x )
2

The marginal distribution of x' is obtained by integrating f ( y, x ') over y and the resulting distribution is ( k − 1)
variate multivariate normal distribution as

 1 
exp  − ( x − µ x ) ' Σ −xx1 ( x − µ x )  .
1
=g ( x ') k −1
 2 
1
( 2π ) 2 Σ xx 2
8
The conditional probability density function of y given x' is
f ( y, x ')
f ( y | x ') =
g ( x ')
 1
{ ( y − µ y ) − ( x − µ x ) Σ−xx1Σ xy } 
2
1 '
= exp  − 2
2πσ 2  2σ

which is the probability density function of normal distribution with


 conditional mean
E ( y | x ')= µ y + ( x − µ x ) ' Σ −xx1Σ xy and
 conditional variance

| x ') σ yy (1 − ρ 2 )
Var ( y=
where
Σ yx Σ −xx1Σ xy
ρ =2

σ yy
is the population multiple correlation coefficient.

In the model
y=β 0 + x ' β1 + ε ,
the conditional mean is

E ( y | x ') =β 0 + x ' β1 + E ( ε | x )
= β 0 + x ' β1.
9

Comparing this conditional mean with the conditional mean of normal distribution, we obtain the relationship with β 0 and β1
as follows:
β1 =Σ −xx1Σ xy
β=
0 µ y − µ x' β1.

The likelihood function of ( y, x ') based on a sample of size n is

 n  1  y − µ ' y − µ y  
1  −1  i
=  ∑ − Σ   .
i y
L exp    
nk n
 x − µ x − µ  
( 2π ) 2 Σ2   
i =1 2 i x   i x

Maximizing the log-likelihood function with respect to µ y , µ x , Σ xx and Σ xy , the maximum likelihood estimates of respective
parameters are obtained as

1 n
µ y= y= ∑ yi
n i =1
1 n
µ x= x= ∑ xi= ( x2 , x3 ,..., xk )
n i =1
1 n 
Σ xx = S xx =  ∑
n  i =1
xi xi' − nx x ' 

1 n 
Σ xy = S xy = ∑ xi yi − nx
n  i =1
y

1
where xi' ( xi 2 , xi 3 ,..., xik ), S xx is [(k -1) × (k -1)] matrix with elements ∑ ( xti − xi )( xtj − x j ) and S xy is [(k -1) ×1] vector
n t
1
with elements ∑ ( xti − xi )( yi − y ).
n t
10

Based on these estimates, the maximum likelihood estimators of β 0 and β1 are obtained as

β1 = S xx−1S xy

β0= y − x ' β1

 β0 
β = (X 'X )
−1
=
 β1 
X ' y.
 

Properties of the least squares estimator

The estimation error of OLSE b = ( X ' X )


−1
X ' y of β is
(X ' X )
−1
b−β
= X'y−β

(X ' X ) X '( X β + ε ) − β
−1
=

= ( X ' X ) X 'ε.
−1

Then assuming that E ( X ' X ) X ' exists, we have


−1
 
E ( X ' X ) X 'ε 
 −1
E (b − β ) =
 

= E E
 {( X ' X ) −1
X 'ε X 
 }
= E ( X ' X ) X ' E (ε )
−1
 
=0
because ( X ' X )
−1
X ' and ε are independent. So b is an unbiased estimator of β.
11
The covariance matrix of b is obtained as
V ( b ) =E ( b − β )( b − β ) '

= E ( X ' X ) X ' εε ' X ( X ' X ) 


−1 −1
 
= E E
 {( X ' X ) −1
X ' εε ' X ( X ' X )
−1
}
X 

= E ( X ' X ) X ' E ( εε ' ) X ( X ' X ) X
−1 −1
 
= E ( X ' X ) X ' σ 2 X ( X ' X ) 
−1 −1
 
= σ 2 E ( X ' X )  .
−1
 

Thus the covariance matrix involves a mathematical expectation. The unknown σ2 can be estimated by

e'e
σˆ 2 =
n−k

=
( y − Xb ) ' ( y − Xb )
n−k
where e= y − Xb is the residual and
E (σˆ 2 ) = E  E (σˆ 2 X )

  e'e  
= E E   X
 n−k  
= E (σ 2 )

= σ 2.
12

b = ( X ' X ) X ' y involves the stochastic matrix X and stochastic vector y, so b is not a linear
−1
Note that the OLSE

estimator. It is also no more the best linear unbiased estimator of β as in the case when X is nonstochastic. The

estimator of σ2 as being conditional on given X is an efficient estimator.


ECONOMETRIC THEORY

MODULE – XIII
Lecture - 34
Asymptotic Theory and Stochastic
Regressors
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Asymptotic theory
The asymptotic properties of an estimator concerns the properties of the estimator when sample size n grows large.

For the need and understanding of asymptotic theory, we consider an example. Consider the simple linear regression
model with one explanatory variable and n observations as
yi =β 0 + β1 xi + ε i , E ( ε i ) =0, Var ( ε i ) =σ 2 , i =1, 2,..., n.
The OLSE of β1 is
n

∑ ( x − x )( y − y )
i i
b1 = i =1
n

∑(x − x )
2
i
i =1

and its variance is


σ2
Var ( b1 ) = .
n
If the sample size grows large, then the variance of b1 gets smaller. The shrinkage in variance implies that as sample

size n increases, the probability density of OLSE b collapses around its mean because Var(b) becomes zero.

Let there are three OLSEs b1 , b2 and b3 which are based on sample sizes n1 , n2 and n3 respectively such that

n1 < n2 < n3 , say. If c and δ are some arbitrarily chosen positive constants, then the probability that the value of b

lies within the interval β ±c can be made to be greater than (1 − δ ) for a large value of n. This property is the

consistency of b which ensure that even if the sample is very large, then we can be confident with high probability that

b will yield an estimate that is close to β.


3

Probability in limit
Let βˆn be an estimator of β based on a sample of size n. Let γ be any small positive constant. Then for large n, the
requirement that bn takes values with probability almost one in an arbitrary small neighborhood of the true parameter
value β is

lim P  βˆn − β < γ  =


1
n →∞  
which is denoted as

plim βˆn = β

and it is said that βˆn converges to β in probability. The estimator βˆn is said to be a consistent estimator of β .
A sufficient but not necessary condition for βˆn to be a consistent estimator of β is that

lim E  βˆn  = β
n →∞

and
lim Var  βˆn  = 0.
n →∞
4

Consistency of estimators

Now we look at the consistency of the estimators of β and σ 2.


(i) Consistency of b
 X 'X 
Under the assumption that lim   = ∆ exists as a nonstochastic and nonsingular matrix (with finite elements), we
n →∞
 n 
have
−1
1 X 'X 
lim V (b) = σ lim 
2

n →∞ n →∞ n
 n 
1
= σ 2 lim ∆ −1
n →∞ n

= 0.
This implies that OLSE converges to β in quadratic mean.
Thus OLSE is a consistent estimator of β . This holds true for maximum likelihood estimators also.

Same conclusion can also be proved using the concept of convergence in probability.

The consistency of OLSE can be obtained under the following weaker assumptions:
 X 'X 
(i) plim   = ∆* exists and is a nonsingular and nonstochastic matrix.
 n 

 X 'ε 
(ii) plim   = 0.
 n 
5

Since
b−β =
( X ' X ) −1 X ' ε
−1
 X ' X  X 'ε
=  .
 n  n
So
−1
 X 'X   X 'ε 
plim(b − β ) = plim   plim  
 n   n 
= ∆*−1.0
= 0.
Thus b is a consistent estimator of β . The same is true for maximum likelihood estimator also.
6

(ii) Consistency of s2

Now we look at the consistency of s2 as an estimate of σ 2 . We have


1
s2 = e 'e
n−k
1
= ε ' Hε
n−k
−1
1 k 
1 −  ε ' ε − ε ' X ( X ' X ) X ' ε 
−1
=
n n

 k   ε 'ε ε ' X  X ' X  X 'ε 


−1 −1

=
1 −   −   .
 n   n n  n  n 

ε 'ε 1 n 2
Note that
n
consists of terms ∑ εi
n i =1
and {ε i2 , i = 1, 2,..., n} is a sequence of independently and identically

distributed random variables with mean σ 2. Using the law of large numbers
 ε 'ε 
 =σ
2
plim 
 n 
 ε ' X  X ' X  −1 X ' ε   ε 'X    X ' X  
−1
X 'ε 
plim     =  plim   plim     plim 
 n  n  n   n    n    n 
= 0.∆*−1.0
=0
(1 − 0) −1 σ 2 − 0 
⇒ plim( s 2 ) =
= σ 2.
Thus s2 is a consistent estimator of σ 2. The same hold true for maximum likelihood estimators also.
7

Asymptotic distributions

Suppose we have a sequence of random variables {α n } with a corresponding sequence of cumulative density functions
{Fn } for a random variable α with cumulative density function F. Then αn converges in distribution to α if Fn
converges to F point wise. In this case, F is called the asymptotic distribution of α n .

Note that since convergence in probability implies the convergence in distribution, so

plim α=
n α ⇒ α n 
D
→α
(α n tend to α in distribution), i.e., the asymptotic distribution of α n is F which is the distribution of α ..
Note that

E (α ) : Mean of asymptotic distribution


Var (α ) : Variance of asymptotic distribution
lim E (α n ) :  Asymptotic mean
n →∞
2
lim E α n − lim E (α n )  :  Asymptotic variance.
n →∞  n →∞ 
8

Asymptotic distribution of sample mean and least squares estimation


1 n
Let α=
n Y=
n ∑ Yi be the sample mean based on a sample of size n. Since sample mean is a consistent estimator
n i =1
of population mean Y , so

plim Yn = Y
which is constant. Thus the asymptotic distribution of Yn is the distribution of a constant. This is not a regular distribution
as all the probability mass is concentrated at one point. Thus as sample size increases, the distribution of Yn collapses.

Suppose consider only the one third observations in the sample and find sample mean as
n
3
3
Yn* = ∑ Yi .
n i =1
Then
E (Yn* ) = Y
n

( ) = n9 ∑Var (Y )
3
*
and Var Yn 2 i
i =1

9 n 2
= σ
n2 3
3
= σ2
n
→ 0 as n → ∞.
9

Thus plim Yn* = Y and Yn* has the same degenerate distribution as Yn . Since Var (Yn* ) > Var (Yn ) , so Yn* is
preferred over Yn .
Now we observe the asymptotic behavior of Yn and Yn
*
. Consider a sequence of random variables {α n }.
Thus for all n, we have

αn
= n (Yn − Y )

=α n* n (Yn* − Y )

(α n )
E= n E (Yn =
−Y ) 0

(αn* )
E= n E (Yn* −
= Y) 0

σ2
Var (α n ) = nE (Yn − Y ) = n
2
= σ2
n
3σ 2
Var (α )= nE (Y − Y ) = n
2
*
n n
*
= 3σ 2 .
n
Assuming the population to be normal, the asymptotic distribution of

 Yn is N ( 0, σ 2 )

 Yn* is N ( 0,3σ 2 )
10

So now Yn is preferable over Yn* . The central limit theorem can be used to show that α n will have an asymptotically
normal distribution even if the population is not normally distributed.
Also, since
n (Yn − Y ) ~ N ( 0, σ 2 )
n (Yn − Y )
⇒Z = ~ N ( 0,1)
σ
and this statement holds true in finite sample as well as asymptotic distributions.

Consider the ordinary least squares estimate b = ( X ' X )


−1
y X β + ε . If X is
X ' y of β in linear regression model=
nonstochastic then the finite covariance matrix of b is

V (b) = σ 2 ( X ' X ) −1.


X 'X
The asymptotic covariance matrix of b under the assumption that lim = Σ xx exists and is nonsingular. It is given
n →∞ n
by −1
1  X 'X 
σ lim ( X ' X ) = σ lim   lim 
2 2

n →∞
 
n →∞ n n →∞
 n 
= σ 2 .0.Σ −xx1
=0
which is a null matrix.
11

Consider the asymptotic distribution of n ( b − β ) . Then even if ε is not necessarily normally distributed, then
asymptotically

n ( b − β ) ~ N ( 0, σ 2Σ −xx1 )

n ( b − β ) ' Σ xx ( b − β )
~ χ k2 .
σ 2

If
X ' X is considered as an estimator of Σ xx , then
n

X 'X
n (b − β ) ' (b − β ) (b − β ) ' X ' X (b − β )
n =
σ2 σ2

is the usual test statistic as is in the case of finite samples with b ~ N ( β ,σ 2


(X 'X )
−1
).
ECONOMETRIC THEORY

MODULE – XIV
Lecture - 35
Stein-Rule Estimation

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

The ordinary least squares estimation of regression coefficients in linear regression model provides the estimators having

minimum variance in the class of linear and unbiased estimators.

The criterion of linearity is desirable because such estimators involve less mathematical complexity, they are easy to

compute and it is easier to investigate their statistical properties.

The criterion of unbiasedness is attractive because it is intuitively desirable to have an estimator whose expected value,

i.e., the mean of the estimator should be same as the parameter being estimated.

Considerations of linearity and unbiased estimators sometimes may lead to an unacceptably high price to be paid in terms

of the variability of the estimator around the true parameter. To improve such an estimator, one possibility is to consider

nonlinear and biased estimator which may have better statistical properties.

It is to be noted that one of the main objectives of estimation is to find an estimator whose values have high concentration

around the true parameter. Sometimes it is possible to have a nonlinear and biased estimator that has smaller variability

than the variability of best linear unbiased estimator of the parameter under some mild restrictions.
3

In the multiple regression model

X β + ε , E (ε ) =
y= 0, V ( ε ) =
σ 2 I,
n×1 n×k k ×1 n×1

the ordinary least squares estimator (OLSE) of β is b = ( X ' X )


−1
X ' y which is the best linear unbiased estimator of
β in the sense that it is linear in y, E ( b ) = β and b has smallest variance among all linear and unbiased estimators of β .
Its covariance matrix is

V ( b ) = E ( b − β )( b − β ) ' = σ 2 ( X ' X ) .
−1

The weighted mean squared error of any estimator β̂ is defined as

(
E βˆ − β 'W
= βˆ − β ) ( ) ∑∑ w E ( βˆ − β )( βˆ
i j
ij i i j −βj )
where W is k × k fixed positive definite matrix of weights wij . The two popular choice of weight matrix W are

(i) W is an identity matrix, i.e. W = I then E ( βˆ − β ) ' ( βˆ − β ) is called as the total mean squared error (MSE) of β̂ .
(ii) W = X ' X , then

( ) ( ) (
E βˆ − β ' X ' X βˆ − β= E X βˆ − X β ' X βˆ − X β )( )
is called as the predictive mean squared error of β̂ . 
Note that E ( y ) = X βˆ is the predictor of average value
(
E ( y ) = X β and X βˆ − X β ) is the corresponding prediction error.

There can be other choices of W and it depends entirely on the analyst how to define the loss function so that the
variability is minimum.
4

If a random vector with k elements (k > 2) is normally distributed as N ( µ , I ) , µ being the mean vector, then Stein

established that if the linearity and unbiasedness are dropped, then it is possible to improve upon the maximum likelihood

estimator of µ under the criterion of total MSE.

Later, this result was generalized by James and Stein for linear regression model. They demonstrated that if the criteria of

linearity and unbiasedness of the estimators are dropped, then a nonlinear estimator can be obtained which has better

performance than the best linear unbiased estimator under the criterion of predictive MSE. In other words, James and

Stein established that OLSE is inadmissible for k > 2 under predictive MSE criterion, i.e., for k > 2, there exists an

estimator β̂ such that

( ) ( )
E βˆ − β ' X ' X βˆ − β ≤ E ( b − β ) ' X ' X ( b − β )

for all values of β with strict inequality holding for some values of β. For k ≤ 2, no such estimator exists and we say
that “b can be beaten” in this sense. Thus it is possible to find estimators which will beat b in this sense. So a nonlinear

and biased estimator can be defined which has better performance than OLSE.

Such an estimator is given by the family of Stein-rule estimators.


5
The family of Stein-rule estimators is given by

 σ2 
β= 1 − c
ˆ when σ
2
b ' X ' Xb 
b is known

and
 e 'e 
βˆ= 1 − c b when σ is unknown.
2

 
b ' X ' Xb 

Here c is a fixed positive characterizing scalar, e’e is the residuum sum of squares based on OLSE and e= y − Xb
is the residual. By assuming different values to c, we can generate different estimators. So a class of estimators
characterized by c can be defined.

Let
 σ2 
δ
1 − c b ' X ' Xb  =
 
be a scalar quantity. Then
βˆ = δ b.
So essentially we say that instead of estimating β1 , β 2 ,..., β k by b1 , b2 ,..., bk we estimate them by δβˆ1.δβˆ2 ,..., δβˆk ,
respectively. So in order to increase the efficiency, the OLSE is multiplied by a constant δ. Thus δ is called the
shrinkage factor. As Stein-rule estimators attempts to shrink the components of b towards zero, so these estimators
are known as shrinkage estimators.

First we discuss a result which is used to prove the dominance of Stein-rule estimator over OLSE.
6

Result: Suppose a random vector Z of order ( k ×1) is normally distributed as N ( µ , I ) where µ is the mean vector and
I is the covariance matrix.
Then
 Z '( Z − µ ) 
( k − 2 ) E  
1
E = .
 Z 'Z   Z 'Z 

An important point to be noted in this result is that the left hand side depends on µ but right hand side is independent of µ .

Now we consider the Stein-rule estimator β̂ when σ2 is known. Note that

( ) 
βˆ E ( b ) − cσ 2 E 
E=
b 

 b ' X ' Xb 
= β − (in general, a non-zero quantity)
≠ 0, in general.
Thus the Stein-rule estimator is biased while OLSE b is unbiased for β.

The predictive risk of b and β̂ are

PR ( b ) =
E (b − β ) ' X ' X (b − β )

( )
PR βˆ = ( )
E βˆ − β ' X ' X βˆ − β . ( )
The Stein-rule estimator β̂ is better them OLSE b under the criterion of predictive risk if

( )
PR βˆ < PR ( b ) .
7
Solving the expressions, we get

( X ' X ) X 'ε
−1
b−β =

PR ( b ) = E ε ' X ( X ' X ) X ' X ( X ' X ) X ' ε 


−1 −1
 
= E ε ' X ( X ' X ) X ' ε 
−1
 
= E tr ( X ' X ) X ' εε ' X 
−1
 
= tr ( X ' X ) X ' E ( εε ' ) X
−1

= σ 2tr ( X ' X ) X ' X


−1

= σ 2 trI k
= σ 2k .

'
 cσ 2   cσ 2 
( )
PR βˆ = E ( b − β ) −

b  X ' X ( b − β ) −
b ' X ' Xb   b ' X ' Xb 
 cσ 2 
= E (b − β ) ' X ' X (b − β ) − E  {b ' X ' X ( b − β ) + ( b − β ) ' X ' Xb}
 b ' X ' Xb 
 c 2σ 4 
+E b ' X ' Xb 
 ( b ' X ' Xb )
2

 cσ 2 ( b − β ) ' X ' Xb   c 2σ 4 
σ k − 2E 
=2
+E .
 b ' X ' Xb   b ' X ' Xb 
8

Suppose
1
Z= (X 'X )
1/ 2
b
σ
or
b =σ (X 'X )
−1/ 2
Z

1
µ= (X 'X ) β
1/ 2

σ
or
β =σ (X 'X )
−1/ 2
µ

( )
and Z ~ N ( µ , I ) , i.e., Z1 , Z 2 ,..., Z k are independent. Substituting these values in the expressions for PR βˆ , we get

 2 σ 2 (Z − µ )'Z   c 2σ 2 
PR β =
ˆ 2
( )
σ k − 2 E cσ
σ 2Z ' Z 
+E 2 
σ Z ' Z 

 Z '( Z − µ )   c 2σ 2 
σ 2 k − 2 E cσ 2
=  + E  Z 'Z 
 Z 'Z   
 Z '( Z − µ )  2 2  1 
σ 2 k − 2cσ 2 
= +c σ E 
 Z 'Z   Z 'Z 
 1 
= σ 2 k − cσ 2  2 ( k − 2 ) − c  E   (using the result )
 Z 'Z 
 1 
= PR ( b ) − cσ 2  2 ( k − 2 ) − c  E  .
 Z 'Z 
9

Thus

( )
PR βˆ < PR ( b )
if and only if
 1 
cσ 2  2 ( k − 2 ) − c  E   > 0.
 Z 'Z 
Since Z ~ N ( µ , I ) , σ 2 > 0. So Z’Z has a non-central Chi-square distribution. Thus

 1 
E >0
 Z 'Z 
⇒ c  Z ( k − 2 ) − c  > 0.
Since c > 0 is assumed, so this inequality holds true when

2 ( k − 2) − c > 0
or 0 < c < 2 ( k − 2 ) provided k > 2.

So as long as 0 < c < 2 ( k − 2 ) is satisfied, the Stein-rule estimator will have smaller predictive risk them OLSE.

This inequality is not satisfied for k = 1 and k = 2. The largest gains efficiency arises when c= k − 2. So if the number of
explanatory variables are more than two, then it is always possible to construct an estimator which is better than OLSE.
10

When σ2 is unknown then it can be shown that the Stein-rule estimator


 e 'e 
βˆ= 1 − c b
 b ' X ' Xb 
is better than OLSE b if and only if
2 ( k − 2)
0<c< ; k > 2.
n−k +2
The optimum choice of c giving largest gain is efficiency is

k −2
c= .
n−k +2
ECONOMETRIC THEORY

MODULE – XV
Lecture - 36
Instrumental Variables Estimation

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
A basic assumption in analyzing the performance of estimators in a multiple regression is that the explanatory variables and
disturbance terms are independently distributed. The violation of such assumption disturbs the optimal properties of the
estimators. The instrumental variable estimation method helps in estimating the regression coefficients in multiple linear
regression model when such violation occurs.

Consider the multiple linear regression model


y Xβ +ε
=
where y is (n x 1) vector of observation on study variable, X is (n x k) matrix of observations on X 1 , X 2 ,..., X k , β is a
(k x 1) vector of regression coefficient and ε is a (n x 1) vector of disturbances.

Suppose one or more explanatory variables is correlated with the disturbances in the limit, then we can write
1 
plim  X ' ε  ≠ 0.
n 
The consequences of such an assumption on ordinary least squares estimator are as follows:
b =(X 'X ) X 'y
−1

( X ' X ) X '( X β + ε )
−1
=
( X ' X ) X 'ε
−1
b−β =
−1
 X ' X   X 'ε 
=   
 n   n 
−1
 X 'X   X 'ε 
plim ( b − β ) =plim   plim  
 n   n 
≠0
 X 'X 
assuming plim   = Σ XX exists and is nonsingular.
 n 

Consequently plim b ≠ β and thus the OLSE becomes an inconsistent estimator of β .


3
To overcome this problem and to obtain a consistent estimator of β, the instrumental variable estimation can be used.

Consider the model


 X 'ε 
y X β + ε with plim 
=  ≠ 0.
 n 

Suppose that it is possible to find a data matrix Z of order (n x k) with the following properties:

Z'X 
(i) plim   = Σ ZX is a finite and nonsingular matrix of full rank. This interprets that the variables in Z are correlated
 n 
with those in X, in the limit.
 Z 'ε 
(ii) plim   = 0,
 n 
i.e., the variables in Z are uncorrelated with ε, in the limit.
 Z 'Z 
(iii) plim   = Σ ZZ exists.
 n 

Thus Z - variables are postulated to be


 uncorrelated with ε , in the limit and
 to have nonzero cross product with X.

Such variables are called instrumental variables.

If some of X variables are likely to be uncorrelated with ε, then these can be used to form some of the columns of Z and
extraneous variables are found only for the remaining columns.
4

First we understand the role of the term X ' ε in the OLS estimation. The OLSE b of β is derived by solving the
equation
∂ ( y − X β ) '( y − X β )
=0
∂β
or X ' y = X ' Xb
or X ' ( y − Xb ) =
0.
y X β + ε by X ' as
Now we look at this normal equation as if it is obtained by pre-multiplying =

X ' y X ' X β + X 'ε


=

where the term X ' ε is dropped and β is replaced by b.

The disappearance of X ' ε can be explained when X and ε are uncorrelated as follows:

X ' y X ' X β + X 'ε


=
X 'y X 'X X 'ε
= β+
n n n
 X 'y  X 'X   X 'ε 
=
plim   plim   β + plim  
 n   n   n 
−1
 X 'X    X 'y  X 'ε 
⇒β plim    plim   − plim   .
 n    n   n 
5

Let
 X 'X 
plim   = Σ XX
 n 
 X 'y
plim   = Σ Xy
 n 

where population cross moments Σ XX and Σ Xy are finite, Σ XX is finite and nonsingular.

If X and ε are uncorrected so that

 X 'ε 
plim   = 0,
 n 
then
β =Σ −XX1 Σ Xy .
and Σ Xy is estimated by sample cross moment
X 'X X ' y then the 6
If Σ XX is estimated by sample cross moment ,
n n
OLS estimator of β is obtained as
−1
 X 'X   X 'y
b=   
 n   n 
= ( X ' X ) X ' y.
−1

Such an analysis suggests to use Z to pre-multiply the multiple regression model as follows:
Z ' y Z ' X β + Z 'ε
=
Z'y Z'X   Z 'ε 
=  β + 
n  n   n 
Z'y Z'X   Z 'ε 
=
plim   plim   β + plim  
 n   n   n 
−1
Z'X   Z'y  Z 'ε 
⇒β plim    plim   − plim  
 n    n   n 
=Σ −ZX1 Σ Zy .

Substituting the sample cross moment


Z'X
 of Σ ZX and
n
Z'y
 of Σ Zy ,
n
the following instrumental variable estimator of β is obtained:

βˆIV = ( Z ' X ) Z ' y


−1

which is termed as instrumental variable estimator of β and this method is called as instrumental variable method.
7

Since
(Z ' X ) Z '( X β + ε ) − β
−1
βˆIV − β
=

= (Z ' X ) Z 'ε
−1

(
plim βˆIV − β =)
plim ( Z ' X ) Z ' ε 

−1

  Z ' X  −1 Z ' ε 
= plim   
 n  n 
−1
Z'X   Z 'ε 
= plim   plim  
 n   n 
= Σ −ZX1 .0

⇒ plimβˆIV =
β.

Thus the instrumental variable estimator is consistent.

Note that the variables Z1 , Z 2 ,..., Z k in Z are chosen such that they are uncorrelated with ε and correlated with X, at
least asymptotically so that the second order moment matrix Σ ZX exists and is nonsingular.
8
Asymptotic distribution

The asymptotic distribution of


−1

( 1
n βˆIV − β = )
 Z'X 
n 
1
n
Z 'ε

is normal with mean vector 0 and asymptotic covariance matrix given by

 Z ' X  −1 1 Z'X  
−1
ˆ ( )
AsyVar β IV = plim   Z ' E ( εε ') Z   
 n  n  n  
 Z ' X  −1  Z ' Z   X ' Z  −1 
= σ plim 
2
    
 n   n   n  
−1 −1
Z'X   Z 'Z   X 'Z 
= σ plim 
2
 plim   plim  
 n   n   n 
= σ 2 Σ −XZ1 Σ ZZ Σ −ZX1 .
For large sample,
σ 2 −1
( )
V βˆIV =
n
Σ XZ Σ ZZ Σ −ZX1

which can be estimated by


σˆ 2 ˆ −1 ˆ ˆ −1
ˆ ˆ ( )
V β IV =
n
Σ XZ Σ ZZ Σ ZX

s2
= (Z ' X ) Z 'Z ( X 'Z )
−1 −1

n
1
(X 'X ) ( y − Xb ) '( y − Xb).
−1
where b = X ' y and s 2 =
n−k

The variance of βˆIV is not necessarily a minimum asymptotic variance because there can be more than one sets of
instrumental variables that fulfill the requirement of being uncorrelated with ε and correlated with stochastic regressors.
ECONOMETRIC THEORY

MODULE – XVI
Lecture – 37
Measurement Error Models

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
A fundamental assumption in all the statistical analysis is that all the observations are correctly measured. In the context of

multiple regression model, it is assumed that the observations on study and explanatory variables are observed without

any error. In many situations, this basic assumption is violated. There can be several reasons for such violation.

 For example, the variables may not be measurable, e.g., taste, climatic conditions, intelligence, education, ability
etc. In such cases, the dummy variables are used and the observations can be recorded in terms of values of
dummy variables.
 Sometimes the variables are clearly defined but it is hard to take correct observations. For example, the age is
generally reported in complete years or in multiple of five.
 Sometimes the variable is conceptually well defined but it is not possible to take correct observation on it.
Instead, the observations are obtained on closely related proxy variables, e.g., the level of education is measured
by the number of years of schooling.
 Sometimes the variable is well understood but it is qualitative is nature . For example, intelligence is measured
by intelligence quotient (IQ) scores.

In all such cases, the true value of variable can not be recorded. Instead, it is observed with some error. The difference
between the observed and true values of the variable is called as measurement error or errors-in-variables.

Difference between disturbances and measurement errors


The disturbances in the linear regression model arise due to factors like unpredictable element of randomness, lack of
deterministic relationship, measurement error in study variable etc. The disturbance term is generally thought of as
representing the influence of various explanatory variables that have not actually been included is the relation. The
measurement errors arise due to the use of an imperfect measure of true variables.
3
Large and small measurement errors

If the magnitude of measurement errors is small, then they can be assumed to be merged in the disturbance term and they
will not affect the statistical inferences much. On the other hand, if they are large in magnitude, then they will lead to
incorrect and invalid statistical inferences. For example, in the context of linear regression model, the ordinary least
squares estimator (OLSE) is the best linear unbiased estimator of regression coefficient when measurement errors are
absent. When the measurement errors are present in the data, the same OLSE becomes biased as well as inconsistent
estimator of regression coefficients.

Consequences of measurement errors


We first describe the measurement error model. Let the true relationship between correctly observed study and
explanatory variables be
y = X β
where y is a (n x 1) vector of true observation on study variable, X is a (n x k) matrix of true observations on
explanatory variables andβ is a (k x 1) vector of regression coefficients. The value y and X are not observable due to
the presence of measurement errors. Instead, the values of y and X are observed with additive measurement errors as

y= y + u
X= X + V

where y is a (n x 1) vector of observed values of study variables which are observed with (n x 1) measurement error
vector u. Similarly, X is a (n x k) matrix of observed values of explanatory variables which are observed with (n x k) matrix
V of measurement errors in X. In such a case, the usual disturbance term can be assumed to be subsumed in u without
loss of generality. Since our aim is to see the impact of measurement errors, so it is not considered separately in the
present case.
4
Alternatively, the same setup can be expressed as
y X β + u
=
X= X + V
where it can be assumed that only X is measured with measurement errors V and u can be considered as the usual
disturbance term in the model.
In case, some of the explanatory variables are measured without any measurement error then the corresponding values in
V will be set to zero.
We assume that
= E (uu ') σ 2 I
E (u ) 0,=
E (V ) =
0, E (V 'V ) =
Ω, E (V ' u ) =
0.

The following set of equations describes the measurement error model


y = X β
y= y + u
X= X + V
which can be re-expressed as
y= y + u
= X β + u
=( X − V ) β + u
= X β + (u − V β )
=X β + ω
where ω= u − V β is called as the composite disturbance term. This model resembles with the usual linear regression
model. A basic assumption in linear regression model is that the explanatory variables and disturbances are uncorrelated.
5
y X β + w as follows:
Let us verify this assumption in the model=

E { X − E ( X )} '{ω − E (ω )} = E [V '(u − V β ) ]


1 
= E [V ' u ] − E  V 'V  β
n 
= 0 − Ωβ
= −Ωβ
≠ 0.
Thus X and ω are correlated. So OLS will not provide efficient result.

Suppose we ignore the measurement errors and obtain the OLSE. Note that ignoring the measurement errors in the data
does not mean that they are not present. We now observe the properties of such an OLSE under the setup of measurement
error model.

The OLSE is
b = (X ' X ) X ' y
−1

(X ' X ) X '( X β + ω)
−1
=b−β

= ( X ' X ) X 'ω
−1

E (b − β ) =
E ( X ' X ) X ' ω 
−1
 
≠ ( X ' X ) X ' E (ω )
−1

≠0

as X is a random matrix which is correlated with ω. So b becomes a biased estimator of β .


6
Now we check the consistency property of OLSE.
Assume
1 
plim  X ' X  = Σ xx
n 
1 
plim  V 'V  = Σ vv
n 
1 
plim  X 'V  = 0
n 
1 
plim  V ' u  = 0.
n 

Then
 X ' X  −1  X ' ω  
plim ( b − β ) =
plim    
 n   n  
1
n
1
( )(
X ' X = X + V ' X + V
n
)
1 1 1 1
= X ' X + X 'V + V ' X + V 'V
n n n n

1  1  1  1  1 
plim  X ' X  = plim  X ' X  + plim  X 'V  + plim  V ' X  + plim  V 'V 
n  n  n  n  n 
=Σ xx + 0 + 0 + Σ vv
= Σ xx + Σ vv
7

1 1  1
=X 'ω X 'ω + V 'ω
n n n
1  1
= X '(u − V β ) + V '(u − V β )
n n
1  1  1  1  1 
plim  X ' ω  = plim  X ' u  − plim  X 'V  β + plim  V ' u  − plim  V 'V  β
n  n  n  n  n 
= 0 − 0 + 0 − Σ vv β
 X 'ω 
-1
 X 'X 
plim ( b − β ) =
plim   plim  
 n   n 
= − ( Σ xx + Σ vv ) Σ vv β
−1

≠ 0.

Thus b is an inconsistent estimator of β. Such inconsistency arises essentially due to correlation between X and ω.

Note: It should not be misunderstood that the OLSE b = ( X ' X )


−1
X ' y is obtained by minimizing
S ==ω ' ω ( y − X β ) ' ( y − X β ) in the model=
y X β + ω . In fact ω ' ω cannot be minimized as in the case of usual
linear regression, because the composite error ω= u − V β is itself a function of β .
8
To see the nature of consistency, consider the simple linear regression model with measurement error as

β 0 + β1 xi ,
y i = i=
1, 2,..., n
y=
i y i + ui
x=
i xi + vi .
Now
 1 x1   1 x1   0 v1 
     
 1 x2   1 x2   0 v2 
=X = , X  = , V  
          
      
 1 xn   1 xn   0 vn 

and assuming that


1 n 
plim  ∑ xi  = µ
 n i =1 
1 n 
plim  ∑ ( xi − µ ) 2  =
σ x2 ,
 n i =1 
we have
1 
Σ xx = plim  X ' X 
n 
 1 n 
 1
n
∑ xi 
= plim  n 
i =1

1 1 n 2
 ∑ xi ∑ xi 
=  n i 1= ni1 
1 µ 
= 2
.
µ σx + µ 
2
9
Also
1 
Σ vv =plim  V 'V 
n 
0 0 
= 2
.
0 σv 

Now
plim ( b − β ) = − ( Σ xx + Σ vv ) Σ vv β
−1

 b0 − β 0 
−1
1 µ   0 0   β0 
plim  = −
 b − β  2 2 2  2  
 µ σ x + µ + σ v   0 σ v   β1 
 1 1 

1  σ x2 + µ 2 + σ v2 −µ   0 
= − 2
(σ x + σ v2 )  − µ  2
1   βσ v 

 σ v2 
 σ 2 σ 2 µβ 
v +
=  .
x

 σv
2 
− 2 β 
 σ x + σv 
2

Thus we find that the OLSEs of β 0 and β1 are biased and inconsistent. So if a variable is subjected to measurement
errors, it not only affects its own parameter estimate but also affect other estimator of parameter that are associated
with those variable which are measured without any error. So the presence of measurement errors in even a single
variable not only makes the OLSE of its own parameter inconsistent but also makes the estimates of other regression
coefficients inconsistent which are measured without any error.
10

Forms of measurement error model


Based on the assumption about the true values of explanatory variable, there are three forms of measurement error
model .
Consider the model
β 0 + β1 xi ,
yi = i=
1, 2,..., n
y=
i yi + ui
x=
i xi + vi .

1. Functional form When the xi ' s are unknown constants (fixed), then the measurement error model is said to
be in its functional form.

2. Structural form When the xi ' s are identically and independently distributed random variables, say, with mean

µ ( )
and variance σ 2 σ 2 > 0 , the measurement error model is said to be in the structural form.

Note that is case of functional form, σ = 0.


2

3. Ultrastructural form When the xi ' s are independently distributed random variables with different means,

( )
say µi and variance σ 2 σ 2 > 0 , then the model is said to be in the ultrastructural form. This form is a synthesis of

function and structural forms in the sense that both the forms are particular cases of ultrastructural form.
ECONOMETRIC THEORY

MODULE – XVI
Lecture – 38
Measurement Error Models

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Methods for consistent estimation of β


The OLSE of β which is the best linear unbiased estimator becomes biased and inconsistent in the presence of

measurement errors. An important objective in measurement error models is how to obtain the consistent estimators of

regression coefficients. The instrumental variable estimation and method of maximum likelihood (or method of moments)

are utilized to obtain the consistent estimates of the parameters.

Instrumental variable estimation

The instrumental variable method provides the consistent estimate of regression coefficients in linear regression model
when the explanatory variables and disturbance terms are correlated. Since in measurement error model, the
explanatory variables and disturbance are correlated, so this method helps. The instrumental variable method consists
of finding a set of variables which are correlated with the explanatory variables in the model but uncorrelated with the
composite disturbances, at least asymptotically, to ensure consistency.

Let Z1 , Z 2 ,..., Z k be the k instrumental variables. In the context of the model

X β + ω, ω =
y= u −V β ,

let Z be the n × k matrix of k instrumental variables Z1 , Z 2 ,..., Z k , each having n observations such that
 Z and X are correlated, atleast asymptotically and
 Z and ω are uncorrelated, atleast asymptotically.
3
So we have
1 
plim  Z ' X  = Σ ZX
n 
1 
plim  Z ' ω  = 0.
n 

The instrumental variable estimator of β is given by


βˆIV = ( Z ' X ) Z ' y
−1

(Z ' X ) Z '( Xα + ω )
−1
=

( Z ' X ) Z 'ω
−1
βˆIV − β =
−1

(
plim βˆIV − β ) 1  1 
plim  Z ' X  plim  Z ' ω 
=
n  n 
= Σ −ZX1 .0
= 0.
So βˆ IV is consistent estimator of β.

Any instrument that fulfills the requirement of being uncorrelated with the composite disturbance term and uncorrelated
with explanatory variables will result in a consistent estimate of parameter. However, there can be various sets of
variables which satisfy these conditions to become instrumental variables. Different choices of instruments give different
consistent estimators. It is difficult to assert that which choice of instruments will give an instrumental variable estimator
having minimum asymptotic variance. Moreover, it is also difficult to decide that which choice of the instrumental variable
is better and more appropriate in comparison to other. An additional difficulty is to check whether the chosen instruments
are indeed uncorrelated with the disturbance term or not.
4

Choice of instrument
We discuss some popular choices of instruments in a univariate measurement error model. Consider the model

yi =β 0 + β1 xi + ωi , ωi =ui − β1vi , i =
1, 2,..., n.
A variable that is likely to satisfy the two requirements of an instrumental variable is the discrete grouping variable. The
Wald’s, Bartlett’s and Durbin’s methods are based on different choices of discrete grouping variables.

1. Wald’s method
Find the median of the given observations x1 , x2 ,..., xn . Now classify the observations by defining an instrumental variable
Z such that
1 if xi > median( x1 , x2 ,..., xn )
Zi = 
−1 if xi < median( x1 , x2 ,..., xn ).

In this case,
1 Z1  1 x1 
   
1 Z 1 x2 
=Z = 2 , X .
     
   
1 Z n  1 xn 

Now form two groups of observations as follows.


 One group with those xi’s below the median of x1 , x2 ,..., xn . Find the means of yi’s and xi’s, say y1 and x1 ,
respectively in this group..
 Another group with those xi’s above the median of x1 , x2 ,..., xn . Find the means of yi’s and xi ' s, say y2 and x2,
respectively in this group.
5
1 n 1 n
=
Now we find the instrumental variable estimator under this set up as follows: Let x
=
= ∑ xi , y n ∑
n i 1= i 1
yi .

βˆIV = ( Z ' X ) Z ' y


−1

 n

 n ∑ xi  n nx 
=Z ' X =
i =1
  
n
 n n
 0 ( x2 − x1 ) 
 ∑ Z i ∑ Z i xi   2 
=  i 1 =i 1 
 n 
 ∑ yi   ny 
=Z ' y =
i =1
  
 n
  n ( y2 − y1 ) 
 ∑ Z i yi  2 
 i =1 
−1
 βˆ0 IV  n nx   ny 
 = n
  
 βˆ  0 ( 2 1)
x − x  n ( y2 − y1 ) 
 1IV   2  2 
 x2 − x1  y 
2  2 −x   
=
( x2 − x1 )  0   y2 − y1 

 1  2 
  y2 − y1  
 y − x
  x2 − x1  
= 
  y2 − y1  
 
  x2 − x1  
y2 − y1
⇒ βˆ1IV =
x2 − x1
 y − y1 
βˆ0 IV =
y − 2 y − βˆ1IV x .
x =
 2
x − x 
If n is odd, then the middle observations can be deleted. Under fairly general conditions, the estimators are consistent but
are likely to have large sampling variance. This is the limitation of this method.
6

2. Bartlett’s method
Let x1 , x2 ,..., xn be the n observations. Rank these observation and order them in an increasing or decreasing order.
Now three groups can be formed, each containing n / 3 observations. Define the instrumental variable as

1 if observation is in the top group



Z i = 0 if observation is in the middle group
−1 if observation is in the bottom group.

Now discard the observations in the middle group and compute the means of yi’s and xi’s in
• bottom group, say y1 and x1 and
• top group, say y3 and x3 .

Substituting the values of X and Z in βˆIV = ( Z ' X ) Z ' y and on solving, we get
−1

y3 − y1
βˆ1IV =
x3 − x1

βˆ0 IV= y − βˆ1IV x .

These estimators are consistent. No conclusive evidences are available to compare the Bartlett’s method and Wald’s
method but three grouping method generally provides more efficient estimates than two grouping method is many cases.
7

3. Durbin’s method
Let x1 , x2 ,..., xn be the observations. Arrange these observations in an ascending order. Define the instrumental variable

Zi as the rank of xi. Then substituting the suitable values of Z and X in βˆIV = ( Z ' X ) Z ' y, we get the instrumental
−1

variable estimators
n

∑Z (y − y)
i i
βˆ1IV = i =1
n
.
∑Z (x − x)
i =1
i i

When there are more than one explanatory variables, one may choose the instrument as the rank of that particular

variable.

Since the estimator uses more information, it is believed to be superior in efficiency to other grouping methods. However,

nothing definite is known about the efficiency of this method.

In general, the instrumental variable estimators may have fairly large standard errors in comparison to ordinary least

square estimators which is the price paid for inconsistency. However, inconsistent estimators have little appeal.
8

Maximum likelihood estimation in structural form


Consider the maximum likelihood estimation of parameters in the simple measurement error model given by
β 0 β1 xi , i =
y i =+ 1, 2,..., n
y=
i y i + ui
x=
i xi + vi .

Here ( xi , yi ) are unobservable and ( xi , yi ) are observable.

Assume
σ u2 if i = j
E ( ui u j ) 
E ( ui ) 0,=
=
0 if i ≠ j ,
σ v2 if i = j
E ( vi v j ) 
E ( vi ) 0,=
=
0 if i ≠ j ,
E ( uiV j ) 0=
= =
for all i 1, 2,..., n; j 1, 2,..., n.

For the application of method of maximum likelihood, we assume the normal distribution for ui and vi .

We consider the estimation of parameters in the structural form of the model in which xi ' s are stochastic.
So assume
xi ~ N ( µ , σ 2 )
and xi ' s are independent of ui and vi .
9

Thus
E ( xi ) = µ
Var ( x1 ) = σ 2

E ( xi ) = µ

xi ) E [ xi − E ( xi ) ]
Var (=
2

= E [ xi + vi − µ ]
2

= E ( xi − µ ) + E ( vi2 ) − 2 ( xi − µ ) vi


2

= σ 2 + σ v2

) β0 + β1E ( xi )
E ( yi=
= β 0 + β1µ

yi ) E [ yi − E ( yi ) ]
Var (=
2

= E [ β 0 + β 1 xi + ui − β 0 − β1µ ]
2

= β12 E ( xi − µ ) + E ( ui2 ) − 2 β1 E ( xi − µ ) ui


2

= β12σ 2 + σ u2
E { xi − E ( xi )}{ yi − E ( yi )}
Cov ( xi , yi ) =

= E { xi + vi − µ}{β 0 + β1 xi + ui − β 0 − β1µ}

= β1 E ( xi − µ ) + E ( xi − µ ) ui + β1 E ( xi − µ ) vi + E ( ui vi )


2

= β12σ 2 + 0 + 0 + 0
= β12σ 2 .
10

So
 yi   β 0 + β1µ   β12σ 2 + σ u2 β1σ 2  
  ~ N   
, 2 
.
 i
x  µ   β1σ 2
σ 2
+ σ 
v 

The likelihood function is the joint probability density function of ui and vi , i = 1, 2,..., n as

L = f ( u1u2 ,..., un , v1 , v2 ,..., vn )


 n 2  n 2
 1 =   ∑ i 
n u  ∑ vi 
=   exp  −  exp  − 
i 1 =i 1

 2πσ uσ v   2σ u   2σ v 
2 2

   
 n 2  n 2

= 1 
n
 ∑ ( xi − 
xi)   ∑ ( yi − β 0 − β1 xi ) 
=  exp  −  exp  − .
i 1 =i 1

 2πσ uσ v   2σ u2   2σ v2 
   

The log-likelihood is
n n

∑ ( xi − xi ) ∑( y − β 0 − β1 xi )
2 2
i
L* =ln L =constant −
n
2
( u + ln σ v ) −
ln σ =
2 2 i 1 =i 1
2σ u2

2σ v2
.
11
The normal equations are obtained by equating the partial differentiations equals to zero as
∂L * 1 n
(1) = ∑ ( yi − β 0 − =
∂β 0 σ v2 i =1
β1 xi ) 0

∂L * 1 n
(2) = ∑ xi ( yi − β0 =
∂β1 σ v2 i =1
− β1 xi ) 0

∂L * 1 β
(3) = 2 ( xi − xi ) + 2 ( yi − β 0 − β1 xi ) = 0, i = 1, 2,..., n
∂xi σ u σv
∂L * n 1 n
=
− + ∑ ( −  )
2
(4) x x
∂σ u2 2σ u2 2σ v4 i =1
i i

∂L * n 1 n
=
− + ∑ ( − β − β  )
2
(5) y x .
∂σ v2 2σ v2 2σ v4 i =1
i 0 1 i

These are (n + 4) equations in (n + 4) parameters but summing equation (3) over i = 1,2,…,n and using equation (4), we
get
σ v2 = β12σ u2
which is undesirable.
These equations can be used to estimate the two means ( µ and β 0 + β1µ ) , two variances and one covariance. The six
parameters µ , α , β , σ u2 , σ v2 and σ can be estimated from the following five structural relations derived from these normal
2

equations
(i ) x=µ
(ii ) y β 0 + β1µ
=
(iii ) m=
xx σ 2 + σ v2
m yy β12σ 2 + σ u2
(iv ) =
(v ) mxy = β1σ 2
12

1 n 1 n 1 n 1 n 1 n
∑i
where x= = ∑ i xx n ∑
= ( − ) = ∑ i − = ∑ ( xi − x )( yi − y ).
2 2
x , y y , m xi x , m yy ( y y ) and m xy
=n i 1= ni1 = i 1 = ni1 = ni1

These equations can be derived directly using the sufficiency property of the parameters in bivariate normal distribution
using the definition of structural relationship as
E ( x) = µ
) β 0 + β1µ
E ( y=
) σ 2 + σ v2
Var ( x=
( y ) β12σ 2 + σ u2
Var=
Cov ( x, y ) = βσ 2 .

We observe that there are six parameters β 0 , β1 , µ , σ , σ u and σ v


2 2 2
to be estimated based on five structural equations (i)-
(v). So no unique solution exists. Only µ can be uniquely determined while remaining parameters can not be uniquely
determined. So only µ is identifiable and remaining parameters are unidentifiable. This is called as the problem of
identification. One relation is less to obtain a unique solution, so additional a priori restrictions relating any of the six
parameters is required.

Note: The sane equations (i)-(v) can also be derived using the method of moments. The structural equations are derived
by equating the sample and population moments. The assumption of normal distribution for ui, vi and xi is not needed in
case of method of moments.
ECONOMETRIC THEORY

MODULE – XVI
Lecture – 39
Measurement Error Models

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Additional information for the consistent estimation of parameters


The parameters in the model can be consistently estimated only when some additional information about the model is
available.
From equations (i) and (ii), we have
µˆ = x
and so µ is clearly estimated. Further,

βˆ0= y − βˆ1 x
is estimated if β̂1 is uniquely determined. So we consider the estimation of β1 , σ , σ u and σ v only. Some additional
2 2 2

information is required for the unique determination of these parameters. We consider now various type of additional
information which are used for estimating the parameters uniquely.

1. σ v2 is known
Suppose σ v2 is known a priori. Now the remaining parameters can be estimated as follows:
mxx = σ 2 + σ v2 ⇒ σˆ 2 = mxx − σ v2
mxy
mxy= β1σ 2 ⇒ βˆ1=
mxx − σ v2
m yy = β1σ 2 + σ u2 ⇒ σˆ u2 = m yy − βˆ12σˆ 2
mxy2
= m yy − .
mxx − σ v2
Note that σ=
ˆ 2 mxx − σ v2 can be negative because σ v2 is known and mxx is based upon sample. So we assume that
σˆ 2 > 0 and redefine
mxy
= βˆ1 ; mxx > σ v2 .
mxx − σ 2
v
3

Similarly, σˆ u is also assumed to be positive under suitable condition. All the estimators βˆ1 , σˆ 2 and σˆ u2 are the
2

consistent estimators of β , σ and σ u , respectively. Note that β̂1 looks like as if the direct regression estimator of β1 has
2 2

been adjusted by σ v2 for its inconsistency. So it is termed as adjusted estimator also.

2. σ u2 is known
Suppose σ u2 is known a priori. Then using mxy = β1σ 2 , we can rewrite

m yy β12σ 2 + σ u2
=
= mxy β1 + σ u2
m yy − σ u2
⇒ β1
= ˆ ; myy > σ u2
mxy
mxy
σˆ 2 =
βˆ 1

σ=
ˆ2
vmxx − σˆ 2 .

The estimators βˆ1 , σˆ and σˆ v are the consistent estimators of β , σ and σ v , respectively. Note that
2 2 2 2
β̂1 looks like as if the
reverse regression estimator of β1 is adjusted by σ 2
u for its inconsistency. So it is termed as adjusted estimator also.
4

σ u2
3. λ = 2 is known
σv
Suppose the ratio of the measurement error variances is known, so let
σ u2
λ= 2
σv
is known.

Consider
m yy β12σ 2 + σ u2
=
= β1 mxy + λσ v2 (using (iv))
= β1 mxy + λ ( mxx − σ 2 ) (using (iii))
 mxy 
=β1 mxy + λ  mxx −  (using (iv))
 β1 
β12 mxy + λ ( β1mxx − mxy ) − β1myy= 0 ( β1 ≠ 0 )
β12 mxy + β ( λ mxx − myy ) − λ mxy =
0.

Solving this quadratic equation

(m − λ mxx ) ± (m − λ mxx ) + 4λ mxy2


2

βˆ1 =
yy yy

2mxy
U
= , say.
2mxy
5

Since mxy = β1σ 2 and

σˆ 2 ≥ 0
sxy
⇒ ≥0
βˆ1

2 sxy2
⇒ ≥0
U
⇒ since sxy2 ≥ 0, so U must be nonnegative.

This implies that the positive sign in U has to be considered and so

(m − λ mxx ) ± (m − λ mxx ) + 4λ mxy2


2

βˆ1 =
yy yy

2mxy

Other estimates are

m yy − 2 βˆ1mxy + βˆ12 sxx


σˆ =
2

λ + βˆ 2
v
1

mxy
σˆ 2 = .
βˆ 1

Note that the same estimator βˆ1 of β1 can be obtained by orthogonal regression. This amounts to transform xi by xi / σ u
and yi by yi / σ v and use the orthogonal regression estimation with transformed variables.
6
4. Reliability ratio is known
The reliability ratio associated with explanatory variable is defined as the ratio of variances of true and observed values of
explanatory variables, so
Var ( x ) σ2
K= = ; 0 ≤ Kx ≤ 1
Var ( x) σ 2 + σ v2
x

is the reliability ratio. Note that K x = 1 when σ v = 0 which means that there is no measurement error in the explanatory
2

variable and K x = 0 means σ 2 = 0 which means explanatory variable is fixed. Higher value of K x is obtained when σ v2
is small, i.e., the impact of measurement errors is small. The reliability ratio is a popular measure in psychometrics.

Let K x be known a priori. Then


m= xx σ 2 + σ v2
mxy = β1σ 2
mxy β1σ 2
⇒ =
mxx σ 2 + σ v2
= β1 K x
m
⇒ βˆ1 = xy
K x mxx

mxy
σ2 =
β1
⇒ σˆ 2 =
K x sxx
m=
xx σ 2 + σ v2
Note that βˆ1 = K x b ⇒ σˆ v2 =(1 − K x ) sxx .
−1

mxy
where b is the ordinary least squares estimator b = .
mxx
7
5. β 0 is known

Suppose β 0 is known a priori and E ( x) ≠ 0. Then


y β 0 + β1µ
=
y −β
⇒ βˆ1 = 0
µˆ
y − β0
=
x
m
σˆ 2 = xy
βˆ 1

σ=
ˆ2
um yy − βˆ1mxy
mxy
σ=
ˆ v2 mxx − .
βˆ1

6. Both σ u2 and σ v2 are known


This case leads to over-identification in the sense that the number of parameters to be estimated are smaller than the
number of structural relationships binding them. So no unique solutions are obtained in this case.

Note: In each of the cases 1-6, note that the form of the estimate depends on the type of available information which is
needed for the consistent estimator of the parameters. Such information can be available from various sources, e.g.,
long association of the experimenter with the experiment, similar type of studies conducted in the part, some extraneous
source etc.
8
Estimation of parameters in function form
In the functional form of the measurement error model, xi ' s are assumed to be fixed. This assumption is unrealistic in

the sense that when xi ' s are unobservable and unknown, it is difficult to know if they are fixed or not. This can not be

ensured even in repeated sampling that the same value is repeated. All that can be said in this case is that the information

in this case is conditional upon xi ' s. So assume that xi ' s are conditionally known. So the model is

y=
i β 0 + β1 xi
x=
i xi + vi
y=
i yi + ui .
Then
 yi   α + β xi   σ u2 0  
  ~ N   , 2 
.
 xi   xi   0 σ v  

The likelihood function is


n n
   2
 ∑( y − β − β x )  ∑ ( xi − xi ) 
n 2 n
2 i 0
 1 
1 i 2  1 
L=  exp  −   exp  − .
=i 1 =i 1

2
u u
2
 2πσ 2

v  2σ   2πσ   2σ v2 
   
The log-likelihood is
n

∑( y − β 0 − β1 xi )
2
i
n n 1 n
2 ∑( i
L* = constant − ln σ u2 −
ln L = − ln σ v2 − x − xi ) .
i =1 2

2 2σ u2 2 2σ v i =1
The normal equations are obtained by partially differentiating L* and equating to zero as 9
∂L * 1 n
(I ) =0 ⇒ 2 ∑ ( yi − β 0 − β1 xi ) =0
∂β 0 σ u i =1
∂L * 1 n
( II ) =0 ⇒ 2 ∑ ( yi − β 0 − β1 xi ) xi =0
∂β1 σ v i =1
∂L * n 1 n
= ⇒ − + ∑ ( − β − β  ) = 0
2
( III ) 0 y x
∂σ u2 2σ u2 2σ u4 i =1
i 0 1 i

∂L * n 1 n
4 ∑( i
= 0⇒− 2 + x − xi ) = 0
2
( IV )
∂σ v
2
2σ v 2σ v i =1
∂L * β 1
(V ) = 0 ⇒ 2 ( yi − β 0 − β1 xi ) + 2 ( xi − xi ) =
0.
∂xi σu σv

Squaring and summing equation (V), we get


2 2
β   1 
∑i σ 12 ( yi − β0 − β1 xi ) = ∑i − σ 2 ( xi − xi )
 u   v 
β12 1
2 ∑( i 1 xi ) 4 ∑( i
y − β 0 − β=  x − xi ) .
2 2
or
σu i σv i

Using the left hand side of equation (III) and right hand side of equation (IV), we get

nβ 2 n
=
σ u2 σ v2
σ
⇒ β =u .
σv
10

which is unacceptable because β can be negative also. In the present case, as σ u > 0 and σ v > 0, so β will always be
positive. Thus the maximum likelihood breaks down because of insufficient information in the model. Increasing the
σ u2
sample size n does not solve the purpose. If the restrictions like σ 2
known, σ 2
known or known are
σ v2
u v

incorporated, then the maximum likelihood estimation is similar to as in the case of structural form and the similar estimates

may be obtained. For example, if λ = σ u / σ v is known, then substitute it in the likelihood function and maximize it. The
2 2

same solution as in the case of structural form are obtained.


ECONOMETRIC THEORY

MODULE – XVII
Lecture - 40
Simultaneous Equations Models

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

In any regression modeling, generally an equation is considered to represent a relationship describing a phenomenon.

Many situations involve a set of relationships which explain the behaviour of certain variables. For example, in analyzing

the market conditions for a particular commodity, there can be a demand equation and a supply equation which explain the

price and quantity of commodity exchanged in the market at market equilibrium. So there are two equations to explain the

whole phenomenon - one for demand and another for supply.

In such cases, it is not necessary that all the variables should appear in all the equations. So estimation of parameters

under this type of situation has those features that are not present when a model involves only a single relationship. In

particular, when a relationship is a part of a system, then some explanatory variables are stochastic and are correlated

with the disturbances. So the basic assumption of a linear regression model that the explanatory variable and disturbance

are uncorrelated or explanatory variables are fixed is violated and consequently ordinary least squares estimator becomes

inconsistent.

Similar to the classification of variables as explanatory variable and study variable in linear regression model, the variables

in simultaneous equation models are classified as endogenous variables and exogenous variables.
3

Endogenous variables (Jointly determined variables)

The variables which are explained by the functioning of system and values of which are determined by the simultaneous
interaction of the relations in the model are endogenous variables or jointly determined variables.

Exogenous variables (Predetermined variables)

The variables that contribute to provide explanations for the endogenous variables and values of which are determined
from outside the model are exogenous variables or predetermined variables.

Exogenous variables help is explaining the variations in endogenous variables. It is customary to include past values of
endogenous variables in the predetermined group. Since exogenous variables are predetermined, so they are
independent of disturbance term in the model. They satisfy those assumptions which explanatory variables satisfy in the
usual regression model. Exogenous variables influence the endogenous variables but are not themselves influenced by
them. One variable which is endogenous for one model can be exogenous variable for the other model.

Note that in linear regression model, the explanatory variables influence study variable but not vice versa. So relationship
is one sided.

The classification of variables as endogenous and exogenous is important because a necessary condition for uniquely
estimating all the parameters is that the number of endogenous variables is equal to the number of independent equations
in the system. Moreover, the main distinction of predetermined variable in the estimation of parameters is that they are
uncorrelated with disturbance term in the equations in which they appear.
4
Simultaneous equation systems

A model constitutes a system of simultaneous equations if all the involved relationships are needed for determining the
value of at least one of the endogenous variables included in the model. This implies that at least one of the relationships
includes more them one endogenous variable.

Example 1
Now we consider the following example in detail and introduce various concepts and terminologies used in describing the
simultaneous equations models.

Consider a situation of an ideal market where transaction of only one commodity, say wheat, takes place. Assume that the
number of buyers and sellers are large enough so that the market is perfect competitive market. It is also assumed that
the amount of wheat that comes into the market in a day is completely sold out on the same day. No seller takes it back.
Now we develop a model for such mechanism.
Let
dt denotes the demand of the commodity, say wheat, at time t,
st denotes the supply of the commodity, say wheat, at time t, and
qt denotes the quantity of the commodity, say wheat, transacted at time t.

By economic theory about ideal market, we have the following condition:

=
dt s=
t, t 1, 2,..., n.
5
Observe that
• demand of wheat depends on
- price of wheat (pt) at time t.
- income of buyer (it) at time t.
• supply of wheat depends on
- price of wheat (pt) at time t.
- rainfall (rt) at time t.

From market conditions, we have


q=
t d=
t st .
Demand, supply and price are determined from each other.
Note that
 income can influence demand and supply but demand and supply cannot influence the income.
 supply is influenced by rainfall but rainfall is not influenced by the supply of wheat.

Our aim is to study the behaviour of st, pt and rt which are determined by the simultaneous equation model.
Since endogenous variables are influenced by exogenous variables but not vice versa, so
 st , pt and rt are endogenous variables.
 it and rt are exogenous variables.

Now consider an additional variable for the model as lagged value of price pt, denoted as pt-1. In a market, generally the
price of the commodity depends on the price of the commodity on previous day. If the price of commodity today is less
than the price in the previous day, then buyer would like to buy more. For seller also, today’s price of commodity depends
on previous day’s price and based on which he decides the quantity of commodity (wheat) to be brought in the market.
6

So the lagged price affects the demand and supply equations both. Updating both the models, we can now write that
 demand depends on pt , it and pt −1.
 supply depends on pt , rt and pt −1.

Note that the lagged variables are considered as exogenous variable. The updated list of endogenous and exogenous
variables is as follows:
 Endogenous variables: pt , d t , st
 Exogenous variables: .pt −1 , it , rt

The mechanism of market is now described by the following set of equations.


 demand dt =α1 + β1 pt + ε1t
 supply st =α 2 + β 2 pt + ε 2t
 equilibrium condition d=
t s=
t qt
where α ' s denote the intercept terms, β 's denote the regression coefficients and ε ' s denote the disturbance terms.

These equations are called structural equations. The error terms ε1 and ε2 are called structural disturbances. The
coefficients α1 , α 2 , β1 and β 2 are called the structural coefficients.

The system of equations is called the structural form of the model.


7
Since q=
t d=
t st , so the demand and supply equations can be expressed as
qt =α1 + β1 pt + ε1t (I)
qt =α 2 + β 2 pt + ε 2t (II)

So there are only two structural relationships. The price is determined by the mechanism of market and not by the buyer or
supplier. Thus qt and pt are the endogenous variables. Without loss of generality, we can assume that the variables
associated with α1 and α 2 are X1 and X2 respectively such that
= X 1 1=
and X 2 1.
=
So X 1 1=
and X 2 1 are predetermined and so they can be regarded as exogenous variables.

From statistical point of view, we would like to write the model in such a form so that the OLS can be directly applied. So
writing equations (I) and (II) as

α1 + β1 pt + ε1t = α 2 + β 2 pt + ε 2t
α1 − α 2 ε1t − ε 2t
or=pt +
β 2 − β1 β 2 − β1
= π 11 + v1t (III)

β 2α1 − β1α 2 β 2ε1t − β1ε 2t


=qt +
β 2 − β1 β 2 − β1
= π 21 + v2t (IV)
where
α1 − α 2 β 2α1 − β1α 2
=π 11 = , π 21
β 2 − β1 β 2 − β1
ε1t − ε 2t β 2ε1t − β1ε 2t
=v1t = , v2t .
β 2 − β1 β 2 − β1
8
Each endogenous variable is expressed as a function of exogenous variable. Note that the exogenous variable 1
=
(from X 1 1,=
or X 2 1) is not clearly identifiable.

The equations (III) and (IV) are called the reduced form relationships and in general, called as the reduced form of the
model.

The coefficients π 11 and π 21 are called reduced form coefficients and errors v1t and v2t are called the reduced
form disturbances. The reduced from essentially express every endogenous variable as a function of exogenous
variable. This presents a clear relationship between reduced form coefficients and structural coefficients as well as
between structural disturbances and reduced form disturbances. The reduced form is ready for the application of OLS
technique. The reduced form of the model satisfies all the assumptions needed for the application of OLS.

Suppose we apply OLS technique to equations (III) and (IV) and obtained the OLS estimates of π 11 and π 12 as
πˆ11 and πˆ12 respectively which are given by

α1 − α 2
πˆ11 =
β 2 − β1
β α −α β
πˆ 21 = 2 1 2 1 .
β 2 − β1

Note that πˆ11 and πˆ 21 are the numerical values of the estimates. So now there are two equations and four unknown
parameters α1 , α 2 , β1 and β 2 . So it is not possible to derive the unique estimates of parameters of the model by
applying OLS technique to reduced form. This is known as problem of identifications.
9
By this example, the following have been described upto now:
 Structural form relationship.
 Reduced form relationship.
 Need for reducing the structural form into reduced form.
 Reason for the problem of identification.
Now we describe the problem of identification in more detail.

The identification problem


Consider the model in earlier Example 1 which describes the behavior of a perfectly competitive market in which only one
commodity, say wheat, is transacted. The models describing the behavior of consumer and supplier are prescribed by
demand and supply conditions given as

Demand : dt =α1 + β1 pt + ε1t , t =


1, 2,..., n
Supply : st =α 2 + β 2 pt + ε 2t
Equilibrium condition: dt = st .

If quantity qt is transacted at time t then


d=
t s=
t qt .
So we have two structural equations model in two endogenous variables (qt and pt) and one exogenous variable
(value is 1 given by=
X 1 1,=
X 2 1) . The set of three equations is reduced to a set of two equations as follows:
Demand : qt =α1 + β1 pt + ε1t (1)
Supply : qt =α 2 + β 2 pt + ε 2t (2)

Before analysis, we would like to check whether it is possible to estimate the parameters α1 , α 2 , β1 and β 2 or not.
10

Multiplying equations (1) by λ and (2) by (1 − λ ) and then adding them together gives

λ qt + (1 − λ ) q=t [λα1 + (1 − λ )α 2 ] + [λβ1 + (1 − λ ) β 2 ] pt + [λε1t + (1 − λ )ε 2t ]


or qt =α + β pt + ε t (3)

where α = λα1 + (1 − λ )α 2 , β = λβ1 + (1 − λ ) β 2 , ε t = λε1t + (1 − λ )ε 2t and λ is any scalar lying between 0 and 1.

Comparing equation (3) with equations (1) and (2), we notice that they have same form. So it is difficult to say that which
is supply equation and which is demand equation. To find this, let equation (3) be demand equation. Then there is no way
to identify the true demand equation (1) and pretended demand equation (3).

Similar exercise can be done for supply equation and we find that there is no way to identify the true supply equation (2)
and pretended supply equation (3).

Suppose we apply OLS technique to these models. Applying OLS to equation (1) yields
n

∑( p
− p )( qt − q )
t
βˆ1 =
t =1
n
0.6, say
∑ ( pt − p )
2

t =1
n
1 1 n
=
where
=
p = ∑ pt , q n ∑
n t 1= t 1
qt .

Applying OLS to equation (3) yields


n

∑( p
− p )( qt − q )
t
=βˆ =
t =1
n
0.6.
∑ ( pt − p )
2

t =1
11
Note that βˆ1 and βˆ have same analytical expressions, so they will also have same numerical values, say 0.6. Looking
at the value 0.6, it is difficult to say that the value 0.6 determines equation (1) or (3).

Applying OLS to equation (3) yields


n

∑ ( p − p )( q − q )
t t
=βˆ2 =
t =1
n
0.6
∑ ( pt − p )
2

t =1

because β̂ 2 has same analytical expression as βˆ1 and βˆ2 . So

β=
ˆ βˆ= βˆ .
1 2 3

Thus it is difficult to decide and identify whether β̂1 is determined by the value 0.6 or β̂ 2 is determined by the value 0.6.
Increasing the number of observations also does not helps in the identification of these equations.

So we are not able to identify the parameters. So we take the help of economic theory to identify the parameters.
12

The economic theory suggests that when price increases then supply increases but demand decreases. So the plot will
look like

quantity (qt )

supply (st )

demand (dt )

price (pt )

and this implies β1 < 0 and β 2 > 0. Thus since 0.6 > 0, so we can say that the value 0.6 represents βˆ2 > 0 and so βˆ2 = 0.6.
But one can always choose a value of λ such that pretended equation does not violate the sign of coefficients, say β > 0.
So it again becomes difficult to see whether equation (3) represents supply equation (2) or not. So none of the
parameters is identifiable.
13

Now we obtain the reduced form of the model as


α1 − α 2 ε1t − ε 2 t
=pt +
β 2 − β1 β 2 − β1
pt π 11 + v1t
or = (4)

α1β 2 − α 2 β1 β 2ε1t − β1ε 2 t


=qt +
β 2 − β1 β 2 − β1
qt π 21 + v2 t .
or = (5)

Applying OLS to equations (4) and (5) and obtain OLSEs πˆ11 and πˆ 21 which are given by
α1 − α 2
πˆ11 =
β 2 − β1
α β − α 2 β1
πˆ 21 = 1 2 .
β 2 − β1

There are two equations and four unknowns. So the unique estimates of the parameters cannot be obtained.
Thus the equations (1) and (2) can not be identified. Thus the model is not identifiable and estimation of parameters is not
possible.
ECONOMETRIC THEORY

MODULE – XVII
Lecture - 41
Simultaneous Equations Models

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Suppose a new exogenous variable income it is introduced in the model which represents the income of buyer at time t.
Note that the demand of commodity is influenced by income. On the other hand, the supply of commodity is not influenced
by the income, so this variable is introduced only in the demand equation. The structural equations (1) and (2) now
become
Demand : qt =α1 + β1 pt + γ 1it + ε1t (6)
Supply: qt =α 2 + β 2 pt + ε 2t (7)

where γ1 is the structural coefficient associate with income. The pretended equation is obtained by multiplying the
equations (6) and (7) by λ and (1 − λ ), respectively and then adding them together. This is obtained as follows:

λ qt + (1 − λ ) q=t [λα1 + (1 − λ )α 2 ] + [λβ1 + (1 − λ ) β 2 ] pt + λγ 1it + [λε1t + (1 − λ )ε 2t ]


or qt =α + β pt + γ it + ε t (8)

where α = λα1 + (1 − λ ) β 2 , β = λβ1 + (1 − λ ) β 2 , γ = λγ 1 , ε t = λε1t + (1 − λ )ε 2t and 0 ≤ λ ≤ 1 is a scalar.

Suppose now if we claim that equation (8) is true demand equation because it contains pt and it which influence the
demand. But we note that it is difficult to decides that between the two equations (6) or (8), which one is the true
demand equation.

Suppose now if we claim that equation (8) is the true supply equation. This claim is wrong because income does not
affect the supply. So equation (6) is the supply equation.
Thus supply equation is now identifiable but demand equation is not identifiable. Such situation is termed as partial
identification.

Now we find the reduced form of structural equations (6) and (7). This is achieved by first solving equation (6) for pt and
substituting it in equation (7) to obtain an equation in qt. Such an exercise yields the reduced form equations of the
following form: pt =π 11 + π 12it + v1t (9)
qt =π 21 + π 22it + v2t . (10)
3

Applying OLS to equations (9) and (10), we get OLSEs πˆ11 , πˆ 22 , πˆ12 , πˆ 21 . Now we have four equations ((6),(7),(9),(10)) and
there are five unknowns (α1 , α 2 , β1 , β 2 , γ 1 ). So the parameters are not determined uniquely. Thus the model is not
identifiable.

However, here π 22
β2 =
π 11
π 11
α= π 21 − π 22 .
2
π 22

If πˆ11 , πˆ 22 , πˆ12 and πˆ 21 are available, then αˆ 2 and βˆ2 can be obtained by substituting πˆ ' s in place of π ' s. So
α2 and β 2 are determined uniquely which are the parameters of supply equation. So supply equation is identified but
demand equation is still not identifiable.

Now, as done earlier, we introduce another exogenous variable- rainfall, denoted as rt which denotes the amount of rainfall
at time t. The rainfall influences the supply because better rainfall produces better yield of wheat. On the other hand, the
demand of wheat is not influenced by the rainfall. So the updated set of structural equations is
Demand : qt =α1 + β1 pt + γ 1it + ε1t (11)
Supply: qt =α 2 + β 2 pt + δ 2 rt + ε 2t . (12)
The pretended equation is obtained by adding together the equations obtained after multiplying equation (11) by λ and
equation (12) by (1 − λ ) as follows:

λ qt + (1 − λ )q=
t [λα1 + (1 − λ )α 2 ] + [λβ1 + (1 − λ ) β 2 ] pt + λγ 1it + (1 − λ )δ 2 rt + [λε1t + (1 − λ )ε 2t ]
or qt =α + β pt + γ it + δ rt + ε t (13)

where α = λα1 + (1 − λ )α 2 , β = λβ1 + (1 − λ ) β 2 , γ = λγ 1 , δ = (1 − λ )δ 2 , ε t = λε1t + (1 − λ )ε 2t and   0 ≤ λ ≤ 1   is a scalar.


4

Now we claim that equation (13) is a demand equation. The demand does not depend on rainfall. So unless λ =1
so that rt is absent in the model, the equation (13) cannot be a demand equation. Thus equation (11) is a demand
equation. So demand equation is identified.

Now we claim that equation (13) is supply equation. The supply is not influenced by the income of buyer, so (13) cannot be
a supply equation. Thus equation (12) is the supply equation. So now the supply equation is also identified.

The reduced form model from structural equations (11) and (12) can be obtained which have the following forms:

pt =π 11 + π 12it + π 13 rt + v1t (14)


qt =π 21 + π 22it + π 23 rt + v2t . (15)

Application of OLS technique to equations (14) and (15) yields the OLSEs πˆ11 , πˆ12 , πˆ13 , πˆ 21 , πˆ 22 and πˆ 23 . So now there are
six such equations and six unknowns α1 , α 2 , β1 , β 2 , γ 1 and δ 2 . So all the estimates are uniquely determined. Thus the
equations (11) and (12) are exactly identifiable.

Finally, we introduce a lagged endogenous variable pt - 1 which denotes the price of commodity on previous day. Since only
the supply of wheat is affected by the price on the previous day, so it is introduced in the supply equation only as

Demand : qt =α1 + β1 pt + γ 1it + ε1t (16)


Supply: qt =α 2 + β 2 pt + δ 2 rt + θ 2 pt −1 + ε 2t (17)

where θ 2 is the structural coefficient associated with pt - 1 .


5

The pretended equation is obtained by first multiplying equations (16) by λ and (17) by (1 − λ ) and then adding them
together as follows:
λ qt + (1 − λ )qt = α + β pt + γ it + δ rt + (1 − λ )θ 2 pt −1 + ε t
or qt =α + β pt + γ it + δ rt + θ pt −1 + ε t . (18)

Now we claim that equation (18) represents the demand equation. Since rainfall and lagged price do not affect the
demand, so equation (18) cannot be demand equation. Thus equation (16) is a demand equation and the demand
equation is identified.

Now finally, we claim that equation (18) is the supply equation. Since income does not affect supply, so equation (18)
cannot be a supply equation. Thus equation (17) is supply equation and the supply equation is identified.

The reduced from equations from equations (16) and (17) can be obtained as of the following form:

pt =π 11 + π 12it + π 13 rt + π 14 pt −1 + v1t (19)


qt =π 21 + π 22it + π 23 rt + r24 pt −1 + v2t . (20)

Applying OLS technique to equations (19) and (20) gives the OLSEs as πˆ11 , πˆ12 , πˆ13 , πˆ14 , πˆ 21 , πˆ 22 , πˆ 23 and πˆ 24 .

So there are eight equations in seven parameters α1 , α 2 , β1 , β 2 , γ 1 , δ 2 and θ 2 . So unique estimates of all the parameters
are not available. In fact, in this case, the supply equation (17) is identifiable and demand equation (16) is overly
identified (in terms of multiple solutions).
6
The whole analysis in this example can be classified into three categories –

(1) Under identifiable case


The estimation of parameters is not at all possible in this case. No enough estimates are available for structural
parameters.

(2) Exactly identifiable case


The estimation of parameters is possible in this case. The OLSE of reduced form coefficients leads to unique estimates
of structural coefficients.

(3) Over identifiable case


The estimation of parameters in this case is possible. The OLSE of reduced form coefficients leads to multiple estimates
of structural coefficients.
7

Analysis
Suppose there are G jointly dependent (endogenous) variables y1 , y2 ,..., yG and K predetermined (exogenous) variables
x1 , x2 ,..., xK . Let there are n observations available on each of the variable and there are G structural equations connecting
both the variables which describe the complete model as follows:

β11 y1t + β12 y2t + ... + β1G yGt + γ 11 x1t + γ 12 x2t + ... + γ 1K xKt = ε1t
β 21 y1t + β 22 y2t + ... + β 2G yGt + γ 21 x2t + γ 22 x2t + ... + γ 2 k xK 2 = ε 2t

βG1 y1t + βG 2 y2t + ... + βGG yGt + γ G1 x1t + γ G 2 x2t + ... + γ Gk xKt =
ε Gt .

These equations can be expressed is matrix form as

 β11 β12  β1G   y1t   γ 11 γ 12  γ 1K   x1t   ε1t 


β β 22  β 2G   y2t   γ 21 γ 22  γ 2 K   x2t   ε 2t 
 21 + =
                  
       
 β G1 βG 2  βGG   yGt  γ K 1 γ K 2  γ KK   xKt  ε Gt 
or
S : Byt + Γxt =ε t , t = 1, 2,..., n

where B is a G × G matrix of unknown coefficients of predetermined variables, yt is a ( n × 1) vector of observations on G


jointly dependent variables, Γ is (G × K ) matrix of structural coefficients, xt is ( K ×1) vector of observations on K
predetermined variables and εt is (G x 1) vector of structural disturbances. The structural form S describes the
functioning of model at time t.
8

Assuming B is nonsingular, premultiplying the structural form equations by B-1, we get

B −1 Byt + B −1Γxt =B −1ε t


or yt =π xt + vt , t =1, 2,..., n.
This is the reduced form equation of the model where π B −1Γ
= is the matrix of reduced form coefficients and vt = B −1ε t
is the reduced form disturbance vectors.

If B is singular, then one or more structural relations would be a linear combination of other structural relations. If B is
non-singular, such identities are eliminated.

The structural form relationship describes the interaction taking place inside the model. In reduced form relationship, the
jointly dependent (endogenous) variables are expressed as linear combination of predetermined (exogenous) variables.
This is the difference between structural and reduced form relationships.

Assume that ε t ' s are identically and independently distributed following N (0, ∑) and vt ' s are identically and
−1 −1
independently distributed following N (0, Ω) where Ω
= B ∑B' with

E (ε t ) =
0, E (ε t ε t' ) =
∑, E (ε t ε t'* ) =
0 for all t ≠ t *.
E (vt ) =
0, E (vt vt' ) =
Ω, E (vt vt'* ) =
0 for all t ≠ t *.
9

The joint probability density function of yt given xt is

p ( yt | xt ) = p (vt )
∂ε t
= p (ε t )
∂vt
= p ( ε t ) det( B)
∂ε t
where is the related Jacobian of transformation and det( B ) is the absolute value of determinant of B.
∂vt

The likelihood function is

L = p ( y1 , y2 ,..., yn x1 , x2 ,..., xn )
n
= ∏ p ( yt xt )
t =1
n
= det( B) ∏ p(ε ) .
n
t
t =1

Applying a nonsingular linear transformation on structural equation S with a nonsingular matrix D, we get

DByt + DΓxt =Dε t


or xt ε t=
S * : B * yt + Γ *= *
, t 1, 2,..., n

where B* = DB, Γ* = DΓ, ε t* = Dε t and structural model S* describes the functioning of this model at time t.

Now find p ( yt xt ) with S* as follows:


10

∂ε t*
p ( yt | xt ) = p ( ε *
) ∂vt
t

= p ( ε t* ) det ( B *) .
Also
ε t* = Dε t
∂ε t*
⇒ =
D.
∂ε t
Thus ∂ε t
p ( yt xt ) = p ( ε t ) det( B*)
∂ε t*

= p ( ε t ) det( D −1 ) det( DB)

= p ( ε t ) det( D −1 ) det( D) det( B)

= p ( ε t ) det( B) .

The likelihood function corresponding to S* is

∏ p (ε )
n
L* = det( B*)
n *
t
t =1
n
∂ε t
= det( D*) det( B) ∏ p (ε ) ∂ε
n n
t *
t =1 t
n
= det( D*) det( B) ∏ p (ε ) det( D −1
n n
t )
t =1

= L.
11

Thus both the structural forms S and S* have same likelihood functions. Since the likelihood functions form the basis of
statistical analysis, so both S and S* have same implications. Moreover, it is difficult to identify whether the likelihood
function corresponds to S and S*. Any attempt to estimate the parameters will result into failure in the sense that we cannot
know whether we are estimating S and S*. Thus S and S* are observationally equivalent. So the model is not identifiable.

A parameter is said to be identifiable within the model if the parameter has the same value for all equivalent structures
contained in the model.

If all the parameters in structural equation are identifiable, then the structural equation is said to be identifiable.

Given a structure, we can thus find many observationally equivalent structures by non-singular linear transformation.

The apriori restrictions on B and Γ may help in the identification of parameters. The derived structures may not satisfy
these restrictions and may therefore not be admissible.

The presence and/or absence of certain variables helps in the identifiability. So we use and apply some apriori restrictions.
These apriori restrictions may arise from various sources like economic theory, e.g. it is known that the price and income
affect the demand of wheat but rainfall does not affect it. Similarly, supply of wheat depends on income, price and rainfall.
There are many types of restrictions available which can solve the problem of identification. We consider zero-one type
restrictions.
ECONOMETRIC THEORY

MODULE – XVII
Lecture - 42
Simultaneous Equations Models

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Zero-one type restrictions


Suppose the apriori restrictions are of zero-one type, i.e., some of the coefficients are one and others are zero.
Without los of generality, consider S as

xt ε t=
Byt + Γ= , t 1, 2,..., n.

When the zero-one type restrictions are incorporated in the model, suppose there are G∆ jointly dependent and K*
predetermined variables in S having non zero coefficients. Rest ( G − G∆ ) jointly dependent and ( K − K* ) predetermined
variables are having coefficients zero.

Without loss of generality, let β ∆ and γ * be the row vectors formed by the nonzero elements in the first row of B and Γ
respectively. Thus the first row of B can be expressed as ( β∆ 0 ) . So B has G∆ coefficients which are one (indicating the
presence) and ( G − G∆ ) coefficients which are zero (indicating the absence).

Similarly the first row of Γ can be written as ( γ * 0 ) . So in Γ, there are K* elements present ( those take value one) and
( K − K* ) elements absent (those take value zero).
The first equation of the model can be rewritten as
β11 y1t + β12 y2t + ... + β1G∆ yG∆t + γ 11 x1t + γ 12 x2t + ... + γ 1K* xK*t =
ε1t
or ( β ∆ 0 ) yt + (γ * 0 ) xt =
ε1t , t =
1, 2,..., n.
Assume every equation describes the behaviour of a particular variable, so that we can take β11 = 1. If β11 ≠ 1, then
divide the whole equation by β11 so that the coefficient of y1t is one.

So the fist equation of the model becomes


y1t + β12 y2t + ... + β1G∆ yG∆t + γ 11 x1t + γ 12 x2t + ... + γ 1K* xK *t =
ε1t .
Now the reduced form coefficient relationship is 3

π= − B −1Γ
or Bπ = −Γ
or ( β ∆ 0)π = − (γ * 0).
↓ ↓ ↓ ↓
G∆ ( G − G∆ ) K* K − K*
elements elements elements elements
Partition
 π ∆* π ∆** 
π =
 π ∆∆* π ∆∆** 

where the orders of π ∆* is ( G∆ × K* ) , π ∆** is ( G∆ × K** ) , π ∆∆* (


is G∆∆ × K* ) and π ∆∆** is ( G∆∆ × K** )
where G∆∆ =
G − G∆ and K** =
K − K* .

We can re-express
( β∆ 0 ) π = − (γ * 0 )
(β ∆ )
0 ∆∆ π = − ( γ * 0** )

π π ∆** 
or ( β∆ 0∆∆ )  ∆*  = − ( γ * 0** )
 π ∆∆* π ∆∆** 
⇒ β ∆ π ∆* =
−γ * (i )
β ∆ π ∆** = 0**. (ii )

Assume π is known. Then (i) gives a unique solution for γ* if β ∆ is uniquely found from (ii). Thus identifiability of S lies
upon the unique determination of β ∆. Out of G∆ elements in β ∆ , one has coefficient 1, so there are ( G∆ − 1) elements in
β ∆ that are unknown.
4

Note that

β ∆ = .(1, β12 ,..., β1G



)
As
β ∆π ∆** = 0**
or (1 β ) π ∆** = 0** .

So ( G∆ − 1) elements of β ∆ are uniquely determined as non-trivial solution when

rank (π ∆**=
) G∆ − 1.
Thus β0 is identifiable when

rank (π ∆**=
) G∆ − 1.
This is known as rank condition for the identifiability of parameters in S. This condition is necessary and sufficient.

Another condition, known as order condition is only necessary and not sufficient. The order condition is derived as
follows:

We now use the result that for any matrix A of order m × n , the rank ( A) = min( m, n). For identifiability, if rank ( A) = m
then obviously n > m.
Since β ∆π ∆** = 0 and β ∆ has only (G∆ − 1) elements which are identifiable when

rank (π ∆**=
) G∆ − 1
⇒ K − K* ≥ G∆ − 1.

This is known as the order condition for identifiability.


5

There are various ways is which these conditions can be represented to have meaningful interpretations. We discuss them
as follows:

1. ( G − G∆ ) + ( K − K* ) ≥ ( G − G∆ ) + ( G∆ − 1)
or ( G − G∆ ) + ( K − K* ) ≥ G − 1.

Here

G − G∆ : Number of jointly dependent variables left out in the equation

y1t + β12 y2t + ... + β1G∆ yG∆ + γ 11 x1t + ... + λ1K xKt =
ε1t

K − K* : Number of predetermined variables left out in the equation

y1t + β12 y2t + ... + β1G∆ yG∆ + γ 11 x1t + ... + λ1K xKt =
ε1t

G − 1: Number of total equations - 1.

So left hand side of the condition denotes the total number of variables excluded is this equation.

Thus if total number of variables excluded in this equation exceeds (G - 1), then the model is identifiable.
6
2. K ≥ K* + G∆ -1
Here
K* : Number of predetermined variables present is the equation.
G∆ : Number of jointly dependent variables present is the equation.

3. Define L = K − K* − (G∆ − 1)
L measures the degree of overidentification.

If L = 0 then the equation is said to be exactly identified.


L > 0 , then the equation is said to be over identified.
L < 0 ⇒ the parameters cannot be estimated.
L= 0 ⇒ the unique estimates of parameters are obtained.
L > 0 ⇒ the parameters are estimated with multiple estimators.

So by looking at the value of L, we can check the identifiability of the equation.

Rank condition tells whether the equation is identified or not. If identified, then the order condition tells whether the
equation is exactly identified or over identified.

Note
We have illustrated the various aspects of estimation and conditions for identification in the simultaneous equation system
using the first equation of the system. It may be noted that this can be done for any equation of the system and it is not
necessary to consider only the first equation. Now onwards, we do not restrict to first equation but we consider any, say ih
equation (i = 1, 2, …,G).
7

Working rule for rank condition


The checking of rank condition sometimes, in practice, can be a difficult task. An equivalent form of rank condition based
on the partitioning of structural coefficients is as follows.
Suppose
β 0∆∆   γ* 0** 
=B  ∆= , Γ  ,
 B∆ B∆∆   Γ* Γ** 

where β ∆ , γ * , 0∆∆ and 0** are row vectors consisting of G∆ , K* , G∆∆ and K** elements respectively in them. Similarly the
orders of B∆ is ( ( G − 1) × G ) , B
∆ ∆∆ is ((G − 1) × G∆∆ ), Γ* is ((G − 1) × K* ) and Γ is ((G − 1) × K** ).
**

Note that B∆∆ and Γ** are the matrices of structural coefficients for the variable omitted from the ih equation (i = 1,2,…,G)
but included in other structural equations.

Form a new matrix

=
∆ (B Γ)
 β ∆ 0∆∆ γ * 0** 
= 
 B∆ B∆∆ Γ* Γ** 
B −1∆
= (B −1
B B −1Γ )
= (I −π )
 I ∆∆ 0 − π ∆ ,* − π ∆ ,** 
= .
 0 I ∆∆ , ∆∆ − π ∆∆ ,* − π ∆∆ ,** 
8
If
0 0** 
∆* = ∆∆
 B∆∆ Γ** 

then clearly the rank of ∆* is same as the rank of ∆ since the rank of a matrix is not affected by enlarging the matrix by a
rows of zeros or switching any columns.

Now using the result that if a matrix A is multiplied by a nonsingular matrix then the product has the same rank as of A, we
can write

(∆* ) rank ( B −1∆* )


rank=
 0 −π ∆ ,** 
rank ( B −1∆* ) =rank  
 I ∆∆ ,∆∆ −π ∆∆ ,** 
= rank ( I ∆∆ ,∆∆ ) + rank (π ∆ ,** )
=( G − G∆ ) + ( G∆ − 1)
= G −1
rank ( B −1∆=
*) rank ( ∆=
*) rank ( B∆∆ Γ** ) .

So
rank ( B∆∆ Γ** ) = G − 1
and then the equation is identifiable.
Note that ( β ∆∆ Γ** ) is a matrix constructed from the coefficients of variables excluded from that particular equation but
included in other equations of the model. If rank ( β ∆∆ Γ** ) = G − 1, then the equation is identifiable and this is a
necessary and sufficient condition. An advantage of this term is that it avoids the inversion of matrix. A working rule is
proposed like following.
9

Working rule
1. Write the model in tabular form by putting ‘X’ if the variable is present and ‘0’ if the variable is absent.

2. For the equation under study, mark the 0’s (zeros) and pick up the corresponding columns suppressing that row.

3. If we can choose (G – 1) rows and (G – 1) columns that are not all zero then it can be identified.
10

Example 2
Now we discuss the identifiability of following simultaneous equations model with three structural equations.

(1) y1 = β13 y3 + γ 11 x1 + γ 13 x3 + ε1
(2) y1 = γ 21 x1 + γ 23 x3 + ε 2
(3) y2 = β 33 y3 + γ 31 x1 + γ 32 x2 + ε 3.

First we represent the equation (1)-(3) in tabular form as follows:

Equation y1 y2 y3 x1 x2 x3 G∆ − 1 K* L
number
1 X 0 X X 0 X 2–1=1 2 3–3=0
2 X 0 0 X 0 X 1–1=0 2 3–2=1
3 0 X X X X 0 2–1=1 2 3–3=0

 G = Number of equations = 3.

 ‘X’ denotes the presence and ‘0’ denotes the absence of variables in an equation.

 G∆ = Numbers of ‘X’ in ( y1 y2 y3 ) .

 K* = Number of ‘X’ in ( x1 x2 x3 ) .
11

Consider equation (1)

 Write columns corresponding to ‘0’ which are columns corresponding to y2 and x2 and write as follows

y2 x2

Equation (2) 0 0
Equation (3) X X

 Check if any row/column is present with all elements ‘0’. If we can pick up a block of order (G − 1) is which all
rows/columns are all ‘0’ , then the equation is identifiable.

0 0
Here G − 1 = 3 − 1 = 2, B∆∆ =   , Γ** =   . So we need ( 2 × 2 ) matrix in which no row and no column has all
 X  X
elements ‘0’ . In case of equation (1), the first row has all ‘0’, so it is not identifiable.

Notice that from order condition, we have L = 0 which indicates that the equation is identified and this conclusion

is misleading. This is happening because order condition is just necessary but not sufficient.
12

Consider equation (2)

 Identify ‘0’ and then

y2 y3 x2

Equation (1) 0 X 0
Equation (3) X X X

 0 X
 B∆∆ 
0
= =
 , Γ**   .
 X X  X

 G − 1 = 3 − 1 = 2.
 We see that there is atleast one block in which no row is ‘0’ and no column is ‘0’. So the equation (2) is identified
 Also, L =1 > 0 ⇒ Equation (2) is over identified.

Consider equation (3)

 Identify ‘0’
y1 x3

Equation (1) X X
Equation (2) X X

 X  X
 So B=
∆∆   , Γ=
**  .
 X  X
 We see that there is no ‘0’ present in the block. So equation (3) is identified .

 Also L= 0 ⇒ Equation (3) is exactly identified.


ECONOMETRIC THEORY

MODULE – XVII
Lecture - 43
Simultaneous Equations Models

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2

Estimation of parameters
To estimate the parameters of the structural equation, we assume that the equations are identifiable.
Consider the first equation of the model
β11 y1t + β12 y2t + ... + β1G yGt + γ 11 x1t + γ 12 x2t + ... + γ 1K xKt= ε1t , =
t 1, 2,.., n.
Assume β11 = 1 and incorporate the zero-one type restrictions. Then we have
y1t β12 y2t − ... − β1G ∆ yG∆ − γ 11 x1t − γ 12 x2t − ... − γ 1K xK*t + ε1t=
= , t 1, 2,.., n.
Writing this equation in vector and matrix notations by collecting all n observations, we can write

y1 =Y1β + X 1γ + ε1
where

 y11   y21 y31  yG∆1 


   
y12   y22 y32  yG∆ 2 
=y1 = , Y1
        
   
 y1n   y2 n y3n  yG∆n 

 x11 x21  xK* 1   − β12   −γ 11 


     
=
 x12
X1 
x22  xK* 2 
= , β =− β13 
, γ  −γ 12  .
          
     
 x2 n  xK*n   − β1G∆   −γ * 
 x1n 1 K

The order of y1 is( n × 1) , Y1 is ( n × (G∆ − 1)), X 1 is ( n × K* ) , β is ( (G∆ − 1) × 1) and γ is ( K* x 1) .

This describes one equation of the structural model.


3

Now we describe the general notations in this model.

Consider the model with incorporation of zero-one type restrictions as


y1 =Y1β + X 1γ + ε
where y1 is a ( n × 1) vector of jointly dependent variables, Y1 is ( n × (G∆ − 1) ) matrix of jointly dependent variables where
G∆ denote the number of jointly dependent variables present on the right hand side of the equation, β is a ((G∆ − 1) × 1)
vector of associated structural coefficients, X1 is a ( n × K* ) matrix of n observations on each of the K* predetermined
variables and ε is a ( n × 1) vector of structural disturbances. This equation is one of the equations of complete
simultaneous equation model.

Stacking all the observations according to the variance, rather than time, the complete model consisting of G equations
describing the structural equation model can be written as

YB + X Γ = Φ

where Y is a matrix ( n × G ) of observation on jointly dependent variables, B is ( G × G ) matrix of associated structural


coefficients, X is (n× K ) matrix of observations on predetermined variables, Γ is ( K × G ) matrix of associated
structural coefficients and Φ is a ( n × G ) matrix of structural disturbances.
4

1
Assume E ( Φ ) = 0, E ( ΦΦ ') = Σ where Σ is positive definite symmetric matrix.
n
The reduced form of the model is obtained from structural equation model by post multiplying by B −1 as

YBB −1 + X ΓB −1 =ΦB −1
Y Xπ +V
=
where π = −ΓB and V = ΦB are the matrices of reduced form coefficients and reduced form disturbances
−1 −1

respectively.

The structural equation is expressed as


y =Y1β + X 1γ + ε1
β 
= (Y1 X 1 )   + ε1
γ 
= Aδ + ε

=
where A (=
Y1 , X 1 ) , δ ( β γ ) ' and ε = ε1 . This model looks like a multiple linear regression model.
Since X1 is a submatrix of X, so it can be expressed as

X 1 = XJ1
where J1 is called as a select matrix and consists of two types of elements, viz., 0 and 1. If the corresponding variable is
present, its value is 1 and if absent, its value is 0.
5

Method of consistent estimation of parameters

1. Indirect least squares (ILS) method


This method for the consistent estimation of parameters is available only for exactly identified equations.
If equations are exactly identifiable, then
K = G∆ + K* − 1.
Step 1: Apply ordinary least squares to each of the reduced form equations and estimate the reduced form coefficient
matrix.
Step 2: Find algebraic relations between structural and reduced form coefficients. Then find the structural coefficients.

The structural model at time t is


β yt + Γ=
xt ε t=
; t 1, 2,..., n
where
yt (=
y1t , y2t ,..., yGt ) ', xt ( x1t , x2t ,..., xKt ) '.

Stacking all n such equations, the structural model is obtained as


BY + ΓX = Φ

where Y is ( n × G ) matrix, X is ( n × K ) matrix and Φ is ( n × K ) matrix.


−1
The reduced form equation is obtained by premultiplication of B as

B −1 BY + B −1ΓX = B −1Φ
Y Xπ +V
=
where π =
− B −1Γ and B −1Φ.
Applying OLS to reduced form equation yields the OLSE of π as

πˆ = ( X ' X ) X ' Y .
−1
6

This is the first step of ILS procedure and yields the set of estimated reduced form coefficients.
Suppose we are interested in the estimation of following structural equation
y =Y1β + X 1γ + ε
where y is (n x 1) vector of n observation on dependent (endogenous) variable, Y1 is ( n × (G∆ − 1) ) matrix of observations
on G∆ current endogenous variables, X1 is ( n × K* ) matrix of observations on K* predetermined (exogenous) variables in
the equation and ε is (n x 1) vector of structural disturbances. Write this model as

 1
 
(y1 Y1 X1 )  −β  =
ε
 −γ 
 
or more general

 1
 
 −β 
(y 1 Y1 Y2 X1 X2 ) 0  = ε
 
 −γ 
 0
 

where Y2 and X 2 are the matrices of observations on ( G − G∆ + 1) endogenous and ( K − K* ) predetermined variables
which are excluded from the model due to zero-one type restrictions.
Write
π B = −Γ
 1
γ 
or π  − β  =
 .
 0  0
 
7

Substitute π as πˆ = ( X ' X )
−1
X 'Y and solve for β and γ . This gives indirect least squares estimators b and c of
β and γ respectively by solving

 1
c
πˆ  −b  =
 
 0  0
 
 1
c
( X ' X ) X 'Y  −b  =
−1
or  
 0  0
 
 1
c
( X ' X ) X ' ( y1 Y1 Y2 )  −b  =
−1
or  
 0  0
 
c
⇒ X ' y1 − X ' Y1b = X 'X  
0

Since  X = ( X 1 X2 )

⇒ ( X 1'Y1 )b + ( X 1' X 1 ) c =
X 1' y1 (i )

( X 2' Y1 )b + ( X 2' X 1 ) c =
X 2' y1. (ii )

These equations (i) and (ii) are K equations is (G∆ + K* − 1) unknowns. Solving the equations (i) and (ii) gives ILS
estimators of β and γ .
8
2. Two stage least squares (2SLS) or generalized classical linear (GCL) method
This is more widely used estimation procedure as it is applicable to exactly as well as overidentified equations. The least
squares method is applied in two stages in this method.
Consider equation y1 =Y1β + X 1γ + ε .

Stage 1 Apply least squares method to estimate the reduced form parameters in the reduced form model
= Y1 X π 1 + V1
⇒ πˆ1 =
( X ' X ) −1 X ' Y1
X πˆ1.
⇒ Y1 =
Stage 2 Replace Y1 is structural equation y1 =Y1β + X 1γ + ε by Yˆ1 and apply OLS to thus obtained structural equation
as follows:
y1 = Yˆ1β + X 1γ + ε
= X ( X ' X ) X ' Y1β + X 1γ + ε
−1

β
 X ( X ' X ) −1 X ' Yˆ1 X 1    + ε
 γ 

= Aˆδ + ε .
X ( X ' X ) X ' Y1 X 1  HA, H = X ( X ' X ) X ' is idempotent and δ = ( β γ ) '.
−1
where Aˆ  X (=
X ' X ) X ' Yˆ1 X 1  , A =
−1 −1
   
( )
−1
Y1 Aˆδ + ε gives OLSE of
Applying OLS to = δ ˆ ′ Aˆ
as δˆ = A ˆ
Ay1

= ( A ' HA ) A ' Hy1


−1

−1
 βˆ   Y1' HY1 Y1' X 1   Y1' y1 
or  = '   
 γˆ   X 1Y1 X 1' X 1   X 1' y1 
−1
 Y1'Y1 − Vˆ1'Vˆ1 Y1' X 1   Y1' − Vˆ1' 
=     y1.
 X 1Y1
'
X 1' X 1   X 1
'

9
where V1= Y1 − X π 1 is estimated by Vˆ1 =−
Y1 X πˆ1 =− (
I H Y1 = )
HY1 , H =
I − H . Solving these two equations, we get β̂
and γˆ which are the two stage least squares estimators of β and γ respectively.
Now we see the consistency of δˆ .

( ) Ay
−1
δˆ = Aˆ ′Aˆ
ˆ
1

= ( Aˆ ' Aˆ ) Aˆ ' ( Aˆδ + ε )


−1

( Aˆ ' Aˆ ) Aˆ ' ε
−1
δˆ − δ =
−1

(
plim δˆ − δ =
1
)  1 
plim  Aˆ ' A  plim  Aˆ ' ε  .
n  n 

The 2SLS estimator δˆ is consistent if


 1 ˆ' 
plim Y1 ε 
1 ˆ   n
=
plim  A ' ε  =  0.
n   1 ' 
plim X 1ε
 n 
Since by assumption, X variables are uncorrelated with ε in limit, so

1 
plim  X ' ε  = 0.
n 

1 
For plim  Yˆ1ε  , we observe that
n 
10
1  1 
plim  Yˆ1'ε  = plim  πˆ1 X ' ε 
n  n 
 1 
= ( plimπˆ1 )  plim X ' ε 
 n 
= π 1.0
= 0.

( )
Thus plim δˆ -δ = 0 and so the 2SLS estimators are consistent.

The asymptotic covariance matrix of δˆ is

=
Asy Var ()
δˆ n −1 plim  n δˆ − δ δˆ − δ '
  ( )( )
( )
= n −1 plim  n Aˆ ′Aˆ ( ) 
−1 −1
Aˆ ' εε ' Aˆ Aˆ ' Aˆ
 
−1 −1
1  1  1 
= n plim  Aˆ ' Aˆ  plim  Aˆ ' εε ' Aˆ  plim  Aˆ ' Aˆ 
−1

n  n  n 
−1
1 
= n σ ε plim  Aˆ ' Aˆ 
−1 2

n 
where Var (ε ) = σ ε .
2

The asymptotic covariance matrix is estimated by


Y ' X ( X ' X ) −1 X ' Y1 Y1' X 1 
( )
−1
s 2 Aˆ ' Aˆ = s2  1 
 X 1'Y1 X 1' X 1 
where

s 2
=
( y − Y βˆ − X γˆ ) ' ( y − Y βˆ − X γˆ )
1 1 1
.
1

n−G − K
11

Other estimation methods

There are some other methods available for the estimation of parameters like

• Limited information maximum likelihood estimator

• Three stage least squares (3SLS)

• Full information maximum likelihood (FIML) method etc.

Types of prior information for restriction

We have used zero-one type restrictions. Some other type of prior information that can be used for restrictions are

1. Normalizations.
2. Restrictions among parameters within each structural equation.
3. Restriction among parameters across structural equations.
4. Restrictions on some of the elements of Σ.
ECONOMETRIC THEORY

MODULE – XVIII
Lecture - 44
Seemingly Unrelated Regression
Equations Models
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
A basic nature of multiple regression model is that it describes the behaviour of a particular study variable based on a set of

explanatory variables. When the objective is to explain the whole system, there may be more than one multiple regression

equations. For example, in a set of individual linear multiple regression equations, each equation may explain some

economic phenomenon. One approach to handle such a set of equations is to consider the set up of simultaneous

equations model is which one or more of the explanatory variables in one or more equations are itself the dependent

(endogenous) variable associated with another equation in the full system. On the other hand, suppose that none of the

variables in the system are simultaneously both explanatory and dependent in nature. There may still be interactions

between the individual equations if the random error components associated with at least some of the different equations

are correlated with each other. This means that the equations may be linked statistically, even though not structurally –

through the jointness of the distribution of the error terms and through the non-diagonal covariance matrix. Such a

behaviour is reflected in the seemingly unrelated regression equations (SURE) model in which the individual equations are

in fact related to one another, even though superficially they may not seem to be.

The basic philosophy of the SURE model is as follows. The jointness of the equations is explained by the structure of the

SURE model and the covariance matrix of the associated disturbances. Such jointness introduces additional information

which is over and above the information available when the individual equations are considered separately. So it is desired

to consider all the separate relationships collectively to draw the statistical inferences about the model parameters.
3
Example

Suppose a country has 20 states and the objective is to study the consumption pattern of the country. There is one
consumption equation for each state. So all together there are 20 equations which describe 20 consumption functions. It
may also not necessary that the same variables are present in all the models. Different equations may contain different
variables. It may be noted that the consumption pattern of the neighbouring states may have the characteristics in
common. Apparently, the equations may look distinct individually but there may be some kind of relationship that may be
existing among the equations. Such equations can be used to examine the jointness of the distribution of disturbances. It
seems reasonable to assume that the error terms associated with the equations may be contemporaneously correlated.
The equations are apparently or “seemingly” unrelated regressions rather than independent relationships.

Model

We consider here a model comprising of M multiple regression equations of the form


ki
y=
ti ∑x
j =1
tij βij + ε ti , =
t 1, 2,..., T ; =
i 1, 2,..., M ; =
j 1, 2,..., ki

where yti is the tth observation on the ith dependent variable which is to be explained by the ith regression equation, xtij is
the tth observation on jth explanatory variable appearing in the ith equation, βij is the coefficient associated with xtij at each
observation and ε ti is the tth value of the random error component associated with ith equation of the model.

These M equations can be compactly expressed as


yi = X i βi + ε i , i = 1, 2,..., M
where yi is (T × 1) vector with elements yti; X i is (T × K i ) matrix whose columns represent the T observations on an
explanatory variable in the ith equation; βi is a ( ki ×1) vector with elements βij ; and ε i is a (T ×1) vector of
disturbances.
4
These M equations can be further expressed as

 y1   X1 0  0   β1   ε1 
      
 y2   0 X2  0   β2  +  ε2 
             
      
 yM   0 0  X M   βM  εM 

or y Xβ +ε
=

where the orders of y is (TM × 1) , X is (TM × k *) , β is ( k * ×1) , ε is (TM × 1) and k * = ∑k .


i
i .

Treat each of the M equations as the classical regression model and make conventional assumptions for i = 1, 2,..., M as

 X i is fixed.

 rank ( X i ) = ki .
1 ' 
 Tlim  X i X i  = Qii where Qii is nonsingular with fixed and finite elements.
→∞
T 
 E ( ui ) = 0.
 E ( ui ui' ) = σ ii IT where σ ii is the variance of disturbances in ith equation for each observation in the sample.

Considering the interactions between the M equations of the model, we assume

1 '
• lim Xi X j = Qij
T →∞ T

• E ( ui u 'j ) = σ ij IT ; i, j = 1, 2,..., M

where Qij is non-singular matrix with fixed and finite elements and σ ij is the covariance between the disturbances of ith
and jth equations for each observation in the sample.
Compactly, we can write 5

E (ε ) = 0
 σ 11 IT σ 12 IT  σ 1M I T 
 
σ I σ 22 IT  σ 2 M IT 
E ( εε ') =  21 T = Σ ⊗ IT = ψ
     
 
 σ M 1 IT σ M 2 IT  σ MM IT 

where ⊗ denotes the Kronecker product operator, ψ is ( MT × MT ) =


matrix and Σ ((σ )) is ( M × M )
ij positive
definite symmetric matrix. The definiteness of Σ avoids the possibility of linear dependencies among the
contemporaneous disturbances in the M equations of the model.

The structure E ( uu ' ) = Σ ⊗ IT implies that


 variance of ε ti is constant for all t.
 contemporaneous covariance between ε ti and ε tj is constant for all t.
 intertemporal covariance between ε ti and ε t* j ( t ≠ t *) are zero for all i and j.

By using the terminologies “contemporaneous” and “intertemporal” covariance, we are implicitly assuming that the data
are available in time series form but this is not restrictive. The results can be used for cross-section data also. The
constancy of the contemporaneous covariances across sample points is a natural generalization of homoskedastic
disturbances in a single equation model.

It is clear that the M equations may appear to be not related in the sense that there is no simultaneity between the
variables in the system and each equation has its own explanatory variables to explain the study variable. The equations
are related stochastically through the disturbances which are serially correlated across the equations of the model. That
is why this system is referred to as SURE model.
6

The SURE model is a particular case of simultaneous equations model involving M structural equations with M jointly
dependent variable and k ( ≥ ki for all i ) distinct exogenous variables and in which neither current nor logged endogenous
variables appear as explanatory variables in any of the structural equations.

The SURE model differs from the multivariate regression model only in the sense that it takes account of prior information
concerning the absence of certain explanatory variables from certain equations of the model. Such exclusions are highly
realistic in many economic situations.

OLS and GLS estimation


The SURE model is
y = X β + ε , E ( ε ) = 0, V ( ε ) = Σ ⊗ IT = ψ .
Assume that ψ is known.
The OLS estimator of β is

b0 = ( X ' X ) X ' y.
−1

Further,
E ( b0 ) = β
V ( b0 ) =E ( b0 − β )( b0 − β ) '
= ( X ' X ) X 'ψ X ( X ' X ) .
−1 −1
7
The generalized least squares (GLS) estimator of β is

βˆ = ( X 'ψ −1 X ) X 'ψ −1 y
−1

=  X ' ( Σ −1 ⊗ IT ) X  X ' ( Σ −1 ⊗ IT ) y
−1

( )
E βˆ = β

V ( βˆ ) =E ( βˆ − β )( βˆ − β ) '

= ( X 'ψ −1 X )
−1

=  X ' ( Σ −1 ⊗ IT ) X  .
−1

Define
X '− ( X 'ψ −1 X ) X 'ψ −1
−1
(X 'X )
−1
=G
then GX = 0 and we find that

( )
V ( b0 ) − V βˆ =
Gψ G '.

Since ψ is positive definite, so Gψ G ' is atleast positive semidefinite and so GLSE is, in general, more efficient than
OLSE for estimating β . In fact, using the result that GLSE best linear unbiased estimator of β , so we can conclude that
β̂ is the best linear unbiased estimator in this case also.
8

Feasible generalized least squares estimation


When Σ is unknown, then GLSE of β cannot be used. Then Σ can be estimated and replaced by ( M × M ) matrix S.

With such replacement, we obtain a feasible generalized least squares (FGLS) estimator of β as

.  X ' ( S −1 ⊗ I ) X  X ' ( S −1 ⊗ I ) y.
−1
βˆF =  T  T

Assume that S = ( ( s ) ) is nonsingular matrix and s


ij ij is some estimator of σ ij .

Estimation of Σ
There are two possible ways to estimate σ ij ' s.

1. Use of unrestricted residuals


Let K be the total number of distinct explanatory variables out of k1 , k2 ,..., km variables in the full model
y = X β + ε , E ( ε ) = 0, V ( ε ) = Σ ⊗ IT
and let Z be a T × K observation matrix of these variables.

Regress each of the M study variables on the column of Z and obtain (T × 1) residual vectors

yi − Z ( Z ' Z ) Z ' yi i =
−1
εˆi = 1, 2,..., M
= H Z yi

IT − Z ( Z ' Z ) Z '.
−1
where H =
Z
9
Then obtain
1 '
sij = εˆiεˆ j
T
1
= yi' H Z y j
T
and construct the matrix S = (( s ))
ij accordingly.

Since Xi is a submatrix of Z, so we can write

X i = ZJ i
where Jt is a ( K × ki ) selection matrix. Then

Xi − Z (Z ' Z ) Z ' Xi
−1
H Z X=
i

= X i − ZJ i
=0
and thus ( βi' X i' ε i' ) H Z ( X j β j + ε j )
yi' H Z y j =+
= ε i' H Z ε j .

E ( ε i' H Z ε j )
E ( sij ) =
Hence 1
T
= σ ij tr ( H Z )
1
T
 K
= 1 −  σ ij
 T 
 T 
E sij  = σ ij .
T − K 
T
Thus an unbiased estimator of σ ij is given by sij .
T −K
10

2. Use of restricted residuals


In this approach to find an estimator of σ ij , the residuals obtained by taking into account the restrictions on the coefficients

which distinguish the SURE model from the multivariate regression model are used as follows.
Regress yi on Xi , i.e., regress each equation, i = 1,2,…,M by OLS and obtain the residual vector

u=  I − X ( X ' X ) −1 X '  y


i  i i i i i

= H X i yi .

A consistent estimator of σ ij is obtained as


1 '
sij* = ui u j
T
1
= yi' H X i H X j y j
T
where

H X i = I − X i ( X i' X i ) X i
−1

H X j = I − X j ( X 'j X j ) X j .
−1

Using sij* , a consistent estimator of S can be constructed.


*
If T in sij is replaced by

( )
tr H X i H X j = T − ki − k j + tr ( X i' X i ) X i' X j ( X 'j X j ) X 'j X i ,
−1 −1

then sij* is an unbiased estimator of σ ij .


ECONOMETRIC THEORY

MODULE – XIX
Lecture - 45
Exercises

Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Exercise 1
The following data has been obtained on 26 days on the price of a share in the stock market and the daily change in the
stock market index. Now we illustrate the use of regression tools and the interpretation of the output from a statistical
software through this example.

Day no. (i) Change in Value of share (yi) Day no. (i) Change in Value of share (yi)
index (Xi) index (Xi)

1 87 199 14 70 172

2 52 160 15 71 176

3 50 160 16 65 165

4 88 175 17 67 150

5 79 193 18 75 170

6 70 189 19 80 189

7 40 155 20 82 190

8 63 150 21 76 185

9 70 170 22 86 190

10 60 171 23 68 165

11 44 160 24 89 200

12 47 165 25 70 172

13 42 148 26 83 190
3

The first step involves the check whether a simple linear regression model can be fitted to this data or not. For this, we plot
a scatter diagram as follows:

Figure 1 Figure 2

Looking at the scatter plots in Figures 1 and 2, an increasing linear trend in the relationship between share price and
change in index is clear. It can also be concluded from the figure 2 that it is reasonable to fit a linear regression model.
4
A typical output of a statistical software on regression of blood pressure (y) versus age (x) will look like as follows:

Regression Analysis: y versus x

The regression equation is


y = 116 + 0.848 x

Predictor Coef SE Coef T P


Constant 115.544 8.807 13.12 0.000
x 0.8483 0.1262 6.72 0.000

R-Sq = 65.3% R-Sq(adj) = 63.8%

PRESS = 2418.97 R-Sq(pred) = 60.12%

Analysis of Variance

Source DF SS MS F P
Regression 1 3961.0 3961.0 45.15 0.000
Residual Error 24 2105.3 87.7
Lack of Fit 21 1868.6 89.0 1.13 0.536
Pure Error 3 236.8 78.9
Total 25 6066.3

Unusual Observations
Obs x y Fit SE Fit Residual St Resid
8 63.0 150.00 168.99 1.95 -18.99 -2.07R
17 67.0 150.00 172.38 1.84 -22.38 -2.44R

R denotes an observation with a large standardized residual.

Durbin-Watson statistic = 1.66365

Contd....
5

Contd....

Obs x y Fit SE Fit Residual St Resid


1 87.0 199.00 189.34 3.00 9.66 1.09
2 52.0 160.00 159.65 2.75 0.35 0.04
3 50.0 160.00 157.96 2.94 2.04 0.23
4 88.0 175.00 190.19 3.10 -15.19 -1.72
5 79.0 193.00 182.56 2.29 10.44 1.15
6 70.0 189.00 174.92 1.85 14.08 1.53
7 40.0 155.00 149.48 4.01 5.52 0.65
8 63.0 150.00 168.99 1.95 -18.99 -2.07R
9 70.0 170.00 174.92 1.85 -4.92 -0.54
10 60.0 171.00 166.44 2.11 4.56 0.50
11 44.0 160.00 152.87 3.57 7.13 0.82
12 47.0 165.00 155.41 3.25 9.59 1.09
13 42.0 148.00 151.17 3.79 -3.17 -0.37
14 70.0 172.00 174.92 1.85 -2.92 -0.32
15 71.0 176.00 175.77 1.87 0.23 0.02
16 65.0 165.00 170.68 1.88 -5.68 -0.62
17 67.0 150.00 172.38 1.84 -22.38 -2.44R
18 75.0 170.00 179.17 2.03 -9.17 -1.00
19 80.0 189.00 183.41 2.36 5.59 0.62
20 82.0 190.00 185.10 2.53 4.90 0.54
21 76.0 185.00 180.01 2.08 4.99 0.55
22 86.0 190.00 188.50 2.90 1.50 0.17
23 68.0 165.00 173.23 1.84 -8.23 -0.90
24 89.0 200.00 191.04 3.20 8.96 1.02
25 70.0 172.00 174.92 1.85 -2.92 -0.32
26 83.0 190.00 185.95 2.62 4.05 0.45

R denotes an observation with a large standardized residual.


5
6
Now we discuss the interpretation of the results part-wise.
First look at the following section of the output.

The regression equation is


y = 116 + 0.848 x

Predictor Coef SE Coef T P


Constant 115.544 8.807 13.12 0.000
x 0.8483 0.1262 6.72 0.000

R-Sq = 65.3% R-Sq(adj) = 63.8%

PRESS = 2418.97 R-Sq(pred) = 60.12%

The fitted linear regression model is


y = 116 + 0.848 x

in which the regression coefficients are obtained as

b0 =y − b1 x =115.544 ≈ 116
sxy
=
b1 = 0.848
sxx
where
n n
1 n 1 n
sxy = ∑ ( xi − x )( yi − y ), sxx = ∑ ( xi − x ) , x = ∑ xi , y = ∑ yi .
2

=i 1 =i 1 = n i 1= ni1
The same is indicated by Constant 115.544 and Age (X) 0.8483.
7
The terms SE Coef denotes the standard errors of the estimates of intercept term and slope parameter. They are
obtained as

 (b )=  1 x2 
Var 0 s  +
2
= 8.807
 n sxx 

= s2
Var (b1 ) = 0.1262
sxx
where
n

∑(y −b − b1 xi ) 2
i 0
s2 =
i =1
, n 26.
n−2

The terms T the value of t-statistics for testing the significance of regression coefficient and are obtained as

H 0 : β 0 β=
= 00 with β 00 0
b0 − β 00
t0 = 13.12
SS res  1 x 
2

 + 
n − 2  n sxx 

H 0 : β1 β=
= 10 with β10 0
b1 − β10
=t0 = 6.72
SS res
(n − 2) sxx

The corresponding P values are all 0.000 which are less than the level of significance α = 0.5. This indicates that the
null hypothesis is rejected and the intercept term and slope parameter are significant.
8
Now we discuss the interpretation of the goodness of fit statistics.

The value of coefficient of determination is given by R-Sq = 65.3%.


This value is obtained from
e 'e
R2 =
1− n =
0.653.
∑ ( yi − y )
i =1
2

The value of adjusted coefficient of determination is given by R-Sq(adj) = 63.8%.

This value is obtained from

 n −1 
R2 =
1−   (1 − R ) =
2
0.638 with k =
1.
n−k 

This means that the fitted model can expalain about 64% of the variability in y through X.
9

This value of PRESS statistics is given by PRESS = 2418.97


This value is obtained from

2
 e 
n 2 n
PRESS = ∑  yi − yˆ (i )  = ∑  i  = 2418.97.
=i 1 =i 1 1 − hii 

This is also a measure of model quality. It gives an idea how well a regression model will perform in predicting new data. It
is apparent that the value is quite high. A model with small value of PRESS is desirable.

The value of R2 for prediction based on PRESS statistics is given by R-Sq(pred) = 60.12%
This value is obtained from

PRESS
2
Rprediction =
1− =
0.6012.
SST

This statistic gives an indication of the predictive capability of the fitted regression model.

2
Here Rprediction = 0.6012 indicates that the model is expected to explain about 60% of the variability in predicting new
observations.
10
Now we look at the following output on analysis of variance.

Analysis of Variance

Source DF SS MS F P
Regression 1 3961.0 3961.0 45.15 0.000
Residual Error 24 2105.3 87.7
Lack of Fit 21 1868.6 89.0 1.13 0.536
Pure Error 3 236.8 78.9
Total 25 6066.3

However, in this example with one explanatory variable, the analysis of variance does not make much sense and is
equivalent to test the hypothesis

H 0 : β1 β=
= 10 with β10 0 by
b1 − β10
t0 = .
SS res
(n − 2) sxx

We will discuss the details on analysis of variance in the next example.

However it may be noted that the test of hypothesis for lack of fit has P-value more than 0.5, which indicates that the model
is not well fitted. This is also clear from the fact that the value of coefficient of determination is around 0.65 which is not very
high. The reason for such a low value is clear from the scatter diagram that the points are highly dispersed around the line.
11
Now we look into the following part of the output. The following outcome presents the analysis of residuals to find an
unusual observation which can possibly be an outlier or an extreme observation. Its interpretation in the present case is
that the 6th observation needs attention.

Unusual Observations
Obs x y Fit SE Fit Residual St Resid
8 63.0 150.00 168.99 1.95 -18.99 -2.07R
17 67.0 150.00 172.38 1.84 -22.38 -2.44R

R denotes an observation with a large standardized residual.

Durbin-Watson statistic = 1.66365

The standardized residuals are high for the 8th and 17th observations which indicates that these two observations are
unusual and need attention for further exploration.

The Durbin Watson statistics is obtained from Durbin-Watson statistic = 1.66

Its value is obtained from


n

∑ (e − e )
2
t t −1
=d =
t =2
n
1.66.
∑ et
2

t =1

At d = 2, it is indicated that there is no first order autocorrelation in the data. As the value d = 1.66 which is less than 2, so it
indicates the possible presence of first order positive autocorrelation in the data.
12
Now we analyze the following output:

Obs x y Fit SE Fit Residual St Resid


1 87.0 199.00 189.34 3.00 9.66 1.09
2 52.0 160.00 159.65 2.75 0.35 0.04
3 50.0 160.00 157.96 2.94 2.04 0.23
4 88.0 175.00 190.19 3.10 -15.19 -1.72
5 79.0 193.00 182.56 2.29 10.44 1.15
6 70.0 189.00 174.92 1.85 14.08 1.53
7 40.0 155.00 149.48 4.01 5.52 0.65
8 63.0 150.00 168.99 1.95 -18.99 -2.07R
9 70.0 170.00 174.92 1.85 -4.92 -0.54
10 60.0 171.00 166.44 2.11 4.56 0.50
11 44.0 160.00 152.87 3.57 7.13 0.82
12 47.0 165.00 155.41 3.25 9.59 1.09
13 42.0 148.00 151.17 3.79 -3.17 -0.37
14 70.0 172.00 174.92 1.85 -2.92 -0.32
15 71.0 176.00 175.77 1.87 0.23 0.02
16 65.0 165.00 170.68 1.88 -5.68 -0.62
17 67.0 150.00 172.38 1.84 -22.38 -2.44R
18 75.0 170.00 179.17 2.03 -9.17 -1.00
19 80.0 189.00 183.41 2.36 5.59 0.62
20 82.0 190.00 185.10 2.53 4.90 0.54
21 76.0 185.00 180.01 2.08 4.99 0.55
22 86.0 190.00 188.50 2.90 1.50 0.17
23 68.0 165.00 173.23 1.84 -8.23 -0.90
24 89.0 200.00 191.04 3.20 8.96 1.02
25 70.0 172.00 174.92 1.85 -2.92 -0.32
26 83.0 190.00 185.95 2.62 4.05 0.45

R denotes an observation with a large standardized residual.


13

The residuals are standardized based on the concept of residual minus its mean and divided by its standard deviation.

Since E(ei) = 0 and MSres estimates the approximate average variance, so logically the scaling of residual is

ei
=di = , i 1, 2,..., n
MS r e s

is called as standardized residual for which

E (di ) = 0
Var (di ) ≈ 1.

So a large value of di potentially indicates an outlier.

The residual are obtained as e=


i yi − yˆi where the observed values are denoted as yi and the fitted values are
denoted by Fit and obtained as
yˆi = 116 + 0.848 × xi ; i =1, 2,..., 26.
The standard error of yˆi is denoted by SE Fit .

The Residual denotes the values of e=


i yi − yˆi .

The standardized residuals are obtained in St Resid.


14
Next we consider the graphical analysis.

The normal probability plot is obtained like as follows:

The normal probability plots is a plot of the ordered standardized residuals versus the cumulative probability

 1
i − 
Pi =
2
= , i 1, 2,..., n.
n
It can be seen from the plot that most of the points are lying close to the line. This clearly indicates that the assumption of
normal distribution for random errors is satisfactory in the given data.
15
Next we consider the graph between the residuals and the fitted values.
Such a plot is helpful in detecting several common type of model inadequacies.

If plot is such that the residuals can be contained in a horizontal band (and residual fluctuates is more or less in a random
fashion inside the band) then there are no obvious model defects.

The points in this plot can more or less be contained in a horizontal band. So the model has no obvious defects. The
picture may not be fully indicating that all the points can be enclosed in the horizontal band. This was indicated by the
coefficient of determination which is around 63% only. This indicates the degree of lack of fit and it is reflected in this graph
also.
16
Exercise 2

The following data has 20 observations on study variable with three independent variables- X1i, X2i and X3i.
Now we illustrate the use of regression tools and the interpretation of the output from a statistical software through this
example.

i yi X1i X2i X3i i yi X1i X2i X3i

1 36 7 55.21 5.54 12 25 23 58.26 16.58

2 15 27 75.32 36.24 13 8 49 74.36 32.95

3 13 28 70.31 38.24 14 48 13 66.24 13.85

4 12 45 76.28 46.85 15 44 12 71.95 16.64

5 36 12 80.25 38.91 16 19 28 77.36 35.91

6 23 15 89.85 45.84 17 24 11 65.38 16.56

7 24 20 91.21 44.26 18 35 15 83.25 29.35

8 9 35 99.95 70.21 19 30 18 84.68 27.36

9 21 56 55.78 66.59 20 15 21 89.37 38.95

10 55 8 55.32 9.65 21 23 30 75.25 36.65

11 28 19 65.96 18.32 22 35 18 88.94 39.16


17
The first step involves the check whether a simple linear regression model can be fitted to this data or not. For this, we plot
a matrix scatter diagram as follows:

Looking at the scatter plots in various blocks, a good degree of linear trend in the relationship between study variable and
each of the explanatory variable is clear. The degree of linear relationships in various blocks are different. It can also be
concluded that it is appropriate to fit a linear regression model.
18
A typical output of a statistical software on regression of y on X 1 , X 2 , X 3 will look like as follows:

The regression equation is


y = 71.6 - 0.758 X1 - 0.417 X2 + 0.106 X3

Predictor Coef SE Coef T P VIF


Constant 71.60 14.36 4.99 0.000
X1 -0.7585 0.2312 -3.28 0.004 2.9
X2 -0.4166 0.2144 -1.94 0.068 2.3
X3 0.1063 0.2223 0.48 0.638 4.2

R-Sq = 63.0% R-Sq(adj) = 56.9%

PRESS = 2554.13 R-Sq(pred) = 23.77%

Analysis of Variance

Source DF SS MS F P
Regression 3 2111.52 703.84 10.23 0.000
Residual Error 18 1238.85 68.82
Total 21 3350.36

Continued...
19

Obs X1 y Fit SE Fit Residual St Resid


1 7.0 36.00 43.89 3.94 -7.89 -1.08
2 27.0 15.00 23.60 1.84 -8.60 -1.06
3 28.0 13.00 25.14 2.26 -12.14 -1.52
4 45.0 12.00 10.68 3.66 1.32 0.18
5 12.0 36.00 33.21 3.56 2.79 0.37
6 15.0 23.00 27.67 3.61 -4.67 -0.63
7 20.0 24.00 23.15 2.99 0.85 0.11
8 35.0 9.00 10.89 4.72 -1.89 -0.28
9 56.0 21.00 12.97 7.17 8.03 1.92 X
10 8.0 55.00 43.52 3.94 11.48 1.57
11 19.0 28.00 31.67 2.47 -3.67 -0.46
12 23.0 25.00 31.65 3.10 -6.65 -0.86
13 49.0 8.00 6.97 6.14 1.03 0.19 X
14 13.0 48.00 35.62 2.72 12.38 1.58
15 12.0 44.00 34.30 2.59 9.70 1.23
16 28.0 19.00 21.96 2.01 -2.96 -0.37
17 11.0 24.00 37.79 2.77 -13.79 -1.76
18 15.0 35.00 28.67 2.49 6.33 0.80
19 18.0 30.00 25.59 2.98 4.41 0.57
20 21.0 15.00 22.59 2.78 -7.59 -0.97
21 30.0 23.00 21.40 2.06 1.60 0.20
22 18.0 35.00 25.07 2.76 9.93 1.27

X denotes an observation whose X value gives it large influence.

Durbin-Watson statistic = 1.46718

19
20
Now we discuss the interpretation of the results part-wise.
First look at the following section of the output.

The regression equation is


y = 71.6 - 0.758 X1 - 0.417 X2 + 0.106 X3

Predictor Coef SE Coef T P VIF


Constant 71.60 14.36 4.99 0.000
X1 -0.7585 0.2312 -3.28 0.004 2.9
X2 -0.4166 0.2144 -1.94 0.068 2.3
X3 0.1063 0.2223 0.48 0.638 4.2

R-Sq = 63.0% R-Sq(adj) = 56.9%

PRESS = 2554.13 R-Sq(pred) = 23.77%

The fitted linear regression model is


y = 71.6 - 0.758 X1 - 0.417 X2 + 0.106 X3
in which the regression coefficients are obtained as

b = ( X ' X ) −1 X ' y
which is 4 X 1 vector
=b (b=
1 , b2 , b3 , b4 ) ' (71.6, - 0.758, - 0.417, 0.106) '
where b1 is the OLS estimator of intercept term and b2 , b3 , b4 are the OLS estimators of regression coefficients.
The same is indicated by ‘Constant 71.60’, ‘x1 -0.7585’, ‘x2 -0.4166’, and ‘x3 0.1063’.
21
The terms SE Coef denotes the standard errors of the estimates of intercept term and slope parameter. They are
obtained as positive square roots of the diagonal elements of the covariance matrix of

Vˆ (b) = σˆ 2 ( X ' X ) −1.


where
n

∑(y −b −b x
i− b3 xi 3 − b4 xi 4 ) 2
1 2 i2
σˆ 2 =
i =1
, n 22.
n−4

The terms T the value of t-statistics for testing the significance of regression coefficient and are obtained for testing

H 0 j : β j β=
= j 0 with β j 0 0 against H1 j : β j ≠ β j 0 using the statistic

β j − β j0
tj = , j 1, 2,3, 4

Var (b j )

t1 = 4.99 with corresponding p-value 0.000,

t2 = −3.28 with corresponding p-value 0.004,

t3 = −1.94 with corresponding p-value 0.068,

t4 = 0.48 with corresponding p-value 0.638.

The null hypothesis is rejected at α = 0.5 level of significance when the corresponding P value is less than the level of
significance α = 0.5 and the intercept term as well as slope parameter are significant.
H 01 : β1 0,=
Thus = H 02 : β 2 0 H 03 : β3 0,=
are rejected and= H 04 : β 4 0 is accepted. Thus the variables X 1 , X 3 , X 3 enter in
the model and X 4 leaves the model.
22

Next we consider the values of variance inflation factor (VIF) given in the output as VIF. The variance inflation factor for
the jth explanatory variable is defined as
1
VIFj = .
1 − R 2`j
This is the factor which is responsible for inflating the sampling variance. The combined effect of dependencies among the
explanatory variables on the variance of a term is measured by the VIF of that term in the model. One or more large VIFs
indicate the presence of multicollinearity in the data.

In practice, usually a VIF > 5 or 10 indicates that the associated regression coefficients are poorly estimated because of
multicollinearity.

In our case, the values of VIFs due to first, second and third explanatory variables are 2.9, 2.3 and 4.2, respectively. Thus
there is no indication of the presence of multicollinearity in the data.
23
Now we discuss the interpretation of the goodness of fit statistics.

The value of coefficient of determination is given by R-Sq = 63.0%


This value is obtained from
e 'e
R2 =
1− n =
0.63.
∑ ( yi − y )
i =1
2

This means that the fitted model can expalain about 63% of the variability in y through X.

The value of adjusted coefficient of determination is given by R-Sq(adj) = 56.9%

This value is obtained from

 n −1 
R2 =
1−   (1 − R ) =
2
0.569 with n =
22, k =
4.
n−k 

This means that the fitted model can expalain about 57% of the variability in y through X‘s.
24

This value of PRESS statistics is given by PRESS = 2554.13


This value is obtained from

2
 ei 
n
PRESS ∑
= =   2554.13
i =1 1 − hii 

where hii is the ith diagonal element of hat matrix.

This is also a measure of model quality. It gives an idea how well a regression model will perform in predicting new data. It
is apparent that the value is quite high. A model with small value of PRESS is desirable.

The value of R2 for prediction based on PRESS statistics is given by R-Sq(pred) = 23.77%
This value is obtained from

PRESS
2
Rprediction =
1− =
0.2377.
SST

This statistic gives an indication of the predictive capability of the fitted regression model. Here
2
Rprediction = 0.2377
indicates that the model is expected to explain about 24% of the variability in predicting new observations. This is also
reflected in the value of PRESS statistic. So the fitted model is not very good for prediction purpose.
25
Now we look at the following output on analysis of variance. This output tests the null hypothesis H 01 : β=
2 β=
3 β=
4 0
against the alternative hypothesis that at least one of the regression coefficient is different from other. There are three
explanatory variables.

Analysis of Variance

Source DF SS MS F P
Regression 3 2111.52 703.84 10.23 0.000
Residual Error 18 1238.85 68.82
Total 21 3350.36

The sum of squares due to regression is obtained by


SS r e g = b ' X ' y − ny 2 = y ' Hy − ny 2 = 2111.52.

The sum of squares due to total is obtained by


SST = y ' y − ny 2 = 33350.36.
The sum of squares due to error is obtained by
SS res =SST − SS r e g =1238.85.
The mean squares are obtained by dividing the sum of squares by the degrees of freedom.

SS r e g
The mean sqaure due to regression is obtained as =
MS reg = 703.84.
3
SS r e s
=
The mean sqaure due to error is obtained as MS res = 68.82.
16
MS reg
The F-statistic is obtained by=F = 10.23.
MS res
The null hypothesis is rejected at 5% level of significance because P value is less than then α = 0.5.
26
Now we look into the following part of the output. The following outcome presents the analysis of residuals to find an
unusual observation which can possibly be an outlier or an extreme observation.

Obs X1 y Fit SE Fit Residual St Resid


1 7.0 36.00 43.89 3.94 -7.89 -1.08
2 27.0 15.00 23.60 1.84 -8.60 -1.06
3 28.0 13.00 25.14 2.26 -12.14 -1.52
4 45.0 12.00 10.68 3.66 1.32 0.18
5 12.0 36.00 33.21 3.56 2.79 0.37
6 15.0 23.00 27.67 3.61 -4.67 -0.63
7 20.0 24.00 23.15 2.99 0.85 0.11
8 35.0 9.00 10.89 4.72 -1.89 -0.28
9 56.0 21.00 12.97 7.17 8.03 1.92 X
10 8.0 55.00 43.52 3.94 11.48 1.57
11 19.0 28.00 31.67 2.47 -3.67 -0.46
12 23.0 25.00 31.65 3.10 -6.65 -0.86
13 49.0 8.00 6.97 6.14 1.03 0.19 X
14 13.0 48.00 35.62 2.72 12.38 1.58
15 12.0 44.00 34.30 2.59 9.70 1.23
16 28.0 19.00 21.96 2.01 -2.96 -0.37
17 11.0 24.00 37.79 2.77 -13.79 -1.76
18 15.0 35.00 28.67 2.49 6.33 0.80
19 18.0 30.00 25.59 2.98 4.41 0.57
20 21.0 15.00 22.59 2.78 -7.59 -0.97
21 30.0 23.00 21.40 2.06 1.60 0.20
22 18.0 35.00 25.07 2.76 9.93 1.27

X denotes an observation whose X value gives it large influence.

Durbin-Watson statistic = 1.46718

No evidence of lack of fit (P >= 0.1).


27
Now we look into the following part of the output. The following outcome presents the analysis of residuals to find an
unusual observation which can possibly be an outlier or an extreme observation. Here ‘y’ denotes the observed values,
‘Fit‘ denotes the fitted values yˆi , ‘SE Fit’ denotes the standard error of fit, i.e., yˆi , ‘Residual‘ denotes the
ordinary residuals obtained by
ei = yi − yˆi = yi − (71.6 − 0.758 X 1 − 0.417 X 2 + 0.106 X 3 )
and ‘St Resid’ denotes the standardized residuals.

The 9th and 13th observationa have a large standardized residual.

So the 9th and 13th observationa have a large influence.

There is no evidence of lack of fit.

The Durbin Watson statistics is obtained from Durbin-Watson statistic = 1.46718

Its value is obtained from


n

∑ (e − e )
2
t t −1
=d =
t =2
n
1.46718.
∑ et
2

t =1

At d = 2, it is indicated that there is no first order autocorrelation in the data. As the value d = 1.47 is less than 2, so it
indicates the possibility of presence of first order positive autocorrelation in the data.
28
Next we consider the graphical analysis.

The normal probability plot is obtained like as follows:

The normal probability plots is a plot of the ordered standardized residuals versus the cumulative probability

 1
i − 
Pi =
2
= , i 1, 2,..., n.
n
It can be seen from the plot that most of the points are lying close to the line. This clearly indicates that the assumption of
normal distribution for random errors is satisfactory.
29
Next we consider the graph between the residuals and the fitted values.
Such a plot is helpful in detecting several common type of model inadequacies.

If plot is such that the residuals can be contained in a horizontal band (and residual fluctuates is more or less in a random
fashion inside the band) then there are no obvious model defects.

The points in this plot can more or less be contained in an outward opening funnel. So the assumption of constant
variance is violated and it indicates that possibly the variance increases with the increase in the values of study variable.

You might also like