Professional Documents
Culture Documents
Bibliography
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
1. Jack Johnston and John DiNardo (1997): Econometric Methods, Third edition, McGraw Hill.
Suggested readings
1. George G. Judge, William E. Griffiths, R. Carter Hill, Helmut Lütkepohl, Tsoung-Chao Lee (1985): The Theory and
2. C.R. Rao, H. Toutenburg, Shalabh and C. Heumann (2008): Linear Models and Generalizations - Least Squares
5. Aman Ullah and H.D. Vinod (1981): Recent Advances in Regression Models, Marcel Dekker.
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
It is an integration of economic theories, mathematical economics and statistics with an objective to provide numerical
The relationships of economic theories are usually expressed in mathematical forms and combined with empirical
economics.
The econometrics methods are used to obtain the values of parameters which are essentially the coefficients of
mathematical form of the economic relationships. The statistical methods which help in explaining the economic
The econometric relationships depict the random behaviour of economic relationships which are generally not considered in
It may be pointed out that the econometric methods can be used in other areas like engineering sciences, biological
In simple words, whenever there is a need of finding the stochastic relationship in mathematical format, the econometric
methods and tools help. The econometric tools are helpful in explaining the relationships among variables.
3
Econometric models
A model is a simplified representation of a real world process. It should be representative in the sense that it should
In general, one of the objectives in modeling is to have a simple model to explain a complex phenomenon. Such an
objective may sometimes lead to oversimplified model and sometimes the assumptions made are unrealistic.
In practice, generally all the variables which the experimenter thinks are relevant to explain the phenomenon are included
in the model. Rest of the variables are dumped in a basket called “disturbances” where the disturbances are random
variables. This is the main difference between the economic modeling and econometric modeling. This is also the main
The mathematical modeling is exact in nature whereas the statistical modeling contains a stochastic term also.
An economic model is a set of assumptions that describes the behavior of an economy, or more general, a phenomenon.
a set of equations describing the behavior. These equations are derived from the economic model and have two
Aims of econometrics
The three main aims of econometrics are as follows:
3. Use of models
The obtained models are used for forecasting and policy formulation which is an essential part in any policy decision. Such
forecasts help the policy makers to judge the goodness of fitted model and take necessary measures to re-adjust the
relevant economic variables.
5
collected recorded, tabulated and used in describing the pattern in their development over time. The economic statistics is
a descriptive aspect of economics. It does not provide either the explanations of the development of various variables or
Statistical methods describe the methods of measurement which are developed on the basis of controlled experiments.
Such methods may not be suitable for economic phenomenon as they don’t fit in the framework of controlled experiments.
For example, in real world experiments, the variables usually change continuously and simultaneously and so the set up
Econometrics uses statistical methods after adapting them to the problems of economic life. These adopted statistical
methods are usually termed as econometric methods. Such methods are adjusted so that they become appropriate for
the measurement of stochastic relationships. These adjustments basically attempt to specify the stochastic element which
operate in real world data and enters into the determination of observed data. This enables the data to be called as
The theoretical econometrics includes the development of appropriate methods for the measurement of economic
relationships which are not meant for controlled experiments conducted inside the laboratories. The econometric methods
The applied econometrics includes the application of econometric methods to specific branches of econometric theory
and problems like demand, supply, production, investment, consumption etc. The applied econometrics involves the
application of the tools of econometric theory for the analysis of economic phenomenon and forecasting the economic
behaviour.
7
Types of data
Various types of data is used in the estimation of the model.
3. Panel data
The panel data are the data from repeated survey of a single (cross-section) sample in different periods of time.
Aggregation problem
The aggregation problems arise when aggregative variables are used in functions. Such aggregative variables may involve.
4. Spatial aggregation
Sometimes the aggregation is related to spatial issues. For example, the population of towns, countries, or the
where f is some well defined function and β1 , β 2 ,..., β k are the parameters which characterize the role and contribution of
X1, X2,…, Xk, respectively. The term ε reflects the stochastic nature of the relationship between y and X1, X2,…, Xk and
indicates that such a relationship is not exact in nature. When ε = 0, then the relationship is called the mathematical
model otherwise the statistical model. The term “model” is broadly used to represent any phenomenon in a mathematical
frame work.
A model or relationship is termed as linear if it is linear in parameters and nonlinear, if it is not linear in parameters. In other
words, if all the partial derivatives of y with respect to each of the parameters β1 , β 2 ,..., β k are independent of the
parameters, then the model is called as a linear model. If any of the partial derivatives of y with respect to any of the
β1 , β 2 ,..., β k is not independent of the parameters, the model is called as nonlinear. Note that the linearity or non-linearity
of the model is not described by the linearity or nonlinearity of explanatory variables in the model.
10
For example
y=β1 X 12 + β 2 X 2 + β3 log X 3 + ε
1, 2,3) are independent of the parameters β i , (i = 1, 2,3). On the other hand,
is a linear model because ∂y / ∂βi , (i =
y = β12 X 1 + β 2 X 2 + β 3 log X + ε
=
When the function f is linear in parameters, then y f ( X 1 , X 2 ,..., X k , β1 , β 2 ,..., β k ) + ε is called a linear model and when
the function f is nonlinear in parameters, then it is called a nonlinear model. In general, the function f is chosen as
f ( X 1 , X 2 ,..., X k , β1 , β 2 ..., β k=
) β1 X 1 + β 2 X 2 + ... + β k X k
to describe a linear model. Since X1,X2,…,Xk are pre-determined variables and y is the outcome, so both are known.
Thus the knowledge of the model depends on the knowledge of the parameters β1 , β 2 ,..., β k.
The statistical linear modeling essentially consists of developing approaches and tools to determine β1 , β 2 ,..., β k in the
linear model
y β1 X 1 + β 2 X 2 + ... + β k X k + ε
=
given the observations on y and X1,X2,…,Xk.
Different statistical estimation procedures, e.g., method of maximum likelihood, principle of least squares, method of
moments etc. can be employed to estimate the parameters of the model. The method of maximum likelihood needs further
knowledge of the distribution of y whereas the method of moments and the principle of least squares do not need any
knowledge about the distribution of y.
11
The regression analysis is a tool to determine the values of the parameters given the data on y and X1,X2,…,Xk. The
literal meaning of regression is “to move in the backward direction”. Before discussing and understanding the meaning of
“backward direction”, let us find which of the following statements is correct:
S1: model generates data or
S2: data generates model.
Obviously, S1 is correct. It can be broadly thought that the model exists in nature but is unknown to the experimenter.
When some values to the explanatory variables are provided, then the values for the output or study variable are
generated accordingly, depending on the form of the function f and the nature of phenomenon. So ideally, the pre-existing
model gives rise to the data. Our objective is to determine the functional form of this model. Now we move in the backward
direction. We propose to first collect the data on study and explanatory variables. Then we employ some statistical
techniques and use this data to know the form of function f. Equivalently, the data from the model is recorded first and then
used to determine the parameters of the model. The regression analysis is a technique which helps in determining the
statistical model by using the data on study and explanatory variables. The classification of linear and nonlinear regression
analysis is based on the determination of linear and nonlinear models, respectively.
Consider a simple example to understand the meaning of “regression”. Suppose the yield of crop (y) depends linearly on
two explanatory variables, viz., the quantity of a fertilizer (X1) and level of irrigation (X2) as
y = β1 X 1 + β 2 X 2 + ε .
12
There exist the true values of β1 and β 2 in nature but are unknown to the experimenter. Some values on y are recorded
by providing different values to X1 and X2. There exists some relationship between y and X1, X2 which gives rise to a
systematically behaved data on y, X1 and X2. Such relationship is unknown to the experimenter. To determine the model,
we move in the backward direction in the sense that the collected data is used to determine the unknown parameters
β1 and β 2 of the model. In this sense such an approach is termed as regression analysis.
ECONOMETRIC THEORY
MODULE – I
Lecture - 2
Introduction to Econometrics
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
The theory and fundamentals of linear models lay the foundation for developing the tools for regression analysis that are
based on valid statistical theory and concepts.
Specification of model
Fitting of model
Using the chosen model(s) for the solution of the posed problem and forecasting
3
These steps are examined below.
The first important step in conducting any regression analysis is to specify the problem and the objectives to be
addressed by the regression analysis. The wrong formulation or the wrong understanding of the problem will give the
wrong statistical inferences. The choice of variables depends upon the objectives of study and understanding of the
problem. For example, height and weight of children are related. Now there can be two issues to be addressed.
In the case (i), the height is response variable whereas weight is response variable in case (ii). The role of explanatory
variables. It has to kept in mind that the correct choice of variables will determine the statistical inferences correctly. For
example, in any agricultural experiment, the yield depends on explanatory variables like quantity of fertilizer, rainfall,
irrigation, temperature etc. These variables are denoted by X1,X2,…,Xk as a set of k explanatory variables.
4
such relevant variables. The data is essentially the measurement on these variables. For example, suppose we want
to collect the data on age. For this, it is important to know how to record the data on age. Then either the date of birth
can be recorded which will provide the exact age on any specific date or the age in terms of completed years as on
specific date can be recorded. Moreover, it is also important to decide that whether the data has to be collected on
variables as quantitative variables or qualitative variables. For example, if the ages (in years) are 15, 17, 19, 21, 23,
then these are quantitative values. If the ages are defined by a variable that takes value 1 if ages are less than 18
years and 0 if the ages are more than 18 years, then the earlier recorded data is converted to 1, 1, 0, 0, 0. Note that
there is a loss of information in converting the quantitative data into qualitative data.
The methods and approaches for qualitative and quantitative data are also different. If the study variable is binary, then
logistic and probit regressions etc. are used. If all explanatory variables are qualitative, then analysis of variance
technique is used. If some explanatory variables are qualitative and others are quantitative, then analysis of covariance
technique is used. The techniques of analysis of variance and analysis of covariance are the special cases of regression
analysis .
5
Generally, the data is collected on n subjects, then y denotes the response or study variable and y1 , y2 ,..., yn
are the n values. If there are k explanatory variables X 1 , X 2 ,.., X k then xij denotes the ith value of jth variable,
i = 1, 2, …, n; j = 1, 2,…, k. The observation can be presented in the following table:
n yn xn1 xn 2 xnk
_______________________________________________________________________
6
4. Specification of model
The experimenter or the person working in the subject usually helps in determining the form of the model. Only the form of
the tentative model can be ascertained and it will depend on some unknown parameters. For example, a general form
will be like
y f ( X 1 , X 2 ,..., X k ; β1 , β 2 ,..., β k ) + ε
where ε is the random error reflecting mainly the difference in the observed value of y and the value of y obtained
through the model. The form of f ( X 1 , X 2 ,..., X k ; β1 , β 2 ,..., β k ) can be linear as well as nonlinear depending on the form of
parameters β1 , β 2 ,..., β k . A model is said to be linear if it is linear in parameters. For example,
y = β1 X 1 + β 2 X 12 + β3 X 2 + ε
β1 + β 2 ln X 2 + ε
y=
are linear models whereas
y = β1 X 1 + β 22 X 2 + β3 X 2 + ε
y (ln β1 ) X 1 + β 2 X 2 + ε
=
are non-linear models. Many times, the nonlinear models can be converted into linear models through some
transformations. So the class of linear models is wider than what it appears initially.
If a model contains only one explanatory variable, then it is called as simple regression model. When there are more
than one independent variables, then it is called as multiple regression model. When there is only one study variable,
the regression is termed as univariate regression. When there are more than one study variables, the regression is
termed as multivariate regression. Note that the simple and multiple regressions are not same as univariate and
multivariate regressions. The simple and multiple regression are determined by the number of explanatory variables
whereas univariate and multivariate regressions are determined by the number of study variables.
7
6. Fitting of model
The estimation of unknown parameters using appropriate method provides the values of the parameters. Substituting these
values in the equation gives us a usable model. This is termed as model fitting. The estimates of parameters β1 , β 2 ,..., β k
in the model
y f ( X 1 , X 2 ,..., X k , β1 , β 2 ,..., β k ) + ε
are denoted as βˆ1 , βˆ2 ,..., βˆk . which gives the fitted model as
When the value of y is obtained for the given values of X 1 , X 2 ,..., X k , it is denoted as ŷ and called as fitted value.
The fitted equation is used for prediction. In this case, ŷ is termed as predicted value. Note that the fitted value is where
the values used for explanatory variables correspond to one of the n observations in the data whereas predicted value is
the one obtained for any set of values of explanatory variables. It is not generally recommended to predict the y - values
for the set of those values of explanatory variables which lie outside the range of data. When the values of explanatory
variables are the future values of explanatory variables, the predicted values are called forecasted values.
8
The validation of the assumptions must be made before drawing any statistical conclusion. Any departure from validity of
assumptions will be reflected in the statistical inferences. In fact, the regression analysis is an iterative process where the
outputs are used to diagnose, validate, criticize and modify the inputs. The iterative process is illustrated in the following
figure.
Inputs Outputs
The determination of explicit form of regression equation is the ultimate objective of regression analysis. It is finally a good
and valid relationship between study variable and explanatory variables. The regression equation helps in understanding
the interrelationships among the variables. Such regression equation can be used for several purposes. For example, to
determine the role of any explanatory variable in the joint relationship in any policy formulation, to forecast the values of
MODULE – II
Lecture - 3
Simple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
We consider the modeling between the dependent and one independent variable. When there is only one independent
variable in the linear regression model, the model is generally termed as simple linear regression model. When there are
more than one independent variables in the model, then the linear model is termed as the multiple linear regression model.
y =β 0 + β1 X + ε
where y is termed as the dependent or study variable and X is termed as independent or explanatory variable.
The terms β 0 and β1 are the parameters of the model. The parameter β 0 is termed as intercept term and the parameter
β1 is termed as slope parameter. These parameters are usually called as regression coefficients. The unobservable
error component ε accounts for the failure of data to lie on the straight line and represents the difference between the true
and observed realization of y. This is termed as disturbance or error term. There can be several reasons for such
difference, e.g., the effect of all deleted variables in the model, variables may be qualitative, inherit randomness in the
observations etc. We assume that ε is observed as independent and identically distributed random variable with mean zero
and constant variance σ . Later, we will additionally assume that
2
ε is normally distributed.
The independent variable is viewed as controlled by the experimenter, so it is considered as non-stochastic whereas y is
viewed as a random variable with
) β 0 + β1 X
E ( y=
and
Var ( y ) = σ 2 .
3
Sometimes X can also be a random variable. In such a case, instead of simple mean and simple variance of y, we
consider the conditional mean of y given X = x as
) β 0 + β1 x
E ( y | x=
and the conditional variance of y given X = x as
Var ( y | x ) = σ 2 .
When the values of β 0 , β1 and σ 2 are known, the model is completely described.
The parameters β 0 , β1 and σ 2 are generally unknown and ε is unobserved. The determination of the statistical model
y =β 0 + β1 X + ε depends on the determination (i.e., estimation ) of β 0 , β1 and σ 2 .
In order to know the value of the parameters, n pairs of observations ( xi , yi )(i = 1,..., n) on ( X , y ) are observed/collected
and are used to determine these unknown parameters.
Various methods of estimation can be used to determine the estimates of the parameters. Among them, the least squares
and maximum likelihood principles are the popular methods of estimation.
4
The method of least squares estimates the parameters β 0 and β1 by minimizing the sum of squares of difference between
the observations and the line in the scatter diagram. Such an idea is viewed from different perspectives. When the vertical
difference between the observations and the line in the scatter diagram is considered and its sum of squares is minimized
to obtain the estimates of β 0 and β1 , the method is known as direct regression.
5
The method of least absolute deviation regression considers the sum of the absolute deviation of the observations
from the line in the vertical direction in the scatter diagram as in the case of direct regression to obtain the estimates of
β 0 and β1
7
No assumption is required about the form of probability distribution of εi in deriving the least squares estimates. For
the purpose of deriving the statistical inferences only, we assume that εi ' s are observed as random variable with
E (ε=
i) 0,Var (ε=
i) σ 2 and Cov (ε i , ε=
j ) 0 for all i ≠ j (i=
, j 1, 2,..., n ).
This assumption is needed to find the mean, variance and other properties of the least squares estimates. The
assumption that εi ' s are normally distributed is utilized while constructing the tests of hypotheses and confidence
intervals of the parameters.
Based on these approaches, different estimates of β 0 and β1 are obtained which have different statistical properties.
Among them the direct regression approach is more popular. Generally, the direct regression estimates are referred as
the least squares estimates or ordinary least squares estimates.
8
Direct regression method
This method is also known as the ordinary least squares estimation. Assuming that a set of n paired observations
on ( xi , yi ), i = 1, 2,..., n are available which satisfy the linear regression model y =β 0 + β1 X + ε . So we can write the
model for each observation as yi =β 0 + β1 xi + ε i , , (i =1, 2,..., n) .
The direct regression approach minimizes the sum of squares due to errors given by
n n
S (β , β =
0 1 ) i
2
=i 1 =i 1
∑ ε= ∑ ( y − β i 0 − β1 xi ) 2
The solutions of these two equations are called the direct regression estimators, or usually called as the ordinary
least squares (OLS) estimators of β 0 and β1 .
9
n
=
sxx ∑ (x − x ) ,
i =1
i
2
1 n
x= ∑ xi ,
n i =1
1 n
y = ∑ yi .
n i =1
Further, we have
∂ 2 S ( β 0 , β1 ) n
∂β 02
=−2 ∑
i =1
(−1) =2n,
∂ 2 S ( β 0 , β1 ) n
∂β1 2
= 2 ∑
i =1
xi2 ,
∂ 2 S ( β 0 , β1 ) n
= 2=
∂β 0 ∂β1
∑
i =1
xt 2nx .
The Hessian matrix which is the matrix of second order partial derivatives in this case is given as 10
∂ 2 S ( β 0 , β1 ) ∂ 2 S ( β 0 , β1 )
∂β 02 ∂β 0 ∂β1
H* = 2
∂ S (β , β ) ∂ 2 S ( β 0 , β1 )
0 1
∂β 0 ∂β1 ∂β12
n nx
= 2 n
nx
∑
i =1
2
xi
'
= 2 ( , x )
x '
where = (1,1,...,1) ' is a n-vector of elements unity and x = ( x1 ,..., xn ) ' is a n-vector of observations on X. The matrix H*
is positive definite if its determinant and the element in the first row and column of H* are positive.
≥ 0.
n
The case when ∑ (x − x )
i =1
i
2
=
0 is not interesting because then all the observations are identical, i.e., xi = c (some constant).
n
In such a case there is no relationship between x and y in the context of regression analysis. Since ∑ (x − x )
i =1
i
2
> 0,
therefore H * > 0. So H* is positive definite for any ( β 0 , β1 ); therefore S ( β 0 , β1 ) has a global minimum at (b0 , b1 ).
11
=ei y= ˆ
i ~ yi (i 1, 2,..., n).
We consider it as
e=
i yi − yˆi
= yi − (b0 + b1 xi ).
ECONOMETRIC THEORY
MODULE – II
Lecture - 4
Simple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Unbiased property
sxy
Note that b1= and b0= y − b1 x are the linear combinations of yi (i = 1,..., n).
sxx
Therefore
n
b1 = ∑ ki yi
i =1
n n
where ki =
( xi − x ) / sxx . Note that
=i 1 =i 1
∑ ki =
0 and ∑ ki xi =
1,
n
E (b1 ) = ∑ ki E ( yi )
i =1
n
= ∑ k (β
i =1
i 0 + β1 xi )
= β1.
∑ (x − x )i
2
=σ2 i
(since y1 ,..., yn are independent)
sxx2
σ 2 sxx
=
sxx2
σ2
= .
sxx
Var (b0 ) =
Var ( y ) + x 2 Var (b1 ) − 2 xCov( y , b1 ).
First we find that
E { y − E ( y )}{b1 − E (b1 )}
Cov( y , b1 ) =
= E ε (∑ ki yi − β1 )
i
1
= E (∑ ε i )( β 0 ∑ ki + β1 ∑ ki xi + ∑ kiε i ) − β1 ∑ ε i
n i i i i i
1
= [ 0 + 0 + 0 + 0]
n
= 0,
so
1 x2
(b0 ) σ 2 +
Var= .
n sxx
4
Covariance
=
Cov (b0 , b1 ) Cov( y , b1 ) − xVar (b1 )
x 2
= − σ .
sxx
It can further be shown that the ordinary least squares estimators b0 and b1 possess the minimum variance in the class of
linear and unbiased estimators. So they are termed as the Best Linear Unbiased Estimators (BLUE). Such a property is
known as the Gauss-Markov theorem which is discussed later in multiple linear regression model.
5
Residual sum of squares
The residual sum of squares is given as
n
SS res = ∑ εˆi2
i =1
n
= ∑ ( y − yˆ )
i =1
i i
2
n
= ∑(y −b
i =1
i 0 − b1 xi ) 2
n
= ∑ [( y − y + b x − b x )]
2
i 1 1 i
i =1
n
= ∑ [( y − y ) − b ( x − x )]
2
i 1 i
i =1
n nn
i 1
2
=i
=i 1 =i 1 =i 1
2
∑ ( y − y)
1
2
+b ∑ (x − x ) − 2b ∑ ( xi − x )( yi − y )
Estimation of σ
2
The estimator of σ is obtained from residual sum of squares as follows. Assuming that Since yi
2
is normally distributed,
so SSres has a χ distribution with (n - 2) degrees of freedom, so
2
SS res
~ χ 2 (n − 2).
σ 2
Thus using the result about the expectation of a chi-square random variable, we have
) (n − 2)σ 2 .
E ( SS res=
Thus an unbiased estimator of σ is
2
SS res
s2 = .
n−2
Note that SSres has only (n - 2) degrees of freedom. The two degrees of freedom are lost due to estimation of b0 and b1.
Since s2 depends on the estimates b0 and b1, so it is a model dependent estimate of σ2 .
7
21 x2
Var=
(b0 ) s +
n sxx
and
s2
Var (b1 ) = .
sxx
n n
It is observed that since ∑ ( yi − yˆi ) =
0, so ∑e i = 0. In the light of this property, ei can be regarded as an estimate
i =1 i =1
of unknown ε i (i = 1,..., n) . This helps in verifying the different model assumptions on the basis of the given sample
( xi , yi ), i = 1, 2,..., n.
Centered model
Sometimes it is useful to measure the independent variable around its mean. In such a case, model yi =β 0 + β1 X i + ε i
has a centered version as follows:
yi = β 0 + β1 ( xi − x ) + β1 x + ε (i = 1, 2,..., n )
= β 0* + β1 ( xi − x ) + ε i
n n 2
0 ) ∑ ε = ∑ yi − β 0* − β 1 ( xi − x ) .
S (β , β =
*
1 i
2
=i 1 =i 1
Now solving
∂S ( β 0* , β1 )
=0
∂β 0*
∂S ( β 0* , β1 )
= 0,
∂β1*
we get the direct regression least squares estimates of β 0* and β1 as
b0* = y
and sxy
b1 = ,
sxx
respectively.
9
Thus the form of the estimate of slope parameter β1 remains same in usual and centered model whereas the form of the
estimate of intercept term changes in the usual and centered models.
Further, the Hessian matrix of the second order partial derivatives of S ( β 0* , β1 ) with respect to β 0* and β1 is positive definite
at β 0* = b0* and β1 = b1 which ensures that S ( β 0* , β1 ) is minimized at β 0* = b0* and β1 = b1 .
E (ε i ) 0, Var
Under the assumption that = = (ε i ) σ 2 and Cov=
(ε iε j ) 0 for all=
i ≠ j 1, 2,..., n. It follows that
E (b0* ) β=
= *
0 , E (b1 ) β1 ,
σ2 σ2
=
Var (b0* ) = , Var (b1 ) .
n sxx
y=y + b1 ( x − x ),
yˆi =+
y b1 ( xi − x ) (i =
1,..., n).
Cov(b0* , b1 ) = 0.
10
Using the data ( xi , yi ), i = 1, 2,..., n, the direct regression least squares estimate of β1 is obtained by minimizing
n n
( β1 )
S=
=i 1 =i 1
∑=
ε i2 ∑(y − β x )
i 1 i
2
and solving
∂S ( β1 )
=0
∂β1
gives the estimator of β1 as
n
∑yx i i
b =
*
1
i =1
n
.
∑x
i =1
2
i
The second order partial derivative of S ( β1 ) with respect to β1 at β1 = b1 is positive which ensures that b1 minimizes S ( β1 ).
11
E (ε i ) 0, Var
Using the assumption that = = (ε i ) σ 2 and Cov=
(ε iε j ) 0 for all=
i ≠ j 1, 2,..., n., the properties of b1* can be
n
derived as follows: ∑ x E( y ) i i
E (b ) =
*
1
i =1
n
∑x
i =1
2
i
∑x β 2
i 1
= i =1
n
∑x
i =1
2
i
= β1.
∑ x Var ( y )
2
i i
Var (b ) =
*
1
i =1
2
n 2
∑ xi
i =1
n
∑x 2
i
=σ2 i =1
2
n 2
∑ xi
i =1
σ2
= n
∑x 2
i n n
i =1
∑ yi2 − b1 ∑ yi xi
and an unbiased estimator of σ=
2
is i 1 =i 1 .
n −1
ECONOMETRIC THEORY
MODULE – II
Lecture - 5
Simple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Now we use the method of maximum likelihood to estimate the parameters of the linear regression model
the observations yi (i = 1, 2,..., n) are independently distributed with N ( β 0 + β1 xi , σ ) for all i = 1, 2,..., n. The likelihood
2
1/ 2
n
1 1
i , yi ; β 0 , β1 , σ
L( x= 2
∏ 2
i =1 2πσ
exp − 2 ( yi − β 0 − β1 xi ) 2 .
2σ
n n 1 n
ln L( xi , yi ; β 0 , β1 , σ 2 ) =
− ln 2π − ln σ 2 − 2 ∑ ( yi − β 0 − β1 xi ) 2 .
2 2 2σ i =1
The normal equations are obtained by partial differentiation of log-likelihood with respect to β 0 , β1 and σ 2 equating them
to zero
∂ ln L( xi , yi ; β 0 , β1 , σ 2 ) 1 n
− 2 ∑ ( yi − β 0 − β1 xi ) =
= 0
∂β 0 σ i =1
∂ ln L( xi , yi ; β 0 , β1 , σ 2 ) 1 n
− 2 ∑ ( yi − β 0 − β1 xi )xi =
= 0
∂β1 σ i =1
and
∂ ln L( xi , yi ; β 0 , β1 , σ 2 ) n 1 n
4 ∑
=
− 2+ ( yi − β 0 − β1 xi ) 2 =
0.
∂σ 2
2σ 2σ i =1
3
The solution of these normal equations give the maximum likelihood estimates of β 0 , β1 and σ 2 as
b0= y − b1 x
n
∑ ( x − x )( y − y )
i i
sxy
b1 =
i =1
n
∑ ( xi − x )2
sxx
i =1
and
n
∑ ( y − bi 0 − b1 xi ) 2
s 2 = i =1
,
n
respectively.
It can be verified that the Hessian matrix of second order partial derivation of ln L with respect to β 0 , β1 , and σ 2 is negative
=
definite at β 0 b=
0 , β1 b1 , and σ 2 = s 2 which ensures that the likelihood function is maximized at these values.
Note that the least squares and maximum likelihood estimates of β 0 and β1 are identical when disturbances are normally
distributed. The least squares and maximum likelihood estimates of σ are different. In fact, the least squares estimate of σ 2 is
2
1 n
=s2 ∑
n − 2 i =1
( yi − y ) 2
n−2 2
so that it is related to maximum likelihood estimate as s 2 =
s .
n
Thus b0 and b1 are unbiased estimators of β 0 and β1 whereas s is a biased estimate of σ 2 , but it is asymptotically unbiased.
2
The variances of b0 and b1 are same as that of b0 and b1 respectively but the mean squared error
H 0 : β1 = β10
where β10 is some given constant.
σ2
b1 ~ N β1 ,
sxx
The 100(1 − α )% confidence interval for β1 can be obtained using the Z1 statistic as follows:
P − zα ≤ Z1 ≤ zα =1 − α
2 2
b1 − β1
P − zα ≤ ≤ zα =1 − α
σ 2
2 2
sxx
σ2 σ2
P b1 − zα ≤ β1 ≤ b1 + zα =1 − α .
2 s xx 2 s xx
α2 α2
b1 − zα /2 , b1 + zα /2
s s
xx xx
and
SS
E res = σ 2 .
n−2
Further, SS res / σ 2 and b1 are independently distributed. This result will be proved formally later in module on multiple linear
regression. This result also follows from the result that under normal distribution, the maximum likelihood estimates, viz.,
sample mean (estimator of population mean) and sample variance (estimator of population variance) are independently
distributed, so b1 and s2 are also independently distributed.
Similarly, the decision rule for one sided alternative hypothesis can also be framed.
The 100(1 − α )% confidence interval of β1 can be obtained using the t0 statistic as follows :
P −tα ≤ t0 ≤ tα =1 − α
2 2
b1 − β1
P −tα ≤ ≤ tα =1 − α
σˆ 2
2 2
sxx
σˆ 2 σˆ 2
P b1 − tα ≤ β1 ≤ b1 + tα =1 − α .
2 sxx 2 sxx
SS res SS res
b1 − tn − 2,α /2 , b1 + tn − 2,α /2 .
(n − 2) sxx (n − 2) sxx
8
Now, we consider the tests of hypothesis and confidence interval estimation for intercept term under two cases, viz., when
σ 2 is known and when σ 2 is unknown.
1 x2
where σ is known, then using the result that
2
(b0 ) β 0 , Var=
E= (b0 ) σ 2 + and b0 is a linear combination of
n sx
normally distributed random variables, the following statistic
b0 − β 00
Z0 = ~ N (0,1),
21 x2
σ +
n sxx
has a N(0, 1) distribution when H0 is true.
Similarly, the decision rule for one sided alternative hypothesis can also be framed.
9
The 100(1 − α )% confidence intervals for β 0 when σ 2 is known can be derived using the Z 0 statistic as follows:
P − zα ≤ Z 0 ≤ zα =1 − α
2 2
b0 − β 0
P − zα ≤ ≤ zα =1 − α
2 1 x2 2
+
n s xx
x2
21 x2 21
P b0 − zα σ + ≤ β 0 ≤ b0 + zα σ + =1 − α .
2 n s xx 2 n s xx
21 x2 21 x2
b0 − zα /2 σ + , b0 + zα /2 σ + .
n sxx n sxx
10
b0 − β 00
t0 =
SSres 1 x 2
+
n − 2 n s xx
Similarly, the decision rule for one sided alternative hypothesis can also be framed.
11
The 100 (1 − α )% of confidential interval of β 0 can be obtained as follows:
Consider
P tn − 2,α /2 ≤ t0 ≤ tn − 2,α /2 =1 − α
b0 − β 0
P tn − 2,α /2 ≤ ≤ tn − 2,α /2 = 1−α
SS res 1 x 2
+
n − 2 n sxx
SS res 1 x 2 SS res 1 x 2
P b0 − tn − 2,α /2 + ≤ β ≤ b + t n − 2,α /2 + =1 − α .
n − 2 n sxx n − 2 n sxx
0 0
The 100 (1 − α )% confidential interval for β 0 is
SS res 1 x 2 SS res 1 x 2
b0 − tn − 2,α / 2 + , b0 + tn − 2,α / 2 + .
n − 2 n sxx n − 2 n sxx
SS
P χ n2− 2,α /2 ≤ res ≤ χ n2− 2,1−α /2 =
1−α
σ 2
SS SS
P 2 res ≤ σ 2 ≤ 2 res =− 1 α.
χ
n − 2,1−α /2 χ
n − 2,α /2
SS res SS res
The corresponding 100(1 − α )% confidence interval for σ is 2 .
2
,
χ n − 2,1−α /2 χ n2− 2,α /2
ECONOMETRIC THEORY
MODULE – II
Lecture - 6
Simple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
yi = β 0* + β1 ( xi − x ) + ε i
sxy
where β=
*
0 β 0 + β1 x . The least squares estimators of β 0* and β=
1
are b0* y=
and b1 , respectively.
sxx
Using the results that
E (b0* ) = β 0* ,
E (b1 ) = β1 ,
σ2
Var (b ) =*
0 ,
n
σ2
Var (b1 ) = .
s xx
b0* − β 0*
~ N (0,1)
σ2
n
and
b1 − β1
~ N (0,1).
σ 2
sxx
3
Moreover, both the statistics are independently distributed. Thus
2
* *
b0 − β 0 ~ χ12
σ 2
n
and
2
b1 − β1 ~ χ 2
2 1
σ
s
xx
*
are also independently distributed because b0 and b1 are independently distributed. Consequently, the sum of these two
Since
SS res
~ χ n2− 2
σ 2
Substituting b0=
*
b0 + b1 x and β=
*
0 β 0 + β1 x , we get
n − 2 Qf
2 SS res
where
n n
Q f = n(b0 − β 0 ) + 2∑ xi (b0 − β1 )(b1 − β1 ) + ∑ xi2 (b1 − β1 ) 2 .
2
=i 1 =i 1
Since
n − 2 Q f
P 1−α
≤ F2,n − 2 =
2 SS res
holds true for all values of β 0 and β1, so the 100(1 − α )% confidence region for β 0 and β1 is
n − 2 Qf
≤ F2,n − 2;α .
2 SS res
This confidence region is an ellipse which gives the 100 (1 − α )% probability that β 0 and β1 are contained simultaneously in
this ellipse.
5
Analysis of variance
The technique of analysis of variance is usually used for testing the hypothesis related to equality of more than one
parameters, like population means or slope parameters. It is more meaningful in case of multiple regression model when
there are more than one slope parameters. This technique is discussed and illustrated here to understand the related
basic concepts and fundamentals which will be used in developing the analysis of variance in the next module in multiple
linear regression model where the explanatory variables are more than two.
A test statistic for testing H 0 : β1 = 0 can also be formulated using the analysis of variance technique as follows.
n n n
= ∑ ( yi − y ) + ∑ ( yˆi − yi ) 2 − 2∑ ( yi − y )( yˆi − y ).
2
=i 1 =i 1 =i 1
n n
Further consider
i i
=i 1 =i 1
∑ ( y − y )( yˆ − y=) ∑ ( y − y )b ( x − x ) i 1 i
n
= b12 ∑ ( xi − x ) 2
i =1
n
= ∑ ( yˆ − y ) .
i =1
i
2
n n n
Thus we have
i i i
=i 1 =i 1 =i 1
∑ ( y − y )= ∑ ( y − yˆ ) + ∑ ( yˆ − y ).
2 2
i
2
The term ∑ ( y − y)
i =1
i
2
is called the sum of squares about the mean or corrected sum of squares of y (i.e., SScorrected ) or
total sum of squares denoted as syy.
6
n
The term ∑ ( y − yˆ )
i =1
i i
2
describes the deviation: observation minus predicted value, viz., the residual sum of squares, i.e.:
n
=
SS res ∑ ( y − yˆ )
i =1
i i
2
n
whereas the term ∑ ( yˆ − y )
i =1
i
2
describes the proportion of variability explained by regression,
n
=
SS reg ∑ ( yˆ − y ) .
i =1
i
2
n
If all observations yi are located on a straight line, then in this case= ∑
( yi − yˆi ) 0 and= 2
thus SScorrected SS r e g .
i =1
Note that SSreg is completely determined by b1 and so has only one degrees of freedom. The total sum of squares
n n
s yy = ∑ ( yi − y ) 2 has (n - 1) degrees of freedom due to constraint ∑ ( y − y) =
i 0 and SS res has (n - 2) degrees of
i =1 i =1
All sums of squares are mutually independent and distributed as χ df2 with df degrees of freedom if the errors are normally
distributed.
The mean square due to regression is
SS r e g
MS r e g =
1
and mean square due to residuals is
SS res
MSE =.
n−2
The test statistic for testing H 0 : β1 = 0 is
MS r e g
F0 = .
MSE
7
If H 0 : β1 = 0 is true, then MS r e g and MSE are independently distributed and thus F0 ~ F1, n − 2 .
Regression SS r e g 1 MS r e g MS r e g / MSE
Total s yy n −1
Some other forms of SS reg , SS res and syy can be derived as follows:
The sample correlation coefficient then may be written as
sxy
rxy = .
sxx s yy
sxy s
Moreover, we have =
b1 = rxy yy .
sxx sxx
n
= ∑ [( y − y ) − b ( x − x )]
i =1
i 1 i
2
= s yy − b12 sxx
( sxy ) 2
= s yy − .
sxx
SScorrected = s yy
and
SS r e=
g s yy − SS res
( sxy ) 2
=
sxx
= b12 sxx
= b1sxy .
9
It can be noted that a fitted model can be said to be good when residuals are small. Since SSres is based on residuals, so a
measure of quality of fitted model can be based on SSres. When intercept term is present in the model, a measure of
goodness of fit of the model is given by
SS res
R2 = 1 −
s yy
SS r e g
= .
s yy
This is known as the coefficient of determination. This measure is based on the concept that how much variation in y’s
stated by syy is explainable by SSreg. and how much unexplainable part is contained in SSres. The ratio SSreg / syy describes
the proportion of variability that is explained by regression in relation to the total variability of y. The ratio SSres / syy
describes the proportion of variability that is not covered by the regression.
MODULE – II
Lecture - 7
Simple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
The reverse (or inverse) regression approach minimizes the sum of squares of horizontal distances between the observed
data points and the line in the following scatter diagram to obtain the estimates of regression parameters.
The reverse regression has been advocated in the analysis of sex (or race) discrimination in salaries. For example, if y
denotes salary and x denotes qualifications and we are interested in determining if there is a sex discrimination in salaries,
we can ask:
“Whether men and women with the same qualifications (value of x) are getting the same salaries (value of y). This
question is answered by the direct regression.”
Alternatively, we can ask:
“Whether men and women with the same salaries (value of y) have the same qualifications (value of x). This
question is answered by the reverse regression, i.e., regression of x on y.”
3
The regression equation in case of reverse regression can be written as xi =β 0* + β1* yi + δ i (i =1, 2,..., n)
where δ i ’s are the associated random error components and satisfy the assumptions as in the case of usual simple linear
regression model.
The reverse regression estimates βˆOR of β 0 and βˆ1R of β 1 for the model are obtained by interchanging the x and y in
* *
βˆOR= x − βˆ1R y
and
sxy
βˆ1R =
s yy
for β 0 and β 1 respectively.
* *
sxy2
SS = sxx −
*
res .
s yy
Note that
sxy2
β=
ˆ b
1R 1 = rxy2
sxx s yy
where b1 is the direct regression estimator of slope parameter and rxy is the correlation coefficient between x and y. Hence
2
if rxy is close to 1, the two regression lines will be close to each other.
The direct and reverse regression methods of estimation assume that the errors in the observations are either in x-direction
or y-direction. In other words, the errors can be either in dependent variable or independent variable. There can be
situations when uncertainties are involved in dependent and independent variables both. In such situations, the orthogonal
regression is more appropriate. In order to take care of errors in both the directions, the least squares principle in
orthogonal regression minimizes the squared perpendicular distance between the observed data points and the line in the
following scatter diagram to obtain the estimates of regression coefficients. This is also known as major axis regression
method. The estimates obtained are called as orthogonal regression
estimates or major axis regression estimates of regression coefficients. yi,
lie on this line. But these points deviate from the line and in such a
case, the squared perpendicular distance of observed data
( xi , yi )(i = 1, 2,..., n) from the line is given by di2 = ( X i − xi ) 2 + (Yi − yi ) 2
(Xi, Yi)
where ( X i , Yi ) denotes the i th pair of observation without any error
which lie on the line.
xi,
∂L0 n
=
∂β1
∑
= λi X i 0.
i =1
6
Since
X=
i xi − λi β1
Y=
i yi + λi
so substituting these values in Ei , we obtain
Ei = ( yi + λi ) − β 0 − β1 ( xi − λi β1 ) = 0
β + β1 xi − yi
⇒ λi =0 .
1 + β12
n
Also using this λi in the equation ∑λ
i =1
i = 0, we get
n
∑ (β 0 + β1 xi − yi )
i =1
=0
1 + β12
n
xi ) + λi β1 0 and
and using ( X i −= = ∑ λi X i 0, we get i =1
n
∑ λ (x − λ β ) =
i =1
i 0. i i 1
n n
∑ (β x + β x
0 i 1 i
2
i i 1 −yx) β ∑ (β 0 + β1 xi − yi ) 2
i 1 =i 1
− =
0. (1)
1
2
(1 + β ) (1 + β12 ) 2
n
Using λi in the equation and using the equation ∑λ
i =1
i = 0 , we solve
n
∑ (β 0 + β1 xi − yi )
i =1
=0.
1 + β12
7
βˆ0OR= y − βˆ1OR x
or (1 + β ) 1
2
∑ x [ y − y − β ( x − x ) ] + β ∑ [ −( y − y ) + β ( x − x ) ]
i i
i 1 =i 1
1 i 1 i 1 i =0
n n
or (1 + β )∑ (ui + x )(vi − β1ui ) + β1 ∑ (−vi + β1ui ) 2 =
1
2
0
i 1 =i 1
where
u=
i xi − x ,
v=
i yi − y .
n n
Since ∑=
ui
=i 1 =i 1
∑=
vi 0, so
n
∑ β
i =1
u v + β1 (ui2 − vi2 ) − ui vi =
2
1 i i 0
or
β12 sxy + β1 ( sxx − s yy ) − sxy =
0.
8
where sign ( sxy ) denotes the sign of s xy which can be positive or negative. So
MODULE – II
Lecture - 8
Simple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
The direct, reverse and orthogonal methods of estimation minimize the errors in a particular direction which is usually the
distance between the observed data points and the line in the scatter diagram. Alternatively, one can consider the area
extended by the data points in certain neighbourhood and instead of distances, the area of rectangles defined between
corresponding observed data point and nearest point on the line in the following scatter diagram can also be minimized.
Such an approach is more appropriate when the uncertainties are present in study as well as explanatory variables. This
approach is termed as reduced major axis regression.
The area of rectangle extended between the ith observed data point and the line is
Ai (=
X i ~ xi )(Yi ~ yi ) (i 1, 2,..., n )
where ( X i , Yi ) denotes the ith pair of observation without any error which lie on the line.
n n
The total area extended by n data points is ∑ Ai = ∑ ( X i ~ xi )(Yi ~ yi ).
=i 1 =i 1
All observed data points ( xi , yi ), (i = 1, 2,..., n) are expected to lie on the line
Y=
i β 0 + β1 X i
and let
So now the objective is to minimize the sum of areas under the constraints Ei* to obtain the reduced major axis estimates
of regression coefficients. Using the Lagrangian multiplies method, the Lagrangian function is
n n
=
LR i ∑ A −∑µ E
=i 1 =i 1
i
*
i
n n
=i 1 =i 1
= ∑ ( X i − xi )(Yi − yi ) − ∑ µi Ei*
where µ1 ,..., µn are the Lagrangian multipliers. The set of equations are obtained by setting
∂LR n
=
∂β1
∑=
µX
i =1
i i 0.
Now
X=
i xi + µi
Y=
i yi − β1µi
β 0 + β1 X i =yi − β1µi
β 0 + β1 ( xi + µi ) =yi − β1µi
y − β 0 − β1 xi
⇒ µi =i .
2 β1
n
Substituting µi in ∑µ i =1
i = 0 , we get the reduced major axis regression estimate of β 0 is obtained as
we get n
yi − y + β1 x − β1 xi yi − y + β1 x − β1 xi
∑
i =1 2 β1
xi −
2 β1
=0.
5
n
Let ui =
xi − x and vi =
yi − y , then this equation can be re-expressed as ∑ (v − β u )(v + β u + 2β x ) =
i =1
i 1 i i 1 i 10.
n n
Using i∑=
u ∑
=i 1 =i 1
= v i 0, we get
n n
2
i 1∑v
2
=i 1 =i 1
−β ∑u 2
i =
0.
Solving this equation, the reduced major axis regression estimate of β1 is obtained as
s yy
βˆ1RM = sign( sxy )
s xx
where
1 if sxy > 0
sign ( sxy ) =
−1 if sxy < 0.
We choose the regression estimator which has same sign as that of sxy.
6
The least squares principle advocates the minimization of sum of squared errors. The idea of squaring the errors is useful in
place of simple errors because the random errors can be positive as well as negative. So consequently their sum can be
close to zero indicating that there is no error in the model which can be misleading. Instead of the sum of random errors,
the sum of absolute random errors can be considered which avoids the problem due to positive and negative random
errors.
In the method of least squares, the estimates of the parameters β0 and β1 in the model yi =β 0 + β1 xi + ε i . (i =1, 2,..., n)
n
are chosen such that the sum of squares of deviations ∑ε
i =1
i
2
is minimum. In the method of least absolute deviation (LAD)
n
The LAD estimates βˆ0L and βˆ1L are the values β 0 and β1 , respectively which minimize LAD ( β 0 , β1=
) ∑ y −β
i =1
i 0 − β1 xi
Conceptually, LAD procedure is simpler than OLS procedure because |e| (absolute residuals) is a more straightforward
measure of the size of the residual than e2 (squared residuals). The LAD regression estimates of β 0 and β1 are not
available in closed form. Rather they can be obtained numerically based on algorithms. Moreover, this creates the
problems of non-uniqueness and degeneracy in the estimates. The concept of non-uniqueness relates to more than one
best lines passing through a data point. The degeneracy concept describes that the best line through a data point also
passes through more than one other data points. The non-uniqueness and degeneracy concepts are used in algorithms to
judge the quality of the estimates. The algorithm for finding the estimators generally proceeds in steps. At each step, the
best line is found that passes through a given data point. The best line always passes through another data point, and this
data point is used in the next step. When there is non-uniqueness, then there are more than one best lines. When there
is degeneracy, then the best line passes through more than one other data point. When either of the problem is present,
then there is more than one choice for the data point to be used in the next step and the algorithm may go around in circles
or make a wrong choice of the LAD regression line. The exact tests of hypothesis and confidence intervals for the LAD
regression estimates can not be derived analytically. Instead they are derived analogous to the tests of hypothesis and
In a usual linear regression model, the study variable is supped to be random and explanatory variables are assumed to be
fixed. In practice, there may be situations in which the explanatory variable also becomes random.
Suppose both dependent and independent variables are stochastic in the simple linear regression model
y =β 0 + β1 X + ε
where ε is the associated random error component. The observations ( xi , yi ), i = 1, 2,..., n are assumed to be jointly
distributed. Then the statistical inferences can be drawn in such cases which are conditional on X.
) µ y=
E ( y | X= x= |x β 0 + β1 x
) σ y2=
Var ( y|X= x= |x σ y2 (1 − ρ 2 )
where
β=
0 µ y − µ x β1
and
σy
β1 = ρ.
σx
9
When both X and y are stochastic, then the problem of estimation of parameters can be reformulated as follows. Consider
a conditional random variable y|X = x having a normal distribution with mean as conditional mean µ y |x and variance as
) σ y2|x . Obtain n independently distributed observation
conditional variance Var ( y|X= x= yi |xi , i = 1, 2,..., n from
N ( µ y |x , σ y2|x ) with nonstochastic X. Now the method of maximum likelihood can be used to estimate the parameters
which yields the estimates of β 0 and β1 as earlier in the case of nonstochastic X as
b= y − b1 x
and
s
b1 = xy
sxx
respectively.
E ( y − µ y )( X − µ x )
ρ=
σ yσ x
can be estimated by the sample correlation coefficient
n
∑ ( y − y )( x − x )
i i
ρˆ = i =1
n n
=i 1 =i 1
∑ ( xi − x )2 ∑ ( y − y)
i
2
sxy
=
sxx s yy
s
= b1 xx .
s yy
10
Thus
sxx
ρˆ 2 = b12
s yy
sxy
= b1
s yy
n
s yy − ∑ εˆi2
= i =1
s yy
= R2
which is same as the coefficient of determination.
Thus R2 again measures the goodness of fitted model even when X is stochastic.
ECONOMETRIC THEORY
MODULE – III
Lecture - 9
Multiple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
We consider now the problem of regression when study variable depends on more than one explanatory or independent
variables, called as multiple linear regression model. This model generalizes the simple linear regression in two ways. It
allows the mean function E(y) to depend on more than one explanatory variables and to have shapes other than straight
lines, although it does not allow for arbitrary shapes.
y X 1β1 + X 2 β 2 + ... + X k β k + ε .
=
This is called as the multiple linear regression model. The parameters β1 , β 2 ,..., β k are the regression coefficients
associated with X1,X2,…,Xk respectively and ε is the random error component reflecting the difference between the
observed and fitted linear relationship. There can be various reasons for such difference, e.g., joint effect of those
variables not included in the model, random factors which can not be accounted in the model etc.
The jth regression coefficient βj represents the expected change in y per unit change in jth independent variable Xj .
Assuming E (ε ) = 0,
∂E ( y )
βj = .
∂X j
3
Linear model
∂y ∂E ( y )
A model is said to be linear when it is linear in parameters. In such a case (or equivalently ) should not
∂β j ∂X j
depend on any β 's . For example
y β1 + β 2 X is a linear model as it is linear is parameter.
i) =
β2
ii) y = β1 X can be written as
=
log y log β1 + β 2 log X
y=
*
β1* + β 2 x*
iii) y=β1 + β 2 X + β3 X 2
is linear in parameters β1 , β 2 and β 3 but it is nonlinear is variables X. So it is a linear model.
β2
iv) y β1 +
=
X − β3
is nonlinear in parameters and variables both. So it is a nonlinear model.
v) y β1 + β 2 X β3
=
is nonlinear in parameters and variables both. So it is a nonlinear model.
vi) y=β1 + β 2 X + β3 X 2 + β 4 X 3
β1 + β 2 X 2 + β3 X 3 + β 4 X 4
is a cubic polynomial model which can be written as y =
which is linear in parameters β1 , β 2 , β 3 , β 4 and linear in variables=
X 2 X=
, X3 X=
2
, X 4 X 3. So it is a linear model.
4
Example
The income and education of a person are related. It is expected that, on an average, higher level of education provides
higher income. So a simple linear regression model can be expressed as
β1 + β 2 education + ε .
income =
Not that β 2 reflects the change in income with respect to per unit change in education and β1 reflects the income when
education is zero as it is expected that even an illiterate person can also have some income.
Further this model neglects that most people have higher income when they are older than when they are young,
regardless of education. So β 2 will over-state the marginal impact of education. If age and education are positively
correlated, then the regression model will associate all the observed increase in income with an increase in education. So
better model is
β1 + β 2 education + β3 age + ε .
income =
Usually it is observed that the income tends to rise less rapidly in the later earning years than is early years. To
accommodate such possibility, we might extend the model to
This is how we proceed for regression modeling in real life situation. One needs to consider the experimental condition and
the phenomenon before taking the decision on how many, why and how to choose the dependent and independent
variables.
5
Model set up
Let an experiment be conducted n times and the data is obtained as follows:
y=
1 β1 x11 + β 2 x12 + ... + β k x1k + ε1
y=
2 β1 x21 + β 2 x22 + ... + β k x2 k + ε 2
y=
n β1 xn1 + β 2 xn 2 + ... + β k xnk + ε n .
6
These n equations can be written as
or
y Xβ +ε
=
where y = ( y1 , y2 ,..., yn ) ' is a n × 1 vector of n observation on study variable,
(i) E (ε ) = 0
(iii) Rank ( X ) = k
(v) ε ~ N (0, σ 2 I n )
These assumptions are used to study the statistical properties of estimator of regression coefficients. The
following assumption is required to study particularly the large sample properties of the estimators
X 'X
(vi) lim = ∆ exists and is a non-stochastic and nonsingular matrix (with finite elements).
n →∞
n
The explanatory variables can also be stochastic in some cases. We assume that X is non-stochastic unless stated
separately.
We consider the problems of estimation and testing of hypothesis on regression coefficient vector under the stated
assumption.
8
Estimation of parameters
A general procedure for the estimation of regression coefficient vector is to minimize
n n
i∑ M=
=i 1 =i 1
(ε ) ∑ M ( y − x β i i1 1 − xi 2 β 2 − ... − xik β k )
M ( x) = x
M ( x) = x 2
M ( x) = x , in general.
p
We consider the principle of least square which is related to M(x) = x2 and method of maximum likelihood estimation for
∑ ε i2 ==
S (β ) =
i =1
ε ' ε ( y − X β ) '( y − X β )
for given y and X. A minimum will always exist as S ( β ) is a real valued, convex and differentiable function. Write
.
S (β ) =
y ' y + β ' X ' X β − 2 β ' X ' y.
∂S ( β )
=0
∂β
⇒ X ' Xb =
X'y
where the following result is used:
10
∂
Result: If f ( z ) = Z ' AZ is a quadratic form, Z is a m × 1 vector and A is any m × m symmetric matrix then F ( z ) = 2 Az .
∂z
Since it is assumed that rank ( X ) = k (full rank), then X ' X is positive definite and unique solution of normal equation is
b = ( X ' X ) −1 X ' y
∂ 2 S (β )
Since is at least non-negative definite, so b minimizes S ( β ) .
∂β 2
where ( X ' X ) − is the generalized inverse of X ' X and ω is an arbitrary vector. The generalized inverse ( X ' X )− of X ' X
satisfies
X ( X ' X )− X ' X = X
MODULE – III
Lecture - 10
Multiple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Theorem
i. Let ŷ = Xb be the empirical predictor of y. Then ŷ has the same value for all solutions b of X ' Xb = X ' y.
Proof:
which is independent of ω. This implies that ŷ has same value for all solution b of X ' Xb = X ' y.
S ( β ) =[ y − Xb + X (b − β ) ]′ [ y − Xb + X (b − β ) ]
= ( y − Xb)′( y − Xb) + (b − β )′ X ' X (b − β ) + 2(b − β )′ X ′( y − Xb)
= ( y − Xb)′( y − Xb) + (b − β )′ X ' X (b − β ) (Using X ' Xb = X ' y )
≥ ( y − Xb)′( y − Xb) = S (b)
=y ' y − 2 y ' Xb + b ' X ' Xb
= y ' y − b ' X ' Xb
= y ' y − yˆ ' yˆ .
3
Fitted values
In case of βˆ = b,
yˆ = Xb
= X ( X ' X ) −1 X ' y
= Hy
iii. H tr X ( X ′X ) −1=
tr= ) −1 tr=
X ' tr X ' X ( X ' X= I k k.
4
Residuals
The difference between the observed and fitted values of study variable is called as residual. It is denoted as
e = y ~ yˆ
= y − yˆ
= y − Xb
= y − Hy
= (I − H ) y
= Hy
where H= I − H .
Note that
Properties of OLSE
= ( X ' X ) −1 X '( X β + ε ) − β
= ( X ' X ) −1 X ' ε .
(ii) Bias
(iv) Variance
The variance of b can be obtained as the sum of variances of all b1 , b2 ,..., bk which is the trace of covariance matrix of b.
Thus
Var (b) = tr [V (b) ]
k
= ∑ E (b − β )
i =1
i i
2
k
= ∑ Var (bi ).
i =1
Estimation of σ
2
The least squares criterion can not be used to estimate σ2 because σ2 does not appear in S (β ) . Since E (ε i ) = σ
2 2
,
so we attempt with residuals ei to estimate σ 2 as follows:
e= y − yˆ
= y − X ( X ' X ) −1 X ' y
= [ I − X ( X ' X ) −1 X '] y
= Hy.
7
= e 'e
=
( y − Xb) '( y − Xb)
=y '( I − H )( I − H ) y
= y '( I − H ) y
= y ' Hy.
Also
SS r e s =
( y − Xb) '( y − Xb)
=
y ' y − 2b ' X ' y + b ' X ' Xb
=
y ' y − b ' X ' y (Using X ' Xb =
X ' y ).
SS r e s = y ' Hy
(X β + ε )'H (X β + ε )
=
= ε=
' H ε (Using HX 0).
Since ε ~ N (0, σ I ),
2
so y ~ N ( X β , σ 2 I ).
Hence y ' Hy ~ χ 2 (n − k ).
8
] (n − k )σ 2
Thus E[ y ' Hy=
y ' Hy
or E =σ2
n−k
or E [ MS r e s ] = σ 2
SS r e s
where MS r e s = is the mean sum of squares due to residual.
n−k
σˆ 2 MS
= = res s 2 (say),
Covariance matrix of ŷ
The covariance matrix of ŷ is
V ( yˆ ) = V ( Xb)
= XV (b) X '
= σ 2 X ( X ' X ) −1 X '
= σ 2H .
9
Gauss-Markov theorem
The ordinary least squares estimator (OLSE) is the best linear unbiased estimator (BLUE) of β .
Proof: The OLSE of β is
b = ( X ' X ) −1 X ' y
which is a linear function of y. Consider the arbitrary linear estimator b* = a ' y of linear parametric function ' β where
the elements of a are arbitrary constants.
*
Then for b ,
= (a ' y ) a ' X β
E (b* ) E=
is an unbiased estimator of ' β when
*
and so b
=
E ' X β 'β
(b* ) a=
⇒ a' X =
'.
Since we wish to consider only those estimators that are linear and unbiased, so we restrict ourselves to those estimators
for which a ' X = '.
Further
= 'Var ( y )a σ 2a ' a
Var ( a ' y ) a=
Var ( ' b) = 'Var (b)
= σ 2 a ' X ( X ' X ) −1 X ' a .
10
Consider
This reveals that if b* is any linear unbiased estimator then its variance must be no smaller than that of b.
If we consider = ( 0, 0,…, 0,1, 0,…, 0 ) (here 1 occurs at ith place), then ' b = bi is best linear unbiased estimator of
' β = βi for all i = 1,2,…,k.
Consequently b is the best linear unbiased estimator of β , where ‘best’ refers to the fact that b is efficient within the class of
linear and unbiased estimators.
ECONOMETRIC THEORY
MODULE – III
Lecture - 11
Multiple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
i =1
1 n 21
=
(2πσ 2 ) n /2
exp − 2σ 2 ∑ ε i
i =1
1 1
= exp − 2 ε ' ε
(2πσ )
2 n /2
2σ
1 1
= exp − 2 ( y − X β ) '( y − X β ) .
(2πσ )2 n /2
2σ
n 1
ln L( β , σ 2 ) =
− ln(2πσ 2 ) − 2 ( y − X β ) '( y − X β ).
2 2σ
3
Thus the Hessian matrix of second order partial derivatives of ln L( β , σ 2 ) with respect to β and σ 2 is
∂ 2 ln L( β , σ 2 ) ∂ 2 ln L( β , σ 2 )
∂β 2 ∂β∂σ 2
∂ 2 log L( β , σ 2 ) ∂ 2 ln L( β , σ 2 )
∂σ 2 ∂β ∂ 2 (σ 2 ) 2
n−k 2
ii. OLSE of σ2 is s 2 which is related to m.l.e. of σ 2 as σ 2 = s . So m.l.e. of σ 2 is a biased estimator of σ .
2
n
5
∂ 2 ln L(θ )
I (θ ) = − E
∂θ∂θ '
∂ ln L( β , σ 2 ) ∂ ln L( β , σ 2 )
− E −E
∂β 2 ∂β∂σ
2
=
− E ∂ ln L( β , σ ) ∂ ln L( β , σ 2 )
2
−E
∂σ 2 ∂β ∂ (σ )
2 2 2
X 'X X '( y − X β )
− E − σ 2 −E
σ4
=
(y − X β)' X n ( y − X β ) '( y − X β )
− E −E 4 −
σ4 2σ σ6
X 'X
σ2 0
= .
0 n
2σ 4
Then
σ 2 ( X ' X ) −1 0
[ I (θ )] =
−1
2σ 4
0
n
is the Cramer-Rao lower bound matrix of β and σ 2.
6
σ 2 ( X ' X ) −1 0
∑ OLS = 0 2σ 4
n − k
which means that the Cramer-Rao bound is attained for the covariance of b but not for s2.
ECONOMETRIC THEORY
MODULE – III
Lecture - 12
Multiple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Usually it is difficult to compare the regression coefficients because the magnitude of βˆ j reflects the units of
measurement of jth explanatory variable Xj. For example, in the following fitted regression model
yˆ =+
5 X 1 + 1000 X 2 ,
y is measured in liters, X1 is milliliters and X2 in liters. Although βˆ2 >> βˆ1 but effect of both explanatory variables is
identical. One liter change in either X1 and X2 when other variable is held fixed produces the same change in ŷ .
Sometimes it is helpful to work with scaled explanatory and study variables that produces dimensionless regression
coefficients.
There are two popular approaches for scaling which gives standardized regression coefficients.
We discuss them as follows:
3
xij − x j
=
So define zij = , i 1,=
2,..., n, j 1, 2,..., k
sj
yi − y
yi* =
sy
1 n 1 n
=
where s ∑
2
n − 1 i =1
j ( xij − x j ) 2 =
and s ∑2
n − 1 i =1
y ( yi − y ) 2
are the sample variances of jth explanatory variable and study variable, respectively.
All scaled explanatory variables and the scaled study variable have mean zero and sample variance unity, i.e., using these
new variables, the regression model becomes
Such centering removes the intercept term from the model. The least squares estimate of γ = (γ 1 , γ 2 ,..., γ k ) ' is
γˆ = ( Z ' Z ) −1 Z ' y* .
This scaling has a similarity to standardizing a normal random variable, i.e., observation minus its mean and divided by its
standard deviation. So it is called as a unit normal scaling.
4
2. Unit length scaling
In unit length scaling, define
xij − x j
=ωij = , i 1,=
2,..., n; j 1, 2,..., k
S 1/2
jj
yi − y
yi0 =
SST1/2
where
n
=
S jj ∑ (x
i =1
ij − x j )2
∑ (ω
i =1
ij − ω j )2 =
1.
y=
o
i δ1ωi1 + δ 2ωi 2 + ... + δ k ωik + ε i , =i 1, 2,..., n.
In such a case, the matrix W 'W is in the form of correlation matrix, i.e.,
where n
∑ (x ui − xi )( xuj − x j )
rij = u =1
( Sii S jj )1/2
Sij
=
( Sii S jj )1/2
is the simple correlation coefficient between the explanatory variables Xi and Xj.
Similarly,
W ' y o = (r1 y , r2 y ,..., rky ) '
where
n
∑ (x
− x j )( yu − y )
uj
S jy
=rjy =
u =1
is the simple correlation coefficient between jth explanatory variable Xj and study variable y.
Note that it is customary to refer rij and rjy as correlation coefficient though Xi ’s are not random variable.
6
γˆ = δˆ.
The regression coefficients obtained after such scaling, viz., γˆ or δˆ are usually called standardized regression
coefficients.
where b0 is the OLSE of intercept term and bj are the OLSE of slope parameters β j.
7
Let
1
A= I − '
n
=
where (1,1,...,1) ' is a n × 1 vector of each element unity. So
1 0 0 1 1 1
0 1 0 1 1 1 1
=A − .
n
0 0 1 1 1 1
Then
y1
1 n 1 y
=y = ∑ yi (1,1,...,1) 2
n i =1 n
yn
1
= ' y
n
Ay =y − y =( y1 − y , y2 − y ,..., yn − y ) '.
8
Thus pre-multiplication of any column vector by A produces a vector showing those observations in deviation form.
Note that
1
A= − '
n
1
= − .n
n
= −
=0
and A is symmetric and idempotent matrix.
In the model
=y X β +ε,
the OLSE of β is
e= y − Xb.
Note that Ae = e.
9
Let the n × k matrix X is partitioned as
X = X 1 X 2*
=
where X 1 (1,1,...,1) ' is n ×1 vector with all elements unity representing the intercept term, X 2* is n × ( k − 1)
with OLSE of intercept term β1 as b1 and b2* as a ( k − 1) × 1 vector of OLSEs associated with β 2 , β3 ,..., β k .
Then
y = X 1b1 + X 2*b2* + e.
Premultiplication by A gives
Ay = AX 1b1 + AX 2*b2* + Ae
= AX 2*b2* + e.
*
Further, premultiplication by X 2 ' gives
=
X 2* ' Ay X 2* ' AX 2*b2* + X 2* ' e
= X 2* ' AX 2*b2* .
( AX ) ' ( Ay ) = ( AX ) ' ( AX ) b
*
2
*
2
*
2
*
2 .
y X β + ε . Such a comparison
This equation can be compared with the normal equations X ' y = X ' Xb in the model=
*
b2 is the sub-vector of OLSE.
This is normal equation in terms of deviations. Its solution gives OLS of slope coefficients as
1 1 1
= ' y ' Xb + ' e
n n n
b1
b
y 1 X 2 X 3 ... X k 2 + 0
bk
⇒ b1 = y − b2 X 2 − b3 X 3 − ... − bk X k .
11
Now we explain the various sums of squares in terms of this model.
=
Since Ay AX 2*b2* + e
=
y ' Ay y ' AX 2*b2* + y ' e
= SS reg + SS res
TSS
SS res = e ' e.
ECONOMETRIC THEORY
MODULE – III
Lecture - 13
Multiple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Testing of hypothesis
There are several important questions which can be answered through the test of hypothesis concerning the regression
coefficients.
For example
1. What is the overall adequacy of the model?
2. Which specific explanatory variables seems to be important?
etc.
In order to the answer such questions, we first develop the test of hypothesis for a general framework, viz., general linear
hypothesis.
which Rβ = r , where R is a (J x k) matrix of known elements and r is a (J x 1) vector of known elements. Note that the
H 0 : Rβ = r
H1 : R β ≠ r
We assume that rank (R) = J, i.e., R is of full column rank, so that there is no linear dependence in the hypothesis.
H 0 : β3
(ii) = β 4 or H 0=
: β3 − β 4 0
Choose=
J 1,=r 0,=
R [0, 0,1, −1, 0,..., 0].
4
(iii) H 0 : β=
3 β=
4 β5
or H 0 : β3 − β 4= 0, β3 − β5= 0
0 0 1 − 1 0 0 ... 0
Choose =J 2,=r (0, 0) ',=R .
0 0 1 0 − 1 0 ... 0
(iv) H 0 : β 3 +5β 4 =
2
Choose=J 1,=r 2,=
R [0, 0, 1, 5, 0,..., 0] .
(v)
H 0 : β 2= β3= ...= β k= 0
Choose
J= k − 1
r = (0, 0,..., 0) '
0 1 0 ... 0 0
0 0 1 ... 0 0 I k −1
R = .
0 0 0 ... 1 ( k −1)×k 0
This particular hypothesis explains the goodness of fit. It tells whether βi has linear effect or not and are they of any
importance. It also tests whether X 2 , X 3 ,..., X k have any influence in the determination of y or not. Here β1 = 0 is
excluded because this involves additional implication that the mean level of y is zero. Our main concern is to know whether
the explanatory variables help in explaining the variation in y around its mean value or not.
If both the likelihoods are maximized, one constrained and the other unconstrained, then the value of the unconstrained will
not be smaller than the value of the constrained. Hence λ ≥ 1.
=
First we discus the likelihood ratio test for a simpler case when R I=
k and r β 0 ,=
i.e., β β 0 .
This will give us better and detailed understanding for the minor details and then we generalize it for R β = r , in general.
where β0 is specified by the investigator. The elements of β0 can take on any value, including zero.
The concerned alternative hypothesis is H1 : β ≠ β 0 .
Since y X β + ε , so y ~ N ( X β , σ 2 I ). Thus the whole parametric space and sample space are
ε ~ N (0, σ 2 I ) in =
Ω and ω respectively given by
Ω : {( β , σ 2 ) : − ∞ < βi < ∞, σ 2 > 0, i =1, 2,..., k }
ω : {( β , σ =
2
) : β β 0 , σ 2 > 0} .
6
1 1
L( β , σ 2 |=
y, X ) exp − 2 ( y − X β ) '( y − X β ) .
(2πσ ) 2 n/2
2σ
β = ( X ' X ) −1 X ' y
1
σ 2 =( y − X β ) '( y − X β )
n
where β and σ 2 are the maximum likelihood estimates of β and σ2 which are the values obtained by maximizing
the likelihood function. Thus
Lˆ (Ω) =max L ( β , σ 2 | y, X ) )
) '( y − X β )
− β
exp −
1 ( y X
n
2( y − X β ) '( y − X β )
2π ) '( y − X β ) 2
n ( y − X β n
n
n n / 2 exp −
= 2 .
n
(2π ) ( y − X β ) '( y − X β ) 2
n/2
7
Since β 0 is known, so the constrained likelihood function has an optimum variance estimator
1
σω2 =( y − X β 0 ) '( y − X β 0 )
n
n
n n /2 exp −
Lˆ (ω ) = 2 .
n /2
(2π ) n /2 ( y − X β 0 ) '( y − X β 0)
0 0
n/2
( y − X β 0 ) '( y − X β 0 )
=
( y − X β ) '( y − X β )
n/2
σω2
= 2
σ
= (λ )
n/2
8
where
( y − X β 0 ) '( y − X β 0 )
λ=
( y − X β ) '( y − X β )
is the ratio of the quadratic forms. Now we simplify the numerator of λ as follows:
′
( y − X β 0 ) '( y − X β 0 ) =( y − X β ) + X ( β − β 0 ) ( y − X β ) + X ( β − β 0 )
Thus
( y − X β ) '( y − X β ) + ( β − β 0 ) ' X ' X ( β − β 0 )
λ=
( y − X β ) '( y − X β )
( β − β 0 ) ' X ' X ( β − β 0 )
= 1+
( y − X β ) '( y − X β )
( β − β 0 ) ' X ' X ( β − β 0 )
or λ − 1 = λ0 =
( y − X β ) '( y − X β )
where
0 ≤ λ0 < ∞.
9
Now we find the distribution of the quadratic forms involved in λ0 to find the distribution of λ0 as follows:
( y − X β ) '( y − X β ) =
e ' e
= y ' I − X ( X ' X ) −1 X ' y
= y ' Hy
(X β + ε )'H (X β + ε )
=
= ε=
' Hε (using HX 0)
= (n − k )σˆ 2 .
Result: Let Z is a n x 1 random vector that is distributed as N (0, σ 2 I n ) and A is any symmetric idempotent n x n
Z ' AZ
matrix of rank p then ~ χ 2 ( p ). Let B is another n x n symmetric idempotent matrix of rank q, then
σ 2
Z ' BZ
~ χ ( q).
2
σ2
If AB = 0 then Z’AZ is distributed independently of Z’BZ.
y ' Hy (n − k )σˆ 2
= ~ χ 2 (n − k ).
2
σ
2
σ
10
Further, if H0 is true, then β = β 0 . Substituting β = β 0 in this expression, we have the quantity which is same as the
numerator of λ0. The numerator of λ0 can be rewritten in a general form for any β as
( β − β ) ' X ' X ( β − β ) =
ε ' X ( X ' X ) −1 X ' X ( X ' X ) −1 X ' ε
= ε ' Hε
where H is an idempotent matrix with rank k.
Furthermore, the product of the quadratic form matrices in the numerator (ε ' H ε ) and in the denominator (ε ' H ε ) of λ0
is
and hence the χ 2 random variables in numerator and denominator of λ0 are independent. Dividing each of the χ2
random variable by their respective degrees of freedom, we get
11
(β − β0 ) ' X ' X (β − β0 )
σ2
k
λ1 =
(n − k )σˆ 2
σ 2
n − k
( β − β 0 ) ' X ' X ( β − β 0 )
=
kσˆ 2
( y − X β 0 ) '( y − X β 0 ) − ( y − X β ) '( y − X β )
=
kσˆ 2
~ F (k , n − k ) under H 0 .
Note that
( y − X β 0 ) '( y − X β 0 ) : Restricted error sum of squares
( y − X β ) '( y − X β ) : Unrestricted error sum of squares
Numerator of λ1 : Difference between the restricted and unrestricted error sum of squares.
λ1 ≥ Fα (k , n − k )
where Fα ( k , n − k ) is the upper critical points on the central F-distribution with k and n - k degrees of freedom.
ECONOMETRIC THEORY
MODULE – III
Lecture - 14
Multiple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
=Ω {(β ,σ 2
) : − ∞ < βi < ∞, σ=
2
> 0, i 1, 2,..., k }
ω
= {(β ,σ 2
Rβ r , σ 2 > 0} .
) : − ∞ < β i < ∞, =
Then
E ( Rβ ) = Rβ
= RV ( β ) R '
= σ 2 R( X ' X ) −1 R '.
Since β ~ N β , σ 2 ( X ' X ) −1
so R β ~ N R β , σ 2 R ( X ' X ) −1 R '
σ
2
n−k
( ( ))
−1
Rβ − r ) ' R ( X ' X ) −1 R ' R β − r
=
J σˆ 2
~ F ( J , n − k ) under H 0 .
λ1 ≥ Fα ( J , n − k )
where Fα ( J , n − k ) is the upper critical points on the central F distribution with J and (n – k) degrees of freedom.
5
Test of significance of regression (Analysis of variance)
If= =
we set R [0 I k −1 ], r 0, then the hypothesis H 0 : Rβ = r reduces to the following null hypothesis:
H 0 : β 2= β3= ...= β k= 0
which is tested against the alternative hypothesis
H1 : β j ≠ 0 for at least one j = 2, 3,…, k.
This hypothesis determines if there is a linear relationship between y and any set of the explanatory variables X 2 , X 3 ,..., X k .
=
Notice that X 1 corresponds to the intercept term in the model and hence xi1 1=
for all i 1, 2,..., n.
This is an overall or global test of model adequacy. Rejection of the null hypothesis indicates that at least one of the
explanatory variables among X 2 , X 3 ,..., X k contributes significantly to the model. This is called as analysis of variance.
Since ε ~ N (0, σ I ),
2
so y ~ N ( X β , σ 2 I )
SS res
Also σˆ 2 =
n−k
( y − yˆ ) '( y − yˆ )
=
n−k
y ' I − X ( X ' X ) −1 X ' y y ' Hy y' y −b' X ' y
= = = .
n−k n−k n−k
Since ( X ' X )-1 X ' H = 0,
SS r e s
~ χ (2n − k ) ,
σ 2
Partition X = [ X 1 , X 2* ] where the submatrix X 2* contains the explanatory variables X 2 , X 3 ,..., X k and
partition β = [ β1 , β 2* ] where the subvector β 2* contains the regression coefficients β 2 , β3 ,..., β k .
SST = y ' Ay
= SS r e g + SS r e s
where
SS r e g = b2* ' X 2* ' AX 2*b2*
is the sum of squares due to regression and b2* is the OLSE of β 2* .
SS r e s =
( y − Xb) '( y − Xb)
= y ' Hy
= SST − SS r e g .
7
Further
SS r e g β * ' X * ' AX * β * β * ' X * ' AX * β *
~ χ k2−1 2 2 2 2 2 , i.e., non-central χ 2 distribution with non − centrality parameter 2 2 2 2 2 ,
σ2 2σ 2σ
SS r e g
MS r e g =
k −1
and the mean square due to residuals (or error) is
SS r e s
MS res = .
n−k
Then
MS reg
F= ~ Fk −1,n − k .
MS res
F ≥ Fα (k − 1, n − k ).
8
The calculation of F-statistic can be summarized in the form of an analysis of variance (ANOVA) table given as follows:
Adding such explanatory variables also increases the variance of fitted values ŷ , so one needs to be cautious that only
those regressors are added which are really important in explaining the response. Adding unimportant explanatory
variables may increase the residual mean square which may decrease the usefulness of the model.
Note that this is only a partial or marginal test because b j depends on all the other explanatory variables X i (i ≠ j )
that are in the model. This is a test of the contribution of X j given the other explanatory variables in the model.
ECONOMETRIC THEORY
MODULE – III
Lecture - 15
Multiple Linear Regression Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
The confidence intervals in multiple regression model can be constructed for individual regression coefficients as well as
jointly . We consider both of them as follows:
y ~ N ( X β ,σ 2 I )
b ~ N ( β , σ 2 ( X ' X ) −1 ).
Thus the marginal distribution of any regression coefficient estimate
b j ~ N ( β j , σ 2C jj )
−1
where C jj is the jth diagonal element of ( X ' X ) .
Thus
b −βj
t j =j ~ t ( n − k ) under H 0 , j =
1, 2,...
σˆ 2C jj
where
SS r e s y ' y − b ' X ' y
σˆ 2
= = .
n−k n−k
3
bj − β j
P − tα ≤ ≤ tα = 1−α
2 , n − k
σ 2
ˆ C jj ,n − k
2
P b j − tα σˆ 2C jj ≤ β j ≤ b j + tα σˆ 2C jj =1 − α .
,n − k ,n − k
2 2
So the confidence interval is
b j − tα ,n − k σˆ C jj , b j + tα ,n − k σˆ C jj
2 2
.
2 2
4
(b − β ) ' X ' X (b − β )
⇒ P 1 − α.
≤ Fα ( k , n − k ) =
k MSr e s
(b − β ) ' X ' X (b − β )
~ Fα (k , n − k )
k MS r e s
Let R be the multiple correlation coefficient between y and X 1 , X 2 ,..., X k . Then square of multiple correlation coefficient
(R2) is called as coefficient of determination. The value of R2 commonly describes that how well the sample regression
line fits to the observed data. This is also treated as a measure of goodness of fit of the model.
yi = β1 + β 2 X i 2 + β3 X i 3 + ... + β k X ik + ui , i = 1, 2,..., n
then
e'e
R2 = 1 − n
∑( y
i =1
i − y )2
SSres
= 1−
SST
SSr e g
=
SST
where
SSr e s : sum of squares due to residuals,
SST : total sum of squares,
SS r e g : sum of squares due to regression.
R2 measure the explanatory power of the model which in turn reflects the goodness of fit of the model.
It reflects the model adequacy in the sense that how much is the explanatory power of explanatory variable.
6
Since
y ' I − X ( X ' X ) −1 X ' y =
e 'e = y ' Hy,
n n
=i 1 =i 1
∑ ( yi − y )2 = ∑y 2
i − ny 2 ,
1 n 1
=
where y = ∑
n i =1
yi
n
' y =
with (1,1,...,1=
) ', y ( y1 , y2 ,..., yn ) '.
n
1
Thus ∑ ( y − y)
i =1
i = y ' y − n 2 ' yy '
2
n
1
= y ' y − y ' ' y
n
= y ' y − y ' ( ' ) −1 ' y
= y ' I − ( ' ) −1 ' y
= y ' Ay
−1
where A= I − ( ' ) '.
y ' Hy
So R2 = 1 − .
y ' Ay
Adjusted R2
If more explanatory variables are added to the model, then R2 increases. In case the variables are irrelevant, then R2 will
still increase and gives an overly optimistic picture.
With a purpose of correction in overly optimistic picture, adjusted R2, denoted as R 2 or adj R2 is used which is defined as
SSr e s / ( n − k )
R2 = 1 −
SST / ( n − 1)
n −1
=
1− (1 − R ).
2
n−k
We will see later that (n - k) and (n - 1) are the degrees of freedom associated with the distributions of SSres and SST.
SS r e s SST are based on the unbiased estimators of respective variances of e and y is the
Moreover , the quantities and
n−k n −1
context of analysis of variance.
The adjusted R2 will decline if the addition of an extra variable produces too small a reduction in (1 − R ) to compensate
2
n −1
for the increase is .
n−k
yi = β1 + β 2 X i 2 + ... + β k X ik + ui , i = 1, 2,.., n
log yi = γ 1 + γ 2 X i 2 + ... + γ k X ik + vi .
n
For the first model,
∑ ( y − yˆ )i i
2
R12 = 1 − i =1
n
∑ ( y − y)
i =1
i
2
∑ (log y − log yˆ ) i i
2
R22 = 1 − i =1
n
.
∑ (log y − log y )
i =1
i
2
∑ ( y − anti log yˆ )
i
*
i
R32 = 1 − i =1
n
∑ ( y − y)
i =1
i
2
where
y. Now R 2 and R 2 on comparison may give an idea about the adequacy of the two models.
yi* = log i 1 3
9
Assuming β1 to be an intercept term, then for H 0 : β 2= β 3= ...= β k= 0 the F-statistic in analysis of variance test is
MS r e g
F=
MS res
SS r e g
(n − k ) SS r e g n − k SS r e g n − k SST
= =
(k − 1) SS r e s k − 1 SST − SS r e g k − 1 1 − SS r e g
SST
n−k R
2
=
k −1 1− R
2
In limit, when R 2 = 1, F = ∞. So both F and R2 vary directly. Larger R2 implies greater F value.
That is why the F test under analysis of variance is termed as the measure of overall significance of estimated regression.
It is also a test of significance of R2. If F is highly significant, it implies that we can reject H0, i.e., y is linearly related to
X’s.
ECONOMETRIC THEORY
MODULE – IV
Lecture - 16
Predictions in Linear Regression
Model
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
N (0, σ 2 ). The parameters β 0 and β1 are estimated using the ordinary least squares estimation as b0 of β 0 and b1 of β1 as
b0= y − b1 x
sxy
b1 =
sxx
where
n n
1 n 1 n
sxy = ∑ ( xi − x )( yi − y ), sxx = ∑ ( xi − x ) , x = ∑ xi , y = ∑ yi .
2
=i 1 =i 1 = n i 1= ni1
The fitted model is y= b0 + b1 x .
3
Case 1: Prediction of average value of y
Suppose we want to predict the value of E(y) for a given value of x = x0 . Then the predictor is given by
pm= b0 + b1 x0 .
Here m stands for mean value.
Predictive bias
The prediction error is given as
pm − E ( y ) =b0 + b1 x0 − E ( β 0 + β1 x0 + ε )
=b0 + b1 x0 − ( β 0 + β1 x0 )
= (b0 − β 0 ) + (b1 − β1 ) x0 .
Then the prediction bias is given as
E [ pm − E ( y ) ] = E (b0 − β 0 ) + E (b1 − β1 ) x0
= 0 + 0 = 0.
Thus the predictor pm is an unbiased predictor of E(y).
Predictive variance
The predictive variance of pm is
PV=
( pm ) Var (b0 + b1 x0 )
= Var [ y + b1 ( x0 − x ) ]
= Var ( y ) + ( x0 − x ) 2 Var (b1 ) + 2( x0 − x )Cov( y , b1 )
σ 2 σ 2 ( x0 − x ) 2
= + +0
n sxx
1 ( x − x )2
= σ2 + 0 .
n s xx
4
2 1 ( x0 − x ) 2
( pm ) σˆ +
PV =
n sxx
1 ( x0 − x ) 2
= MSE + .
n sxx
pm ~ N ( β 0 + β1 x0 , PV ( pm ) ) .
So if σ 2 is known, then the distribution of
pm − E ( y )
PV ( pm )
p − E ( y)
P − zα ≤ m ≤ zα =1 − α
2 PV ( pm ) 2
When σ 2 is unknown, it is replaced by σˆ 2 = MSE and in this case, the sampling distribution of
pm − E ( y )
1 ( x0 − x ) 2
MSE +
n sxx
is t-distribution with (n - 2) degrees of freedom, i.e., tn - 2.
pm − E ( y )
P −tα ≤ ≤ tα = 1−α
2
,n −2
1 (x − x )
2
2
,n −2
MSE + 0
n sxx
1 ( x0 − x ) 2 1 ( x0 − x ) 2
pm − tα MSE + , pm + tα ,n − 2 MSE + .
2
,n −2
n s xx 2 n s xx
Note that the width of prediction interval E(y) is a function of x0. The interval width is minimum for x0 = x and widens as
x0 − x increases. This is expected also as the best estimates of y to be made at x-values lie near the center of the data
and the precision of estimation to deteriorate as we move to the boundary of the x-space.
6
p=
a b0 + b1 x0 .
Here a means “actual”. Note that the form of predictor is same as that of average value predictor but its predictive error
and other properties are different. This is the dual nature of predictor.
Predictive bias
The predictive error of pa is given by
pa − y0 = b0 + b1 x0 − ( β 0 + β1 x0 + ε )
= (b0 − β 0 ) + (b1 − β1 ) x0 − ε .
E ( pa − y0 ) = E (b0 − β 0 ) + E (b1 − β1 ) x0 − E (ε )
= 0+0+0
=0
which implies that pa is an unbiased predictor of y0.
7
Predictive variance
PV (=
pa ) E ( pa − y0 ) 2
= E[(b0 − β 0 ) + ( x0 − x )(b1 − β1 ) + (b1 − β1 ) x − ε 0 ]2
= Var (b0 ) + ( x0 − x ) 2 Var (b1 ) + x 2Var (b1 ) + Var (ε ) + 2( x0 − x )Cov(b0 , b1 ) + 2 xCov(b0 , b1 ) + 2( x0 − x )Var (b1 )
[rest of the terms are 0 assuming the independence of ε 0 with ε1 , ε 2 ,..., ε n ]
1 ( x0 − x ) 2
= σ 1 + +
2
.
n sxx
( p = 2 1 ( x0 − x ) 2
PV a ) σˆ 1 + +
n sxx
1 ( x0 − x ) 2
= MSE 1 + + .
n sxx
8
Prediction interval
p − E ( pa )
P − zα ≤ a ≤ zα =1 − α
2 PV ( pa ) 2
which gives the prediction interval for ŷ0 as
2 1 ( x0 − x ) 2 2 1 ( x0 − x ) 2
pa − zα σ 1 + + , pa + zα σ 1 + + .
2 n sxx 2 n sxx
follows a t-distribution with (n - 2) degrees of freedom. The 100(1 − α )% prediction interval for ŷ0 in this case is obtained
as p − E ( p )
P −tα ≤ a a
≤ tα = 1−α
2 ,n −2
PV ( pa )
,n −2
2
which gives the prediction interval
1 ( x0 − x ) 2 1 ( x0 − x ) 2
pa − tα MSE 1 + + , pa + tα ,n − 2 MSE 1 + + .
2
,n −2
n s xx 2 n s xx
9
The prediction interval for pa is wider than the prediction interval for pm because the prediction interval for pa
depends on both the error from the fitted model as well as the error associated with the future observations.
Let the parameter β be estimated by its ordinary least squares estimator b = ( X ' X ) −1 X ' y . Then the predictor is p = Xb
which can be used for predicting the actual and average values of study variable. This is the dual nature of predictor.
10
When the objective is to predict the average value of y, i.e., E(y) then the estimation error is given by
p − E ( y ) =Xb − X β
= X (b − β )
= X ( X ' X ) −1 X ' ε
= Hε
where H = X ( X ' X ) −1 X '.
Then
E [ p − E ( y)] = X β − X β = 0
which proves that the predictor p = Xb provides unbiased prediction for average value.
= E [ε ' H .H ε ]
= E (ε ' H ε )
= σ 2tr H
= σ 2k.
m ( p ) = σˆ 2 k where σˆ 2 = MSE is obtained from analysis of variance
The predictive variance can be estimated by PV
based on OLSE.
11
Case 2: Prediction of actual value of y
When the predictor p = Xb is used for predicting the actual value of study variable y, then its prediction error is given by
p − y = Xb − X β − ε
= X (b − β ) − ε
= X ( X ' X ) −1 X ' ε − ε
=− I − X ( X ' X ) −1 X ' ε
= −Hε .
Thus
E ( p − y) =
0
which shows that p provides unbiased predictions for the actual values of study variable.
= E (ε ' H ε )
= σ 2trH
= σ 2 (n − k ).
The predictive variance can be estimated by
m=
PV ( p ) σˆ 2 (n − k )
where σˆ 2 = MSE is obtained from analysis of variance based on OLSE. Comparing the performances of p to predict
actual and average values, we find that p in better predictor for predicting the average value in comparison to actual
value when
PVm < PVa ( p )
or k < (n − k ) or 2k < n,
i.e., when the total number of observations are more than twice the number of explanatory variables.
ECONOMETRIC THEORY
MODULE – IV
Lecture – 17
Predictions in Linear Regression
Model
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
yf X f β +ε f
= (2)
where yf is a nf x 1 vector of future values, Xf is a nf x k matrix of known values of explanatory variables and ε f is a
nf x 1 vector of disturbances following N (0, σ 2 I m ). It is also assumed that the elements of ε and ε f are independently
distributed.
We now consider the prediction of yf values for given Xf from model (2). This can be done by estimating the regression
coefficients from model (1) based on n observations and use it is formulating the predictor in model (2). If ordinary
least squares estimation is used to estimate β in model (1) as
b = ( X ' X ) −1 X ' y,
=
p f X=
fb X f ( X ' X ) −1 X ' y.
3
Prediction of average value of y
When the aim is to predict the average value E(yf), then the prediction error is
p f − E ( y f ) =X f b − X f β
= X f (b − β )
= X f ( X ' X ) −1 X ' ε .
Then
X f ( X ' X ) X ' E (ε )
−1
E p f − E ( y f ) =
= 0.
Thus pf provides unbiased prediction for average value.
= σ 2 X f ( X ' X ) −1 X 'f .
The predictive variance of pf is
PVm ( p f ) = E { p f − E ( y f } '{ p f − E ( y f )}
= tr Covm ( p f )
= σ 2tr ( X ' X ) −1 X 'f X f .
If σ2 is unknown, then replace σ 2 by σˆ 2 = MSE in the expressions of predictive covariance matrix and predictive
variance and there estimates are m ( p ) = σˆ 2 X ( X ' X ) −1 X '
Cov f f f
pf − yf = X f b − X f β −ε f
= X f (b − β ) − ε f .
Then
E ( p f − y=
f ) X f E ( b − β ) − E (ε f )
= 0.
Thus pf provides unbiased prediction for actual values.
= σ 2 X f ( X ' X ) −1 X 'f X f + I n f .
= tr Cova ( p f )
The estimates of covariance matrix and predictive variance can be obtained by replacing σ 2 by σˆ 2 = MSE as
= a ( p ) σˆ 2 X ( X ' X ) −1 X ' + I
Cov f f f nf
many applications, it may not be appropriate to confine our attention only to either of the two. It may be more appropriate
in some situations to predict both the values simultaneously, i.e., consider the prediction of actual and average values of
For example, suppose a firm deals with the sale of fertilizer to the user. The interest of company would be in predicting
the average value of yield which the company would like to use in showing that the average yield of the crop increases by
using their fertilizer. On the other side, the user would not be interested in the average value. The user would like to know
the actual increase in the yield by using the fertilizer. Suppose both the seller and the user go for prediction through
regression modeling. Now using the classical tools, the statistician can predict either the actual value or the average value.
This can safeguard the interest of either the user or the seller. Instead of this, it is required to safeguard the interests of
both by striking a balance between the objectives of the seller and the user.
This can be achieved by combining both the predictions of actual and average values. This can be done by formulating an
objective function or target function. Such a target function has to be flexible and should allow to assign different weights
to the choice of two kinds of predictions depending upon their importance in any given application and also reducible to
Now we consider the simultaneous prediction in within and outside sample cases.
6
p = Xb.
Now employ this predictor for predicting the actual and average values simultaneously through the target function.
The prediction error is
p − τ = Xb − λ y − (1 − λ ) E ( y )
= Xb − λ ( X β + ε ) − (1 − λ ) X β
= X (b − β ) − λε .
Thus
E ( p − τ=
) XE (b − β ) − λ E (ε )
= 0.
So p provides unbiased prediction for τ.
7
The variance of p is
Var ( p ) =E ( p − τ ) '( p − τ )
= E {(b − β ) ' X '− λε '}{ X (b − β ) − λε }
= E ε ' X ( X ' X ) −1 X ' X ( X ' X ) X ' ε + λ 2ε ' ε − λ (b − β ) ' X ' ε − λε ' X (b − β ')
−1
E (1 − 2λ )ε ' X ( X ' X ) −1 X ' ε + λ 2ε ' ε
=
= σ 2 (1 − 2λ )k + λ 2 n .
X β + ε , E (ε ) =
y= 0, V (ε ) =
σ 2 In
n×1 n×k k ×1 n×1
X f β + ε f ; E (ε f ) =
yf = 0, V (ε f ) =
σ 2 Inf .
n f ×1 n f ×k k ×1 n f ×1
τ f= λ y f + (1 − λ ) E ( y f ); 0 ≤ λ ≤ 1.
(X 'X )
−1
=p f X=
f b; b X ' y.
8
The variance of pf is
Var ( p f ) =E ( p f − τ ) '( p f − τ )
= E {(b − β ) ' ε 'f − λε 'f }{ X f (b − β ) − λε f }
E ε ' X ( X ' X ) −1 X 'f X f ( X ' X ) X ' ε + λε 'f ε f − 2λε 'f X f ( X ' X ) −1 X ' ε
−1
σ 2 tr { X ( X ' X ) −1 X 'f X f ( X ' X ) −1 X '} + λ 2 n f
MODULE – V
Lecture - 18
Generalized and Weighted Least
Squares Estimation
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
The usual linear regression model assumes that all the random error components are identically and independently
distributed with constant variance. When this assumption is violated, then ordinary least squares estimator of regression
coefficient looses its property of minimum variance in the class of linear and unbiased estimators. The violation of such
assumption can arise in anyone of the following situations:
1. The variance of random error components is not constant.
2. The random error components are not independent.
3. The random error components do not have constant variance as well as they are not independent.
In such cases, the covariance matrix of random error components does not remain in the form of an identity matrix but can
be considered as any positive definite matrix. Under such assumption, the OLSE does not remain efficient as in the case of
identity covariance matrix. The generalized or weighted least squares method is used in such situations to estimate the
parameters of the model.
In this method, the deviation between the observed and expected values of yi is multiplied by a weight ωi where ωi
is chosen to be inversely proportional to the variance of yi.
n
For simple linear regression model, the weighted least squares function is S ( β 0=
, β1 ) ∑ω ( y − β − β1 xi ) .
2
i i 0
The least squares normal equations are obtained by differentiating S ( β 0 , β1 ) with respect to β 0 and β1 and equating
them to zero as n n n
βˆ0 ∑ ωi +βˆ1 ∑ ωi xi =
∑ ωi yi
=i 1 =i 1 =i 1
n n n
0 ∑ ωi xi +β1 ∑ ωi xi =
βˆ ∑ ωi xi yi .
ˆ 2
=i 1 =i 1 =i 1
Solution of these two normal equations give the weighted least squares estimate of β 0 and β1 .
3
The OLSE of β is
b = ( X ' X ) −1 X ' y.
In such cases OLSE gives unbiased estimate but has more variability as
KK ' = Ω .
K −1 y K −1 X β + K −1ε
=
or
z Bβ + g
=
=
where
−1
z K= −1
y, B K= X , g K −1ε . Now observe that
= −1)
E ( g ) K= E (ε ) 0
E { g − E ( g )}{ g − E ( g )} '
V (g) =
= E ( gg ')
So either minimize
S (β ) = g ' g
= ε ' Ω −1ε
=( y − X β ) ' Ω −1 ( y − X β )
and get normal equations as
(X'Ω-1 X ) βˆ =
X ' Ω −1 y
or βˆ =Ω
( X ' −1 X ) −1 X ' Ω −1 y.
βˆ = ( B ' B) −1 B ' z
= ( X ' K '−1 K −1 X ) −1 X ' K '−1 K −1 y
( X ' −1 X ) −1 X ' Ω −1 y.
=Ω
or βˆ − β =
( B ' B ) −1 B ' g .
6
Then
E ( βˆ − β ) ( B ' B=
= ) −1 B ' E ( g ) 0
which shows that GLSE is an unbiased estimator of β . The covariance matrix of GLSE is given by
{ }{
E βˆ − E ( βˆ ) βˆ − E ( βˆ ) '
V ( βˆ ) =
}
= E ( B ' B ) −1 B ' gg ' B '( B ' B ) −1
( X ' −1 X ) −1 X ' Ω −1 E ( y )
=Ω
( X ' −1 X ) −1 X ' Ω −1 X β
=Ω
= β.
= ( X ' Ω −1 X ) −1.
Thus,
Var ( ' βˆ ) = 'Var ( βˆ )
Let β be another unbiased estimator of β that is a linear combination of the data. Our goal, then, is to show that
Var ( ' β ) ≥ '( X ' Ω −1 X ) −1 with at least one such that Var ( ' β ) ≥ '( X ' Ω −1 X ) −1 .
We first note that we can write any other estimator of β that is a linear combination of the data as
( X ' Ω −1 X ) −1 X ' Ω −1 + B y + b*
β= 0
where B is an p x n matrix and bo* is a p x 1 vector of constants that appropriately adjusts the GLS estimator to form
the alternative estimate. Then
(
( β ) E ( X ' Ω −1 X ) −1 X ' Ω −1 + B y + bo*
E= )
= ( X ' Ω −1 X ) −1 X ' Ω −1 + B E ( y ) + b0*
(
V ( β ) Var ( X ' Ω −1 X ) −1 X ' Ω −1 + B y
= )
= ( X ' Ω −1 X ) −1 X ' Ω −1 + B V ( y ) ( X ' Ω −1 X ) −1 X ' Ω −1 + B '
(
= ' ( X ' Ω −1 X ) −1 + BΩB ' )
= '( X ' Ω −1 X ) −1 + ' BΩB '
We note that Ω is a positive definite matrix. Consequently, there exists some nonsingular matrix K such that Ω =K ' K As
a result, BΩB ' =
BK ' KB ' is at least positive semidefinite matrix; hence, ' BΩB ' ≥ 0. Next note that we can define
* = KB ' .
As a result,
p
' BΩB ' = = *' *
∑
i =1
*2
i
all i = 1, 2, …, k. Thus, the GLS estimate of β is the best linear unbiased estimator.
10
1
ω 0 0 0
1
1
2
0 0 0
V (ε )= σ Ω= σ
2
ω2 .
1
0 0 0
ωn
The estimation procedure is usually called as weighted least squares.
Let W = Ω −1 then the weighted least squares estimator of β is obtained by solving normal equation ( X 'WX ) βˆ = X 'Wy
which gives βˆ = ( X 'WX ) X 'Wy where
−1
ω1 , ω2 ,..., ωn are called the weights.
The observations with large variances usual have smaller weights than observations with small variance.
ECONOMETRIC THEORY
MODULE – VI
Lecture – 19
Regression Analysis Under Linear
Restrictions
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
One of the basic objectives in any statistical modeling is to find good estimators of parameters. In the context of multiple
X β + ε , the ordinary least squares estimator b = ( X ' X ) X ' y is the best linear unbiased
−1
linear regression model=
y
estimator of β . Several approaches have been attempted in the literature to improve further the OLSE. One approach to
improve the estimators is the use of extraneous information or prior information. In applied work, such prior information
may be available about the regression coefficients. For example, in economics, the constant returns to scale imply that
the exponents in a Cobb-Douglas production function should sum to unity. In another example, absence of money illusion
on the part of consumers implies that the sum of money income and price elasticities in a demand function should be
zero. These types of constraints or the prior information may be available from
i. some theoretical considerations.
ii. past experience of the experimenter.
iii. empirical investigations.
iv. some extraneous sources etc.
To utilize such information in improving the estimation of regression coefficients, it can be expressed in the form of
i. exact linear restrictions.
ii. stochastic linear restrictions.
iii. inequality restrictions.
We consider the use of prior information in the form of exact and stochastic linear restrictions in the model=
y Xβ +ε
where y is a ( n × 1) vector of observations on study variable, X is a ( n × k ) matrix of observations on explanatory
variables X 1 , X 2 ,..., X k , β is a (k ×1) vector of regression coefficients and ε is a (n × 1) vector of disturbance terms.
3
Exact linear restrictions
Suppose the prior information binding the regression coefficients is available from some extraneous sources which can be
expressed in the form of exact linear restrictions as
r = Rβ
where r is a ( q × 1) vector and R is a (q × k ) matrix with rank(R) = q (q < k). The elements in r and R are known.
Some examples of exact linear restriction r = R β are as follows:
(i) If there are two restrictions with k = 6 like
β2 = β4
β3 + 2β 4 + β5 =
1
then
0 0 1 0 − 1 0 0 0
=r = , R 0 0 1 2 1 0 0 .
1
=r [3=
] , R [0 1 0] .
0 1 − a 0
=r =
0 , R 0 1
−b .
0 1 0 −ab
The ordinary least squares estimator b = ( X ' X ) −1 X ' y does not uses the prior information. It does not obey the
restrictions in the sense that r ≠ Rb. So the issue is how to use the sample information and prior information together in
finding an improved estimator of β .
4
Restricted least squares estimation
The restricted least squares estimation method enables the use of sample information and prior information
simultaneously. In this method, choose β such that the error sum of squares is minimized subject to linear restrictions
r = Rβ . This can be achieved using the Lagrangian multiplier technique. Define the Lagrangian function
S (β , λ ) =( y − X β ) '( y − X β ) − 2λ '( Rβ − r )
where λ is a ( k × 1) vector of Lagrangian multiplier.
Using the result that if a and b are vectors and A is a suitably defined matrix, then
∂
= ( A + A ')a
a ' Aa
∂a
∂
a ' b = b,
∂a
we have
∂S ( β , λ )
= 2 X ' X β − 2 X ' y − 2 R ' λ =' 0 (*)
∂β
∂S ( β , λ )
= Rβ − r= 0.
∂λ
Pre-multiplying equation (*) by R ( X ' X ) −1 , we have
−1
( X ' X ) X ' y + ( X ' X ) R ' R( X ' X )−1 R ' ( r − Rb )
−1 −1
βˆR =
−1
b + ( X ' X ) R ' R ( X ' X ) R ' ( r − Rb ) .
−1 −1
=
This estimation is termed as restricted regression estimator of β .
2. Unbiasedness
The estimation error of βˆR is
where −1
D= I − ( X ' X ) −1 R R ( X ' X ) R ' R.
−1
Thus
( )
E βˆR − β= DE ( b − β )
=0
implying that βˆR is an unbiased estimator of β.
7
3. Covariance matrix
The covariance matrix of βˆR is
( )
V βˆR =E βˆR − β ( )( βˆ R )
−β '
= DE ( b − β )( b − β ) ' D '
= DV (b) D '
= σ 2D ( X ' X ) D '
−1
−1
= σ 2 ( X ' X ) − σ 2 ( X ' X ) R ' R ( X ' X ) R ' R ' ( X ' X )
−1 −1 −1 −1
Consider
−1
D (=
X 'X) (X 'X ) − ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1
{ } { }
'
D( X ' X ) X ' X −1 − X ' X −1 R ' R X ' X −1 R ' −1
I − X ' X −1 R ' R X ' X −1 R ' −1
R
( ) ( ) ( ) R( X ' X ) ( ) ( )
−1 −1
D' =
−1
( X ' X ) − ( X ' X ) R ' R ( X ' X ) R ' ' R ( X ' X ) − ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1 −1 −1
=
−1 −1
+ ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1
−1
(X 'X ) − ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X ) .
−1 −1 −1 −1
=
8
where λ is a ( q × 1) vector of Lagrangian multipliers. The normal equations are obtained by partially differentiating the
log– likelihood function with respect to β ,σ 2 and λ and equated to zero as
∂ ln L ( β , σ 2 , λ ) 1
− 2 ( X ' X β − X ' y ) + 2R ' λ =
= 0 (1)
∂β σ
∂ ln L ( β , σ 2 , λ )
= 2 ( Rβ − r=
) 0 (2)
∂λ
∂ ln L ( β , σ 2 , λ ) 2n 2 ( y − X β ) ' ( y − X β )
=
− 2+ =
0. (3)
∂σ 2 σ 4
σ
Let βR , σ R2 and λ denote the maximum likelihood estimators of β , σ and λ , respectively which are obtained by solving
2
( r − Rβ )
−1
R ( X ' X )−1 R '
λ = .
σ 2
9
Substituting λ in equation (1) gives
( r − Rβ )
−1
β + ( X ' X ) R ' R ( X ' X ) R '
βR =
−1 −1
where β = ( X ' X ) X ' y
−1
is the maximum likelihood estimator of β without restrictions. From equation (3), we get
σ 2
=
( y − X β ) ' ( y − X β )
R R
.
R
n
The Hessian matrix of second order partial derivatives of β and σ is positive definite
= at β β=
R and σ σ R2 .
2 2
The restricted least squares and restricted maximum likelihood estimators of β are same whereas they are different
for σ2.
Test of hypothesis
It is important to test the hypothesis
H 0 : r = Rβ
H1 : r ≠ Rβ
before using it in the estimation procedure.
The construction of the test statistic for this hypothesis is detailed in the module 3 on multiple linear regression model. The
resulting test statistic is (r − Rb) ' R ( X ' X ) −1 R ' −1 (r − Rb)
q
F=
( y − Xb) '( y − Xb
n−k
which follows a F-distribution with q and (n - k) degrees of freedom under H0. The decision rule is to reject H 0 at α
level of significance whenever F ≥ F1−α ( q, n − k ).
ECONOMETRIC THEORY
MODULE – VI
Lecture – 20
Regression Analysis Under Linear
Restrictions
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
r Rβ + V
=
where r is a (q x 1) vector, R is a (q x k) matrix and V is a (q x 1) vector of random errors. The elements in r and R
are known. The term V reflects the randomness involved in the prior information r = R β . Assume
E (V ) = 0
E (VV ') = ψ
E ( εV ' ) = 0.
where ψ is a known (q x q) positive definite matrix and y X β + ε.
ε is the disturbance term in multiple regression model=
Note that E (r ) = Rβ .
The possible reasons for such stochastic linear restriction are as follows:
i. Stochastic linear restrictions exhibits the unstability of estimates. An unbiased estimate with a reasonable standard
error may exhibit stability. For example, in repetitive studies, the surveys are conducted every year. Suppose the
regression coefficient β1 remains stable for several years. Suppose its estimate is provided along with its standard
error. Suppose its value remains stable around the value 0.5 with standard error 2. This information can be expressed
as r β1 + V1 ,
=
=
where =
r 0.5, E (V1 ) 0,=
E (V12 ) 22.
Now ψ can be formulated with this data. It is not necessary that we should have information for all regression
coefficients but we can have information on some of the regression coefficients only.
3
(ii) Sometimes the restrictions are in the form of inequality. Such restrictions may arise from theoretical considerations.
For example, the value of a regression coefficient may lie between 3 and 5, i.e., 3 ≤ β1 ≤ 5, say. In another example,
consider a simple linear regression model
y =β 0 + β1 x + ε
where y denotes the consumption expenditure on food and x denotes the income. Then the marginal propensity
(tendency) to consume is
dy
= β1 ,
dx
i.e., if salary increase by rupee one, then one is expected to spend β1 amount of rupee one on food or save (1 − β1 )
amount. We may put a bound on β1 that either one can not spend all of rupee one or nothing out of rupee one. So
0 < β1 < 1. This is a natural restriction arising from theoretical considerations.
These bounds can be treated as p-sigma limits, say 2-sigma limits or confidence limits. Thus
µ − 2σ =
0
µ + 2σ =
1
1 1
⇒ µ= , σ= .
2 4
These values can be interpreted as
1
β1 + V1 =
2
1
E (V12 ) = .
16
(iii) Sometimes the truthfulness of exact linear restriction r = R β can be suspected and accordingly, an element of
uncertainty can be introduced. For example, one may say that 95% of the restrictions hold. So some element of
uncertainty prevails.
4
r Rβ + V . So the objective is
which is termed as pure estimator. The pure estimator b does not satisfy the restriction=
to obtain an estimate of β by utilizing the stochastic restrictions such that the resulting estimator satisfies the stochastic
restrictions also. In order to avoid the conflict between prior information and sample information, we can combine them as
follows:
Write
X β + ε with E (ε ) =
y= 0, E (εε ') =
σ 2 In
R β + V with E (V ) ==
r= 0, E (VV ') ψ , E (εV ') =
0
jointly as
y X ε
r R β + V
=
or
a Aβ + w
=
where
=a (=
y r ) ', A ( X=
R) ', w (ε V ) '.
Note that
E (ε ) 0
=
E ( w) =
E (V ) 0
5
εε ' εV '
E ( ww ') = E
V ε ' VV '
σ 2 I n 0
=
0 ψ
= Ω, say.
This shows that the disturbances are non spherical or heteroskedastic. So the application of generalized least squares
estimation will yield more efficient estimator than ordinary least squares estimation. So applying generalized least squares
to the model
a=
AB + w with E ( w) =
0, V ( w) =
Ω,
1
I 0 y
[X '
−1
A'Ω a = R '] σ
2 n
−1 r
0 Ψ
= 1
σ2
X ' y + R ' Ψ −1r
1
I 0 X
[X '
−1
A'Ω A = R '] σ
2 n
−1 R
0 Ψ
= 1
σ2
X ' X + R ' Ψ −1 R.
6
Thus
−1
1 1
βˆM 2 X ' X + R 'ψ −1 R 2 X ' y + R ' Ψ −1r
=
σ σ
assuming σ2 to be unknown. This is termed as mixed regression estimator.
If σ2 is unknown, then σ
2
can be replaced by its estimator
1
s 2 = ( y − Xb ) ' ( y − Xb )
σˆ 2 =
n−k
−1
1 1
βˆ f 2 X ' X + R ' Ψ −1 R 2 X ' y + R ' Ψ −1r .
=
s s
(i) Unbiasedness
The estimation error of βˆm is
( A ' ΩA)
−1
βˆM − β = A ' Ω −1a − β
= ( A ' Ω −1 A ) A ' Ω −1 ( AB + w ) − β
−1
( )
E βˆM − β = ( A ' Ω −1 A ) A ' Ω −1 E ( w)
−1
= 0.
So mixed regression estimator provides an unbiased estimator of β. Note that the pure regression b = ( X ' X ) X ' y
−1
( A ' Ω A)−1 −1
=
−1
1
= 2 X ' X + R ' Ω −1 R .
σ
8
(iii) The estimator βˆM satisfies the stochastic linear restrictions in the sense that
=r RβˆM + V
( )
E (r ) RE βˆM + E (V )
=
= Rβ + 0
= Rβ .
We first state a result that is used further to establish the dominance of βˆM over b.
Result: If A1 and A2 are suitably defined matrices, then the difference of matrices ( A1 − A2 ) is positive definite if
−1 −1
( A2 − A1 ) is positive definite.
Let σ 2 ( X ' X ) −1
A1 ≡ V (b) =
−1
1
A2 ≡ V=( βˆM ) 2 X ' X + R ' Ψ −1 R
σ
1 1
then A1−1 −=
A2−1 X ' X + R ' Ψ −1 R − 2 X ' X
σ2
σ
= R ' Ψ −1 R
which is a positive definite matrix. This implies that
A1 − A2 = V (b) − V ( βˆM )
is a positive definite matrix. Thus βˆM is more efficient than b under the criterion of covariance matrices or Loewner
ordering provided σ2 is known.
9
Testing of hypothesis
If Ψ =0, then the distribution is degenerated and hence r becomes a fixed quantity. For the feasible version of mixed
regression estimator
−1
1 1
βˆ f 2 X ' X + R ' Ψ −1 R 2 X ' y + R ' Ψ −1r ,
=
s s
the optimal properties of mixed regression estimator like linearity unbiasedness and/or minimum variance do not remain
valid. So there can be situations when the incorporation of prior information may lead to loss in efficiency. This is not a
favourable situation. Under such situations, the pure regression estimator is better to use. In order to know whether the
use of prior information will lead to better estimator or not, the null hypothesis H 0 : E (r ) = R β can be tested.
For testing the null hypothesis
H 0 : E (r ) = Rβ
when σ2 is unknown, we use the statistic given by
{
r − Rb ' R X ' X −1 R '+ Ψ
} ( r − Rb )
−1
( ) ( )
q
F=
s2
1
where s 2 = ( y − Xb ) ' ( y − Xb ) and F follows a F - distribution with q and (n – k) degrees of freedom under H0 .
n−k
ECONOMETRIC THEORY
MODULE – VII
Lecture - 21
Multicollinearity
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
A basic assumption in multiple linear regression model is that the rank of the matrix of observations on explanatory
variables is same as the number of explanatory variables. In other words, such matrix is of full column rank. This in turn
implies that all the explanatory variables are independent, i.e., there is no linear relationship among the explanatory
variables. It is termed as the explanatory variables are orthogonal.
In many situations in practice, the explanatory variables may not remain independent due to various reasons. The
situation where the explanatory variables are highly intercorrelated is referred to as multicollinearity.
=y X β + ε , ε ~ N (0, σ 2 I )
n×k k ×1 n×1
with k explanatory variables X1, X2,…, Xk with usual assumptions including rank(X) = k.
Assume the observations on all Xi’s and yi’s are centered and scaled to unit length. So
X’ X becomes a k x k matrix of correlation coefficients between the explanatory variables and
X’ y becomes a k x 1 vector of correlation coefficients between explanatory and study variables.
Let X = [ X1, X2,…,Xk ] where Xj is the jth column of X denoting the n observations on Xj. The column vectors X1, X2,…,Xk
are linearly dependent if there exists a set of constants 1 , 2 ,..., k , not all zero, such that
k
∑
j =1
j X j = 0.
If this holds exactly for a subset of the X1, X2,…,Xk , then rank(X’X) < k. Consequently (X’X)-1 does not exist. If the condition
k
∑
j =1
j X j = 0 is approximately true for some subset of X1, X2,…, Xk , then there will be a near-linear dependency in X’X. In
such a case, the multicollinearity problem exists. It is also said that X’X becomes ill-conditioned.
3
Source of multicollinearity
1. Method of data collection
It is expected that the data is collected over the whole cross-section of variables. It may happen that the data is collected
over a subspace of the explanatory variables where the variables are linearly dependent. For example, sampling is done
only over a limited range of explanatory variables in the population.
5. An over-determined model
Sometimes, due to over enthusiasm, large number of variables are included in the model to make it more realistic and
consequently the number of observations (n) becomes smaller than the number of explanatory variables (k). Such
situation can arise in medical research where the number of patients may be small but information is collected on a large
number of variables. In another example, if there is time series data for 50 years on consumption pattern, then it is
expected that the consumption pattern does not remain same for 50 years. So better option is to choose smaller number
of variables and hence it results into n < k. But this is not always advisable. For example, in microarray experiments, it is
not advisable to choose smaller number of variables.
5
Consequences of multicollinearity
y = β1 x1 + β 2 x2 + ε , E (ε ) = 0, V (ε ) =σ 2 I
1 1 −r
(X 'X )
−1
= 2
1 − r −r 1
r1 y − r r2 y
⇒ b1 =
1− r2
r2 y − r r1 y
b2 = .
1− r2
So the covariance matrix is V (b) = σ 2 ( X ' X ) −1
σ2
⇒ Var (b1 ) = Var (b2 ) =
1− r2
rσ 2
Cov(b1 , b2 ) = − .
1− r2
6
=
Var (b1 ) Var (b2 ) → ∞.
So if variables are perfectly collinear, the variance of OLSEs becomes large. This indicates highly unreliable estimates and
The standard errors of b1 and b2 rise sharply as r → ±1 and they break down at r = ±1 because X’X becomes non-
singular.
If r is close to 0, then multicollinearity does not harm and it is termed as non-harmful multicollinearity .
If r is close to +1 or -1 then multicollinearity inflates the variance and it rises terribly. This is termed as harmful
multicollinearity.
7
There is no clear cut boundary to distinguish between the harmful and non-harmful multicollinearity. Generally, if r is low,
the multicollinearity is considered as non-harmful and if r is high, the multicollinearity is considered as harmful.
b1
t0 = .
Var (b )
1
3. Due to large standard errors, the large confidence region may arise. For example, the confidence interval is
(b ) . When Var
(b ) becomes large, then confidence interval becomes wider.
given by b1 ± tα Var 1 1
, n −1
2
4. The OLSE may be sensitive to small changes in the values of explanatory variables. If some observations are
added or dropped, OLSE may change considerably in magnitude as well as in sign. Ideally, OLSE should not
change with inclusion or deletion of few observations. Thus OLSE loses stability and robustness.
8
When the number of explanatory variables are more than two, say k as X1, X2,…,Xk then the jth diagonal element of
C =(X 'X )
−1
is
1
C jj =
1 − R 2j
where R 2j is the multiple correlation coefficient or coefficient of determination from the regression of Xj on the remaining
(k - 1) explanatory variables.
If Xj is highly correlated with any subset of other (k - 1) explanatory variables then R 2j is high and close to 1.
Consequently variance of jth OLSE
σ2
Var= jjσ
(b j ) C= 2
1 − R 2j
becomes very high. The covariance between bi and bj will also be large if Xi and Xj are involved in the linear
relationship leading to multicollinearity.
The least squares estimates bj become too large in absolute value in the presence of multicollinearity. For example,
consider the squared distance between b and β as
(b β ) '(b − β )
L2 =−
k
=
E(L )2
∑ E (b
j =1
j − β j )2
k
= ∑ Var (b j )
j =1
The trace of a matrix is same as the sum of its eigenvalues. If λ1 , λ2 ,..., λk are the eigenvalues of (X ’X), then
1 1 1
, ,..., are the eigenvalues of (X’X)-1 and hence
λ1 λ2 λk
k
1
E(L ) σ
= 2 2
∑λ
j =1
, λ j > 0.
j
If (X ’X) is ill-conditioned due to the presence of multicollinearity then at least one of the eigenvalue will be small. So the
distance between b and β may also be large. Thus
E ( L2 ) =E (b − β ) '(b − β )
σ 2tr ( X ' X ) −1 = E (b ' b − 2b ' β + β ' β )
' b) σ 2tr ( X ' X ) −1 + β ' β
⇒ E (b=
The least squares produces bad estimates of parameters in the presence of multicollinearity. This does not imply that the
fitted model produces bad predictions also. If the predictions are confined to X - space with non-harmful multicollinearity,
then predictions are satisfactory.
ECONOMETRIC THEORY
MODULE – VII
Lecture - 22
Multicollinearity
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Multicollinearity diagnostics
An important question that arises is how to diagnose the presence of multicollinearity in the data on the basis of given
sample information. Several diagnostic measures are available and each of them is based on a particular approach. It
is difficult to say that which of the diagnostic is the best or ultimate. Some of the popular and important diagnostics are
described further.
If rank ( X ' X )< k then | X ' X | will be singular and so | X ' X |= 0 . So as X ' X → 0 , the degree of multicollinearity
increases and it becomes exact or perfect at | X ' X |= 0 . Thus | X ' X | serves as a measure of multicollinearity and
| X ' X |= 0 indicates that perfect multicollinearity exists.
3
Limitations
This measure has following limitations
i. It is not bounded as 0 < X ' X < ∞.
ii. It is affected by dispersion of explanatory variables. For example, if k = 2, then
n n
∑ x12i
=i 1 =i 1
∑x x
1i 2 i
X 'X = n n
=i 1 =i 1
∑ x2i x1i ∑x 2
2i
n 2 n 2
= ∑ x1i ∑ x2i (1 − r122 )
= i 1= i 1
where r12 is the correlation coefficient between X1 and X2. So | X ' X | depends on correlation coefficient and
variability of explanatory variable. If explanatory variables have very low variability, then | X ' X | may tend to
zero which will indicate the presence of multicollinearity and which is not the case so.
iii. It gives no idea about the relative effects on individual coefficients. If multicollinearity is present, then it will not
indicate that which variable in | X ' X | is causing multicollinearity and is hard to determine.
4
When more than two explanatory variables are considered and if they are involved in near-linear dependency, then it is not
necessary that any of the rij will be large. Generally, pairwise inspection of correlation coefficients is not sufficient for
detecting multicollinearity in the data.
Limitation
It gives no information about the number of linear dependencies among explanatory variables.
5
(ii) It is not affected by the dispersion of explanatory variables. For example, when k = 2,
n n
∑ x12i
=i 1 =i 1
∑x x
1i 2 i
X ' X= n n
= (1 − r122 ).
∑x
1i 2 i
=i 1 =i 1
x ∑x 2
2i
Procedure
i. Drop one of the explanatory variable among k variables, say X1.
ii. Run regression of y over rest of the (k - 1) variables X2, X3,…, Xk.
2
iii. Calculate R1 .
iv. Similarly calculate R22 , R32 ,..., Rk2
Find RL = Max( R1 , R2 ,..., Rk ).
2 2 2 2
v.
vi. Determine R − RL .
2 2
The quantity ( R − RL ) provides a measure of multicollinearity. If multicollinearity is present, RL2 will be high. Higher the
2 2
degree of multicollinearity, higher the value of RL . So in the presence of multicollinearity, ( R − RL ) will be low.
2 2 2
Limitations
i. It gives no information about the underlying relations about explanatory variables, i.e., how many relationships are
present or how many explanatory variables are responsible for the multicollinearity.
Small value of ( R − RL ) may occur because of poor specification of the model also and it may be inferred in
2 2
ii.
such situation that multicollinearity is present.
7
5. Variance inflation factors (VIF)
The matrix X ' X becomes ill-conditioned in the presence of multicollinearity in the data. So the diagonal elements of
C = ( X ' X ) −1 helps in the detection of multicollinearity. If R 2j denotes the coefficient of determination obtained when Xj is
regressed on the remaining (k - 1) variables excluding Xj , then the jth diagonal element of C is
1
C jj = .
1 − R 2j
2
If Xj is nearly orthogonal to remaining explanatory variables, then R j is small and consequently Cjj is close to 1.
If Xj is nearly linearly dependent on a subset of remaining explanatory variables, then R 2j is close to 1 and consequently
Cjj is large.
Since the variance of jth OLSE of β j is Var (b j ) = σ 2C jj .
So Cjj is the factor by which the variance of bj increases when the explanatory variables are near linear dependent. Based
on this concept, the variance inflation factor for the jth explanatory variable is defined as
1
VIFj = .
1 − R 2j
This is the factor which is responsible for inflating the sampling variance. The combined effect of dependencies among the
explanatory variables on the variance of a term is measured by the VIF of that term in the model.
One or more large VIFs indicate the presence of multicollinearity in the data.
In practice, usually a VIF > 5 or 10 indicates that the associated regression coefficients are poorly estimated because of
multicollinearity. If regression coefficients are estimated by OLSE and its variance is σ 2 ( X ' X ) −1. So VIF indicates that a
part of this variance is given by VIFj.
8
Limitations
(i) It sheds no light on the number of dependencies among the explanatory variables.
(ii) The rule of VIF > 5 or 10 is a rule of thumb which may differ from one situation to another situation.
b ± σˆ C jj tα ,n − k −1 .
2
2
The length of the confidence interval is
L j = 2 σˆ 2C jj tα .
, n − k −1
2
Now consider a situation where X is an orthogonal matrix, i.e., X ' X = I so that Cjj = 1, sample size is same as earlier and
same root mean squares 1 2 , then the length of confidence interval becomes L* = 2σˆ tα
n
n ∑ ( xij − x j )
.
, n − k −1
i =1 2
Lj
Consider the ratio = C jj .
L*
Thus VIFj indicates the increase in the length of confidence interval of jth regression coefficient due to the presence of
multicollinearity.
9
6. Condition number and condition index
Let λ1 , λ2 ,..., λk be the eigenvalues (or characteristic roots) of X ' X . Let
The small values of characteristic roots indicates the presence of near-linear dependencies in the data. The CN provides a
measure of spread in the spectrum of characteristic roots of X’X.
The condition number is based only or two eigenvalues: λmin and λmax . Another measures are condition indices which
use information on other eigenvalues as well.
λmax
The condition indices of X’X are defined=
as C j = , j 1, 2,..., k .
λj
In fact, largest Cj = CN.
The number of condition indices that are large, say more than 1000, indicate the number of near-linear dependencies in
X’X. A limitation of CN and Cj is that they are unbounded measures as 0 < CN < ∞ , 0 < C j < ∞ .
10
V (b) = σ 2 ( X ' X ) −1
= σ 2 (V ΛV ') −1
= σ 2V Λ −1V '
vi21 vi22 vik2
(bi ) σ +
⇒ Var= 2
+ ... +
λ
1 λ 2 λk
where vi1 , vi 2 ,..., vik are the elements in V.
λmax
=Cj = , j 1, 2,..., k .
λj
11
Procedure
i. Find condition index C1 , C2 ,..., Ck .
ii. (a) Identify those λi ' s for which C j is greater than the danger level 1000.
(b) This gives the number of linear dependencies.
(c) Don’t consider those C j ' s which are below the danger level.
iii. For such λ ' s with condition index above the danger level, choose one such eigenvalue, say λj.
iv. Find the value of proportion of variance corresponding to λ j in Var (b1 ),Var (b2 ),...,Var (bk ) as
(vij2 / λ j ) vij2 / λ j
=pij = k
.
∑ (v
VIFj 2
ij / λj )
j =1
vij2
Note that can be found from the expression
λ
j
2 vi1 vik2
2
vi22
Var=(bi ) σ + + ... +
λ1 λ2 λk
i.e., corresponding to j th factor.
The proportion of variance pij provides a measure of multicollinearity.
If pij > 0.5, it indicates that bi is adversely affected by the multicollinearity, i.e., estimate of βi is influenced by the
presence of multicollinearity.
It is a good diagnostic tool in the sense that it tells about the presence of harmful multicollinearity as well as also indicates
the number of linear dependencies responsible for multicollinearity. This diagnostic is better than other diagnostics.
12
The condition indices are also defined by the singular value decomposition of X matrix as follows:
X = UDV '
µ 2j λ=
so = j, j 1, 2,..., k .
µ j = λj
Note that with ,
2
k v 2ji
Var (b j ) = σ 2
∑µ
i =1
2
i
k v 2ji
VIFj = ∑
i =1 µi2
(v 2ji / µi2 )
pij = .
VIFj
13
The ill-conditioning in X is reflected in the size of singular values. There will be one small singular value for each non-
linear dependency. The extent of ill conditioning is described by how small is µj relative to µmax .
It is suggested that the explanatory variables should be scaled to unit length but should not be centered when computing
pij . This will help in diagnosing the role of intercept term in near-linear dependence. No unique guidance is available in
literature on the issue of centering the explanatory variables. The centering makes the intercept orthogonal to explanatory
variables. So this may remove the ill conditioning due to intercept term in the model.
ECONOMETRIC THEORY
MODULE – VII
Lecture - 23
Multicollinearity
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
If possible, identify the variables which seems to causing multicollinearity. These collinear variables can be dropped so as
to match the condition of fall rank of X-matrix. The process of omitting the variables may be carried out on the basis of
some kind of ordering of explanatory variables, e.g., those variables can be deleted first which have smaller value of t-
ratio.
In another example, suppose the experimenter is not interested in all the parameters. In such cases, one can get the
estimators of the parameters of interest which have smaller mean squared errors than the variance of OLSE of full vector
by dropping some variables. If some variables are eliminated, then this may reduce the predictive power of the model.
Sometimes there is no assurance that how the model will exhibit less multicollinearity.
The relevance and correctness of information plays an important role in such analysis but it is difficult to ensure it in
practice. For example, the estimates derived in U.K. may not be valid in India.
4
4. Employ generalized inverse
If rank ( X ' X ) < k , then the generalized inverse can be used to find the inverse of X 'X . Then β can be estimated by
βˆ = ( X ' X ) − X ' y.
In such case, the estimates will not be unique except in the case of use of Moore-Penrose inverse of ( X ' X ). Different
methods of finding generalized inverse may give different results. So applied workers will get different results. Moreover, it
is also not known that which method of finding generalized inverse is optimum.
are transformed into a new set of orthogonal variables called as principal components. Usually this technique is used for
reducing the dimensionality of data by retaining some levels of variability of explanatory variables which is expressed by the
variability in study variable. The principal components involves the determination of a set of linear combinations of
explanatory variables such that they retain the total variability of the system and these linear combinations are mutually
independent of each other. Such obtained principal components are ranked in the order of their importance. The importance
being judged in terms of variability explained by a principal component relative to the total variability in the system. The
procedure then involves eliminating some of the principal components which contribute in explaining relatively less variation.
After elimination of the least important principal components, the set up of multiple regression is used by replacing the
Then study variable is regressed against the set of selected principal components using ordinary least squares method.
Since all the principal components are orthogonal, they are mutually independent and so OLS is used without any problem.
Once the estimates of regression coefficients for the reduced set of orthogonal variables (principal components) have been
obtained, they are mathematically transformed into a new set of estimated regression coefficients that correspond to the
original correlated set of variables. These new estimated coefficients are the principal components estimators of regression
coefficients.
Suppose there are k explanatory variables X 1 , X 2 ,..., X k . Consider the linear function of X 1 , X 2 ,.., X k like
k
Z1 = ∑ ai X i
i =1
k
Z 2 = ∑ bi X i etc.
i =1
The constants a1 , a2 ,..., ak are determined such that the variance of Z1 is maximized subject to the normalizing condition
that ∑a
i =1
2
i = 1. The constant b1 , b2 ,..., bk are determined such that the variance of Z is maximized subject to the normality
2
condition that ∑b
i =1
i
2
= 1 and is independent of the first principal component.
6
We continue with such process and obtain k such linear combinations such that they are orthogonal to their preceding linear
combinations and satisfy the normality condition. Then we obtain their variances. Suppose such linear combinations are
Z1 , Z 2 ,.., Z k and for them, Var ( Z1 ) > Var ( Z 2 ) > ... > Var ( Z k ). The linear combination having the variance is the first
principal component. The linear combination having the second largest variance is the second largest principal component
k k
The problem of multicollinearity arises because X 1 , X 2 ,..., X k are not independent. Since the principal components based
on X 1 , X 2 ,..., X k are mutually independent, so they can be used as explanatory variables and such regression will combat
the multicollinearity.
Let λ1 , λ2 ,..., λk be the eigenvalues of X ' X , Λ =diag (λ1 , λ2 ,..., λk ) is k × k diagonal matrix, V is a k × k orthogonal
matrix whose columns are the eigenvectors associated with λ1 , λ2 ,..., λk .
y Xβ +ε
=
= XVV ' β + ε
= Zα + ε
where
Columns of Z = ( Z1 , Z 2 ,..., Z k ) define a new set of explanatory variables which are called as principal component.
7
The OLSE of α is
αˆ = ( Z ' Z ) −1 Z ' y
= Λ −1Z ' y
V (αˆ ) = σ 2 ( Z ' Z ) −1
= σ 2 Λ −1
1 1 1
= σ 2 diag , ,..., .
λ1 λ2 λk
k k
Note that λj is the variance of j th principal component and Z ' Z = ∑∑ Z Z
=i 1 =j 1
i j = Λ.
A small eigenvalue of X’X means that the linear relationship between the original explanatory variable exist and the variance
of corresponding orthogonal regression coefficient is large which indicates that the multicollinearity exists. If one or more λ j
are small, then it indicates that multicollinearity is present.
8
The new set of variables, i.e., principal components are orthogonal, and they retain the same magnitude of variance as of
original set. If multicollinearity is severe, then there will be at least one small value of eigenvalue. The elimination of one or
more principal components associated with smallest eigenvalues will reduce the total variance in the model. Moreover, the
principal components responsible for creating multicollinearity will be removed and the resulting model will be appreciably
improved.
The principal component matrix Z = [ Z1 , Z 2 ,..., Z k ] with Z1 , Z 2 ,..., Z k contains exactly the same information as the original
data in X in the sense that the total variability in X and Z is same. The difference between them is that the original data are
arranged into a set of new variables which are uncorrelated with each other and can be ranked with respect to the
eigenvalue. The idea behind to do this is that the principal component with smallest eigenvalue is contributing the least
∑λ i
i =1
k
> 0.90.
∑λ
i =1
i
Various strategies to choose required number of principal components are also available in the literature.
Suppose after using such a rule, the r principal components are eliminated. Now only (k - r) components will be used for
regression. So Z matrix is partitioned as
=Z (=
Z r Z k −r ) X (Vr Vk − r )
where submatrix Zr is of order n x r and contains the principal components to be eliminated. The submatrix Zk-r is of order
n x (k – r) and contains the principal components to be retained.
The reduced model obtained after the elimination of r principal components can be expressed as
=y Z k − rα k − r + ε *.
The random error component is represented as ε* just to distinguish with ε. The reduced coefficients contain the
coefficients associated with retained Zj’s. So
Z k − r = ( Z1 , Z 2 ,..., Z k − r )
α k −r = (α1 , α 2 ,..., α k −r )
Vk − r = (V1 ,V2 ,...,Vk − r ) .
10
Using OLS on the model with retained principal components, the OLSE of α k −r is
αˆ k − r = ( Z k' − r Z k − r ) −1 Z k' − r y.
.
Now it is transformed back to original explanatory variables as follows:
α =V 'β
α k − r = Vk'− r β
⇒ βˆ pc =
Vk − rαˆ k − r
MODULE – VII
Lecture - 24
Multicollinearity
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
6. Ridge regression
The OLSE is the best linear unbiased estimator of regression coefficient in the sense that it has minimum variance in the
class of linear and unbiased estimators. However, if the condition of unbiasedness can be relaxed then it is possible to find a
biased estimator of regression coefficient say β̂ that has smaller variance them the unbiased OLSE b. The mean squared
error (MSE) of β̂ is
βˆ ) E ( βˆ − β ) 2
MSE (=
{ } { }
2
= E βˆ − E ( βˆ ) + E ( βˆ ) − β
2
= Var ( βˆ ) + E ( βˆ ) − β
. 2
= Var ( βˆ ) + Bias ( βˆ ) .
βˆridge is the ridge regression estimator of β and δ ≥ 0 is any characterizing scalar termed as biasing parameter.
3
As
So larger the value of δ , larger shrinkage towards zero. Note that the OLSE is inappropriate to use in the sense that it has
very high variance when multicollinearity is present in the data. On the other hand, a very small value of β̂ may tend to
accept the null hypothesis H 0 : β = 0 indicating that the corresponding variables are not relevant. The value of biasing
parameter controls the amount of shrinkage in the estimates.
( βˆridge ) E ( βˆridge ) − β
Bias=
( X ' X δ I ) −1 X ' E ( y ) − β
=+
= ( X ' X + δ I ) −1 X ' X − I β
= ( X ' X + δ I ) −1 [ X ' X − X ' X − δ I ] β
−δ ( X ' X + δ I ) −1 β .
=
Covariance matrix
{ }{
E βˆridge − E ( βˆridge ) βˆridge − E ( βˆridge ) . }
'
V ( βˆridge ) =
Since
βˆridge − E ( βˆridge ) = ( X ' X + δ I ) −1 X ' y − ( X ' X + δ I ) −1 X ' X β
( X ' X + δ I ) −1 X '( y − X β )
=
= ( X ' X + δ I ) −1 X ' ε ,
so
V ( βˆridge ) =
( X ' X + δ I ) −1 X 'V (ε ) X ( X ' X + δ I ) −1
Writing X ' X = Λ = diag (λ1 , λ2 ,..., λk ), the mean squared error of βˆridge is
2
=
MSE ( βˆridge ) Var ( βˆridge ) + bias ( βˆridge )
2
= tr V ( βˆridge ) + bias ( βˆridge )
β j2
k λj k
= σ ∑ +δ ∑ 2 2
=j 1 =(λ j + δ ) 2 j 1 (λ j + δ )
2
Thus as δ increases, the bias in βˆridge increases but its variance decreases. The trade off between bias and variance
hinges upon the value of δ. It can be shown that there exists a value of δ such that MSE ( βˆridge ) < Var (b)
provided β ' β is bounded.
Choice of δ
β 'β
The estimation of ridge regression estimator depends upon the value of δ . Various approaches have been suggested in
the literature to determine the value of δ. The value of δ can be chosen on the basis of criteria like
- stability of estimators with respect to δ .
- reasonable signs.
- magnitude of residual sum of squares etc.
We consider here the determination of δ by the inspection of ridge trace.
6
Ridge trace
If multicollinearity is present and is severe, then the instability of regression coefficients is reflected in the ridge trace. As δ
increases, some of the ridge estimates vary dramatically and they stabilizes at some value of δ. The objective in ridge
trace is to inspect the trace (curve) and find the reasonably small value of δ at which the ridge regression estimators are
stable. The ridge regression estimator with such a choice of δ will have smaller MSE than the variance of OLSE.
An example of ridge trace for a model with 6 parameters is as follows: In this ridge trace, the βˆridge is evaluated for various
choices of δ and the corresponding values of all regression coefficients βˆ j ( ridge ) ’s, j = 1, 2,…, 6 are plotted versus δ .
These values are denoted by different symbols and are joined by a smooth curve. This produces a ridge trace for
respective parameter. Now choose the value of δ where all the curves stabilize and become nearly parallel. For example,
the curves in following figure become nearly parallel starting from δ = δ4 or so. Thus one possible choice of δ is δ = δ4
and parameters can be estimated as
( X ' X + δ4I )
−1
βˆridge
= X ' y.
7
The figure drastically exposes the presence of multicollinearity in the data. The behaviour of βˆi ( ridge ) at δ0 ≈ 0 is very
different than at other values of δ . For small values of δ , the estimates change rapidly. The estimates stabilize
gradually as δ increases. The value of δ at which all the estimates stabilize gives the desired value of δ because
moving away from such δ will not bring any appreciable reduction in the residual sum of squares. If multicollinearity is
present, then the variation in ridge regression estimators is rapid around δ 0 ≈ 0. The optimal δ is chosen such that after
that value of δ , almost all traces stabilize.
8
Limitations
1. The choice of δ is data dependent and therefore is a random variable. Using it as a random variable violates the
assumption that δ is a constant. This will disturb the optimal properties derived under the assumption of constancy
of δ .
2. The value of δ lies in the interval (0, ∞). So large number of values are required for exploration. This results in
wasting of time. However, this is not a big issue when working with software.
3. The choice of δ from graphical display may not be unique. Different people may choose different δ and
consequently the values of ridge regression estimators will be changing. However δ is chosen so that all the
estimators of all the coefficients stabilize. Hence small variation in choosing the value of δ may not produce much
change in ridge estimators of the coefficients. Another choice of δ is
kσˆ 2
δ=
b 'b
The problem of multicollinearity arises because some of the eigenvalues roots of X’X are close to zero or are zero. So
if λ1 , λ2 ,..., λ p are the characteristic roots, and if
then
βˆridge = (Λ + δ I ) −1 X 1 y = ( I + δΛ −1 ) −1 b
1 λi
xi' y = b.
λi + δ λi + δ i
1
So a small quantity δ is added to λi so that if λi = 0, even then remains meaningful.
λi + δ
10
δ (β ) =( y − X β ) '( y − X β ) + δ ( β ' β − C )
where δ is the Lagrangian multiplier. Differentiating S ( β ) with respect to β , the normal equations are obtained as
∂S ( β )
= 0 ⇒ −2 X ' y + 2 X ' X β + 2δβ = 0
∂β
⇒ βˆ = ( X ' X + δ I ) −1 X ' y.
ridge
Note that if δ is very small, it may indicate that most of the regression coefficients are close to zero and if δ is large,
then it may indicate that the regression coefficients are away from zero. So δ puts a sort of penalty on the regression
coefficients to enable its estimation.
ECONOMETRIC THEORY
MODULE – VIII
Lecture - 25
Heteroskedasticity
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
In many situations, this assumption may not be plausible and the variances may not remain same. The disturbances
whose variances are not constant across the observations are called heteroskedastic disturbance and this property is
termed as heteroskedasticity. In this case
(ε i ) σ=
Var= 2
i , i 1, 2,..., n
and disturbances are pairwise uncorrelated.
σ 12 0 0
0 σ 22 0
=V (ε ) diag
= (σ 1 , σ 2 ,..., σ n )
2 2 2
.
0 0 σ n2
Homoskedasticity
Examples
Suppose in a simple linear regression model, x denote the income and y denotes the expenditure on food. It is observed
that as the income increases, the variation in expenditure on food increases because the choice and varieties in food
increases, in general, upto certain extent. So the variance of observations on y will not remain constant as income
changes. The assumption of homoscedasticity implies that the consumption pattern of food will remain same irrespective
of the income of the person. This may not generally be a correct assumption in real situations. Rather the consumption
pattern changes and hence the variance of y and so the variances of disturbances will not remain constant. In general, it
In another example, suppose in a simple linear regression model, x denotes the number of hours of practice for typing and
y denotes the number of typing errors per page. It is expected that the number of typing mistakes per page decreases as
the person practices more. The homoskedastic disturbances assumption implies that the number of errors per page will
remain same irrespective of the number of hours of typing practice which may not be true in practice.
5
Possible reasons for heteroskedasticity
There are various reasons due to which the heteroskedasticity is introduced in the data. Some of them are as follows:
1. The nature of phenomenon under study may have an increasing or decreasing trend. For example, the variation in
consumption pattern on food increases as income increases. Similarly the variation in number of typing mistakes
decreases as the number of hours of typing practice increases.
2. Sometimes the observations are in the form of averages and this introduces the heteroskedasticity in the model. For
example, it is easier to collect data on the expenditure on clothes for the whole family rather than on a particular
family member. Suppose in a simple linear regression model
yij denotes the expenditure on cloth for the jth family having mj members and xij denotes the age of the ith person in jth
family. It is difficult to record data for individual family member but it is easier to get data for the whole family. So yij’s
are known collectively.
Then instead of per member expenditure, we find the data on average expenditure for each family member as
mj
1
yi =
mj
∑y
j =1
ij
3. Sometimes the theoretical considerations introduces the heteroskedasticity in the data. For example, suppose in the
simple linear model
yi =β 0 + β1 xi + ε i , i =1, 2,..., n
yi denotes the yield of rice and xi denotes the quantity of fertilizer in an agricultural experiment. It is observed that
when the quantity of fertilizer increases, then yield increases. In fact, initially the yield increases when quantity of
fertilizer increases. Gradually, the rate of increase slows down and if fertilizer is increased further, the crop burns. So
notice that β1 changes with different levels of fertilizer. In such cases, when β1 changes, a possible way is to express
it as a random variable with constant mean β1 and constant variance θ 2 like
= Var (vi ) θ 2=
E (vi ) 0,= , E (ε i vi ) 0.
yi =β 0 + β1i xi + ε i
β1=i β1 + vi
⇒ yi = β 0 + β1 xi + (ε i + xi vi )
=β 0 + β1 xi + wi
where
w=
i ε i + xi vi
is like a new random error component.
7
So
E ( wi ) = 0
Var ( wi ) = E ( wi2 )
=E (ε i2 ) + xi2 E (vi2 ) + 2 xi E (ε i vi )
=σ 2 + xi2θ 2 + 0
= σ 2 + xi2θ 2 .
So variance depends on i and thus heteroskedasticity is introduced in the model. Note that we assume homoskedastic
disturbances for the model
yi =β 0 + β1 xi + ε i , β1i =β1 + vi
but finally ends up with heteroskedastic disturbances. This is due to theoretical considerations.
4. The skewness in the distribution of one or more explanatory variables in the model also causes
heteroskedasticity in the model.
5. The incorrect data transformations and incorrect functional form of the model can also give rise to the heteroskedasticity
problem.
6. Sometimes the study variable may have distribution such as Binomial or Poisson where variances are the functions of
means. In such cases, when the mean varies, then the variance also varies.
8
The presence of heteroskedasticity affects the estimation and test of hypothesis. The heteroskedasticity can enter into the
data due to various reasons. The tests for heteroskedasticity assume a specific nature of heteroskedasticity. Various
tests are available in literature for testing the presence of heteroskedasticity, e.g.,
1. Bartlett test
4. Glesjer test
6. White test
7. Ramsey test
9. Szroeter test
Bartlett’s test
It is a test for testing the null hypothesis
H 0 : σ 12= σ 22= ...= σ i2= ...= σ n2 .
This hypothesis is termed as the hypothesis of homoskedasticity. This test can be used only when replicated data is
available.
yi* X i β + ε i*
=
where yi* is a mi x 1 vector, Xi is mi x k matrix, β is k x 1 vector and ε i* is mi x 1 vector. So replicated data is now
*
available for every y i in the following way:
∑ (m − k ) s
i
2
i
s2 = i =1
n
.
∑ (m − k )
i =1
i
Now apply the Bartlett’s test as 11
1 n s2
=χ ∑ (mi − k ) log s 2
2
C i =1 i
1 1 n
1
C=
1+ ∑ − .
3(n − 1) i =1 mi − k n
∑ (mi − k )
i =1
where yi and si2 are the sample mean and sample variance respectively of the ith sample.
ni
∑( y − yi ) , i= 1, 2,..., m; j= 1, 2,..., ni
1 2
s=
2
i ij
ni j =1
1 m
s = ∑ ni si2
2
n i =1
m
n = ∑ ni .
i =1
12
To obtain an unbiased test and a modification of −2 ln u which is a closer approximation to χ m2 −1 under H 0 , Bartlett test
replaces ni by (ni – 1) and divide by a scalar constant.
m
(n − m) log σˆ 2 − ∑ (ni − 1) log σˆ i2
M= i =1
1 1
m
1
1+ ∑ −
3(m − 1) i =1 ni − 1 n − m
n − 1 j =1
i ( yij − yi ) 2
1 m
=σˆ 2 ∑
n − m i =1
( ni − 1)σˆ i2 .
In experimental sciences, it is easier to get replicated data and this test can be easily applied. In real life applications, it is
difficult to get replicated data and this test may not be applied. This difficulty is overcome in Breusch Pagan test.
ECONOMETRIC THEORY
MODULE – VIII
Lecture - 26
Heteroskedasticity
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
σ i2 = γ 1 + γ 2 Z i 2 + ... + γ p Z ip .
H 0 : γ 2= γ 3= ...= γ p= 0.
If H0 is accepted , it implies that Z i 2 , Z i 3 ,..., Z ip do not have any effect on σ i2 and we get σ i2 = γ 1 .
3
The test procedure is as follows:
1. Ignoring heteroskedasticity, apply OLS to
yi = β1 + β 2 X i1 + ... + β k X ik + ε i
and obtain residual
e= y − Xb
b = ( X ' X ) −1 X ' y.
2. Construct the variables
ei2 nei2
=gi =
n 2 SS res
∑ ei n
i =1
where SSres is the residual sum of squares based on ei ’s.
3. Run regression of g on Z , Z ,..., Z and get residual sum of squares SS * .
1 2 p res
1 n 2 *
=Q ∑
2 i =1
gi − SS res
which is asymptotically distributed as χ2 distribution with (p - 1) degrees of freedom.
5. The decision rule is to reject H0 if Q > χ 2
1−α (m − 1).
heteroskedasticity in the model. Let jth explanatory variable explains the heteroskedasticity, so
σ i2 ∝ X ij
or σ i2 = σ 2 X ij .
3. Run two separate regression in the two parts using OLS and obtain the residual sum of squares SS res1 and SS res 2 .
4. The test statistic is
SS res 2
F0 =
SS res1
n−c n−c
which follows a F - distribution, i.e., F − k, − k when H0 true.
2 2
n−c n−c
5. The decision rule is to reject H0 whenever F0 > F1−α − k, − k .
2 2
5
This test is simple test but it is based on the assumption that only one of the explanatory variable helps is determining
the heteroskedasticity.
One difficulty in this test is that the choice of c is not obvious. If large value of c is chosen, then it reduces the
n−c n−c
degrees of freedom − k and the condition > k may be violated.
2 2
On the other hand, if smaller value of c is chosen, then the test may fail to reveal the heteroskedasticity. The basic
objective of ordering of observations and deletion of observations in the middle part may not reveal the heteroskedasticity
effect. Since the first and last values of σ i2 give the maximum discretion, so deletion of smaller value may not give the
n
proper idea of heteroskedasticity. Keeping those two points in view, the working choice of c is suggested as c= .
3
Moreover, the choice of X ij is also difficult. Since σ i2 ∝ X ij , so if all important variables are included in the model, then it
may be difficult to decide that which of the variable is influencing the heteroskedasticity.
6
4. Glesjer test
This test is based on the assumption that σ i is influenced by one variable Z, i.e., there is only one variable which is
2
influencing the heteroskedasticity. This variable could be either one of the explanatory variable or it can be chosen from
some extraneous sources also.
The test has only asymptotic justification and the four choices of h give generally satisfactory results.
=ei2 =
i ' e.e ' i i ' H εε ' H i
ei2 ) i ' HE (εε ') H
E (= = i i ' H ΩH i
0
h1i i
h11 h1n 0
h2i i
H i = 1
hn1 hnn 0
hni i
0
σ 12 0 0 h1i i
0 σ 22 0 h2i i
E (ei ) = h1i i, h2i i,..., hni i
2
.
0 0 σ n2 hn1 i
10
In the presence of heteroskedasticity, use the generalized least squares estimation. The generalized least squares estimator
(GLSE) of β is
βˆ =Ω
( X ' −1 X ) −1 X ' Ω −1 y.
βˆ =( X ' Ω −1 X ) −1 X ' Ω −1 ( X β + ε )
Thus
V ( βˆ ) =E ( βˆ − β )( βˆ − β )
( X ' Ω −1 X ) −1 X ' Ω −1 E (εε )Ω −1 X ( X ' Ω −1 X ) −1
=
= ( X ' Ω −1 X ) −1 X ' Ω −1ΩΩ −1 X ( X ' Ω −1 X ) −1
= ( X ' Ω −1 X ) −1.
11
∑ (x − x ) σ i
2
i
2
Var (b) = i =1
n
∑ (x − x )
i =1
i
2
∑ (x − x ) i
2
Var ( βˆ ) = i =1
2 2 2 1
n n
=
∑ (
i 1=
xi − x ) σ i ∑
i 1
( xi − x )
σ i2
xi − x
Square of the correlation coefficient between σ i ( xi − x ) and
σi
≤1
⇒ Var ( βˆ ) ≤ Var (b).
So efficient of OLSE and GLSE depends upon the correlation coefficient between ( xi − x )σ i and ( xi − x ) .
σi
12
The generalized least squares estimation assumes that Ω is known, i.e., the nature of heteroskedasticity is completely
specified. Based on this assumption, the possibilities of following two cases arise:
Ω is completely specified or
Ω is not completely specified
εi σ i2
Let ε i* = , then E (ε=
*
) 0, Var (ε=*
) = 1.
σi i i
σ i2
Now OLS can be applied to this model and usual tools for drawing statistical inferences can be used.
Note that when the model is deflated, the intercept term is lost as β1 / σ i is itself a variable. This point has to be taken care
in a software output.
13
Case 2: Ω may not be completely specified
σ i2 ∝ X ij2 λ
or σ i2 = σ 2 X ij2 λ
but σ 2 is not available. Consider the model
yi = β1 + β 2 X i 2 + ... + β k X ik + ε i
λ
and deflate it by X ij as
yi β1 X X ε
λ
= λ
+ β 2 iλ2 + ... + β k ikλ + iλ .
X ij X ij X ij X ij X ij
Now apply OLS to this transformed model and use the usual statistical tools for drawing inferences.
A caution is to be kept is mind while doing so. This is illustrated in the following example with one explanatory variable
model.
Consider the model
yi =β 0 + β1 xi + ε i .
Deflate it by xi , so we get
yi β 0 ε
= + β1 + i .
xi xi xi
Note that the roles of β0
β1 in original and deflated models are interchanged. In original model, β 0 is intercept term
and
and β1 is slope parameter whereas in deflated model, β1 becomes the intercept term and β 0 becomes the slope
parameter. So essentially, one can use OLS but need to be careful in identifying the intercept term and slope parameter,
particularly in the software output.
ECONOMETRIC THEORY
MODULE – IX
Lecture - 27
Autocorrelation
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
One of the basic assumptions in linear regression model is that the random error components or disturbances are
y X β + u , it is assumed that
identically and independently distributed. So in the model=
σ u2 if s = 0
E (ut , ut − s ) =
0 if s ≠ 0,
i.e., the correlation between the successive disturbances is zero.
ut − s ) σ=
In this assumption, when E (ut , = u, s
2
0 is violated, i.e., the variance of disturbance term does not remains constant,
then problem of heteroskedasticity arises.
When autocorrelation is present, some or all off diagonal elements in E (uu ') are nonzero.
Sometimes the study and explanatory variables have a natural sequence order over time, i.e., the data is collected with
respect to time. Such data is termed as time series data. The disturbance terms in time series data are serially
correlated.
3
γ 0 E=
= (ut2 ) σ 2 .
E (ut ut − s )
ρs = ; s = 0, ±1, ±2,...
Var (ut )Var (ut − s )
Assume ρ s and γ s are symmetrical in s, i.e., these coefficients are constant over time and depend only on length of lag s.
The autocorrelation between the successive terms (u2 and u1), (u3 and u2),…, (un and un-1) gives the autocorrelation of
order one, i.e., ρ1 .
Similarly, the autocorrelation between the successive terms (u3 and u1 ), (u4 and u2 ),..., (un and un − 2 )
gives the autocorrelation of order two, i.e., ρ2 .
4
Source of autocorrelation
Some of the possible reasons for the introduction of autocorrelation in the data are as follows:
1. Carryover of effect, atleast in part, is an important source of autocorrelation. For example, the monthly data on
expenditure on household is influenced by the expenditure of preceding month. The autocorrelation is present in cross-
section data as well as time series data. In the cross-section data, the neighboring units tend to be similar with respect
to the characteristic under study. In time series data, the time is the factor that produces autocorrelation. Whenever
some ordering of sampling units is present, the autocorrelation may arise.
2. Another source of autocorrelation is the effect of deletion of some variables. In regression modeling, it is not possible
to include all the variables in the model. There can be various reasons for this, e.g., some variable may be qualitative,
sometimes direct observations may not be available on the variable etc. The joint effect of such deleted variables gives
rise to autocorrelation in the data.
3. The misspecification of the form of relationship can also introduce autocorrelation in the data. It is assumed that the form
of relationship between study and explanatory variables is linear. If there are log or exponential terms present in the
model so that the linearity of the model is questionable then this also gives rise to autocorrelation in the data.
4. The difference between the observed and true values of variable is called measurement error or errors–in-variable. The
presence of measurement errors on the dependent variable may also introduce the autocorrelation in the data.
5
Structure of disturbance term
Consider the situation where the disturbances are autocorrelated,
γ0 γ1 γ n −1
γ γ0 γ n − 2
E (εε ') = 1
γ n −1 γ n−2 γ 0
1 ρ1 ρ n −1
ρ 1 ρ n − 2
= γ0 1
ρ n −1 ρn−2 1
1 ρ1 ρ n −1
ρ 1 ρ n − 2
2
= σu 1
.
ρ n −1 ρn−2 1
Observe that now there are (n + k) parameters- β1 , β 2 ,..., β k , σ u2 , ρ1 , ρ 2 ,..., ρ n −1. These (n + k) parameters are to be
estimated on the basis of available n observations. Since the number of parameters are more than the number of
observations, so the situation is not good from the statistical point of view. In order to handle the situation, some special
form and the structure of the disturbance term is needed to be assumed so that the number of parameters in the
covariance matrix of disturbance term can be reduced.
The following structures are popular in autocorrelation:
1. Autoregressive (AR) process.
2. Moving average (MA) process.
3. Joint autoregressive moving average (ARMA) process.
6
ut = ε t + θ1ε t −1 + .... + θ pε t − p ,
i.e., the present disturbance term ut depends on the p lagged values. The coefficients θ1 ,θ 2 ,...,θ p are the parameters and
are associated with ε t −1 , ε t − 2 ,..., ε t − p , respectively. This process is termed as MA (p) process.
7
u=
t φ1ut −1 + ... + φq ut − q + ε1 + θ1ε t −1 + ... + θ pε t − p .
This is termed as ARMA(q, p) process.
The method of correlogram is used to check that the data is following which of the processes. The correlogram is a two
dimensional graph between the lag s and autocorrelation coefficient ρs which is plotted as lag s on X-axis and ρ s on
y-axis.
In MA(1) process
u=
t ε t + θ1ε t −1
θ1
for s = 1
ρ s = 1 + θ12
0 for s ≥ 2
ρ0 = 1
ρ1 ≠ 0
ρi 0=
= i 2,3,...
So there is no autocorrelation between the disturbances that are more than one period apart.
8
In ARMA(1,1) process
u=
t φ1ut −1 + ε t + θ1ε t −1
(1 + φ1θ1 )(φ1 + θ1 )
for s = 1
ρ s = (1 + θ1 + 2φ1θ1 )
2
φ1 ρ s −1 for s ≥ 2
1 + θ12 + 2φ1θ1 2
σ =
2
σ ε .
1 − φ12
u
The autocorrelation function begins at some point determined by both the AR and MA components but thereafter, declines
geometrically at a rate determined by the AR component.
The results of any lower order of process are not applicable in higher order schemes. As the order of the process increases,
the difficulty in handling them mathematically also increases
ECONOMETRIC THEORY
MODULE – IX
Lecture - 28
Autocorrelation
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
ρ < 1, E (ε t ) =
0,
σ ε2 if s = 0
E (ε t , ε t + s ) =
0 if s ≠ 0
for all t = 1, 2,..., n where ρ is the first order autocorrelation between ut and ut −1 , t = 1, 2,..., n. Now
ut ρ ut −1 + ε t
=
= ρ (ut − 2 + ε t −1 ) + ε t
=
ε t ρε t −1 + ρ 2ε t − 2 + ...
=+
∞
= ∑ ρ rε t −r
r =0
E (ut ) = 0
E (ε t2 ) + ρ 2 E (ε t2−1 ) + ρ 4 E (ε t2− 2 ) + ...
E (ut2 ) =
=(1 + ρ 2 + ρ 4 + ....)σ ε2 (ε t' s are serially independent)
σ ε2
E (ut=
2
) σ=
2
for all t.
1− ρ 2
u
3
E (ut ut −=
1) E ( ε t + ρε t −1 + ρ 2ε t − 2 + ...) × ( ε t −1 + ρε t − 2 + ρ 2ε t −3 + ...)
= ρ E ( ε t −1 + ρε t − 2 + ...)
2
= ρσ u2 .
Similarly,
E (ut ut − 2 ) = ρ 2σ u2 .
In general,
E (ut ut − s ) = ρ sσ u2
1 ρ ρ2 ρ n −1
ρ 1 ρ ρ n−2
E (uu ') = Ω = σ u ρ 2
2
ρ 1 ρ n −3 .
ρ n −1 ρ n−2 ρ n −3 1
Note that the disturbance terms are no more independent and E (uu ') ≠ σ I . The disturbance are nonspherical.
2
4
ut = ρ ut −1 + ε t , t = 1, 2,..., n
with assumptions
E (u ) = 0, E (uu ') = Ω
σ ε2 if s = 0
E (ε t ) 0,=
= E (ε t ε t + s )
0 if s ≠ 0
where Ω is a positive definite matrix.
The ordinary least squares estimator of β is
b = ( X ' X ) −1 X ' y
= ( X ' X ) −1 X '( X β + u )
b−β = ( X ' X ) −1 X ' u
E (b − β ) =0.
So OLSE remains unbiased under autocorrelated disturbances.
σ u2 1
E (s 2 ) = − tr ( X ' X ) −1 X ' ΩX ,
n −1 n −1
so s2 is a biased estimator of σ 2. In fact, s2 has downward bias.
Application of OLS fails in case of autocorrelation in the data and leads to serious consequences as
overly optimistic view from R2.
narrow confidence interval.
usual t-ratio and F - ratio tests provide misleading results.
prediction may have large variances.
Since disturbances are nonspherical, so generalized least squares estimate of β yields more efficient estimates than
OLSE.
The GLSE of β is
βˆ =Ω
( X ' −1 X ) −1 X ' Ω −1 y
E ( βˆ ) = β
=
V ( βˆ ) σ 2 ( X ' Ω −1 X ) −1.
u
Tests of autocorrelation
Durbin-Watson (D-W) test
The Durbin-Watson (D-W) test is used for testing the hypothesis of lack of first order autocorrelation in the disturbance
term. The null hypothesis is
H0 : ρ = 0
Use OLS to estimate β in=
y Xβ +u and obtain residual vector
e =y − Xb =Hy
where
b= ( X ' X ) −1 X ' y, H = I − X ( X ' X ) −1 X '.
∑ (e − e )
2
t t −1
d= t =2
n
∑e
t =1
2
t
n n n
∑ et2 ∑ et −1 ∑ et et −1
=n
=t 2=t 2
+ n = − 2 t 2n .
∑ et ∑ et
2
=t 1 =t 1
2
∑ et 2
=t 1
For large n,
d ≈ 1 + 1 − 2r
d 2(1 − r ).
7
where r is the sample autocorrelation coefficient from residuals based on OLSE and can be regarded as the regression
coefficient of et on et-1. Here
positive autocorrelation of et’s ⇒ d < 2,
negative autocorrelation of et’s ⇒ d > 2,
zero autocorrelation of et’s ⇒ d ≈ 2.
As -1 < r < 1, so
if -1 < r < 0, then 2 < d < 4 and
if 0 < r < 1, then 0 < d < 2.
So d lies between 0 and 4.
Since e depends on X, so for different data sets, different values of d are obtained. So the sampling distribution of d
depends on X. Consequently, exact critical values of d cannot be tabulated owing to their dependence on X. Durbin and
Watson therefore obtained two statistics d and d such that
d <d <d
and their sampling distributions do not depend upon X.
Considering the distribution of d and d , they tabulated the critical values as dL and dU respectively. They prepared the
tables of critical values for 15 < n <100 and k ≤ 5. Now tables are available for 6 < n < 200 and k ≤ 10.
8
H0 : ρ = 0
Nature of H1 Reject H 0 when Retain H 0 when The test is
inconclusive when
H1 : ρ > 0 d < dL d > dU d L < d < dU
H1 : ρ < 0 d > (4 − d L ) d < (4 − dU ) ( 4 − dU ) < d < (4 − d L )
H1 : ρ ≠ 0 d < d L or dU < d < (4 − dU ) d L < d < dU
d > (4 − d L ) or
(4 − dU ) < d < (4 − d L )
Values of d L and dU are obtained from tables.
9
2. The D-W test is not applicable when intercept term is absent in the model. In such a case, one can use another critical
values, say dM in place of dL. The tables for critical values dM are available.
3. The test is not valid when lagged dependent variables appear as explanatory variables. For example,
Durbin’s h-test
Apply OLS to
yt β1 yt −1 + β 2 yt − 2 + .... + β r yt − r + β r +1 xt1 + ... + β k xt ,k − r + ut
=
= ut ρ ut −1 + ε t
(b ). Then the Dubin’s h-statistic is
and find OLSE b1 of β1. Let its variance be Var (b1 ) and its estimator is Var 1
n
h=r
(b )
1 − n Var 1
∑e e t t −1
r= t =2
n
.
∑e
t =2
2
t
(b ) < 0, then test breaks down. In such cases, the following test
This test is applicable when n is large. When 1 − nVar 1
procedure can be adopted.
Introduce a new variable ε t −1 to
= ut ρ ut −1 + ε t . Then
et δρt −1 + yt .
=
Now apply OLS to this model and test H 0 A : δ = 0 versus H1 A : δ ≠ 0 using t - test . It H 0 A is accepted then accept
H 0 : ρ = 0.
If H 0 A : δ = 0 is rejected, then reject H 0 : ρ = 0.
11
4. If H 0 : ρ = 0 is rejected by D-W test, it does not necessarily mean the presence of first order autocorrelation in the
disturbances. It could happen because of other reasons also, e.g.,
distribution may follows higher order AR process.
some important variables are omitted.
dynamics of model is misspecified.
functional term of model is incorrect.
ECONOMETRIC THEORY
MODULE – IX
Lecture - 29
Autocorrelation
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
βˆ = ( X 'ψ −1 X ) −1 X 'ψ −1 y
where
1 −ρ 0 0 0
− ρ 1 + ρ 2 −ρ 0 0
0 −ρ 1+ ρ 2 0 0
ψ =
−1
.
0 0 0 1+ ρ 2 −ρ
0 0 0 −ρ 1
1− ρ 2 0 0 0 0
−ρ 1 0 0 0
0 −ρ 1 0 0
P= .
0 0 0 1 0
0 0 0 −ρ 1
1− ρ 2 y 1− ρ 2 1 − ρ 2 x12 1 − ρ 2 x1k
1
y2 − ρ y1 1− ρ x22 − ρ x12 x2 k − ρ x1k
y − ρy , 1− ρ
y* =
3 2 X* = x32 − ρ x22 x3k − ρ x2 k .
yn − ρ yn −1 1− ρ xn 2 − ρ xn −1,2 , xn − ρ xn −1
Note that the first observation is treated differently than other observations. For the first observation,
( ) (
1 − ρ 2 y1 = )
1 − ρ 2 x1' β + ( 1 − ρ 2 u1)
whereas for other observations
yt − ρ yt −1 = ( xt − ρ xt −1 ) ' β + (ut − ρ ut −1 ) ; t = 2,3,..., n
where xt' is a row vector of X. Also, 1 − ρ 2 u1 and (u1 − ρ u0 ) have same properties. So we expect these two errors to
be uncorrelated and homoscedastic.
4
If first column of X is a vector of ones, then first column of X* is not constant. Its first element is 1− ρ 2.
V ( βˆ ) = σ 2 ( X *' X *) −1
= σ 2 ( X 'ψ −1 X ) −1
Vˆ ( βˆ ) = σˆ 2 ( X 'ψ −1 X ) −1
where
( y − X βˆ ) 'ψ −1 ( y − X βˆ )
σˆ 2 = .
n−k
5
Several procedure have been suggested to estimate the regression coefficients when autocorrelation coefficient is
unknown. The feasible GLSE of β is
βˆF =Ω
( X ' ˆ −1 X ) −1 X ' Ω
ˆ −1 y
ˆ −1 is the
where Ω Ψ −1 matrix with ρ replaced by its estimator ρ̂ .
∑e e t t −1
r= t =2
n
∑e
t =2
2
t
2. Durbin procedure
In Durbin procedure, the model
yt − ρ yt −1 = β 0 (1 − ρ ) + β ( xt − ρ xt −1 ) + ε t , t = 2,3,..., n
is expressed as
yt = β 0 (1 − ρ ) + ρ yt −1 + β x1 − ρβ xt −1 + ε t
=β 0* + ρ yt −1 + β xt + β * xt −1 + ε t , t =2,3,..., n (*)
where β 0* =
β 0 (1 − ρ ), β * =
− ρβ. .
Now run regression using OLS to model (*) and estimate r * as the estimated coefficient of yt −1.
Another possibility is that since ρ ∈ (−1,1) , so search for a suitable ρ which has smaller error sum of squares.
7
3. Cochrane-Orcutt procedure
This procedure utilizes P matrix defined while estimating β when ρ is known. It has following steps:
∑e e t t −1
(ii) Estimate ρ by r = t =2
n
.
∑e
t =2
2
t −1
(iii) Replace ρ by r is
yt − ρ yt −1 = β 0 (1 − ρ ) + β ( xt − ρ xt −1 ) + ε t
This is Cochrane-Orcutt procedure. Since two successive applications of OLS are involved, so it is also called as two-step
procedure.
8
∑e e t t −1
(iii) Calculate ρ by r= t =2
n
and substitute it in the model
∑e
t =2
2
t −1
yt − ρ yt −1 = β 0 (1 − ρ ) + β ( xt − ρ xt −1 ) + ε t
and again obtain the transformed model.
(iv) Apply OLS to this model and calculate the regression coefficients.
This procedure is repeated until convergence is achieved, i.e., iterate the process till the two successive estimates are nearly
same so that stability of estimator is achieved.
This is an iterative procedure and is numerically convergent procedure. Such estimates are asymptotically efficient and there
is a loss of one observation.
9
( yt − ρ yt −1 ) = β 0 (1 − ρ ) + β ( xt − ρ xt −1 ) + ε t , t = 2,3,..., n
(iii) Select that value of ρ for which residual sum of squares is smallest.
Suppose we get ρ = 0.4. Now choose a finer grid. For example, choose ρ such that 0.3 < ρ < 0.5 and consider
ρ = 0.31, 0.32,..., 0.49 and pick up that ρ with smallest residual sum of squares. Such iteration can be repeated until a
suitable value of ρ corresponding to minimum residual sum of squares is obtained. The selected final value of ρ can be
used and for transforming the model as in the case of Cocharane-Orcutt procedure. The estimators obtained with this
procedure are as efficient as obtained by Cochrane-Orcutt procedure and there is a loss of one observation.
10
5. Prais-Winston procedure
∑e e t t −1
(i) Estimate ρ by ρˆ = t =2
n
where et ’s are residuals based on OLSE.
∑e
t =3
2
t −1
( ) (
1 − ρˆ 2 y1 = )
1 − ρˆ 2 β 0 + β ( ) (
1 − ρˆ 2 xt + )
1 − ρˆ 2 ut
The estimators obtained with this procedure are asymptotically as efficient as best linear unbiased estimators. There is no
1 1
=L exp − 2 ( y − X β ) 'ψ −1 ( y − X β ) .
2σ ε
n
( 2πσ )
n
2 2
ψ 2
1
Ignoring the constant and using ψ = , the log-likelihood is
1− ρ 2
n 1 1
ln L( β , σ ε2 , ρ ) =
ln L = − ln σ ε2 + ln(1 − ρ 2 ) − 2 ( y − X β ) 'ψ −1 ( y − X β ).
2 2 2σ ε
The maximum likelihood estimators of β , ρ and σ ε can be obtained by solving the normal equations
2
∂ ln L ∂ ln L ∂ ln L
= 0,= 0,= 0.
∂β ∂ρ ∂σ ε2
There normal equations turn out to be nonlinear in parameters and can not be easily solved.
One solution is to
first derive the maximum likelihood estimator of σ ε .
2
Substitute it back into the likelihood function and obtain the likelihood function as the function of β and ρ .
Maximize this likelihood function with respect to β and ρ .
Thus
∂ ln L n 1
= 0⇒− 2 + ( y − X β ) 'ψ −1 ( y − X β ) = 0
∂σ ε2
2σ ε 2σ ε
2
1
⇒ σˆ ε2 = ( y − X β ) 'ψ −1 ( y − X β )
n
is the estimator of σ ε2 .
12
Substituting σˆ ε in place of
2
σ ε2 in the log-likelihood function yields
n 1 1 n
ln L* = ln L *( β , ρ ) =
− ln ( y − X β ) 'ψ −1 ( y − X β ) + ln(1 − ρ 2 ) −
2 n 2 2
n
− ln {( y − X β ) 'ψ −1 ( y − X β )} − ln(1 − ρ 2 ) + k
1
=
2 n
n ( y − X β ) 'ψ −1 ( y − X β )
= k − ln
2 1
(1 − ρ 2 n
)
n n
=
where k ln n − .
2 2
( y − X β ) 'ψ −1 ( y − X β )
1
.
(1 − ρ )
2 n
Using optimization techniques of non-linear regression, this function can be minimized and estimates of β and ρ
can be obtained.
If n is large and | ρ | is not too close to one, then the term (1 − ρ 2 ) −1/ n is negligible and the estimates of β
MODULE – X
Lecture - 30
Dummy Variable Models
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
In general, the explanatory variables in any regression analysis are assumed to be quantitative in nature.
For example, the variables like temperature, distance, age etc. are quantitative in the sense that they are recorded on a well
defined scale.
In many applications, the variables can not be defined on a well defined scale and they are qualitative in nature.
For example, the variables like sex (male or female), colour (black, white), nationality, employment status (employed,
unemployed) are defined on a nominal scale. Such variables do not have any natural scale of measurement. Such
variables usually indicate the presence or absence of a “quality” or an attribute like employed or unemployed, graduate or
non-graduate, smokers or non- smokers, yes or no, acceptance or rejection, so they are defined on a nominal scale. Such
variables can be quantified by artificially constructing the variables that take the values, e.g., 1 and 0 where “1” indicates
usually the presence of attribute and “0” indicates usually the absence of attribute. For example, “1” indicates that the
person is male and “0” indicates that the person is female. Similarly, “1” may indicate that the person is employed and
Such variables classify the data into mutually exclusive categories. These variables are called indicator variables or
dummy variables.
3
Usually, the dummy or indicator variables take on the values 0 and 1 to identify the mutually exclusive classes of the
explanatory variables. For example,
Here we use the notation D in place of X to denote the dummy variable. The choice of 1 and 0 to identify a category is
arbitrary. For example, one can also define the dummy variable in above examples as
It is also not necessary to choose only 1 and 0 to denote the category. In fact, any distinct value of D will serve the
purpose. The choices of 1 and 0 are preferred as they make the calculations simple, help in easy interpretation of the
values and usually turn out to be a satisfactory choice.
In a given regression model, the qualitative and quantitative variables may also occur together, i.e., some variables may be
qualitative and others quantitative.
4
Such models can be dealt within the framework of regression analysis. The usual tools of regression analysis can be used
in case of dummy variables.
Example
Consider the following model with x1 as quantitative and D2 as dummy variable
y =β 0 + β1 x1 + β 2 D2 + ε , E (ε ) =0, Var (ε ) =σ 2
0 if an observation belongs to group A
D2 =
1 if an observation belongs to group B.
y =β 0 + β1 x1 + β 2 .0 + ε
=β 0 + β1 x1 + ε
E ( y | D=
2 = β 0 + β1 x1
0)
y =β 0 + β1 x1 + β 2 .1 + ε
= ( β 0 + β 2 ) + β1 x1 + ε
E ( y | D2 =1) =( β 0 + β 2 ) + β1 x1
which is a straight line relationship with intercept ( β 0 + β 2 ) and slope β1.
The quantities E(y|D2 = 0) and E(y|D2 = 1) are the average responses when an observation belongs to group A and
group B, respectively. Thus
β2 =
E ( y | D2 =
1) − E ( y | D2 =
0)
which has an interpretation as the difference between the average values of y with D2 = 0 and D2 = 1.
Graphically, it looks like as in the following figure. It describes two parallel regression lines with same variances .
y
E ( y | D2 =1) =( β 0 + β 2 ) + β1 x1
β0 + β2 β1
β2
E ( y | D=
2 = β 0 + β1 x1
0)
β1
β0
x
6
If there are three explanatory variables in the model with two dummy variables D2 and D3 then they will describe three
levels, e.g., groups A, B and The levels of dummy variables are as follows:
1. =
D2 0,=
D3 0 if the observation is from group A.
2. =
D2 1,=
D3 0 if the observation is from group B.
3. =
D2 0,=
D3 1 if the observation is from group C.
y =β 0 + β1 x1 + β 2 D2 + β3 D3 + ε , E (ε ) =0, Var (ε ) =σ 2 .
In general, if a qualitative variable has m levels, then (m - 1) dummy variables are required and each of them takes value
0 and 1.
7
Consider the following examples to understand how to define such dummy variables and how they can be handled.
Example
Suppose y denotes the monthly salary of a person and D denotes whether the person is graduate or non-graduate. The
model is
y =β 0 + β1 D + ε , E (ε ) =0, var(ε ) =σ 2 .
With n observations, the model is
yi =β 0 + β1 Di + ε i , i =1, 2,..., n
E ( yi | D=
i 0)= β 0
E ( yi | D=
i 1)= β 0 + β1
β1 =
E ( yi | Di =
1) − E ( yi | Di =
0).
Thus
- β0 measures the mean salary of a non-graduate
- β1 measures the difference in the mean salaries of a graduate and non-graduate person.
Now consider the same model with two dummy variables defined in the following way:
1 if person is graduate
Di1 =
0 if person is nongraduate,
1 if person is nongraduate
Di 2 =
0 if person is graduate.
8
1. E [ yi | Di1= 0, Di 2= 1=
] β0 + β 2 : Average salary of non-graduate
2. E [ yi | Di=
1 1, Di=
2 ] β 0 + β1 : Average salary of graduate
0=
3. E [ yi | Di=
1 0, Di=
2 ] β0
0= : cannot exist
So multicollinearity is present in such cases. Hence the rank of matrix of explanatory variables falls short by 1. So β 0 , β1
and β 2 are indeterminate and least squares method breaks down. So the proposition of introducing two dummy
variables is useful but they lead to serious consequences. This is known as dummy variable trap.
9
then
E ( yi | Di=
1 1, Di=
2 0)= β1 ⇒ Average salary of a graduate.
E ( yi | Di1= 0, Di 2= 1)= β 2 ⇒ Average salary of a non − graduate.
So when intercept term is dropped, then β1 and β 2 have proper interpretations as the average salaries of a graduate
and non-graduate persons, respectively.
Now the parameters can be estimated using ordinary least squares principle and standard procedures for drawing
inferences can be used.
Rule: When the explanatory variable leads to m mutually exclusive categories classification, then use (m - 1) dummy
variables for its representation. Alternatively, use dummy variables but drop the intercept term.
10
Interaction term
Suppose a model has two explanatory variables – one quantitative variable and other a dummy variable. Suppose both
interact and an explanatory variable as the interaction of them is added to the model.
Thus
β 2 reflects the change in intercept term associated with the change in the group of person i.e., when group
changes from A to B.
β3 reflects the change in slope associated with the change in the group of person, i.e., when group changes
from A to B.
is equivalent to fitting two separate regression models corresponding to Di2 = 1 and Di2 = 0, i.e.,
yi =β 0 + β1 xi1 + β 2 .1 + β3 xi1.1 + ε i
= ( β 0 + β 2 ) + ( β1 + β3 ) xi1 Di 2 + ε i
and
yi =β 0 + β1 xi1 + β 2 .0 + β3 xi1.0 + ε i
=β 0 + β1 xi1 + ε i ,
respectively.
The test of hypothesis becomes convenient by using a dummy variable. For example, if we want to test whether the two
regression models are identical, the test of hypothesis involves testing
H 0 : β=
2 β=
3 0
H1 : β 2 ≠ 0 and/or β3 ≠ 0.
Acceptance of H0 indicates that only single model is necessary to explain the relationship.
In another example, if the objective is to test that the two models differ with respect to intercepts only and they have
same slopes, then the test of hypothesis involves testing H 0 : β 3 = 0
H1 : β 3 ≠ 0.
12
then the variable “age” can be represented by four different dummy variables.
Since it is difficult to collect the data on individual ages, so this will help in easy collection of data. A disadvantage is that some
loss of information occurs. For example, if the ages in years are 2, 3, 4, 5, 6, 7 and suppose the dummy variable is defined
as
1 if age of i th person is > 5 years
Di =
0 if age of i person is ≤ 5 years.
th
Then these values become 0, 0, 0, 1, 1, 1. Now looking at the value 1, one can not determine if it corresponds to age 5, 6 or
7 years.
13
Moreover, if a quantitative explanatory variable is grouped into m categories, then (m -1) parameters are required
whereas if the original variable is used as such, then only one parameter is required.
Treating a quantitative variable as qualitative variable increases the complexity of the model.
The degrees of freedom for error are also reduced. This can effect the inferences if data set is small.
The use of dummy variables does not require any assumption about the functional form of the relationship between
study and explanatory variables.
ECONOMETRIC THEORY
MODULE – XI
Lecture - 31
Specification Error Analysis
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
The specification of a linear regression model consists of a formulation of the regression relationships and of statements or
assumptions concerning the explanatory variables and disturbances. If any of these is violated, e.g., incorrect functional
form, incorrect introduction of disturbance term in the model etc., then specification error occurs. In narrower sense, the
specification error refers to explanatory variables.
The complete regression analysis depends on the explanatory variables present in the model. It is understood in the
regression analysis that only correct and important explanatory variables appear in the model. In practice, after ensuring
the correct functional form of the model, the analyst usually has a pool of explanatory variables which possibly influence
the process or experiment. Generally, all such candidate variables are not used in the regression modeling but a subset
of explanatory variables is chosen from this pool. How to determine such an appropriate subset of explanatory variables
to be used in regression is called the problem of variable selection.
While choosing a subset of explanatory variables, there are two possible options:
1. In order to make the model as realistic as possible, the analyst may include as many as possible explanatory
variables.
2. In order to make the model as simple as possible, one may include only fewer number of explanatory variables.
Now we discuss the statistical consequences arising from the both situations.
3
from the point of view of theoretical considerations. There can be several reasons behind such decisions, e.g., it may be
hard to quantify the variables like taste, intelligence etc. Sometimes it may be difficult to take correct observations on the
Let there be k candidate explanatory variables out of which suppose r variables are included and (k – r) variables are to be
deleted from the model. So partition the X and β as
β1
X 1 X 2 and β .
r ×1
= X =
n×k
n×r n×( k − r )
2 β
( k − r )×1)
The model y = X β + ε , E (ε ) = 0, V (ε ) =
σ I can be expressed as
2
y = X 1β1 + X 2 β 2 + ε
which is called as full model or true model.
After dropping the r explanatory variable in the model, the new model is
=y X 1β1 + δ
which is called as misspecified model or false model.
4
Thus
E (b1F − β ) =θ + ( X 1' X 1 ) −1 E (ε )
=θ
which is a linear function of β 2 , i.e., the coefficients of excluded variables. So b1F is biased, in general. The bias vanishes
if X 1 X 2 = 0, i.e., X1 and X2 are orthogonal or uncorrelated.
'
E θθ '+ θε ' X 1 ( X 1' X 1 ) −1 + ( X 1' X 1 ) −1 X 1'εθ '+ ( X 1' X 1 ) −1 X 1'εε ' X 1 ( X 1' X 1 ) −1
=
So efficiency generally declines. Note that the second term is the conventional form of MSE.
5
SS res e 'e
σˆ 2
= =
n−r n−r
where
e=
y − X 1b1F =
H1 y ,
H1= I − X 1 ( X 1' X 1 ) −1 X 1' .
Thus
y H1 ( X 1β1 + X 2 β 2 + ε )
H1 =
0 + H1 ( X 2 β 2 + ε )
=
= H1 ( X 2 β 2 + ε ).
y′H1 y = ( X 1β1 + X 2 β 2 + ε ) H1 ( X 2 β 2 + ε )
= ( β 2' X 2 H1 X 2 β 2 + β 2' X 2' H1ε + ε ' H1' X 2 β 2 + ε ' H1ε ) (Using X 1=
H1 0).
1
=E (s 2 ) E ( β 2' X 2' H1 X 2 β 2 ) + 0 + 0 + E (ε ' H ε )
n−r
1
= β 2' X 2' H1 X 2 β 2 + (n − r )σ 2
n−r
1
= σ2 + β 2' X 2' H1 X 2 β 2 .
n−r
Thus s2 is a biased estimator of σ and s2 provides an over estimate of σ 2. Note that even if X 1' X 2 = 0, then also s2
2
gives an overestimate of σ 2. So the statistical inferences based on this will be faulty. The t-test and confidence region will
be invalid in this case.
6
If the response is to be predicted at x ' = ( x1' , x2' ), then using the full model, the predicted value is
Sometimes due to enthusiasm and to make the model more realistic, the analyst may include some explanatory variables
that are not very relevant to the model. Such variables may contribute very little to the explanatory power of the model.
This may tend to reduce the degrees of freedom (n - k) and consequently the validity of inference drawn may be
questionable. For example, the value of coefficient of determination will increase indicating that the model is getting better
which may not really be true.
( X ' H Z X ) −1 X ' H Z y
⇒ bF =
where H Z = I − Z ( Z ' Z ) −1 Z '.
V (bF ) =E ( bF − β )( bF − β )
1
= σ 2 ( X ' H Z X ) X ' H Z IH Z X ( X ' H Z X )
−1 −1
= σ 2 ( X ' HZ X ) .
−1
9
bT = ( X ' X ) −1 X ' y
with
E (bT ) = β
V (bT ) = σ 2 ( X ' X ) −1.
To compare bF and bT we use the following result.
Result: If A and B are two positive definite matrices then A – B is atleast positive semi definite if B −1 − A−1 is also
atleast positive semidefinite.
Let
A = ( X ' H Z X ) −1
B = ( X ' X ) −1
B −1 − A−1 = X ' X − X ' H Z X
= X ' X − X ' X + X ' Z ( Z ' Z ) −1 Z ' X
= X ' Z ( Z ' Z ) −1 Z ' X
which is atleast positive semi definite matrix. This implies that the efficiency declines unless X’Z = 0. If X’Z = 0, i.e., X
and Z are orthogonal, then both are equally efficient.
SS res = eF' eF
10
where eF =−
y XbF − ZcF
bF = ( XH Z X ) −1 X ' H Z y
=cF ( Z ' Z ) −1 Z ' y − ( Z ' Z ) −1 Z ' XbF
= ( Z ' Z ) −1 Z '( y − XbF )
= ( Z ' Z ) −1 Z ' H XZ y
H Z = I − Z ( Z ' Z ) −1 Z '
H ZX = I − X ( X ' H Z X ) −1 X ' H Z
2
H ZX = H ZX : idempotent.
So
y − X ( X ' H Z X ) −1 X ' H Z y − Z ( Z ' Z ) −1 Z ' H ZX y
eF =
I − X ( X ' H Z X ) −1 X ' H Z − Z ( Z ' Z ) −1 Z ' H ZX y
=
= H ZX − ( I − H Z ) H ZX y
= H Z H ZX y
= H=
* *
ZX y where H ZX H Z H ZX .
11
Thus
SS res = eF' eF
= y ' H Z H ZX H ZX H Z y
= y ' H Z H ZX y
= y ' H ZX
*
y
E ( SS res ) = σ 2 tr ( H ZX
*
)
= σ 2 (n − k − r )
SS res
=σ .
2
E
n−k −r
SS res
So is an unbiased estimator of σ 2 .
n−k −r
MODULE – XII
Lecture – 32
Tests for Structural Change and
Stability
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
A fundamental assumption in a regression modeling is that the pattern of data on dependent and independent variables
remains the same throughout the period over which the data is collected. Under such an assumption, a single linear
regression model is fitted over the entire data set. The regression model is estimated and used for prediction assuming that
the parameters remain same over the entire time period of estimation and prediction. When it is suspected that there exist
a change in the pattern of data, then the fitting of single linear regression model may not be appropriate and more than one
regression models may be required to be fitted. Before taking such a decision to fit a single or more than one regression
models, a question arises how to test and decide if there is a change in the structure or pattern of data. Such changes can
be characterized by the change in the parameters of the model and are termed as structural change.
Now we consider some examples to understand the problem of structural change in the data. Suppose the data on the
consumption pattern is available for several years and suppose there was a war in between the years over which the
consumption data is available. Obviously, the consumption pattern before and after war does not remains the same as the
economy of the country gets disturbed. So if a model
yi = β 0 + β1 X i1 + ... + β k X ik + ε i , i = 1, 2,..., n
is fitted then the regression coefficients before and after the war period will change. Such a change is referred to as
structural break or structural change in the data. A better option in this case would be to fit two different linear regression
models- one for the data before war and another for the data after war.
In an another example, suppose the study variable is the salary of a person and explanatory variable is the number of
years of schooling. Suppose the objective is to find if there is any discrimination in the salaries of males and females. To
know this, two different regression models can be fitted-one for male employees and another for females employees. By
calculating and comparing the regression coefficients of both the models, one can check the presence of sex
discrimination in the salaries of male and female employees.
3
Consider another example of structural change. Suppose an experiment is conducted to study certain objectives and data
is collected in USA and in India. Then a question arises whether the data sets from both the countries can be pooled
together or not. The data sets can be pooled if they originate from the same model in the sense that there is no structural
change present in the data. In such case, the presence of structural change in the data can be tested and if there is no
change, then both the data sets can be merged and single regression model can be fitted. If structural change is present,
then two models are needed to be fitted.
The objective is now how to test for the presence of structural change in the data and stability of regression coefficients. In
other words, we want to test the hypothesis that some of or all the regression coefficients differ in different subsets of
data.
Analysis
We consider here a situation where only one structural change is present in the data. The data in this case be divided into
two parts. Suppose we have a data set of n observations which is divided into two parts consisting of n1 and n2 observations
such that
n1 + n2 =
n.
Based on this partition, the two models corresponding to two subgroups are
y1 =α 1 + X 1β + ε1
y2 =α 2 + X 2 β + ε 2 .
In matrix notations, we can write
y X 1 α ε1
= (1) : 1 1 +
X 2 β ε 2
Model
y2 2
and term it as Model (1).
In this case, the intercept terms and regression coefficients remain same for both the submodels. So there is no structural
change in this situation.
The problem of structural change can be characterized if intercept terms and/or regression coefficients in the submodels
are different.
If the structural change is caused due to change in the intercept terms only, then the situation is characterized by the
following model:
y1 = α11 + X 1β + ε1
y2 = α 2 2 + X 2 β + ε 2
or
α
y 0 X 1 1 ε1
=
Model (2) : 1 1 α +
X 2 2 ε 2
.
y2 0 2
β
5
If the structural change is due to different intercept terms as well as different regression coefficients, then the model is
y1 =α11 + X 1β1 + ε1
y2 =α 2 2 + X 2 β 2 + ε 2
or
α1
y 0 X1 0 α 2 ε1
=
Model (3) : 1 1 + .
y2 0 2 0 X 2 β1 ε 2
β2
The test of hypothesis related to the test of structural change is conducted by testing anyone of the null hypothesis
depending upon the situation
(I) H 0 : α1 = α 2
(II) H 0 : β1 = β 2
0 : α1
(III) H= α=
2 , β1 β2.
To construct the test statistic, apply ordinary least squares estimation to models (1), (2) and (3) and obtain the residual sum
of squares as RSS1, RSS2 and RSS3 respectively.
Note that the degrees of freedom associated with
RSS1 from model (1) is n − ( k + 1) ,
RSS2 from model (2) is n − ( k + 1 + 1) =n − ( k + 2 )
RSS3 from model (3) is n − ( k + 1 + k + 1) = n − 2 ( k + 1) .
6
The null hypothesis H 0 : α1 = α 2 , i.e., different intercept terms is tested by the statistic
F=
( RSS1 − RSS2 ) /1
RSS 2 /(n − k − 2)
which follows F (1, n − k − 2 ) under H0. This statistic tests α1 = α 2 for model (2) using model (1), i.e., model (1) contrasted
with model (2).
F=
( RSS2 − RSS3 ) / k
RSS3 / (n − 2k − 2)
which follows F ( k , n − 2k − 2 ) under H0. This statistic tests β1 = β 2 from model (3) using model (2), i.e., model (2)
contrasted with model (3).
0 : α1
The test of null hypothesis H= α=
2 , β1 β 2,, i.e., different intercepts and different slope parameters can be jointly
tested by the test statistic
F=
( RSS1 − RSS3 ) / ( k + 1)
RSS3 / (n − 2k − 2)
y2 X 2 β 2 + ε 2
= (ii )
n2 ×1 n2 × p p×1 n2 ×1
=
y X β+ε (iii )
n×1 n × p p×1 n×1
n= n1 + n2
Define
H1= I1 − X 1 ( X 1' X 1 ) X 1'
−1
I 2 − X 2 ( X 2' X 2 ) X 2'
−1
H=
2
H= I − X ( X ' X ) X '
−1
where I1 and I 2 are the identity matrices of orders ( n1 × n1 ) and ( n2 × n2 ) . The residual sums of squares based on
models (i), (ii) and (iii) are obtained as
Both H1* and H 2* are ( n × n ) matrices. Now the RSS1 and RSS2 can be re-expressed as
We can write
H= I − X ( X ' X ) X '
−1
I 0 X1
− ( X ' X ) ( X 1' X 2' )
−1
= 1
0 I2 X 2
H H12
= 11
H 21 H 22
H 21 = − X 2 ( X ' X ) X 1'
−1
Define
H=
*
H1* + H 2*
so that
RSS3 = ε ' H *ε
RSS1 = ε ' H ε
Note that ( H − H ) and H *
*
( )
are idempotent matrices. Also H − H * H * =
0. First we see how this result holds.
Consider
H11 − H1 H12 H1 0
(H − H ) H *
1
*
1 =
H 22 0 0
H 21
( H − H1 ) H1 0
= 11 .
H H
21 1 0
H 21H1 = 0
H11H1 = 0.
Thus ( H − H ) H =
0*
1
*
1
or HH1* = H1* .
10
Similarly, it can be shown that
(H − H ) H *
2
*
2 =
0
or HH 2* = H 2*.
( H − H − H )( H + H ) =
*
1 0*
2
*
1
*
2
or (H − H ) H =*
0. *
Also, we have
tr H= n − p
tr=
H * tr H1* + tr H 2*
= n1 − p + n2 − p
= n−2p
=n − 2k − 2.
RSS1 − RSS3
~ χ p2 ,
σ 2
RSS3
~ χ (2n −2 p ) .
σ 2
11
RSS1 − RSS3
p
σ2 ~ F ( p, n − 2 p )
RSS3
2 (n − 2 p)
σ
( RSS1 − RSS3 ) / ( k + 1) ~ F k + 1, n − 2k − 2 .
or ( )
RSS3 / ( n − 2k − 2 )
1. All tests are based under the assumption that σ2 remains same. So first the stability of σ2 should be checked and
then these tests can be used.
2. It is assumed in these tests that the point of change is exactly known. In practice, it is difficult to find such a point at
which the change occurs. It is more difficult to know such point when the change occurs slowly. These tests are not
applicable when the point of change is unknown. An adhoc techniques when the point of change is unknown is to
delete the data of transition period.
3. When there are more than one points of structural change, then the analysis becomes difficult.
ECONOMETRIC THEORY
MODULE – XIII
Lecture - 33
Asymptotic Theory and Stochastic
Regressors
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
The nature of explanatory variable is assumed to be non-stochastic or fixed in repeated samples in any regression analysis.
Such an assumption is appropriate for those experiments which are conducted inside the laboratories where the
experimenter can control the values of explanatory variables. Then the repeated observations on study variable can be
obtained for fixed values of explanatory variables. In practice, such an assumption may not always be satisfied.
Sometimes, the explanatory variables in a given model are the study variable in another model. Thus the study variable
depends on the explanatory variables that are stochastic in nature. Under such situations, the statistical inferences drawn
from the linear regression model based on the assumption of fixed explanatory variables may not remain valid.
We assume now that the explanatory variables are stochastic but uncorrelated with the disturbance term. In case, they are
correlated then the issue is addressed through instrumental variable estimation. Such a situation arises in the case of
Let p (ε i | xi' ) be the conditional probability density function of εi given xi' and p ( ε i ) is the unconditional probability
density function of ε i . Then
E (ε i | xi' ) = ∫ ε i p (ε i | xi' ) d ε i
= ∫ ε i p (ε i ) d ε i
= E (ε i )
=0
E (ε i2 | xi' ) = ∫ ε i2 p (ε i | xi' ) d ε i
= ∫ ε i2 p ( ε i ) d ε i
= E (ε i2 )
= σ 2.
In case, ε i and xi
'
(
are independent, then p ε i | xi = p ( ε i ) .
'
)
4
The additional assumption that the explanatory variables are stochastic poses no problem in the ordinary least squares
estimation of β and σ 2 . The OLSE of β is obtained by minimizing ( y − X β ) ' ( y − X β ) with respect β as
b =(X 'X ) X 'y
−1
(
Assuming ε ~ N 0, σ 2 I ) in the model=y X β + ε along with X is stochastic and independent of ε , the joint probability
density function ε and X can be derived from the joint probability density function of y and X as follows:
f ( ε , X ) = f ( ε1 , ε 2 ,..., ε n , x1' , x2' ,..., xn' )
n n
= ∏ f ( ε i ) ∏ f ( xi' )
= i 1= i 1
n n
= ∏ f ( yi | xi' ) ∏ f ( xi' )
= i 1= i 1
(
= ∏ f ( yi | xi' ) f ( xi' ) )
n
i =1
= ∏ f ( yi , xi' )
n
i =1
This implies that the maximum likelihood estimators of β and σ 2 will be based on
∏ f ( y | x ) = ∏ f (ε )
n n
'
i i i
=i 1 =i 1
(
so they will be same as based on the assumption that ε i ' s, i = 1, 2,..., n are distributed as N 0, σ 2 . So the maximum)
likelihood estimators of β and σ 2
when the explanatory variables are stochastic are obtained as
β = ( X ' X ) X ' y
−1
σ 2 =( y − X β )′ ( y − X β ) .
1
n
6
Alternative approach for deriving the maximum likelihood estimates
Alternatively, the maximum likelihood estimators of β and σ 2 can also be derived using the joint probability density
function of y and X.
Let xi , i = 1, 2,..., n are from a multivariate normal distribution with mean vector
'
µ x and covariance matrix Σ xx , i.e.,
xi' ~ N ( µ x , Σ xx ) and the joint distribution of y and xi' is
y µ y σ yy Σ yx
' ~ N µ , .
xi x Σ xy Σ xx
Note: Note that the vector x ' is represented by an underscore in this section to denote that it ‘s order is 1× ( k − 1)
which excludes the intercept term.
Suppose
y µ y σ yy Σ yx
~ N , .
x µ x Σ xy Σ xx
1 y − µ y ' y − µ y
1 −1
f ( y, x=
') exp − Σ .
k 1
x − µ x − µ
( 2π ) 2 Σ2
2 x x
Result:
B C
A= ,
D E
−1
where E and F= B − CE D are nonsingular matrices, then
−1 F −1 − F −1CE −1
A = −1 −1 −1 −1 −1 −1
.
− E DF E + E DF CE
−1 −1
= A=
Note that AA A I . Thus
−11 1 −Σ yx Σ −xx1
Σ = 2 −1 ,
σ −Σ xx Σ xy σ 2 Σ −xx1 + Σ −xx1Σ xy Σ yx Σ −xx1
where
σ=
2
σ yy − Σ yx Σ −xx1Σ xy .
Then
f ( y, x ')
=
( 2π )
1
k
2 Σ
1
2
1
2σ
{ } .
exp − 2 y − µ y − ( x − µ x ) ' Σ −xx1Σ xy + σ 2 ( x − µ x ) ' Σ −xx1 ( x − µ x )
2
The marginal distribution of x' is obtained by integrating f ( y, x ') over y and the resulting distribution is ( k − 1)
variate multivariate normal distribution as
1
exp − ( x − µ x ) ' Σ −xx1 ( x − µ x ) .
1
=g ( x ') k −1
2
1
( 2π ) 2 Σ xx 2
8
The conditional probability density function of y given x' is
f ( y, x ')
f ( y | x ') =
g ( x ')
1
{ ( y − µ y ) − ( x − µ x ) Σ−xx1Σ xy }
2
1 '
= exp − 2
2πσ 2 2σ
| x ') σ yy (1 − ρ 2 )
Var ( y=
where
Σ yx Σ −xx1Σ xy
ρ =2
σ yy
is the population multiple correlation coefficient.
In the model
y=β 0 + x ' β1 + ε ,
the conditional mean is
E ( y | x ') =β 0 + x ' β1 + E ( ε | x )
= β 0 + x ' β1.
9
Comparing this conditional mean with the conditional mean of normal distribution, we obtain the relationship with β 0 and β1
as follows:
β1 =Σ −xx1Σ xy
β=
0 µ y − µ x' β1.
n 1 y − µ ' y − µ y
1 −1 i
= ∑ − Σ .
i y
L exp
nk n
x − µ x − µ
( 2π ) 2 Σ2
i =1 2 i x i x
Maximizing the log-likelihood function with respect to µ y , µ x , Σ xx and Σ xy , the maximum likelihood estimates of respective
parameters are obtained as
1 n
µ y= y= ∑ yi
n i =1
1 n
µ x= x= ∑ xi= ( x2 , x3 ,..., xk )
n i =1
1 n
Σ xx = S xx = ∑
n i =1
xi xi' − nx x '
1 n
Σ xy = S xy = ∑ xi yi − nx
n i =1
y
1
where xi' ( xi 2 , xi 3 ,..., xik ), S xx is [(k -1) × (k -1)] matrix with elements ∑ ( xti − xi )( xtj − x j ) and S xy is [(k -1) ×1] vector
n t
1
with elements ∑ ( xti − xi )( yi − y ).
n t
10
Based on these estimates, the maximum likelihood estimators of β 0 and β1 are obtained as
β1 = S xx−1S xy
β0
β = (X 'X )
−1
=
β1
X ' y.
(X ' X ) X '( X β + ε ) − β
−1
=
= ( X ' X ) X 'ε.
−1
= E E
{( X ' X ) −1
X 'ε X
}
= E ( X ' X ) X ' E (ε )
−1
=0
because ( X ' X )
−1
X ' and ε are independent. So b is an unbiased estimator of β.
11
The covariance matrix of b is obtained as
V ( b ) =E ( b − β )( b − β ) '
Thus the covariance matrix involves a mathematical expectation. The unknown σ2 can be estimated by
e'e
σˆ 2 =
n−k
=
( y − Xb ) ' ( y − Xb )
n−k
where e= y − Xb is the residual and
E (σˆ 2 ) = E E (σˆ 2 X )
e'e
= E E X
n−k
= E (σ 2 )
= σ 2.
12
b = ( X ' X ) X ' y involves the stochastic matrix X and stochastic vector y, so b is not a linear
−1
Note that the OLSE
estimator. It is also no more the best linear unbiased estimator of β as in the case when X is nonstochastic. The
MODULE – XIII
Lecture - 34
Asymptotic Theory and Stochastic
Regressors
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Asymptotic theory
The asymptotic properties of an estimator concerns the properties of the estimator when sample size n grows large.
For the need and understanding of asymptotic theory, we consider an example. Consider the simple linear regression
model with one explanatory variable and n observations as
yi =β 0 + β1 xi + ε i , E ( ε i ) =0, Var ( ε i ) =σ 2 , i =1, 2,..., n.
The OLSE of β1 is
n
∑ ( x − x )( y − y )
i i
b1 = i =1
n
∑(x − x )
2
i
i =1
size n increases, the probability density of OLSE b collapses around its mean because Var(b) becomes zero.
Let there are three OLSEs b1 , b2 and b3 which are based on sample sizes n1 , n2 and n3 respectively such that
n1 < n2 < n3 , say. If c and δ are some arbitrarily chosen positive constants, then the probability that the value of b
lies within the interval β ±c can be made to be greater than (1 − δ ) for a large value of n. This property is the
consistency of b which ensure that even if the sample is very large, then we can be confident with high probability that
Probability in limit
Let βˆn be an estimator of β based on a sample of size n. Let γ be any small positive constant. Then for large n, the
requirement that bn takes values with probability almost one in an arbitrary small neighborhood of the true parameter
value β is
plim βˆn = β
and it is said that βˆn converges to β in probability. The estimator βˆn is said to be a consistent estimator of β .
A sufficient but not necessary condition for βˆn to be a consistent estimator of β is that
lim E βˆn = β
n →∞
and
lim Var βˆn = 0.
n →∞
4
Consistency of estimators
= 0.
This implies that OLSE converges to β in quadratic mean.
Thus OLSE is a consistent estimator of β . This holds true for maximum likelihood estimators also.
Same conclusion can also be proved using the concept of convergence in probability.
The consistency of OLSE can be obtained under the following weaker assumptions:
X 'X
(i) plim = ∆* exists and is a nonsingular and nonstochastic matrix.
n
X 'ε
(ii) plim = 0.
n
5
Since
b−β =
( X ' X ) −1 X ' ε
−1
X ' X X 'ε
= .
n n
So
−1
X 'X X 'ε
plim(b − β ) = plim plim
n n
= ∆*−1.0
= 0.
Thus b is a consistent estimator of β . The same is true for maximum likelihood estimator also.
6
(ii) Consistency of s2
=
1 − − .
n n n n n
ε 'ε 1 n 2
Note that
n
consists of terms ∑ εi
n i =1
and {ε i2 , i = 1, 2,..., n} is a sequence of independently and identically
distributed random variables with mean σ 2. Using the law of large numbers
ε 'ε
=σ
2
plim
n
ε ' X X ' X −1 X ' ε ε 'X X ' X
−1
X 'ε
plim = plim plim plim
n n n n n n
= 0.∆*−1.0
=0
(1 − 0) −1 σ 2 − 0
⇒ plim( s 2 ) =
= σ 2.
Thus s2 is a consistent estimator of σ 2. The same hold true for maximum likelihood estimators also.
7
Asymptotic distributions
Suppose we have a sequence of random variables {α n } with a corresponding sequence of cumulative density functions
{Fn } for a random variable α with cumulative density function F. Then αn converges in distribution to α if Fn
converges to F point wise. In this case, F is called the asymptotic distribution of α n .
plim α=
n α ⇒ α n
D
→α
(α n tend to α in distribution), i.e., the asymptotic distribution of α n is F which is the distribution of α ..
Note that
plim Yn = Y
which is constant. Thus the asymptotic distribution of Yn is the distribution of a constant. This is not a regular distribution
as all the probability mass is concentrated at one point. Thus as sample size increases, the distribution of Yn collapses.
Suppose consider only the one third observations in the sample and find sample mean as
n
3
3
Yn* = ∑ Yi .
n i =1
Then
E (Yn* ) = Y
n
( ) = n9 ∑Var (Y )
3
*
and Var Yn 2 i
i =1
9 n 2
= σ
n2 3
3
= σ2
n
→ 0 as n → ∞.
9
Thus plim Yn* = Y and Yn* has the same degenerate distribution as Yn . Since Var (Yn* ) > Var (Yn ) , so Yn* is
preferred over Yn .
Now we observe the asymptotic behavior of Yn and Yn
*
. Consider a sequence of random variables {α n }.
Thus for all n, we have
αn
= n (Yn − Y )
=α n* n (Yn* − Y )
(α n )
E= n E (Yn =
−Y ) 0
(αn* )
E= n E (Yn* −
= Y) 0
σ2
Var (α n ) = nE (Yn − Y ) = n
2
= σ2
n
3σ 2
Var (α )= nE (Y − Y ) = n
2
*
n n
*
= 3σ 2 .
n
Assuming the population to be normal, the asymptotic distribution of
Yn is N ( 0, σ 2 )
Yn* is N ( 0,3σ 2 )
10
So now Yn is preferable over Yn* . The central limit theorem can be used to show that α n will have an asymptotically
normal distribution even if the population is not normally distributed.
Also, since
n (Yn − Y ) ~ N ( 0, σ 2 )
n (Yn − Y )
⇒Z = ~ N ( 0,1)
σ
and this statement holds true in finite sample as well as asymptotic distributions.
Consider the asymptotic distribution of n ( b − β ) . Then even if ε is not necessarily normally distributed, then
asymptotically
n ( b − β ) ~ N ( 0, σ 2Σ −xx1 )
n ( b − β ) ' Σ xx ( b − β )
~ χ k2 .
σ 2
If
X ' X is considered as an estimator of Σ xx , then
n
X 'X
n (b − β ) ' (b − β ) (b − β ) ' X ' X (b − β )
n =
σ2 σ2
MODULE – XIV
Lecture - 35
Stein-Rule Estimation
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
The ordinary least squares estimation of regression coefficients in linear regression model provides the estimators having
The criterion of linearity is desirable because such estimators involve less mathematical complexity, they are easy to
The criterion of unbiasedness is attractive because it is intuitively desirable to have an estimator whose expected value,
i.e., the mean of the estimator should be same as the parameter being estimated.
Considerations of linearity and unbiased estimators sometimes may lead to an unacceptably high price to be paid in terms
of the variability of the estimator around the true parameter. To improve such an estimator, one possibility is to consider
nonlinear and biased estimator which may have better statistical properties.
It is to be noted that one of the main objectives of estimation is to find an estimator whose values have high concentration
around the true parameter. Sometimes it is possible to have a nonlinear and biased estimator that has smaller variability
than the variability of best linear unbiased estimator of the parameter under some mild restrictions.
3
X β + ε , E (ε ) =
y= 0, V ( ε ) =
σ 2 I,
n×1 n×k k ×1 n×1
V ( b ) = E ( b − β )( b − β ) ' = σ 2 ( X ' X ) .
−1
(
E βˆ − β 'W
= βˆ − β ) ( ) ∑∑ w E ( βˆ − β )( βˆ
i j
ij i i j −βj )
where W is k × k fixed positive definite matrix of weights wij . The two popular choice of weight matrix W are
(i) W is an identity matrix, i.e. W = I then E ( βˆ − β ) ' ( βˆ − β ) is called as the total mean squared error (MSE) of β̂ .
(ii) W = X ' X , then
( ) ( ) (
E βˆ − β ' X ' X βˆ − β= E X βˆ − X β ' X βˆ − X β )( )
is called as the predictive mean squared error of β̂ .
Note that E ( y ) = X βˆ is the predictor of average value
(
E ( y ) = X β and X βˆ − X β ) is the corresponding prediction error.
There can be other choices of W and it depends entirely on the analyst how to define the loss function so that the
variability is minimum.
4
If a random vector with k elements (k > 2) is normally distributed as N ( µ , I ) , µ being the mean vector, then Stein
established that if the linearity and unbiasedness are dropped, then it is possible to improve upon the maximum likelihood
Later, this result was generalized by James and Stein for linear regression model. They demonstrated that if the criteria of
linearity and unbiasedness of the estimators are dropped, then a nonlinear estimator can be obtained which has better
performance than the best linear unbiased estimator under the criterion of predictive MSE. In other words, James and
Stein established that OLSE is inadmissible for k > 2 under predictive MSE criterion, i.e., for k > 2, there exists an
( ) ( )
E βˆ − β ' X ' X βˆ − β ≤ E ( b − β ) ' X ' X ( b − β )
for all values of β with strict inequality holding for some values of β. For k ≤ 2, no such estimator exists and we say
that “b can be beaten” in this sense. Thus it is possible to find estimators which will beat b in this sense. So a nonlinear
and biased estimator can be defined which has better performance than OLSE.
σ2
β= 1 − c
ˆ when σ
2
b ' X ' Xb
b is known
and
e 'e
βˆ= 1 − c b when σ is unknown.
2
b ' X ' Xb
Here c is a fixed positive characterizing scalar, e’e is the residuum sum of squares based on OLSE and e= y − Xb
is the residual. By assuming different values to c, we can generate different estimators. So a class of estimators
characterized by c can be defined.
Let
σ2
δ
1 − c b ' X ' Xb =
be a scalar quantity. Then
βˆ = δ b.
So essentially we say that instead of estimating β1 , β 2 ,..., β k by b1 , b2 ,..., bk we estimate them by δβˆ1.δβˆ2 ,..., δβˆk ,
respectively. So in order to increase the efficiency, the OLSE is multiplied by a constant δ. Thus δ is called the
shrinkage factor. As Stein-rule estimators attempts to shrink the components of b towards zero, so these estimators
are known as shrinkage estimators.
First we discuss a result which is used to prove the dominance of Stein-rule estimator over OLSE.
6
Result: Suppose a random vector Z of order ( k ×1) is normally distributed as N ( µ , I ) where µ is the mean vector and
I is the covariance matrix.
Then
Z '( Z − µ )
( k − 2 ) E
1
E = .
Z 'Z Z 'Z
An important point to be noted in this result is that the left hand side depends on µ but right hand side is independent of µ .
( )
βˆ E ( b ) − cσ 2 E
E=
b
b ' X ' Xb
= β − (in general, a non-zero quantity)
≠ 0, in general.
Thus the Stein-rule estimator is biased while OLSE b is unbiased for β.
PR ( b ) =
E (b − β ) ' X ' X (b − β )
( )
PR βˆ = ( )
E βˆ − β ' X ' X βˆ − β . ( )
The Stein-rule estimator β̂ is better them OLSE b under the criterion of predictive risk if
( )
PR βˆ < PR ( b ) .
7
Solving the expressions, we get
( X ' X ) X 'ε
−1
b−β =
= σ 2 trI k
= σ 2k .
'
cσ 2 cσ 2
( )
PR βˆ = E ( b − β ) −
b X ' X ( b − β ) −
b ' X ' Xb b ' X ' Xb
cσ 2
= E (b − β ) ' X ' X (b − β ) − E {b ' X ' X ( b − β ) + ( b − β ) ' X ' Xb}
b ' X ' Xb
c 2σ 4
+E b ' X ' Xb
( b ' X ' Xb )
2
cσ 2 ( b − β ) ' X ' Xb c 2σ 4
σ k − 2E
=2
+E .
b ' X ' Xb b ' X ' Xb
8
Suppose
1
Z= (X 'X )
1/ 2
b
σ
or
b =σ (X 'X )
−1/ 2
Z
1
µ= (X 'X ) β
1/ 2
σ
or
β =σ (X 'X )
−1/ 2
µ
( )
and Z ~ N ( µ , I ) , i.e., Z1 , Z 2 ,..., Z k are independent. Substituting these values in the expressions for PR βˆ , we get
2 σ 2 (Z − µ )'Z c 2σ 2
PR β =
ˆ 2
( )
σ k − 2 E cσ
σ 2Z ' Z
+E 2
σ Z ' Z
Z '( Z − µ ) c 2σ 2
σ 2 k − 2 E cσ 2
= + E Z 'Z
Z 'Z
Z '( Z − µ ) 2 2 1
σ 2 k − 2cσ 2
= +c σ E
Z 'Z Z 'Z
1
= σ 2 k − cσ 2 2 ( k − 2 ) − c E (using the result )
Z 'Z
1
= PR ( b ) − cσ 2 2 ( k − 2 ) − c E .
Z 'Z
9
Thus
( )
PR βˆ < PR ( b )
if and only if
1
cσ 2 2 ( k − 2 ) − c E > 0.
Z 'Z
Since Z ~ N ( µ , I ) , σ 2 > 0. So Z’Z has a non-central Chi-square distribution. Thus
1
E >0
Z 'Z
⇒ c Z ( k − 2 ) − c > 0.
Since c > 0 is assumed, so this inequality holds true when
2 ( k − 2) − c > 0
or 0 < c < 2 ( k − 2 ) provided k > 2.
So as long as 0 < c < 2 ( k − 2 ) is satisfied, the Stein-rule estimator will have smaller predictive risk them OLSE.
This inequality is not satisfied for k = 1 and k = 2. The largest gains efficiency arises when c= k − 2. So if the number of
explanatory variables are more than two, then it is always possible to construct an estimator which is better than OLSE.
10
k −2
c= .
n−k +2
ECONOMETRIC THEORY
MODULE – XV
Lecture - 36
Instrumental Variables Estimation
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
A basic assumption in analyzing the performance of estimators in a multiple regression is that the explanatory variables and
disturbance terms are independently distributed. The violation of such assumption disturbs the optimal properties of the
estimators. The instrumental variable estimation method helps in estimating the regression coefficients in multiple linear
regression model when such violation occurs.
Suppose one or more explanatory variables is correlated with the disturbances in the limit, then we can write
1
plim X ' ε ≠ 0.
n
The consequences of such an assumption on ordinary least squares estimator are as follows:
b =(X 'X ) X 'y
−1
( X ' X ) X '( X β + ε )
−1
=
( X ' X ) X 'ε
−1
b−β =
−1
X ' X X 'ε
=
n n
−1
X 'X X 'ε
plim ( b − β ) =plim plim
n n
≠0
X 'X
assuming plim = Σ XX exists and is nonsingular.
n
Suppose that it is possible to find a data matrix Z of order (n x k) with the following properties:
Z'X
(i) plim = Σ ZX is a finite and nonsingular matrix of full rank. This interprets that the variables in Z are correlated
n
with those in X, in the limit.
Z 'ε
(ii) plim = 0,
n
i.e., the variables in Z are uncorrelated with ε, in the limit.
Z 'Z
(iii) plim = Σ ZZ exists.
n
If some of X variables are likely to be uncorrelated with ε, then these can be used to form some of the columns of Z and
extraneous variables are found only for the remaining columns.
4
First we understand the role of the term X ' ε in the OLS estimation. The OLSE b of β is derived by solving the
equation
∂ ( y − X β ) '( y − X β )
=0
∂β
or X ' y = X ' Xb
or X ' ( y − Xb ) =
0.
y X β + ε by X ' as
Now we look at this normal equation as if it is obtained by pre-multiplying =
The disappearance of X ' ε can be explained when X and ε are uncorrelated as follows:
Let
X 'X
plim = Σ XX
n
X 'y
plim = Σ Xy
n
where population cross moments Σ XX and Σ Xy are finite, Σ XX is finite and nonsingular.
X 'ε
plim = 0,
n
then
β =Σ −XX1 Σ Xy .
and Σ Xy is estimated by sample cross moment
X 'X X ' y then the 6
If Σ XX is estimated by sample cross moment ,
n n
OLS estimator of β is obtained as
−1
X 'X X 'y
b=
n n
= ( X ' X ) X ' y.
−1
Such an analysis suggests to use Z to pre-multiply the multiple regression model as follows:
Z ' y Z ' X β + Z 'ε
=
Z'y Z'X Z 'ε
= β +
n n n
Z'y Z'X Z 'ε
=
plim plim β + plim
n n n
−1
Z'X Z'y Z 'ε
⇒β plim plim − plim
n n n
=Σ −ZX1 Σ Zy .
which is termed as instrumental variable estimator of β and this method is called as instrumental variable method.
7
Since
(Z ' X ) Z '( X β + ε ) − β
−1
βˆIV − β
=
= (Z ' X ) Z 'ε
−1
(
plim βˆIV − β =)
plim ( Z ' X ) Z ' ε
−1
Z ' X −1 Z ' ε
= plim
n n
−1
Z'X Z 'ε
= plim plim
n n
= Σ −ZX1 .0
⇒ plimβˆIV =
β.
Note that the variables Z1 , Z 2 ,..., Z k in Z are chosen such that they are uncorrelated with ε and correlated with X, at
least asymptotically so that the second order moment matrix Σ ZX exists and is nonsingular.
8
Asymptotic distribution
( 1
n βˆIV − β = )
Z'X
n
1
n
Z 'ε
Z ' X −1 1 Z'X
−1
ˆ ( )
AsyVar β IV = plim Z ' E ( εε ') Z
n n n
Z ' X −1 Z ' Z X ' Z −1
= σ plim
2
n n n
−1 −1
Z'X Z 'Z X 'Z
= σ plim
2
plim plim
n n n
= σ 2 Σ −XZ1 Σ ZZ Σ −ZX1 .
For large sample,
σ 2 −1
( )
V βˆIV =
n
Σ XZ Σ ZZ Σ −ZX1
s2
= (Z ' X ) Z 'Z ( X 'Z )
−1 −1
n
1
(X 'X ) ( y − Xb ) '( y − Xb).
−1
where b = X ' y and s 2 =
n−k
The variance of βˆIV is not necessarily a minimum asymptotic variance because there can be more than one sets of
instrumental variables that fulfill the requirement of being uncorrelated with ε and correlated with stochastic regressors.
ECONOMETRIC THEORY
MODULE – XVI
Lecture – 37
Measurement Error Models
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
A fundamental assumption in all the statistical analysis is that all the observations are correctly measured. In the context of
multiple regression model, it is assumed that the observations on study and explanatory variables are observed without
any error. In many situations, this basic assumption is violated. There can be several reasons for such violation.
For example, the variables may not be measurable, e.g., taste, climatic conditions, intelligence, education, ability
etc. In such cases, the dummy variables are used and the observations can be recorded in terms of values of
dummy variables.
Sometimes the variables are clearly defined but it is hard to take correct observations. For example, the age is
generally reported in complete years or in multiple of five.
Sometimes the variable is conceptually well defined but it is not possible to take correct observation on it.
Instead, the observations are obtained on closely related proxy variables, e.g., the level of education is measured
by the number of years of schooling.
Sometimes the variable is well understood but it is qualitative is nature . For example, intelligence is measured
by intelligence quotient (IQ) scores.
In all such cases, the true value of variable can not be recorded. Instead, it is observed with some error. The difference
between the observed and true values of the variable is called as measurement error or errors-in-variables.
If the magnitude of measurement errors is small, then they can be assumed to be merged in the disturbance term and they
will not affect the statistical inferences much. On the other hand, if they are large in magnitude, then they will lead to
incorrect and invalid statistical inferences. For example, in the context of linear regression model, the ordinary least
squares estimator (OLSE) is the best linear unbiased estimator of regression coefficient when measurement errors are
absent. When the measurement errors are present in the data, the same OLSE becomes biased as well as inconsistent
estimator of regression coefficients.
y= y + u
X= X + V
where y is a (n x 1) vector of observed values of study variables which are observed with (n x 1) measurement error
vector u. Similarly, X is a (n x k) matrix of observed values of explanatory variables which are observed with (n x k) matrix
V of measurement errors in X. In such a case, the usual disturbance term can be assumed to be subsumed in u without
loss of generality. Since our aim is to see the impact of measurement errors, so it is not considered separately in the
present case.
4
Alternatively, the same setup can be expressed as
y X β + u
=
X= X + V
where it can be assumed that only X is measured with measurement errors V and u can be considered as the usual
disturbance term in the model.
In case, some of the explanatory variables are measured without any measurement error then the corresponding values in
V will be set to zero.
We assume that
= E (uu ') σ 2 I
E (u ) 0,=
E (V ) =
0, E (V 'V ) =
Ω, E (V ' u ) =
0.
Suppose we ignore the measurement errors and obtain the OLSE. Note that ignoring the measurement errors in the data
does not mean that they are not present. We now observe the properties of such an OLSE under the setup of measurement
error model.
The OLSE is
b = (X ' X ) X ' y
−1
(X ' X ) X '( X β + ω)
−1
=b−β
= ( X ' X ) X 'ω
−1
E (b − β ) =
E ( X ' X ) X ' ω
−1
≠ ( X ' X ) X ' E (ω )
−1
≠0
Then
X ' X −1 X ' ω
plim ( b − β ) =
plim
n n
1
n
1
( )(
X ' X = X + V ' X + V
n
)
1 1 1 1
= X ' X + X 'V + V ' X + V 'V
n n n n
1 1 1 1 1
plim X ' X = plim X ' X + plim X 'V + plim V ' X + plim V 'V
n n n n n
=Σ xx + 0 + 0 + Σ vv
= Σ xx + Σ vv
7
1 1 1
=X 'ω X 'ω + V 'ω
n n n
1 1
= X '(u − V β ) + V '(u − V β )
n n
1 1 1 1 1
plim X ' ω = plim X ' u − plim X 'V β + plim V ' u − plim V 'V β
n n n n n
= 0 − 0 + 0 − Σ vv β
X 'ω
-1
X 'X
plim ( b − β ) =
plim plim
n n
= − ( Σ xx + Σ vv ) Σ vv β
−1
≠ 0.
Thus b is an inconsistent estimator of β. Such inconsistency arises essentially due to correlation between X and ω.
β 0 + β1 xi ,
y i = i=
1, 2,..., n
y=
i y i + ui
x=
i xi + vi .
Now
1 x1 1 x1 0 v1
1 x2 1 x2 0 v2
=X = , X = , V
1 xn 1 xn 0 vn
1 1 n 2
∑ xi ∑ xi
= n i 1= ni1
1 µ
= 2
.
µ σx + µ
2
9
Also
1
Σ vv =plim V 'V
n
0 0
= 2
.
0 σv
Now
plim ( b − β ) = − ( Σ xx + Σ vv ) Σ vv β
−1
b0 − β 0
−1
1 µ 0 0 β0
plim = −
b − β 2 2 2 2
µ σ x + µ + σ v 0 σ v β1
1 1
1 σ x2 + µ 2 + σ v2 −µ 0
= − 2
(σ x + σ v2 ) − µ 2
1 βσ v
σ v2
σ 2 σ 2 µβ
v +
= .
x
σv
2
− 2 β
σ x + σv
2
Thus we find that the OLSEs of β 0 and β1 are biased and inconsistent. So if a variable is subjected to measurement
errors, it not only affects its own parameter estimate but also affect other estimator of parameter that are associated
with those variable which are measured without any error. So the presence of measurement errors in even a single
variable not only makes the OLSE of its own parameter inconsistent but also makes the estimates of other regression
coefficients inconsistent which are measured without any error.
10
1. Functional form When the xi ' s are unknown constants (fixed), then the measurement error model is said to
be in its functional form.
2. Structural form When the xi ' s are identically and independently distributed random variables, say, with mean
µ ( )
and variance σ 2 σ 2 > 0 , the measurement error model is said to be in the structural form.
3. Ultrastructural form When the xi ' s are independently distributed random variables with different means,
( )
say µi and variance σ 2 σ 2 > 0 , then the model is said to be in the ultrastructural form. This form is a synthesis of
function and structural forms in the sense that both the forms are particular cases of ultrastructural form.
ECONOMETRIC THEORY
MODULE – XVI
Lecture – 38
Measurement Error Models
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
measurement errors. An important objective in measurement error models is how to obtain the consistent estimators of
regression coefficients. The instrumental variable estimation and method of maximum likelihood (or method of moments)
The instrumental variable method provides the consistent estimate of regression coefficients in linear regression model
when the explanatory variables and disturbance terms are correlated. Since in measurement error model, the
explanatory variables and disturbance are correlated, so this method helps. The instrumental variable method consists
of finding a set of variables which are correlated with the explanatory variables in the model but uncorrelated with the
composite disturbances, at least asymptotically, to ensure consistency.
X β + ω, ω =
y= u −V β ,
let Z be the n × k matrix of k instrumental variables Z1 , Z 2 ,..., Z k , each having n observations such that
Z and X are correlated, atleast asymptotically and
Z and ω are uncorrelated, atleast asymptotically.
3
So we have
1
plim Z ' X = Σ ZX
n
1
plim Z ' ω = 0.
n
(Z ' X ) Z '( Xα + ω )
−1
=
( Z ' X ) Z 'ω
−1
βˆIV − β =
−1
(
plim βˆIV − β ) 1 1
plim Z ' X plim Z ' ω
=
n n
= Σ −ZX1 .0
= 0.
So βˆ IV is consistent estimator of β.
Any instrument that fulfills the requirement of being uncorrelated with the composite disturbance term and uncorrelated
with explanatory variables will result in a consistent estimate of parameter. However, there can be various sets of
variables which satisfy these conditions to become instrumental variables. Different choices of instruments give different
consistent estimators. It is difficult to assert that which choice of instruments will give an instrumental variable estimator
having minimum asymptotic variance. Moreover, it is also difficult to decide that which choice of the instrumental variable
is better and more appropriate in comparison to other. An additional difficulty is to check whether the chosen instruments
are indeed uncorrelated with the disturbance term or not.
4
Choice of instrument
We discuss some popular choices of instruments in a univariate measurement error model. Consider the model
yi =β 0 + β1 xi + ωi , ωi =ui − β1vi , i =
1, 2,..., n.
A variable that is likely to satisfy the two requirements of an instrumental variable is the discrete grouping variable. The
Wald’s, Bartlett’s and Durbin’s methods are based on different choices of discrete grouping variables.
1. Wald’s method
Find the median of the given observations x1 , x2 ,..., xn . Now classify the observations by defining an instrumental variable
Z such that
1 if xi > median( x1 , x2 ,..., xn )
Zi =
−1 if xi < median( x1 , x2 ,..., xn ).
In this case,
1 Z1 1 x1
1 Z 1 x2
=Z = 2 , X .
1 Z n 1 xn
n
n ∑ xi n nx
=Z ' X =
i =1
n
n n
0 ( x2 − x1 )
∑ Z i ∑ Z i xi 2
= i 1 =i 1
n
∑ yi ny
=Z ' y =
i =1
n
n ( y2 − y1 )
∑ Z i yi 2
i =1
−1
βˆ0 IV n nx ny
= n
βˆ 0 ( 2 1)
x − x n ( y2 − y1 )
1IV 2 2
x2 − x1 y
2 2 −x
=
( x2 − x1 ) 0 y2 − y1
1 2
y2 − y1
y − x
x2 − x1
=
y2 − y1
x2 − x1
y2 − y1
⇒ βˆ1IV =
x2 − x1
y − y1
βˆ0 IV =
y − 2 y − βˆ1IV x .
x =
2
x − x
If n is odd, then the middle observations can be deleted. Under fairly general conditions, the estimators are consistent but
are likely to have large sampling variance. This is the limitation of this method.
6
2. Bartlett’s method
Let x1 , x2 ,..., xn be the n observations. Rank these observation and order them in an increasing or decreasing order.
Now three groups can be formed, each containing n / 3 observations. Define the instrumental variable as
Now discard the observations in the middle group and compute the means of yi’s and xi’s in
• bottom group, say y1 and x1 and
• top group, say y3 and x3 .
Substituting the values of X and Z in βˆIV = ( Z ' X ) Z ' y and on solving, we get
−1
y3 − y1
βˆ1IV =
x3 − x1
These estimators are consistent. No conclusive evidences are available to compare the Bartlett’s method and Wald’s
method but three grouping method generally provides more efficient estimates than two grouping method is many cases.
7
3. Durbin’s method
Let x1 , x2 ,..., xn be the observations. Arrange these observations in an ascending order. Define the instrumental variable
Zi as the rank of xi. Then substituting the suitable values of Z and X in βˆIV = ( Z ' X ) Z ' y, we get the instrumental
−1
variable estimators
n
∑Z (y − y)
i i
βˆ1IV = i =1
n
.
∑Z (x − x)
i =1
i i
When there are more than one explanatory variables, one may choose the instrument as the rank of that particular
variable.
Since the estimator uses more information, it is believed to be superior in efficiency to other grouping methods. However,
In general, the instrumental variable estimators may have fairly large standard errors in comparison to ordinary least
square estimators which is the price paid for inconsistency. However, inconsistent estimators have little appeal.
8
Assume
σ u2 if i = j
E ( ui u j )
E ( ui ) 0,=
=
0 if i ≠ j ,
σ v2 if i = j
E ( vi v j )
E ( vi ) 0,=
=
0 if i ≠ j ,
E ( uiV j ) 0=
= =
for all i 1, 2,..., n; j 1, 2,..., n.
For the application of method of maximum likelihood, we assume the normal distribution for ui and vi .
We consider the estimation of parameters in the structural form of the model in which xi ' s are stochastic.
So assume
xi ~ N ( µ , σ 2 )
and xi ' s are independent of ui and vi .
9
Thus
E ( xi ) = µ
Var ( x1 ) = σ 2
E ( xi ) = µ
xi ) E [ xi − E ( xi ) ]
Var (=
2
= E [ xi + vi − µ ]
2
= σ 2 + σ v2
) β0 + β1E ( xi )
E ( yi=
= β 0 + β1µ
yi ) E [ yi − E ( yi ) ]
Var (=
2
= E [ β 0 + β 1 xi + ui − β 0 − β1µ ]
2
= β12σ 2 + σ u2
E { xi − E ( xi )}{ yi − E ( yi )}
Cov ( xi , yi ) =
= β12σ 2 + 0 + 0 + 0
= β12σ 2 .
10
So
yi β 0 + β1µ β12σ 2 + σ u2 β1σ 2
~ N
, 2
.
i
x µ β1σ 2
σ 2
+ σ
v
The likelihood function is the joint probability density function of ui and vi , i = 1, 2,..., n as
2πσ uσ v 2σ u 2σ v
2 2
n 2 n 2
= 1
n
∑ ( xi −
xi) ∑ ( yi − β 0 − β1 xi )
= exp − exp − .
i 1 =i 1
2πσ uσ v 2σ u2 2σ v2
The log-likelihood is
n n
∑ ( xi − xi ) ∑( y − β 0 − β1 xi )
2 2
i
L* =ln L =constant −
n
2
( u + ln σ v ) −
ln σ =
2 2 i 1 =i 1
2σ u2
−
2σ v2
.
11
The normal equations are obtained by equating the partial differentiations equals to zero as
∂L * 1 n
(1) = ∑ ( yi − β 0 − =
∂β 0 σ v2 i =1
β1 xi ) 0
∂L * 1 n
(2) = ∑ xi ( yi − β0 =
∂β1 σ v2 i =1
− β1 xi ) 0
∂L * 1 β
(3) = 2 ( xi − xi ) + 2 ( yi − β 0 − β1 xi ) = 0, i = 1, 2,..., n
∂xi σ u σv
∂L * n 1 n
=
− + ∑ ( − )
2
(4) x x
∂σ u2 2σ u2 2σ v4 i =1
i i
∂L * n 1 n
=
− + ∑ ( − β − β )
2
(5) y x .
∂σ v2 2σ v2 2σ v4 i =1
i 0 1 i
These are (n + 4) equations in (n + 4) parameters but summing equation (3) over i = 1,2,…,n and using equation (4), we
get
σ v2 = β12σ u2
which is undesirable.
These equations can be used to estimate the two means ( µ and β 0 + β1µ ) , two variances and one covariance. The six
parameters µ , α , β , σ u2 , σ v2 and σ can be estimated from the following five structural relations derived from these normal
2
equations
(i ) x=µ
(ii ) y β 0 + β1µ
=
(iii ) m=
xx σ 2 + σ v2
m yy β12σ 2 + σ u2
(iv ) =
(v ) mxy = β1σ 2
12
1 n 1 n 1 n 1 n 1 n
∑i
where x= = ∑ i xx n ∑
= ( − ) = ∑ i − = ∑ ( xi − x )( yi − y ).
2 2
x , y y , m xi x , m yy ( y y ) and m xy
=n i 1= ni1 = i 1 = ni1 = ni1
These equations can be derived directly using the sufficiency property of the parameters in bivariate normal distribution
using the definition of structural relationship as
E ( x) = µ
) β 0 + β1µ
E ( y=
) σ 2 + σ v2
Var ( x=
( y ) β12σ 2 + σ u2
Var=
Cov ( x, y ) = βσ 2 .
Note: The sane equations (i)-(v) can also be derived using the method of moments. The structural equations are derived
by equating the sample and population moments. The assumption of normal distribution for ui, vi and xi is not needed in
case of method of moments.
ECONOMETRIC THEORY
MODULE – XVI
Lecture – 39
Measurement Error Models
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
βˆ0= y − βˆ1 x
is estimated if β̂1 is uniquely determined. So we consider the estimation of β1 , σ , σ u and σ v only. Some additional
2 2 2
information is required for the unique determination of these parameters. We consider now various type of additional
information which are used for estimating the parameters uniquely.
1. σ v2 is known
Suppose σ v2 is known a priori. Now the remaining parameters can be estimated as follows:
mxx = σ 2 + σ v2 ⇒ σˆ 2 = mxx − σ v2
mxy
mxy= β1σ 2 ⇒ βˆ1=
mxx − σ v2
m yy = β1σ 2 + σ u2 ⇒ σˆ u2 = m yy − βˆ12σˆ 2
mxy2
= m yy − .
mxx − σ v2
Note that σ=
ˆ 2 mxx − σ v2 can be negative because σ v2 is known and mxx is based upon sample. So we assume that
σˆ 2 > 0 and redefine
mxy
= βˆ1 ; mxx > σ v2 .
mxx − σ 2
v
3
Similarly, σˆ u is also assumed to be positive under suitable condition. All the estimators βˆ1 , σˆ 2 and σˆ u2 are the
2
consistent estimators of β , σ and σ u , respectively. Note that β̂1 looks like as if the direct regression estimator of β1 has
2 2
2. σ u2 is known
Suppose σ u2 is known a priori. Then using mxy = β1σ 2 , we can rewrite
m yy β12σ 2 + σ u2
=
= mxy β1 + σ u2
m yy − σ u2
⇒ β1
= ˆ ; myy > σ u2
mxy
mxy
σˆ 2 =
βˆ 1
σ=
ˆ2
vmxx − σˆ 2 .
The estimators βˆ1 , σˆ and σˆ v are the consistent estimators of β , σ and σ v , respectively. Note that
2 2 2 2
β̂1 looks like as if the
reverse regression estimator of β1 is adjusted by σ 2
u for its inconsistency. So it is termed as adjusted estimator also.
4
σ u2
3. λ = 2 is known
σv
Suppose the ratio of the measurement error variances is known, so let
σ u2
λ= 2
σv
is known.
Consider
m yy β12σ 2 + σ u2
=
= β1 mxy + λσ v2 (using (iv))
= β1 mxy + λ ( mxx − σ 2 ) (using (iii))
mxy
=β1 mxy + λ mxx − (using (iv))
β1
β12 mxy + λ ( β1mxx − mxy ) − β1myy= 0 ( β1 ≠ 0 )
β12 mxy + β ( λ mxx − myy ) − λ mxy =
0.
βˆ1 =
yy yy
2mxy
U
= , say.
2mxy
5
σˆ 2 ≥ 0
sxy
⇒ ≥0
βˆ1
2 sxy2
⇒ ≥0
U
⇒ since sxy2 ≥ 0, so U must be nonnegative.
βˆ1 =
yy yy
2mxy
λ + βˆ 2
v
1
mxy
σˆ 2 = .
βˆ 1
Note that the same estimator βˆ1 of β1 can be obtained by orthogonal regression. This amounts to transform xi by xi / σ u
and yi by yi / σ v and use the orthogonal regression estimation with transformed variables.
6
4. Reliability ratio is known
The reliability ratio associated with explanatory variable is defined as the ratio of variances of true and observed values of
explanatory variables, so
Var ( x ) σ2
K= = ; 0 ≤ Kx ≤ 1
Var ( x) σ 2 + σ v2
x
is the reliability ratio. Note that K x = 1 when σ v = 0 which means that there is no measurement error in the explanatory
2
variable and K x = 0 means σ 2 = 0 which means explanatory variable is fixed. Higher value of K x is obtained when σ v2
is small, i.e., the impact of measurement errors is small. The reliability ratio is a popular measure in psychometrics.
mxy
σ2 =
β1
⇒ σˆ 2 =
K x sxx
m=
xx σ 2 + σ v2
Note that βˆ1 = K x b ⇒ σˆ v2 =(1 − K x ) sxx .
−1
mxy
where b is the ordinary least squares estimator b = .
mxx
7
5. β 0 is known
σ=
ˆ2
um yy − βˆ1mxy
mxy
σ=
ˆ v2 mxx − .
βˆ1
Note: In each of the cases 1-6, note that the form of the estimate depends on the type of available information which is
needed for the consistent estimator of the parameters. Such information can be available from various sources, e.g.,
long association of the experimenter with the experiment, similar type of studies conducted in the part, some extraneous
source etc.
8
Estimation of parameters in function form
In the functional form of the measurement error model, xi ' s are assumed to be fixed. This assumption is unrealistic in
the sense that when xi ' s are unobservable and unknown, it is difficult to know if they are fixed or not. This can not be
ensured even in repeated sampling that the same value is repeated. All that can be said in this case is that the information
in this case is conditional upon xi ' s. So assume that xi ' s are conditionally known. So the model is
y=
i β 0 + β1 xi
x=
i xi + vi
y=
i yi + ui .
Then
yi α + β xi σ u2 0
~ N , 2
.
xi xi 0 σ v
∑( y − β 0 − β1 xi )
2
i
n n 1 n
2 ∑( i
L* = constant − ln σ u2 −
ln L = − ln σ v2 − x − xi ) .
i =1 2
2 2σ u2 2 2σ v i =1
The normal equations are obtained by partially differentiating L* and equating to zero as 9
∂L * 1 n
(I ) =0 ⇒ 2 ∑ ( yi − β 0 − β1 xi ) =0
∂β 0 σ u i =1
∂L * 1 n
( II ) =0 ⇒ 2 ∑ ( yi − β 0 − β1 xi ) xi =0
∂β1 σ v i =1
∂L * n 1 n
= ⇒ − + ∑ ( − β − β ) = 0
2
( III ) 0 y x
∂σ u2 2σ u2 2σ u4 i =1
i 0 1 i
∂L * n 1 n
4 ∑( i
= 0⇒− 2 + x − xi ) = 0
2
( IV )
∂σ v
2
2σ v 2σ v i =1
∂L * β 1
(V ) = 0 ⇒ 2 ( yi − β 0 − β1 xi ) + 2 ( xi − xi ) =
0.
∂xi σu σv
Using the left hand side of equation (III) and right hand side of equation (IV), we get
nβ 2 n
=
σ u2 σ v2
σ
⇒ β =u .
σv
10
which is unacceptable because β can be negative also. In the present case, as σ u > 0 and σ v > 0, so β will always be
positive. Thus the maximum likelihood breaks down because of insufficient information in the model. Increasing the
σ u2
sample size n does not solve the purpose. If the restrictions like σ 2
known, σ 2
known or known are
σ v2
u v
incorporated, then the maximum likelihood estimation is similar to as in the case of structural form and the similar estimates
may be obtained. For example, if λ = σ u / σ v is known, then substitute it in the likelihood function and maximize it. The
2 2
MODULE – XVII
Lecture - 40
Simultaneous Equations Models
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
In any regression modeling, generally an equation is considered to represent a relationship describing a phenomenon.
Many situations involve a set of relationships which explain the behaviour of certain variables. For example, in analyzing
the market conditions for a particular commodity, there can be a demand equation and a supply equation which explain the
price and quantity of commodity exchanged in the market at market equilibrium. So there are two equations to explain the
In such cases, it is not necessary that all the variables should appear in all the equations. So estimation of parameters
under this type of situation has those features that are not present when a model involves only a single relationship. In
particular, when a relationship is a part of a system, then some explanatory variables are stochastic and are correlated
with the disturbances. So the basic assumption of a linear regression model that the explanatory variable and disturbance
are uncorrelated or explanatory variables are fixed is violated and consequently ordinary least squares estimator becomes
inconsistent.
Similar to the classification of variables as explanatory variable and study variable in linear regression model, the variables
in simultaneous equation models are classified as endogenous variables and exogenous variables.
3
The variables which are explained by the functioning of system and values of which are determined by the simultaneous
interaction of the relations in the model are endogenous variables or jointly determined variables.
The variables that contribute to provide explanations for the endogenous variables and values of which are determined
from outside the model are exogenous variables or predetermined variables.
Exogenous variables help is explaining the variations in endogenous variables. It is customary to include past values of
endogenous variables in the predetermined group. Since exogenous variables are predetermined, so they are
independent of disturbance term in the model. They satisfy those assumptions which explanatory variables satisfy in the
usual regression model. Exogenous variables influence the endogenous variables but are not themselves influenced by
them. One variable which is endogenous for one model can be exogenous variable for the other model.
Note that in linear regression model, the explanatory variables influence study variable but not vice versa. So relationship
is one sided.
The classification of variables as endogenous and exogenous is important because a necessary condition for uniquely
estimating all the parameters is that the number of endogenous variables is equal to the number of independent equations
in the system. Moreover, the main distinction of predetermined variable in the estimation of parameters is that they are
uncorrelated with disturbance term in the equations in which they appear.
4
Simultaneous equation systems
A model constitutes a system of simultaneous equations if all the involved relationships are needed for determining the
value of at least one of the endogenous variables included in the model. This implies that at least one of the relationships
includes more them one endogenous variable.
Example 1
Now we consider the following example in detail and introduce various concepts and terminologies used in describing the
simultaneous equations models.
Consider a situation of an ideal market where transaction of only one commodity, say wheat, takes place. Assume that the
number of buyers and sellers are large enough so that the market is perfect competitive market. It is also assumed that
the amount of wheat that comes into the market in a day is completely sold out on the same day. No seller takes it back.
Now we develop a model for such mechanism.
Let
dt denotes the demand of the commodity, say wheat, at time t,
st denotes the supply of the commodity, say wheat, at time t, and
qt denotes the quantity of the commodity, say wheat, transacted at time t.
=
dt s=
t, t 1, 2,..., n.
5
Observe that
• demand of wheat depends on
- price of wheat (pt) at time t.
- income of buyer (it) at time t.
• supply of wheat depends on
- price of wheat (pt) at time t.
- rainfall (rt) at time t.
Our aim is to study the behaviour of st, pt and rt which are determined by the simultaneous equation model.
Since endogenous variables are influenced by exogenous variables but not vice versa, so
st , pt and rt are endogenous variables.
it and rt are exogenous variables.
Now consider an additional variable for the model as lagged value of price pt, denoted as pt-1. In a market, generally the
price of the commodity depends on the price of the commodity on previous day. If the price of commodity today is less
than the price in the previous day, then buyer would like to buy more. For seller also, today’s price of commodity depends
on previous day’s price and based on which he decides the quantity of commodity (wheat) to be brought in the market.
6
So the lagged price affects the demand and supply equations both. Updating both the models, we can now write that
demand depends on pt , it and pt −1.
supply depends on pt , rt and pt −1.
Note that the lagged variables are considered as exogenous variable. The updated list of endogenous and exogenous
variables is as follows:
Endogenous variables: pt , d t , st
Exogenous variables: .pt −1 , it , rt
These equations are called structural equations. The error terms ε1 and ε2 are called structural disturbances. The
coefficients α1 , α 2 , β1 and β 2 are called the structural coefficients.
So there are only two structural relationships. The price is determined by the mechanism of market and not by the buyer or
supplier. Thus qt and pt are the endogenous variables. Without loss of generality, we can assume that the variables
associated with α1 and α 2 are X1 and X2 respectively such that
= X 1 1=
and X 2 1.
=
So X 1 1=
and X 2 1 are predetermined and so they can be regarded as exogenous variables.
From statistical point of view, we would like to write the model in such a form so that the OLS can be directly applied. So
writing equations (I) and (II) as
α1 + β1 pt + ε1t = α 2 + β 2 pt + ε 2t
α1 − α 2 ε1t − ε 2t
or=pt +
β 2 − β1 β 2 − β1
= π 11 + v1t (III)
The equations (III) and (IV) are called the reduced form relationships and in general, called as the reduced form of the
model.
The coefficients π 11 and π 21 are called reduced form coefficients and errors v1t and v2t are called the reduced
form disturbances. The reduced from essentially express every endogenous variable as a function of exogenous
variable. This presents a clear relationship between reduced form coefficients and structural coefficients as well as
between structural disturbances and reduced form disturbances. The reduced form is ready for the application of OLS
technique. The reduced form of the model satisfies all the assumptions needed for the application of OLS.
Suppose we apply OLS technique to equations (III) and (IV) and obtained the OLS estimates of π 11 and π 12 as
πˆ11 and πˆ12 respectively which are given by
α1 − α 2
πˆ11 =
β 2 − β1
β α −α β
πˆ 21 = 2 1 2 1 .
β 2 − β1
Note that πˆ11 and πˆ 21 are the numerical values of the estimates. So now there are two equations and four unknown
parameters α1 , α 2 , β1 and β 2 . So it is not possible to derive the unique estimates of parameters of the model by
applying OLS technique to reduced form. This is known as problem of identifications.
9
By this example, the following have been described upto now:
Structural form relationship.
Reduced form relationship.
Need for reducing the structural form into reduced form.
Reason for the problem of identification.
Now we describe the problem of identification in more detail.
Before analysis, we would like to check whether it is possible to estimate the parameters α1 , α 2 , β1 and β 2 or not.
10
Multiplying equations (1) by λ and (2) by (1 − λ ) and then adding them together gives
where α = λα1 + (1 − λ )α 2 , β = λβ1 + (1 − λ ) β 2 , ε t = λε1t + (1 − λ )ε 2t and λ is any scalar lying between 0 and 1.
Comparing equation (3) with equations (1) and (2), we notice that they have same form. So it is difficult to say that which
is supply equation and which is demand equation. To find this, let equation (3) be demand equation. Then there is no way
to identify the true demand equation (1) and pretended demand equation (3).
Similar exercise can be done for supply equation and we find that there is no way to identify the true supply equation (2)
and pretended supply equation (3).
Suppose we apply OLS technique to these models. Applying OLS to equation (1) yields
n
∑( p
− p )( qt − q )
t
βˆ1 =
t =1
n
0.6, say
∑ ( pt − p )
2
t =1
n
1 1 n
=
where
=
p = ∑ pt , q n ∑
n t 1= t 1
qt .
∑( p
− p )( qt − q )
t
=βˆ =
t =1
n
0.6.
∑ ( pt − p )
2
t =1
11
Note that βˆ1 and βˆ have same analytical expressions, so they will also have same numerical values, say 0.6. Looking
at the value 0.6, it is difficult to say that the value 0.6 determines equation (1) or (3).
∑ ( p − p )( q − q )
t t
=βˆ2 =
t =1
n
0.6
∑ ( pt − p )
2
t =1
β=
ˆ βˆ= βˆ .
1 2 3
Thus it is difficult to decide and identify whether β̂1 is determined by the value 0.6 or β̂ 2 is determined by the value 0.6.
Increasing the number of observations also does not helps in the identification of these equations.
So we are not able to identify the parameters. So we take the help of economic theory to identify the parameters.
12
The economic theory suggests that when price increases then supply increases but demand decreases. So the plot will
look like
quantity (qt )
supply (st )
demand (dt )
price (pt )
and this implies β1 < 0 and β 2 > 0. Thus since 0.6 > 0, so we can say that the value 0.6 represents βˆ2 > 0 and so βˆ2 = 0.6.
But one can always choose a value of λ such that pretended equation does not violate the sign of coefficients, say β > 0.
So it again becomes difficult to see whether equation (3) represents supply equation (2) or not. So none of the
parameters is identifiable.
13
Applying OLS to equations (4) and (5) and obtain OLSEs πˆ11 and πˆ 21 which are given by
α1 − α 2
πˆ11 =
β 2 − β1
α β − α 2 β1
πˆ 21 = 1 2 .
β 2 − β1
There are two equations and four unknowns. So the unique estimates of the parameters cannot be obtained.
Thus the equations (1) and (2) can not be identified. Thus the model is not identifiable and estimation of parameters is not
possible.
ECONOMETRIC THEORY
MODULE – XVII
Lecture - 41
Simultaneous Equations Models
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Suppose a new exogenous variable income it is introduced in the model which represents the income of buyer at time t.
Note that the demand of commodity is influenced by income. On the other hand, the supply of commodity is not influenced
by the income, so this variable is introduced only in the demand equation. The structural equations (1) and (2) now
become
Demand : qt =α1 + β1 pt + γ 1it + ε1t (6)
Supply: qt =α 2 + β 2 pt + ε 2t (7)
where γ1 is the structural coefficient associate with income. The pretended equation is obtained by multiplying the
equations (6) and (7) by λ and (1 − λ ), respectively and then adding them together. This is obtained as follows:
Suppose now if we claim that equation (8) is true demand equation because it contains pt and it which influence the
demand. But we note that it is difficult to decides that between the two equations (6) or (8), which one is the true
demand equation.
Suppose now if we claim that equation (8) is the true supply equation. This claim is wrong because income does not
affect the supply. So equation (6) is the supply equation.
Thus supply equation is now identifiable but demand equation is not identifiable. Such situation is termed as partial
identification.
Now we find the reduced form of structural equations (6) and (7). This is achieved by first solving equation (6) for pt and
substituting it in equation (7) to obtain an equation in qt. Such an exercise yields the reduced form equations of the
following form: pt =π 11 + π 12it + v1t (9)
qt =π 21 + π 22it + v2t . (10)
3
Applying OLS to equations (9) and (10), we get OLSEs πˆ11 , πˆ 22 , πˆ12 , πˆ 21 . Now we have four equations ((6),(7),(9),(10)) and
there are five unknowns (α1 , α 2 , β1 , β 2 , γ 1 ). So the parameters are not determined uniquely. Thus the model is not
identifiable.
However, here π 22
β2 =
π 11
π 11
α= π 21 − π 22 .
2
π 22
If πˆ11 , πˆ 22 , πˆ12 and πˆ 21 are available, then αˆ 2 and βˆ2 can be obtained by substituting πˆ ' s in place of π ' s. So
α2 and β 2 are determined uniquely which are the parameters of supply equation. So supply equation is identified but
demand equation is still not identifiable.
Now, as done earlier, we introduce another exogenous variable- rainfall, denoted as rt which denotes the amount of rainfall
at time t. The rainfall influences the supply because better rainfall produces better yield of wheat. On the other hand, the
demand of wheat is not influenced by the rainfall. So the updated set of structural equations is
Demand : qt =α1 + β1 pt + γ 1it + ε1t (11)
Supply: qt =α 2 + β 2 pt + δ 2 rt + ε 2t . (12)
The pretended equation is obtained by adding together the equations obtained after multiplying equation (11) by λ and
equation (12) by (1 − λ ) as follows:
λ qt + (1 − λ )q=
t [λα1 + (1 − λ )α 2 ] + [λβ1 + (1 − λ ) β 2 ] pt + λγ 1it + (1 − λ )δ 2 rt + [λε1t + (1 − λ )ε 2t ]
or qt =α + β pt + γ it + δ rt + ε t (13)
Now we claim that equation (13) is a demand equation. The demand does not depend on rainfall. So unless λ =1
so that rt is absent in the model, the equation (13) cannot be a demand equation. Thus equation (11) is a demand
equation. So demand equation is identified.
Now we claim that equation (13) is supply equation. The supply is not influenced by the income of buyer, so (13) cannot be
a supply equation. Thus equation (12) is the supply equation. So now the supply equation is also identified.
The reduced form model from structural equations (11) and (12) can be obtained which have the following forms:
Application of OLS technique to equations (14) and (15) yields the OLSEs πˆ11 , πˆ12 , πˆ13 , πˆ 21 , πˆ 22 and πˆ 23 . So now there are
six such equations and six unknowns α1 , α 2 , β1 , β 2 , γ 1 and δ 2 . So all the estimates are uniquely determined. Thus the
equations (11) and (12) are exactly identifiable.
Finally, we introduce a lagged endogenous variable pt - 1 which denotes the price of commodity on previous day. Since only
the supply of wheat is affected by the price on the previous day, so it is introduced in the supply equation only as
The pretended equation is obtained by first multiplying equations (16) by λ and (17) by (1 − λ ) and then adding them
together as follows:
λ qt + (1 − λ )qt = α + β pt + γ it + δ rt + (1 − λ )θ 2 pt −1 + ε t
or qt =α + β pt + γ it + δ rt + θ pt −1 + ε t . (18)
Now we claim that equation (18) represents the demand equation. Since rainfall and lagged price do not affect the
demand, so equation (18) cannot be demand equation. Thus equation (16) is a demand equation and the demand
equation is identified.
Now finally, we claim that equation (18) is the supply equation. Since income does not affect supply, so equation (18)
cannot be a supply equation. Thus equation (17) is supply equation and the supply equation is identified.
The reduced from equations from equations (16) and (17) can be obtained as of the following form:
Applying OLS technique to equations (19) and (20) gives the OLSEs as πˆ11 , πˆ12 , πˆ13 , πˆ14 , πˆ 21 , πˆ 22 , πˆ 23 and πˆ 24 .
So there are eight equations in seven parameters α1 , α 2 , β1 , β 2 , γ 1 , δ 2 and θ 2 . So unique estimates of all the parameters
are not available. In fact, in this case, the supply equation (17) is identifiable and demand equation (16) is overly
identified (in terms of multiple solutions).
6
The whole analysis in this example can be classified into three categories –
Analysis
Suppose there are G jointly dependent (endogenous) variables y1 , y2 ,..., yG and K predetermined (exogenous) variables
x1 , x2 ,..., xK . Let there are n observations available on each of the variable and there are G structural equations connecting
both the variables which describe the complete model as follows:
β11 y1t + β12 y2t + ... + β1G yGt + γ 11 x1t + γ 12 x2t + ... + γ 1K xKt = ε1t
β 21 y1t + β 22 y2t + ... + β 2G yGt + γ 21 x2t + γ 22 x2t + ... + γ 2 k xK 2 = ε 2t
βG1 y1t + βG 2 y2t + ... + βGG yGt + γ G1 x1t + γ G 2 x2t + ... + γ Gk xKt =
ε Gt .
If B is singular, then one or more structural relations would be a linear combination of other structural relations. If B is
non-singular, such identities are eliminated.
The structural form relationship describes the interaction taking place inside the model. In reduced form relationship, the
jointly dependent (endogenous) variables are expressed as linear combination of predetermined (exogenous) variables.
This is the difference between structural and reduced form relationships.
Assume that ε t ' s are identically and independently distributed following N (0, ∑) and vt ' s are identically and
−1 −1
independently distributed following N (0, Ω) where Ω
= B ∑B' with
E (ε t ) =
0, E (ε t ε t' ) =
∑, E (ε t ε t'* ) =
0 for all t ≠ t *.
E (vt ) =
0, E (vt vt' ) =
Ω, E (vt vt'* ) =
0 for all t ≠ t *.
9
p ( yt | xt ) = p (vt )
∂ε t
= p (ε t )
∂vt
= p ( ε t ) det( B)
∂ε t
where is the related Jacobian of transformation and det( B ) is the absolute value of determinant of B.
∂vt
L = p ( y1 , y2 ,..., yn x1 , x2 ,..., xn )
n
= ∏ p ( yt xt )
t =1
n
= det( B) ∏ p(ε ) .
n
t
t =1
Applying a nonsingular linear transformation on structural equation S with a nonsingular matrix D, we get
where B* = DB, Γ* = DΓ, ε t* = Dε t and structural model S* describes the functioning of this model at time t.
∂ε t*
p ( yt | xt ) = p ( ε *
) ∂vt
t
= p ( ε t* ) det ( B *) .
Also
ε t* = Dε t
∂ε t*
⇒ =
D.
∂ε t
Thus ∂ε t
p ( yt xt ) = p ( ε t ) det( B*)
∂ε t*
= p ( ε t ) det( B) .
∏ p (ε )
n
L* = det( B*)
n *
t
t =1
n
∂ε t
= det( D*) det( B) ∏ p (ε ) ∂ε
n n
t *
t =1 t
n
= det( D*) det( B) ∏ p (ε ) det( D −1
n n
t )
t =1
= L.
11
Thus both the structural forms S and S* have same likelihood functions. Since the likelihood functions form the basis of
statistical analysis, so both S and S* have same implications. Moreover, it is difficult to identify whether the likelihood
function corresponds to S and S*. Any attempt to estimate the parameters will result into failure in the sense that we cannot
know whether we are estimating S and S*. Thus S and S* are observationally equivalent. So the model is not identifiable.
A parameter is said to be identifiable within the model if the parameter has the same value for all equivalent structures
contained in the model.
If all the parameters in structural equation are identifiable, then the structural equation is said to be identifiable.
Given a structure, we can thus find many observationally equivalent structures by non-singular linear transformation.
The apriori restrictions on B and Γ may help in the identification of parameters. The derived structures may not satisfy
these restrictions and may therefore not be admissible.
The presence and/or absence of certain variables helps in the identifiability. So we use and apply some apriori restrictions.
These apriori restrictions may arise from various sources like economic theory, e.g. it is known that the price and income
affect the demand of wheat but rainfall does not affect it. Similarly, supply of wheat depends on income, price and rainfall.
There are many types of restrictions available which can solve the problem of identification. We consider zero-one type
restrictions.
ECONOMETRIC THEORY
MODULE – XVII
Lecture - 42
Simultaneous Equations Models
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
xt ε t=
Byt + Γ= , t 1, 2,..., n.
When the zero-one type restrictions are incorporated in the model, suppose there are G∆ jointly dependent and K*
predetermined variables in S having non zero coefficients. Rest ( G − G∆ ) jointly dependent and ( K − K* ) predetermined
variables are having coefficients zero.
Without loss of generality, let β ∆ and γ * be the row vectors formed by the nonzero elements in the first row of B and Γ
respectively. Thus the first row of B can be expressed as ( β∆ 0 ) . So B has G∆ coefficients which are one (indicating the
presence) and ( G − G∆ ) coefficients which are zero (indicating the absence).
Similarly the first row of Γ can be written as ( γ * 0 ) . So in Γ, there are K* elements present ( those take value one) and
( K − K* ) elements absent (those take value zero).
The first equation of the model can be rewritten as
β11 y1t + β12 y2t + ... + β1G∆ yG∆t + γ 11 x1t + γ 12 x2t + ... + γ 1K* xK*t =
ε1t
or ( β ∆ 0 ) yt + (γ * 0 ) xt =
ε1t , t =
1, 2,..., n.
Assume every equation describes the behaviour of a particular variable, so that we can take β11 = 1. If β11 ≠ 1, then
divide the whole equation by β11 so that the coefficient of y1t is one.
π= − B −1Γ
or Bπ = −Γ
or ( β ∆ 0)π = − (γ * 0).
↓ ↓ ↓ ↓
G∆ ( G − G∆ ) K* K − K*
elements elements elements elements
Partition
π ∆* π ∆**
π =
π ∆∆* π ∆∆**
We can re-express
( β∆ 0 ) π = − (γ * 0 )
(β ∆ )
0 ∆∆ π = − ( γ * 0** )
π π ∆**
or ( β∆ 0∆∆ ) ∆* = − ( γ * 0** )
π ∆∆* π ∆∆**
⇒ β ∆ π ∆* =
−γ * (i )
β ∆ π ∆** = 0**. (ii )
Assume π is known. Then (i) gives a unique solution for γ* if β ∆ is uniquely found from (ii). Thus identifiability of S lies
upon the unique determination of β ∆. Out of G∆ elements in β ∆ , one has coefficient 1, so there are ( G∆ − 1) elements in
β ∆ that are unknown.
4
Note that
rank (π ∆**=
) G∆ − 1.
Thus β0 is identifiable when
rank (π ∆**=
) G∆ − 1.
This is known as rank condition for the identifiability of parameters in S. This condition is necessary and sufficient.
Another condition, known as order condition is only necessary and not sufficient. The order condition is derived as
follows:
We now use the result that for any matrix A of order m × n , the rank ( A) = min( m, n). For identifiability, if rank ( A) = m
then obviously n > m.
Since β ∆π ∆** = 0 and β ∆ has only (G∆ − 1) elements which are identifiable when
rank (π ∆**=
) G∆ − 1
⇒ K − K* ≥ G∆ − 1.
There are various ways is which these conditions can be represented to have meaningful interpretations. We discuss them
as follows:
1. ( G − G∆ ) + ( K − K* ) ≥ ( G − G∆ ) + ( G∆ − 1)
or ( G − G∆ ) + ( K − K* ) ≥ G − 1.
Here
y1t + β12 y2t + ... + β1G∆ yG∆ + γ 11 x1t + ... + λ1K xKt =
ε1t
y1t + β12 y2t + ... + β1G∆ yG∆ + γ 11 x1t + ... + λ1K xKt =
ε1t
So left hand side of the condition denotes the total number of variables excluded is this equation.
Thus if total number of variables excluded in this equation exceeds (G - 1), then the model is identifiable.
6
2. K ≥ K* + G∆ -1
Here
K* : Number of predetermined variables present is the equation.
G∆ : Number of jointly dependent variables present is the equation.
3. Define L = K − K* − (G∆ − 1)
L measures the degree of overidentification.
Rank condition tells whether the equation is identified or not. If identified, then the order condition tells whether the
equation is exactly identified or over identified.
Note
We have illustrated the various aspects of estimation and conditions for identification in the simultaneous equation system
using the first equation of the system. It may be noted that this can be done for any equation of the system and it is not
necessary to consider only the first equation. Now onwards, we do not restrict to first equation but we consider any, say ih
equation (i = 1, 2, …,G).
7
where β ∆ , γ * , 0∆∆ and 0** are row vectors consisting of G∆ , K* , G∆∆ and K** elements respectively in them. Similarly the
orders of B∆ is ( ( G − 1) × G ) , B
∆ ∆∆ is ((G − 1) × G∆∆ ), Γ* is ((G − 1) × K* ) and Γ is ((G − 1) × K** ).
**
Note that B∆∆ and Γ** are the matrices of structural coefficients for the variable omitted from the ih equation (i = 1,2,…,G)
but included in other structural equations.
=
∆ (B Γ)
β ∆ 0∆∆ γ * 0**
=
B∆ B∆∆ Γ* Γ**
B −1∆
= (B −1
B B −1Γ )
= (I −π )
I ∆∆ 0 − π ∆ ,* − π ∆ ,**
= .
0 I ∆∆ , ∆∆ − π ∆∆ ,* − π ∆∆ ,**
8
If
0 0**
∆* = ∆∆
B∆∆ Γ**
then clearly the rank of ∆* is same as the rank of ∆ since the rank of a matrix is not affected by enlarging the matrix by a
rows of zeros or switching any columns.
Now using the result that if a matrix A is multiplied by a nonsingular matrix then the product has the same rank as of A, we
can write
So
rank ( B∆∆ Γ** ) = G − 1
and then the equation is identifiable.
Note that ( β ∆∆ Γ** ) is a matrix constructed from the coefficients of variables excluded from that particular equation but
included in other equations of the model. If rank ( β ∆∆ Γ** ) = G − 1, then the equation is identifiable and this is a
necessary and sufficient condition. An advantage of this term is that it avoids the inversion of matrix. A working rule is
proposed like following.
9
Working rule
1. Write the model in tabular form by putting ‘X’ if the variable is present and ‘0’ if the variable is absent.
2. For the equation under study, mark the 0’s (zeros) and pick up the corresponding columns suppressing that row.
3. If we can choose (G – 1) rows and (G – 1) columns that are not all zero then it can be identified.
10
Example 2
Now we discuss the identifiability of following simultaneous equations model with three structural equations.
(1) y1 = β13 y3 + γ 11 x1 + γ 13 x3 + ε1
(2) y1 = γ 21 x1 + γ 23 x3 + ε 2
(3) y2 = β 33 y3 + γ 31 x1 + γ 32 x2 + ε 3.
Equation y1 y2 y3 x1 x2 x3 G∆ − 1 K* L
number
1 X 0 X X 0 X 2–1=1 2 3–3=0
2 X 0 0 X 0 X 1–1=0 2 3–2=1
3 0 X X X X 0 2–1=1 2 3–3=0
G = Number of equations = 3.
‘X’ denotes the presence and ‘0’ denotes the absence of variables in an equation.
G∆ = Numbers of ‘X’ in ( y1 y2 y3 ) .
K* = Number of ‘X’ in ( x1 x2 x3 ) .
11
Write columns corresponding to ‘0’ which are columns corresponding to y2 and x2 and write as follows
y2 x2
Equation (2) 0 0
Equation (3) X X
Check if any row/column is present with all elements ‘0’. If we can pick up a block of order (G − 1) is which all
rows/columns are all ‘0’ , then the equation is identifiable.
0 0
Here G − 1 = 3 − 1 = 2, B∆∆ = , Γ** = . So we need ( 2 × 2 ) matrix in which no row and no column has all
X X
elements ‘0’ . In case of equation (1), the first row has all ‘0’, so it is not identifiable.
Notice that from order condition, we have L = 0 which indicates that the equation is identified and this conclusion
is misleading. This is happening because order condition is just necessary but not sufficient.
12
y2 y3 x2
Equation (1) 0 X 0
Equation (3) X X X
0 X
B∆∆
0
= =
, Γ** .
X X X
G − 1 = 3 − 1 = 2.
We see that there is atleast one block in which no row is ‘0’ and no column is ‘0’. So the equation (2) is identified
Also, L =1 > 0 ⇒ Equation (2) is over identified.
Identify ‘0’
y1 x3
Equation (1) X X
Equation (2) X X
X X
So B=
∆∆ , Γ=
** .
X X
We see that there is no ‘0’ present in the block. So equation (3) is identified .
MODULE – XVII
Lecture - 43
Simultaneous Equations Models
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Estimation of parameters
To estimate the parameters of the structural equation, we assume that the equations are identifiable.
Consider the first equation of the model
β11 y1t + β12 y2t + ... + β1G yGt + γ 11 x1t + γ 12 x2t + ... + γ 1K xKt= ε1t , =
t 1, 2,.., n.
Assume β11 = 1 and incorporate the zero-one type restrictions. Then we have
y1t β12 y2t − ... − β1G ∆ yG∆ − γ 11 x1t − γ 12 x2t − ... − γ 1K xK*t + ε1t=
= , t 1, 2,.., n.
Writing this equation in vector and matrix notations by collecting all n observations, we can write
y1 =Y1β + X 1γ + ε1
where
Stacking all the observations according to the variance, rather than time, the complete model consisting of G equations
describing the structural equation model can be written as
YB + X Γ = Φ
1
Assume E ( Φ ) = 0, E ( ΦΦ ') = Σ where Σ is positive definite symmetric matrix.
n
The reduced form of the model is obtained from structural equation model by post multiplying by B −1 as
YBB −1 + X ΓB −1 =ΦB −1
Y Xπ +V
=
where π = −ΓB and V = ΦB are the matrices of reduced form coefficients and reduced form disturbances
−1 −1
respectively.
=
where A (=
Y1 , X 1 ) , δ ( β γ ) ' and ε = ε1 . This model looks like a multiple linear regression model.
Since X1 is a submatrix of X, so it can be expressed as
X 1 = XJ1
where J1 is called as a select matrix and consists of two types of elements, viz., 0 and 1. If the corresponding variable is
present, its value is 1 and if absent, its value is 0.
5
B −1 BY + B −1ΓX = B −1Φ
Y Xπ +V
=
where π =
− B −1Γ and B −1Φ.
Applying OLS to reduced form equation yields the OLSE of π as
πˆ = ( X ' X ) X ' Y .
−1
6
This is the first step of ILS procedure and yields the set of estimated reduced form coefficients.
Suppose we are interested in the estimation of following structural equation
y =Y1β + X 1γ + ε
where y is (n x 1) vector of n observation on dependent (endogenous) variable, Y1 is ( n × (G∆ − 1) ) matrix of observations
on G∆ current endogenous variables, X1 is ( n × K* ) matrix of observations on K* predetermined (exogenous) variables in
the equation and ε is (n x 1) vector of structural disturbances. Write this model as
1
(y1 Y1 X1 ) −β =
ε
−γ
or more general
1
−β
(y 1 Y1 Y2 X1 X2 ) 0 = ε
−γ
0
where Y2 and X 2 are the matrices of observations on ( G − G∆ + 1) endogenous and ( K − K* ) predetermined variables
which are excluded from the model due to zero-one type restrictions.
Write
π B = −Γ
1
γ
or π − β =
.
0 0
7
Substitute π as πˆ = ( X ' X )
−1
X 'Y and solve for β and γ . This gives indirect least squares estimators b and c of
β and γ respectively by solving
1
c
πˆ −b =
0 0
1
c
( X ' X ) X 'Y −b =
−1
or
0 0
1
c
( X ' X ) X ' ( y1 Y1 Y2 ) −b =
−1
or
0 0
c
⇒ X ' y1 − X ' Y1b = X 'X
0
Since X = ( X 1 X2 )
⇒ ( X 1'Y1 )b + ( X 1' X 1 ) c =
X 1' y1 (i )
( X 2' Y1 )b + ( X 2' X 1 ) c =
X 2' y1. (ii )
These equations (i) and (ii) are K equations is (G∆ + K* − 1) unknowns. Solving the equations (i) and (ii) gives ILS
estimators of β and γ .
8
2. Two stage least squares (2SLS) or generalized classical linear (GCL) method
This is more widely used estimation procedure as it is applicable to exactly as well as overidentified equations. The least
squares method is applied in two stages in this method.
Consider equation y1 =Y1β + X 1γ + ε .
Stage 1 Apply least squares method to estimate the reduced form parameters in the reduced form model
= Y1 X π 1 + V1
⇒ πˆ1 =
( X ' X ) −1 X ' Y1
X πˆ1.
⇒ Y1 =
Stage 2 Replace Y1 is structural equation y1 =Y1β + X 1γ + ε by Yˆ1 and apply OLS to thus obtained structural equation
as follows:
y1 = Yˆ1β + X 1γ + ε
= X ( X ' X ) X ' Y1β + X 1γ + ε
−1
β
X ( X ' X ) −1 X ' Yˆ1 X 1 + ε
γ
= Aˆδ + ε .
X ( X ' X ) X ' Y1 X 1 HA, H = X ( X ' X ) X ' is idempotent and δ = ( β γ ) '.
−1
where Aˆ X (=
X ' X ) X ' Yˆ1 X 1 , A =
−1 −1
( )
−1
Y1 Aˆδ + ε gives OLSE of
Applying OLS to = δ ˆ ′ Aˆ
as δˆ = A ˆ
Ay1
−1
βˆ Y1' HY1 Y1' X 1 Y1' y1
or = '
γˆ X 1Y1 X 1' X 1 X 1' y1
−1
Y1'Y1 − Vˆ1'Vˆ1 Y1' X 1 Y1' − Vˆ1'
= y1.
X 1Y1
'
X 1' X 1 X 1
'
9
where V1= Y1 − X π 1 is estimated by Vˆ1 =−
Y1 X πˆ1 =− (
I H Y1 = )
HY1 , H =
I − H . Solving these two equations, we get β̂
and γˆ which are the two stage least squares estimators of β and γ respectively.
Now we see the consistency of δˆ .
( ) Ay
−1
δˆ = Aˆ ′Aˆ
ˆ
1
( Aˆ ' Aˆ ) Aˆ ' ε
−1
δˆ − δ =
−1
(
plim δˆ − δ =
1
) 1
plim Aˆ ' A plim Aˆ ' ε .
n n
1
plim X ' ε = 0.
n
1
For plim Yˆ1ε , we observe that
n
10
1 1
plim Yˆ1'ε = plim πˆ1 X ' ε
n n
1
= ( plimπˆ1 ) plim X ' ε
n
= π 1.0
= 0.
( )
Thus plim δˆ -δ = 0 and so the 2SLS estimators are consistent.
=
Asy Var ()
δˆ n −1 plim n δˆ − δ δˆ − δ '
( )( )
( )
= n −1 plim n Aˆ ′Aˆ ( )
−1 −1
Aˆ ' εε ' Aˆ Aˆ ' Aˆ
−1 −1
1 1 1
= n plim Aˆ ' Aˆ plim Aˆ ' εε ' Aˆ plim Aˆ ' Aˆ
−1
n n n
−1
1
= n σ ε plim Aˆ ' Aˆ
−1 2
n
where Var (ε ) = σ ε .
2
s 2
=
( y − Y βˆ − X γˆ ) ' ( y − Y βˆ − X γˆ )
1 1 1
.
1
n−G − K
11
There are some other methods available for the estimation of parameters like
We have used zero-one type restrictions. Some other type of prior information that can be used for restrictions are
1. Normalizations.
2. Restrictions among parameters within each structural equation.
3. Restriction among parameters across structural equations.
4. Restrictions on some of the elements of Σ.
ECONOMETRIC THEORY
MODULE – XVIII
Lecture - 44
Seemingly Unrelated Regression
Equations Models
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
A basic nature of multiple regression model is that it describes the behaviour of a particular study variable based on a set of
explanatory variables. When the objective is to explain the whole system, there may be more than one multiple regression
equations. For example, in a set of individual linear multiple regression equations, each equation may explain some
economic phenomenon. One approach to handle such a set of equations is to consider the set up of simultaneous
equations model is which one or more of the explanatory variables in one or more equations are itself the dependent
(endogenous) variable associated with another equation in the full system. On the other hand, suppose that none of the
variables in the system are simultaneously both explanatory and dependent in nature. There may still be interactions
between the individual equations if the random error components associated with at least some of the different equations
are correlated with each other. This means that the equations may be linked statistically, even though not structurally –
through the jointness of the distribution of the error terms and through the non-diagonal covariance matrix. Such a
behaviour is reflected in the seemingly unrelated regression equations (SURE) model in which the individual equations are
in fact related to one another, even though superficially they may not seem to be.
The basic philosophy of the SURE model is as follows. The jointness of the equations is explained by the structure of the
SURE model and the covariance matrix of the associated disturbances. Such jointness introduces additional information
which is over and above the information available when the individual equations are considered separately. So it is desired
to consider all the separate relationships collectively to draw the statistical inferences about the model parameters.
3
Example
Suppose a country has 20 states and the objective is to study the consumption pattern of the country. There is one
consumption equation for each state. So all together there are 20 equations which describe 20 consumption functions. It
may also not necessary that the same variables are present in all the models. Different equations may contain different
variables. It may be noted that the consumption pattern of the neighbouring states may have the characteristics in
common. Apparently, the equations may look distinct individually but there may be some kind of relationship that may be
existing among the equations. Such equations can be used to examine the jointness of the distribution of disturbances. It
seems reasonable to assume that the error terms associated with the equations may be contemporaneously correlated.
The equations are apparently or “seemingly” unrelated regressions rather than independent relationships.
Model
where yti is the tth observation on the ith dependent variable which is to be explained by the ith regression equation, xtij is
the tth observation on jth explanatory variable appearing in the ith equation, βij is the coefficient associated with xtij at each
observation and ε ti is the tth value of the random error component associated with ith equation of the model.
y1 X1 0 0 β1 ε1
y2 0 X2 0 β2 + ε2
yM 0 0 X M βM εM
or y Xβ +ε
=
Treat each of the M equations as the classical regression model and make conventional assumptions for i = 1, 2,..., M as
X i is fixed.
rank ( X i ) = ki .
1 '
Tlim X i X i = Qii where Qii is nonsingular with fixed and finite elements.
→∞
T
E ( ui ) = 0.
E ( ui ui' ) = σ ii IT where σ ii is the variance of disturbances in ith equation for each observation in the sample.
1 '
• lim Xi X j = Qij
T →∞ T
• E ( ui u 'j ) = σ ij IT ; i, j = 1, 2,..., M
where Qij is non-singular matrix with fixed and finite elements and σ ij is the covariance between the disturbances of ith
and jth equations for each observation in the sample.
Compactly, we can write 5
E (ε ) = 0
σ 11 IT σ 12 IT σ 1M I T
σ I σ 22 IT σ 2 M IT
E ( εε ') = 21 T = Σ ⊗ IT = ψ
σ M 1 IT σ M 2 IT σ MM IT
By using the terminologies “contemporaneous” and “intertemporal” covariance, we are implicitly assuming that the data
are available in time series form but this is not restrictive. The results can be used for cross-section data also. The
constancy of the contemporaneous covariances across sample points is a natural generalization of homoskedastic
disturbances in a single equation model.
It is clear that the M equations may appear to be not related in the sense that there is no simultaneity between the
variables in the system and each equation has its own explanatory variables to explain the study variable. The equations
are related stochastically through the disturbances which are serially correlated across the equations of the model. That
is why this system is referred to as SURE model.
6
The SURE model is a particular case of simultaneous equations model involving M structural equations with M jointly
dependent variable and k ( ≥ ki for all i ) distinct exogenous variables and in which neither current nor logged endogenous
variables appear as explanatory variables in any of the structural equations.
The SURE model differs from the multivariate regression model only in the sense that it takes account of prior information
concerning the absence of certain explanatory variables from certain equations of the model. Such exclusions are highly
realistic in many economic situations.
b0 = ( X ' X ) X ' y.
−1
Further,
E ( b0 ) = β
V ( b0 ) =E ( b0 − β )( b0 − β ) '
= ( X ' X ) X 'ψ X ( X ' X ) .
−1 −1
7
The generalized least squares (GLS) estimator of β is
βˆ = ( X 'ψ −1 X ) X 'ψ −1 y
−1
= X ' ( Σ −1 ⊗ IT ) X X ' ( Σ −1 ⊗ IT ) y
−1
( )
E βˆ = β
V ( βˆ ) =E ( βˆ − β )( βˆ − β ) '
= ( X 'ψ −1 X )
−1
= X ' ( Σ −1 ⊗ IT ) X .
−1
Define
X '− ( X 'ψ −1 X ) X 'ψ −1
−1
(X 'X )
−1
=G
then GX = 0 and we find that
( )
V ( b0 ) − V βˆ =
Gψ G '.
Since ψ is positive definite, so Gψ G ' is atleast positive semidefinite and so GLSE is, in general, more efficient than
OLSE for estimating β . In fact, using the result that GLSE best linear unbiased estimator of β , so we can conclude that
β̂ is the best linear unbiased estimator in this case also.
8
With such replacement, we obtain a feasible generalized least squares (FGLS) estimator of β as
. X ' ( S −1 ⊗ I ) X X ' ( S −1 ⊗ I ) y.
−1
βˆF = T T
Estimation of Σ
There are two possible ways to estimate σ ij ' s.
Regress each of the M study variables on the column of Z and obtain (T × 1) residual vectors
yi − Z ( Z ' Z ) Z ' yi i =
−1
εˆi = 1, 2,..., M
= H Z yi
IT − Z ( Z ' Z ) Z '.
−1
where H =
Z
9
Then obtain
1 '
sij = εˆiεˆ j
T
1
= yi' H Z y j
T
and construct the matrix S = (( s ))
ij accordingly.
X i = ZJ i
where Jt is a ( K × ki ) selection matrix. Then
Xi − Z (Z ' Z ) Z ' Xi
−1
H Z X=
i
= X i − ZJ i
=0
and thus ( βi' X i' ε i' ) H Z ( X j β j + ε j )
yi' H Z y j =+
= ε i' H Z ε j .
E ( ε i' H Z ε j )
E ( sij ) =
Hence 1
T
= σ ij tr ( H Z )
1
T
K
= 1 − σ ij
T
T
E sij = σ ij .
T − K
T
Thus an unbiased estimator of σ ij is given by sij .
T −K
10
which distinguish the SURE model from the multivariate regression model are used as follows.
Regress yi on Xi , i.e., regress each equation, i = 1,2,…,M by OLS and obtain the residual vector
H X i = I − X i ( X i' X i ) X i
−1
H X j = I − X j ( X 'j X j ) X j .
−1
( )
tr H X i H X j = T − ki − k j + tr ( X i' X i ) X i' X j ( X 'j X j ) X 'j X i ,
−1 −1
MODULE – XIX
Lecture - 45
Exercises
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
Exercise 1
The following data has been obtained on 26 days on the price of a share in the stock market and the daily change in the
stock market index. Now we illustrate the use of regression tools and the interpretation of the output from a statistical
software through this example.
Day no. (i) Change in Value of share (yi) Day no. (i) Change in Value of share (yi)
index (Xi) index (Xi)
1 87 199 14 70 172
2 52 160 15 71 176
3 50 160 16 65 165
4 88 175 17 67 150
5 79 193 18 75 170
6 70 189 19 80 189
7 40 155 20 82 190
8 63 150 21 76 185
9 70 170 22 86 190
10 60 171 23 68 165
11 44 160 24 89 200
12 47 165 25 70 172
13 42 148 26 83 190
3
The first step involves the check whether a simple linear regression model can be fitted to this data or not. For this, we plot
a scatter diagram as follows:
Figure 1 Figure 2
Looking at the scatter plots in Figures 1 and 2, an increasing linear trend in the relationship between share price and
change in index is clear. It can also be concluded from the figure 2 that it is reasonable to fit a linear regression model.
4
A typical output of a statistical software on regression of blood pressure (y) versus age (x) will look like as follows:
Analysis of Variance
Source DF SS MS F P
Regression 1 3961.0 3961.0 45.15 0.000
Residual Error 24 2105.3 87.7
Lack of Fit 21 1868.6 89.0 1.13 0.536
Pure Error 3 236.8 78.9
Total 25 6066.3
Unusual Observations
Obs x y Fit SE Fit Residual St Resid
8 63.0 150.00 168.99 1.95 -18.99 -2.07R
17 67.0 150.00 172.38 1.84 -22.38 -2.44R
Contd....
5
Contd....
b0 =y − b1 x =115.544 ≈ 116
sxy
=
b1 = 0.848
sxx
where
n n
1 n 1 n
sxy = ∑ ( xi − x )( yi − y ), sxx = ∑ ( xi − x ) , x = ∑ xi , y = ∑ yi .
2
=i 1 =i 1 = n i 1= ni1
The same is indicated by Constant 115.544 and Age (X) 0.8483.
7
The terms SE Coef denotes the standard errors of the estimates of intercept term and slope parameter. They are
obtained as
(b )= 1 x2
Var 0 s +
2
= 8.807
n sxx
= s2
Var (b1 ) = 0.1262
sxx
where
n
∑(y −b − b1 xi ) 2
i 0
s2 =
i =1
, n 26.
n−2
The terms T the value of t-statistics for testing the significance of regression coefficient and are obtained as
H 0 : β 0 β=
= 00 with β 00 0
b0 − β 00
t0 = 13.12
SS res 1 x
2
+
n − 2 n sxx
H 0 : β1 β=
= 10 with β10 0
b1 − β10
=t0 = 6.72
SS res
(n − 2) sxx
The corresponding P values are all 0.000 which are less than the level of significance α = 0.5. This indicates that the
null hypothesis is rejected and the intercept term and slope parameter are significant.
8
Now we discuss the interpretation of the goodness of fit statistics.
n −1
R2 =
1− (1 − R ) =
2
0.638 with k =
1.
n−k
This means that the fitted model can expalain about 64% of the variability in y through X.
9
2
e
n 2 n
PRESS = ∑ yi − yˆ (i ) = ∑ i = 2418.97.
=i 1 =i 1 1 − hii
This is also a measure of model quality. It gives an idea how well a regression model will perform in predicting new data. It
is apparent that the value is quite high. A model with small value of PRESS is desirable.
The value of R2 for prediction based on PRESS statistics is given by R-Sq(pred) = 60.12%
This value is obtained from
PRESS
2
Rprediction =
1− =
0.6012.
SST
This statistic gives an indication of the predictive capability of the fitted regression model.
2
Here Rprediction = 0.6012 indicates that the model is expected to explain about 60% of the variability in predicting new
observations.
10
Now we look at the following output on analysis of variance.
Analysis of Variance
Source DF SS MS F P
Regression 1 3961.0 3961.0 45.15 0.000
Residual Error 24 2105.3 87.7
Lack of Fit 21 1868.6 89.0 1.13 0.536
Pure Error 3 236.8 78.9
Total 25 6066.3
However, in this example with one explanatory variable, the analysis of variance does not make much sense and is
equivalent to test the hypothesis
H 0 : β1 β=
= 10 with β10 0 by
b1 − β10
t0 = .
SS res
(n − 2) sxx
However it may be noted that the test of hypothesis for lack of fit has P-value more than 0.5, which indicates that the model
is not well fitted. This is also clear from the fact that the value of coefficient of determination is around 0.65 which is not very
high. The reason for such a low value is clear from the scatter diagram that the points are highly dispersed around the line.
11
Now we look into the following part of the output. The following outcome presents the analysis of residuals to find an
unusual observation which can possibly be an outlier or an extreme observation. Its interpretation in the present case is
that the 6th observation needs attention.
Unusual Observations
Obs x y Fit SE Fit Residual St Resid
8 63.0 150.00 168.99 1.95 -18.99 -2.07R
17 67.0 150.00 172.38 1.84 -22.38 -2.44R
The standardized residuals are high for the 8th and 17th observations which indicates that these two observations are
unusual and need attention for further exploration.
∑ (e − e )
2
t t −1
=d =
t =2
n
1.66.
∑ et
2
t =1
At d = 2, it is indicated that there is no first order autocorrelation in the data. As the value d = 1.66 which is less than 2, so it
indicates the possible presence of first order positive autocorrelation in the data.
12
Now we analyze the following output:
The residuals are standardized based on the concept of residual minus its mean and divided by its standard deviation.
Since E(ei) = 0 and MSres estimates the approximate average variance, so logically the scaling of residual is
ei
=di = , i 1, 2,..., n
MS r e s
E (di ) = 0
Var (di ) ≈ 1.
The normal probability plots is a plot of the ordered standardized residuals versus the cumulative probability
1
i −
Pi =
2
= , i 1, 2,..., n.
n
It can be seen from the plot that most of the points are lying close to the line. This clearly indicates that the assumption of
normal distribution for random errors is satisfactory in the given data.
15
Next we consider the graph between the residuals and the fitted values.
Such a plot is helpful in detecting several common type of model inadequacies.
If plot is such that the residuals can be contained in a horizontal band (and residual fluctuates is more or less in a random
fashion inside the band) then there are no obvious model defects.
The points in this plot can more or less be contained in a horizontal band. So the model has no obvious defects. The
picture may not be fully indicating that all the points can be enclosed in the horizontal band. This was indicated by the
coefficient of determination which is around 63% only. This indicates the degree of lack of fit and it is reflected in this graph
also.
16
Exercise 2
The following data has 20 observations on study variable with three independent variables- X1i, X2i and X3i.
Now we illustrate the use of regression tools and the interpretation of the output from a statistical software through this
example.
Looking at the scatter plots in various blocks, a good degree of linear trend in the relationship between study variable and
each of the explanatory variable is clear. The degree of linear relationships in various blocks are different. It can also be
concluded that it is appropriate to fit a linear regression model.
18
A typical output of a statistical software on regression of y on X 1 , X 2 , X 3 will look like as follows:
Analysis of Variance
Source DF SS MS F P
Regression 3 2111.52 703.84 10.23 0.000
Residual Error 18 1238.85 68.82
Total 21 3350.36
Continued...
19
19
20
Now we discuss the interpretation of the results part-wise.
First look at the following section of the output.
b = ( X ' X ) −1 X ' y
which is 4 X 1 vector
=b (b=
1 , b2 , b3 , b4 ) ' (71.6, - 0.758, - 0.417, 0.106) '
where b1 is the OLS estimator of intercept term and b2 , b3 , b4 are the OLS estimators of regression coefficients.
The same is indicated by ‘Constant 71.60’, ‘x1 -0.7585’, ‘x2 -0.4166’, and ‘x3 0.1063’.
21
The terms SE Coef denotes the standard errors of the estimates of intercept term and slope parameter. They are
obtained as positive square roots of the diagonal elements of the covariance matrix of
∑(y −b −b x
i− b3 xi 3 − b4 xi 4 ) 2
1 2 i2
σˆ 2 =
i =1
, n 22.
n−4
The terms T the value of t-statistics for testing the significance of regression coefficient and are obtained for testing
H 0 j : β j β=
= j 0 with β j 0 0 against H1 j : β j ≠ β j 0 using the statistic
β j − β j0
tj = , j 1, 2,3, 4
Var (b j )
The null hypothesis is rejected at α = 0.5 level of significance when the corresponding P value is less than the level of
significance α = 0.5 and the intercept term as well as slope parameter are significant.
H 01 : β1 0,=
Thus = H 02 : β 2 0 H 03 : β3 0,=
are rejected and= H 04 : β 4 0 is accepted. Thus the variables X 1 , X 3 , X 3 enter in
the model and X 4 leaves the model.
22
Next we consider the values of variance inflation factor (VIF) given in the output as VIF. The variance inflation factor for
the jth explanatory variable is defined as
1
VIFj = .
1 − R 2`j
This is the factor which is responsible for inflating the sampling variance. The combined effect of dependencies among the
explanatory variables on the variance of a term is measured by the VIF of that term in the model. One or more large VIFs
indicate the presence of multicollinearity in the data.
In practice, usually a VIF > 5 or 10 indicates that the associated regression coefficients are poorly estimated because of
multicollinearity.
In our case, the values of VIFs due to first, second and third explanatory variables are 2.9, 2.3 and 4.2, respectively. Thus
there is no indication of the presence of multicollinearity in the data.
23
Now we discuss the interpretation of the goodness of fit statistics.
This means that the fitted model can expalain about 63% of the variability in y through X.
n −1
R2 =
1− (1 − R ) =
2
0.569 with n =
22, k =
4.
n−k
This means that the fitted model can expalain about 57% of the variability in y through X‘s.
24
2
ei
n
PRESS ∑
= = 2554.13
i =1 1 − hii
This is also a measure of model quality. It gives an idea how well a regression model will perform in predicting new data. It
is apparent that the value is quite high. A model with small value of PRESS is desirable.
The value of R2 for prediction based on PRESS statistics is given by R-Sq(pred) = 23.77%
This value is obtained from
PRESS
2
Rprediction =
1− =
0.2377.
SST
This statistic gives an indication of the predictive capability of the fitted regression model. Here
2
Rprediction = 0.2377
indicates that the model is expected to explain about 24% of the variability in predicting new observations. This is also
reflected in the value of PRESS statistic. So the fitted model is not very good for prediction purpose.
25
Now we look at the following output on analysis of variance. This output tests the null hypothesis H 01 : β=
2 β=
3 β=
4 0
against the alternative hypothesis that at least one of the regression coefficient is different from other. There are three
explanatory variables.
Analysis of Variance
Source DF SS MS F P
Regression 3 2111.52 703.84 10.23 0.000
Residual Error 18 1238.85 68.82
Total 21 3350.36
SS r e g
The mean sqaure due to regression is obtained as =
MS reg = 703.84.
3
SS r e s
=
The mean sqaure due to error is obtained as MS res = 68.82.
16
MS reg
The F-statistic is obtained by=F = 10.23.
MS res
The null hypothesis is rejected at 5% level of significance because P value is less than then α = 0.5.
26
Now we look into the following part of the output. The following outcome presents the analysis of residuals to find an
unusual observation which can possibly be an outlier or an extreme observation.
∑ (e − e )
2
t t −1
=d =
t =2
n
1.46718.
∑ et
2
t =1
At d = 2, it is indicated that there is no first order autocorrelation in the data. As the value d = 1.47 is less than 2, so it
indicates the possibility of presence of first order positive autocorrelation in the data.
28
Next we consider the graphical analysis.
The normal probability plots is a plot of the ordered standardized residuals versus the cumulative probability
1
i −
Pi =
2
= , i 1, 2,..., n.
n
It can be seen from the plot that most of the points are lying close to the line. This clearly indicates that the assumption of
normal distribution for random errors is satisfactory.
29
Next we consider the graph between the residuals and the fitted values.
Such a plot is helpful in detecting several common type of model inadequacies.
If plot is such that the residuals can be contained in a horizontal band (and residual fluctuates is more or less in a random
fashion inside the band) then there are no obvious model defects.
The points in this plot can more or less be contained in an outward opening funnel. So the assumption of constant
variance is violated and it indicates that possibly the variance increases with the increase in the values of study variable.