Professional Documents
Culture Documents
…….. (1)
In this model the intercept term is absent or zero, hence the name regression through the origin.
As an illustration, consider the Capital Asset Pricing Model (CAPM) of modern portfolio theory, which,
in its risk-premium form, may be expressed as
…….. (2)
Where,
ERi = Expected rate of return on security i
ERm = Expected rate of return on the market portfolio as represented by, say, the S&P 500
composite stock index
rf = Risk-free rate of return, say, the return on 90-day Treasury bills
βi = The Beta coefficient, a measure of systematic risk, i.e., risk that cannot be eliminated through
diversification. Also, a measure of the extent to which the ith security’s rate of return moves
with the market.
Value of βi > 1 implies a volatile or aggressive security, and βi < 1 is a defensive security.
(Note: Do not confuse this βi with the slope coefficient of the two variable regression, β2.)
If capital markets work efficiently, then CAPM postulates that security i’s expected risk premium
(= ERi − rf) is equal to that security’s β coefficient times the expected market risk premium (= ERm − rf).
If the CAPM holds, we have the situation depicted in Fig.1. The line shown in the figure is known as the
security market line (SML). For empirical purposes, (2) is often expressed as
…….. (3)
Fig.1: Systematic risk. Fig.2: The Market Model of Portfolio Theory (assuming αi = 0).
In equation (3) the dependent variable, Y, is (Ri − rf) and the explanatory variable, X, is βi, the
volatility coefficient, and not (Rm − rf). Therefore, to run regression-3, one must first estimate βi, which
is usually derived from the characteristic line.
As this example shows, sometimes the underlying theory dictates that the intercept term be absent
from the model. Other instances where the zero intercept model may be appropriate are Milton
Friedman’s permanent income hypothesis, which states that permanent consumption is proportional to
permanent income; cost analysis theory, where it is postulated that the variable cost of production is
proportional to output; and some versions of monetarist theory that state that the rate of change of prices
(i.e., the rate of inflation) is proportional to the rate of change of the money supply.
Estimation of SRF of regression through origin
Applying the OLS method to this equation, we obtain the following formulas for ˆ β2 and its variance
Particulars of differences Without intercept With intercept
β2
Variance of β2
σ2 estimation
Differences between the use raw sums of squares and use adjusted (from mean) sums of
two sets of formulas cross products squares and cross products
df for computing ˆσ 2 (n − 1) (n − 2)
Although this raw r2 satisfies the relation 0 < r2 < 1, it is not directly comparable to the
conventional r2 value. For this reason, some authors do not report the r2 value for zero intercept regression
models. Because of these special features of this model, one needs to exercise great caution in using the
zero intercept regression model. Unless there is very strong a priori expectation, one would be well
advised to stick to the conventional, intercept-present model.
This has a dual advantage.
First, if the intercept term is included in the model but it turns out to be statistically insignificant (i.e.,
statistically equal to zero), for all practical purposes we have a regression through the origin.
Second, and more important, if in fact there is an intercept in the model but we insist on fitting a
regression through the origin, we would be committing a specification error, thus violating
Assumption 9 of the classical linear regression model.
Regression on Standardized Variables
The units in which the regressand and regressor(s) are expressed affect the interpretation of the
regression coefficients. This can be avoided if we are willing to express the regressand and regressor(s) as
standardized variables. A variable is said to be standardized if we subtract the mean value of the variable
from its individual values and divide the difference by the standard deviation of that variable. Thus, in the
regression of Y and X, if we redefine these variables as
……..(1)
where Y̅ = sample mean of Y, SY = sample standard deviation of Y, X̅ = sample mean of X, and SX is the
sample standard deviation of X; the variables Y*i and X*i are called standardized variables.
An interesting property of a standardized variable is that its mean value is always zero and its standard
deviation is always 1.
As a result, it does not matter in what unit the regressand and regressor(s) are measured. Therefore,
instead of running the standard (bivariate) regression:
……..(2)
we could run regression on the standardized variables as
……..(3)
since it is easy to show that, in the regression involving standardized regressand and regressor(s), the
intercept term is always zero. The regression coefficients of the standardized variables, denoted by β*1
and β*2, are known in the literature as the beta coefficients. Incidentally, notice that (3) is a regression
through the origin. How do we interpret the beta coefficients? The interpretation is that if the
(standardized) regressor increases by one standard deviation, on average, the (standardized) regressand
increases by β*2 standard deviation units. Thus, unlike the traditional model (2), we measure the effect
not in terms of the original units in which Y and X are expressed, but in standard deviation units.
To show the difference between (2) and (3), consider the results presented under
Interpretation: if the (standardized) GDP increases by one standard deviation, on average, the
(standardized) GPDI increases by about 0.94 standard deviations.
Advantage of the standardized regression model
If there is more than one regressor, by standardizing all regressors, we put them on equal basis
and therefore can compare them directly. If the coefficient of a standardized regressor is larger than that
of another standardized regressor appearing in that model, then the latter contributes more relatively to the
explanation of the regressand than the former. Thus, we can use the beta coefficients as a measure of
relative strength of the various regressors.
Points to note under Regression through Origin
The r2 value not given because this is a regression through the origin for which the usual r2 is not
applicable.
There is an interesting relationship between the β coefficients of the conventional model and the
beta coefficients. For bivariate case
Where, Sx = the sample standard deviation of the X regressor and Sy = the sample standard deviation of
the regressand. Therefore, one can crisscross between the β and beta coefficients if we know the (sample)
standard deviation of the regressor and regressand.
Functional Forms of Regression Models
Commonly used regression models that may be nonlinear in the variables but are linear in the parameters
or that can be made so by suitable transformations of the variables.
1. The log-linear model
Consider the following model, known as the exponential regression model
……………… (1)
which may be expressed alternatively as follows using properties of the logarithms like
(1) ln (AB) = ln A + ln B,
(2) ln (A/B) = ln A − ln B, and
(3) ln (Ak) = kln A,
assuming that A and B are positive, and where k is some constant.
……………… (2)
Where,
ln = natural log (i.e., log to the base e, and where e = 2.718).
In practice one may use common logarithms, that is, log to the base 10. The relationship between
the natural log and common log is: lne X = 2.3026 log10 X.
By convention, ln means natural logarithm, and log means logarithm to the base 10; hence there
is no need to write the subscripts e and 10 explicitly
If we write (2) as follows
……………… (3)
where α = ln β1, this model is linear in the parameters α and β2, linear in the logarithms of the variables Y
and X, and can be estimated by OLS regression. Because of this linearity, such models are called log-log,
double-log, or loglinear models.
If the assumptions of the classical linear regression model are fulfilled, the parameters of above
equation (3) can be estimated by the OLS method by letting
……………… (4)
where Y*i= ln Yi and X*i= ln Xi. The OLS estimators ˆα and ˆβ2 obtained will be best linear unbiased
estimators of α and β2, respectively.
One attractive feature of the log-log model, which has made it popular in applied work, is that the
slope coefficient β2 measures the elasticity of Y with respect to X, that is, the percentage change in Y for a
given (small) percentage change in X. Thus, if Y represents the quantity of a commodity demanded and X
its unit price, β2 measures the price elasticity of demand, a parameter of considerable economic interest. If
the relationship between quantity demanded and price is as shown in Fig.3a, the double-log
transformation as shown in Fig.3b will then give the estimate of the price elasticity (−β2).
Two special features of the log-linear model:
1. The model assumes that the elasticity coefficient between Y and X, β2, remains constant throughout,
hence the alternative name constant elasticity model. In other words, as Fig.3b shows, the change in
ln Y per unit change in ln X (i.e., the elasticity, β2) remains the same no matter at which ln X we
measure the elasticity.
2. Another feature of the model is that although ˆα and ˆ β2 are unbiased estimates of α and β2, β1 (the
parameter entering the original model) when estimated as ˆβ1= antilog (ˆα) is itself a biased estimator.
In most practical problems, however, the intercept term is of secondary importance, and one need not
worry about obtaining its unbiased estimate.
In the two-variable model, the simplest way to decide whether the loglinear model fits the data is
to plot the scattergram of lnYi against ln Xi and see if the scatter points lie approximately on a straight
line, as in Fig.3b.
where r is the compound (i.e., over time) rate of growth of Y. Taking the natural logarithm of (5), we can
write
Now let
β1 = ln Y0
β2 = ln (1 + r ) ……………… (5)
……………… (6)
Adding the disturbance term to this equation, we obtain
……………… (7)
This model is like any other linear regression model in that the parameters β1 and β2 are linear.
The only difference is that the regressand is the logarithm of Y and the regressor is “time,” which will take
values of 1, 2, 3, etc.
Models like (7) are called semilog models because only one variable (in this case the regressand)
appears in the logarithmic form.
A model in which the regressand is logarithmic will be called a log-lin model.
A model in which the regressand is linear but the regressor(s) are logarithmic is a lin-log model.
Properties of model (6)
In this model the slope coefficient measures the constant proportional or relative change in Y for
a given absolute change in the value of the regressor (in this case the variable t), that is
……………… (8)
Because
β2 = d(ln Y)/dX = (1/Y)(dY/dX) = (dY/Y)/dX, which is nothing but above equation. For small
changes in Y and X this relation may be approximated by
where X is t
If we multiply the relative change in Y by 100, (8) will then give the percentage change, or the
growth rate, in Y for an absolute change in X, the regressor. That is, 100 times β2 gives the growth rate in
Y; 100 times β2 is known in the literature as the semielasticity of Y with respect to X.
Instantaneous versus Compound Rate of Growth.
The coefficient of the trend variable in the growth model (7), β2, gives the instantaneous (at a
point in time) rate of growth and not the compound (over a period of time) rate of growth. But the latter
can be easily found by taking the antilog of the estimated β2 and subtracting 1 from it and multiplying the
difference by 100 using equation.
For the illustrative example, the estimated slope coefficient is 0.00743. Therefore, [antilog
(0.00743) − 1] = 0.00746 or 0.746 percent. Thus, in the illustrative example, the compound rate of growth
on expenditure on services was about 0.746 percent per quarter, which is slightly higher than the
instantaneous growth rate of 0.743 percent. This is of course due to the compounding effect.
Linear Trend Model.
Instead of estimating model (7), sometimes estimate the following model
……………… (9)
That is, instead of regressing the log of Y on time, regress Y on time, where Y is the regressand
under consideration. Such a model is called a linear trend model and the time variable t is known as the
trend variable. If the slope coefficient is positive, there is an upward trend in Y, whereas if it is negative,
there is a downward trend in Y.
For the expenditure on services data, the results of fitting the linear trend model are as follows
………………….(10)
For descriptive purposes such a model is called as lin–log model.
The interpretation of the slope coefficient β2 is as usual
This equation states that the absolute change in Y (=ΔY) is equal to slope times the relative change
in X. If the latter is multiplied by 100, then above equation gives the absolute change in Y for a percentage
change in X. Thus, if (ΔX/X) changes by 0.01 unit (or 1 percent), the absolute change in Y is 0.01(β2); If
in an application one finds that β2 = 500, the absolute change in Y is (0.01)(500) = 5.0.
Therefore, when regression (10) is estimated by OLS, do not forget to multiply the value of the
estimated slope coefficient by 0.01, or divide it by 100. If not, interpretation in an application will be
highly misleading.
lin–log model has an interesting application in the so-called Engel expenditure models, named
after the German statistician Ernst Engel, 1821–1896. Engel postulated that “the total expenditure that is
devoted to food tends to increase in arithmetic progression as total expenditure increases in geometric
progression.
Reciprocal Models
Although this model is nonlinear in the variable X because it enters inversely or reciprocally, the model is
linear in β1 and β2 and is therefore a linear regression model.
If we let X*i= (1/Xi), then above equation is linear in the parameters as well as the variables Yi and X*i.
Features of Reciprocal model
As X increases indefinitely, the term β2(l/X) approaches zero (note: β2 is a constant) and Y
approaches the limiting or asymptotic value β1. Therefore, these models have built in them an asymptote
or limit value that the dependent variable will take when the value of the X variable increases indefinitely.
Some likely shapes of the curve corresponding to this equation are shown in following Fig. 4.
As an illustration of Fig.4a, consider the data given in the following Table-1. These are cross-sectional
data for 64 countries on child mortality and a few other variables. For now, concentrate on the variables,
child mortality (CM) and per capita GNP, which are plotted in Fig.5.
This figure resembles Fig. 4.a: As per capita GNP increases, one would expect child mortality to
decrease because people can afford to spend more on health care, assuming all other factors remain
constant. But the relationship is not a straight line one: As per capita GNP increases, initially there is
dramatic drop in CM but the drop tapers off as per capita GNP continues to increase.
The slope of reciprocal model above is: dY/dX = −β2(1/X2), implying that if β2 is positive, the
slope is negative throughout, and if β2 is negative, the slope is positive throughout. [Fig.4a and 4c,
respectively]
Table-1: Fertility and other Data for 64 Countries
Fig. 4: The reciprocal model: Y = β1 + β2 [1/X]
.
Fig. 5: Relationship between child mortality and per capita GNP in 66 countries
By fitting the reciprocal model, the regression results are
As per capita GNP increases indefinitely, child mortality approaches its asymptotic value of about
82 deaths per thousand. As said earlier, the positive value of the coefficient of (l/PGNPt) implies that the
rate of change of CM with respect to PGNP is negative.
One of the important applications of Fig.4 b is the celebrated Phillips curve of macroeconomics.
Using the data on percent rate of change of money wages (Y) and the unemployment rate (X) for the United
Kingdom for the period 1861–1957, Phillips obtained a curve whose general shape resembles Fig. 4 b.
As Fig. 6 shows, there is an asymmetry in the response of wage changes to the level of the
unemployment rate: Wages rise faster for a unit change in unemployment if the unemployment rate is
below Un, which is called the natural rate of unemployment by economists [defined as the rate of
unemployment required to keep (wage) inflation constant], and then they fall for an equivalent change
when the unemployment rate is above the natural rate, β1, indicating the asymptotic floor for wage
change. This particular feature of the Phillips curve may be due to institutional factors, such as union
bargaining power, minimum wages, unemployment compensation, etc.
But
Then, on substituting this we get
3. The coefficients of the model chosen should satisfy certain a priori expectations. For example, if we
are considering the demand for automobiles as a function of price and some other variables, we
should expect a negative coefficient for the price variable.
4. Sometime more than one model may fit a given set of data reasonably well. With coefficients in line
with prior expectations and all were statistically significant but with different r2 value but make sure
that in comparing two r2 values the dependent variable, or the regressand, of the two models is the
same; the regressor(s).
5. In general, one should not overemphasize the r2 measure in the sense that the higher the r2 the better
the model. As r2 increases as we add more regressors to the model. What is of greater importance is
the theoretical underpinning of the chosen model, the signs of the estimated coefficients and their
statistical significance. If a model is good on these criteria, a model with a lower r2 may be quite
acceptable.
The Gauss–Markov Theorem
Given the assumptions of the classical linear regression model, the least-squares estimates
possess some ideal or optimum properties. These properties are contained in the well-known
Gauss–Markov theorem. To understand this theorem, consider the best linear unbiasedness
property of an estimator. An estimator, say the OLS estimator ˆβ2, is said to be a Best Linear
Unbiased Estimator (BLUE) of β2 if the following hold:
1. It is linear, that is, a linear function of a random variable, such as the dependent variable Y in
the regression model.
2. It is unbiased, i.e., its average /expected value, E(ˆβ2), is equal to the true value, β2.
3. It has minimum variance in the class of all such linear unbiased estimators; an unbiased
estimator with the least variance is known as an efficient estimator.
In the regression context it can be proved that the OLS estimators are BLUE. This is the
gist of the famous Gauss–Markov theorem, which can be stated as follows:
Gauss–Markov Theorem: Given the assumptions of the classical linear regression model, the
least-squares estimators, in the class of unbiased linear estimators, have minimum variance, that is,
they are BLUE.
Figure (a) shows the sampling distribution of the OLS estimator ˆβ2, that is, the
distribution of the values taken by ˆβ2 in repeated sampling experiments. For convenience it is
assumed ˆβ2 to be distributed symmetrically. As the figure shows, the mean of the ˆβ2 values,
E(ˆβ2), is equal to the true β2. In this situation we say that ˆβ2 is an unbiased estimator of β2. Figure
(b) shows the sampling distribution of β*2, an alternative estimator of β2 obtained by using another
(i.e., other than OLS) method. For convenience, assume that β*2, like ˆβ2, is unbiased, that is, its
average or expected value is equal to β2.
Assume further that both ˆβ2 and β*2 are linear estimators, i.e., linear functions of Y. Which
estimator, ˆβ2 or β*2 one need to choose? To answer this question, superimpose the two figures, as
in Figure (c). It is obvious that although both ^β2 and β*2 are unbiased the distribution of β*2 is
AEC 507/GMG/Goodness of fit 1
more diffused or widespread around the mean value than the distribution of ˆβ2. In other words, the
variance of β*2 is larger than the variance of ˆβ2.
Now given two estimators that are both linear and unbiased, one would choose the
estimator with the smaller variance because it is more likely to be close to β2 than the alternative
estimator. In short, one would choose the BLUE estimator.
Ex: Linear regression function/ Linear Production function
The function Y= a+bX is a straight line with two constants ‘a’ and ‘b’. ‘a’ is the intercept
(Value of Y when X is zero i.e., If X = 0, then Y =a). ‘b’ is the regression coefficient of Y on X (When
X will change by one unit Y will change by ‘b’ units). In this equation Y is dependent variable and X is
independent variable. Let us consider one example
SN Y X Y2 X2 XY
1 8.0 5.5 64.00 30.25 44
2 8.5 6.0 72.25 36.00 51
3 10.0 8.5 100.00 72.25 85
Total 26.5 20.00 236.25 138.50 180
We know normal equations
AEC 507/GMG/Goodness of fit 2
The important properties of linear function are it is represented by a straight line and has a
constant slope at every point on the line, constant productivity because MP=AP, the output
changes proportionately for proportional changes in inputs.
Variation means the sum of squares of the deviations of a variable from its mean value or
it is the difference between expected mean value and the actual mean value.
The circle Y represents variation in the dependent variable Y and the circle X represents
variation in the explanatory variable X. The overlap of the two circles (the shaded area) indicates
the extent to which the variation in Y is explained by the variation in X (say, via an OLS
regression).
The greater the extent of the overlap, the greater the variation in Y is explained by X. The r
2
is simply a numerical measure of this overlap. In the figure, as we move from left to right, the
area of the overlap increases, that is, successively a greater proportion of the variation in Y is
explained by X. i.e., r2 increases. When there is no overlap, r2 is obviously zero, but when the
overlap is complete, r2 is 1, since 100 percent of the variation in Y is explained by X. The r2 lies
between 0 and 1.
Computation of r 2,
In deviation form
AEC 507/GMG/Goodness of fit 3
Since,
∑ˆyi ˆui = 0 and ˆyi = ˆβ2xi.
Total Sum of Squares (TSS) = Total variation of the actual Y values about their sample
mean. i.e.
Explained Sum of Squares (ESS) = Variation of the estimated Y values about their mean
(ˆY= Y bar), which appropriately may be called the sum of squares due to regression [i.e.,
due to the explanatory variable(s)], or explained by regression, or ESS.
The total variation in the observed Y values about their mean value can be
partitioned into two parts, one attributable to the regression line and the other to random
forces, because not all actual Y observations lie on the fitted line. Geometrically,
AEC 507/GMG/Goodness of fit 4
Dividing by TSS on both sides, we obtain
We now define r2
Or alternatively
The quantity r2 thus defined is known as the (sample) coefficient of determination and is
the most commonly used measure of the goodness of fit of a regression line. Verbally, r2
measures the proportion or percentage of the total variation in Y explained by the
regression model.
Two properties of r2 are
1. It is a non-negative quantity.
2. Its limits are 0 ≤ r2 ≤ 1. An r2 of 1 means a perfect fit, that is, ˆYi = Yi for each i. On
the other hand, an r2 of zero means that there is no relationship between the regressand
and the regressor whatsoever (i.e., ˆβ2 = 0). In this case, ˆYi = ˆβ1 = Y̅, that is, the best
prediction of any Y value is simply its mean value. In this situation therefore the
regression line will be horizontal to the X axis.
Although r2 can be computed directly from its definition, it can be obtained more
quickly from the following formula
If we divide the numerator and the denominator of above equation by the sample size n (or
n− 1 if the sample size is small), we obtain
where Sx2 and Sy2 are the sample variances of Y and X, respectively.
AEC 507/GMG/Goodness of fit 5
Since above Equation can also be expressed as
Given the definition of r2, we can express ESS and RSS discussed earlier as follows
A quantity closely related to but conceptually very much different from r 2 is the
coefficient of correlation, is a measure of the degree of association between two
variables. It can be computed either from
AEC 507/GMG/Goodness of fit 6
Single equation correlation coefficient
Properties of r
1. It can be positive or negative, the sign depending on the sign of the term in the
numerator, which measures the sample covariation of two variables.
2. It lies between the limits of −1 and +1; that is, −1 ≤ r ≤ 1.
3. It is symmetrical in nature; that is, the coefficient of correlation between X and Y(rXY)
is the same as that between Y and X(rYX).
AEC 507/GMG/Goodness of fit 7
4. It is independent of the origin and scale; that is, if we define X*I =aXi + C and Y*i=
bYi + d, where a > 0, b > 0, and c and d are constants, then r between X* and Y* is the
same as that between the original variables X and Y.
5. If X and Y are statistically independent, the correlation coefficient between them is
zero; but if r = 0, it does not mean that two variables are independent. In other words,
zero correlation does not necessarily imply independence. [Figure (h).]
6. It is a measure of linear association or linear dependence only; it has no meaning for
describing nonlinear relations. Thus in Figure (h), Y = X2 is an exact relationship yet r
is zero.
7. Although it is a measure of linear association between two variables, it does not
necessarily imply any cause-and-effect relationship.
In the regression context, r2 is a more meaningful measure than r, for the former tells
us the proportion of variation in the dependent variable explained by the explanatory
variable(s) and therefore provides an overall measure of the extent to which the variation
in one variable determines the variation in the other. The latter does not have such value.
Moreover, as we shall see, the interpretation of r (= R) in a multiple regression model is of
dubious value.
In passing, note that the r2 defined previously can also be computed as the squared
coefficient of correlation between actual Yi and the estimated Yi, namely, ˆYi . That is,
using above equation we can write.
That is
Where Yi = actual Y, ˆYi = estimated Y, and Y̅ = ˆY̅= the mean of Y. Above equation
justifies the description of r2 as a measure of goodness of fit, for it tells how close the
estimated Y values are to their actual values.
AEC 507/GMG/Goodness of fit 8
Keynesian consumption function
Keynes stated that “The fundamental psychological law . . . is that men [women]
are disposed, as a rule and on average, to increase their consumption as their income
increases, but not by as much as the increase in their income,” that is, the marginal
propensity to consume (MPC) is greater than zero but less than one. Although Keynes did
not specify the exact functional form of the relationship between consumption and income,
for simplicity assume that the relationship is linear. As a test of the Keynesian
consumption function, the sample data in following Table is used to obtain the estimates
of the regression coefficients, their standard errors, etc.,
Table: Data on weekly family consumption expenditure Y & weekly family income X
This above SRF and the associated regression line given below are interpreted as follows:
Each point on the regression line gives an estimate of the expected or mean value of Y
corresponding to the chosen X value; that is, Ŷi is an estimate of E(Y | Xi). The value of ˆβ2 =
0.5091, which measures the slope of the line, shows that, within the sample range of X between
$80 and $260 per week, as X increases, say, by $1, the estimated increase in the mean or average
weekly consumption expenditure amounts to about 51 cents. The value of ˆβ1 =24.4545, which is
the intercept of the line, indicates the average level of weekly consumption expenditure when
weekly income is zero. However, this is a mechanical interpretation of the intercept term. In
regression analysis such literal interpretation of the intercept term may not be always meaningful,
although in the present example it can be argued that a family without any income (because of
unemployment, layoff, etc.) might maintain some minimum level of consumption expenditure
either by borrowing or dissaving. But in general one has to use common sense in interpreting the
intercept term, for very often the sample range of X values may not include zero as one of the
observed values.
AEC 507/GMG/Goodness of fit 9
Fig. Sample regression line based on the data given in above table
AEC 507/GMG/Goodness of fit 10
Analysis of variance (ANOVA)
Study of components of TSS is known as the analysis of variance (ANOVA).
TSS = ESS + RSS,
Total sum of squares (TSS) = Explained sum of squares (ESS) + residual sum of squares
(RSS).
Associated with any sum of squares is its df, the number of independent observations on
which it is based.
TSS has n− 1 df because we lose 1 df in computing the sample mean Ῡ.
RSS has n− 2 df. (Why?) (Note: This is true only for the two-variable regression model
with the intercept β1 present.)
ESS has 1 df (again true of the two-variable case only), which follows from the fact that
ESS = ˆβ22 Ʃxi2 is a function of ˆβ2 only, since Ʃxi2 is known.
Let us arrange the various sums of squares and their associated df in Table 5.3, which is
the standard form of the AOV table, sometimes called the ANOVA table. Given the entries of
Table 5.3, we now consider the following variable:
If we assume that the disturbances ui are normally distributed, which we do under the
CNLRM, and if the null hypothesis (H0) is that β2 = 0, then it can be shown that the F variable of
above equation follows the F distribution with 1 df in the numerator and (n − 2) df in the
denominator.
AEC 507/GMG/Goodness of fit 11
Classical Normal Linear Regression Model (CNLRM)
The classical theory of statistical inference consists of two branches, namely, estimation
and hypothesis testing.
Using the method of OLS we estimated the parameters β1, β2, and σ2. Under the
assumptions of the classical linear regression model (CLRM), we were able to show that the
estimators of these parameters, ˆβ1, ˆβ2, and ˆσ2, satisfy several desirable statistical properties, such
as unbiasedness, minimum variance, etc. Since these are estimators, their values will change from
sample to sample. Therefore, these estimators are random variables. The estimation is half the
battle; Hypothesis testing is the other half.
Since ˆβ1, ˆβ2, and ˆσ2 are random variables, we need to find out their probability
distributions, for without that knowledge we will not be able to relate them to their true values.
Probability Distribution of Disturbances Ui
To find out the probability distributions of the OLS estimators, proceed with ˆβ2.
Where,
ki = xi/∑xi2. But since the X’s are assumed fixed, or non-stochastic, because ours is conditional
regression analysis, conditional on the fixed values of Xi, above Equation (1) shows that ˆβ2 is a
linear function of Yi, which is random by assumption. But since Yi=β1 + β2Xi + ui, we can write (1)
as
Because ki, the betas, and Xi are all fixed, ˆβ2 is ultimately a linear function of the random
variable ui, which is random by assumption. Therefore, the probability distribution of ˆβ2 (and also
of ˆβ1) will depend on the assumption made about the probability distribution of ui. And since
knowledge of the probability distributions of OLS estimators is necessary to draw inferences about
their population values, the nature of the probability distribution of ui assumes an extremely
important role in hypothesis testing.
Since the method of OLS does not make any assumption about the probabilistic nature of
ui, it is of little help for the purpose of drawing inferences about the PRF from the SRF.
This void can be filled if we are willing to assume that the u’s follow some probability
distribution. In the regression context it is usually assumed that the u’s follow the normal
distribution. Adding the normality assumption for ui to the assumptions of the classical linear
regression model (CLRM) we obtain what is known as the classical normal linear regression
model (CNLRM).
The Normality Assumption for Ui
The classical normal linear regression model assumes that each ui is distributed normally with
Where, the symbol ~ means distributed as and N stands for the normal distribution, the
terms in the parentheses representing the two parameters of the normal distribution, namely, the
mean and the variance.
For two normally distributed variables, zero covariance or correlation means independence
of the two variables. Therefore, with the normality assumption, ui and uj are not only
uncorrelated but are also independently distributed. Therefore, we can write above equation as
GMG/AEC 507/CLRM/2021 CLRM 1
Reasons for assumption of the Normality
1. ui represent the combined influence (on the dependent variable) of a large number of
independent variables that are not explicitly introduced in the regression model. We hope that
the influence of these omitted or neglected variables is small and at best random. Now by the
celebrated central limit theorem (CLT) of statistics, it can be shown that if there are a large
number of independent and identically distributed random variables, then, with a few
exceptions, the distribution of their sum tends to a normal distribution as the number of such
variables increase indefinitely. It is the CLT that provides a theoretical justification for the
assumption of normality of ui.
2. A variant of the CLT states that, even if the number of variables is not very large or if these
variables are not strictly independent, their sum may still be normally distributed.
3. With the normality assumption, the probability distributions of OLS estimators can be easily
derived because, as one property of the normal distribution is that any linear function of
normally distributed variables is itself normally distributed. The OLS estimators ˆβ1 and
ˆβ2 are linear functions of ui. Therefore, if ui are normally distributed, so are ˆβ1 and ˆβ2, which
makes task of hypothesis testing very straightforward.
4. The normal distribution is a comparatively simple distribution involving only two parameters
(mean and variance); it is very well known and its theoretical properties have been extensively
studied in mathematical statistics. Besides, many phenomena follow the normal distribution.
5. Finally, if we are dealing with a small, or finite, sample size, say data of less than 100
observations, the normality assumption assumes a critical role. It not only helps us to derive
the exact probability distributions of OLS estimators but also enables us to use the t, F, and χ2
statistical tests for regression models.
Properties of OLS Estimators under the Normality Assumption
With the assumption of ui follow the normal distribution, the OLS estimators have the
following properties.
1. They are unbiased.
2. They have minimum variance. Combined with 1, this means that they are minimum-variance
unbiased, or efficient estimators.
3. They have consistency; that is, as the sample size increases indefinitely, the estimators
converge to their true population values.
4. ˆ β1 (being a linear function of ui) is normally distributed with
Or more specifically,
Then by the properties of the normal distribution the variable Z follows the standard normal
distribution, that is, a normal distribution with zero mean and unit (=1) variance, or Z N(0, 1)
more compactly,
GMG/AEC 507/CLRM/2021 CLRM 2
also follows the standard normal distribution. Geometrically, the probability distributions of ˆβ1
and ˆ β2 are shown below.
More specifically
GMG/AEC 507/CLRM/2021 CLRM 3
Multicollinearity
………………… (3)
which shows how X2 is exactly linearly related to other variables or how it can be derived
from a linear combination of other X variables. In this situation, the coefficient of
correlation between the variable X2 and the linear combination on the right side of (3) is
bound to be unity.
Similarly, if λ2 ≠ 0, Eq. (2) can be written as
………………… (4)
which shows that X2 is not an exact linear combination of other X’s because it is also
determined by the stochastic error term vi.
As a numerical example, consider the following hypothetical data:
AEC 507/GMG/MC/2021 1
It is apparent that X3i = 5X2i. Therefore, there is perfect collinearity between X2 and
X3 since the coefficient of correlation r23 is unity. The variable X*3 was created from X3 by
simply adding to it the numbers taken from a table of random numbers: 2, 0, 7, 9, 2. Now
there is no longer perfect collinearity between X2 and X*3. However, the two variables are
highly correlated because calculations will show that the coefficient of correlation between
them is 0.9959.
Algebraically multicollinearity can be portrayed using following Ballentine.
In this figure the circles Y, X2, and X3 represent, respectively, the variations in Y
(the dependent variable) and X2 and X3 (the explanatory variables). The degree of
collinearity can be measured by the extent of the overlap (shaded area) of the X2 and X3
circles. In above Fig (a) there is no overlap between X2 and X3, and hence no collinearity.
AEC 507/GMG/MC/2021 2
In Fig. (b) through (e) there is a “low” to “high” degree of collinearity—the greater the
overlap between X2 and X3 (shaded area), the higher the degree of collinearity.
In the extreme, if X2 and X3 were to overlap completely (or if X2 were completely
inside X3, or vice versa), collinearity would be perfect.
Note that multicollinearity, as we have defined it, refers only to linear relationships
among the X variables. It does not rule out nonlinear relationships among them. For
example, consider the following regression model:
…………………(5)
where, say, Y = total cost of production and X = output. The variables Xi2 (output squared)
and Xi3 (output cubed) are obviously functionally related to Xi, but the relationship is
nonlinear.
Therefore, models like (5) do not violate the assumption of no multicollinearity.
However, in concrete applications, the conventionally measured correlation coefficient
will show Xi, X2i, and X3i to be highly correlated, which, will make it difficult to estimate
the parameters of (5) with greater precision (i.e., with smaller standard errors).
The classical linear regression model assumes no multicollinearity among the X’s
because, if multicollinearity is perfect in the sense of (1), the regression coefficients of the
X variables are indeterminate and their standard errors are infinite. If multicollinearity is
less than perfect, as in (2), the regression coefficients, although determinate, possess large
standard errors (in relation to the coefficients themselves), which means the coefficients
cannot be estimated with great precision or accuracy.
Sources of multicollinearity [as given by Montgomery and Peck]:
1. The data collection method employed- If sampling over a limited range of the values
taken by the regressors in the population.
2. Constraints on the model or in the population being sampled –Ex: In the regression of
electricity consumption on income (X2) and house size (X3) there is a physical
constraint in the population in that families with higher incomes generally have larger
homes than families with lower incomes.
3. Model specification, for example, adding polynomial terms to a regression model,
especially when the range of the X variable is small.
4. An over-determined model. This happens when the model has more explanatory
variables than the number of observations. This could happen in medical research
where there may be a small number of patients about whom information is collected
on a large number of variables.
5. In time series data, regressors included in the model share a common trend, i.e., they
all increase or decrease over time. Thus, in the regression of consumption expenditure
on income, wealth, and population, the regressors income, wealth, and population may
all be growing over time at more or less the same rate, leading to collinearity among
these variables.
AEC 507/GMG/MC/2021 3
Problem with Multicollinearity (MC)
Estimation of regression coefficient is a problem and we can’t find them with ease.
In the case of perfect multicollinearity
Here the regression coefficients remain indeterminate and their standard errors are infinite.
This fact can be demonstrated in terms of the three-variable regression model. Using the deviation
form, where all the variables are expressed as deviations from their sample means, we can write
the three variable regression model as
Assume that X3i = λX2i, where λ is a nonzero constant (e.g., 2, 4, 1.8, etc.). Substituting this
into previous equation above, we obtain
Where,
gives us only one equation in two unknowns (note λ is given) and there is an infinity of solutions
to above equation for given values of ˆα and λ.
To put this idea in concrete terms, let ˆα = 0.8 and λ = 2. Then we have
0.8 = ˆ β2 + 2 ˆβ3
Or β2 = 0.8 − 2ˆβ3
AEC 507/GMG/MC/2021 4
Now, choose a value of ˆβ3 arbitrarily, and we will have a solution for ˆβ2. Choose another
value for ˆβ3, and we will have another solution for ˆβ2. No matter how hard we try, there is no
unique value for ˆβ2.
Thus in the case of perfect multicollinearity one cannot get a unique solution for the
individual regression coefficients. But notice that one can get a unique solution for linear
combinations of these coefficients. The linear combination (β2 + λβ3) is uniquely estimated by α,
given the value of λ.
In the case of perfect multicollinearity the variances and standard errors of ˆβ2 and ˆβ3
individually are infinite.
Estimation in the Presence of “High” but “Imperfect” Multicollinearity
The perfect multicollinearity situation is a pathological extreme. Generally, there is no
exact linear relationship among the X variables, especially in data involving economic time series.
Thus, turning to the three-variable model in the deviation form, instead of exact multicollinearity,
we may have
where λ≠ 0 and where vi is a stochastic error term such that x2ivi = 0. Incidentally, the Ballentines
shown in earlier Figure b to e represent cases of imperfect collinearity.
In this case, estimation of regression coefficients β2 and β3 may be possible.
For example, substituting above equation we obtain
where use is made of Ʃx2ivi=0, A similar expression can be derived for ˆ β3.
Theoretical Consequences of Multicollinearity
Even if multicollinearity is very high, as in the case of near multicollinearity, the OLS
estimators still retain the property of BLUE.
The only effect of multicollinearity is to make it hard to get coefficient estimates with
small standard error. The MC problem is largely something to do with samples / sample size, but
not on population. If the sample size is low, then SE is high and hence variance is also high.
Practical Consequences of High Multicollinearity
1. Although BLUE, the OLS estimators have large variances and covariances, making precise
estimation difficult.
2. Because of consequence 1, the confidence intervals tend to be much wider, leading to the
acceptance of the “zero null hypothesis” (i.e., the true population coefficient is zero) more readily.
3. Also because of consequence 1, the t ratio of one or more coefficients tends to be statistically
insignificant.
4. Although the t ratio of one or more coefficients is statistically insignificant, R2, the overall
measure of goodness of fit, can be very high.
5. The OLS estimators and their standard errors can be sensitive to small changes in the data.
AEC 507/GMG/MC/2021 5
Demonstration of consequences
Large Variances and Covariances of OLS Estimators
To see large variances and covariances, the variances and covariances of ˆβ2 and ˆβ3 are given by
where r23 is the coefficient of correlation between X2 and X3. It is apparent from above first
two equations that as r23 tends toward 1, that is, as collinearity increases, the variances of the two
estimators increase and in the limit when r23 = 1, they are infinite. It is equally clear from third
equation that as r23 increases toward 1, the covariance of the two estimators also increases in
absolute value. [Note: cov (ˆβ2, ˆβ3) ≡ cov (ˆβ3, ˆβ2).]
If r223=1, then variances will be ∞ and SE will be ∞ and also covariance increases
progressively.
Variance-Inflating Factor (VIF)
The speed with which variances and covariances increase will be known as VIF, by this
we can detect presence or absence of MC.
The VIF shows how the variance of an estimator is inflated by the presence of MC.
As r223 approaches 1, the VIF approaches infinity. That is, as the extent of collinearity
increases, the variance of an estimator increases, and in the limit it can become infinite. As can be
readily seen, if there is no collinearity between X2 and X3, VIF will be 1.
Thus variances of ˆβ2 and ˆβ3 are directly proportional to the VIF and inversely
proportional to the sum of deviations squares of x values.
The inverse of the VIF is called tolerance (TOL).
AEC 507/GMG/MC/2021 6
To have an idea about how fast the variances and covariances increase as r23 increases,
consider the data in above Table on variances and covariances for selected values of r23. The
increases in r23 have a dramatic effect on the estimated variances and covariances of the OLS
estimators. When r23 = 0.50, the var ( ˆβ2) is 1.33 times the variance when r23 is zero, but by the
time r23 reaches 0.95 it is about 10 times as high as when there is no collinearity. And, an increase
of r23 from 0.95 to 0.995 makes the estimated variance 100 times that when collinearity is zero.
The same dramatic effect is seen on the estimated covariance as indicated in in following Figure.
AEC 507/GMG/MC/2021 7
Detection of Multicollinearity
After studying the nature and consequences of multicollinearity, How to detect the presence of
collinearity in any given situation, especially in models involving more than two explanatory
variables. For that let us consider the Kmenta’s warning:
Multicollinearity is a question of degree and not of kind. The meaningful distinction is not
between the presence and the absence of multicollinearity, but between its various degrees.
Since multicollinearity refers to the condition of the explanatory variables that are assumed to
be non-stochastic, it is a feature of the sample and not of the population. Therefore, we do not
“test for multicollinearity” but can, if we wish, measure its degree in any particular sample.
Since multicollinearity is essentially a sample phenomenon, arising out of the largely non-
experimental data collected in social sciences, there is no unique method of detecting it or
measuring its strength, instead there are some rules of thumb, some informal and some formal, like
1. High R2 but few significant t ratios: This is the “classic” symptom of multicollinearity. If R2 is
high, say, in excess of 0.8, the F test in most cases will reject the hypothesis that the partial
slope coefficients are simultaneously equal to zero, but the individual t tests will show that none
or very few of the partial slope coefficients are statistically different from zero.
Although this diagnostic is sensible, its disadvantage is that “it is too strong in the sense that
multicollinearity is considered as harmful only when all of the influences of the explanatory
variables on Y cannot be disentangled.”
2. High pair-wise correlations among regressors: If the pair-wise or zero-order correlation
coefficient between two regressors is high, say, in excess of 0.8, then multicollinearity is a
serious problem. The problem with this criterion is that, although high zero-order correlations
may suggest collinearity, it is not necessary that they be high to have collinearity in any specific
case. To put the matter somewhat technically, high zero-order correlations are a sufficient but
not a necessary condition for the existence of multicollinearity because it can exist even though
the zero-order or simple correlations are comparatively low (say, less than 0.50).
Suppose for 4 variable model
If
where λ2 and λ3 are constants, not both zero. Obviously, X4 is an exact linear combination of X2 and
X3, giving R24.23 = 1, the coefficient of determination in the regression of X4 on X2 and X3.
With r42 = 0.5, r43 = 0.5, and r23 = −0.5, which are not very high values, we get R2 =1.
Therefore, in models involving more than two explanatory variables, the simple or zero-
order correlation will not provide an infallible guide to the presence of multicollinearity. Of
course, if there are only two explanatory variables, the zero-order correlations will suffice.
3. Examination of partial correlations: Farrar and Glauber have suggested that one should look
at the partial correlation coefficients. In the regression of Y on X2, X3, and X4, a finding that R2
1.234 is very high but r 2 12.34, r 213.24, and r214.23 are comparatively low may suggest that
the variables X2, X3, and X4 are highly inter-correlated and that at least one of these variables is
superfluous.
AEC 507/GMG/MC/2021 8
Supposing partial correlation between dependent and independent variables is lower, then
we can suspect presence of MC.
4. Auxiliary regressions: Since multicollinearity arises because one or more of the regressors are
exact or approximately linear combinations of the other regressors, one way of finding out
which X variable is related to other X variables is to regress each Xi on the remaining X
variables and compute the corresponding R2, which we designate as R2i; each one of these
regressions is called an auxiliary regression, auxiliary to the main regression of Y on the X’s.
Then, following the relationship between F and R2, the variable
Rule of thumb:
If k is between 100 and 1000 there is moderate to strong multicollinearity and if it exceeds
1000 there is severe multicollinearity.
Alternatively, if the CI (=√k) is between 10 and 30, there is moderate to strong
multicollinearity and if it exceeds 30 there is severe multicollinearity.
6. Tolerance and variance inflation factor: As R2j, increases towards unity, that is, as the
collinearity of Xj with the other regressors increases, VIF also increases. The larger the value of
VIFj, the more “troublesome” or collinear the variable Xj.
As a rule of thumb, if the VIF of a variable exceeds 10, which will happen if R2j exceeds 0.90,
that variable is said be highly collinear. The closer is TOLj (Inverse of VIF) to zero, the greater the
degree of collinearity of that variable with the other regressors, while the closer TOLj is to 1, the
greater the evidence that Xj is not collinear with the other regressors
According to Goldberger, exact micronumerosity (the counterpart of exact
multicollinearity) arises when n, the sample size, is zero, in which case any kind of estimation is
impossible. Near micronumerosity, like near multicollinearity, arises when the number of
observations barely exceeds the number of parameters to be estimated.
AEC 507/GMG/MC/2021 9
Remedial Measures
If multicollinearity is serious? We have two choices:(1) do nothing or (2) follow some
rules of thumb.
The “do nothing” school of thought is expressed by Blanchard as follows:
When students run their first ordinary least squares (OLS) regression, the first problem
that they usually encounter is that of multicollinearity. Many of them conclude that there is
something wrong with OLS; some resort to new and often creative techniques to get around the
problem. But, we tell them, this is wrong. Multicollinearity is God’s will, not a problem with OLS
or statistical technique in general.
What Blanchard is saying is that multicollinearity is essentially a data deficiency problem
(micronumerosity, again) and sometimes we have no choice over the data we have available for
empirical analysis. Also, it is not that all the coefficients in a regression model are statistically
insignificant. Moreover, even if we cannot estimate one or more regression coefficients with
greater precision, a linear combination of them (i.e., estimable function) can be estimated
relatively efficiently.
Rule-of-Thumb Procedures
One can try rules of thumb to address the problem of multicollinearity, the success
depending on the severity of the collinearity problem.
1. A priori information: Suppose we consider the model
Where, Y = consumption, X2 = income, and X3 = wealth. Since income and wealth variables tend to
be highly collinear. But suppose a priori we believe that β3 = 0.10β2; that is, the rate of change of
consumption with respect to wealth is one-tenth the corresponding rate with respect to income. We
can then run the following regression
Where, Xi=X2i + 0.1X3i. Then with ˆβ2, we can estimate ˆβ3 from the postulated relationship
between β2 and β3.
The a priori information could come from previous empirical work in which the
collinearity problem happens to be less serious or from the relevant theory underlying the field of
study.
2. Combining cross-sectional and time series data:
A variant of the extraneous or apriori information technique is the combination of cross
sectional and time-series data, known as pooling the data. Suppose we want to study the demand
for automobiles in the United States and assume we have time series data on the number of cars
sold, average price of the car, and consumer income. Suppose also that
AEC 507/GMG/MC/2021 10
where Y* = ln Y − ˆβ3 ln I, that is, Y* represents that value of Y after removing from it the effect of
income. We can now obtain an estimate of the price elasticity β2 from the preceding regression.
3. Dropping a variable(s) and specification bias:
When faced with severe multicollinearity, one of the “simplest” things to do is to drop one
of the collinear variables, so that the non-significant may become significant variable or vis-a-vis.
But in dropping a variable from the model we may be committing a specification bias or
specification error. Specification bias arises from incorrect specification of the model used in the
analysis.
4. Transformation of variables: Suppose we have time series data on consumption expenditure,
income, and wealth. One reason for high multicollinearity between income and wealth in such
data is that over time both the variables tend to move in the same direction. One way of
minimizing this dependence is to proceed as follows
If the relation
holds at time t, it must also hold at time t-1 because the origin of time is arbitrary anyway.
Therefore, we have
On taking difference
where vt = ut - ut−1. This equation is known as the first difference form because we run the
regression, not on the original variables, but on the differences of successive values of the
variables.
The first difference regression model often reduces the severity of multicollinearity because,
although the levels of X2 and X3 may be highly correlated, there is no a priori reason to believe
that their differences will also be highly correlated.
An incidental advantage of the first-difference transformation is that it may make a
nonstationary time series stationary. Loosely speaking, a time series, say, Yt, is stationary if its
mean and variance do not change systematically over time.
Another commonly used transformation in practice is the ratio transformation.
Consider the model:
where Y is consumption expenditure in real dollars, X2 is GDP, and X3 is total population. Since
GDP and population grow over time, they are likely to be correlated. One “solution” to this
problem is to express the model on a per capita basis, that is, by dividing with X3, we obtain:
AEC 507/GMG/MC/2021 11
5. Additional or new data:
Since multicollinearity is a sample feature, it is possible that in another sample involving
the same variables collinearity may not be so serious as in the first sample. Sometimes simply
increasing the size of the sample (if possible) may attenuate the collinearity problem.
For example, in the three-variable model
AEC 507/GMG/MC/2021 12
are also high, multicollinearity may not be readily detectable. Also, as pointed out by
C. Robert, Krishna Kumar, John O’Hagan, and Brendan McCabe, there are some
statistical problems with the partial correlation test suggested by Farrar and Glauber.
e. Therefore, one may regress each of the Xi variables on the remaining X variables in the
model and find out the corresponding coefficients of determination R2i. A high R2i
would suggest that Xi is highly correlated with the rest of the X’s. Thus, one may drop
that Xi from the model, provided it does not lead to serious specification bias.
4. Detection of multicollinearity is half the battle. The other half is concerned with how to get rid
of the problem. Again there are no sure methods, only a few rules of thumb. Some of these
rules are:
a. using extraneous or prior information,
b. combining cross-sectional and time series data,
c. omitting a highly collinear variable,
d. transforming data, and
e. obtaining additional or new data.
Which of these rules will work in practice will depend on the nature of the data and
severity of the collinearity problem.
5. We noted the role of multicollinearity in prediction and pointed out that unless the collinearity
structure continues in the future sample it is hazardous to use the estimated regression that has
been plagued by multicollinearity for the purpose of forecasting.
6. Although multicollinearity has received extensive (some would say excessive) attention in the
literature, an equally important problem encountered in empirical research is that of
micronumerosity, smallness of sample size. According to Goldberger, “When a research
article complains about multicollinearity, readers ought to see whether the complaints would
be convincing if “micronumerosity” were substituted for “multicollinearity.” He suggests that
the one ought to decide how small n, the number of observations, is before deciding that one
has a small-sample problem, just as one decides how high an R2 value is in an auxiliary
regression before declaring that the collinearity problem is very severe.
AEC 507/GMG/MC/2021 13
Autocorrelation:
(What happens if the error terms are correlated?)
The three types of data available for empirical analysis include: (1) Cross section, (2) Time series,
and (3) Combination of cross section and time series, also known as pooled data.
The assumption of homoscedasticity (or equal error variance), may not be always tenable in
cross-sectional data, as this kind of data are often plagued by the problem of heteroscedasticity.
In cross-section studies, data collected using random sample of cross-sectional units (households-
consumption function analysis or firms - investment study analysis), so there is no prior reason to believe
that the error term pertaining to one household (firm) is not correlated with the error term of another
household (firm). If by chance such a correlation is observed in cross-sectional units, it is called spatial
autocorrelation, that is, correlation in space rather than over time (Temporal).
However, it is important to remember that, in cross-sectional analysis, the ordering of the data
must have some logic, or economic interest, to make sense of any determination of whether (spatial)
autocorrelation is present or not.
The situation, is different while dealing with time series data, for the observations in such data
follow a natural ordering over time so that successive observations are likely to exhibit inter-correlations,
especially if the time interval between successive observations is short, such as a day, a week, or a month
rather than a year. If one observes stock price indices of a company over successive days, they move up
or down for several days in succession. Obviously, in such situations, the assumption of no auto, or
serial, correlation in the error terms that underlies the CLRM will be violated.
The Nature of the Problem
The autocorrelation may be defined as “correlation between members of series of observations
ordered in time [as in time series data] or space [as in cross-sectional data]”. In the regression context, the
classical linear regression model assumes that such autocorrelation does not exist in the disturbances ui.
Symbolically,
The classical model assumes that the disturbance term relating to any observation is not
influenced by the disturbance term relating to any other observation.
Ex: i) If we are dealing with quarterly time series data involving the regression of output on labor and
capital inputs and if, say, there is a labor strike affecting output in one quarter, there is no reason to
believe that this disruption will be carried over to the next quarter. That is, if output is lower this
quarter, there is no reason to expect it to be lower next quarter.
ii) While dealing with cross-sectional data involving the regression of family consumption
expenditure on family income, the effect of an increase of one family’s income on its consumption
expenditure is not expected to affect the consumption expenditure of another family.
However, if there is such a dependence, we have autocorrelation.
Symbolically,
In this situation, the disruption caused by a strike this quarter may very well affect output next
quarter, or the increases in the consumption expenditure of one family may very well prompt another
family to increase its consumption expenditure if it wants to keep up with the Joneses.
The terms autocorrelation and serial correlation are used synonymously, some authors like
Tintner differentiate them, he defines autocorrelation as “lag correlation of a given series with itself,
lagged by a number of time units,’’ whereas, he used serial correlation to “lag correlation between two
different series.’’ Thus, correlation between two time series such as u1, u2, . . . , u10 and u2, u3, . . . , u11,
where the former is the latter series lagged by one time period, is autocorrelation, whereas correlation
between time series such as u1, u2, . . . , u10 and v2, v3, . . . , v11, where u and v are two different time
series, is called serial correlation.
AEC 507/GMG Autocorrelation 1
In following Figures Fig.1a to d shows that there is a distinct pattern among the u’s. Fig.1a shows
a cyclical pattern; Fig.1b and c suggests an upward or downward linear trend in the disturbances; whereas
Fig.1d indicates that both linear and quadratic trend terms are present in the disturbances. Only Fig.1e
indicates no systematic pattern, supporting the non-autocorrelation assumption of the classical linear
regression model.
AEC 507/GMG Autocorrelation 2
Causes for autocorrelation:
1. Inertia: A salient feature of most economic time series is inertia (inactivity), or sluggishness. As is
well known, time series such as GNP, price indices, production, employment, and unemployment
exhibit (business) cycles.
Starting at the bottom of the recession, when economic recovery starts, most of these series start
moving upward. In this upswing, the value of a series at one point in time is greater than its previous
value. This “momentum’’ built into them, continues until something happens (e.g., increase in interest
rate or taxes or both) to slow them down. Therefore, in regressions involving time series data, successive
observations are likely to be interdependent.
2. Specification Bias- Excluded Variables Case:
The researcher often starts with a plausible regression model that may not be the most “perfect’’
one. After the regression analysis, to find out whether the results accord with apriori expectations. If
plotted residuals ˆui obtained from the fitted regression follows patterns such as shown in Fig. 1a to d.
These residuals (which are proxies for ui) may suggest that some variables were not included in the model
for a variety of reasons should be included. This is the case of excluded variable specification bias.
Often the inclusion of such variables removes the correlation pattern observed among the
residuals. Ex: For the following demand model:
………. (1)
Where,
Y = Quantity of beef demanded,
X2 = Price of beef,
X3 = Consumer income,
X4 = Price of pork, and
t = time.
However, for some reason, if we run the following regression:
………. (2)
If (1) is the “correct’’ model or true relation, running (2) is equivalent to letting vt = β4X4t + ut.
And to the extent the price of pork affects the consumption of beef, the error term v will reflect a
systematic pattern, thus creating (false) autocorrelation. A simple test of this would be to run both (1) and
(2) and see whether autocorrelation, if any, observed in model (2) disappears when (1) is run.
3. Specification bias - Incorrect functional form
Suppose the “true’’ or correct model in a cost-output study is as follows:
……….(3)
but we fit the following model:
……….(4)
AEC 507/GMG Autocorrelation 3
Fig.2: Specification bias: incorrect functional form
The marginal cost curve corresponding to the “true’’ model is shown in Fig. 2 along with the
“incorrect’’ linear cost curve. As Fig.2 shows, between points A and B the linear marginal cost curve will
consistently overestimate the true marginal cost, whereas beyond these points it will consistently
underestimate the true marginal cost. This result is to be expected, because the disturbance term vi is,
infact, equal to output2 + ui, and hence will catch the systematic effect of the output2 term on marginal
cost. In this case, vi will reflect autocorrelation because of the use of an incorrect functional form.
4. Cobweb Phenomenon:
The supply of many agricultural commodities reflects the cobweb phenomenon, where supply
reacts to price with a lag of one-time period because supply decisions take time to implement (the
gestation period). Thus, this year’s planting of crops, farmers are influenced by the price prevailing last
year, so that their supply function is
……….(5)
Suppose at the end of period t, price Pt turns out to be lower than Pt−1.
Therefore, in period t+1 farmers may very well decide to produce less than they did in period t.
Obviously, in this situation the disturbances ut are not expected to be random because if the farmers
overproduce in year t, they are likely to reduce their production in t + 1, and so on, leading to a Cobweb
pattern.
5. Lags
In a time series regression of consumption expenditure on income, it is not uncommon to find that
the consumption expenditure in the current period depends, among other things, on the consumption
expenditure of the previous period. That is,
……….(6)
Above regression is known as auto-regression because one of the explanatory variables is the
lagged value of the dependent variable.
Consumers do not change their consumption habits readily for psychological, technological, or
institutional reasons. Now if we neglect the lagged term in above equation, the resulting error term will
reflect a systematic pattern due to the influence of lagged consumption on current consumption.
AEC 507/GMG Autocorrelation 4
6. Manipulation of Data:
In empirical analysis, the raw data are often “manipulated.’’
Ex: In time series regressions involving quarterly data [adding 3 Months value / 3]. This averaging
introduces smoothness into the data by dampening the fluctuations in the monthly data. Therefore, the
graph plotting the quarterly data looks much smoother than the monthly data, and this smoothness may
itself lend to a systematic pattern in the disturbances, thereby introducing autocorrelation.
Another source of manipulation is interpolation or extrapolation of data.
Ex: The Population Census is conducted every 10 years, like during 2001 and 1991. Now if there is a
need to obtain data for some year within the inter-census period 1991–2001, the common practice is to
interpolate on the basis of some adhoc assumptions. All such data “massaging’’ techniques might impose
upon the data a systematic pattern that might not exist in the original data.
7. Data Transformation:
Consider the following model:
Yt = β1 + β2Xt + ut ……….(7)
where,
Y = Consumption expenditure and X = Income.
Since it holds true at every time period, it holds true also in the previous time period, (t − 1). So,
we can write (7) as
Yt−1 = β1 + β2Xt−1 + ut−1 ……….(8) [Known as level form regression]
Yt−1, Xt−1, and ut−1 are known as the lagged values of Y, X, and u, respectively, by one period.
subtracting (8) from (7), we obtain
∆Yt = β2∆Xt +∆ut……….(9) [Known as difference form regression]
∆ is the first difference operator, tells us to take successive differences of the variables.
For empirical purposes, we write this equation as
∆Yt = β2∆Xt + vt-----------------(10) [Known as dynamic regression models – which
involve lagged regressands]
If in (8) Y and X represent the logarithms of consumption expenditure and income, then in (9) ∆Y and
∆X will represent changes in the logs of consumption expenditure and income. A change in the log of a
variable is a relative or a percentage change, if the former is multiplied by 100. So, instead of studying
relationships between variables in the level form, we may be interested in their relationships in the growth
form.
Now if the error term in (7) satisfies the standard OLS assumptions, particularly the assumption
of no autocorrelation, it can be shown that the error term vt in (10) is autocorrelated.
Since, vt = ut − ut−1,
E(vt) = E(ut − ut−1) = E(ut)−E(ut−1) = 0, since E(u) = 0, for each t.
Now, var (vt) = var (ut − ut−1) = var (ut) + var (ut−1) = 2σ2,
Since, the variance of each ut is σ2 and the u’s are independently distributed. Hence, vt is homoscedastic.
But cov (vt , vt−1) = E(vtvt−1) = E[(ut − ut−1)(ut−1 − ut−2)] = −σ2 which is obviously nonzero. Therefore,
although the u’s are not autocorrelated, the v’s are.
8. Nonstationarity:
A time series is stationary if its characteristics (e.g., mean, variance, and covariance) are time
invariant; that is, they do not change over time. If that is not the case, we have a nonstationary time
series. In a regression model (7), it is quite possible that both Y and X are nonstationary and therefore the
error u is also nonstationary. In that case, the error term will exhibit autocorrelation.
Autocorrelation can be positive (Fig. a) as well as negative, although most economic time series
generally exhibit positive autocorrelation because most of them either move upward or downward over
extended time periods and do not exhibit a constant up-and-down movement such as that shown in Fig. b.
AEC 507/GMG Autocorrelation 5
Fig.3: (a) Positive and (b) negative autocorrelation
AEC 507/GMG Autocorrelation 6
An error term with these properties is called a white noise error term. Equation (12) postulates
that the value of the disturbance term in period t is equal to rho times its value in the previous period plus
a purely random error term. The scheme (12) is known as Markov first-order autoregressive scheme,
or simply a first-order autoregressive scheme, usually denoted as AR(1). The name autoregressive is
appropriate because (12) can be interpreted as the regression of ut on itself lagged one period. It is first
order because ut and its immediate past value are involved; that is, the maximum lag is 1.
If the model is ut = ρ1ut−1 + ρ2ut−2 + εt, it would be an AR(2), or second-order, autoregressive
scheme, and so on.
Rho (coefficient of autocovariance) in (12), can also be interpreted as the first-order coefficient
of autocorrelation, or more accurately, the coefficient of autocorrelation at lag 1.
Given the AR(1) scheme, it can be shown that
……………….. (13)
…………………(14)
…………………(15)
Where, cov (ut, ut+s) means covariance between error terms s periods apart and where cor (ut, ut+s) means
correlation between error terms s periods apart. Because of the symmetry property of covariances and
correlations, cov (ut , ut+s) = cov (ut, ut−s) and cor(ut, ut+s) = cor(ut, ut−s) .
Since ρ is a constant between −1 and +1, (13) shows that under the AR(1) scheme, the variance of
ut is still homoscedastic, but ut is correlated not only with its immediate past value but its values several
periods in the past. It is critical to note that |ρ| < 1, that is, the absolute value of rho is less than one.
If ρ =1, the variances and covariances listed above are not defined.
If |ρ| < 1, we say that the AR(1) process given in (12,i.e., ut = ρut−1 + εt) is stationary; that is, the
mean, variance, and covariance of ut do not change over time.
If |ρ| is less than one, then it is clear from (14) that the value of the covariance will decline as we
go into the distant past.
One reason we use the AR(1) process is not only because of its simplicity compared to higher-order
AR schemes, but also because in many applications it has proved to be quite useful. Additionally, a
considerable amount of theoretical and empirical work has been done on the AR(1) scheme.
…………………(17)
Now under the AR(1) scheme, it can be shown that the variance of this estimator is:
……(18)
Where, var (ˆβ2)AR1 means the variance of ˆβ2 under first-order autoregressive scheme.
A comparison of (18) with (17) shows the former is equal to the latter times a term that depends
on ρ as well as the sample autocorrelations between the values taken by the regressor X at various lags.
AEC 507/GMG Autocorrelation 7
In general we cannot predict whether var (ˆβ2) is less than or greater than var (ˆβ2)AR1. Of course,
if rho is zero, the two formulas will coincide. Also, if the correlations among the successive values of the
regressor are very small, the usual OLS variance of the slope estimator will not be seriously biased. But,
as a general principle, the two variances will not be the same.
To give some idea about the difference between the variances given in (17) and (18), assume that
the regressor X also follows the first-order autoregressive scheme with a coefficient of autocorrelation of
r. Then it can be shown that (18) reduces to:
……(19)
If, for example, r = 0.6 and ρ = 0.8, using (19)
var (ˆβ2)AR1 = 2.8461 var (ˆβ2)OLS.
To put it another way,
var (ˆβ2)OLS = 1/2.8461var (ˆβ2)AR1 = 0.3513 var (ˆβ2)AR1 .
That is, the usual OLS formula [i.e.,(17)] will underestimate the variance of (ˆβ2)AR1 by about 65 percent.
[This answer is specific for the given values of r and ρ].
Warning:
A blind application of the usual OLS formulas to compute the variances and standard errors of the
OLS estimators could give seriously misleading results.
Suppose we continue to use the OLS estimator ˆβ2 and adjust the usual variance formula by taking into
account the AR(1) scheme. That is, we use ˆβ2 given by (16) but use the variance formula given by (18).
What now are the properties of ˆβ2? It can be proved that ˆ β2 is still linear and unbiased.
Detecting Autocorrelation
Ex: Relationship between wages and productivity in the business sector of the US, 1959–1998
Table-1: Indices of Real Compensation and Productivity, United States, 1959–1998
AEC 507/GMG Autocorrelation 8
Above Table 1 contains the data on indices of real compensation per hour (Y) and output per hour
(X) in the business sector of the U.S. economy for the period 1959–1998[Base is 1992 = 100]
Fig.4: Residuals and standardized residuals from the wages–productivity regression (1)
Plotting this data on Y and X, we obtain Fig.4. Since the relationship between them is expected to
be positive, it is not surprising that the two variables are positively related, but the relationship between
the two is almost linear is surprising, although there is some hint that at higher values of productivity the
relationship between the two may be slightly nonlinear. Therefore, linear as well as a log–linear models
were estimated and results are follows.
AEC 507/GMG Autocorrelation 9
Qualitatively, both the models give similar results. In both cases the estimated coefficients are
“highly” significant, as indicated by the high t values.
In the linear model, if the index of productivity goes up by a unit, on average, the index of
compensation goes up by about 0.71 units.
In the log–linear model, if the index of productivity goes up by 1 percent, on average, the index
of real compensation goes up by about 0.67 percent.
How reliable are the results if there is autocorrelation?
Tests / Methods to detect Autocorrelation
If there is autocorrelation, the estimated standard errors are biased, as a result of which the
estimated t ratios are unreliable. Hence there is need to find out if data suffer from autocorrelation using
different methods as detailed below with the LINEAR MODEL.
I. Graphical Method
The assumption of non-autocorrelation of the classical model relates to the population
disturbances ut, which are not directly observable. What we have instead are their proxies, the residuals
ˆut, which can be obtained by the usual OLS procedure. Although the ˆut are not the same thing as ut, very
often a visual examination of the ˆu’s gives us some clues about the likely presence of autocorrelation in
the u’s. Actually, a visual examination of ˆut or (ˆu2t) can provide useful information not only about
autocorrelation but also about heteroscedasticity, model inadequacy, or specification bias.
There are various ways of examining the residuals. Residuals are plotted against time, the time
sequence plot [Fig.5], shows the residuals obtained from the wages–productivity regression (1). The
values of these residuals are given in Table 2 along with some other data.
Fig.5: Residuals and standardized residuals from the wages–productivity regression (1)
Alternatively, we can plot the standardized residuals against time, which are also shown in
Fig.5 and Table 2.
The standardized residuals are the residuals (ˆut) divided by the standard error of the regression
(√ˆσ2), that is, they are (ˆut/ˆσ ). [Note that ˆut and ˆσ are measured in the units in which the regressand Y
is measured]. The values of the standardized residuals will therefore be pure numbers (devoid of units of
measurement) and can be compared with the standardized residuals of other regressions. Moreover, the
standardized residuals, like ˆut, have zero mean and approximately unit variance. In large samples (ˆut/ˆσ)
is approximately normally distributed with zero mean and unit variance. For our example, ˆσ = 2.6755.
Examining the time sequence plot given in Fig 5, one can observe that both ˆut and the standardized ˆut
exhibit a pattern observed in Figure 1d, suggesting that perhaps ut are not random.
To see this differently, we can plot ˆut against ˆut−1, that is, plot the residuals at time t against their
value at time (t-1), a kind of empirical test of the AR(1) scheme. If the residuals are nonrandom, we
should obtain pictures similar to those shown in Fig.3. This plot for wages–productivity regression is
shown in Fig.6 for the data in Table 2. As this figure reveals, most of the residuals are bunched in the
second (northeast) and the fourth (southwest) quadrants, suggesting a strong positive correlation in the
residuals.
Note: N = N1 + N2.
If the null hypothesis of randomness is sustainable, following the properties of the normal
distribution, we should expect that
Prob [E(R) − 1.96σR ≤ R ≤ E(R) + 1.96σR] = 0.95
That is, the probability is 95 percent that the preceding interval will include R.
Decision Rule:
Do not reject the null hypothesis of randomness with 95% confidence if R, the number of runs,
lies in the preceding confidence interval; reject the null hypothesis if the estimated R lies outside these
limits. (can choose any level of confidence)
In this example,
N1 - the number of minuses = 19
N2 - the number of pluses = 21
R = 3.
Then using the above formula,
E(R) = 10.975
σ2R = 9.6936
σR = 3.1134
The 95% confidence interval for R is thus:
[10.975 ± 1.96(3.1134)] = (4.8728, 17.0722)
Obviously, this interval does not include 3. Hence, we can reject the hypothesis that the residuals
in our wages–productivity regression are random with 95% confidence. In other words, the residuals
exhibit autocorrelation.
As a general rule, if there is
Positive autocorrelation- the number of runs will be few, whereas
Negative autocorrelation - the number of runs will be many.
Swed and Eisenhart have developed special tables that give critical values of the runs expected in a
random sequence of N observations if N1 or N2 is smaller than 20.
3. Durbin–Watson d Test
The most celebrated test for detecting serial correlation is that developed by statisticians Durbin
and Watson. It is popularly known as the Durbin–Watson d statistic, which is defined as
Since ∑ût2 and û2t-1 differ in only one observation, they are approximately equal. Therefore, setting
∑ û2t-1 ≈∑ û2t, above equation may be written as
Illustration: Wages–productivity regression for the data given in Table 2 the estimated d value can be
shown to be 0.1229, suggesting that there is positive serial correlation in the residuals. From the Durbin–
Watson tables, we find that for 40 observations and one explanatory variable, dL = 1.44 and dU = 1.54 at
the 5 percent level. Since the computed d of 0.1229 lies below dL, we cannot reject the hypothesis that
there is positive serial correlation in the residuals.
However, the d test has great drawback that if it falls in the indecisive zone, one cannot conclude that
(first-order) autocorrelation does or does not exist. Then following modified d test can be used.
Given the level of significance α,
1. H0: ρ = 0 versus H1:ρ > 0. Reject H0 at α level if d < dU. That is, there is statistically significant
positive autocorrelation.
2. H0: ρ = 0 versus H1:ρ < 0. Reject H0 at α level if the estimated (4 − d) < dU, that is, there is
statistically significant evidence of negative autocorrelation.
3. H0: ρ = 0 versus H1: ρ _= 0. Reject H0 at 2α level if d < dU or (4 − d) <dU, that is, there is
statistically significant evidence of autocorrelation, positive or negative.
It may be pointed out that the indecisive zone narrows as the sample size increases, which can be
seen clearly from the Durbin–Watson tables. For example, with 4 regressors and 20 observations, the 5
percent lower and upper d values are 0.894 and 1.828, respectively, but these values are 1.515 and 1.739
if the sample size is 75.
The computer program Shazam performs an exact d test, that is, it gives the p value, the exact
probability of the computed d value. With modern computing facilities, it is no longer difficult to find the
p value of the computed d statistic. Using SHAZAM (version 9) the wages–productivity regression, the p
value of the computed d of 0.1229 is practically zero, thereby reconfirming our earlier conclusion based
on the Durbin–Watson tables.
The Durbin–Watson d test has become so venerable that practitioners often forget the
assumptions underlying the test. In particular, the assumptions that (1) the explanatory variables, or
regressors, are nonstochastic; (2) the error term follows the normal distribution; and (3) that the regression
models do not include the lagged value(s) of the regressand are very important for the application of the d
test.
If a regression model contains lagged value(s) of the regressand, the d value in such cases is often
around 2, which would suggest that there is no (first-order) autocorrelation in such models. Thus, there is
a built-in bias against discovering (first-order) autocorrelation in such models.
Breusch–Godfrey (BG) Test
This test allows for
(1) nonstochastic regressors, such as the lagged values of the regressand;
(2) higher-order autoregressive schemes, such as AR(1), AR(2), etc.; and
(3) simple or higher-order moving averages of white noise error terms, such as εt in Equation (12).The
BG test, also known as the LM test.
…………..(5)
That is, asymptotically, n − p times the R2 value obtained from the eqn (4) regression follows the chi-
square distribution with p df. If in an application, (n − p)R2 exceeds the critical chi-square value at the
chosen level of significance, we reject the null hypothesis, in which case at least one rho in eqn (2) is
statistically significantly different from zero.
Note the subscript of σ2, which tells that the conditional variances of ui (= conditional
variances of Yi) are no longer constant.
To make the difference between homoscedasticity and heteroscedasticity clear, assume
that in the two-variable model
Yi = β1 + β2Xi + ui,
Where, Y represents savings and X represents income. Figures 1 and 2 show that as income
increases, savings on the average also increase. But in Figure 1 the variance of savings remains the
same at all levels of income, whereas in Fig. 2 it increases with income, i.e., the higher income
families on the average save more than the lower-income families, but there is also more
variability in their savings.
Reasons for varying variances of ui
1. Following the error-learning models: As people learn, their errors of behavior become
smaller over time. In this case, σi2 is expected to decrease. Ex: In following Fig. 3, which
relates the number of typing errors made in a given time period on a test to the hours put in
typing practice. As the number of hours of typing practice increases, the average number of
typing errors as well as their variances decreases.
AEC 507/GMG/ Heteroscedasticity 1
Fig. 3: Illustration of heteroscedasticity
2. As incomes grow, people have more discretionary income and hence more scope for choice
about the disposition of their income. Hence, σi2 is likely to increase with income. Thus in the
regression of savings on income one is likely to find σi2 increasing with income because people
have more choices about their savings behavior. Similarly, companies with larger profits are
generally expected to show greater variability in their dividend policies than companies with
lower profits. Also, growth oriented companies are likely to show more variability in their
dividend payout ratio than established companies.
3. As data collecting techniques improve, σi2 is likely to decrease. Thus, banks that have
sophisticated data processing equipment are likely to commit fewer errors in the monthly or
quarterly statements of their customers than banks without such facilities.
4. Heteroscedasticity can also arise as a result of the presence of outliers. An outlying
observation, or outlier, is an observation that is much different (either very small or very large)
in relation to the observations in the sample. More precisely, an outlier is an observation from
a different population to that generating the remaining sample observations. The inclusion or
exclusion of such an observation, especially if the sample size is small, can substantially alter
the results of regression analysis.
AEC 507/GMG/ Heteroscedasticity 2
Fig.4: The relationship between stock prices and consumer prices
5. Another source of heteroscedasticity arises from violating Assumption 9 of CLRM, viz., that
the regression model is correctly specified. Heteroscedasticity may be due to the fact that some
important variables are omitted from the model. Thus, in the demand function for a
commodity, if we do not include the prices of commodities complementary to or competing
with the commodity in question (the omitted variable bias), the residuals obtained from the
regression may give the distinct impression that the error variance may not be constant. But if
the omitted variables are included in the model, that impression may disappear.
Ex: advertising impressions retained (Y) in relation to advertising expenditure (X). If we
regress Y on X only and observe the residuals from this regression, you will see one pattern, but if
you regress Y on X and X2, you will see another pattern, which can be seen clearly from Fig.5. We
have already seen that X2 belongs in the model.
Fig.5: Residuals from the regression of (a) impressions of advertising expenditure and (b)
impression on Adexp and Adexp2.
6. Another source of heteroscedasticity is skewness in the distribution of one or more regressors
included in the model. Examples are economic variables such as income, wealth, and
education. It is well known that the distribution of income and wealth in most societies is
uneven, with the bulk of the income and wealth being owned by a few at the top.
AEC 507/GMG/ Heteroscedasticity 3
7. Other sources of heteroscedasticity: As David Hendry notes, heteroscedasticity can also arise
because of (1) incorrect data transformation (e.g., ratio or first difference transformations) and
(2) incorrect functional form (e.g., linear versus log–linear models).
Note that the problem of heteroscedasticity is likely to be more common in cross-sectional
than in time series data. In cross-sectional data, one usually deals with members of a population at
a given point in time, such as individual consumers or their families, firms, industries, or
geographical subdivisions such as state, country, city, etc. Moreover, these members may be of
different sizes, such as small, medium, or large firms or low, medium, or high income. In time
series data, on the other hand, the variables tend to be of similar orders of magnitude because one
generally collects the data for the same entity over a period of time. Examples are GNP,
consumption expenditure, savings, or employment in the United States, say, for the period 1950 to
2000
Following table gives data on compensation per employee in 10 nondurable goods
manufacturing industries, classified by the employment size of the firm or the establishment for
the year 1958. Also given in the table are average productivity figures for nine employment
classes. Although the industries differ in their output composition, Table shows clearly that on the
average large firms pay more than the small firms. As an example, firms employing one to four
employees paid on the average about $3396, whereas those employing 1000 to 2499 employees on
the average paid about $4843. But notice that there is considerable variability in earning among
various employment classes as indicated by the estimated standard deviations of earnings. This can
be seen also from Fig. 6, which plots the standard deviation of compensation and average
compensation in each employment class. Thus it can be seen clearly that on average, the standard
deviation of compensation increases with the average value of compensation.
Table-1: Compensation per employee ($) in nondurable manufacturing industries according
to Employment size of establishment, 1958
AEC 507/GMG/ Heteroscedasticity 4
Fig.6: Standard deviation of compensation and mean compensatio
OLS Estimation in the Presence of Heteroscedasticity
What happens to OLS estimators and their variances if we introduce heteroscedasticity by
letting E(ui2 ) = σi2 but retain all other assumptions of the classical model? To answer this question,
let us revert to the two-variable model:
which is obviously different from the usual variance formula obtained under the assumption of
homoscedasticity, namely,
AEC 507/GMG/ Heteroscedasticity 5
under certain conditions (called regularity conditions), ˆβ2 is asymptotically normally distributed.
Of course, what we have said about ˆβ2 also holds true of other parameters of a multiple regression
model. Granted that ˆβ2 is still linear unbiased and consistent, is it “efficient” or “best”; that is,
does it have minimum variance in the class of unbiased estimators? The answer is no: ˆβ2 is no
longer best. Then what is BLUE in the presence of heteroscedasticity?
The answer is given in the following section.
The Method of Generalized Least Squares (GLS)
Why the usual OLS estimator of β2 given above is not best, although it is still unbiased?
Intuitively, we can see the reason from Table 1 above. As the table shows, there is considerable
variability in the earnings between employment classes. If we were to regress per-employee
compensation on the size of employment, we would like to make use of the knowledge that there is
considerable interclass variability in earnings. Ideally, we would like to devise the estimating
scheme in such a manner that observations coming from populations with greater variability are
given less weight than those coming from populations with smaller variability. Examining Table 1,
we would like to weight observations coming from employment classes 10–19 and 20–49 more
heavily than those coming from employment classes like 5–9 and 250–499, for the former are
more closely clustered around their mean values than the latter, thereby enabling us to estimate the
PRF more accurately.
Unfortunately, the usual OLS method does not follow this strategy and therefore does not
make use of the “information” contained in the unequal variability of the dependent variable Y,
say, employee compensation of Table 1: It assigns equal weight or importance to each observation.
But a method of estimation, known as generalized least squares (GLS), takes such information
into account explicitly and is therefore capable of producing estimators that are BLUE. To see let
us consider two-variable model:
where the starred, or transformed, variables are the original variables divided by (the known) σi.
We use the notation β*1 and β*2, the parameters of the transformed model, to distinguish them from
the usual OLS parameters β1 and β2. What is the purpose of transforming the original model? To
see this, notice the following feature of the transformed error term ui*:
AEC 507/GMG/ Heteroscedasticity 6
Which is a constant, i.e., the variance of the transformed disturbance term ui* is now
homoscedastic. With the other assumptions of the classical model, the finding of u* is
homoscedastic suggests that if we apply OLS to the transformed model it will produce estimators
that are BLUE. In short, the estimated β*1 and β*2 are now BLUE and not the OLS estimators ˆβ1
and ˆβ2.
The procedure of transforming the original variables in such a way that the transformed
variables satisfy the assumptions of the classical model and then applying OLS to them is known
as the method of Generalized Least Squares (GLS). In short, GLS is OLS on the transformed
variables that satisfy the standard least-squares assumptions. The estimators thus obtained are
known as GLS estimators, and it is these estimators that are BLUE.
Mechanics of estimating β*1 and β*2
First, write down the SRF
i.e.,
where wi = 1/σ2i
OLS GLS
We minimize
minimize an unweighted or minimize a weighted sum of residual
(what amounts to the same squares with wi =1/σ2i acting as the
thing) equally weighted RSS weights
For clear difference between OLS and GLS, consider the following hypothetical scattergram
(Fig.7)
AEC 507/GMG/ Heteroscedasticity 7
Fig.7: Hypothetical scattergram
In the (unweighted) OLS, each ûi2 associated with points A, B, and C will receive the same
weight in minimizing the RSS. Obviously, in this case the ûi2 associated with point C will
dominate the RSS. But in GLS the extreme observation C will get relatively smaller weight than
the other two observations. As noted earlier, this is the right strategy, for in estimating the
population regression function (PRF) more reliably we would like to give more weight to
observations that are closely clustered around their (population) mean than to those that are widely
scattered about.
Since above equation minimizes a weighted RSS, it is appropriately known as weighted
least squares (WLS), and the estimators thus obtained are known as WLS estimators. But WLS
is just a special case of the more general estimating technique, GLS. In the context of
heteroscedasticity, one can treat the two terms WLS and GLS interchangeably.
Note that if wi = w, a constant for all i, ˆβ*2 is identical with ˆβ2 and var (ˆβ*2) is identical
with the usual (i.e., homoscedastic) var (ˆβ2), which should not be surprising.
Detection of Heteroscedasticity
Informal Methods
Nature of the Problem
According to Prais and Houthakker on family budget studies found that the residual
variance around the regression of consumption on income increased with income, one now
generally assumes that in similar surveys one can expect unequal variances among the
disturbances. As a matter of fact, in cross-sectional data involving heterogeneous units,
heteroscedasticity may be the rule rather than the exception.
Graphical Method
Do the regression analysis on the assumption that there is no heteroscedasticity and then
do a postmortem examination of the residual squared ûi2 to see if they exhibit any systematic
pattern. Although ûi2 are not the same thing as u2i, they can be used as proxies especially if the
sample size is sufficiently large. An examination of the ûi2 may reveal patterns such as those
shown in Fig.8.
AEC 507/GMG/ Heteroscedasticity 8
Fig.8: Hypothetical patterns of estimated squared residuals.
In the above figure ûi2 are plotted against Ŷi, the estimated Yi from the regression line, the
idea being to find out whether the estimated mean value of Y is systematically related to the
squared residual.
In above Figure (a) one can see that there is no systematic pattern between the two
variables, suggesting that perhaps no heteroscedasticity is present in the data. Figure (b) to (e),
however, exhibits definite patterns. For instance, Figure (c) suggests a linear relationship, whereas
Figure (d) and (e) indicates a quadratic relationship between ûi2 and Ŷi. Using such knowledge,
albeit informal, one may transform the data in such a manner that the transformed data do not
exhibit heteroscedasticity.
Instead of plotting ûi2 against Ŷi, one may plot them against one of the explanatory
variables, especially if plotting ûi2 against Ŷi results in the pattern similar to in Figure (a). In the
case of the two-variable model, plotting ûi2 against Ŷi is equivalent to plotting it against Xi,
Formal Methods
Park Test
Park formalizes the graphical method by suggesting that σ2i is some function of the explanatory
variable Xi. The functional form he suggested was
or
AEC 507/GMG/ Heteroscedasticity 9
First stage - Run the OLS regression disregarding the heteroscedasticity question and
obtain ˆui.
Then run the second regression with ui’s as dependent variable.
Although empirically appealing, but Goldfeld and Quandt have argued that the error term vi may
not satisfy the OLS assumptions and may itself be heteroscedastic.
Illustration of Park approach
From data given in Table 1 run the following regression:
Yi = β1 + β2Xi + ui
where Y = average compensation in thousands of dollars, X = average productivity in thousands of
dollars, and i = i th employment size of the establishment. The results of the regression were as
follows:
The results reveal that the estimated slope coefficient is significant at the 5 percent level on the
basis of a one-tail t test. The equation shows that as labor productivity increases by, say, a dollar,
labor compensation on the average increases by about 23 cents.
The residuals obtained from regression were regressed on Xi as suggested gave the following
results:
Inference: There is no statistically significant relationship between the two variables. Following
the Park test, one may conclude that there is no heteroscedasticity in the error variance.
Glejser Test
The Glejser test is similar in spirit to the Park test. After obtaining the residuals ˆui from
the OLS regression, Glejser suggests regressing the absolute values of ˆui on the X variable that is
thought to be closely associated with σ2i. Glejser used the following functional forms:
Continuing with the previous Example, the absolute value of the residuals obtained from
regression were regressed on average productivity (X), gives the following results
Inference: Regression results revealed that there is no relationship between the absolute value of
the residuals and the regressor, average productivity. This reinforces the conclusion based on the
Park test.
and
Glejser has found that for large samples the first four of the preceding models give
generally satisfactory results in detecting heteroscedasticity. As a practical matter, therefore, the
Glejser technique may be used for large samples and may be used in the small samples strictly as a
qualitative device to learn something about heteroscedasticity.