You are on page 1of 69

Extension of the Two-variable Linear Regression Model

Regression through the Origin


There are occasions when the two-variable PRF assumes the following form

…….. (1)

In this model the intercept term is absent or zero, hence the name regression through the origin.
As an illustration, consider the Capital Asset Pricing Model (CAPM) of modern portfolio theory, which,
in its risk-premium form, may be expressed as
…….. (2)

Where,
ERi = Expected rate of return on security i
ERm = Expected rate of return on the market portfolio as represented by, say, the S&P 500
composite stock index
rf = Risk-free rate of return, say, the return on 90-day Treasury bills
βi = The Beta coefficient, a measure of systematic risk, i.e., risk that cannot be eliminated through
diversification. Also, a measure of the extent to which the ith security’s rate of return moves
with the market.
Value of βi > 1 implies a volatile or aggressive security, and βi < 1 is a defensive security.
(Note: Do not confuse this βi with the slope coefficient of the two variable regression, β2.)
If capital markets work efficiently, then CAPM postulates that security i’s expected risk premium
(= ERi − rf) is equal to that security’s β coefficient times the expected market risk premium (= ERm − rf).
If the CAPM holds, we have the situation depicted in Fig.1. The line shown in the figure is known as the
security market line (SML). For empirical purposes, (2) is often expressed as

…….. (3)

If CAPM holds, αi is expected to be zero.

Fig.1: Systematic risk. Fig.2: The Market Model of Portfolio Theory (assuming αi = 0).
In equation (3) the dependent variable, Y, is (Ri − rf) and the explanatory variable, X, is βi, the
volatility coefficient, and not (Rm − rf). Therefore, to run regression-3, one must first estimate βi, which
is usually derived from the characteristic line.
As this example shows, sometimes the underlying theory dictates that the intercept term be absent
from the model. Other instances where the zero intercept model may be appropriate are Milton
Friedman’s permanent income hypothesis, which states that permanent consumption is proportional to
permanent income; cost analysis theory, where it is postulated that the variable cost of production is
proportional to output; and some versions of monetarist theory that state that the rate of change of prices
(i.e., the rate of inflation) is proportional to the rate of change of the money supply.
Estimation of SRF of regression through origin

Applying the OLS method to this equation, we obtain the following formulas for ˆ β2 and its variance
Particulars of differences Without intercept With intercept
β2

Variance of β2

σ2 estimation

Differences between the use raw sums of squares and use adjusted (from mean) sums of
two sets of formulas cross products squares and cross products
df for computing ˆσ 2 (n − 1) (n − 2)

Features of regression through the origin model


1. ∑ˆui, which is always zero for the model with the intercept term (the conventional model), need
not be zero for the regression through the origin.
2. r2, the coefficient of determination, which is always non-negative for the conventional model, can
on occasions turn out to be negative for the intercept-less model.
r 2 for Regression-through-Origin Model
The conventional r2 is not appropriate for regressions that do not contain the intercept. But one
can compute what is known as the raw r2 for such models, which is defined as

Although this raw r2 satisfies the relation 0 < r2 < 1, it is not directly comparable to the
conventional r2 value. For this reason, some authors do not report the r2 value for zero intercept regression
models. Because of these special features of this model, one needs to exercise great caution in using the
zero intercept regression model. Unless there is very strong a priori expectation, one would be well
advised to stick to the conventional, intercept-present model.
This has a dual advantage.
 First, if the intercept term is included in the model but it turns out to be statistically insignificant (i.e.,
statistically equal to zero), for all practical purposes we have a regression through the origin.
 Second, and more important, if in fact there is an intercept in the model but we insist on fitting a
regression through the origin, we would be committing a specification error, thus violating
Assumption 9 of the classical linear regression model.
Regression on Standardized Variables
The units in which the regressand and regressor(s) are expressed affect the interpretation of the
regression coefficients. This can be avoided if we are willing to express the regressand and regressor(s) as
standardized variables. A variable is said to be standardized if we subtract the mean value of the variable
from its individual values and divide the difference by the standard deviation of that variable. Thus, in the
regression of Y and X, if we redefine these variables as

……..(1)
where Y̅ = sample mean of Y, SY = sample standard deviation of Y, X̅ = sample mean of X, and SX is the
sample standard deviation of X; the variables Y*i and X*i are called standardized variables.
An interesting property of a standardized variable is that its mean value is always zero and its standard
deviation is always 1.
As a result, it does not matter in what unit the regressand and regressor(s) are measured. Therefore,
instead of running the standard (bivariate) regression:

……..(2)
we could run regression on the standardized variables as

……..(3)
since it is easy to show that, in the regression involving standardized regressand and regressor(s), the
intercept term is always zero. The regression coefficients of the standardized variables, denoted by β*1
and β*2, are known in the literature as the beta coefficients. Incidentally, notice that (3) is a regression
through the origin. How do we interpret the beta coefficients? The interpretation is that if the
(standardized) regressor increases by one standard deviation, on average, the (standardized) regressand
increases by β*2 standard deviation units. Thus, unlike the traditional model (2), we measure the effect
not in terms of the original units in which Y and X are expressed, but in standard deviation units.
To show the difference between (2) and (3), consider the results presented under

where GPDI and GDP are measured in billions of dollars.


Interpretation: If GDP goes up by a dollar, on average GPDI goes up by about 30 cents.
The results corresponding to (3) are as follows, where the starred variables are standardized variables:

Interpretation: if the (standardized) GDP increases by one standard deviation, on average, the
(standardized) GPDI increases by about 0.94 standard deviations.
Advantage of the standardized regression model
If there is more than one regressor, by standardizing all regressors, we put them on equal basis
and therefore can compare them directly. If the coefficient of a standardized regressor is larger than that
of another standardized regressor appearing in that model, then the latter contributes more relatively to the
explanation of the regressand than the former. Thus, we can use the beta coefficients as a measure of
relative strength of the various regressors.
Points to note under Regression through Origin
 The r2 value not given because this is a regression through the origin for which the usual r2 is not
applicable.
 There is an interesting relationship between the β coefficients of the conventional model and the
beta coefficients. For bivariate case

Where, Sx = the sample standard deviation of the X regressor and Sy = the sample standard deviation of
the regressand. Therefore, one can crisscross between the β and beta coefficients if we know the (sample)
standard deviation of the regressor and regressand.
Functional Forms of Regression Models
Commonly used regression models that may be nonlinear in the variables but are linear in the parameters
or that can be made so by suitable transformations of the variables.
1. The log-linear model
Consider the following model, known as the exponential regression model

……………… (1)
which may be expressed alternatively as follows using properties of the logarithms like
(1) ln (AB) = ln A + ln B,
(2) ln (A/B) = ln A − ln B, and
(3) ln (Ak) = kln A,
assuming that A and B are positive, and where k is some constant.

……………… (2)

Where,
ln = natural log (i.e., log to the base e, and where e = 2.718).
In practice one may use common logarithms, that is, log to the base 10. The relationship between
the natural log and common log is: lne X = 2.3026 log10 X.
By convention, ln means natural logarithm, and log means logarithm to the base 10; hence there
is no need to write the subscripts e and 10 explicitly
If we write (2) as follows

……………… (3)
where α = ln β1, this model is linear in the parameters α and β2, linear in the logarithms of the variables Y
and X, and can be estimated by OLS regression. Because of this linearity, such models are called log-log,
double-log, or loglinear models.
If the assumptions of the classical linear regression model are fulfilled, the parameters of above
equation (3) can be estimated by the OLS method by letting
……………… (4)
where Y*i= ln Yi and X*i= ln Xi. The OLS estimators ˆα and ˆβ2 obtained will be best linear unbiased
estimators of α and β2, respectively.
One attractive feature of the log-log model, which has made it popular in applied work, is that the
slope coefficient β2 measures the elasticity of Y with respect to X, that is, the percentage change in Y for a
given (small) percentage change in X. Thus, if Y represents the quantity of a commodity demanded and X
its unit price, β2 measures the price elasticity of demand, a parameter of considerable economic interest. If
the relationship between quantity demanded and price is as shown in Fig.3a, the double-log
transformation as shown in Fig.3b will then give the estimate of the price elasticity (−β2).
Two special features of the log-linear model:
1. The model assumes that the elasticity coefficient between Y and X, β2, remains constant throughout,
hence the alternative name constant elasticity model. In other words, as Fig.3b shows, the change in
ln Y per unit change in ln X (i.e., the elasticity, β2) remains the same no matter at which ln X we
measure the elasticity.
2. Another feature of the model is that although ˆα and ˆ β2 are unbiased estimates of α and β2, β1 (the
parameter entering the original model) when estimated as ˆβ1= antilog (ˆα) is itself a biased estimator.
In most practical problems, however, the intercept term is of secondary importance, and one need not
worry about obtaining its unbiased estimate.
In the two-variable model, the simplest way to decide whether the loglinear model fits the data is
to plot the scattergram of lnYi against ln Xi and see if the scatter points lie approximately on a straight
line, as in Fig.3b.

Fig.3: Constant Elasticity model


Technical note:
d(ln X)/dX = 1/X or
d(ln X) = dX/X,
that is, for infinitesimally small changes (note the differential operator d) the change in ln X is equal to the
relative or proportional change in X. In practice, though, if the change in X is small, this relationship can
be written as: change in ln X ≅ relative change in X, where, ≅ means approximately.
Thus, for small changes,
(ln Xt − ln Xt−1) ≅ (Xt − Xt−1)/Xt−1 = relative change in X
Frequently occurring terms:
(1) Absolute change, = (Xt − Xt−1)
(2) Relative or proportional change, (Xt − Xt−1)/Xt−1 = (Xt/Xt−1 − 1) and
(3) Percentage change, or percent growth rate [(Xt − Xt−1)/Xt−1]100
Where, Xt and Xt−1 are, respectively, the current and previous values of the variable X.
Semilog Models: Log–Lin and Lin–Log Models
Measurement of Growth Rate: The Log–Lin Model
In finding out the rate of growth of personal consumption expenditure on services for the data
given in Table 6.3. Let Yt denote real expenditure on services at time t and Y0 the initial value of the
expenditure on services (i.e., the value at the end of 1992-IV). You may recall the following well-known
compound interest formula.
……………… (5)

where r is the compound (i.e., over time) rate of growth of Y. Taking the natural logarithm of (5), we can
write

Now let
β1 = ln Y0
β2 = ln (1 + r ) ……………… (5)
……………… (6)
Adding the disturbance term to this equation, we obtain
……………… (7)
This model is like any other linear regression model in that the parameters β1 and β2 are linear.
The only difference is that the regressand is the logarithm of Y and the regressor is “time,” which will take
values of 1, 2, 3, etc.
Models like (7) are called semilog models because only one variable (in this case the regressand)
appears in the logarithmic form.
 A model in which the regressand is logarithmic will be called a log-lin model.
 A model in which the regressand is linear but the regressor(s) are logarithmic is a lin-log model.
Properties of model (6)
In this model the slope coefficient measures the constant proportional or relative change in Y for
a given absolute change in the value of the regressor (in this case the variable t), that is

……………… (8)
Because
β2 = d(ln Y)/dX = (1/Y)(dY/dX) = (dY/Y)/dX, which is nothing but above equation. For small
changes in Y and X this relation may be approximated by

where X is t

If we multiply the relative change in Y by 100, (8) will then give the percentage change, or the
growth rate, in Y for an absolute change in X, the regressor. That is, 100 times β2 gives the growth rate in
Y; 100 times β2 is known in the literature as the semielasticity of Y with respect to X.
Instantaneous versus Compound Rate of Growth.
The coefficient of the trend variable in the growth model (7), β2, gives the instantaneous (at a
point in time) rate of growth and not the compound (over a period of time) rate of growth. But the latter
can be easily found by taking the antilog of the estimated β2 and subtracting 1 from it and multiplying the
difference by 100 using equation.
For the illustrative example, the estimated slope coefficient is 0.00743. Therefore, [antilog
(0.00743) − 1] = 0.00746 or 0.746 percent. Thus, in the illustrative example, the compound rate of growth
on expenditure on services was about 0.746 percent per quarter, which is slightly higher than the
instantaneous growth rate of 0.743 percent. This is of course due to the compounding effect.
Linear Trend Model.
Instead of estimating model (7), sometimes estimate the following model
……………… (9)
That is, instead of regressing the log of Y on time, regress Y on time, where Y is the regressand
under consideration. Such a model is called a linear trend model and the time variable t is known as the
trend variable. If the slope coefficient is positive, there is an upward trend in Y, whereas if it is negative,
there is a downward trend in Y.
For the expenditure on services data, the results of fitting the linear trend model are as follows

In contrast to earlier Eq. (6.6.8), the interpretation of Eq. (9) is as follows:


Over the quarterly period 1993-I to 1998-III, on average, expenditure on services increased at the
absolute (note: not relative) rate of about 20 billion dollars per quarter. That is, there was an upward trend
in the expenditure on services.
The choice between the growth rate model (6.6.8) and the linear trend model (9) will depend
upon whether one is interested in the relative or absolute change in the expenditure on services, although
for comparative purposes it is the relative change that is generally more relevant.
The Lin–Log Model
To find the absolute change in Y for a percent change in X. Following model can accomplish this
purpose:

………………….(10)
For descriptive purposes such a model is called as lin–log model.
The interpretation of the slope coefficient β2 is as usual

using differential calculus or


Second step: Follows from the fact that a change in the log of a number is a relative change.

This equation states that the absolute change in Y (=ΔY) is equal to slope times the relative change
in X. If the latter is multiplied by 100, then above equation gives the absolute change in Y for a percentage
change in X. Thus, if (ΔX/X) changes by 0.01 unit (or 1 percent), the absolute change in Y is 0.01(β2); If
in an application one finds that β2 = 500, the absolute change in Y is (0.01)(500) = 5.0.
Therefore, when regression (10) is estimated by OLS, do not forget to multiply the value of the
estimated slope coefficient by 0.01, or divide it by 100. If not, interpretation in an application will be
highly misleading.
lin–log model has an interesting application in the so-called Engel expenditure models, named
after the German statistician Ernst Engel, 1821–1896. Engel postulated that “the total expenditure that is
devoted to food tends to increase in arithmetic progression as total expenditure increases in geometric
progression.
Reciprocal Models

Although this model is nonlinear in the variable X because it enters inversely or reciprocally, the model is
linear in β1 and β2 and is therefore a linear regression model.
If we let X*i= (1/Xi), then above equation is linear in the parameters as well as the variables Yi and X*i.
Features of Reciprocal model
As X increases indefinitely, the term β2(l/X) approaches zero (note: β2 is a constant) and Y
approaches the limiting or asymptotic value β1. Therefore, these models have built in them an asymptote
or limit value that the dependent variable will take when the value of the X variable increases indefinitely.
Some likely shapes of the curve corresponding to this equation are shown in following Fig. 4.
As an illustration of Fig.4a, consider the data given in the following Table-1. These are cross-sectional
data for 64 countries on child mortality and a few other variables. For now, concentrate on the variables,
child mortality (CM) and per capita GNP, which are plotted in Fig.5.
This figure resembles Fig. 4.a: As per capita GNP increases, one would expect child mortality to
decrease because people can afford to spend more on health care, assuming all other factors remain
constant. But the relationship is not a straight line one: As per capita GNP increases, initially there is
dramatic drop in CM but the drop tapers off as per capita GNP continues to increase.
The slope of reciprocal model above is: dY/dX = −β2(1/X2), implying that if β2 is positive, the
slope is negative throughout, and if β2 is negative, the slope is positive throughout. [Fig.4a and 4c,
respectively]
Table-1: Fertility and other Data for 64 Countries
Fig. 4: The reciprocal model: Y = β1 + β2 [1/X]

.
Fig. 5: Relationship between child mortality and per capita GNP in 66 countries
By fitting the reciprocal model, the regression results are

As per capita GNP increases indefinitely, child mortality approaches its asymptotic value of about
82 deaths per thousand. As said earlier, the positive value of the coefficient of (l/PGNPt) implies that the
rate of change of CM with respect to PGNP is negative.
One of the important applications of Fig.4 b is the celebrated Phillips curve of macroeconomics.
Using the data on percent rate of change of money wages (Y) and the unemployment rate (X) for the United
Kingdom for the period 1861–1957, Phillips obtained a curve whose general shape resembles Fig. 4 b.
As Fig. 6 shows, there is an asymmetry in the response of wage changes to the level of the
unemployment rate: Wages rise faster for a unit change in unemployment if the unemployment rate is
below Un, which is called the natural rate of unemployment by economists [defined as the rate of
unemployment required to keep (wage) inflation constant], and then they fall for an equivalent change
when the unemployment rate is above the natural rate, β1, indicating the asymptotic floor for wage
change. This particular feature of the Phillips curve may be due to institutional factors, such as union
bargaining power, minimum wages, unemployment compensation, etc.

Fig. 6: Phillips Curve

Log Hyperbola or Logarithmic Reciprocal Model


The form and graphical presentation of the model is as follows

Fig. 7: Log Reciprocal Model


From the figures it is clear that initially Y increases at an increasing rate (i.e., the curve is initially
convex) and then it increases at a decreasing rate (i.e., the curve becomes concave). Such models model
may be appropriate to model a short-run production function. In microeconomics that if labor and capital
are the inputs in a production function and if we keep the capital input constant but increase the labor
input, the short-run output–labor relationship will resemble the above Fig. 7.

But
Then, on substituting this we get

Which indicates the slope of Y with respect to X


Choice of Functional Form
The choice of a particular functional form may be comparatively easy in the two-variable case, where
in we can plot the variables and get some rough idea about the appropriate model. The choice becomes
difficult in the case of multiple regression model involving more than one regressor, A great deal of skill
and experience are required in choosing an appropriate model for empirical estimation. The guidelines
that can be some use are
1. The underlying theory (e.g., the Phillips curve) may suggest a particular functional form.
2. It is good practice to find out the rate of change (i.e., the slope) of the regressand with respect to the
regressor as well as to find out the elasticity of the regressand with respect to the regressor. For the
various formulas for the slope and elasticity coefficients of the various models are given in following
Table. The knowledge of these formulas will help us to compare the various models.

3. The coefficients of the model chosen should satisfy certain a priori expectations. For example, if we
are considering the demand for automobiles as a function of price and some other variables, we
should expect a negative coefficient for the price variable.
4. Sometime more than one model may fit a given set of data reasonably well. With coefficients in line
with prior expectations and all were statistically significant but with different r2 value but make sure
that in comparing two r2 values the dependent variable, or the regressand, of the two models is the
same; the regressor(s).
5. In general, one should not overemphasize the r2 measure in the sense that the higher the r2 the better
the model. As r2 increases as we add more regressors to the model. What is of greater importance is
the theoretical underpinning of the chosen model, the signs of the estimated coefficients and their
statistical significance. If a model is good on these criteria, a model with a lower r2 may be quite
acceptable.
The Gauss–Markov Theorem

Given the assumptions of the classical linear regression model, the least-squares estimates
possess some ideal or optimum properties. These properties are contained in the well-known
Gauss–Markov theorem. To understand this theorem, consider the best linear unbiasedness
property of an estimator. An estimator, say the OLS estimator ˆβ2, is said to be a Best Linear
Unbiased Estimator (BLUE) of β2 if the following hold:
1. It is linear, that is, a linear function of a random variable, such as the dependent variable Y in
the regression model.
2. It is unbiased, i.e., its average /expected value, E(ˆβ2), is equal to the true value, β2.
3. It has minimum variance in the class of all such linear unbiased estimators; an unbiased
estimator with the least variance is known as an efficient estimator.
In the regression context it can be proved that the OLS estimators are BLUE. This is the
gist of the famous Gauss–Markov theorem, which can be stated as follows:
Gauss–Markov Theorem: Given the assumptions of the classical linear regression model, the
least-squares estimators, in the class of unbiased linear estimators, have minimum variance, that is,
they are BLUE.

Sampling distribution of OLS estimator ^β2 and alternative estimator β*2

Figure (a) shows the sampling distribution of the OLS estimator ˆβ2, that is, the
distribution of the values taken by ˆβ2 in repeated sampling experiments. For convenience it is
assumed ˆβ2 to be distributed symmetrically. As the figure shows, the mean of the ˆβ2 values,
E(ˆβ2), is equal to the true β2. In this situation we say that ˆβ2 is an unbiased estimator of β2. Figure
(b) shows the sampling distribution of β*2, an alternative estimator of β2 obtained by using another
(i.e., other than OLS) method. For convenience, assume that β*2, like ˆβ2, is unbiased, that is, its
average or expected value is equal to β2.
Assume further that both ˆβ2 and β*2 are linear estimators, i.e., linear functions of Y. Which
estimator, ˆβ2 or β*2 one need to choose? To answer this question, superimpose the two figures, as
in Figure (c). It is obvious that although both ^β2 and β*2 are unbiased the distribution of β*2 is

AEC 507/GMG/Goodness of fit    1 
 
more diffused or widespread around the mean value than the distribution of ˆβ2. In other words, the
variance of β*2 is larger than the variance of ˆβ2.
Now given two estimators that are both linear and unbiased, one would choose the
estimator with the smaller variance because it is more likely to be close to β2 than the alternative
estimator. In short, one would choose the BLUE estimator.
Ex: Linear regression function/ Linear Production function
The function Y= a+bX is a straight line with two constants ‘a’ and ‘b’. ‘a’ is the intercept
(Value of Y when X is zero i.e., If X = 0, then Y =a). ‘b’ is the regression coefficient of Y on X (When
X will change by one unit Y will change by ‘b’ units). In this equation Y is dependent variable and X is
independent variable. Let us consider one example
SN Y X Y2 X2 XY
1 8.0 5.5 64.00 30.25 44
2 8.5 6.0 72.25 36.00 51
3 10.0 8.5 100.00 72.25 85
Total 26.5 20.00 236.25 138.50 180
We know normal equations

ΣY = Na +bΣX ………………… (i)


ΣXY = aΣX + b X2 …………………(ii)
Where, N is the number of observations, on substituting the values in the above normal equations we
get
26.5 = 3 a + 20 b ...………………(iii)
18.00 = 20 a +138.5 b ……...…………(iv)
Simultaneous equation
Multiplying equation (iii) by 20 and (iv) by 3 and subtracting (iv) from (iii), we get
530 = 60a-400b
540 = 60a+415.5b
-10 = -15.5 b
b = 10/15.5 = 0.6452
Then to obtain the value of ‘a’ put value of ‘b’ in equation (iii)
26.5 = 3 a + 20 (0.6452)
a = 13.596 = 4.532
Now the equation can be written as
Y = 4.532 + 0.6452 X
2
R = RgSS /TSS = 2.1507/2.1667 = 0.9926
[Hint: RgSS = b[ΣXY – {ΣX ΣY}/N] = 0.6452 [180 – 20*26.5/3] = 2.1550
And TSS = ΣY2-(ΣY)2/N = 236.25 – (26.5)2/3=2.1667
Alternatively, R2 = 1-[Residual SS /TSS]
Source of variation Degrees of Freedom SS MSS F
Regression [r-1] 1 2.1507 2.1507 2.1507/0.0160=134.42
Error [n-r] 1 0.0160 0.0160
Total [n-1] 2 2.1667
The estimated regression coefficient is highly significant; indicating that there is an
increase of 0.6452 units in Y per unit increase in X and Coefficient of Multiple Determination (R2)
indicates that the straight line fitted to the data is a very good fit as it explained 99.26% of
variation in Y due to variation in X

AEC 507/GMG/Goodness of fit    2 
 
The important properties of linear function are it is represented by a straight line and has a
constant slope at every point on the line, constant productivity because MP=AP, the output
changes proportionately for proportional changes in inputs.

The coefficient of determination r 2: A measure of “goodness of fit”


From Fig. 1 it is clear that if all the observations were to lie on the regression line, we
would obtain a “perfect” fit, but this is a rare case. Generally, there will be some positive ˆui and
some negative ˆui. What we hope for is that these residuals around the regression line are as small
as possible. The coefficient of determination r2 (two-variable case) or R2 (multiple regression) is
a summary measure that tells how well the sample regression line fits the data.
Let us consider an experiential explanation of r2 in terms of a graphical device, known as
the Venn diagram, or the Ballentine (Fig.1).

Ballentine View of r 2: (a) r 2 = 0; (f ) r 2 = 1.

Variation means the sum of squares of the deviations of a variable from its mean value or
it is the difference between expected mean value and the actual mean value.
The circle Y represents variation in the dependent variable Y and the circle X represents
variation in the explanatory variable X. The overlap of the two circles (the shaded area) indicates
the extent to which the variation in Y is explained by the variation in X (say, via an OLS
regression).
The greater the extent of the overlap, the greater the variation in Y is explained by X. The r
2
is simply a numerical measure of this overlap. In the figure, as we move from left to right, the
area of the overlap increases, that is, successively a greater proportion of the variation in Y is
explained by X. i.e., r2 increases. When there is no overlap, r2 is obviously zero, but when the
overlap is complete, r2 is 1, since 100 percent of the variation in Y is explained by X. The r2 lies
between 0 and 1.
Computation of r 2,

In deviation form

Squaring on both sides and summing over the samples, we get

AEC 507/GMG/Goodness of fit    3 
 
Since,
∑ˆyi ˆui = 0 and ˆyi = ˆβ2xi.

TSS = ESS + RSS

Total Sum of Squares (TSS) = Total variation of the actual Y values about their sample
mean. i.e.

Explained Sum of Squares (ESS) = Variation of the estimated Y values about their mean
(ˆY= Y bar), which appropriately may be called the sum of squares due to regression [i.e.,
due to the explanatory variable(s)], or explained by regression, or ESS.

Residual Sum of Squares (RSS)= residual or unexplained variation of the Y values


about the regression line, or simply ESS.

The total variation in the observed Y values about their mean value can be
partitioned into two parts, one attributable to the regression line and the other to random
forces, because not all actual Y observations lie on the fitted line. Geometrically,

Breakup of the variation of Yi into two components.

AEC 507/GMG/Goodness of fit    4 
 
Dividing by TSS on both sides, we obtain

We now define r2

Or alternatively

The quantity r2 thus defined is known as the (sample) coefficient of determination and is
the most commonly used measure of the goodness of fit of a regression line. Verbally, r2
measures the proportion or percentage of the total variation in Y explained by the
regression model.
Two properties of r2 are
1. It is a non-negative quantity.
2. Its limits are 0 ≤ r2 ≤ 1. An r2 of 1 means a perfect fit, that is, ˆYi = Yi for each i. On
the other hand, an r2 of zero means that there is no relationship between the regressand
and the regressor whatsoever (i.e., ˆβ2 = 0). In this case, ˆYi = ˆβ1 = Y̅, that is, the best
prediction of any Y value is simply its mean value. In this situation therefore the
regression line will be horizontal to the X axis.
Although r2 can be computed directly from its definition, it can be obtained more
quickly from the following formula

If we divide the numerator and the denominator of above equation by the sample size n (or
n− 1 if the sample size is small), we obtain

where Sx2 and Sy2 are the sample variances of Y and X, respectively.

AEC 507/GMG/Goodness of fit    5 
 
Since above Equation can also be expressed as

Given the definition of r2, we can express ESS and RSS discussed earlier as follows

Therefore, we can write

A quantity closely related to but conceptually very much different from r 2 is the
coefficient of correlation, is a measure of the degree of association between two
variables. It can be computed either from

Or from its definition

This is known as the sample correlation coefficient (ρ -population corr. Coefft.).

AEC 507/GMG/Goodness of fit    6 
 
Single equation correlation coefficient

Correlation patterns (adapted from Henri Theil, Introduction to Econometrics,


Prentice-Hall, Englewood Cliffs, N.J., 1978, p. 86).

Properties of r
1. It can be positive or negative, the sign depending on the sign of the term in the
numerator, which measures the sample covariation of two variables.
2. It lies between the limits of −1 and +1; that is, −1 ≤ r ≤ 1.
3. It is symmetrical in nature; that is, the coefficient of correlation between X and Y(rXY)
is the same as that between Y and X(rYX).

AEC 507/GMG/Goodness of fit    7 
 
4. It is independent of the origin and scale; that is, if we define X*I =aXi + C and Y*i=
bYi + d, where a > 0, b > 0, and c and d are constants, then r between X* and Y* is the
same as that between the original variables X and Y.
5. If X and Y are statistically independent, the correlation coefficient between them is
zero; but if r = 0, it does not mean that two variables are independent. In other words,
zero correlation does not necessarily imply independence. [Figure (h).]
6. It is a measure of linear association or linear dependence only; it has no meaning for
describing nonlinear relations. Thus in Figure (h), Y = X2 is an exact relationship yet r
is zero.
7. Although it is a measure of linear association between two variables, it does not
necessarily imply any cause-and-effect relationship.

In the regression context, r2 is a more meaningful measure than r, for the former tells
us the proportion of variation in the dependent variable explained by the explanatory
variable(s) and therefore provides an overall measure of the extent to which the variation
in one variable determines the variation in the other. The latter does not have such value.
Moreover, as we shall see, the interpretation of r (= R) in a multiple regression model is of
dubious value.
In passing, note that the r2 defined previously can also be computed as the squared
coefficient of correlation between actual Yi and the estimated Yi, namely, ˆYi . That is,
using above equation we can write.

That is

Where Yi = actual Y, ˆYi = estimated Y, and Y̅ = ˆY̅= the mean of Y. Above equation
justifies the description of r2 as a measure of goodness of fit, for it tells how close the
estimated Y values are to their actual values.

AEC 507/GMG/Goodness of fit    8 
 
Keynesian consumption function
Keynes stated that “The fundamental psychological law . . . is that men [women]
are disposed, as a rule and on average, to increase their consumption as their income
increases, but not by as much as the increase in their income,” that is, the marginal
propensity to consume (MPC) is greater than zero but less than one. Although Keynes did
not specify the exact functional form of the relationship between consumption and income,
for simplicity assume that the relationship is linear. As a test of the Keynesian
consumption function, the sample data in following Table is used to obtain the estimates
of the regression coefficients, their standard errors, etc.,
Table: Data on weekly family consumption expenditure Y & weekly family income X

The estimated regression function is also indicated graphically in following figure

This above SRF and the associated regression line given below are interpreted as follows:
Each point on the regression line gives an estimate of the expected or mean value of Y
corresponding to the chosen X value; that is, Ŷi is an estimate of E(Y | Xi). The value of ˆβ2 =
0.5091, which measures the slope of the line, shows that, within the sample range of X between
$80 and $260 per week, as X increases, say, by $1, the estimated increase in the mean or average
weekly consumption expenditure amounts to about 51 cents. The value of ˆβ1 =24.4545, which is
the intercept of the line, indicates the average level of weekly consumption expenditure when
weekly income is zero. However, this is a mechanical interpretation of the intercept term. In
regression analysis such literal interpretation of the intercept term may not be always meaningful,
although in the present example it can be argued that a family without any income (because of
unemployment, layoff, etc.) might maintain some minimum level of consumption expenditure
either by borrowing or dissaving. But in general one has to use common sense in interpreting the
intercept term, for very often the sample range of X values may not include zero as one of the
observed values.

AEC 507/GMG/Goodness of fit    9 
 
Fig. Sample regression line based on the data given in above table

AEC 507/GMG/Goodness of fit    10 
 
Analysis of variance (ANOVA)
Study of components of TSS is known as the analysis of variance (ANOVA).
TSS = ESS + RSS,
Total sum of squares (TSS) = Explained sum of squares (ESS) + residual sum of squares
(RSS).
Associated with any sum of squares is its df, the number of independent observations on
which it is based.
TSS has n− 1 df because we lose 1 df in computing the sample mean Ῡ.
RSS has n− 2 df. (Why?) (Note: This is true only for the two-variable regression model
with the intercept β1 present.)
ESS has 1 df (again true of the two-variable case only), which follows from the fact that
ESS = ˆβ22 Ʃxi2 is a function of ˆβ2 only, since Ʃxi2 is known.

Let us arrange the various sums of squares and their associated df in Table 5.3, which is
the standard form of the AOV table, sometimes called the ANOVA table. Given the entries of
Table 5.3, we now consider the following variable:

If we assume that the disturbances ui are normally distributed, which we do under the
CNLRM, and if the null hypothesis (H0) is that β2 = 0, then it can be shown that the F variable of
above equation follows the F distribution with 1 df in the numerator and (n − 2) df in the
denominator.

AEC 507/GMG/Goodness of fit    11 
 
Classical Normal Linear Regression Model (CNLRM)

The classical theory of statistical inference consists of two branches, namely, estimation
and hypothesis testing.
Using the method of OLS we estimated the parameters β1, β2, and σ2. Under the
assumptions of the classical linear regression model (CLRM), we were able to show that the
estimators of these parameters, ˆβ1, ˆβ2, and ˆσ2, satisfy several desirable statistical properties, such
as unbiasedness, minimum variance, etc. Since these are estimators, their values will change from
sample to sample. Therefore, these estimators are random variables. The estimation is half the
battle; Hypothesis testing is the other half.
Since ˆβ1, ˆβ2, and ˆσ2 are random variables, we need to find out their probability
distributions, for without that knowledge we will not be able to relate them to their true values.
Probability Distribution of Disturbances Ui
To find out the probability distributions of the OLS estimators, proceed with ˆβ2.

Where,
ki = xi/∑xi2. But since the X’s are assumed fixed, or non-stochastic, because ours is conditional
regression analysis, conditional on the fixed values of Xi, above Equation (1) shows that ˆβ2 is a
linear function of Yi, which is random by assumption. But since Yi=β1 + β2Xi + ui, we can write (1)
as

Because ki, the betas, and Xi are all fixed, ˆβ2 is ultimately a linear function of the random
variable ui, which is random by assumption. Therefore, the probability distribution of ˆβ2 (and also
of ˆβ1) will depend on the assumption made about the probability distribution of ui. And since
knowledge of the probability distributions of OLS estimators is necessary to draw inferences about
their population values, the nature of the probability distribution of ui assumes an extremely
important role in hypothesis testing.
Since the method of OLS does not make any assumption about the probabilistic nature of
ui, it is of little help for the purpose of drawing inferences about the PRF from the SRF.
This void can be filled if we are willing to assume that the u’s follow some probability
distribution. In the regression context it is usually assumed that the u’s follow the normal
distribution. Adding the normality assumption for ui to the assumptions of the classical linear
regression model (CLRM) we obtain what is known as the classical normal linear regression
model (CNLRM).
The Normality Assumption for Ui
The classical normal linear regression model assumes that each ui is distributed normally with

This can be compactly written as

Where, the symbol ~ means distributed as and N stands for the normal distribution, the
terms in the parentheses representing the two parameters of the normal distribution, namely, the
mean and the variance.
For two normally distributed variables, zero covariance or correlation means independence
of the two variables. Therefore, with the normality assumption, ui and uj are not only
uncorrelated but are also independently distributed. Therefore, we can write above equation as

NID - stands for normally and independently distributed.

GMG/AEC 507/CLRM/2021   CLRM  1 
 
Reasons for assumption of the Normality
1. ui represent the combined influence (on the dependent variable) of a large number of
independent variables that are not explicitly introduced in the regression model. We hope that
the influence of these omitted or neglected variables is small and at best random. Now by the
celebrated central limit theorem (CLT) of statistics, it can be shown that if there are a large
number of independent and identically distributed random variables, then, with a few
exceptions, the distribution of their sum tends to a normal distribution as the number of such
variables increase indefinitely. It is the CLT that provides a theoretical justification for the
assumption of normality of ui.
2. A variant of the CLT states that, even if the number of variables is not very large or if these
variables are not strictly independent, their sum may still be normally distributed.
3. With the normality assumption, the probability distributions of OLS estimators can be easily
derived because, as one property of the normal distribution is that any linear function of
normally distributed variables is itself normally distributed. The OLS estimators ˆβ1 and
ˆβ2 are linear functions of ui. Therefore, if ui are normally distributed, so are ˆβ1 and ˆβ2, which
makes task of hypothesis testing very straightforward.
4. The normal distribution is a comparatively simple distribution involving only two parameters
(mean and variance); it is very well known and its theoretical properties have been extensively
studied in mathematical statistics. Besides, many phenomena follow the normal distribution.
5. Finally, if we are dealing with a small, or finite, sample size, say data of less than 100
observations, the normality assumption assumes a critical role. It not only helps us to derive
the exact probability distributions of OLS estimators but also enables us to use the t, F, and χ2
statistical tests for regression models.
Properties of OLS Estimators under the Normality Assumption
With the assumption of ui follow the normal distribution, the OLS estimators have the
following properties.
1. They are unbiased.
2. They have minimum variance. Combined with 1, this means that they are minimum-variance
unbiased, or efficient estimators.
3. They have consistency; that is, as the sample size increases indefinitely, the estimators
converge to their true population values.
4. ˆ β1 (being a linear function of ui) is normally distributed with

Or more specifically,

Then by the properties of the normal distribution the variable Z follows the standard normal
distribution, that is, a normal distribution with zero mean and unit (=1) variance, or Z N(0, 1)

5. ˆ β2 (being a linear function of ui) is normally distributed with

more compactly,

and Z can be defined as

GMG/AEC 507/CLRM/2021   CLRM  2 
 
also follows the standard normal distribution. Geometrically, the probability distributions of ˆβ1
and ˆ β2 are shown below.

Fig. 1 Probability distributions of ˆ β1 and ˆβ2

5. (n− 2)(ˆσ2/σ2) is distributed as the χ2 (chi-square) distribution with (n − 2)df.


This knowledge will help us to draw inferences about the true σ2 from the estimated σ2.
7. (ˆβ1, ˆβ2) are distributed independently of ˆσ2.
8. ˆ β1 and ˆβ2 have minimum variance in the entire class of unbiased estimators, whether linear or
not. This result, due to Rao, is very powerful because, unlike the Gauss–Markov theorem, it is
not restricted to the class of linear estimators only. Therefore, we can say that the least-squares
estimators are best unbiased estimators (BUE); that is, they have minimum variance in the
entire class of unbiased estimators.
Summary
The important point to note is that the normality assumption enables us to derive the
probability, or sampling, distributions of ˆβ1 and ˆβ2 (both normal) and ˆσ2 (related to the chi
square). With the assumption that ui ~ N(0, σ2), Yi , being a linear function of ui, is itself normally
distributed with the mean and variance given by

More specifically

The Method of Maximum Likelihood (ML)


A method of point estimation with some stronger theoretical properties than the method of
OLS is the method of maximum likelihood (ML). It important to note that if ui are assumed to be
normally distributed, the ML and OLS estimators of the regression coefficients, the β’s, are
identical,and this is true of simple as well as multiple regressions. The ML estimator of σ2 is
∑ˆu2i/n. This estimator is biased, whereas the OLS estimator of σ2 = ∑ˆu2i/(n− 2), as we have seen,
is unbiased. But comparing these two estimators of σ2, we see that as the sample size n gets larger
the two estimators of σ2 tend to be equal. Thus, asymptotically (i.e., as n increases indefinitely), the
ML estimator of σ2 is also unbiased. Since the method of least squares with the added assumption
of normality of ui provides us with all the tools necessary for both estimation and hypothesis
testing of the linear regression models, there is no loss for those who may not want to pursue the
maximum likelihood method because of its slight mathematical complexity.

GMG/AEC 507/CLRM/2021   CLRM  3 
 
Multicollinearity

Assumption 10 of the classical linear regression model (CLRM) is that there is no


multicollinearity among the regressors included in the regression model.
Multicollinearity refers to the existence of more than one exact linear relationship,
and collinearity refers to the existence of a single linear relationship. But this distinction is
rarely maintained in practice, and multicollinearity refers to both cases
Now take a look at this assumption by seeking answers to the questions like
1. What is the nature of multicollinearity? 2. Is multicollinearity really a problem?
3. What are its practical consequences? 4. How does one detect it?
5. What remedial measures can be taken to alleviate the problem of multicollinearity?
The term multicollinearity is due to Ragnar Frisch. Originally it meant the
existence of a “perfect,” or exact linear relationship among some or all explanatory
variables of a regression model. For the k-variable regression involving explanatory
variable X1, X2, . . , Xk (where X1 = 1 for all observations to allow for the intercept term),
an exact linear relationship is said to exist if the following condition is satisfied:
λ1X1 + λ2X2 +ꞏ ꞏ ꞏ+λkXk = 0…………………(1)
Where λ1, λ2, . . . , λk are constants such that not all of them are zero simultaneously.
The chances of one’s obtaining a sample of values where the regressors are related
in this fashion (multi-collinear) are indeed very small in practice except by design when,
for example, the number of observations is smaller than the number of regressors or if one
falls into the “dummy variable trap”
Today, however, the term multicollinearity is used in a broader sense to include the
case of perfect multicollinearity, as shown in above equation, as well as the case where the
X variables are inter-correlated but not perfectly so, as follows.
[If there are only two explanatory variables, inter-correlation can be measured by the zero
order or simple correlation coefficient. But if there are more than two X variables, inter-
correlation can be measured by the partial correlation coefficients or by the multiple
correlation coefficient R of one X variable with all other X variables taken together]
λ1X1 + λ2X2 +ꞏ ꞏ ꞏ+λkXk + vi = 0 ………………… (2)
where vi is a stochastic error term.
To see the difference between perfect and less than perfect multicollinearity,
assume, for example, that λ2 ≠ 0. Then, above equation (2) can be written as

………………… (3)
which shows how X2 is exactly linearly related to other variables or how it can be derived
from a linear combination of other X variables. In this situation, the coefficient of
correlation between the variable X2 and the linear combination on the right side of (3) is
bound to be unity.
Similarly, if λ2 ≠ 0, Eq. (2) can be written as

………………… (4)
which shows that X2 is not an exact linear combination of other X’s because it is also
determined by the stochastic error term vi.
As a numerical example, consider the following hypothetical data:

AEC 507/GMG/MC/2021    1 
 
It is apparent that X3i = 5X2i. Therefore, there is perfect collinearity between X2 and
X3 since the coefficient of correlation r23 is unity. The variable X*3 was created from X3 by
simply adding to it the numbers taken from a table of random numbers: 2, 0, 7, 9, 2. Now
there is no longer perfect collinearity between X2 and X*3. However, the two variables are
highly correlated because calculations will show that the coefficient of correlation between
them is 0.9959.
Algebraically multicollinearity can be portrayed using following Ballentine.

In this figure the circles Y, X2, and X3 represent, respectively, the variations in Y
(the dependent variable) and X2 and X3 (the explanatory variables). The degree of
collinearity can be measured by the extent of the overlap (shaded area) of the X2 and X3
circles. In above Fig (a) there is no overlap between X2 and X3, and hence no collinearity.

AEC 507/GMG/MC/2021    2 
 
In Fig. (b) through (e) there is a “low” to “high” degree of collinearity—the greater the
overlap between X2 and X3 (shaded area), the higher the degree of collinearity.
In the extreme, if X2 and X3 were to overlap completely (or if X2 were completely
inside X3, or vice versa), collinearity would be perfect.
Note that multicollinearity, as we have defined it, refers only to linear relationships
among the X variables. It does not rule out nonlinear relationships among them. For
example, consider the following regression model:

…………………(5)
where, say, Y = total cost of production and X = output. The variables Xi2 (output squared)
and Xi3 (output cubed) are obviously functionally related to Xi, but the relationship is
nonlinear.
Therefore, models like (5) do not violate the assumption of no multicollinearity.
However, in concrete applications, the conventionally measured correlation coefficient
will show Xi, X2i, and X3i to be highly correlated, which, will make it difficult to estimate
the parameters of (5) with greater precision (i.e., with smaller standard errors).
The classical linear regression model assumes no multicollinearity among the X’s
because, if multicollinearity is perfect in the sense of (1), the regression coefficients of the
X variables are indeterminate and their standard errors are infinite. If multicollinearity is
less than perfect, as in (2), the regression coefficients, although determinate, possess large
standard errors (in relation to the coefficients themselves), which means the coefficients
cannot be estimated with great precision or accuracy.
Sources of multicollinearity [as given by Montgomery and Peck]:
1. The data collection method employed- If sampling over a limited range of the values
taken by the regressors in the population.
2. Constraints on the model or in the population being sampled –Ex: In the regression of
electricity consumption on income (X2) and house size (X3) there is a physical
constraint in the population in that families with higher incomes generally have larger
homes than families with lower incomes.
3. Model specification, for example, adding polynomial terms to a regression model,
especially when the range of the X variable is small.
4. An over-determined model. This happens when the model has more explanatory
variables than the number of observations. This could happen in medical research
where there may be a small number of patients about whom information is collected
on a large number of variables.
5. In time series data, regressors included in the model share a common trend, i.e., they
all increase or decrease over time. Thus, in the regression of consumption expenditure
on income, wealth, and population, the regressors income, wealth, and population may
all be growing over time at more or less the same rate, leading to collinearity among
these variables.

AEC 507/GMG/MC/2021    3 
 
Problem with Multicollinearity (MC)
Estimation of regression coefficient is a problem and we can’t find them with ease.
In the case of perfect multicollinearity
Here the regression coefficients remain indeterminate and their standard errors are infinite.
This fact can be demonstrated in terms of the three-variable regression model. Using the deviation
form, where all the variables are expressed as deviations from their sample means, we can write
the three variable regression model as

Assume that X3i = λX2i, where λ is a nonzero constant (e.g., 2, 4, 1.8, etc.). Substituting this
into previous equation above, we obtain

Which is an indeterminate expression, therefore we can’t get the regression coefficients, as


both numerator and denominator are zero.
ˆβ2 gives the rate of change in the average value of Y as X2 changes by a unit, holding X3
constant. But if X3 and X2 are perfectly collinear, there is no way X3 can be kept constant: As X2
changes, so does X3 by the factor λ. What it means, then, is that there is no way of disentangling
the separate influences of X2 and X3 from the given sample: For practical purposes X2 and X3 are
indistinguishable. In applied econometrics this problem is most damaging since the entire intent is
to separate the partial effects of each X upon the dependent variable.
On substituting X3i = λX2i into above equation we obtain

Where,

Applying OLS we get

Therefore, although we can estimate α uniquely, there is no way to estimate β2 and β3


uniquely; mathematically

gives us only one equation in two unknowns (note λ is given) and there is an infinity of solutions
to above equation for given values of ˆα and λ.
To put this idea in concrete terms, let ˆα = 0.8 and λ = 2. Then we have
0.8 = ˆ β2 + 2 ˆβ3
Or β2 = 0.8 − 2ˆβ3

AEC 507/GMG/MC/2021    4 
 
Now, choose a value of ˆβ3 arbitrarily, and we will have a solution for ˆβ2. Choose another
value for ˆβ3, and we will have another solution for ˆβ2. No matter how hard we try, there is no
unique value for ˆβ2.
Thus in the case of perfect multicollinearity one cannot get a unique solution for the
individual regression coefficients. But notice that one can get a unique solution for linear
combinations of these coefficients. The linear combination (β2 + λβ3) is uniquely estimated by α,
given the value of λ.
In the case of perfect multicollinearity the variances and standard errors of ˆβ2 and ˆβ3
individually are infinite.
Estimation in the Presence of “High” but “Imperfect” Multicollinearity
The perfect multicollinearity situation is a pathological extreme. Generally, there is no
exact linear relationship among the X variables, especially in data involving economic time series.
Thus, turning to the three-variable model in the deviation form, instead of exact multicollinearity,
we may have

where λ≠ 0 and where vi is a stochastic error term such that x2ivi = 0. Incidentally, the Ballentines
shown in earlier Figure b to e represent cases of imperfect collinearity.
In this case, estimation of regression coefficients β2 and β3 may be possible.
For example, substituting above equation we obtain

where use is made of Ʃx2ivi=0, A similar expression can be derived for ˆ β3.
Theoretical Consequences of Multicollinearity
Even if multicollinearity is very high, as in the case of near multicollinearity, the OLS
estimators still retain the property of BLUE.
The only effect of multicollinearity is to make it hard to get coefficient estimates with
small standard error. The MC problem is largely something to do with samples / sample size, but
not on population. If the sample size is low, then SE is high and hence variance is also high.
Practical Consequences of High Multicollinearity
1. Although BLUE, the OLS estimators have large variances and covariances, making precise
estimation difficult.
2. Because of consequence 1, the confidence intervals tend to be much wider, leading to the
acceptance of the “zero null hypothesis” (i.e., the true population coefficient is zero) more readily.
3. Also because of consequence 1, the t ratio of one or more coefficients tends to be statistically
insignificant.
4. Although the t ratio of one or more coefficients is statistically insignificant, R2, the overall
measure of goodness of fit, can be very high.
5. The OLS estimators and their standard errors can be sensitive to small changes in the data.

AEC 507/GMG/MC/2021    5 
 
Demonstration of consequences
Large Variances and Covariances of OLS Estimators
To see large variances and covariances, the variances and covariances of ˆβ2 and ˆβ3 are given by

where r23 is the coefficient of correlation between X2 and X3. It is apparent from above first
two equations that as r23 tends toward 1, that is, as collinearity increases, the variances of the two
estimators increase and in the limit when r23 = 1, they are infinite. It is equally clear from third
equation that as r23 increases toward 1, the covariance of the two estimators also increases in
absolute value. [Note: cov (ˆβ2, ˆβ3) ≡ cov (ˆβ3, ˆβ2).]
If r223=1, then variances will be ∞ and SE will be ∞ and also covariance increases
progressively.
Variance-Inflating Factor (VIF)
The speed with which variances and covariances increase will be known as VIF, by this
we can detect presence or absence of MC.

The VIF shows how the variance of an estimator is inflated by the presence of MC.
As r223 approaches 1, the VIF approaches infinity. That is, as the extent of collinearity
increases, the variance of an estimator increases, and in the limit it can become infinite. As can be
readily seen, if there is no collinearity between X2 and X3, VIF will be 1.

Thus variances of ˆβ2 and ˆβ3 are directly proportional to the VIF and inversely
proportional to the sum of deviations squares of x values.
The inverse of the VIF is called tolerance (TOL).

AEC 507/GMG/MC/2021    6 
 
To have an idea about how fast the variances and covariances increase as r23 increases,
consider the data in above Table on variances and covariances for selected values of r23. The
increases in r23 have a dramatic effect on the estimated variances and covariances of the OLS
estimators. When r23 = 0.50, the var ( ˆβ2) is 1.33 times the variance when r23 is zero, but by the
time r23 reaches 0.95 it is about 10 times as high as when there is no collinearity. And, an increase
of r23 from 0.95 to 0.995 makes the estimated variance 100 times that when collinearity is zero.
The same dramatic effect is seen on the estimated covariance as indicated in in following Figure.

Fig. : The behavior of var ( ˆ β2) as a function of r23.

AEC 507/GMG/MC/2021    7 
 
Detection of Multicollinearity
After studying the nature and consequences of multicollinearity, How to detect the presence of
collinearity in any given situation, especially in models involving more than two explanatory
variables. For that let us consider the Kmenta’s warning:
 Multicollinearity is a question of degree and not of kind. The meaningful distinction is not
between the presence and the absence of multicollinearity, but between its various degrees.
 Since multicollinearity refers to the condition of the explanatory variables that are assumed to
be non-stochastic, it is a feature of the sample and not of the population. Therefore, we do not
“test for multicollinearity” but can, if we wish, measure its degree in any particular sample.
Since multicollinearity is essentially a sample phenomenon, arising out of the largely non-
experimental data collected in social sciences, there is no unique method of detecting it or
measuring its strength, instead there are some rules of thumb, some informal and some formal, like
1. High R2 but few significant t ratios: This is the “classic” symptom of multicollinearity. If R2 is
high, say, in excess of 0.8, the F test in most cases will reject the hypothesis that the partial
slope coefficients are simultaneously equal to zero, but the individual t tests will show that none
or very few of the partial slope coefficients are statistically different from zero.
Although this diagnostic is sensible, its disadvantage is that “it is too strong in the sense that
multicollinearity is considered as harmful only when all of the influences of the explanatory
variables on Y cannot be disentangled.”
2. High pair-wise correlations among regressors: If the pair-wise or zero-order correlation
coefficient between two regressors is high, say, in excess of 0.8, then multicollinearity is a
serious problem. The problem with this criterion is that, although high zero-order correlations
may suggest collinearity, it is not necessary that they be high to have collinearity in any specific
case. To put the matter somewhat technically, high zero-order correlations are a sufficient but
not a necessary condition for the existence of multicollinearity because it can exist even though
the zero-order or simple correlations are comparatively low (say, less than 0.50).
Suppose for 4 variable model

If

where λ2 and λ3 are constants, not both zero. Obviously, X4 is an exact linear combination of X2 and
X3, giving R24.23 = 1, the coefficient of determination in the regression of X4 on X2 and X3.

But since R24.23 = 1 because of perfect collinearity, we obtain

With r42 = 0.5, r43 = 0.5, and r23 = −0.5, which are not very high values, we get R2 =1.
Therefore, in models involving more than two explanatory variables, the simple or zero-
order correlation will not provide an infallible guide to the presence of multicollinearity. Of
course, if there are only two explanatory variables, the zero-order correlations will suffice.
3. Examination of partial correlations: Farrar and Glauber have suggested that one should look
at the partial correlation coefficients. In the regression of Y on X2, X3, and X4, a finding that R2
1.234 is very high but r 2 12.34, r 213.24, and r214.23 are comparatively low may suggest that
the variables X2, X3, and X4 are highly inter-correlated and that at least one of these variables is
superfluous.

AEC 507/GMG/MC/2021    8 
 
Supposing partial correlation between dependent and independent variables is lower, then
we can suspect presence of MC.
4. Auxiliary regressions: Since multicollinearity arises because one or more of the regressors are
exact or approximately linear combinations of the other regressors, one way of finding out
which X variable is related to other X variables is to regress each Xi on the remaining X
variables and compute the corresponding R2, which we designate as R2i; each one of these
regressions is called an auxiliary regression, auxiliary to the main regression of Y on the X’s.
Then, following the relationship between F and R2, the variable

follows the F distribution with k− 2 and n− k+ 1df


Where,
n - sample size,
k - number of explanatory variables including the intercept term, and
R2 xi ꞏx2x3ꞏꞏꞏxk is the coefficient of determination in the regression of variable Xi on the remaining
X variables.
If the computed F >critical Fi at the chosen level of significance, it is taken to mean that
the particular Xi is collinear with other X’s; non-collinear otherwise.
Instead of formally testing all auxiliary R2 values, one may adopt Klien’s rule of thumb, which
suggests that multicollinearity may be a troublesome problem only if the R2 obtained from an
auxiliary regression is greater than the overall R2, that is, that obtained from the regression of Y on
all the regressors.
5. Eigenvalues and condition index: SAS uses eigenvalues and the condition index to diagnose
multicollinearity. Derive the condition number k as follows using Eigen values

Then condition index (CI)

Rule of thumb:
If k is between 100 and 1000 there is moderate to strong multicollinearity and if it exceeds
1000 there is severe multicollinearity.
Alternatively, if the CI (=√k) is between 10 and 30, there is moderate to strong
multicollinearity and if it exceeds 30 there is severe multicollinearity.
6. Tolerance and variance inflation factor: As R2j, increases towards unity, that is, as the
collinearity of Xj with the other regressors increases, VIF also increases. The larger the value of
VIFj, the more “troublesome” or collinear the variable Xj.
As a rule of thumb, if the VIF of a variable exceeds 10, which will happen if R2j exceeds 0.90,
that variable is said be highly collinear. The closer is TOLj (Inverse of VIF) to zero, the greater the
degree of collinearity of that variable with the other regressors, while the closer TOLj is to 1, the
greater the evidence that Xj is not collinear with the other regressors
According to Goldberger, exact micronumerosity (the counterpart of exact
multicollinearity) arises when n, the sample size, is zero, in which case any kind of estimation is
impossible. Near micronumerosity, like near multicollinearity, arises when the number of
observations barely exceeds the number of parameters to be estimated.

AEC 507/GMG/MC/2021    9 
 
Remedial Measures
If multicollinearity is serious? We have two choices:(1) do nothing or (2) follow some
rules of thumb.
The “do nothing” school of thought is expressed by Blanchard as follows:
When students run their first ordinary least squares (OLS) regression, the first problem
that they usually encounter is that of multicollinearity. Many of them conclude that there is
something wrong with OLS; some resort to new and often creative techniques to get around the
problem. But, we tell them, this is wrong. Multicollinearity is God’s will, not a problem with OLS
or statistical technique in general.
What Blanchard is saying is that multicollinearity is essentially a data deficiency problem
(micronumerosity, again) and sometimes we have no choice over the data we have available for
empirical analysis. Also, it is not that all the coefficients in a regression model are statistically
insignificant. Moreover, even if we cannot estimate one or more regression coefficients with
greater precision, a linear combination of them (i.e., estimable function) can be estimated
relatively efficiently.
Rule-of-Thumb Procedures
One can try rules of thumb to address the problem of multicollinearity, the success
depending on the severity of the collinearity problem.
1. A priori information: Suppose we consider the model

Where, Y = consumption, X2 = income, and X3 = wealth. Since income and wealth variables tend to
be highly collinear. But suppose a priori we believe that β3 = 0.10β2; that is, the rate of change of
consumption with respect to wealth is one-tenth the corresponding rate with respect to income. We
can then run the following regression

Where, Xi=X2i + 0.1X3i. Then with ˆβ2, we can estimate ˆβ3 from the postulated relationship
between β2 and β3.
The a priori information could come from previous empirical work in which the
collinearity problem happens to be less serious or from the relevant theory underlying the field of
study.
2. Combining cross-sectional and time series data:
A variant of the extraneous or apriori information technique is the combination of cross
sectional and time-series data, known as pooling the data. Suppose we want to study the demand
for automobiles in the United States and assume we have time series data on the number of cars
sold, average price of the car, and consumer income. Suppose also that

where Y = number of cars sold, P = average price, I = income, and t = time.


Objective: To estimate the price elasticity β2 and income elasticity β3.
In time series data the price and income variables generally tend to be highly collinear.
Therefore, if we run the preceding regression, we shall be faced with the usual multicollinearity
problem. A way out of this has been suggested by Tobin. He says that if we have cross-sectional
data (for example, data generated by consumer panels, or budget studies conducted by various
private and governmental agencies), we can obtain a fairly reliable estimate of the income
elasticity β3 because in such data, which are at a point in time, the prices do not vary much. Let the
cross-sectionally estimated income elasticity be ˆβ3. Using this estimate, we may write the
preceding time series regression as

AEC 507/GMG/MC/2021    10 
 
where Y* = ln Y − ˆβ3 ln I, that is, Y* represents that value of Y after removing from it the effect of
income. We can now obtain an estimate of the price elasticity β2 from the preceding regression.
3. Dropping a variable(s) and specification bias:
When faced with severe multicollinearity, one of the “simplest” things to do is to drop one
of the collinear variables, so that the non-significant may become significant variable or vis-a-vis.
But in dropping a variable from the model we may be committing a specification bias or
specification error. Specification bias arises from incorrect specification of the model used in the
analysis.
4. Transformation of variables: Suppose we have time series data on consumption expenditure,
income, and wealth. One reason for high multicollinearity between income and wealth in such
data is that over time both the variables tend to move in the same direction. One way of
minimizing this dependence is to proceed as follows
If the relation

holds at time t, it must also hold at time t-1 because the origin of time is arbitrary anyway.
Therefore, we have

On taking difference

where vt = ut - ut−1. This equation is known as the first difference form because we run the
regression, not on the original variables, but on the differences of successive values of the
variables.
 The first difference regression model often reduces the severity of multicollinearity because,
although the levels of X2 and X3 may be highly correlated, there is no a priori reason to believe
that their differences will also be highly correlated.
 An incidental advantage of the first-difference transformation is that it may make a
nonstationary time series stationary. Loosely speaking, a time series, say, Yt, is stationary if its
mean and variance do not change systematically over time.
 Another commonly used transformation in practice is the ratio transformation.
Consider the model:

where Y is consumption expenditure in real dollars, X2 is GDP, and X3 is total population. Since
GDP and population grow over time, they are likely to be correlated. One “solution” to this
problem is to express the model on a per capita basis, that is, by dividing with X3, we obtain:

Such a transformation may reduce collinearity in the original variables.


But the first-difference or ratio transformations are not without problems, like the error
term vt in may not satisfy one of the assumptions of the CLRM of ‘disturbances are serially
uncorrelated’, there is a loss of one observation due to the differencing procedure, and therefore
the degrees of freedom are reduced by one. The first-differencing procedure may not be
appropriate in cross-sectional data where there is no logical ordering of the observations.
In the ratio model, the error term will be heteroscedastic, if the original error term ut is
homoscedastic.

AEC 507/GMG/MC/2021    11 
 
5. Additional or new data:
Since multicollinearity is a sample feature, it is possible that in another sample involving
the same variables collinearity may not be so serious as in the first sample. Sometimes simply
increasing the size of the sample (if possible) may attenuate the collinearity problem.
For example, in the three-variable model

Now as the sample size increases, ∑x22i will generally increase.


Therefore, for any given r23, the variance of ˆβ2 will decrease, thus decreasing the standard error,
which will enable us to estimate β2 more precisely.
6. Reducing collinearity in polynomial regressions: By way of orthogonal polynomial we can
reduce MC for small sample size. If the explanatory variable (s) are expressed in the deviation
form (i.e., deviation from the mean value), multicollinearity is substantially reduced. But even
then the problem may persist, in which case one may want to consider techniques such as
orthogonal polynomials.
7. Frisch’s Confluence Analysis: First start with one variable, see the impact on R2 and
significance and add another variable and seed its effect on significance and R2 and continue till
we get non-significance. It is similar to step-wise regression.
8. Other methods: Multivariate statistical techniques such as factor analysis and principal
components or techniques such as ridge regression are often employed to “solve” the problem
of multicollinearity.
While forecasting or predicting economic phenomenon for the future, the MC is not a
problem, but while analyzing cause and effect relationship then this MC is a problem.
Conclusion
1. One of the assumptions of the classical linear regression model is that there is no
multicollinearity among the explanatory variables, the X’s. Broadly interpreted,
multicollinearity refers to the situation where there is either an exact or approximately exact
linear relationship among the X variables.
2. The consequences of multicollinearity: If there is perfect collinearity among the X’s, their
regression coefficients are indeterminate and their standard errors are not defined. If
collinearity is high but not perfect, estimation of regression coefficients is possible but their
standard errors tend to be large. As a result, the population values of the coefficients cannot be
estimated precisely. However, if the objective is to estimate linear combinations of these
coefficients, the estimable functions, this can be done even in the presence of perfect
multicollinearity.
3. Although there are no sure methods of detecting collinearity, there are several indicators of it
like:
a. The clearest sign of multicollinearity is when R2 is very high but none of the
regression coefficients is statistically significant on the basis of the conventional t test.
This case is, of course, extreme.
b. In models involving just two explanatory variables, a fairly good idea of collinearity
can be obtained by examining the zero-order, or simple, correlation coefficient
between the two variables. If this correlation is high, multicollinearity is generally the
culprit.
c. However, the zero-order correlation coefficients can be misleading in models
involving more than two X variables since it is possible to have low zero-order
correlations and yet find high multicollinearity. In situations like these, one may need
to examine the partial correlation coefficients.
d. If R2 is high but the partial correlations are low, multicollinearity is a possibility. Here
one or more variables may be superfluous. But if R2 is high and the partial correlations

AEC 507/GMG/MC/2021    12 
 
are also high, multicollinearity may not be readily detectable. Also, as pointed out by
C. Robert, Krishna Kumar, John O’Hagan, and Brendan McCabe, there are some
statistical problems with the partial correlation test suggested by Farrar and Glauber.
e. Therefore, one may regress each of the Xi variables on the remaining X variables in the
model and find out the corresponding coefficients of determination R2i. A high R2i
would suggest that Xi is highly correlated with the rest of the X’s. Thus, one may drop
that Xi from the model, provided it does not lead to serious specification bias.
4. Detection of multicollinearity is half the battle. The other half is concerned with how to get rid
of the problem. Again there are no sure methods, only a few rules of thumb. Some of these
rules are:
a. using extraneous or prior information,
b. combining cross-sectional and time series data,
c. omitting a highly collinear variable,
d. transforming data, and
e. obtaining additional or new data.
Which of these rules will work in practice will depend on the nature of the data and
severity of the collinearity problem.
5. We noted the role of multicollinearity in prediction and pointed out that unless the collinearity
structure continues in the future sample it is hazardous to use the estimated regression that has
been plagued by multicollinearity for the purpose of forecasting.
6. Although multicollinearity has received extensive (some would say excessive) attention in the
literature, an equally important problem encountered in empirical research is that of
micronumerosity, smallness of sample size. According to Goldberger, “When a research
article complains about multicollinearity, readers ought to see whether the complaints would
be convincing if “micronumerosity” were substituted for “multicollinearity.” He suggests that
the one ought to decide how small n, the number of observations, is before deciding that one
has a small-sample problem, just as one decides how high an R2 value is in an auxiliary
regression before declaring that the collinearity problem is very severe.

AEC 507/GMG/MC/2021    13 
 
Autocorrelation:
(What happens if the error terms are correlated?)

The three types of data available for empirical analysis include: (1) Cross section, (2) Time series,
and (3) Combination of cross section and time series, also known as pooled data.
The assumption of homoscedasticity (or equal error variance), may not be always tenable in
cross-sectional data, as this kind of data are often plagued by the problem of heteroscedasticity.
In cross-section studies, data collected using random sample of cross-sectional units (households-
consumption function analysis or firms - investment study analysis), so there is no prior reason to believe
that the error term pertaining to one household (firm) is not correlated with the error term of another
household (firm). If by chance such a correlation is observed in cross-sectional units, it is called spatial
autocorrelation, that is, correlation in space rather than over time (Temporal).
However, it is important to remember that, in cross-sectional analysis, the ordering of the data
must have some logic, or economic interest, to make sense of any determination of whether (spatial)
autocorrelation is present or not.
The situation, is different while dealing with time series data, for the observations in such data
follow a natural ordering over time so that successive observations are likely to exhibit inter-correlations,
especially if the time interval between successive observations is short, such as a day, a week, or a month
rather than a year. If one observes stock price indices of a company over successive days, they move up
or down for several days in succession. Obviously, in such situations, the assumption of no auto, or
serial, correlation in the error terms that underlies the CLRM will be violated.
The Nature of the Problem
The autocorrelation may be defined as “correlation between members of series of observations
ordered in time [as in time series data] or space [as in cross-sectional data]”. In the regression context, the
classical linear regression model assumes that such autocorrelation does not exist in the disturbances ui.
Symbolically,

The classical model assumes that the disturbance term relating to any observation is not
influenced by the disturbance term relating to any other observation.
Ex: i) If we are dealing with quarterly time series data involving the regression of output on labor and
capital inputs and if, say, there is a labor strike affecting output in one quarter, there is no reason to
believe that this disruption will be carried over to the next quarter. That is, if output is lower this
quarter, there is no reason to expect it to be lower next quarter.
ii) While dealing with cross-sectional data involving the regression of family consumption
expenditure on family income, the effect of an increase of one family’s income on its consumption
expenditure is not expected to affect the consumption expenditure of another family.
However, if there is such a dependence, we have autocorrelation.
Symbolically,

In this situation, the disruption caused by a strike this quarter may very well affect output next
quarter, or the increases in the consumption expenditure of one family may very well prompt another
family to increase its consumption expenditure if it wants to keep up with the Joneses.
The terms autocorrelation and serial correlation are used synonymously, some authors like
Tintner differentiate them, he defines autocorrelation as “lag correlation of a given series with itself,
lagged by a number of time units,’’ whereas, he used serial correlation to “lag correlation between two
different series.’’ Thus, correlation between two time series such as u1, u2, . . . , u10 and u2, u3, . . . , u11,
where the former is the latter series lagged by one time period, is autocorrelation, whereas correlation
between time series such as u1, u2, . . . , u10 and v2, v3, . . . , v11, where u and v are two different time
series, is called serial correlation.

AEC 507/GMG  Autocorrelation  1 
 
In following Figures Fig.1a to d shows that there is a distinct pattern among the u’s. Fig.1a shows
a cyclical pattern; Fig.1b and c suggests an upward or downward linear trend in the disturbances; whereas
Fig.1d indicates that both linear and quadratic trend terms are present in the disturbances. Only Fig.1e
indicates no systematic pattern, supporting the non-autocorrelation assumption of the classical linear
regression model.

Fig.1: Patterns of autocorrelation and non-autocorrelation.

AEC 507/GMG  Autocorrelation  2 
 
Causes for autocorrelation:
1. Inertia: A salient feature of most economic time series is inertia (inactivity), or sluggishness. As is
well known, time series such as GNP, price indices, production, employment, and unemployment
exhibit (business) cycles.
Starting at the bottom of the recession, when economic recovery starts, most of these series start
moving upward. In this upswing, the value of a series at one point in time is greater than its previous
value. This “momentum’’ built into them, continues until something happens (e.g., increase in interest
rate or taxes or both) to slow them down. Therefore, in regressions involving time series data, successive
observations are likely to be interdependent.
2. Specification Bias- Excluded Variables Case:
The researcher often starts with a plausible regression model that may not be the most “perfect’’
one. After the regression analysis, to find out whether the results accord with apriori expectations. If
plotted residuals ˆui obtained from the fitted regression follows patterns such as shown in Fig. 1a to d.
These residuals (which are proxies for ui) may suggest that some variables were not included in the model
for a variety of reasons should be included. This is the case of excluded variable specification bias.
Often the inclusion of such variables removes the correlation pattern observed among the
residuals. Ex: For the following demand model:

………. (1)
Where,
Y = Quantity of beef demanded,
X2 = Price of beef,
X3 = Consumer income,
X4 = Price of pork, and
t = time.
However, for some reason, if we run the following regression:
………. (2)
If (1) is the “correct’’ model or true relation, running (2) is equivalent to letting vt = β4X4t + ut.
And to the extent the price of pork affects the consumption of beef, the error term v will reflect a
systematic pattern, thus creating (false) autocorrelation. A simple test of this would be to run both (1) and
(2) and see whether autocorrelation, if any, observed in model (2) disappears when (1) is run.
3. Specification bias - Incorrect functional form
Suppose the “true’’ or correct model in a cost-output study is as follows:
……….(3)
but we fit the following model:
……….(4)

AEC 507/GMG  Autocorrelation  3 
 
Fig.2: Specification bias: incorrect functional form
The marginal cost curve corresponding to the “true’’ model is shown in Fig. 2 along with the
“incorrect’’ linear cost curve. As Fig.2 shows, between points A and B the linear marginal cost curve will
consistently overestimate the true marginal cost, whereas beyond these points it will consistently
underestimate the true marginal cost. This result is to be expected, because the disturbance term vi is,
infact, equal to output2 + ui, and hence will catch the systematic effect of the output2 term on marginal
cost. In this case, vi will reflect autocorrelation because of the use of an incorrect functional form.
4. Cobweb Phenomenon:
The supply of many agricultural commodities reflects the cobweb phenomenon, where supply
reacts to price with a lag of one-time period because supply decisions take time to implement (the
gestation period). Thus, this year’s planting of crops, farmers are influenced by the price prevailing last
year, so that their supply function is
……….(5)
Suppose at the end of period t, price Pt turns out to be lower than Pt−1.
Therefore, in period t+1 farmers may very well decide to produce less than they did in period t.
Obviously, in this situation the disturbances ut are not expected to be random because if the farmers
overproduce in year t, they are likely to reduce their production in t + 1, and so on, leading to a Cobweb
pattern.
5. Lags
In a time series regression of consumption expenditure on income, it is not uncommon to find that
the consumption expenditure in the current period depends, among other things, on the consumption
expenditure of the previous period. That is,
……….(6)
Above regression is known as auto-regression because one of the explanatory variables is the
lagged value of the dependent variable.
Consumers do not change their consumption habits readily for psychological, technological, or
institutional reasons. Now if we neglect the lagged term in above equation, the resulting error term will
reflect a systematic pattern due to the influence of lagged consumption on current consumption.

AEC 507/GMG  Autocorrelation  4 
 
6. Manipulation of Data:
In empirical analysis, the raw data are often “manipulated.’’
Ex: In time series regressions involving quarterly data [adding 3 Months value / 3]. This averaging
introduces smoothness into the data by dampening the fluctuations in the monthly data. Therefore, the
graph plotting the quarterly data looks much smoother than the monthly data, and this smoothness may
itself lend to a systematic pattern in the disturbances, thereby introducing autocorrelation.
Another source of manipulation is interpolation or extrapolation of data.
Ex: The Population Census is conducted every 10 years, like during 2001 and 1991. Now if there is a
need to obtain data for some year within the inter-census period 1991–2001, the common practice is to
interpolate on the basis of some adhoc assumptions. All such data “massaging’’ techniques might impose
upon the data a systematic pattern that might not exist in the original data.
7. Data Transformation:
Consider the following model:
Yt = β1 + β2Xt + ut ……….(7)
where,
Y = Consumption expenditure and X = Income.
Since it holds true at every time period, it holds true also in the previous time period, (t − 1). So,
we can write (7) as
Yt−1 = β1 + β2Xt−1 + ut−1 ……….(8) [Known as level form regression]
Yt−1, Xt−1, and ut−1 are known as the lagged values of Y, X, and u, respectively, by one period.
subtracting (8) from (7), we obtain
∆Yt = β2∆Xt +∆ut……….(9) [Known as difference form regression]
∆ is the first difference operator, tells us to take successive differences of the variables.
For empirical purposes, we write this equation as
∆Yt = β2∆Xt + vt-----------------(10) [Known as dynamic regression models – which
involve lagged regressands]
If in (8) Y and X represent the logarithms of consumption expenditure and income, then in (9) ∆Y and
∆X will represent changes in the logs of consumption expenditure and income. A change in the log of a
variable is a relative or a percentage change, if the former is multiplied by 100. So, instead of studying
relationships between variables in the level form, we may be interested in their relationships in the growth
form.
Now if the error term in (7) satisfies the standard OLS assumptions, particularly the assumption
of no autocorrelation, it can be shown that the error term vt in (10) is autocorrelated.
Since, vt = ut − ut−1,
E(vt) = E(ut − ut−1) = E(ut)−E(ut−1) = 0, since E(u) = 0, for each t.
Now, var (vt) = var (ut − ut−1) = var (ut) + var (ut−1) = 2σ2,
Since, the variance of each ut is σ2 and the u’s are independently distributed. Hence, vt is homoscedastic.
But cov (vt , vt−1) = E(vtvt−1) = E[(ut − ut−1)(ut−1 − ut−2)] = −σ2 which is obviously nonzero. Therefore,
although the u’s are not autocorrelated, the v’s are.
8. Nonstationarity:
A time series is stationary if its characteristics (e.g., mean, variance, and covariance) are time
invariant; that is, they do not change over time. If that is not the case, we have a nonstationary time
series. In a regression model (7), it is quite possible that both Y and X are nonstationary and therefore the
error u is also nonstationary. In that case, the error term will exhibit autocorrelation.
Autocorrelation can be positive (Fig. a) as well as negative, although most economic time series
generally exhibit positive autocorrelation because most of them either move upward or downward over
extended time periods and do not exhibit a constant up-and-down movement such as that shown in Fig. b.

AEC 507/GMG  Autocorrelation  5 
 
Fig.3: (a) Positive and (b) negative autocorrelation

OLS Estimation in the Presence of Autocorrelation


What happens to the OLS estimators and their variances with introduction of autocorrelation in
the disturbances by assuming that E(utut+s) ≠ 0 (s ≠ 0) but retain all the other assumptions of the classical
model? Note again that subscript t on the disturbances to emphasize dealing with time series data.
Two-variable regression model
Yt = β1 + β2Xt + ut …………….(11)
Assume the mechanism that generates ut, for E(utut+s) ≠0 (s ≠0) is too general an assumption to be
of any practical use. As a starting point, assume that the disturbance, or error, terms are generated by the
following mechanism.
ut = ρut−1 + εt −1 < ρ < 1…………….(12)
where ρ ( = rho) is known as the coefficient of auto-covariance and where εt is the stochastic disturbance
term such that it satisfied the standard OLS assumptions, namely,

AEC 507/GMG  Autocorrelation  6 
 
An error term with these properties is called a white noise error term. Equation (12) postulates
that the value of the disturbance term in period t is equal to rho times its value in the previous period plus
a purely random error term. The scheme (12) is known as Markov first-order autoregressive scheme,
or simply a first-order autoregressive scheme, usually denoted as AR(1). The name autoregressive is
appropriate because (12) can be interpreted as the regression of ut on itself lagged one period. It is first
order because ut and its immediate past value are involved; that is, the maximum lag is 1.
If the model is ut = ρ1ut−1 + ρ2ut−2 + εt, it would be an AR(2), or second-order, autoregressive
scheme, and so on.
Rho (coefficient of autocovariance) in (12), can also be interpreted as the first-order coefficient
of autocorrelation, or more accurately, the coefficient of autocorrelation at lag 1.
Given the AR(1) scheme, it can be shown that
……………….. (13)

…………………(14)

…………………(15)
Where, cov (ut, ut+s) means covariance between error terms s periods apart and where cor (ut, ut+s) means
correlation between error terms s periods apart. Because of the symmetry property of covariances and
correlations, cov (ut , ut+s) = cov (ut, ut−s) and cor(ut, ut+s) = cor(ut, ut−s) .
Since ρ is a constant between −1 and +1, (13) shows that under the AR(1) scheme, the variance of
ut is still homoscedastic, but ut is correlated not only with its immediate past value but its values several
periods in the past. It is critical to note that |ρ| < 1, that is, the absolute value of rho is less than one.
 If ρ =1, the variances and covariances listed above are not defined.
 If |ρ| < 1, we say that the AR(1) process given in (12,i.e., ut = ρut−1 + εt) is stationary; that is, the
mean, variance, and covariance of ut do not change over time.
 If |ρ| is less than one, then it is clear from (14) that the value of the covariance will decline as we
go into the distant past.
One reason we use the AR(1) process is not only because of its simplicity compared to higher-order
AR schemes, but also because in many applications it has proved to be quite useful. Additionally, a
considerable amount of theoretical and empirical work has been done on the AR(1) scheme.

Ex: Two-variable regression model: Yt = β1 + β2Xt + ut.


We know that the OLS estimator of the slope coefficient and its variance are given by
……………….. (16)

…………………(17)

Now under the AR(1) scheme, it can be shown that the variance of this estimator is:

……(18)
Where, var (ˆβ2)AR1 means the variance of ˆβ2 under first-order autoregressive scheme.

A comparison of (18) with (17) shows the former is equal to the latter times a term that depends
on ρ as well as the sample autocorrelations between the values taken by the regressor X at various lags.

AEC 507/GMG  Autocorrelation  7 
 
In general we cannot predict whether var (ˆβ2) is less than or greater than var (ˆβ2)AR1. Of course,
if rho is zero, the two formulas will coincide. Also, if the correlations among the successive values of the
regressor are very small, the usual OLS variance of the slope estimator will not be seriously biased. But,
as a general principle, the two variances will not be the same.
To give some idea about the difference between the variances given in (17) and (18), assume that
the regressor X also follows the first-order autoregressive scheme with a coefficient of autocorrelation of
r. Then it can be shown that (18) reduces to:

……(19)
If, for example, r = 0.6 and ρ = 0.8, using (19)
var (ˆβ2)AR1 = 2.8461 var (ˆβ2)OLS.
To put it another way,
var (ˆβ2)OLS = 1/2.8461var (ˆβ2)AR1 = 0.3513 var (ˆβ2)AR1 .
That is, the usual OLS formula [i.e.,(17)] will underestimate the variance of (ˆβ2)AR1 by about 65 percent.
[This answer is specific for the given values of r and ρ].
Warning:
A blind application of the usual OLS formulas to compute the variances and standard errors of the
OLS estimators could give seriously misleading results.
Suppose we continue to use the OLS estimator ˆβ2 and adjust the usual variance formula by taking into
account the AR(1) scheme. That is, we use ˆβ2 given by (16) but use the variance formula given by (18).
What now are the properties of ˆβ2? It can be proved that ˆ β2 is still linear and unbiased.
Detecting Autocorrelation
Ex: Relationship between wages and productivity in the business sector of the US, 1959–1998
Table-1: Indices of Real Compensation and Productivity, United States, 1959–1998

AEC 507/GMG  Autocorrelation  8 
 
Above Table 1 contains the data on indices of real compensation per hour (Y) and output per hour
(X) in the business sector of the U.S. economy for the period 1959–1998[Base is 1992 = 100]

Fig.4: Residuals and standardized residuals from the wages–productivity regression (1)
Plotting this data on Y and X, we obtain Fig.4. Since the relationship between them is expected to
be positive, it is not surprising that the two variables are positively related, but the relationship between
the two is almost linear is surprising, although there is some hint that at higher values of productivity the
relationship between the two may be slightly nonlinear. Therefore, linear as well as a log–linear models
were estimated and results are follows.

where d is the Durbin–Watson statistic

AEC 507/GMG  Autocorrelation  9 
 
Qualitatively, both the models give similar results. In both cases the estimated coefficients are
“highly” significant, as indicated by the high t values.
In the linear model, if the index of productivity goes up by a unit, on average, the index of
compensation goes up by about 0.71 units.
In the log–linear model, if the index of productivity goes up by 1 percent, on average, the index
of real compensation goes up by about 0.67 percent.
How reliable are the results if there is autocorrelation?
Tests / Methods to detect Autocorrelation
If there is autocorrelation, the estimated standard errors are biased, as a result of which the
estimated t ratios are unreliable. Hence there is need to find out if data suffer from autocorrelation using
different methods as detailed below with the LINEAR MODEL.
I. Graphical Method
The assumption of non-autocorrelation of the classical model relates to the population
disturbances ut, which are not directly observable. What we have instead are their proxies, the residuals
ˆut, which can be obtained by the usual OLS procedure. Although the ˆut are not the same thing as ut, very
often a visual examination of the ˆu’s gives us some clues about the likely presence of autocorrelation in
the u’s. Actually, a visual examination of ˆut or (ˆu2t) can provide useful information not only about
autocorrelation but also about heteroscedasticity, model inadequacy, or specification bias.
There are various ways of examining the residuals. Residuals are plotted against time, the time
sequence plot [Fig.5], shows the residuals obtained from the wages–productivity regression (1). The
values of these residuals are given in Table 2 along with some other data.

Fig.5: Residuals and standardized residuals from the wages–productivity regression (1)

AEC 507/GMG  Autocorrelation  10 


 
Table-2: Residuals: Actual, Standardized, And Lagged

Alternatively, we can plot the standardized residuals against time, which are also shown in
Fig.5 and Table 2.
The standardized residuals are the residuals (ˆut) divided by the standard error of the regression
(√ˆσ2), that is, they are (ˆut/ˆσ ). [Note that ˆut and ˆσ are measured in the units in which the regressand Y
is measured]. The values of the standardized residuals will therefore be pure numbers (devoid of units of
measurement) and can be compared with the standardized residuals of other regressions. Moreover, the
standardized residuals, like ˆut, have zero mean and approximately unit variance. In large samples (ˆut/ˆσ)
is approximately normally distributed with zero mean and unit variance. For our example, ˆσ = 2.6755.
Examining the time sequence plot given in Fig 5, one can observe that both ˆut and the standardized ˆut
exhibit a pattern observed in Figure 1d, suggesting that perhaps ut are not random.
To see this differently, we can plot ˆut against ˆut−1, that is, plot the residuals at time t against their
value at time (t-1), a kind of empirical test of the AR(1) scheme. If the residuals are nonrandom, we
should obtain pictures similar to those shown in Fig.3. This plot for wages–productivity regression is
shown in Fig.6 for the data in Table 2. As this figure reveals, most of the residuals are bunched in the
second (northeast) and the fourth (southwest) quadrants, suggesting a strong positive correlation in the
residuals.

AEC 507/GMG  Autocorrelation  11 


 
Fig.6: Current residuals versus lagged residuals
This graphical method although powerful and suggestive, is subjective or qualitative in nature.
But there are several quantitative tests that one can use to supplement this which are as follows.
2. The Runs Test
Fig.5 reveals that initially, several residuals are negative, then there is a series of positive
residuals, and then there are several residuals that are negative. If these residuals were purely random,
could we observe such a pattern? Intuitively, it seems unlikely. This intuition can be checked by the so-
called runs test, sometimes also known as the Geary test, a nonparametric test.
Explanation
First note down the signs (+ or −) of the residuals obtained from the wages–productivity regression,
which are given in the first column of Table 2.
(−−−−−−−−−)(+++++++++++++++++++++)(−−−−−−−−−−)
There are 9 negative residuals, 21 positive residuals, followed by 10 negative residuals, for a total of 40
observations.
We now define a run as an uninterrupted sequence of one symbol or attribute, such as + or −. We
further define the length of a run as the number of elements in it. In the above sequence, there are 3 runs:
 A run of 9 minuses (i.e., of length 9),
 a run of 21 pluses (i.e., of length 21) and
 a run of 10 minuses (i.e., of length 10).
By examining how runs behave in a strictly random sequence of observations, one can derive a
test of randomness of runs.
If there are too many runs, it would mean that in our example the residuals change sign
frequently, thus indicating negative serial correlation (Fig.3b). Similarly, if there are too few runs, they
may suggest positive autocorrelation, as in Fig.3a. A priori, then, Fig.5 would indicate positive
correlation in the residuals.

AEC 507/GMG  Autocorrelation  12 


 
Now let
N = Total number of observations = N1 + N2
N1 = Number of + symbols (i.e., + residuals)
N2 = Number of − symbols (i.e., − residuals)
R = Number of runs
Then under the null hypothesis that the successive outcomes (here, residuals) are independent,
and assuming that N1 > 10 and N2 > 10, the number of runs is (asymptotically) normally distributed with

Note: N = N1 + N2.
If the null hypothesis of randomness is sustainable, following the properties of the normal
distribution, we should expect that
Prob [E(R) − 1.96σR ≤ R ≤ E(R) + 1.96σR] = 0.95
That is, the probability is 95 percent that the preceding interval will include R.
Decision Rule:
Do not reject the null hypothesis of randomness with 95% confidence if R, the number of runs,
lies in the preceding confidence interval; reject the null hypothesis if the estimated R lies outside these
limits. (can choose any level of confidence)
In this example,
N1 - the number of minuses = 19
N2 - the number of pluses = 21
R = 3.
Then using the above formula,
E(R) = 10.975
σ2R = 9.6936
σR = 3.1134
The 95% confidence interval for R is thus:
[10.975 ± 1.96(3.1134)] = (4.8728, 17.0722)
Obviously, this interval does not include 3. Hence, we can reject the hypothesis that the residuals
in our wages–productivity regression are random with 95% confidence. In other words, the residuals
exhibit autocorrelation.
As a general rule, if there is
Positive autocorrelation- the number of runs will be few, whereas
Negative autocorrelation - the number of runs will be many.
Swed and Eisenhart have developed special tables that give critical values of the runs expected in a
random sequence of N observations if N1 or N2 is smaller than 20.

3. Durbin–Watson d Test
The most celebrated test for detecting serial correlation is that developed by statisticians Durbin
and Watson. It is popularly known as the Durbin–Watson d statistic, which is defined as

AEC 507/GMG  Autocorrelation  13 


 
This is simply the ratio of the sum of squared differences in successive residuals to the RSS. Note
that in the numerator of the d statistic the number of observations is n− 1 (starting from 2) because one
observation is lost in taking successive differences.
A great advantage of the d statistic is that it is based on the estimated residuals, which are
routinely computed in regression analysis. Because of this advantage, it is now a common practice to
report the Durbin–Watson d along with summary measures, such as R2, adjusted R2, t, and F.
Assumptions underlying the d statistic:
1. The regression model includes the intercept term. If it is not present, as in the case of the regression
through the origin, it is essential to rerun the regression including the intercept term to obtain the
RSS.
2. The explanatory variables, the X’s, are nonstochastic, or fixed in repeated sampling.
3. The disturbances ut are generated by the first-order autoregressive scheme: ut = ρut−1 + εt. Therefore,
it cannot be used to detect higher-order autoregressive schemes.
4. The error term ut is assumed to be normally distributed.
5. The regression model does not include the lagged value(s) of the dependent variable as one of the
explanatory variables. Thus, the test is inapplicable in models of the following type:
a. Yt = β1 + β2X2t + β3X3t + ꞏꞏ ꞏ+βkXkt + γYt−1 + ut
where Yt−1 is the one period lagged value of Y. Such models are known as autoregressive models.
6. There are no missing observations in the data. if so, the d statistic makes no allowance for such
missing observations
The exact sampling or probability distribution of the d statistic is difficult to derive.
Unlike the t, F, or χ2 tests, there is no unique critical value that will lead to the rejection or the
acceptance of the null hypothesis that there is no first-order serial correlation in the disturbances ui.
However, Durbin and Watson were successful in deriving a lower bound dL and an upper bound dU such
that if the computed d lies outside these critical values, a decision can be made regarding the presence of
positive or negative serial correlation.
Durbin and Watson developed d value tables for n going from 6 to 200 and up to 20 explanatory
variables. Using this we can develop upper ad lower limit of d values and by this we can tell about
presence or absence of autocorrelation.
Test Procedure
It can be explained with the aid of Fig.7, which shows that the limits of d are 0 and 4. These can
be established as follows
The expanded form of above equation of d will be

Since ∑ût2 and û2t-1 differ in only one observation, they are approximately equal. Therefore, setting
∑ û2t-1 ≈∑ û2t, above equation may be written as

where ≈ means approximately.

AEC 507/GMG  Autocorrelation  14 


 
Fig. 7: Durbin–Watson d statistic

By expressing ∑ ût and ût-1 as ρ and substitution in above equation we get

Now let us define

as the sample first-order coefficient of autocorrelation, an estimator of ρ.

But since −1 ≤ ρ ≤ 1, this implies that


0≤d≤4
These are the bounds of d; any estimated d value must lie within these limits.
 If ρˆ = 0, d = 2; - no first-order autocorrelation, either positive or negative.
 If ρˆ = +1, d ≈ 0 -indicating perfect positive correlation in the residuals.
 If ρˆ = -1, d ≈ 4 - indicating perfect negative correlation in the residuals.
Therefore,
The closer d is to 0, the greater the evidence of positive serial correlation. If there is positive
autocorrelation, the ˆut ’s will be bunched together and their differences will therefore tend to be small.
As a result, the numerator sum of squares will be smaller in comparison with the denominator sum of
squares, which remains a unique value for any given regression.
If ρˆ = −1, that is, there is perfect negative correlation among successive residuals, d ≈ 4. Hence,
the closer d is to 4, the greater the evidence of negative serial correlation. For if there is negative
autocorrelation, a positive ˆut will tend to be followed by a negative ˆut and vice versa so that | ˆut −
ˆut−1| will usually be greater than | ˆut |. Therefore, the numerator of d will be comparatively larger than
the denominator.
Mechanics of the Durbin–Watson test
Assuming that the assumptions underlying the test are fulfilled:
1. Run the OLS regression and obtain the residuals.
2. Compute d. (Most computer programs now do this routinely.)
3. For the given sample size and given number of explanatory variables find out the critical dL and dU
values.

AEC 507/GMG  Autocorrelation  15 


 
4. Now follow the decision rules given in following Table-3. For ease of reference,these decision rules
are also depicted in Fig.7 above.
Table-3: Durbin–Watson d Test: Decision Rules

Illustration: Wages–productivity regression for the data given in Table 2 the estimated d value can be
shown to be 0.1229, suggesting that there is positive serial correlation in the residuals. From the Durbin–
Watson tables, we find that for 40 observations and one explanatory variable, dL = 1.44 and dU = 1.54 at
the 5 percent level. Since the computed d of 0.1229 lies below dL, we cannot reject the hypothesis that
there is positive serial correlation in the residuals.
However, the d test has great drawback that if it falls in the indecisive zone, one cannot conclude that
(first-order) autocorrelation does or does not exist. Then following modified d test can be used.
Given the level of significance α,
1. H0: ρ = 0 versus H1:ρ > 0. Reject H0 at α level if d < dU. That is, there is statistically significant
positive autocorrelation.
2. H0: ρ = 0 versus H1:ρ < 0. Reject H0 at α level if the estimated (4 − d) < dU, that is, there is
statistically significant evidence of negative autocorrelation.
3. H0: ρ = 0 versus H1: ρ _= 0. Reject H0 at 2α level if d < dU or (4 − d) <dU, that is, there is
statistically significant evidence of autocorrelation, positive or negative.
It may be pointed out that the indecisive zone narrows as the sample size increases, which can be
seen clearly from the Durbin–Watson tables. For example, with 4 regressors and 20 observations, the 5
percent lower and upper d values are 0.894 and 1.828, respectively, but these values are 1.515 and 1.739
if the sample size is 75.
The computer program Shazam performs an exact d test, that is, it gives the p value, the exact
probability of the computed d value. With modern computing facilities, it is no longer difficult to find the
p value of the computed d statistic. Using SHAZAM (version 9) the wages–productivity regression, the p
value of the computed d of 0.1229 is practically zero, thereby reconfirming our earlier conclusion based
on the Durbin–Watson tables.
The Durbin–Watson d test has become so venerable that practitioners often forget the
assumptions underlying the test. In particular, the assumptions that (1) the explanatory variables, or
regressors, are nonstochastic; (2) the error term follows the normal distribution; and (3) that the regression
models do not include the lagged value(s) of the regressand are very important for the application of the d
test.
If a regression model contains lagged value(s) of the regressand, the d value in such cases is often
around 2, which would suggest that there is no (first-order) autocorrelation in such models. Thus, there is
a built-in bias against discovering (first-order) autocorrelation in such models.
Breusch–Godfrey (BG) Test
This test allows for
(1) nonstochastic regressors, such as the lagged values of the regressand;
(2) higher-order autoregressive schemes, such as AR(1), AR(2), etc.; and
(3) simple or higher-order moving averages of white noise error terms, such as εt in Equation (12).The
BG test, also known as the LM test.

AEC 507/GMG  Autocorrelation  16 


 
Ex: Using two-variable regression model although many regressors can be added to the model. Also,
lagged values of the regressand can be added to the model. Let
………….(1)
Assume that the error term ut follows the pth-order autoregressive, AR(p), scheme as follows
………(2)
where εt is a white noise error term as discussed previously. This is simply the extension of the AR(1)
scheme. The null hypothesis H0 to be tested is that
………(3)
That is, there is no serial correlation of any order. The BG test involves the following steps:
1. Estimate above function by OLS and obtain the residuals, ˆut.
2. Regress ˆut on the original Xt (if there is more than one X variable in the original model, include them
also) and ˆut−1, ˆut−2, . . . , ˆut−p, where the latter are the lagged values of the estimated residuals in
step 1. Thus, if p = 4, we will introduce four lagged values of the residuals as additional regressor in
the model. Note that to run this regression we will have only (n − p) observations. In short, run the
following regression.
……….(4)
and obtain R2 from this (auxiliary) regression.
3. If the sample size is large (technically, infinite), Breusch and Godfrey have shown that

…………..(5)
That is, asymptotically, n − p times the R2 value obtained from the eqn (4) regression follows the chi-
square distribution with p df. If in an application, (n − p)R2 exceeds the critical chi-square value at the
chosen level of significance, we reject the null hypothesis, in which case at least one rho in eqn (2) is
statistically significantly different from zero.

AEC 507/GMG  Autocorrelation  17 


 
Heteroscedasticity

An important assumption of the classical linear regression model (Assumption 4) is that


the disturbances ui appearing in the population regression function are homoscedastic; that is, they
all have the same variance. In the following section let us try to understand nature of
heteroscedasticity, its consequences, how detect and different remedial measures.
Nature of Heteroscedasticity
As one of the important assumptions of the classical linear regression model is that the
variance of each disturbance term ui, conditional on the chosen values of the explanatory variables,
is some constant number equal to σ2. This is the assumption of homoscedasticity, or equal (homo)
spread (scedasticity), that is, equal variance. Symbolically,

Diagrammatically, in the two-variable regression model homoscedasticity can be shown as

Fig.-1: Homoscedastic disturbances Fig.-2: Heteroscedastic disturbances


Fig. 1 shows, the conditional variance of Yi (which is equal to that of ui), conditional upon the
given Xi, remains the same regardless of the values taken by the variable X. In contrast, Fig. 2
shows that the conditional variance of Yi increases as X increases. Here, the variances of Yi are not
the same. Hence, there is heteroscedasticity. Symbolically,

Note the subscript of σ2, which tells that the conditional variances of ui (= conditional
variances of Yi) are no longer constant.
To make the difference between homoscedasticity and heteroscedasticity clear, assume
that in the two-variable model
Yi = β1 + β2Xi + ui,
Where, Y represents savings and X represents income. Figures 1 and 2 show that as income
increases, savings on the average also increase. But in Figure 1 the variance of savings remains the
same at all levels of income, whereas in Fig. 2 it increases with income, i.e., the higher income
families on the average save more than the lower-income families, but there is also more
variability in their savings.
Reasons for varying variances of ui
1. Following the error-learning models: As people learn, their errors of behavior become
smaller over time. In this case, σi2 is expected to decrease. Ex: In following Fig. 3, which
relates the number of typing errors made in a given time period on a test to the hours put in
typing practice. As the number of hours of typing practice increases, the average number of
typing errors as well as their variances decreases.

AEC 507/GMG/  Heteroscedasticity  1 
 
Fig. 3: Illustration of heteroscedasticity
2. As incomes grow, people have more discretionary income and hence more scope for choice
about the disposition of their income. Hence, σi2 is likely to increase with income. Thus in the
regression of savings on income one is likely to find σi2 increasing with income because people
have more choices about their savings behavior. Similarly, companies with larger profits are
generally expected to show greater variability in their dividend policies than companies with
lower profits. Also, growth oriented companies are likely to show more variability in their
dividend payout ratio than established companies.
3. As data collecting techniques improve, σi2 is likely to decrease. Thus, banks that have
sophisticated data processing equipment are likely to commit fewer errors in the monthly or
quarterly statements of their customers than banks without such facilities.
4. Heteroscedasticity can also arise as a result of the presence of outliers. An outlying
observation, or outlier, is an observation that is much different (either very small or very large)
in relation to the observations in the sample. More precisely, an outlier is an observation from
a different population to that generating the remaining sample observations. The inclusion or
exclusion of such an observation, especially if the sample size is small, can substantially alter
the results of regression analysis.

AEC 507/GMG/  Heteroscedasticity  2 
 
Fig.4: The relationship between stock prices and consumer prices
5. Another source of heteroscedasticity arises from violating Assumption 9 of CLRM, viz., that
the regression model is correctly specified. Heteroscedasticity may be due to the fact that some
important variables are omitted from the model. Thus, in the demand function for a
commodity, if we do not include the prices of commodities complementary to or competing
with the commodity in question (the omitted variable bias), the residuals obtained from the
regression may give the distinct impression that the error variance may not be constant. But if
the omitted variables are included in the model, that impression may disappear.
Ex: advertising impressions retained (Y) in relation to advertising expenditure (X). If we
regress Y on X only and observe the residuals from this regression, you will see one pattern, but if
you regress Y on X and X2, you will see another pattern, which can be seen clearly from Fig.5. We
have already seen that X2 belongs in the model.

Fig.5: Residuals from the regression of (a) impressions of advertising expenditure and (b)
impression on Adexp and Adexp2.
6. Another source of heteroscedasticity is skewness in the distribution of one or more regressors
included in the model. Examples are economic variables such as income, wealth, and
education. It is well known that the distribution of income and wealth in most societies is
uneven, with the bulk of the income and wealth being owned by a few at the top.

AEC 507/GMG/  Heteroscedasticity  3 
 
7. Other sources of heteroscedasticity: As David Hendry notes, heteroscedasticity can also arise
because of (1) incorrect data transformation (e.g., ratio or first difference transformations) and
(2) incorrect functional form (e.g., linear versus log–linear models).
Note that the problem of heteroscedasticity is likely to be more common in cross-sectional
than in time series data. In cross-sectional data, one usually deals with members of a population at
a given point in time, such as individual consumers or their families, firms, industries, or
geographical subdivisions such as state, country, city, etc. Moreover, these members may be of
different sizes, such as small, medium, or large firms or low, medium, or high income. In time
series data, on the other hand, the variables tend to be of similar orders of magnitude because one
generally collects the data for the same entity over a period of time. Examples are GNP,
consumption expenditure, savings, or employment in the United States, say, for the period 1950 to
2000
Following table gives data on compensation per employee in 10 nondurable goods
manufacturing industries, classified by the employment size of the firm or the establishment for
the year 1958. Also given in the table are average productivity figures for nine employment
classes. Although the industries differ in their output composition, Table shows clearly that on the
average large firms pay more than the small firms. As an example, firms employing one to four
employees paid on the average about $3396, whereas those employing 1000 to 2499 employees on
the average paid about $4843. But notice that there is considerable variability in earning among
various employment classes as indicated by the estimated standard deviations of earnings. This can
be seen also from Fig. 6, which plots the standard deviation of compensation and average
compensation in each employment class. Thus it can be seen clearly that on average, the standard
deviation of compensation increases with the average value of compensation.
Table-1: Compensation per employee ($) in nondurable manufacturing industries according
to Employment size of establishment, 1958

AEC 507/GMG/  Heteroscedasticity  4 
 
Fig.6: Standard deviation of compensation and mean compensatio
OLS Estimation in the Presence of Heteroscedasticity
What happens to OLS estimators and their variances if we introduce heteroscedasticity by
letting E(ui2 ) = σi2 but retain all other assumptions of the classical model? To answer this question,
let us revert to the two-variable model:

but its variance is now given by the following expression

which is obviously different from the usual variance formula obtained under the assumption of
homoscedasticity, namely,

Of course, if σi2= σ2for each i, the two formulas will be identical.


As a matter of fact, to establish the unbiasedness of ˆβ2 it is not necessary that the
disturbances (ui) be homoscedastic. In fact, the variance of ui, homoscedastic or heteroscedastic,
plays no part in the determination of the unbiasedness property.
ˆβ2 is a consistent estimator despite heteroscedasticity; that is, as the sample size increases
indefinitely, the estimated β2 converges to its true value. Furthermore, it can also be shown that

AEC 507/GMG/  Heteroscedasticity  5 
 
under certain conditions (called regularity conditions), ˆβ2 is asymptotically normally distributed.
Of course, what we have said about ˆβ2 also holds true of other parameters of a multiple regression
model. Granted that ˆβ2 is still linear unbiased and consistent, is it “efficient” or “best”; that is,
does it have minimum variance in the class of unbiased estimators? The answer is no: ˆβ2 is no
longer best. Then what is BLUE in the presence of heteroscedasticity?
The answer is given in the following section.
The Method of Generalized Least Squares (GLS)
Why the usual OLS estimator of β2 given above is not best, although it is still unbiased?
Intuitively, we can see the reason from Table 1 above. As the table shows, there is considerable
variability in the earnings between employment classes. If we were to regress per-employee
compensation on the size of employment, we would like to make use of the knowledge that there is
considerable interclass variability in earnings. Ideally, we would like to devise the estimating
scheme in such a manner that observations coming from populations with greater variability are
given less weight than those coming from populations with smaller variability. Examining Table 1,
we would like to weight observations coming from employment classes 10–19 and 20–49 more
heavily than those coming from employment classes like 5–9 and 250–499, for the former are
more closely clustered around their mean values than the latter, thereby enabling us to estimate the
PRF more accurately.
Unfortunately, the usual OLS method does not follow this strategy and therefore does not
make use of the “information” contained in the unequal variability of the dependent variable Y,
say, employee compensation of Table 1: It assigns equal weight or importance to each observation.
But a method of estimation, known as generalized least squares (GLS), takes such information
into account explicitly and is therefore capable of producing estimators that are BLUE. To see let
us consider two-variable model:

which for ease of algebraic manipulation we write as

where X0i = 1 for each i.


Now assume that the heteroscedastic variances σi2 are known. Divide above through by σi to obtain

which for ease of exposition we write as

where the starred, or transformed, variables are the original variables divided by (the known) σi.
We use the notation β*1 and β*2, the parameters of the transformed model, to distinguish them from
the usual OLS parameters β1 and β2. What is the purpose of transforming the original model? To
see this, notice the following feature of the transformed error term ui*:

AEC 507/GMG/  Heteroscedasticity  6 
 
Which is a constant, i.e., the variance of the transformed disturbance term ui* is now
homoscedastic. With the other assumptions of the classical model, the finding of u* is
homoscedastic suggests that if we apply OLS to the transformed model it will produce estimators
that are BLUE. In short, the estimated β*1 and β*2 are now BLUE and not the OLS estimators ˆβ1
and ˆβ2.
The procedure of transforming the original variables in such a way that the transformed
variables satisfy the assumptions of the classical model and then applying OLS to them is known
as the method of Generalized Least Squares (GLS). In short, GLS is OLS on the transformed
variables that satisfy the standard least-squares assumptions. The estimators thus obtained are
known as GLS estimators, and it is these estimators that are BLUE.
Mechanics of estimating β*1 and β*2
First, write down the SRF

obtain the GLS estimators, and minimize

i.e.,

Then GLS estimator of β*2 and its variance are given by

where wi = 1/σ2i

Difference between OLS and GLS

OLS GLS
We minimize
minimize an unweighted or minimize a weighted sum of residual
(what amounts to the same squares with wi =1/σ2i acting as the
thing) equally weighted RSS weights

For clear difference between OLS and GLS, consider the following hypothetical scattergram
(Fig.7)

AEC 507/GMG/  Heteroscedasticity  7 
 
Fig.7: Hypothetical scattergram
In the (unweighted) OLS, each ûi2 associated with points A, B, and C will receive the same
weight in minimizing the RSS. Obviously, in this case the ûi2 associated with point C will
dominate the RSS. But in GLS the extreme observation C will get relatively smaller weight than
the other two observations. As noted earlier, this is the right strategy, for in estimating the
population regression function (PRF) more reliably we would like to give more weight to
observations that are closely clustered around their (population) mean than to those that are widely
scattered about.
Since above equation minimizes a weighted RSS, it is appropriately known as weighted
least squares (WLS), and the estimators thus obtained are known as WLS estimators. But WLS
is just a special case of the more general estimating technique, GLS. In the context of
heteroscedasticity, one can treat the two terms WLS and GLS interchangeably.
Note that if wi = w, a constant for all i, ˆβ*2 is identical with ˆβ2 and var (ˆβ*2) is identical
with the usual (i.e., homoscedastic) var (ˆβ2), which should not be surprising.
Detection of Heteroscedasticity
Informal Methods
Nature of the Problem
According to Prais and Houthakker on family budget studies found that the residual
variance around the regression of consumption on income increased with income, one now
generally assumes that in similar surveys one can expect unequal variances among the
disturbances. As a matter of fact, in cross-sectional data involving heterogeneous units,
heteroscedasticity may be the rule rather than the exception.
Graphical Method
Do the regression analysis on the assumption that there is no heteroscedasticity and then
do a postmortem examination of the residual squared ûi2 to see if they exhibit any systematic
pattern. Although ûi2 are not the same thing as u2i, they can be used as proxies especially if the
sample size is sufficiently large. An examination of the ûi2 may reveal patterns such as those
shown in Fig.8.

AEC 507/GMG/  Heteroscedasticity  8 
 
Fig.8: Hypothetical patterns of estimated squared residuals.
In the above figure ûi2 are plotted against Ŷi, the estimated Yi from the regression line, the
idea being to find out whether the estimated mean value of Y is systematically related to the
squared residual.
In above Figure (a) one can see that there is no systematic pattern between the two
variables, suggesting that perhaps no heteroscedasticity is present in the data. Figure (b) to (e),
however, exhibits definite patterns. For instance, Figure (c) suggests a linear relationship, whereas
Figure (d) and (e) indicates a quadratic relationship between ûi2 and Ŷi. Using such knowledge,
albeit informal, one may transform the data in such a manner that the transformed data do not
exhibit heteroscedasticity.
Instead of plotting ûi2 against Ŷi, one may plot them against one of the explanatory
variables, especially if plotting ûi2 against Ŷi results in the pattern similar to in Figure (a). In the
case of the two-variable model, plotting ûi2 against Ŷi is equivalent to plotting it against Xi,
Formal Methods
Park Test
Park formalizes the graphical method by suggesting that σ2i is some function of the explanatory
variable Xi. The functional form he suggested was

or

where vi is the stochastic disturbance term.


Since σ2i is generally not known, Park suggests using ˆu2i as a proxy and running the following
regression.

If β turns out to be statistically significant, it would suggest presence of heteroscedasticity in the


data. If it turns out to be insignificant, we may accept the assumption of homoscedasticity. The
Park test is thus a two stage procedure.

AEC 507/GMG/  Heteroscedasticity  9 
 
 First stage - Run the OLS regression disregarding the heteroscedasticity question and
obtain ˆui.
 Then run the second regression with ui’s as dependent variable.
Although empirically appealing, but Goldfeld and Quandt have argued that the error term vi may
not satisfy the OLS assumptions and may itself be heteroscedastic.
Illustration of Park approach
From data given in Table 1 run the following regression:
Yi = β1 + β2Xi + ui
where Y = average compensation in thousands of dollars, X = average productivity in thousands of
dollars, and i = i th employment size of the establishment. The results of the regression were as
follows:

The results reveal that the estimated slope coefficient is significant at the 5 percent level on the
basis of a one-tail t test. The equation shows that as labor productivity increases by, say, a dollar,
labor compensation on the average increases by about 23 cents.
The residuals obtained from regression were regressed on Xi as suggested gave the following
results:

Inference: There is no statistically significant relationship between the two variables. Following
the Park test, one may conclude that there is no heteroscedasticity in the error variance.
Glejser Test
The Glejser test is similar in spirit to the Park test. After obtaining the residuals ˆui from
the OLS regression, Glejser suggests regressing the absolute values of ˆui on the X variable that is
thought to be closely associated with σ2i. Glejser used the following functional forms:

where vi is the error term.


Illustration:

Continuing with the previous Example, the absolute value of the residuals obtained from
regression were regressed on average productivity (X), gives the following results

Inference: Regression results revealed that there is no relationship between the absolute value of
the residuals and the regressor, average productivity. This reinforces the conclusion based on the
Park test.

AEC 507/GMG/  Heteroscedasticity  10 


 
Spearman’s Rank Correlation Test
Critics: Goldfeld and Quandt point out that the error term vi has some problems in that its
expected value is nonzero, it is serially correlated, and ironically it is heteroscedastic. An
additional difficulty with the Glejser method is that following models are nonlinear in the
parameters and therefore cannot be estimated with the usual OLS procedure

and
Glejser has found that for large samples the first four of the preceding models give
generally satisfactory results in detecting heteroscedasticity. As a practical matter, therefore, the
Glejser technique may be used for large samples and may be used in the small samples strictly as a
qualitative device to learn something about heteroscedasticity.

AEC 507/GMG/  Heteroscedasticity  11 


 

You might also like