Lecture Notes - Econometrics I - Andrea Weber

Lecture Notes: Econometrics I
Andrea Weber
Institute for Advanced Studies

Department of Economics and Finance
email: andrea.weber@ihs.ac.at
December, 2003
Help in text processing from Andrey Launov and Ivan Prianichnikov is greatly acknowledged.
Thanks to Michael Grabner for helpful comments and finding lots of typos.
Contents
1 Introduction 3
2 Descriptive linear regression 5

2.1 The method of least squares . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The geometry of least squares . . . . . . . . . . . . . . . . . . . . . 6
2.3 Measuring the goodness of fit . . . . . . . . . . . . . . . . . . . . . . 11
3 The classical linear regression model 15

3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 The statistical estimation problem . . . . . . . . . . . . . . . . . . . 17
3.3 Prediction or the out of sample forecasting . . . . . . . . . . . . . . . 24
4 Stochastic regression 25
5 Statistical inference in the classical linear regression model 27

5.1 Introduction and review . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Testing principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Testing values of : practical part . . . . . . . . . . . . . . . . . . . 32
5.4 Testing values of : theoretical part . . . . . . . . . . . . . . . . . . 39
5.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6 Testing linear restrictions . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Some tests for specification error 56

6.1 Tests for structural change . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Further specification tests . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 Tests based on recursive estimation: CUSUM and CUSUMQ test . . . 65
7 Asymptotic theory 68
7.1 Introduction to asymptotic theory . . . . . . . . . . . . . . . . . . . . 68
7.2 Asymptotic properties of OLS estimators . . . . . . . . . . . . . . . 73
1
8 The generalised linear regression model 77
8.1 Aitken estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Asymptotic properties of GLS . . . . . . . . . . . . . . . . . . . . . 79
8.3 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.4 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9 Limited dependent variables models 98

9.1 Binary regression models: Logit, Probit . . . . . . . . . . . . . . . . 98
9.2 Censored regression models: Tobit . . . . . . . . . . . . . . . . . . . 113
References
Baltagi, B. H. (Ed.), 2001. A compagnion to theoretical econometrics. Blackwell Pub-
lishers.
Davidson, R., MacKinnon, J., 1993. Estimation and inference in econometrics. Oxford
University Press.
Greene, W. H., 1997. Econometrics, 3rd Edition. Prentice Hall.
Hayashi, F., 2000. Econometrics. Princeton University Press.
Johnston, J., DiNardo, J., 1997. Econometric methods, 4th Edition. McGraw Hill.
Schönfeld, P., 1969. Methoden der Ökonometrie I. Verlag Franz Vahlen.
ScottLong, J., 1997. Regression models for categorical and limited dependent vari-
ables. SAGE Publications.
Wooldridge, J., 2000. Introductory econometrics: a modern approach. South-Western

College Publishing.
2
1 Introduction
The aim of the lecture is twofold. First students should receive guidelines for applied
empirical research. But second the lecture should also provide a good theoretical basis
for advanced econometrics courses.
What is Econometrics?
At the beginning of the twentieth century economic theory was mainly intuitive and
empirical support for it was largely anecdotal. Now economics has a rich array of
formal models and a high quality data base. Empirical regularities motivate theory in
many areas of economics, and data are routinely used to test theory. Many economic
theories have been developed as measurement frameworks to suggest what data should
be collected and how they should be interpreted.
Econometric theory was developed to analyse and interprete economic data. Most
economic theory adopts methods originally developed in statistics.
Figure 1:
According to Heckman(2000, Quarterly Journal of Economics) major achievements of

econometrics during the twentieth century were
The definition of a causal parameter within a well-defined economic model.
Analysis of what is required to recover causal parameters from data (the iden-
tification problem). Many theoretical models may be consistent with the same
data.
Clarification of the role of causal parameters in policy evaluation and in fore-

casting the effects of policies never previously experienced.
3
The concept of a causal parameter
By a causal effect economists mean a “ceteris paribus” change (all other things are
equal).
Consider, for example, a model of production of output based on inputs that can
be varied independently. We write the function

where is a vector of inputs. Assuming that each input can be

freely varied, the change in produced from the variation in holding all other
inputs constant is the causal effect of . If is differentiable in , the marginal
causal effect of is

A special case occurs if is separable in

and the causal effect of can be defined independently of the level of the other
values of .
Examples
Price effect on consumer demand
Effects of fertiliser on crop yields
Measuring returns to education
The structure of economic data
Cross-sectional data set consists of a set of individuals, firms, regions, etc, at a

given point in time. Assumption: random sampling of the underlying population.
4
Time series data set consists observations of a variable over time, e.g. stock
prices, consumer price index etc. The chronological ordering of observations
contains potentially important information.
Pooled cross-sectional data set.
Panel or longitudinal data set consists of a time series for each cross sectional
member in the data set, e.g. household panel surveys, OECD main economic
indicators.
2 Descriptive linear regression
2.1 The method of least squares
As an extension of the linear regression model in two variables let us introduce the
multiple linear regression model.

Observations Functional Form Fitting Criterion

Table 1: The multiple linear regression model,
To make notation more convenient we transform the model in matrix form. We define

the n-dimensional vectors of observations,

..
.

..
.

..
.

..
.

the k- dimensional parameter vector and the n-dimensional vector of error terms,

..
.
..
.

5

and the matrix

..
.

In matrix notation the multiple linear regression model can be written in the following
way

and the fitting criterion is given by

In the literature we find several names for the variables in the model, which are listed
below. We will commonly use the names in the first row.

dependent variable independent variables error
explained variable explanatory variables disturbance
regressand regressors, covariates
Table 2: variable names
2.2 The geometry of least squares
are vectors in the n - dimensional Euclidian space . The inner product

of two points and in is defined by

All points in are determined by their length and direction. The Euclidian length of
a vector is

If we assume that
6
1. , there are more observations than independent variables
2. , that means the columns of are linearly independent
then the columns of span a - dimensional subspace of which we call :

The orthogonal complement of is given by . In the Euclidian space

the regression problem translates to finding the “closest” point to
in , which
means the point in with the minimal distance to
.

Figure 2: The column space of in
Remember the fitting criterion for the least squares model

7
To solve the minimisation problem we calculate the first order condition

Rules for matrix differentiation:

and get the normal equations

If the inverse matrix exists the optimal parameter vector is given by

Remember that
the columns of are linearly independent which implies that has a full
rank,
is non-negative definite which implies that is strictly convex and

must have a unique minimum
Therefore is determined uniquely by the normal equations.

Let us define the residuals from the regression by

and the fitted values by

8
Figure 3: The projection of
onto
The normal equations

also have a geometric interpretation: there has
to be the right angle between and the residuals.
Now we define some projection matrices

The matrix projects

orthogonally onto the column space

.

projects onto the orthogonal complement .
Properties of the projectors:
and are symmetric

there exists an orthogonal decomposition
, are idempotent, ,

With the help of the projection matrices we can decompose the vector of the dependent
variables:
9

where

are the fitted values and

are the residuals.
and rewrite the normal equations as

.
Figure 4: The orthogonal decomposition of
If a constant term is included in the regression model (one column of is ),

residuals sum up to 0. Because in this case

..
.

and

10
2.3 Measuring the goodness of fit
The idempotency of and often makes expressions associated with least squares
regression very simple. For example the sum of squared residuals is given by

Similarly, the sum of squared fitted values, which is also called the explained sum of
squares, is

the total sum of squares equals the sum of squared dependent variables

In Figure 5 we see the geometric interpretation of the vectors

,
and
. They
form the sides of a right-angled triangle
Figure 5: The orthogonal decomposition of
By applying Pythagoras Theorem we see that

(1)
11
Thus the total sum of squares of
equals the explained sum of squares plus the sum of
squared residuals. The fact that the total variation in the regressand can be divided into
two parts, one “explained” by the regressors and one not explained, suggests a natural
measure of how good the regression fits. Let us divide equation( 1) by

and define the uncentered or the coefficient of variation by

Properties of :
1.

, it is unit free.
2. Return to the triangle in Figure 5 and call ! the angle between

and

.
The cosine of ! is given by

!

and hence
!
3. Anything that changes ! will also change , e.g. adding a constant to

.
A modification of lets us get around the problem addressed in the last point. This
version is called the centered

where and

. Multiplication of with

a vector gives a vector of deviations from the mean

..
.
where

For we can derive a decomposition into explained sum of squares and residual sum
12
of squares, like in equation( 1), only if a constant is included in the regression.
From the orthogonal decomposition of
we get

also we note that

if
and hence

" #
SSR SSE

SST SST
Properties of :
1. only if a constant is included in ( is difficult to interprete

unless there is a constant in .)
2. never decreases if an additional variable is added to the regression.

A measure of the goodness of fit which does not suffer from this problem is the
adjusted defined as

3.

13
4. In the triangle, similar to the one in Figure 5, with sides
,
and
,
is the squared cosine of the angle $
$
14
3 The classical linear regression model
3.1 Assumptions
For the multiple linear regression model

(2)
we start with a set of assumptions
A1

A2 is a matrix with
A3 #
A4 # %
A5 is nonstochastic
Remarks
1. Assumption A1 includes a wider range of functional forms, for example we can

model exponential functions of the form by taking logs of the following equation

&½ ¾ ¿
or quadratic functions like

2. A2. works as identification condition. In the two-dimensional model the as-

sumption means that is not constant. There has to be enough variation in the
model.
15

3. A3.
# #

#
The zero conditional mean implies that the unconditional mean is also zero,
since
# # # #
If we require # to be constant, we can as well set this constant equal to

zero, if a constant term is included in the model. For example if we have
#

A3 further implies that
# for any '
which is seen by
# # # '
# # #
# #
This is why assumption A3 is also called the strict exogeneity condition. It

requires the regressors to be orthogonal on the error term not only of the same
observation (# for all ' ) but also to the error terms of the
other observations. A3 also implies that
#

the regression of
on is the conditional mean.
4. A4 more completely specifies the distribution of the error term.

The condition that ( % requires homoscedastic errors
and the condition on the covariances )*+ requires that the errors
are non-autocorrelated.
16
5. A5. is nonstochastic in an experimantal setting, where the analyst chooses
the independent variables and then observes the outcome

Example: in an agricultural experiment the outcome
may be crop yields and
the analyst chooses the amount of fertilizer that is applied.
An alternative view in the experimental setup is that the observations of are
fixed in repeated samples.
In economics we do not often have the opportunity to analyse experimetal data,
so the assumption on non stochastic is not very appropriate. However, we will
see that this assumption can be dropped at a low cost.
# , # %
Heteroskedastic Errors
Autocorrelated Errors
3.2 The statistical estimation problem
From A1 we know that there exists a true and we want to estimate it as good as
possible from the data.
17
We choose an estimator from the class of linear estimators
,
-

We require that the estimator is unbiased
#

Out of these we choose the best estimator, which is the one with the smallest variance

( # # # is minimal.
Remarks
1) Density function of the estimator of
2) The concept of the smallest variance is the following: is an variance-

covariance matrix, therefore it is symmetric and positive definite and

Lemma 1 Under the assumptions A1, A3, A5 the linear estimator ,

- is
unbiased if and only if , , - .
18
Proof.
,
- , -
# # , , - # , # , # -
,#

,# -

, -
-
,
,
Lemma 2 Let
be an n-dimensional random variable, whose first and second mo-
ments exist and let )
. be k-dimensional random variable, then
) )
Proof. Exercise
As a consequence of these Lemmas we note

,
is an unbiased linear estimator with variance covariance matrix

, , , , % ,,

The remaining problem is to minimise

% ,, under the restriction , .
Lemma 3 Denote with . Further, let have full rank and let

, /, then
,, / / , / , /
Proof. Exercise
Now we apply Lemma 3 with / and get
,
,,

constant

, ,
·

19
With these steps we have proven the Gauss Markov Theorem
Theorem 4 (Gauss Markov Theorem) Under the assumptions A1 to A5 the estima-

tor

is the Best Linear Unbiased Estimator (BLUE) of .
Remarks
In general among the unbiased there are better (nonlinear) estimators of . We

will show that if the error terms have a normal distribution 0 % , is
the best unbiased estimator, which means that also is efficient.
If we give up unbiasedness in general estimators with a smaller variance-covariance

matrix exist.
Estimating the variance of

(

# #

% %

Note that depends on
% the variance of the error

the condition of . If is almost singular we talk of the
problem of multicollinearity.
Linear transformations of
Corollary 5 (Corollary from GM Theorem) Under the assumptions A1 – A5 the es-

timator / (/ is 1 matrix) is BLUE for / .
This corrolary has some important implications
The BLUE for

is given by

with variance

% %
20
Estimating
Like before we want to find the BLUE

linear estimator: )
.
unbiased estimator: # #
We define the estimation error Æ as Æ , # Æ ;
# Æ # )
. # # )
#. # #) #.

) # #) #. ) .

So conditions for the unbiasedness of Æ are
) .
) and .
and the minimisation problem to solve is

ÆÆ # ÆÆ
such that )

# ÆÆ

#
) )
# ) )

) )

% ) )

applying Lemma 3 with / gives

# ÆÆ

% ) )
if ·
)
21
Thus we get the BLUE for

with
%

%

#
Estimation of %
Think of the expression and suppose were observed. Then we get

# # # # %

This suggests that we could take % as an estimator for % . But, due to the

construction of OLS we have
hence %

is biased.
We can see this by

#

# # 2

# 2
2 #
2 #
2 %
% 2 %
the last step follows from

2

2
2 2

22
So we have
# %
and we can get an unbiased estimator for % by

%

Now that we have an estimator for % we can present an estimator for the variance of

( %

Remarks
The standard error %

is also called the Standard Error of Regression.
%

The Standard Error of of a single component of the parameter vector is given by
½
¾
.
The population equivalent of

Remember that we defined the centered coefficient of determination by

It can be shown that:

# Ú

To reduce the bias we define the corrected by

or alternatively the adjusted

23
Both usually have a smaller but negative bias.
3.3 Prediction or the out of sample forecasting

We use the model

where
is unknown and is given
Example: with the help of past values predict consumption in 2001
The problem is again to find the best linear unbiased estimator for
.

)
.
# )
. #

Æ
)
.
Find Æ such that:

# Æ
ÆÆ is minimal
Theorem 6 (Gauss Markov Theorem – Continued) Under the assumptions A1 to A5

the best linear unbiased forecast for
is given by

24
4 Stochastic regression
Social scientists are rarely able to analyse experimental data. Thus it is necessary to
extend the results of the preceding section to cases in which some or all independent
variables are randomly drawn from some probability distribution.
A convenient method of obtaining the statistical properties of is to
1. obtain results on the statistical properties conditional on (equivalent to the

case of non-stochastic regressors),
2. find unconditional results by ”averaging”(integrating over) the conditional dis-

tributions.
As before,

So, conditioned on the observed
#

#
# # # #
The unbiasedness of is robust to assumptions about ; it rests only on assumption

A3.
The variance of , conditioned on , is

( %

(

# ( ( #

(
# %

% #

With the Gauss-Markov Theorem in the previous section we have shown that

25
This inequality, if it holds for every particular , must hold over the average values of
#

Theorem 7 (Gauss Markov Theorem – Continued) In the classical linear regres-

sion model the least squares estimator
is the minimum variance
unbiased estimator of when is either stochasitc or nonstochastic.
Further Remark on Assumption A3
We noticed that
# # # #
but by the same argument we get
)*+ )*+ #
which says that and are uncorrelated. The interpretation is that in some sense
captures all relevant effects which are necessary to explain
.
Example:
– wage, – education, – ability. # – average ability in all
education groups is the same.
26
5 Statistical inference in the classical linear regression model
So far we have solved the estimation problem, but there remain some open questions,
like
Which explanatory variables are best included in the model ?
What about the functional form ?
Which is the distribution of the residuals ?
We need a testing framework in the linear regression model and exact distributional
assumptions for the errors. We make an additional assumption
A6 0 %
There are several arguments in favour of the normal distribution
normality is preserved under linear transformations.
quadratic forms of normals give 3 - or - distributed random variables.
under normality of , the OLS estimator is also the Maximum Likelihood

estimator.
5.1 Introduction and review
The Normal Distribution
An -dimensional random variable is normally distributed if its density function is

of the following form:
4 5

4

4
with 4 and is symmetric and nonnegative definite.
# 4
(
27
Corollary 8 Let be an random variable with 0 4 .
If
) ., where ) is an 1 matrix with ) 1 and
1
, then

0 )4 . ) )
Let us apply the corollary to the linear regression model
0 %

0 %

0

%

The principle of Maximum Likelihood
Suppose we have a given a sample of observations

with the joint den-
sity function

$ and we want to estimate the parameter vector $ . The
estimator of $ dependent on the observations is called $

.
The Likelihood function is defined as
/ $

$
The principle of Maximum Likelihood states that an estimator of $ is given by the

maximum of the Likelihood Function
$

6 / $

Here we want to find the ML estimator for the linear regression model. First we set up
the Likelihood Function for

%

%

¾
5%
/ %

28
Taking logarithms

/

5
%

%
and setting the derivatives with respect to parameters equal to zero

/

%

/

% %
%
gives

%

The Maximum Likelihood estimator for equals the OLS estimator and the Maximum
Likelihood estimator for % is given by the biased variance estimator.
Theorem 9 In the classical regression model with normally distributed errors the
least squares estimator has minimal variance of all unbiased estimators. Thus is
efficient, not only linearly efficient.
Remark: For non-normally distributed errors the ML estimator usually has a smaller
variance than . Thus for non-normality it is better to use the ML estimator than OLS.
The statistical testing problem
We start with a parameter space .

Null Hypothesis: 7 $
Alternative: 7 $
Example:

$
29
A statistical test is a decision rule based on the sample. The decision rule determines
if 7 is accepted or rejected.
For the possible outcomes of the testing procedure we find the following
Accept 7 Reject 7
7 true ok Error Type 1
7 false Error Type 2 ok
Table 3: Testing outcomes
Probability(Error Type 1) ...size of the test
Probability(Error Type 2) ... power of the test
Tests are compared in terms of the size and power. The ”best” test has maximal power
for a given (fixed) size.
5.2 Testing principles
In this section we discuss three testing principles which are based on Maximum Like-
lihood estimation of the parameter $ . Given an arbitrary function . the testing hy-
pothesis is the following
7 . $
Likelihood Ratio Test
If the restriction . $ is valid, imposing it should not lead to a large reduction of

the Likelihood Function. Therefore we base the test on the difference
/
/ ,
where / is the likelihood of unconstrained value of $ and / is the value likelihood
at the restricted estimate. Let $ be the maximum likelihood estimate of $ obtained
without regard to the constraints, and let $ be the maximum likelihood estimate we
receive by maximising the likelihood function subject to the constraint. If /

and
/

are the values of the Likelihood Function evaluated at these two estimates, then the
Likelihood Ratio is
/
8
/

This function must be between 0 and 1. If 8 is too small we reject the null hypothesis.
30

Wald Test

If the restriction . $ is valid . $ should be close to zero, because maximum
likelihood estimation is consistent. Therefore the test is based on . $ . We reject
the null hypothesis if this is significantly different from zero.
Lagrange Multiplier Test
If the restriction . $ is valid, the restricted estimator should be near the point that
maximises the log likelihood. Therefore, the slope of the likelihood function should be
near zero at the restricted estimator. The test is based on the slope of the log Likelihood
at the point where the function is maximised subject to the restriction. The derivative
of the Likelihood with respect to the parameters is called the score
/$
1$
$

The test is based on 1 $ and we reject the null hypothesis if this is significantly
different from zero.
The three tests have asymptotically equivalent behaviour, but differ in small samples.
The choice among the three principles is often made on practical computational con-
siderations
Wald Test: requires estimation of the unrestricted model
LM Test: requires estimation of the restricted model
LR Test: requires estimation of both models
Example: derive a decision rule in case the of LR-test
Let the parameter space be given by . And the null hypothesis by

7 $ with
Then the test statistic of the Likelihood Ratio test is given by
/$

$
¼ ¼
8
8

/$
$

The test statistic 8

is a function of the random variable

, thus it is
itself a random variable with density function 6 8 $ .
31
Note: 8 is always between 0 and 1, because both likelihood functions are positive
and /

cannot be larger than / (a restricted maximum is never greater than an

unrestricted one).
If 7 is true: 8 is near 1.
If 7 is false: 8 is near 0.
We need the statistical framework to specify what “near” means. For a given signifi-
cance level the graph shows the density of 8 in the case that the null hypothesis is
true and in the case that it is not true.
6 8 if 7 is true 6 8 if 7 is false
For a given the critical region 8

8 is defined by the -quantile of 6

6 8 7 -8

And we get the decision rule:

Reject 7 if 8
8
Accept 7 if 8 8

5.3 Testing values of ¬ : practical part
Testing a hypothesis about a single parameter
We start with testing a hypothesis about a single element of the parameter vector ,
like
7

32
To derive a decision rule we have to find a suitable test statistic with a known proba-
bility distribution. Remember that under A6
0 %
and

%
0
If we knew % we would be fine but we have to replace % by the estimated value %

.
It can be shown (in the next section) that
2

%

&
of is given by &

has a 2-distribution with degrees of freedom. Remember that the standard error
% .

The most common application of a test on a single value of is to test the hypothesis
7
This hypothesis claims that the variable does not have a partial effect on
after the
other independent variables have been accounted for. If the
null hypothesis is not rejected can be eliminated from of the regression equation.
In this case the 2-statistic is

2
&
2 is small if either
is close to zero or & is large.
There are two possibilities to define the alternative hypothesis. We start with a one-
sided alternative
7
7 9
and we choose the significance level .

To set up a decision rule we have to find a sufficiently large value of 2 in order to
33
reject 7
. We reject the null hypothesis if 2 9 . with . percentile
of the 2 -distribution.
Figure 14: One side alternative
According to this decision rule a rejection of 7 will occur in 5% of all random sam-
ples in which 7 is true (error type I).
Example 10 We consider a model which explains log wages by the years of education,
years of working experience and years of tenure with the current employer
6&
estimating this model gives the following result
6&

Now we want to test if experience has a partial influence on wages, once education
and tenure have been accounted for.
7
7 9
The degree of freedoms is = 526 - 4 = 524, and 2¿

. Let
then the critical value is given by . . Hence we reject 7 .
34
The second possibility to define an alternative hypothesis is a two-sided alternative
7
7
In this case we reject 7 if 2 9 ..
Figure 15: Two side alternative
Example 11 We consider a model in which college grade point average (GPA) is

explained by other test results and the average number of lectures missed per week
(skipped). We want to test if missing lectures has an influence on college GPA.

The estimation result gives

7
7
For the significance level the critical value is . and the test statistic

2¿
. Hence 7 can be rejected.
35
If we want to test if is equal to some given constant (e.g. or ) we
proceed the same way.
7
7
and the test statistic is given by

2
&
Example 12 (Constant elasticity model) In this model we study the effect of air pol-
lution on housing prices. The dependent variable are median housing prices in 506
Boston regions and the variable * gives the average amount of nitrous oxide.
:.&
7
7
:.&

The test statistic 2¾

. 7 can be accepted at any significance
level, hence is not significantly different from -1.
Testing multiple exclusion restrictions
Now we want to test whether a group of variables has a partial effect on the dependent
variable.
Consider the two models:
6& &-. & :& 2&& (3)
and
6& &-. (4)
36
A test between the two models is a test for the hypothesis
7
7 7 is not true
First, we estimate the unrestricted model given by equation( 3) and keep the coefficient
of determination , which tells us how good the model fits. Second, we estimate the
restricted model of equation( 4) and keep from the regression. A test statistic can
then be based on the difference between the from both models
;

;

To generalise the procedure applied in this example we consider the linear regression
model
and rearrange the columns of so that the independent variables

from the restricted model come first.

and

Now and are matrices of the dimension 1 and 1 respectively,

and are vectors of dimension 1 and 1 and 1 is the number of
restrictions.
7
7 7 is not true
The -statistic is given by
;1

;

37
Figure 16:
We will show that has an -distribution with degrees of freedom 1 and .

Note that . We reject 7 if the -statistic is sufficiently “large”.
Example 13 Consider again the example with the wage equation from before. Now
we want to test if wages are completely determined by years of education.
7
7 7 is not true
The estimation results for the unrestricted model were given by for
and
. For the restricted model we get and the number of

restrictions 1 . Then the test statistic is given by and for the significance
level the critical value . !.
Test the overall significance of the regression
Consider the regression model

where a constant is included say by .

To test if this regression makes any sense at all we set up the hypothesis
7
7 7 is not true
38
We will show that in this case the test statistic
;

;

has an F-distribution with parameters
Example 14 (wage equation)

.
5.4 Testing values of ¬ : theoretical part
We begin with the introduction of several important testing distributions. Then we

derive some results on statistical independence in the linear regression model. Finally
we derive the test statistics.
3 - distribution
<

Let be -dimensional random vector with 0 , then

has a distribution with the density function

½!
< & if <

¾ ¾
<
otherwise

where . ¾ " .
< is called centrally 3 -distributed with degrees of freedom.
Let 0 4 , then < has a non-central 3 -distribution with parameters
and 8 4 4.
Remark: If 0 4 has an arbitrary normal distribution

3 4
4 has a central 3 - distribution with degrees of freedom
and
3 has a non-central 3 - distribution with parameters and 8 4 4.

39
- distribution
Let 3 and 3 be two independent 3 - distributed random variables with and
degrees of freedom respectively then the random variable
3 ;

3;
has a central - distribution. The corresponding density function is given by
!
" ¾ ½
"
. · if
"

¾
otherwise

with
#
"
¾
. # "
"

Let 3 and 3 be independent random variables

and let 3 be non-centrally 3 - distributed and
3 be centrally 3 - distributed
then the ratio of both is non-centrally - distributed.
2 - distribution
The 2 - distribution is given as a special case of the - distribution.

If and the random variable is centrally - distributed then the random
variable

2
has a central 2 - distribution with degrees of freedom,

if is non-centrally - distributed, 2 has a non-central 2 - distribution
Example: the ratio of a standard normal distributed random variable and the square
root of an independently 3 - distributed random variable gives a random variable
with a 2 - distribution.
40
Results on stochastic independence of quadratic forms
Theorem 15 Let the n-dimensional random vector 0 4 % , then

< $ is 3 distributed with 8 4 4;%
%¾
if and only if A is idempotent, symmetric and rank(A) = r.
Theorem 16 Let A be an idempotent matrix , rank(A) = r,

let B be a matrix with =
and let 0 4 % be a random vector
then the random variables = and < are stochastically independent.
Example: In the classical linear regression model with normally distributed errors
and %
are independent.

%

0 %

Theorem 17 Let A be a symmetric idempotent matrix , rank(A) = r,

let B be a symmetric matrix and 0 4 % .
If BA = 0 the quadratic forms < and < = are stochastically indepen-
dent.
Theorem 18 Under the assumptions A1, A2, A3, A4, A6 the quadratic form
;%

is 3 distributed with (n-k) degrees of freedom.

Proof.

% %
where the matrix is symmetric and idempotent with rank( ) = (n-k). The rest
follows from Theorem 15.
41
Theorem 19 Let be fixed.
Under the assumptions A1, A2, A3, A4, A6 the quadratic form

3
%
with

8

%
is independent from
;%

.
Proof.

·

&

0
%
The matrix is idempotent and symmetric hence the 3 distribution follows from
Theorem 15. Further and we get independence from Theorem 16 and
Theorem 17.
Now we have collected all tools that are necessary to derive the test statistics
First we note a helpful equation which can be verified by multiplying out

(5)
Hypothesis I 7

The first hypothesis is a hypothesis on the complete parameter vector. We derive the
test statistic from a Likelihood Ratio test. Remember that the LR test statistic is given
by
¼
/

42
We already derived the Maximum Likelihood estimators for and % from the Likeli-
hood Function

/ 5%

%
and

%

and %

In the restricted model we have .
What we need are the values of the Likelihood function at the restricted maximum and
at the unrestricted maximum.

/$ 5

¼

$ 5

now we get the LR test statistic
/
½

½

¾

¾
(6)
"

for the last equality we used equation ( 5) and

Under 7 is centrally - distributed with parameters (see Theorem 19

and the definition of the - distribution).
Under 7 is non-centrally - distributed with parameters 8 and 8

;% .
Hypothesis II: 7

This is a hypothesis on a part of the parameter vector. We partition in two parts

according to the test hypothesis

43
Where is a 1 vector and is a 1 vector. That means we are testing
1 restrictions.
7

Now the restricted model is given by

We define by ½
½

and by 7
7 ½
and we can express the residuals from this regression by
½

Next we need an auxiliary equation analogous to equation(5)

½

7

(7)
The LR test statistic now equals
¼ /$

½

¾

/
½
/$ ½ ½ ½ ½

¾

¾ ¾ ½ ¾ ¾

"

applying equation( 7)

¾
44
with

7

1
Under 7 is centrally - distributed with parameters 1 . Under 7 is

non-centrally - distributed with parameters 1 8 and 8
7

;% .

Hypothesis III: 7

This is the hypothesis about a single element in the parameter vector. The third hy-
pothesis is a special case of the second with 1 .
7

7
As this is a special case of Hypothesis II with 1 .

7 ½

and thus

Now we can derive the familiar 2 test statistic
2

+

2
Under 7 2 is centrally 2 - distributed. Under 7 2 is non-centrally 2 - distributed.
45
5.5 Confidence intervals
With the help of the 2 - statistic we can construct a confidence interval for the parameter
.
2

+

&
2

Therefore a confidence interval for is given by
* 2 ¾ & 2 ¾ &
2 ¾ is the quantile from the 2 - distribution with parameter .

For example the 95% confidence interval for is given by
# . & . & $
where . is 97.5 - quantile of 2 - distribution.

To find a confidence interval for % we remember that

3
%

%

and we construct the confidence interval by
%¾
%¾
* '¾½

%

'¾

¾ ¾
5.6 Testing linear restrictions
We consider again the classical linear regression model with normally distributed er-
rors in which assumptions hold

46
In addition we impose a set of 1 linear restrictions on the model
>
..
.
>
In matrix form the restrictions are written as
>
where is a 1 matrix of known constants, with linearly independent rows and

1 . The vector > is an 1 vector of known constants.
Testing one linear restriction
We start with the case of 1 where we test the following restriction
7 > or in the vector form >
The sample estimate of is given by >,

and we can construct a 2 - test for by
> >
2 2
&>
We still need to specify the standard deviation &>. Under the assumption of nor-
mality of error terms it is determined by Corollary 8.
&> +
>
%

The other possibility to construct the test is by re-parametrisation. We will see how
this works in an example.
Example 20 The aim is to compare the returns to education between a two-year col-
lege (junior college) and a four-year college (university). The model we have in mind
is
6& . + & :&
The hypothesis of interest: is one year of a junior college worth one year of university
47
education?
7
7
The restriction equation is given by
>
and the test statistic we derived before is

2
&

now we estimate the model and get the following estimation result (standard errors in
parentheses)

6&
1.430 0.098 . 0.124 + 0.019 & :&
(0.270) (0.031) (0.035) (0.008)

The difference between the coefficients of interest

which is (-2.6%)
of wage. Is this statistically significant?
For the re-parametrisation method we define a new parameter $ by
$
We want to test
7 $
7 $
It is possible to rewrite the model so that $ appears directly as parameter of one of

independent variables
6& $ . + & :&

$.

. +
() *+( ,
& :&
48
OLS estimation of the new model gives

6& 1.430 0.026 . 0.124 .*'' 0.019 & :&
(0.270) (0.018) (0.035) (0.008)

Now we immediately see that 2 not significantly different from
zero at the 5 % level.
Note:
and are equal for both estimated equations

is the same as well
A confidence interval for $ is given by
$ . &$
Testing 1 linear restrictions
Now we will generalise the two approaches that were introduced in the example above
direct approach
re-parametrisation approach
We consider the general hypothesis
7 >
To simplify the analysis we partition in two groups of variables according to its

columns. One group of variables consists of 1 linearly independent columns and the
second group consists of the remaining 1 columns. To be consistent in notation
the linearly independent columns come last in R. (That means they form .)

>
Consequently in the restricted model (the model on which the restrictions are imposed)
only the first 1 elements of are free to vary.
All the hypothesis we studied in the previous section are special cases of linear restric-
tions. Examples of how the hypothesis translate in linear restrictions are
49
1. 7 >
2. 7 >
3. 7 >

4. 7

>

¿

Given the OLS-estimator , our interest centres on the discrepancy vector:
$ >
Remember
0 %
0 %

0 %
$ > 0 %
To construct a test statistic we use the approach of the Wald test
? $ ( $ $ 3
? > %
> 3
% , however, is still unknown, therefore we will construct an -test.

We know that
3
%¾
by a result analogous to the one in Theorem 19 it is independent from

3
%
50
and an - statistic is constructed by
> >;1

;

> %
>;1

Two following two applications show that this approach results in exactly the same test
statistics we derived in the previous section.
1. Hypothesis III on a single parameter
7
>

%
&

2 2
&
2. Hypothesis II on a group of parameters
7
where
1

1
is a partitioned matrix according to our notation and >

is an 1 vector of zeros..

We make use of a result on the inversion of partitioned matrices.

51

¾

=

=

½

Multiplying the matrix from the left and from the right with the partitioned
inverse in cuts out the bottom right hand corner of
. Thus

½ ;1
7 ;1

;
;

which is the well known - test statistic with 7 =

is the matrix defined
in the last section.
Further interpretation of the - test
We start with examining the quadratic form

½
Premultiplying the fitted regression equation of the unrestricted model

by ½ yields
½
½ ½ ½
We know that ½ and ½

, thus
½
½

Transpose this equation and multiply by ½

½
½

½

½
..

..

where the subscripts and stand for “restricted” and “unrestricted”.
52
Applying this result we can now rewrite the - test statistic

;
;1

;

;1

;

The intuition for the reformulated - test statistic comes from the following para-
graph. So be a little patient with the interpretation.
Restricted least squares estimation
Here we have the restricted linear regression model in mind. That means we want to

solve the optimisation problem

=

subject to >
Let us set up the Lagrange function
8

8 >
and derive the first order conditions

8

>
8
If we call the optimal parameter value , solving the first order conditions results in

8
>

8

8

8 >
8

>
53
and we get

> (8)
For the variance of this restricted estimator we note
(

%
/

%

,(0*( (
( #

The variance of the restricted estimator equals the variance of the unrestricted OLS
estimator minus some positive definite matrix. This implies that
( (
There occurs a strict reduction in the variance if we move from the unrestricted model
to the restricted model. The intuition is that the restrictions contain additional infor-
mation on the model and consequently the precision of the estimation increases. This
leads us to a new idea for a test. If the restrictions are valid in the general unrestricted
model the the reduction in variance should not be very large. On the other hand if the
restrictions are not valid in general they contain substantial additional information and
therefore we should observe a large reduction in the variance of the restricted estimate.
So we construct a test based on the loss of fit. Denote by the residuals from the
restricted linear regression model.

Note: According to the notation from above . The unrestricted model given
by
has the residuals .
Transpose the equation above and multiply by

Now substitute for from Equation 8
>

>
54
Finally we end up with the - test statistic
;1 ;1

1
;
;

;1

;

55
6 Some tests for specification error
Given the assumptions for the multiple linear regression model in section 3 we derived
estimators and showed that they have desirable properties (linearity, unbiasedness and
minimal variance). Further we employed an array of inference procedures. However,
there is a crucial question. How do we know if the assumptions underlying our esti-
mation frameworl are valid given the data set?
If the assumptions are wrong there is an specification error in the model.
6.1 Tests for structural change
In the classical linear regression model the assumptions apply to all observations in the
whole sample. In this section we want to test the hypothesis that some or all regression
coefficients are different in subsets of the sample.
Applications for these tests occur in different context mainly due to the data type
Time series data: Regime shifts
Cross-sectional data: Differences among population groups
Pooled cross-sectional data: Training models, policy evaluations
Structural break in all variables
We explain the derivation of the Chow test on the basis of examples using the Longley
data. The dependent variable in the regression models will be employment either total
or in one of two sectors given. The independent variables will be a constant, a time
trend GNP, the GNP deflator and the number of armed forces. This data set spans
the years 1947 - 1962. Within this period falls the Korean war ending in 1953. We
consider a model for employment
&:'
& @0 @0 &- *.&1 (9)
Is there a difference between wartime 1947 - 1953 and peacetime 1954 - 1962? We
partition the observation according to these periods

in the years 1947 - 1953

in the years 1954 - 1962
56
The unrestricted model is the model which allows for different parameters in both

periods

the estimator for the parameters is equivalent to the one we get from estimating two

separate regressions

The sum of squared residuals in the unrestricted model equals

To test whether the parameters are equal in both periods we set up the hypothesis
7
> with >
The test statistic for this problem is given by
;

;

The computation of the test statistic in this form requires additional programming
steps. We can, however, choose to estimate the restricted model and simplify the
computations. The restricted model is the model which pools all observations. As
no differences between the time periods are assumed OLS can be estimated for the

complete sample

The residuals from the restricted model are

.
57
We formulate the test statistic by comparison of the residuals. This test is also known
as the Chow Breakpoint Test.
; ;

; ;
Estimation results from the Longley data give
Years 1947 - 62 1947 - 53 1954 - 62

SSR 4898.6 345.2 800.2
;

# $
;!
= 0.05, critical value 4.39.

In this example we were interested if the regime shift affects all parameters. It might,
however, be that the differences in regimes only affect certain parameters e.g. only the
intercept terms or only the slopes.
Different constant terms
In the next example we keep the model from equation( 9) above but we consider dif-
ferences in employment between the agricultural and the nonagricultural sectors. Em-
ployment levels in both sectors are of different magnitude. Therefore we could allow
for different intercepts and test whether the independent variables affect employment
in both sectors differently. This means we test whether slope coefficients alone are
different. Now
and
correspond to employment in each sector and the matrices
of independent variables are equal.

We can formulate the restricted model as follows

% %
%

% %

The first two columns are dummy variables indicating the sector in which the observa-
tions falls.
% includes all columns of except the constant.
Estimation results show

# $ = 106.8 significant.
To investigate further which variable is responsible for the differences in employment
in both sectors we change to a subset of coefficients. In the next step we allow for
different intercepts and time trends in both sectors.
58
Agricultural Non Agricultural Restricted M.
Constant
Agricultural 201.8 626.2
Non Agric. 1086.7 662.4
SSR 241.2 5037.9 107780.2

The restricted model changes to

& %%

%
%

& %%
= 9.22, critical value 3.05.
Dummy variables
We will introduce dummy variables on the basis of examples of wage equations. Indi-
vidual wages in a cross-sectional sample are explained by the degree of education and
further personal individual characteristics.
So far our in examples we mainly worked with quantitative variables like 6&, @, ,
etc.
Qualitative variables are 1& , .&, 1&.2* , &6*, etc.
How can we include qualitative variables as independent variables in the regression
model?

Binary variables We define a binary variable - dummy variable - for example
*
&'&

We can use this variable to estimate differences in the mean wage for men and women
in the model
6& &'& &-.
The parameter # 6&&'& &-. # 6&&'& &-.

estimates the average wage differential between men and women holding years of
59
education fixed.
Multiple categories In the same way we can define dummy variable for multiple
categories like
married woman
married man
unmarried woman
unmarried man
In the model we have to exclude one reference category, e.g
6& &-.
The parameters , , give wage differentials of the other groups to single men,
again holding years of education fixed.
Ordinal variables Suppose we only know the individual’s highest educational de-
gree instead of the years of education. There is information on: primary school, high
school, college education, etc. It is possible to construct a variable of the form
##! :
1.A**'
##"
A6A 1.A**'
&-.
.*''&6&
..
.
and include it in the model. But it preferable to form a set of dummy variables 1,
, ,, ..., because the differences between the educational categories may not be
linear.
60
Interaction of variables We already had an example for the interaction of dummy
variables interacting categories woman/man and married/unmarried.
It is also possible to interact dummies and quantitative variables. Consider the model
6& &'& &-. &'& &-.

&'& &'&&-.
with this regression model we can examine the question if there are differences in the
return to education for men and women. We allow for gender specific slopes of the
wage profiles as well as for different intercepts
The hypothesis that no differences in returns to education between the sexes exist is
given by
7
and the hypothesis that there are no wage differences between women and men is the
following
7

6.2 Prediction
After estimation of the model parameters suppose we want to predict the value
for
some specific vector of regressors .
In section 3.3 we have already shown that

61
is the best linear unbiased predictor for
.
Now we can construct a confidence interval for the expected value #

#

According to the regression model

is also the predictor for #
and its
variance is given by
( ( %

Consequently we get

(
0
and replacing % by its estimate %

gives

%

2
Hence we can construct a confidence interval, with significance level , for by

2 ¾ %

In the next step we want to construct a prediction interval for

. We
define the prediction error by
&

The variance of the prediction error & is given by
( & # && # %

%

62
Now we have
%

2
and prediction interval for

2 ¾ %

A convenient method of computing forecasts:
Suppose that estimation is based upon observations in the model

and
we want to make forecasts on

# # #

First construct the augmented regression model

#

Each variable in the second part of is a dummy variable, which takes the value
for one observation and for all other observations.

Form OLS estimation of the augmented regression model we get the following results

The regression of
on produces the coefficient vector
, where

are OLS coefficients from the original model and is a vector of predictions
for
.

Residuals from the augmented regression are

since the coefficients are the same.
63

The estimated covariance matrix for is given by

(
%

%

The variance matrix contains ( in the upper left and ( & in the lower
right blocks.
Note:
A dummy variable that takes the value only for one observation has the effect of
deleting this observation from the least squares computations.
6.3 Further specification tests
CHOW forecast test
This test is an alternative to the Chow breakpoint test if there is an insufficient number
of observations available. The concept of the test is based on an evaluation of the
predictive power of the model.
Out of sample forecasts provide an easy check for the model fit. To see how good the
estimated model predicts we might proceed the following way
First we estimate the OLS coefficients with observations and get the parameter
estimate

Then we compute predictors for the other observations

and obtain prediction errors
&

Finally we test the hypothesis 7 & .
64
Under the null-hypothesis the restricted model is the regression model which pools all
observations

Now we make use of the method for computing forecasts as described above. As a
result from this method we get the unrestricted model, defined by

¾ %
%

The - test statistic is easily computed by
;
¾ ½
;
Note: Compare the test statistic of the CHOW Breakpoint test
;

;
6.4 Tests based on recursive estimation: CUSUM and CUSUMQ test
The next group of tests are based on a similar intuition: How good is the model’s
ability to predict outside the range of observations used to estimate it.
The primary aim of the tests are applications to time series data. And the tests are more
general than the CHOW tests in the sense that they do not require a prior specification
of when the structural break takes place.
The disadvantage of the CUSUM and CUSUMQ test is, however, that they are of
limited power compared to the CHOW test.
First we introduce the concept of recursive residuals.
65
Recursive residuals
Suppose the sample contains " observations. (We use " instead of 0 to indicate that
we are in a time-series setting.) Then the 2-th recursive residual & is defined as the one
step ahead prediction error; the prediction error for
from the model estimated with
only the first 2 observations.
&
2 "
and corresponds to the t-th row in and is the parameter estimate from the
model with 2 observations. The variance of the 2-th recursive residual is given by
( & %

We define the 2-th scaled residual as

&

2 "

Thus, under the assumptions A1-A6 and under the null hypothesis that the parameters
are constant during the full sample period 0 % . It can also be shown that
the scaled recursive residuals are pairwise uncorrelated.
The tests are based on the hypothesis that the distribution of does not change over
time.
1. CUSUM
The CUSUM test is based on the cumulative sum of residuals.

? 2 "

%
with

2

% and
" "
Under the null hypothesis # ? and ( ? 2.
66

The test is performed by plotting ? against time. Confidence bounds are obtained by
½ ½
two lines connecting the points " ¾ and " " ¾ . the param-
eter corresponds to the significance level, for .
2. CUSUMQ
The CUSUMQ test is based on the cumulative sum of squares. It uses the test statistik

2

Since the residuals are independent the numerator and denominator of are approxi-
mately 3 distributed and therefore
2
#
"
Again the test statistic is plotted against time. Confidence bounds for # for 2
" are constructed and plotted.
67
7 Asymptotic theory
Consider the estimation problem where we would like to estimate a parameter vector
$ from a sample . Let $ be an estimator for $, i.e. let $ A
be a function of the sample. In the linear regression model $ is a linear function of
. And we can easily express the the expected value and the variance covari-
ance matrix of $ in terms of the first and second moments of , provided that they
exist. Especially we saw that if the sample is normally distributed so is $ . Frequently,
however, the estimator of interest will be a nonlinear function of the sample and the
calculations of the exact expressions become very complex or it is inappropriate to
make specific assumptions about the distribution of the error. In view of these difficul-
ties in obtaining the exact expressions for the characteristics of the estimators and their
moments we will often have to resort to approximations for these exact expressions.
Asymptotic theory is one of the ways of obtaining such expressions by essentially ask-
ing what happens to the exact expressions as the sample size tends to infinity. For
example, if we are interested in the expected value of $ and exact expression is un-
available, we could ask if the expected value of $ converges to $ in an appropriate
sense.
In this section we give a short introduction to asymptotic theory and then apply it to the
linear regression model. That means we drop assumption A6 about the normality of
the error term and see what asymptotic theory can tell us about the distribution of the
estimators $ % . Using convergence theorems like Laws of Large Numbers

and Central Limit theorems we will prove consistency and asymptotic normality of
OLS estimators.
7.1 Introduction to asymptotic theory
Various modes of convergence
First we define and discuss various modes of convergence for sequences of random
variables taking their values in . The definitions and results can be extended to -
dimensional random vectors.
Definition 21 (Convergence in Probability) The sequence of random variables

is said to converge to the random variable in probability for every B 9 if
9 B

1
We then write or .
68
Definition 22 (Convergence in Mean Square) The sequence of random variables
is said to converge to the random variable in mean square if
#

We then write .
Definition 23 (Convergence in Distribution) The sequence of random variables

is said to converge to the random variable in distribution if the distribution function
of converges to the distribution function of at every continuity point of .

We then write and call the limiting distribution of .
The following theorems establish relationships between the concepts of convergence.
1
Theorem 24 implies
This is a direct consequence of Chebyshev’s inequality as we see from

Theorem 25 (Chebyshev) # implies
1
Proof. We can always find an C 9 such that
#

- B
- (10)
.
with B .
-
3
-

- B B (11)

. 3
B B 9 B
Combining the inequalities in ( 10) and ( 11) we get

9 B
#
B

9 B
#
B
69
This inequality is called Chebyshev’s inequality and it implies the theorem.
Note
1. A generalised form of Chebyshev’s inequality is given by

6 9 B
# 6
B
where 6 is a nonnegative continuous function.

1
2. The general statement follows from Theorem 25 if
is replaced by .

3. The converse of Theorem 25 is not generally true. For example
with probability

with probability

#

is not convergent in mean square.
The next corollary follows immediately from Theorem 25 by utilising the decompo-
sition # . ( # . .
1
Corollary 26 Suppose # . and ( then ..
This corollary is frequently used to show that for an estimator $ with # $ $ (i.e.
1
an asymptotically unbiased estimator) and with ( $ we have $ $ .
1
Theorem 27 implies
The converse of the theorem does not hold in general. To see this consider the follow-
ing example. Let 0 and put . Then does not converge

in probability. But since each 0 , evidently .
Convergence properties and transformations
Theorem 28 Let be a sequence of random variables whose first and second mo-
ments exist # . # . and let be a sequence of random vari-
ables with .
Then .
70
Theorem 29 (Slutsky) Let and be sequences of random variables with
, ,
,
with , and
, non-stochastic, and let 6 be a function continuous in
,
, .
Then plim 6 6 ,

, .

examples:

,

4
4 .
Such relations do not hold for expected values unless and are stochastically
independent.

Theorem 30 (Bernstein) Let and . Then .

Theorem 31 Let with , and . Then

.
Asymptotic properties of estimators
Definition 32 (Consistency) A sequence of estimators $ is called consistent if

$
$.
Definition 33 (Asymptotic Normality) A sequence of estimators $ is called asymp-

totically normally distributed with mean 4 and variance if

$ $ with 0 4
Remark: $ is called asymptotically efficient if it has minimum variance in limiting

distribution.
1
Let $ be an estimator of the parameter $ and assume $ $ . If @ denotes the
cumulative distribution function of $ , then as
for $
@
for 9 $
To see this observe that $

$ $
$
$ $ $ for
$ and $
$ 9 $ $ 9 $ $ $ 9 $
71
for 9 $ . The result shows that the distribution of $ collapses into the degenerate
distribution at $ , that means into
for $
@
for $
1
Consequently, knowing that $ $ does not provide information about the shape
of @ . This raises the question of how we can obtain information about @ based
on some limiting process. Consider, for example, the case where $ is the sample
1
mean of iid random variables with mean $ and variance % . Then $ $ in the
light of corollary 26, since #$ $ and ( $ % ; . Consequently,
as discussed above, the distribution of $ collapses into the degenerate distribution

at $ . Observe, however, that the rescaled variable $ $ has mean zero and

variance % . This indicates that the distribution of $ $ will not collapse to a

degenerate distribution. Using Theorem 35 below it can be shown that $ $
converges to a 0 % distributed random variable. As a result we take 0 % as

an approximation for the finite sample distribution of $ $ , and consequently
take 0 $ % ; as an approximation for the finite sample distribution of $ .
Laws of large numbers

Let , 0 be a sequence of random variables with #

denote the sample mean, and let 4 #

4 . Furthermore let

4 . A
law of large numbers (LLN) then specifies conditions under which

#

4

converges to zero in probability.

The usefulness of LLN’s stems from the fact that many estimators can be expressed
as continuous functions of sample averages of random variables. Thus to establish the
probability limit of such an estimator we may try to establish in a first step the limits
for the respective averages by means of LLNs. In a second step we may then use
Theorem 29 to describe the actual limit for the estimator.
Theorem 34 (Law of large numbers, Kolmogorov) Let be a sequence of iden-
# 4 then

tically and independently distributed (iid) random variables with # and
1
4 as .

72
Central limit theorems
Let , 0 be a sequence of iid random variables with #

% , % . Let

denote the sample mean. By Kolo-

4 and (
mogorov’s law of large numbers for iid random variables it then follows that

# converges to zero in probability. This implies that the limiting distribution of

# is degenerate at zero, and thus no insight is gained from this limiting
distribution regarding the shape of the distribution of the sample mean for finite .
Suppose we consider the rescaled quantity

5
#

4

Then the variance of the rescaled expression is % for all , indicating that its limit-
ing distribution will not be degenerate. Theorems that provide results concerning the
limiting distribution of expressions like that are called central limit theorems (CLT).
Theorem 35 (Lindeberg - Lévy) Let be a sequence of iid random variables with

# 4, ( % . Then

4
0

%
Theorem 36 (Lindeberg - Feller) Let be a sequence of independent random vari-

ables with # 4 and ( % . Let %

% and suppose that

% 9 , except for finitely many . If for every C 9

# C%
%
then

4;% 0 .
7.2 Asymptotic properties of OLS estimators
Consistency of
Consider the classical linear regression model under assumptions A1-A5.

Assume that

<

73
and < is a positive definite matrix.

<

# #

(12)

%

( # (13)

% %
( <

From equations ( 12) and ( 13) we get the conditions for Corollary 26 which implies
that
.
Under assumptions A1 – A5 and < the estimator is consistent.

Consistency of %

%

%

¾

% !½
The leading constant converges to 1. The second term in brackets converges to zero.
That leaves

Assuming that the errors are independent is the mean of a random sample, and
74
we can apply the law of large numbers (Theorem 34) and get

%

So under assumptions A1 – A5 and <, with < positive definite % is

consistent.
Asymptotic distribution of
As a corollary to the Lindeberg-Lévy CLT we note the following
Theorem 37 Let , , be a sequence of iid random variables with # and

# % . Let , , with be a sequence of real nonstochastic
matrices with
< finite. Let & , then

& 0 % <

We remember that

Under the conditions A1 to A5 and < and if furthermore < is nonsin-

gular we obtain

0 % <

0 < < % <<

and finally

0 % <
If regressors are well-behaved the asymptotic normality of the least squares estimator
does not depend on the normality of disturbances.
75
Asymptotic distribution of %

Under the additional assumption # it can be shown that

% % 0 %
Similar results can be derived for asymptotic distributions of the test statistics.
We have shown that
%
0 <

This implies

%¾ <

0

and
2
%

0
For testing the validity of linear restrictions: >

>
> ;1

3 1
%
76
8 The generalised linear regression model
8.1 Aitken estimator
We modify assumption A4 in our set of assumptions.
A1

A2 dim( )=( ); rank( )=
A3 #
A4* # % ' ' ' ; ' 9
A5 is nonstochastic
If ' is unknown we have

additional parameters in the model. That means there
are more unknown parameters than observations.
Therefore we first assume that ' is known and only % is unknown. As a normalisation
condition we take 2 ' .
The symmetric positive definite matrix ' and its inverse can be decomposed according
to the Cholesky decomposition in
'
'

with and

Now we premultiply the regression model with the matrix

and get a new transformed model

(14)
This model has the properties
1. is non-stochastic
77
2.

is non-singular
½
3. # #
4. # # % ' %
That means in the transformed model ( 14) the assumptions of the classical regres-
sion model are fulfilled. We can apply the OLS estimator on this model and get the
following result.
Theorem 38 Under the assumptions A1, A2, A3, A4 and A5 best linear unbiased
estimator of is given by
% '

'

Proof.

'

'

% is called Generalised Least Squares (GLS) estimator or Aitken estimator.
Covariance matrix of GLS estimator

# % %
# '

'

'

'

'

'

# '
'

% '

'
''

'

% '

%
% and %% % ' %
Properties of OLS in Generalised Regression Model
1. OLS is unbiased
# #

78
2. In general is less efficient then %.

#

#
·

% '
3. %

# %

#

#

# 2

% %
2 # 2 2 '

'

% %
2 ' 2 '

% %
2 ' % 2 ' %

2 2' (normalisation)

OLS estimator of % is biased.
8.2 Asymptotic properties of GLS
Theorem 39 Under the assumptions A1, A2, A3, A4 , A5 and

'
<

if < is positive definite, the GLS estimator % is consistent.
Proof.
% '

'

# %
( % # % % % '

%

'

!½
1
Thus % and this implies % .
79
Remark: The assumption

'
< with < pos.definite

implies

'

<

We give an interpretation of this assumption for the case ' ,

<

with < positive definite. It means that

must not converge to . There has to be
enough variation in the data. When sample size increases variation has to increase as
well.
Is OLS consistent in the Generalised Regression Model ?
Here we present two examples. One in which OLS is not consistent in the generalised
model and one in which it is consistent.
Example 1 OLS is not consistent (heteroscedastic errors)

We consider the model

in this model . Further we assume about the error terms

0 % . That
means the diagonal elements of ' we have heteroscedastic errors and there exist
upper and lower bounds for the variables
.

.
The consistent GLS estimator is given by

%

80

Let us compare it to the OLS estimator

with expected value and variance
#

( ( %

%
% .

%
.

. .

% .
9
.
The variance of is always positive, therefore the OLS estimator is inconsistent.
Example 2 OLS is consistent (AR(1) errors).

we consider the same model as above but we change the assumptions about the error
terms to
D B D

Further we assume that

then

½
½ $

converges to zero.

We still need to see if

#

81
We evaluate the variance ( column by column

(

#
%
'

%

D

% D

if

¾ 6
7¾

½

(

1
and so which implies .
The matrix ' in this model is given by
$ '
%%% D D D
(((
%%%
(((
& )
' D
D
D

is a supremum norm of the matrix , it is defined as the maximum of absolute

values of the elements of .

Let us summarise this example. We found that under the assumptions

in the model
, with error terms D B where D and

B 0 % the OLS estimator is consistent.
Asymptotic distribution of %
Theorem 40 under the assumptions and
'

< positive definite

% converges in distribution to 0 % < .
82
Two-Step Estimation (Feasible GLS)
So far we always assumed that ' is given. But how do we proceed if ' is unknown ?
We can use a 2-step procedure

.
1. Estimate ' by '

2. Estimate with %
%
'
'
' still has has

unknown parameters and the number of parameters exeeds the
number of observartions . Therefore we have to find a parametrization for ', which
reduces the number of parameters.
Let us look at some examples
Example 1 AR(1)
D B
1. In the first step we calculate the OLS estimator and the OLS residuals

. Then we estimate D from the equation

D

B
D

Note that we only have to estimate one additional parameter D.
$ '
2. In the second step we use
%%% D D D
(((
%%%
(((
& )
' D
D
D

%
and estimate with %.
83
$ '
Example 2 heteroscedastic errors
%%%
(((
% %&
..
.
()

like in the previous example we want to find estimators for the in the first step
and then estimate %
% in the second step. But here we still have additional parameters
and further assumptions are necessary to reduce the dimension of the parameter space.
Concerning the asymptotic distribution of the feasible GLS estimator we note that
under relatively general assumptions we get

%% %

or

%
% %
The 2-step estimator is consistent and has the same asymptotic distribution as % (GLS).
% %
%
Asymptotically and are equal.
8.3 Heteroscedasticity
We consider the linear regression model

with the following structure on the error terms
#
$

'
%%% %
(((
#
%& %
..
.

() % '
%
This model has ( ) unknown parameters and we need additional assumptions for
estimation. But first we examine what happens if we estimate with OLS in this model.
84
Properties of OLS
If is non-stochastic OLS has the following properties
1. OLS is unbiased
2. OLS is consistent if

' = with = finite

with finite and non-singular

3. OLS is inefficient
4. OLS standard errors are incorrect.
Remember
The BLUE in this model is given by the GLS estimator
% '

'

and according to the Gauss Markov Theorem the minimum variance of all unbiased
linear estimators is

% '

For the OLS estimator we have

% '

Concerning item 2 we note that the Gauss-Markov Theorem states that

and this implies the inefficiency of OLS.
Further we note that if we let the number of observations go to infinity
%

'

$½ , finite - , finite $½ , finite
85
Thus under the conditions that
' = and with non-

singular and finite matrices < and also
and thus is consistent,
which verifies item 2
Concerning item 4 note that in the classical linear regression model we calculate
%
as estimator for the variance of instead of
% ' .

Consequently the standard errors for are calculated wrong under OLS and all infer-
ence based on OLS standard errors is incorrect.
Correction of OLS Standard Errors
If the sample size is very large we can proceed with OLS in spite of inefficiency of
the parameter estimates. The main problem is how to get valid statistical inference.
Without additional assumptions the estimation of % ' is still impossible, because '
contains unknown parameters.

In an important article White (1980, Econometrica) has shown that it is sufficient to
estimate % ' , which is of dimension .
Let
be -th observation vector with corresponding to the
-th row of . Then we can write

% '
%
..
.

..
.

%

%
in this expression White’ estimator replaces the unknown % by the squared OLS resid-
uals

. Similarly an estimator for the variance is given by

% '

with

% '

..
.

It can be shown that this is a consistent estimator for .
With this method we get standard errors of by

and they are called Heteroscedasticity-

Consistent (Robust) Standard Errors.
86
Tests for Heteroscedasticity
White (1980) presents a test for the hypothesis
7 % %
7 heteroscedasticity
The test is performed in two steps
Step 1 Estimate OLS residuals

.
Step 2 Estimate an auxiliary regression by regressing

on cross-products of all re-
* +
gressors including constant
C
and keep the from this regression.
It can be shown that

3 >

where > is the number of regressors in auxiliary regression >
.
Estimation under Heteroscedasticity
Estimation of feasible GLS requires prior knowledge about the structural form of het-
eroscedasticity. Consider an example
Example Let denote household income and let

denote household consumption
and consider a model in which consumption expenditures are explained by income and
other variables.

One often observes that the variation of average expenditure increases as income in-
creases. Therefore we model % proportional to
% %
87
$ '
and get
%%%% (((

#
& %

..
.
() % '

A more general model for heteroscedasticity is given by
% ¾
where is a single variable usually one of the regressors. Depending on the values of
the parameters we get the models
Homoscedasticity
Variance proportional to
Variance proportional to
Weighted Least Squares Modelling heteroscedasticity in this form is also called

weighted least squares regression. We can see this from
%

% '

..
.

The rows in the matrix correspond to the individual observations on all independent
variables. Here we call the -th row of X. Accordingly
is the
observation of the dependent variable for the -th individual. We rewrite the regression

model
'

½
..
.

..
.

'

88

%

From this expression we see that the GLS estimator is gives the OLS estimator for the
weighted set of observations
.
To estimate the parameters in % ¾ .directly from the data,
one could proceed the following way
Use OLS rediduals and substitute
for % .

By non-linear regression estimate
¾ B
8.4 Autocorrelation
Autoregressive process of order one
A stochastic process of the form
D B
# B
# B B Æ %3
is called first order autoregressive process, or AR(1).

Under the assumption that the process starts from some non stochastic value a
solution for is given by
D B
D DB B
..
.

D D B

The properties of as 2 goes to infinity 2 depend on the parameter D. Here we

treat convergence in the sense of convergence in mean square.
89
Case 1 D .

# D # D B

( D D ( B

%
¾
% ¾
½

½¾
The expected value converges to zero and the variance to some finite value. In
this case we speak of a ”stable solution”.
Case 2 D .

B

#

( # # # B

#B
2%3

The variance of increases as 2 . This is called the “random walk”

solution.
Case 3 D 9
#
(
Both expected value and variance increase over time. This is the ”unstable”
solution.
From now on we consider only stable solutions where D .

The general solution for the starting in the infinite past has the following properties
90

D B

# # D B

( # D B # D B B

D %3 %3

D

)*+ #
D B B %3

D
D

%3 D D %3 E2 1

D

The mean and the variance of the process are constant and finite and the covariance of
two observations and only depends on the difference 2 1. Processes with these
properties are called weakly “stationary”.
Generalised Regression Model with AR(1) Disturbances
Now we return to the linear regression model

but we assume that the error term follows a first order auto regressive process with
#

D D D

..
.

# % ' %3 ..
. D
D
D
D

% %3
D
91
Model misspecification may be a reason for autocorrelated disturbances.
Example Suppose the true model were the following

But the researcher estimates instead

+
in this misspecified model the error term + is autocorrelated: +

.
Properties of OLS
Like for heteroscedasticity of errors, if is non-stochastic we have the following

properties of OLS
1. OLS is unbiased.
2. OLS is consistent under certain conditions.
3. OLS is inefficient.
4. OLS standard errors are incorrect.
However, if is stochastic, we have to be more careful. This is also the case if the
lagged dependent variable is one of the regressors. Consider again an example.
Example We have the model

D B D
# B
# B B Æ %3

Estimating this model by OLS results in

plim

plim

92
Consider now

:

#
# D %3
D

%3 D
9
D D

This means that OLS is inconsistent !
Testing for Autocorrelated Disturbances
In the model

D B
we want to test the hypothesis
7 D
7 D
This is a hypothesis about the error term . But is unobservable and therefore one
has to look for a test based on OLS residuals
.

%
because

Even if 7 is true and #

% , the OLS residuals will display autocorrelation
is not a diagonal matrix and it is dependent of
.
Durbin – Watson Test
93
the Durbin - Watson test statistic is given by

-

Note that - is small for positive autocorrelation and large for negative autocorrelation.
The test statistic - will take on an intermediate value for no autocorrelation.
Durbin and Watson established upper and lower bounds for the distribution of -. These
bounds are independent of under assumptions that
1. is non-stochastic.
2. 0 % .
3. A constant is included in .
8 -
-
- -
Durbin and Watson suggest the following test procedure

For a given significance level we can derive values - and -8 from and 8 .
Then we apply the decision rule
If - - we reject the null hypothesis of non autocorrelation in favour of

positive first order autocorrelation.
If - 9 -8 we do not reject the null hypothesis.
94
If - - -8 the test result is inconclusive.
The assymptotic properties of the Durbin - Watson test statistic are

%¾

%¾

7%¾

- D

%¾
# # D B D#
#

D B B

#! D
#"

- D
D
For - 9 a test of 7 against the alternative can be constructed by calculating -

and comparing this to - and -8 .
Disadvantages of DW Test
and 8 are only tabulated; analytical functional forms are not given.
inconclusive region
only non stochastic are allowed
Further tests: Breusch – Godfrey LM-test
Feasible GLS with Autocorrelated Disturbances
Beware of the problem of misspecification !!!

D B
95

D D

D

D

%3
# % .. %
D

.

D
D

'

..
.
D

D

D

D

D D D

D D
D

'
..

.
D D
D D

D

D

..
.
D
..
.

..
.

D
'

For a model with a constant and one explanatory variable as regressors the transformed

model becomes
D
D D

D
..
.

D
..
.
D
..
.

D
D D
96
There is systematic transformation for all observations except the first.
Cochrane – Orcutt transformation:
For computational siplicity the Cochrane Orcutt transformation omits the first obser-
vation and uses

D
..
.

D
..
.
D
..
.

D
D D
Prais – Winston transformation:

The Prais - Winston transformation uses all observations. Asymptotically two trans-
formations are the equivalent, but they differ in small samples.
To obtain an estimators for and D we can apply the following iterative procedure.
Step 1 OLS regression of

gives residuals .
Step 2 OLS regression

D
B gives an OLS estimator D.
Step 3 use D one of the transformations for

and , and get a new estimator

for
and new residuals .

Step 4 Iterate steps 2 and 3.
Note that iterative methods always depend on the starting values, as they may converge
to local extrema.
Serial Correlation Robust Inference after OLS
Like in the heteroskedasticity model methods of correcting standard errors after OLS
estimation exist : Newy and West (1987).
97
9 Limited dependent variables models
The linear regression model assumes that the dependent variable is continuous and
has been measured for all cases in the sample. Yet, many outcomes of fundamental
interest to social scientists are not continuous or are not observed for all cases. There
are special regression models that are appropriate when the dependent variable is cen-
sored, truncated, binary, ordinal, or count. Variables of that kind are often subsumed
as categorical and limited dependent variables.
In this section we will discuss models for binary and censored dependent variables
Differences in parameter interpretation between linear and nonlinear regression

models
Consider a linear models in two independent variables

Æ-
where is a continuous variable and - is dichotomous with values 0 and 1. For sim-
plicity we assume there is no random error. A graph of the model is given in Figure 6.
In this model a change of by one unit will always result in a change of units in the
dependent variable regardless of the values of and -. And a change in - from to
will always result in a change of
by Æ units.
Now, consider the same graph for the nonlinear model

6

Æ -

where 6 is a non-linear function. In Figure 7 we see that the effect of a unit change in
on
depends on the level of as well as on the level -. Analogously also the effect
of a change in - from to changes for different levels of .
We will return to these basic observations when it comes to parameter interpretation
in nonlinear models for limited dependent variables.
9.1 Binary regression models: Logit, Probit
A binary response model is a regression model in which the dependent variable

is
a binary random variable that takes only the values zero and one. In many economic
applications of this model, an agent makes a choice between two alternatives: for ex-
ample, a commuter chooses to drive a car to work or to take public transport. Another
98
Figure 6: Linear Model
y
d=1
E G d=0
E
DG G
E
x1 x2
example is the choice of a worker between taking a job or not. Driving to work and
taking a job are choices that correspond to
, and taking public transport and not
taking a job to
. The model gives the probability that
is chosen conditional
on a set of explanatory variables.
The econometric problem is to estimate the conditional probability that
consid-
ered as a function of the explanatory variables. The most commonly used approach,
notably logit and probit models, assumes that the functional form of the dependence
on the explanatory variables is known.
The linear probability model
A first approach to estimate a model with a binary dependent variable would be to

use standard OLS. This is the so called linear probability model. We note it here not
to recommend its use but to illustrate the problems resulting from a binary dependent
variable and to motivate the discussion of the logit an probit models.
99
Figure 7: Nonlinear Model
y d=1
'6
'4
' 5 d=0
'3
'1
'2
x1 x2
The structural model is

#
#

Note: In this chapter we use elementwise notation,

gives the i’th component of the
vector
, and is the row of corresponding to observations for the i’th individual.
We will eventually suppress the index to make notation more simple.
A graph for a model with a single independent variable is given in Figure 8. Let us
consider the meaning of #

#

and hence

100
So gives the probability of
given . Depending on this probability need
not always be between zero and one. So there is a problem of nonsensical predictions
in the linear probability model.
As
only takes on two values also the error term for a given only takes two values

if

if

So the errors cannot be normally distributed.

For the variances of the errors we have
(

This means there is a further problem with heteroscedastic errors in the linear proba-
bility model.
Figure 8: Linear Probability Model
XE
101
Latent variable model
As before, we have an observed binary variable

. Suppose that there is an unobserved
(latent) variable
which is continuous and ranging between and that gen-
erates the observed
. In the example of labour force participation we can think of this
variable the utility derived from working. The structural model for
is the following
linear one:

The individual joins the labour force only if its utility is above a certain threshold F .
Hence we only observe a discrete outcome
which is linked to
by
if
9 F

if
F
Since
is continuous, we avoid problems encountered in the linear probability model.
However, since the dependent variable is unobserved, the model cannot be estimated
by OLS. Instead we use Maximum Likelihood estimation which requires assumptions
for the distribution of the errors.
First we assume # like in the linear probability model. Since
is unob-
served we cannot estimate the variance of the error as in the linear model. In the probit
model we assume ( and in the logit model we assume (
5 ; .
By assuming a specific form for the distribution of it is possible to compute the
probability of
for a given . Setting F consider

9

9 9
This is simply the cumulative distribution function of the error evaluated at . Ac-
cordingly,

where F is the normal distribution function ( for the probit model and the logistic
(1
distribution function )
(1 for the logit model.
102
Model identification assumptions
In specifying the logit and probit model we made the following identifying assump-
tions
#
F this is not important if F is assumed to be constant in the model
( =1 resp. ( 5 ;
These assumptions are arbitrary, in the sense that they cannot be tested, but they are
necessary to identify the model. Since a latent variable is unobserved, its mean and
variance cannot be estimated. To see the relationship between the variance of the
dependent variable and the identification of the ’s in a regression model, consider the
model
and assume we rescale
by
Æ
. The variance of equals

(
( Æ

Æ (

and it follows

Æ

Æ Æ
The magnitude of the slope parameter depends on the scale of the dependent variable.
If we do not know the variance of the dependent variable, then the slope coefficients
are not identified.
Differences in the variances of the error terms in the logit and probit model also affect
the parameter estimates. Let
logit
with ( 9¾

probit
& & & with ( &
As transformation to compare coefficients from the logit and probit models we can use

5 ; & &
Nonlinear probability model
The logit and probit models can also be derived without appealing to an underlying
latent variable. This is done by specifying a nonlinear model relating the to the
103
probability of the event
. Remember, in the linear probability model we had the
problem that the predicted probabilities
can take on values that are greater
than one or less than zero. To eliminate this problem we transform
into a

function that ranges between and . First, consider

which ranges between and . Then take the logarithm and get an expression be-
tween and . In the logit model this equals

because

General probability models can be generated by choosing functions of that range

between and , e.g. any distribution function

(15)
Maximum Likelihood estimation
To specify the likelihood function, define : as

if

:

if

is defined by equation (15). If the observations are independent, the
likelihood function is given by
/

,

:
, ,

/

, ,

104
and the log likelihood is

/

It has been shown that under mild conditions, the likelihood function is globally con-
cave. The estimates are consistent, asymptotically normally distributed, and asymp-
totically efficient.
Remark: These are only asymptotic properties and nothing is said about small sample
properties of the Maximum Likelihood estimators. Contrary to OLS, Maximum Like-
lihood estimation of nonlinear functions is only justified for relatively large sample
sizes (above 500 observations).
Numerical maximisation procedures
For nonlinear models algebraic solutions are rarely possible. Consequently, numerical
methods are used to maximise the likelihood function. Numerical methods start with
a guess of the values of the parameters and iterate to improve on that guess.
Assume that we are trying to estimate the vector of parameters $ . We start with an
initial guess $ called start values and attempt to improve by adding a vector G of
adjustments and proceed updating
$ $ G
..
.
$ $ G
Iterations continue until a convergence criterion is reached. This may be either that
the gradient of
/$ is close to zero, or that parameter values do not change any
more.
How is G determined? By a product of gradient and direction matrix
G ,

/
$
$

The gradient vector indicates the direction of a change in the likelihood function for a
change in the parameters. , is the direction matrix that reflects the curvature of the
likelihood function.
105
Method of steepest ascend
,

/
$ $ $
$

Newton Raphson method
, 7#

/
$

$$

Method of scoring
,
#

/
$

$$

BHHH

/
/
, $
$ $
In addition to estimating the parameter $ , numerical methods provide estimates of

the asymptotic covariance matrix ( $, which are used for the test statistics. The
asymptotic covariance matrix for the maximum likelihood estimator equals
( $
#

/

$$
evaluated at the true parameter value $ .

Estimators for the covariance matrix are given by
( $

#

/
$

$$

the inverse of the Hessian evaluated at the maximum of the likelihood function or

/
/
( $ $
$ $
the inverse of the outer product of the gradient vector evaluated at the maximum of the
likelihood function.
106
Parameter interpretation
Since binary regression models are nonlinear, no single approach to parameter in-
terpretation can fully describe the relationship between a variable and the outcome
probability. Here we discuss several methods to interpret parameters. For a given
application, you may need to try each method before a final approach is determined.
Predicted probabilities: The most direct approach for interpretation is to examine

the predicted probabilities of an event for different values of the independent variables.
A useful first step is to examine the range of predicted probabilities within the sample,
and the degree to which each variable affects the probabilities. If we consider the
shape of the cumulative Normal (or logistic) distribution function we see that they are
approximately linear in the range between 0.2 and 0.8. Hence if the range of predicted
probabilities is between 0.2 and 0.8, the relationship between the ’s and the predicted
probability is nearly linear, and simple measures can be used to summarise the results.
Minimum and maximum predicted probabilities are evaluated by

To examine the effect of a single variable on the predicted probabilities we could al-
low one variable to vary from its minimum to its maximum, while all other variables
are fixed at their means. Let

be the probability computed when all
variables except are set equal to their means, and equals some specified value.
Then the predicted change in the probability as changes from its minimum to its
maximum value is given by

One can also use plots of predicted probabilities to examine the effect of one or two
variables while the other variables are held at constant. The effect of discrete in-
dependent variable on the probability can be illustrated by tabulating the predicted
probabilities at selected values.
107
Partial change or marginal effect: In the structural latent variable model

the partial derivative with respect to

Since the model is linear in

the interpretation is straight forward. For a unit change
in ,
is expected to change by units, holding all other variables constant. The
problem is that the variance of
is unknown, so the meaning of a change of

is
unclear. Since the variance of
changes when a new variable is added to the model,

the magnitudes of all 1 will change even if the new variable is uncorrelated with
the original variables. This makes it misleading to compare coefficients from different
specifications.
The 1 can be used to compute the partial change in the probability of an event. Let

then the partial change in the probability or the marginal effect is

In the probit case

!

In the logit case

8

The marginal effect is the slope or the probability curve relating to

,
holding all other variables fixed. The sign of the marginal effect is determined by .
The magnitude of the change depends on the magnitude of and the value of .
To assess the relative magnitudes of the marginal effects of two variables we can use
108
the ratio of marginal effects for and
&

&

Since the value of the marginal effect depends on the levels of all variables, we must
decide on which values of the variables to use when computing the marginal effect.
One method is to use the average over all observations

or to compute the marginal effect on the mean of the independent variables

Taking the mean value of an independent variable does of course make no sense for
0-1 dummy variables. In that case it is better to fix the dummy variable at either value
or go for discrete changes.
Discrete change: Let

Æ be the probability of an event given ,
noting, in particular, the value of . The discrete change in the probability for a
change of Æ in equals
*

Æ

*
In nonlinear models the discrete change is unequal to the marginal change, except in
the limit as Æ becomes infinitely small. The practical problem is again choosing which
values of the variables to consider and how much to let them change. Some options
are:
unit change in , if increases from
to

*

*
centred unit change
*

*
109
standard deviation change, 1 is the standard deviation of
*

1

1

*
a change from 0 to 1 for dummy variables
*

*
Odds ratios: This methods takes advantage of the tractable form of the logit model.
Define the odds of an event as

'

taking logs in the logit model gives

'
The log odds is linear in . The interpretation of the effect of a change in is straight-
forward. (For a unit change in , we expect the log odds to change by , holding all
other variables constant.)
'
To compare the odds before and after adding Æ to , we take the odds ratio
' Æ
Æ
'
and the parameters can be interpreted in terms of odds ratios

For a change of Æ in , the odds are expected to change by a factor of Æ ,
holding all other variables constant.
110
Hypothesis testing and measures of goodness of fit
Using the result about asymptotic normality of the Maximum Likelihood estimator
allows formulating Wald test for testing linear restrictions
7 > 1
? > (
> 3 1
where ( is the estimated covariance matrix for which we get from the iterative
maximisation procedure.
Alternatively we can formulate a Likelihood Ratio test
@ 8
/8
/ 3 1
where 8 is the value of the likelihood function evaluated at the Maximum Likelihood
estimates for the unconstrained model, and is the value of the likelihood function
evaluated at the Maximum Likelihood estimates for the constrained model.
To test the overall significance of the model we can use the so called Likelihood Ratio
3 test. It compares the unconstrained model with a model where only a constant is
included.
full model
model with only a constant included
@
/
/ 3
Residuals: For a binary model we define
5 #

Since
is binary, the deviations
5 are heteroscedastic, with (
5
5 . This suggests the Pearson residual:

5
5 5
111
Large values of suggest a failure of the model to fit a given observation. Pearson
residuals can be used to construct a summary measure of fit, known as Person statistic

Beware that while (

5 5 ,
(
5 5 5
Consequently the variance of is not equal 1.
Pseudo ’s Several Pseudo- for nonlinear models have been defined in analogy
to the formulas for the linear regression model. These formulas produce different val-
ues in models with categorical outcomes, and, consequently are thought of as distinct
measures.

Percentage of explained variation

5
:
,

Likelihood Ration Index, McFadden

/
"

/

increases as a new variable is added to the model
5
/

/
observed versus predicted values
5

if 5

;

if 5
9 ;

, correctly predicted

112
9.2 Censored regression models: Tobit
In the linear regression model, the values of all variables are known for the entire sam-
ple. Here we consider the situation in which the sample is limited by censoring or
truncation. Censoring occurs when we observe the independent variables for the entire
sample, but for some observations we have only limited information about the depen-
dent variable. For example, we might know that the dependent variable is less than
100, but not know how much less. Truncation limits the data more severely by exclud-
ing observations based on characteristics of the dependent variable. For example, in
a truncated sample all cases where the dependent variable is less than 100 would be
deleted, While truncation changes the sample, censoring does not.
Problem of censoring example Let

be a dependent variable that is not censored.
If we do not know the value of
for
then
is a latent variable that cannot be
observed over its entire range. The censored variable
is defined by

if
9

if

Consider the model

with no censoring of observations OLS
would result in

If
were censored from below at 1, we would know for all observations, but observe
only for
9 . For example, the values of
at or below 1 are censored with

. The resulting estimate would be

which underestimates the intercept and overestimates the slope.

Since including censored observations causes problems, we might use OLS to estimate
the regression after truncating the sample to exclude cases with a censored dependent
variable. After deleting cases at
, the OLS estimate

overestimates the intercept and underestimates the slope.
113
Truncated and censored Normal distribution
Assume is normally distributed with 0 4 %

4

(
%
4

4 %

!
% %
Truncated Normal distribution: When values below F are deleted, the variable

9 F has a truncated Normal distribution with density

4 %

9 F 4 %

9 F

; ;
%! %!

% %

9 F 4 %
< ;
;<
( % (
%
Given that the left-hand side of the distribution has been truncated, #

9 F must
be larger than #
4. For the Normal distribution we have

!

;<

#

9 F 4% %
;<
(
%
4F
4 %8 (16)
%
where 8 !;( is called the inverse Mill’s ratio.

Censored Normal distribution:

if
9 F

F if

F
Most often, F F . We know that if

is Normal, then the probability of an observa-

tion being censored is
F 4
censored

F
( (17)
%
114

and the probability of not being censored is
F 4 4F
uncensored ( ( (18)
% %
Thus the expected value of a censored variable equals
#

(
-4 %8
4F
(
.
uncensored #

9 F $ # censored #

F $
#
F 4 F 4
F (19)
% % %
Tobit model for censored data
The structural equation is

where 0 % .
is a latent variable that is observed for values greater than F
and censored for values less or equal F

if
9 F

F if

F

combining the two equations results in the model

if
9 F

F if

F

The probability of an observation being censored for a given
censored

F
F

(20)
Since 0 %
censored
F

(
F
(21)
% % %

and
F F
uncensored ( (
% %
115
Deriving the probability of a case being censored is very similar to deriving the prob-
ability of an event in the probit model. In the tobit model we know the value of
if
9 F , while in the probit model we only know if

9 F . Since more information

is available in tobit, estimates of the parameters from tobit are more efficient than the
estimates from probit. Further since all cases are censored in probit, we have no way
to estimate the variance of
and must make assumptions about it, while (
can be estimated in the tobit model.
Analysing a truncated sample:
#

9 F #
9 F
#
9 F
about #
9 F we know from equation (16)
#

9 F %8Æ (22)
where 8 is the Mill’s ratio and Æ F ;%. The problem introduced by truncation
is that the regression model implied by equation (22) is of the form

%8
In this equation 8 8Æ may be thought of as an additional variable. If we estimate

using
we have a misspecified model that excludes 8.
Analysing a censored sample: In the case of a censored sample applying equation

(19) we get
#

# uncensored #

9 F $ # censored 2(23)
$
using equations (20) and (21) with Æ F ;%,
#

( Æ %!Æ (Æ F
#
is nonlinear in , so that estimating the OLS regression of
on results in

inconsistent estimates of parameters.
116
Estimation Maximum Likelihood estimation of the tobit model involves dividing
the observations into two sets. The first contains the uncensored observations, which
ML treats in the same way as in the linear regression model. For the censored obser-
vations we do not know the value of
, but we know that

F . Hence we use the
probability of being censored as the likelihood.

Formally for uncensored observations the likelihood contributions are

!

%
,

% %

!

/8 %
(,(
% %

and for censored observations
F
,

F

(
%
F
/ % (
(,(
%

The log likelihood is thus given by

F
!

/ %

(
(,(
% % (,(
%
Interpretation For the interpretation of the estimation outcome it is important if

censoring or truncation occurred by accident or if censoring is a genuine property of
the data. We distinguish between the effects of changes in the independent variables
on the latent variable, the truncated variable, or the censored variable.
change in latent outcome
#

#

change in truncated outcome
!Æ
#

9 F %
Æ
(
#

9 F
Æ8Æ 8Æ

117
change in censored outcome
#
Æ
( %!Æ (ÆF
#

Æ
( F F !Æ
%
uncensored if F F
McDonald and Moffitt’s decomposition McDonald and Moffitt suggest a decom-

:
position of that highlights two sources of change in the censored outcome. The
simplest way to derive their decomposition is to differentiate equation (23) by parts
and apply the product rule.
#

# uncensored #

9 F $ # censored 2 $
#
#

9 F
uncensored

uncensored
#

9 F F

The decomposition shows that when changes, it affects the expectation of

for
uncensored cases weighted by the probability of being uncensored, and it affects the
probability of being uncensored weighted by the expected value for uncensored cases
minus the censoring value F .
118

Lecture Notes - Econometrics I - Andrea Weber

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes - Econometrics I - Andrea Weber

Uploaded by

Copyright:

Available Formats

Lecture Notes: Econometrics I

Institute for Advanced Studies

2 Descriptive linear regression 5

3 The classical linear regression model 15

5 Statistical inference in the classical linear regression model 27

6 Some tests for specification error 56

9 Limited dependent variables models 98

Greene, W. H., 1997. Econometrics, 3rd Edition. Prentice Hall.

Hayashi, F., 2000. Econometrics. Princeton University Press.

Schönfeld, P., 1969. Methoden der Ökonometrie I. Verlag Franz Vahlen.

Wooldridge, J., 2000. Introductory econometrics: a modern approach. South-Western

According to Heckman(2000, Quarterly Journal of Economics) major achievements of

The definition of a causal parameter within a well-defined economic model.

Clarification of the role of causal parameters in policy evaluation and in fore-

where      is a vector of inputs. Assuming that each input can be

A special case occurs if is separable in

Price effect on consumer demand

Effects of fertiliser on crop yields

Measuring returns to education

The structure of economic data

Cross-sectional data set consists of a set of individuals, firms, regions, etc, at a

Pooled cross-sectional data set.

2 Descriptive linear regression

2.1 The method of least squares

Table 1: The multiple linear regression model,       

and the fitting criterion is given by

Table 2: variable names

2.2 The geometry of least squares

      are vectors in the n - dimensional Euclidian space . The inner product

2.   , that means the columns of are linearly independent

then the columns of span a  - dimensional subspace of which we call  :

The orthogonal complement of  is given by  . In the Euclidian space

Figure 2: The column space of in

Remember the fitting criterion for the least squares model

Rules for matrix differentiation:

and get the normal equations

If the inverse matrix exists the optimal parameter vector is given by

is non-negative definite which implies that is strictly convex and

Therefore is determined uniquely by the normal equations.

and the fitted values by

The normal equations

The   matrix  projects

 projects onto the orthogonal complement  .

Properties of the projectors:

 and  are symmetric

  there exists an orthogonal decomposition     

and rewrite the normal equations as 

Figure 4: The orthogonal decomposition of

If a constant term is included in the regression model (one column of is       ),

In Figure 5 we see the geometric interpretation of the vectors

Figure 5: The orthogonal decomposition of

By applying Pythagoras Theorem we see that

and define the uncentered or the coefficient of variation by

2. Return to the triangle in Figure 5 and call ! the angle between

3. Anything that changes ! will also change , e.g. adding a constant to

also we note that

1.     only if a constant is included in ( is difficult to interprete

2. never decreases if an additional variable is added to the regression.

For the multiple linear regression model

we start with a set of assumptions

A2 is a   matrix with  

1. Assumption A1 includes a wider range of functional forms, for example we can

or quadratic functions like

where is a vector of inputs. Assuming that each input can be

Table 1: The multiple linear regression model,

are vectors in the n - dimensional Euclidian space . The inner product

2. , that means the columns of are linearly independent

then the columns of span a - dimensional subspace of which we call :

The orthogonal complement of is given by . In the Euclidian space

The matrix projects

projects onto the orthogonal complement .

and are symmetric

there exists an orthogonal decomposition

and rewrite the normal equations as

If a constant term is included in the regression model (one column of is ),

1. only if a constant is included in ( is difficult to interprete

A2 is a matrix with

If we require # to be constant, we can as well set this constant equal to

# for any '

2) The concept of the smallest variance is the following: is an variance-

Lemma 1 Under the assumptions A1, A3, A5 the linear estimator ,

The remaining problem is to minimise

Note that depends on

% the variance of the error

An -dimensional random variable is normally distributed if its density function is

with 4 and is symmetric and nonnegative definite.

We start with a parameter space .

Probability(Error Type 2) ... power of the test

If the restriction . $ is valid, imposing it should not lead to a large reduction of