Professional Documents
Culture Documents
Andrea Weber
December, 2003
Help in text processing from Andrey Launov and Ivan Prianichnikov is greatly acknowledged.
Thanks to Michael Grabner for helpful comments and finding lots of typos.
Contents
1 Introduction 3
4 Stochastic regression 25
7 Asymptotic theory 68
7.1 Introduction to asymptotic theory . . . . . . . . . . . . . . . . . . . . 68
7.2 Asymptotic properties of OLS estimators . . . . . . . . . . . . . . . 73
1
8 The generalised linear regression model 77
8.1 Aitken estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Asymptotic properties of GLS . . . . . . . . . . . . . . . . . . . . . 79
8.3 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.4 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
References
Baltagi, B. H. (Ed.), 2001. A compagnion to theoretical econometrics. Blackwell Pub-
lishers.
Davidson, R., MacKinnon, J., 1993. Estimation and inference in econometrics. Oxford
University Press.
Johnston, J., DiNardo, J., 1997. Econometric methods, 4th Edition. McGraw Hill.
ScottLong, J., 1997. Regression models for categorical and limited dependent vari-
ables. SAGE Publications.
2
1 Introduction
The aim of the lecture is twofold. First students should receive guidelines for applied
empirical research. But second the lecture should also provide a good theoretical basis
for advanced econometrics courses.
What is Econometrics?
At the beginning of the twentieth century economic theory was mainly intuitive and
empirical support for it was largely anecdotal. Now economics has a rich array of
formal models and a high quality data base. Empirical regularities motivate theory in
many areas of economics, and data are routinely used to test theory. Many economic
theories have been developed as measurement frameworks to suggest what data should
be collected and how they should be interpreted.
Econometric theory was developed to analyse and interprete economic data. Most
economic theory adopts methods originally developed in statistics.
Figure 1:
Analysis of what is required to recover causal parameters from data (the iden-
tification problem). Many theoretical models may be consistent with the same
data.
3
The concept of a causal parameter
By a causal effect economists mean a “ceteris paribus” change (all other things are
equal).
Consider, for example, a model of production of output based on inputs that can
be varied independently. We write the function
and the causal effect of can be defined independently of the level of the other
values of .
Examples
4
Time series data set consists observations of a variable over time, e.g. stock
prices, consumer price index etc. The chronological ordering of observations
contains potentially important information.
Panel or longitudinal data set consists of a time series for each cross sectional
member in the data set, e.g. household panel surveys, OECD main economic
indicators.
As an extension of the linear regression model in two variables let us introduce the
multiple linear regression model.
Observations Functional Form Fitting Criterion
To make notation more convenient we transform the model in matrix form. We define
the n-dimensional vectors of observations,
..
.
..
.
..
.
..
.
the k- dimensional parameter vector and the n-dimensional vector of error terms,
..
.
..
.
5
and the matrix
..
.
In matrix notation the multiple linear regression model can be written in the following
way
In the literature we find several names for the variables in the model, which are listed
below. We will commonly use the names in the first row.
dependent variable independent variables error
explained variable explanatory variables disturbance
regressand regressors, covariates
All points in are determined by their length and direction. The Euclidian length of
a vector is
If we assume that
6
1. , there are more observations than independent variables
7
To solve the minimisation problem we calculate the first order condition
Remember that
the columns of are linearly independent which implies that has a full
rank,
8
Figure 3: The projection of
onto
.
, are idempotent, ,
With the help of the projection matrices we can decompose the vector of the dependent
variables:
9
where
are the fitted values and
are the residuals.
residuals sum up to 0. Because in this case
..
.
and
10
2.3 Measuring the goodness of fit
The idempotency of and often makes expressions associated with least squares
regression very simple. For example the sum of squared residuals is given by
Similarly, the sum of squared fitted values, which is also called the explained sum of
squares, is
the total sum of squares equals the sum of squared dependent variables
(1)
11
Thus the total sum of squares of
equals the explained sum of squares plus the sum of
squared residuals. The fact that the total variation in the regressand can be divided into
two parts, one “explained” by the regressors and one not explained, suggests a natural
measure of how good the regression fits. Let us divide equation( 1) by
Properties of :
1.
, it is unit free.
!
and hence
!
A modification of lets us get around the problem addressed in the last point. This
version is called the centered
where and
. Multiplication of with
a vector gives a vector of deviations from the mean
..
.
where
For we can derive a decomposition into explained sum of squares and residual sum
12
of squares, like in equation( 1), only if a constant is included in the regression.
From the orthogonal decomposition of
we get
if
and hence
"
#
SSR SSE
SST SST
Properties of :
3.
13
4. In the triangle, similar to the one in Figure 5, with sides
,
and
,
is the squared cosine of the angle $
$
14
3 The classical linear regression model
3.1 Assumptions
(2)
A1
A3 #
A4 # %
A5 is nonstochastic
Remarks
&½ ¾ ¿
15
3. A3.
# #
#
The zero conditional mean implies that the unconditional mean is also zero,
since
# # # #
#
which is seen by
# # # '
# # #
# #
#
the regression of
on is the conditional mean.
16
5. A5. is nonstochastic in an experimantal setting, where the analyst chooses
the independent variables and then observes the outcome
Example: in an agricultural experiment the outcome
may be crop yields and
the analyst chooses the amount of fertilizer that is applied.
An alternative view in the experimental setup is that the observations of are
fixed in repeated samples.
In economics we do not often have the opportunity to analyse experimetal data,
so the assumption on non stochastic is not very appropriate. However, we will
see that this assumption can be dropped at a low cost.
# , # %
Heteroskedastic Errors
Autocorrelated Errors
From A1 we know that there exists a true and we want to estimate it as good as
possible from the data.
17
We choose an estimator from the class of linear estimators
,
-
#
Out of these we choose the best estimator, which is the one with the smallest variance
( # # # is minimal.
Remarks
18
Proof.
,
- , -
# # , , - # , # , # -
,#
,# -
, -
-
,
,
Lemma 2 Let
be an n-dimensional random variable, whose first and second mo-
ments exist and let )
. be k-dimensional random variable, then
) )
Proof. Exercise
As a consequence of these Lemmas we note
,
is an unbiased linear estimator with variance covariance matrix
, , , , % ,,
Lemma 3 Denote with . Further, let have full rank and let
, /, then
,, / / , / , /
Proof. Exercise
Now we apply Lemma 3 with / and get
,
,,
constant
, ,
·
19
With these steps we have proven the Gauss Markov Theorem
Remarks
(
# #
% %
Linear transformations of
with variance
% %
20
Estimating
# Æ # )
. # # )
#. # #) #.
) # #) #. ) .
) .
) and .
ÆÆ # ÆÆ
such that )
# ÆÆ
#
) )
# ) )
) )
% ) )
21
Thus we get the BLUE for
with
%
%
#
Estimation of %
# # # # %
This suggests that we could take % as an estimator for % . But, due to the
hence %
is biased.
We can see this by
#
# # 2
# 2
2 #
2 #
2 %
% 2 %
2
2
2 2
22
So we have
# %
%
Now that we have an estimator for % we can present an estimator for the variance of
( %
Remarks
%
The Standard Error of of a single component of the parameter vector is given by
½
¾
.
Remember that we defined the centered coefficient of determination by
# Ú
or alternatively the adjusted
23
Both usually have a smaller but negative bias.
We use the model
where
is unknown and is given
Example: with the help of past values predict consumption in 2001
The problem is again to find the best linear unbiased estimator for
.
)
.
# )
. #
Æ
)
.
24
4 Stochastic regression
Social scientists are rarely able to analyse experimental data. Thus it is necessary to
extend the results of the preceding section to cases in which some or all independent
variables are randomly drawn from some probability distribution.
A convenient method of obtaining the statistical properties of is to
As before,
#
#
# # # #
( %
(
# ( ( #
(
# %
% #
With the Gauss-Markov Theorem in the previous section we have shown that
25
This inequality, if it holds for every particular , must hold over the average values of
#
We noticed that
# # # #
)*+ )*+ #
which says that and are uncorrelated. The interpretation is that in some sense
captures all relevant effects which are necessary to explain
.
Example:
– wage, – education, – ability. # – average ability in all
education groups is the same.
26
5 Statistical inference in the classical linear regression model
So far we have solved the estimation problem, but there remain some open questions,
like
We need a testing framework in the linear regression model and exact distributional
assumptions for the errors. We make an additional assumption
A6 0 %
4 5
4
4
# 4
(
27
Corollary 8 Let be an random variable with 0 4 .
If
) ., where ) is an 1 matrix with ) 1 and
1
, then
0 )4 . ) )
0 %
0 %
0
%
/ $
$
$
6 / $
Here we want to find the ML estimator for the linear regression model. First we set up
the Likelihood Function for
%
%
¾
5%
/ %
28
Taking logarithms
/
5
%
%
/
%
/
% %
%
gives
%
The Maximum Likelihood estimator for equals the OLS estimator and the Maximum
Likelihood estimator for % is given by the biased variance estimator.
Theorem 9 In the classical regression model with normally distributed errors the
least squares estimator has minimal variance of all unbiased estimators. Thus is
efficient, not only linearly efficient.
Remark: For non-normally distributed errors the ML estimator usually has a smaller
variance than . Thus for non-normality it is better to use the ML estimator than OLS.
Alternative: 7 $
Example:
$
29
A statistical test is a decision rule based on the sample. The decision rule determines
if 7 is accepted or rejected.
For the possible outcomes of the testing procedure we find the following
Accept 7 Reject 7
Tests are compared in terms of the size and power. The ”best” test has maximal power
for a given (fixed) size.
In this section we discuss three testing principles which are based on Maximum Like-
lihood estimation of the parameter $ . Given an arbitrary function . the testing hy-
pothesis is the following
7 . $
/
8
/
This function must be between 0 and 1. If 8 is too small we reject the null hypothesis.
30
Wald Test
If the restriction . $ is valid . $ should be close to zero, because maximum
likelihood estimation is consistent. Therefore the test is based on . $ . We reject
the null hypothesis if this is significantly different from zero.
If the restriction . $ is valid, the restricted estimator should be near the point that
maximises the log likelihood. Therefore, the slope of the likelihood function should be
near zero at the restricted estimator. The test is based on the slope of the log Likelihood
at the point where the function is maximised subject to the restriction. The derivative
of the Likelihood with respect to the parameters is called the score
/$
1$
$
The test is based on 1 $ and we reject the null hypothesis if this is significantly
different from zero.
The three tests have asymptotically equivalent behaviour, but differ in small samples.
The choice among the three principles is often made on practical computational con-
siderations
Wald Test: requires estimation of the unrestricted model
LM Test: requires estimation of the restricted model
LR Test: requires estimation of both models
31
Note: 8 is always between 0 and 1, because both likelihood functions are positive
and /
cannot be larger than / (a restricted maximum is never greater than an
unrestricted one).
If 7 is true: 8 is near 1.
If 7 is false: 8 is near 0.
We need the statistical framework to specify what “near” means. For a given signifi-
cance level the graph shows the density of 8 in the case that the null hypothesis is
true and in the case that it is not true.
6 8 7 -8
We start with testing a hypothesis about a single element of the parameter vector ,
like
7
32
To derive a decision rule we have to find a suitable test statistic with a known proba-
bility distribution. Remember that under A6
0 %
and
%
0
2
%
&
of is given by
&
has a 2-distribution with degrees of freedom. Remember that the standard error
% .
The most common application of a test on a single value of is to test the hypothesis
7
This hypothesis claims that the variable does not have a partial effect on
after the
other independent variables have been accounted for. If the
null hypothesis is not rejected can be eliminated from of the regression equation.
In this case the 2-statistic is
2
&
2 is small if either
is close to zero or
& is large.
There are two possibilities to define the alternative hypothesis. We start with a one-
sided alternative
7
7 9
33
reject 7
. We reject the null hypothesis if 2 9 . with . percentile
of the 2 -distribution.
According to this decision rule a rejection of 7 will occur in 5% of all random sam-
ples in which 7 is true (error type I).
Example 10 We consider a model which explains log wages by the years of education,
years of working experience and years of tenure with the current employer
6&
Now we want to test if experience has a partial influence on wages, once education
and tenure have been accounted for.
7
7 9
34
The second possibility to define an alternative hypothesis is a two-sided alternative
7
7
7
7
For the significance level the critical value is . and the test statistic
2¿
. Hence 7 can be rejected.
35
If we want to test if is equal to some given constant (e.g. or ) we
proceed the same way.
7
7
2
&
Example 12 (Constant elasticity model) In this model we study the effect of air pol-
lution on housing prices. The dependent variable are median housing prices in 506
Boston regions and the variable * gives the average amount of nitrous oxide.
7
7
:.&
Now we want to test whether a group of variables has a partial effect on the dependent
variable.
Consider the two models:
and
36
A test between the two models is a test for the hypothesis
7
7 7 is not true
First, we estimate the unrestricted model given by equation( 3) and keep the coefficient
of determination , which tells us how good the model fits. Second, we estimate the
restricted model of equation( 4) and keep from the regression. A test statistic can
then be based on the difference between the from both models
;
;
To generalise the procedure applied in this example we consider the linear regression
model
and rearrange the columns of so that the independent variables
from the restricted model come first.
and
7
7 7 is not true
;1
;
37
Figure 16:
Example 13 Consider again the example with the wage equation from before. Now
we want to test if wages are completely determined by years of education.
7
7 7 is not true
The estimation results for the unrestricted model were given by for
and
. For the restricted model we get and the number of
restrictions 1 . Then the test statistic is given by and for the significance
level the critical value . !.
7
7 7 is not true
38
We will show that in this case the test statistic
;
;
.
3 - distribution
<
Let be -dimensional random vector with 0 , then
has a distribution with the density function
½!
< & if <
¾ ¾
<
otherwise
where . ¾ " .
< is called centrally 3 -distributed with degrees of freedom.
Let 0 4 , then < has a non-central 3 -distribution with parameters
and 8 4 4.
39
- distribution
Let 3 and 3 be two independent 3 - distributed random variables with and
degrees of freedom respectively then the random variable
3 ;
3;
!
" ¾ ½
"
. · if
"
¾
otherwise
with
#
"
¾
. # "
"
2 - distribution
40
Results on stochastic independence of quadratic forms
Example: In the classical linear regression model with normally distributed errors
and %
are independent.
%
0 %
Theorem 18 Under the assumptions A1, A2, A3, A4, A6 the quadratic form
;%
is 3 distributed with (n-k) degrees of freedom.
Proof.
% %
where the matrix is symmetric and idempotent with rank( ) = (n-k). The rest
follows from Theorem 15.
41
Theorem 19 Let be fixed.
Under the assumptions A1, A2, A3, A4, A6 the quadratic form
3
%
with
8
%
is independent from
;%
.
Proof.
·
&
0
%
The matrix is idempotent and symmetric hence the 3 distribution follows from
Theorem 15. Further and we get independence from Theorem 16 and
Theorem 17.
Now we have collected all tools that are necessary to derive the test statistics
First we note a helpful equation which can be verified by multiplying out
(5)
Hypothesis I 7
The first hypothesis is a hypothesis on the complete parameter vector. We derive the
test statistic from a Likelihood Ratio test. Remember that the LR test statistic is given
by
¼
/
42
We already derived the Maximum Likelihood estimators for and % from the Likeli-
hood Function
/
5%
%
and
%
and %
In the restricted model we have .
What we need are the values of the Likelihood function at the restricted maximum and
at the unrestricted maximum.
/$
5
¼
$
5
now we get the LR test statistic
/
½
½
¾
¾
(6)
"
;% .
Hypothesis II: 7
according to the test hypothesis
43
Where is a 1 vector and is a 1 vector. That means we are testing
1 restrictions.
7
Now the restricted model is given by
We define by ½
½
and by 7
7 ½
½
½
7
(7)
/
½
/$ ½ ½ ½ ½
¾
¾ ¾ ½ ¾ ¾
"
applying equation( 7)
¾
44
with
7
1
Hypothesis III: 7
This is the hypothesis about a single element in the parameter vector. The third hy-
pothesis is a special case of the second with 1 .
7
7
7 ½
and thus
2
+
2
45
5.5 Confidence intervals
With the help of the 2 - statistic we can construct a confidence interval for the parameter
.
2
+
&
2
Therefore a confidence interval for is given by
3
%
%
and we construct the confidence interval by
%¾
%¾
* '¾½
%
'¾
¾ ¾
We consider again the classical linear regression model with normally distributed er-
rors in which assumptions hold
46
In addition we impose a set of 1 linear restrictions on the model
>
..
.
>
>
> >
2 2
&>
We still need to specify the standard deviation
&>. Under the assumption of nor-
mality of error terms it is determined by Corollary 8.
&> +
>
%
The other possibility to construct the test is by re-parametrisation. We will see how
this works in an example.
Example 20 The aim is to compare the returns to education between a two-year col-
lege (junior college) and a four-year college (university). The model we have in mind
is
The hypothesis of interest: is one year of a junior college worth one year of university
47
education?
7
7
2
&
now we estimate the model and get the following estimation result (standard errors in
parentheses)
6&
1.430 0.098 . 0.124 + 0.019 & :&
(0.270) (0.031) (0.035) (0.008)
$
We want to test
7 $
7 $
48
OLS estimation of the new model gives
6& 1.430 0.026 . 0.124 .*'' 0.019 & :&
(0.270) (0.018) (0.035) (0.008)
Now we immediately see that 2 not significantly different from
zero at the 5 % level.
Note:
$ . &$
Now we will generalise the two approaches that were introduced in the example above
direct approach
re-parametrisation approach
7 >
>
Consequently in the restricted model (the model on which the restrictions are imposed)
only the first 1 elements of are free to vary.
All the hypothesis we studied in the previous section are special cases of linear restric-
tions. Examples of how the hypothesis translate in linear restrictions are
49
1. 7 >
3. 7 >
4. 7
>
¿
$ >
Remember
0 %
0 %
0 %
? $ ( $ $ 3
? > %
> 3
3
%¾
3
%
50
and an - statistic is constructed by
;
> %
>;1
Two following two applications show that this approach results in exactly the same test
statistics we derived in the previous section.
7
>
%
&
2 2
&
7
where
1
1
We make use of a result on the inversion of partitioned matrices.
51
¾
=
=
½
Multiplying the matrix from the left and from the right with the partitioned
inverse in cuts out the bottom right hand corner of
. Thus
½ ;1
7 ;1
;
;
½
by ½ yields
½
½ ½ ½
½
½
½
½
½
½
..
..
52
Applying this result we can now rewrite the - test statistic
;
;1
;
;1
;
The intuition for the reformulated - test statistic comes from the following para-
graph. So be a little patient with the interpretation.
Here we have the restricted linear regression model in mind. That means we want to
solve the optimisation problem
=
subject to >
8
8 >
8
>
8
If we call the optimal parameter value , solving the first order conditions results in
8
>
8
8
8 >
8
>
53
and we get
> (8)
(
%
/
%
,(0*( (
( #
The variance of the restricted estimator equals the variance of the unrestricted OLS
estimator minus some positive definite matrix. This implies that
( (
There occurs a strict reduction in the variance if we move from the unrestricted model
to the restricted model. The intuition is that the restrictions contain additional infor-
mation on the model and consequently the precision of the estimation increases. This
leads us to a new idea for a test. If the restrictions are valid in the general unrestricted
model the the reduction in variance should not be very large. On the other hand if the
restrictions are not valid in general they contain substantial additional information and
therefore we should observe a large reduction in the variance of the restricted estimate.
So we construct a test based on the loss of fit. Denote by the residuals from the
restricted linear regression model.
Note: According to the notation from above . The unrestricted model given
by
has the residuals .
Transpose the equation above and multiply by
>
>
54
Finally we end up with the - test statistic
;1 ;1
1
;
;
;1
;
55
6 Some tests for specification error
Given the assumptions for the multiple linear regression model in section 3 we derived
estimators and showed that they have desirable properties (linearity, unbiasedness and
minimal variance). Further we employed an array of inference procedures. However,
there is a crucial question. How do we know if the assumptions underlying our esti-
mation frameworl are valid given the data set?
If the assumptions are wrong there is an specification error in the model.
In the classical linear regression model the assumptions apply to all observations in the
whole sample. In this section we want to test the hypothesis that some or all regression
coefficients are different in subsets of the sample.
Applications for these tests occur in different context mainly due to the data type
We explain the derivation of the Chow test on the basis of examples using the Longley
data. The dependent variable in the regression models will be employment either total
or in one of two sectors given. The independent variables will be a constant, a time
trend GNP, the GNP deflator and the number of armed forces. This data set spans
the years 1947 - 1962. Within this period falls the Korean war ending in 1953. We
consider a model for employment
&:'
& @0
@0 &- *.&1 (9)
Is there a difference between wartime 1947 - 1953 and peacetime 1954 - 1962? We
partition the observation according to these periods
in the years 1947 - 1953
in the years 1954 - 1962
56
The unrestricted model is the model which allows for different parameters in both
periods
the estimator for the parameters is equivalent to the one we get from estimating two
separate regressions
To test whether the parameters are equal in both periods we set up the hypothesis
7
;
;
The computation of the test statistic in this form requires additional programming
steps. We can, however, choose to estimate the restricted model and simplify the
computations. The restricted model is the model which pools all observations. As
no differences between the time periods are assumed OLS can be estimated for the
complete sample
57
We formulate the test statistic by comparison of the residuals. This test is also known
as the Chow Breakpoint Test.
; ;
; ;
;
# $
;!
In the next example we keep the model from equation( 9) above but we consider dif-
ferences in employment between the agricultural and the nonagricultural sectors. Em-
ployment levels in both sectors are of different magnitude. Therefore we could allow
for different intercepts and test whether the independent variables affect employment
in both sectors differently. This means we test whether slope coefficients alone are
different. Now
and
correspond to employment in each sector and the matrices
of independent variables are equal.
We can formulate the restricted model as follows
% %
%
% %
The first two columns are dummy variables indicating the sector in which the observa-
tions falls.
% includes all columns of except the constant.
58
Agricultural Non Agricultural Restricted M.
Constant
Agricultural 201.8 626.2
Non Agric. 1086.7 662.4
The restricted model changes to
& %%
%
%
& %%
Dummy variables
We will introduce dummy variables on the basis of examples of wage equations. Indi-
vidual wages in a cross-sectional sample are explained by the degree of education and
further personal individual characteristics.
So far our in examples we mainly worked with quantitative variables like 6&, @, ,
etc.
Qualitative variables are 1& , .&, 1&.2* , &6*, etc.
How can we include qualitative variables as independent variables in the regression
model?
Binary variables We define a binary variable - dummy variable - for example
*
&'&
We can use this variable to estimate differences in the mean wage for men and women
in the model
59
education fixed.
Multiple categories In the same way we can define dummy variable for multiple
categories like
married woman
married man
unmarried woman
unmarried man
In the model we have to exclude one reference category, e.g
6& &-.
The parameters , ,
give wage differentials of the other groups to single men,
again holding years of education fixed.
Ordinal variables Suppose we only know the individual’s highest educational de-
gree instead of the years of education. There is information on: primary school, high
school, college education, etc. It is possible to construct a variable of the form
##! :
1.A**'
##"
A6A 1.A**'
&-.
.*''&6&
..
.
and include it in the model. But it preferable to form a set of dummy variables 1,
,
,, ..., because the differences between the educational categories may not be
linear.
60
Interaction of variables We already had an example for the interaction of dummy
variables interacting categories woman/man and married/unmarried.
It is also possible to interact dummies and quantitative variables. Consider the model
with this regression model we can examine the question if there are differences in the
return to education for men and women. We allow for gender specific slopes of the
wage profiles as well as for different intercepts
The hypothesis that no differences in returns to education between the sexes exist is
given by
7
and the hypothesis that there are no wage differences between women and men is the
following
7
6.2 Prediction
After estimation of the model parameters suppose we want to predict the value
for
some specific vector of regressors .
In section 3.3 we have already shown that
61
is the best linear unbiased predictor for
.
Now we can construct a confidence interval for the expected value #
#
( ( %
Consequently we get
(
0
%
2
&
( & # && # %
%
62
Now we have
%
2
2 ¾ %
# # #
First construct the augmented regression model
#
Each variable in the second part of is a dummy variable, which takes the value
for one observation and for all other observations.
Form OLS estimation of the augmented regression model we get the following results
The regression of
on produces the coefficient vector
, where
are OLS coefficients from the original model and is a vector of predictions
for
.
Residuals from the augmented regression are
63
The estimated covariance matrix for is given by
(
%
%
The variance matrix contains ( in the upper left and ( & in the lower
right blocks.
Note:
A dummy variable that takes the value only for one observation has the effect of
deleting this observation from the least squares computations.
This test is an alternative to the Chow breakpoint test if there is an insufficient number
of observations available. The concept of the test is based on an evaluation of the
predictive power of the model.
Out of sample forecasts provide an easy check for the model fit. To see how good the
estimated model predicts we might proceed the following way
First we estimate the OLS coefficients with observations and get the parameter
estimate
&
64
Under the null-hypothesis the restricted model is the regression model which pools all
observations
Now we make use of the method for computing forecasts as described above. As a
result from this method we get the unrestricted model, defined by
¾ %
%
;
¾ ½
;
;
;
The next group of tests are based on a similar intuition: How good is the model’s
ability to predict outside the range of observations used to estimate it.
The primary aim of the tests are applications to time series data. And the tests are more
general than the CHOW tests in the sense that they do not require a prior specification
of when the structural break takes place.
The disadvantage of the CUSUM and CUSUMQ test is, however, that they are of
limited power compared to the CHOW test.
First we introduce the concept of recursive residuals.
65
Recursive residuals
Suppose the sample contains " observations. (We use " instead of 0 to indicate that
we are in a time-series setting.) Then the 2-th recursive residual & is defined as the one
step ahead prediction error; the prediction error for
from the model estimated with
only the first 2 observations.
&
2 "
and corresponds to the t-th row in and is the parameter estimate from the
model with 2 observations. The variance of the 2-th recursive residual is given by
( & %
&
2 "
Thus, under the assumptions A1-A6 and under the null hypothesis that the parameters
are constant during the full sample period 0 % . It can also be shown that
the scaled recursive residuals are pairwise uncorrelated.
The tests are based on the hypothesis that the distribution of does not change over
time.
1. CUSUM
The CUSUM test is based on the cumulative sum of residuals.
? 2 "
%
with
2
% and
" "
66
The test is performed by plotting ? against time. Confidence bounds are obtained by
½ ½
two lines connecting the points " ¾ and " " ¾ . the param-
eter corresponds to the significance level, for .
2. CUSUMQ
The CUSUMQ test is based on the cumulative sum of squares. It uses the test statistik
2
Since the residuals are independent the numerator and denominator of
are approxi-
mately 3 distributed and therefore
2
#
"
Again the test statistic is plotted against time. Confidence bounds for # for 2
67
7 Asymptotic theory
Consider the estimation problem where we would like to estimate a parameter vector
$ from a sample . Let $ be an estimator for $, i.e. let $ A
be a function of the sample. In the linear regression model $ is a linear function of
. And we can easily express the the expected value and the variance covari-
ance matrix of $ in terms of the first and second moments of , provided that they
exist. Especially we saw that if the sample is normally distributed so is $ . Frequently,
however, the estimator of interest will be a nonlinear function of the sample and the
calculations of the exact expressions become very complex or it is inappropriate to
make specific assumptions about the distribution of the error. In view of these difficul-
ties in obtaining the exact expressions for the characteristics of the estimators and their
moments we will often have to resort to approximations for these exact expressions.
Asymptotic theory is one of the ways of obtaining such expressions by essentially ask-
ing what happens to the exact expressions as the sample size tends to infinity. For
example, if we are interested in the expected value of $ and exact expression is un-
available, we could ask if the expected value of $ converges to $ in an appropriate
sense.
In this section we give a short introduction to asymptotic theory and then apply it to the
linear regression model. That means we drop assumption A6 about the normality of
the error term and see what asymptotic theory can tell us about the distribution of the
estimators $ % . Using convergence theorems like Laws of Large Numbers
and Central Limit theorems we will prove consistency and asymptotic normality of
OLS estimators.
First we define and discuss various modes of convergence for sequences of random
variables taking their values in . The definitions and results can be extended to -
dimensional random vectors.
9 B
1
We then write or .
68
Definition 22 (Convergence in Mean Square) The sequence of random variables
is said to converge to the random variable in mean square if
#
We then write .
1
Theorem 24 implies
Theorem 25 (Chebyshev) # implies
1
#
- B
- (10)
.
with B .
-
3
-
9 B
#
B
9 B
#
B
69
This inequality is called Chebyshev’s inequality and it implies the theorem.
Note
6 9 B
# 6
B
3. The converse of Theorem 25 is not generally true. For example
with probability
with probability
#
The next corollary follows immediately from Theorem 25 by utilising the decompo-
sition # . ( # . .
1
Corollary 26 Suppose # . and ( then ..
This corollary is frequently used to show that for an estimator $ with # $ $ (i.e.
1
an asymptotically unbiased estimator) and with ( $ we have $ $ .
1
Theorem 27 implies
The converse of the theorem does not hold in general. To see this consider the follow-
ing example. Let 0 and put . Then does not converge
in probability. But since each 0 , evidently .
Theorem 28 Let be a sequence of random variables whose first and second mo-
ments exist # . # . and let be a sequence of random vari-
ables with .
Then .
70
Theorem 29 (Slutsky) Let and be sequences of random variables with
, ,
,
with , and
, non-stochastic, and let 6 be a function continuous in
,
, .
examples:
,
4
4 .
Such relations do not hold for expected values unless and are stochastically
independent.
Theorem 30 (Bernstein) Let and . Then .
Theorem 31 Let with , and . Then
.
$ $ with 0 4
71
for 9 $ . The result shows that the distribution of $ collapses into the degenerate
distribution at $ , that means into
for $
@
for $
1
Consequently, knowing that $ $ does not provide information about the shape
of @ . This raises the question of how we can obtain information about @ based
on some limiting process. Consider, for example, the case where $ is the sample
1
mean of iid random variables with mean $ and variance % . Then $ $ in the
light of corollary 26, since #$ $ and ( $ % ; . Consequently,
as discussed above, the distribution of $ collapses into the degenerate distribution
at $ . Observe, however, that the rescaled variable $ $ has mean zero and
variance % . This indicates that the distribution of $ $ will not collapse to a
degenerate distribution. Using Theorem 35 below it can be shown that $ $
converges to a 0 % distributed random variable. As a result we take 0 % as
an approximation for the finite sample distribution of $ $ , and consequently
take 0 $ % ; as an approximation for the finite sample distribution of $ .
Let , 0 be a sequence of random variables with #
#
4
72
Central limit theorems
mogorov’s law of large numbers for iid random variables it then follows that
# converges to zero in probability. This implies that the limiting distribution of
# is degenerate at zero, and thus no insight is gained from this limiting
distribution regarding the shape of the distribution of the sample mean for finite .
Suppose we consider the rescaled quantity
5
#
4
Then the variance of the rescaled expression is % for all , indicating that its limit-
ing distribution will not be degenerate. Theorems that provide results concerning the
limiting distribution of expressions like that are called central limit theorems (CLT).
4
0
%
# C%
%
then
4;% 0 .
Consistency of
<
73
and < is a positive definite matrix.
<
# #
(12)
%
( # (13)
% %
( <
From equations ( 12) and ( 13) we get the conditions for Corollary 26 which implies
that
.
Under assumptions A1 – A5 and < the estimator is consistent.
Consistency of %
%
%
¾
% !½
The leading constant converges to 1. The second term in brackets converges to zero.
That leaves
Assuming that the errors are independent is the mean of a random sample, and
74
we can apply the law of large numbers (Theorem 34) and get
%
Asymptotic distribution of
& 0 % <
We remember that
0 % <
0 < < % <<
and finally
0 % <
If regressors are well-behaved the asymptotic normality of the least squares estimator
does not depend on the normality of disturbances.
75
Asymptotic distribution of %
% % 0
%
Similar results can be derived for asymptotic distributions of the test statistics.
We have shown that
%
0 <
This implies
%¾ <
0
and
2
%
0
>
> ;1
3 1
%
76
8 The generalised linear regression model
A1
A3 #
A5 is nonstochastic
'
'
(14)
1. is non-stochastic
77
2.
is non-singular
½
3. # #
4. # # % ' %
That means in the transformed model ( 14) the assumptions of the classical regres-
sion model are fulfilled. We can apply the OLS estimator on this model and get the
following result.
Theorem 38 Under the assumptions A1, A2, A3, A4 and A5 best linear unbiased
estimator of is given by
% '
'
Proof.
'
'
# % %
# '
'
'
'
'
'
# '
'
% '
'
''
'
% '
%
% and %% % ' %
1. OLS is unbiased
# #
78
2. In general is less efficient then %.
#
#
·
% '
3. %
# %
#
#
# 2
% %
2 # 2 2 '
'
% %
2 ' 2 '
% %
2 ' % 2 ' %
'
<
Proof.
% '
'
# %
( % # % % % '
%
'
!½
1
Thus % and this implies % .
79
Remark: The assumption
'
< with < pos.definite
implies
'
<
<
Here we present two examples. One in which OLS is not consistent in the generalised
model and one in which it is consistent.
.
.
%
80
Let us compare it to the OLS estimator
#
( ( %
%
% .
%
.
.
.
% .
9
.
D B D
Further we assume that
then
½
½ $
converges to zero.
We still need to see if
#
81
We evaluate the variance ( column by column
(
#
%
'
%
D
% D
if
¾ 6
7¾
½
(
1
and so which implies .
The matrix ' in this model is given by
$ '
%%% D D D
(((
%%%
(((
& )
' D
D
D
Let us summarise this example. We found that under the assumptions
in the model
, with error terms D B where D and
B 0 % the OLS estimator is consistent.
Asymptotic distribution of %
'
< positive definite
% converges in distribution to 0 % < .
82
Two-Step Estimation (Feasible GLS)
So far we always assumed that ' is given. But how do we proceed if ' is unknown ?
We can use a 2-step procedure
.
1. Estimate ' by '
2. Estimate with %
%
'
'
Example 1 AR(1)
D B
1. In the first step we calculate the OLS estimator and the OLS residuals
. Then we estimate D from the equation
D
B
D
$ '
2. In the second step we use
%%% D D D
(((
%%%
(((
& )
' D
D
D
%
and estimate with %.
83
$ '
Example 2 heteroscedastic errors
%%%
(((
% %&
..
.
()
like in the previous example we want to find estimators for the in the first step
and then estimate %
% in the second step. But here we still have additional parameters
and further assumptions are necessary to reduce the dimension of the parameter space.
Concerning the asymptotic distribution of the feasible GLS estimator we note that
under relatively general assumptions we get
%% %
or
%
% %
The 2-step estimator is consistent and has the same asymptotic distribution as % (GLS).
% %
%
Asymptotically and are equal.
8.3 Heteroscedasticity
#
$
'
%%% %
(((
#
%& %
..
.
() % '
%
This model has ( ) unknown parameters and we need additional assumptions for
estimation. But first we examine what happens if we estimate with OLS in this model.
84
Properties of OLS
1. OLS is unbiased
2. OLS is consistent if
' = with = finite
with finite and non-singular
3. OLS is inefficient
Remember
The BLUE in this model is given by the GLS estimator
% '
'
and according to the Gauss Markov Theorem the minimum variance of all unbiased
linear estimators is
% '
% '
Concerning item 2 we note that the Gauss-Markov Theorem states that
and this implies the inefficiency of OLS.
Further we note that if we let the number of observations go to infinity
%
'
$½ , finite - , finite $½ , finite
85
Thus under the conditions that
' = and with non-
singular and finite matrices < and also
and thus is consistent,
which verifies item 2
Concerning item 4 note that in the classical linear regression model we calculate
%
as estimator for the variance of instead of
% ' .
Consequently the standard errors for are calculated wrong under OLS and all infer-
ence based on OLS standard errors is incorrect.
If the sample size is very large we can proceed with OLS in spite of inefficiency of
the parameter estimates. The main problem is how to get valid statistical inference.
Without additional assumptions the estimation of % ' is still impossible, because '
in this expression White’ estimator replaces the unknown % by the squared OLS resid-
uals
. Similarly an estimator for the variance is given by
% '
with
% '
..
.
86
Tests for Heteroscedasticity
7 % %
7 heteroscedasticity
* +
gressors including constant
C
3 >
where > is the number of regressors in auxiliary regression >
.
Estimation of feasible GLS requires prior knowledge about the structural form of het-
eroscedasticity. Consider an example
One often observes that the variation of average expenditure increases as income in-
creases. Therefore we model % proportional to
% %
87
$ '
and get
%%%% (((
#
& %
..
.
() % '
% ¾
where is a single variable usually one of the regressors. Depending on the values of
the parameters we get the models
Homoscedasticity
Variance proportional to
Variance proportional to
%
% '
..
.
The rows in the matrix correspond to the individual observations on all independent
variables. Here we call the -th row of X. Accordingly
is the
observation of the dependent variable for the -th individual. We rewrite the regression
model
'
½
..
.
..
.
'
88
%
From this expression we see that the GLS estimator is gives the OLS estimator for the
weighted set of observations
.
To estimate the parameters in % ¾ .directly from the data,
one could proceed the following way
Use OLS rediduals and substitute
for % .
¾ B
8.4 Autocorrelation
D B
# B
D B
D DB B
..
.
D D B
89
Case 1 D .
# D # D B
( D D ( B
%
¾
% ¾
½
½¾
The expected value converges to zero and the variance to some finite value. In
this case we speak of a ”stable solution”.
Case 2 D .
B
#
( # # # B
#B
2%3
Case 3 D 9
#
(
Both expected value and variance increase over time. This is the ”unstable”
solution.
90
D B
# # D B
( # D B # D B B
D %3 %3
D
)*+ #
D B B %3
D
D
The mean and the variance of the process are constant and finite and the covariance of
two observations and only depends on the difference 2 1. Processes with these
properties are called weakly “stationary”.
but we assume that the error term follows a first order auto regressive process with
#
D D D
..
.
# % ' %3 ..
. D
D
D
D
% %3
D
91
Model misspecification may be a reason for autocorrelated disturbances.
+
Properties of OLS
1. OLS is unbiased.
3. OLS is inefficient.
However, if is stochastic, we have to be more careful. This is also the case if the
lagged dependent variable is one of the regressors. Consider again an example.
D B D
# B
Estimating this model by OLS results in
plim
plim
92
Consider now
:
#
# D %3
D
%3 D
9
D D
In the model
D B
7 D
7 D
This is a hypothesis about the error term . But is unobservable and therefore one
has to look for a test based on OLS residuals
.
%
because
Even if 7 is true and #
% , the OLS residuals will display autocorrelation
is not a diagonal matrix and it is dependent of
.
Durbin – Watson Test
93
the Durbin - Watson test statistic is given by
-
Note that - is small for positive autocorrelation and large for negative autocorrelation.
The test statistic - will take on an intermediate value for no autocorrelation.
Durbin and Watson established upper and lower bounds for the distribution of -. These
bounds are independent of under assumptions that
1. is non-stochastic.
2. 0 % .
3. A constant is included in .
8 -
-
- -
94
If - - -8 the test result is inconclusive.
%¾
%¾
7%¾
- D
%¾
# # D B D#
#
D B B
#! D
#"
- D
D
and 8 are only tabulated; analytical functional forms are not given.
inconclusive region
D B
95
D D
D
D
%3
# % .. %
D
.
D
D
'
..
.
D
D
D
D
D D D
D D
D
'
..
.
D D
D D
D
D
..
.
D
..
.
..
.
D
'
For a model with a constant and one explanatory variable as regressors the transformed
model becomes
D
D D
D
..
.
D
..
.
D
..
.
D
D D
96
There is systematic transformation for all observations except the first.
Cochrane – Orcutt transformation:
For computational siplicity the Cochrane Orcutt transformation omits the first obser-
vation and uses
D
..
.
D
..
.
D
..
.
D
D D
Note that iterative methods always depend on the starting values, as they may converge
to local extrema.
Like in the heteroskedasticity model methods of correcting standard errors after OLS
estimation exist : Newy and West (1987).
97
9 Limited dependent variables models
The linear regression model assumes that the dependent variable is continuous and
has been measured for all cases in the sample. Yet, many outcomes of fundamental
interest to social scientists are not continuous or are not observed for all cases. There
are special regression models that are appropriate when the dependent variable is cen-
sored, truncated, binary, ordinal, or count. Variables of that kind are often subsumed
as categorical and limited dependent variables.
In this section we will discuss models for binary and censored dependent variables
Æ-
where is a continuous variable and - is dichotomous with values 0 and 1. For sim-
plicity we assume there is no random error. A graph of the model is given in Figure 6.
In this model a change of by one unit will always result in a change of units in the
dependent variable regardless of the values of and -. And a change in - from to
will always result in a change of
by Æ units.
Now, consider the same graph for the nonlinear model
6
Æ -
where 6 is a non-linear function. In Figure 7 we see that the effect of a unit change in
on
depends on the level of as well as on the level -. Analogously also the effect
of a change in - from to changes for different levels of .
We will return to these basic observations when it comes to parameter interpretation
in nonlinear models for limited dependent variables.
98
Figure 6: Linear Model
y
d=1
E G d=0
E
DG G
E
x1 x2
example is the choice of a worker between taking a job or not. Driving to work and
taking a job are choices that correspond to
, and taking public transport and not
taking a job to
. The model gives the probability that
is chosen conditional
on a set of explanatory variables.
The econometric problem is to estimate the conditional probability that
consid-
ered as a function of the explanatory variables. The most commonly used approach,
notably logit and probit models, assumes that the functional form of the dependence
on the explanatory variables is known.
99
Figure 7: Nonlinear Model
y d=1
'6
'4
' 5 d=0
'3
'1
'2
x1 x2
#
#
#
and hence
100
So gives the probability of
given . Depending on this probability need
not always be between zero and one. So there is a problem of nonsensical predictions
in the linear probability model.
As
only takes on two values also the error term for a given only takes two values
if
if
(
This means there is a further problem with heteroscedastic errors in the linear proba-
bility model.
XE
101
Latent variable model
The individual joins the labour force only if its utility is above a certain threshold F .
Hence we only observe a discrete outcome
which is linked to
by
if
9 F
if
F
Since
is continuous, we avoid problems encountered in the linear probability model.
However, since the dependent variable is unobserved, the model cannot be estimated
by OLS. Instead we use Maximum Likelihood estimation which requires assumptions
for the distribution of the errors.
First we assume # like in the linear probability model. Since
is unob-
served we cannot estimate the variance of the error as in the linear model. In the probit
model we assume ( and in the logit model we assume (
5 ; .
By assuming a specific form for the distribution of it is possible to compute the
probability of
for a given . Setting F consider
9
9 9
This is simply the cumulative distribution function of the error evaluated at . Ac-
cordingly,
where F is the normal distribution function ( for the probit model and the logistic
(1
distribution function )
(1 for the logit model.
102
Model identification assumptions
In specifying the logit and probit model we made the following identifying assump-
tions
#
These assumptions are arbitrary, in the sense that they cannot be tested, but they are
necessary to identify the model. Since a latent variable is unobserved, its mean and
variance cannot be estimated. To see the relationship between the variance of the
dependent variable and the identification of the ’s in a regression model, consider the
model
and assume we rescale
by
Æ
. The variance of equals
(
( Æ
Æ (
and it follows
Æ
Æ Æ
The magnitude of the slope parameter depends on the scale of the dependent variable.
If we do not know the variance of the dependent variable, then the slope coefficients
are not identified.
Differences in the variances of the error terms in the logit and probit model also affect
the parameter estimates. Let
logit
with ( 9¾
probit
& & & with ( &
As transformation to compare coefficients from the logit and probit models we can use
5 ; & &
The logit and probit models can also be derived without appealing to an underlying
latent variable. This is done by specifying a nonlinear model relating the to the
103
probability of the event
. Remember, in the linear probability model we had the
problem that the predicted probabilities
can take on values that are greater
than one or less than zero. To eliminate this problem we transform
into a
which ranges between and . Then take the logarithm and get an expression be-
tween and . In the logit model this equals
because
(15)
is defined by equation (15). If the observations are independent, the
likelihood function is given by
/
,
:
, ,
/
, ,
104
and the log likelihood is
/
It has been shown that under mild conditions, the likelihood function is globally con-
cave. The estimates are consistent, asymptotically normally distributed, and asymp-
totically efficient.
Remark: These are only asymptotic properties and nothing is said about small sample
properties of the Maximum Likelihood estimators. Contrary to OLS, Maximum Like-
lihood estimation of nonlinear functions is only justified for relatively large sample
sizes (above 500 observations).
For nonlinear models algebraic solutions are rarely possible. Consequently, numerical
methods are used to maximise the likelihood function. Numerical methods start with
a guess of the values of the parameters and iterate to improve on that guess.
Assume that we are trying to estimate the vector of parameters $ . We start with an
initial guess $ called start values and attempt to improve by adding a vector G of
adjustments and proceed updating
$ $ G
..
.
$ $ G
Iterations continue until a convergence criterion is reached. This may be either that
the gradient of
/$ is close to zero, or that parameter values do not change any
more.
How is G determined? By a product of gradient and direction matrix
G ,
/
$
$
The gradient vector indicates the direction of a change in the likelihood function for a
change in the parameters. , is the direction matrix that reflects the curvature of the
likelihood function.
105
Method of steepest ascend
,
/
$ $ $
$
, 7#
/
$
$$
Method of scoring
,
#
/
$
$$
BHHH
/
/
, $
$ $
( $
#
/
$$
( $
#
/
$
$$
the inverse of the Hessian evaluated at the maximum of the likelihood function or
/
/
( $ $
$ $
the inverse of the outer product of the gradient vector evaluated at the maximum of the
likelihood function.
106
Parameter interpretation
Since binary regression models are nonlinear, no single approach to parameter in-
terpretation can fully describe the relationship between a variable and the outcome
probability. Here we discuss several methods to interpret parameters. For a given
application, you may need to try each method before a final approach is determined.
To examine the effect of a single variable on the predicted probabilities we could al-
low one variable to vary from its minimum to its maximum, while all other variables
are fixed at their means. Let
be the probability computed when all
variables except are set equal to their means, and equals some specified value.
Then the predicted change in the probability as changes from its minimum to its
maximum value is given by
One can also use plots of predicted probabilities to examine the effect of one or two
variables while the other variables are held at constant. The effect of discrete in-
dependent variable on the probability can be illustrated by tabulating the predicted
probabilities at selected values.
107
Partial change or marginal effect: In the structural latent variable model
the magnitudes of all 1 will change even if the new variable is uncorrelated with
the original variables. This makes it misleading to compare coefficients from different
specifications.
The 1 can be used to compute the partial change in the probability of an event. Let
!
8
108
the ratio of marginal effects for and
Since the value of the marginal effect depends on the levels of all variables, we must
decide on which values of the variables to use when computing the marginal effect.
One method is to use the average over all observations
Taking the mean value of an independent variable does of course make no sense for
0-1 dummy variables. In that case it is better to fix the dummy variable at either value
or go for discrete changes.
*
Æ
*
In nonlinear models the discrete change is unequal to the marginal change, except in
the limit as Æ becomes infinitely small. The practical problem is again choosing which
values of the variables to consider and how much to let them change. Some options
are:
unit change in , if increases from
to
*
*
*
*
109
standard deviation change, 1 is the standard deviation of
*
1
1
*
*
*
Odds ratios: This methods takes advantage of the tractable form of the logit model.
Define the odds of an event as
'
'
The log odds is linear in . The interpretation of the effect of a change in is straight-
forward. (For a unit change in , we expect the log odds to change by , holding all
other variables constant.)
'
To compare the odds before and after adding Æ to , we take the odds ratio
' Æ
Æ
'
110
Hypothesis testing and measures of goodness of fit
Using the result about asymptotic normality of the Maximum Likelihood estimator
allows formulating Wald test for testing linear restrictions
7 > 1
? > (
> 3 1
where ( is the estimated covariance matrix for which we get from the iterative
maximisation procedure.
Alternatively we can formulate a Likelihood Ratio test
@ 8
/8
/ 3 1
where 8 is the value of the likelihood function evaluated at the Maximum Likelihood
estimates for the unconstrained model, and is the value of the likelihood function
evaluated at the Maximum Likelihood estimates for the constrained model.
To test the overall significance of the model we can use the so called Likelihood Ratio
3 test. It compares the unconstrained model with a model where only a constant is
included.
full model
model with only a constant included
@
/
/ 3
5 #
Since
is binary, the deviations
5 are heteroscedastic, with (
5
5 . This suggests the Pearson residual:
5
5 5
111
Large values of suggest a failure of the model to fit a given observation. Pearson
residuals can be used to construct a summary measure of fit, known as Person statistic
(
5 5 5
Pseudo ’s Several Pseudo- for nonlinear models have been defined in analogy
to the formulas for the linear regression model. These formulas produce different val-
ues in models with categorical outcomes, and, consequently are thought of as distinct
measures.
Percentage of explained variation
5
:
,
/
"
/
increases as a new variable is added to the model
5
/
/
5
if 5
;
if 5
9 ;
, correctly predicted
112
9.2 Censored regression models: Tobit
In the linear regression model, the values of all variables are known for the entire sam-
ple. Here we consider the situation in which the sample is limited by censoring or
truncation. Censoring occurs when we observe the independent variables for the entire
sample, but for some observations we have only limited information about the depen-
dent variable. For example, we might know that the dependent variable is less than
100, but not know how much less. Truncation limits the data more severely by exclud-
ing observations based on characteristics of the dependent variable. For example, in
a truncated sample all cases where the dependent variable is less than 100 would be
deleted, While truncation changes the sample, censoring does not.
if
If
were censored from below at 1, we would know for all observations, but observe
only for
9 . For example, the values of
at or below 1 are censored with
113
Truncated and censored Normal distribution
(
%
4
4 %
!
% %
Truncated Normal distribution: When values below F are deleted, the variable
9 F has a truncated Normal distribution with density
4 %
9 F 4 %
9 F
; ;
%! %!
% %
9 F 4 %
< ;
;<
( % (
%
Given that the left-hand side of the distribution has been truncated, #
9 F must
be larger than #
4. For the Normal distribution we have
!
;<
#
9 F 4% %
;<
(
%
4F
4 %8 (16)
%
Censored Normal distribution:
if
9 F
F if
F
tion being censored is
F 4
censored
F
( (17)
%
114
and the probability of not being censored is
F 4 4F
uncensored ( ( (18)
% %
#
(
-4 %8
4F
(
.
uncensored #
9 F $ # censored #
F $
#
F 4 F 4
F (19)
% % %
where 0 % .
is a latent variable that is observed for values greater than F
and censored for values less or equal F
if
9 F
F if
F
combining the two equations results in the model
if
9 F
F if
F
censored
F
F
(20)
Since 0 %
censored
F
(
F
(21)
% % %
and
F F
uncensored ( (
% %
115
Deriving the probability of a case being censored is very similar to deriving the prob-
ability of an event in the probit model. In the tobit model we know the value of
if
is available in tobit, estimates of the parameters from tobit are more efficient than the
estimates from probit. Further since all cases are censored in probit, we have no way
to estimate the variance of
and must make assumptions about it, while (
can be estimated in the tobit model.
#
9 F #
9 F
#
9 F
about #
9 F we know from equation (16)
#
9 F %8Æ (22)
where 8 is the Mill’s ratio and Æ F ;%. The problem introduced by truncation
is that the regression model implied by equation (22) is of the form
%8
using
we have a misspecified model that excludes 8.
#
# uncensored #
9 F $ # censored 2(23)
$
#
( Æ %!Æ (Æ F
#
is nonlinear in , so that estimating the OLS regression of
on results in
116
Estimation Maximum Likelihood estimation of the tobit model involves dividing
the observations into two sets. The first contains the uncensored observations, which
ML treats in the same way as in the linear regression model. For the censored obser-
vations we do not know the value of
, but we know that
F . Hence we use the
probability of being censored as the likelihood.
Formally for uncensored observations the likelihood contributions are
!
%
,
% %
!
/8 %
(,(
% %
and for censored observations
F
,
F
(
%
F
/ % (
(,(
%
The log likelihood is thus given by
F
!
/ %
(
(,(
% %
(,(
%
#
#
!Æ
#
9 F %
Æ
(
#
9 F
Æ8Æ 8Æ
117
change in censored outcome
#
Æ
( %!Æ (ÆF
#
Æ
( F F !Æ
%
uncensored if F F
#
# uncensored #
9 F $ # censored 2 $
#
#
9 F
uncensored
uncensored
#
9 F F
118