You are on page 1of 119



Lecture Notes: Econometrics I

Andrea Weber

Institute for Advanced Studies


Department of Economics and Finance
email: andrea.weber@ihs.ac.at

December, 2003

Help in text processing from Andrey Launov and Ivan Prianichnikov is greatly acknowledged.
Thanks to Michael Grabner for helpful comments and finding lots of typos.
Contents

1 Introduction 3

2 Descriptive linear regression 5


2.1 The method of least squares . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The geometry of least squares . . . . . . . . . . . . . . . . . . . . . 6
2.3 Measuring the goodness of fit . . . . . . . . . . . . . . . . . . . . . . 11

3 The classical linear regression model 15


3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 The statistical estimation problem . . . . . . . . . . . . . . . . . . . 17
3.3 Prediction or the out of sample forecasting . . . . . . . . . . . . . . . 24

4 Stochastic regression 25

5 Statistical inference in the classical linear regression model 27


5.1 Introduction and review . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Testing principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Testing values of : practical part . . . . . . . . . . . . . . . . . . . 32
5.4 Testing values of : theoretical part . . . . . . . . . . . . . . . . . . 39
5.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6 Testing linear restrictions . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Some tests for specification error 56


6.1 Tests for structural change . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Further specification tests . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 Tests based on recursive estimation: CUSUM and CUSUMQ test . . . 65

7 Asymptotic theory 68
7.1 Introduction to asymptotic theory . . . . . . . . . . . . . . . . . . . . 68
7.2 Asymptotic properties of OLS estimators . . . . . . . . . . . . . . . 73

1
8 The generalised linear regression model 77
8.1 Aitken estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Asymptotic properties of GLS . . . . . . . . . . . . . . . . . . . . . 79
8.3 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.4 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

9 Limited dependent variables models 98


9.1 Binary regression models: Logit, Probit . . . . . . . . . . . . . . . . 98
9.2 Censored regression models: Tobit . . . . . . . . . . . . . . . . . . . 113

References
Baltagi, B. H. (Ed.), 2001. A compagnion to theoretical econometrics. Blackwell Pub-
lishers.

Davidson, R., MacKinnon, J., 1993. Estimation and inference in econometrics. Oxford
University Press.

Greene, W. H., 1997. Econometrics, 3rd Edition. Prentice Hall.

Hayashi, F., 2000. Econometrics. Princeton University Press.

Johnston, J., DiNardo, J., 1997. Econometric methods, 4th Edition. McGraw Hill.

Schönfeld, P., 1969. Methoden der Ökonometrie I. Verlag Franz Vahlen.

ScottLong, J., 1997. Regression models for categorical and limited dependent vari-
ables. SAGE Publications.

Wooldridge, J., 2000. Introductory econometrics: a modern approach. South-Western


College Publishing.

2
1 Introduction

The aim of the lecture is twofold. First students should receive guidelines for applied
empirical research. But second the lecture should also provide a good theoretical basis
for advanced econometrics courses.

What is Econometrics?

At the beginning of the twentieth century economic theory was mainly intuitive and
empirical support for it was largely anecdotal. Now economics has a rich array of
formal models and a high quality data base. Empirical regularities motivate theory in
many areas of economics, and data are routinely used to test theory. Many economic
theories have been developed as measurement frameworks to suggest what data should
be collected and how they should be interpreted.
Econometric theory was developed to analyse and interprete economic data. Most
economic theory adopts methods originally developed in statistics.

Figure 1:

According to Heckman(2000, Quarterly Journal of Economics) major achievements of


econometrics during the twentieth century were

The definition of a causal parameter within a well-defined economic model.

Analysis of what is required to recover causal parameters from data (the iden-
tification problem). Many theoretical models may be consistent with the same
data.

Clarification of the role of causal parameters in policy evaluation and in fore-


casting the effects of policies never previously experienced.

3
The concept of a causal parameter

By a causal effect economists mean a “ceteris paribus” change (all other things are
equal).
Consider, for example, a model of production of output  based on inputs  that can
be varied independently. We write the function

  
          

where            is a vector of inputs. Assuming that each input can be


freely varied, the change in  produced from the variation in  holding all other
inputs constant is the causal effect of  . If  is differentiable in  , the marginal
causal effect of  is


          


A special case occurs if  is separable in 

     


and the causal effect of  can be defined independently of the level of the other
values of  .
Examples

Price effect on consumer demand

Effects of fertiliser on crop yields

Measuring returns to education

The structure of economic data

Cross-sectional data set consists of a set of individuals, firms, regions, etc, at a


given point in time. Assumption: random sampling of the underlying population.

4
Time series data set consists observations of a variable over time, e.g. stock
prices, consumer price index etc. The chronological ordering of observations
contains potentially important information.

Pooled cross-sectional data set.

Panel or longitudinal data set consists of a time series for each cross sectional
member in the data set, e.g. household panel surveys, OECD main economic
indicators.

2 Descriptive linear regression

2.1 The method of least squares

As an extension of the linear regression model in two variables let us introduce the
multiple linear regression model.


Observations Functional Form Fitting Criterion



   

         
 
     

     
            

  


 

       
  
             
 
   
    

Table 1: The multiple linear regression model,        

To make notation more convenient we transform the model in matrix form. We define

       
the n-dimensional vectors of observations,





..
.
  
 

..
.
   

..
.

   

..
.


   

the k- dimensional parameter vector and the n-dimensional vector of error terms,

   
  


 
..
.
   ..
.


5
 
and the    matrix

 
   
..
.
  

     

In matrix notation the multiple linear regression model can be written in the following
way

   

and the fitting criterion is given by

    
  
 

In the literature we find several names for the variables in the model, which are listed
below. We will commonly use the names in the first row.



dependent variable independent variables error
explained variable explanatory variables disturbance
regressand regressors, covariates

Table 2: variable names

2.2 The geometry of least squares

      are vectors in the n - dimensional Euclidian space  . The inner product


of two points  and  in  is defined by

      


All points in  are determined by their length and direction. The Euclidian length of
a vector  is


    
 
 


If we assume that

6
1.   , there are more observations than independent variables

2.      , that means the columns of  are linearly independent

then the columns of  span a  - dimensional subspace of  which we call   :

   
    
            




   

               

The orthogonal complement of    is given by  . In the Euclidian space 


the regression problem translates to finding the “closest” point to
in   , which
means the point in    with the minimal distance to
.



  
 

Figure 2: The column space of  in 

Remember the fitting criterion for the least squares model

   
    
  
 

  


   
   


 
      

7
To solve the minimisation problem we calculate the first order condition

  
 
  

  
  


Rules for matrix differentiation:

   
        

 

            
 

  

and get the normal equations


 

If the inverse matrix    exists the optimal parameter vector is given by

     



Remember that

the columns of  are linearly independent which implies that   has a full
rank,

  is non-negative definite which implies that   is strictly convex and


must have a unique minimum

Therefore is determined uniquely by the normal equations.


Let us define the residuals from the regression by


 

and the fitted values by


 

8
Figure 3: The projection of
onto   

The normal equations  


     also have a geometric interpretation: there has
to be the right angle between    and the residuals.
Now we define some projection matrices

      
 

The    matrix        projects


orthogonally onto the column space


  .

         
   

 projects onto the orthogonal complement   . 

Properties of the projectors:

 and  are symmetric

    

   there exists an orthogonal decomposition       

 ,  are idempotent,    ,     

   

With the help of the projection matrices we can decompose the vector of the dependent
variables:

9


 where

   
are the fitted values and

 
are the residuals.

and rewrite the normal equations as 


 
  .

Figure 4: The orthogonal decomposition of

If a constant term is included in the regression model (one column of  is       ),

 
residuals sum up to 0. Because in this case

 
..
.
      


and

 
    


10
2.3 Measuring the goodness of fit

The idempotency of  and  often makes expressions associated with least squares
regression very simple. For example the sum of squared residuals is given by

     
   
  
  
 
 
 


 
  

Similarly, the sum of squared fitted values, which is also called the explained sum of
squares, is



      
 
 

 
 

the total sum of squares equals the sum of squared dependent variables



 
 

In Figure 5 we see the geometric interpretation of the vectors


, 
and 
. They
form the sides of a right-angled triangle

Figure 5: The orthogonal decomposition of

By applying Pythagoras Theorem we see that


  
  
  (1)

11
Thus the total sum of squares of
equals the explained sum of squares plus the sum of
squared residuals. The fact that the total variation in the regressand can be divided into
two parts, one “explained” by the regressors and one not explained, suggests a natural
measure of how good the regression fits. Let us divide equation( 1) by 



 

  

 


and define the uncentered   or the coefficient of variation by


 

     

 


Properties of   :

1. 
 
, it is unit free.

2. Return to the triangle in Figure 5 and call ! the angle between


and
 
.
The cosine of ! is given by



 !



and hence

    !

3. Anything that changes ! will also change   , e.g. adding a constant to


.

A modification of  lets us get around the problem addressed in the last point. This
version is called the centered 



       



where         and      

    . Multiplication of  with

 
a vector gives a vector of deviations from the mean






..
.
  where




 
 

 

For   we can derive a decomposition into explained sum of squares and residual sum

12
of squares, like in equation( 1), only if a constant is included in the regression.
From the orthogonal decomposition of 
we get


  
  

also we note that

 
     


  

  
 

 if   

and hence


  
   

"    #

SSR SSE
      
SST SST

Properties of   :

1.       only if a constant is included in  (  is difficult to interprete


unless there is a constant in  .)

2.   never decreases if an additional variable is added to the regression.


A measure of the goodness of fit which does not suffer from this problem is the
adjusted  defined as

 
        


3.


  

     

 

 
  
  
  

 












13
4. In the triangle, similar to the one in Figure 5, with sides 
, 
and 
,
  is the squared cosine of the angle $

    $

14
3 The classical linear regression model

3.1 Assumptions

For the multiple linear regression model


  (2)

we start with a set of assumptions

A1
   

A2  is a    matrix with     

A3 #     

A4 #     %  

A5  is nonstochastic

Remarks

1. Assumption A1 includes a wider range of functional forms, for example we can


model exponential functions of the form by taking logs of the following equation


 &½  ¾  ¿  

or quadratic functions like



       

2. A2. works as identification condition. In the two-dimensional model the as-


sumption means that is not constant. There has to be enough variation in the
model.

15
 
3. A3.

#      #   

  

#    

The zero conditional mean implies that the unconditional mean is also zero,
since

#     # #      #   

If we require #     to be constant, we can as well set this constant equal to


zero, if a constant term is included in the model. For example if we have

#        

       

A3 further implies that

#       for any           '       

which is seen by

#       # #      '  
#      # #      
 #   #       

This is why assumption A3 is also called the strict exogeneity condition. It


requires the regressors to be orthogonal on the error term not only of the same
observation (#       for all '     ) but also to the error terms of the
other observations. A3 also implies that

# 
   

the regression of
on  is the conditional mean.

4. A4 more completely specifies the distribution of the error term.


The condition that (       %  requires homoscedastic errors
and the condition on the covariances )*+        requires that the errors
are non-autocorrelated.

16
5. A5.  is nonstochastic in an experimantal setting, where the analyst chooses
the independent variables  and then observes the outcome
 
Example: in an agricultural experiment the outcome
may be crop yields and
the analyst chooses the amount of fertilizer that is applied.
An alternative view in the experimental setup is that the observations of are
fixed in repeated samples.
In economics we do not often have the opportunity to analyse experimetal data,
so the assumption on non stochastic is not very appropriate. However, we will
see that this assumption can be dropped at a low cost.

#     , #     % 

Heteroskedastic Errors

Autocorrelated Errors

3.2 The statistical estimation problem

From A1 we know that there exists a true and we want to estimate it as good as
possible from the data.

17
We choose an estimator from the class of linear estimators

  ,
 - 
   

We require that the estimator is unbiased

#     

    
Out of these we choose the best estimator, which is the one with the smallest variance


 (    #  #   #  is minimal.

Remarks

1) Density function of the estimator of

2) The concept of the smallest variance is the following:  is an    variance-


covariance matrix, therefore it is symmetric and positive definite and


         

Lemma 1 Under the assumptions A1, A3, A5 the linear estimator   ,


 - is
unbiased if and only if ,   , -  .

18
Proof.

  ,
 -  ,     -
#    # ,  ,  -  # ,   # ,   # -
  ,# 
 

  ,#    - 



 ,  - 
   -
,  
 ,  

Lemma 2 Let
be an n-dimensional random variable, whose first and second mo-
ments exist and let   )
 . be k-dimensional random variable, then

  )  ) 

Proof. Exercise
As a consequence of these Lemmas we note

  
 
 ,
is an unbiased linear estimator with variance covariance matrix


 ,  ,  ,  ,  % ,,



The remaining problem is to minimise 




 % ,, under the restriction ,  .

Lemma 3 Denote with        . Further, let  have full rank and let


,  /, then

,,   / /    ,  / ,  / 

Proof. Exercise
Now we apply Lemma 3 with /   and get
,  

,, 
   
  
constant


,   ,   
 ·

19
With these steps we have proven the Gauss Markov Theorem

Theorem 4 (Gauss Markov Theorem) Under the assumptions A1 to A5 the estima-


tor     

is the Best Linear Unbiased Estimator (BLUE) of .

Remarks

In general among the unbiased there are better (nonlinear) estimators of . We


will show that if the error terms have a normal distribution  0  %  , is
the best unbiased estimator, which means that also is efficient.

If we give up unbiasedness in general estimators with a smaller variance-covariance


matrix exist.

Estimating the variance of

     
  
       
  
  

 

 

       
( 
     
# #     
 
  

  

%     %  
  
 

Note that   depends on

% the variance of the error 


   
the condition of  . If    is almost singular we talk of the
problem of multicollinearity.

Linear transformations of

Corollary 5 (Corollary from GM Theorem) Under the assumptions A1 – A5 the es-


timator   / (/ is 1   matrix) is BLUE for   / .

This corrolary has some important implications

The BLUE for


is given by
   

with variance

          %     % 

20
Estimating

Like before we want to find the BLUE


linear estimator:   )
 .
unbiased estimator: #    #  
We define the estimation error Æ as Æ   , # Æ  ;

# Æ  #   )
 .  #  # )
  #.  #  #)     #. 

 

  )  #  #)  #.  )  .  


So conditions for the unbiasedness of Æ are

)  . 
)   and .  

and the minimisation problem to solve is


ÆÆ  # ÆÆ
  such that )   

    
# ÆÆ 


#
    )   )
#   )    ) 

 


  )   )

%   )    ) 


applying Lemma 3 with /   gives


 
   
# ÆÆ 
  
%     )      )    
 if    ·
)          

21
Thus we get the BLUE for

 

 
with
  %   

%   

#  

Estimation of %

Think of the expression  and suppose were observed. Then we get

      
  

#  #  #   #   %
   


This suggests that we could take % as an estimator for % . But, due to the
 


construction of OLS we have

hence % 
 

is biased.
We can see this by

 
           
 

    
#   
     
#   # 2 




    
# 2 
2  # 
 2 # 
2  % 
 % 2       %

the last step follows from

 
2   



2       
2   2     



  

22
So we have

#       %

and we can get an unbiased estimator for % by


%  


Now that we have an estimator for % we can present an estimator for the variance of


(    %     



  

 


Remarks

The standard error %


is also called the Standard Error of Regression.

%    

The Standard Error of of a single component of the parameter vector  is given by
½
¾
.

The population equivalent of 


Remember that we defined the centered coefficient of determination by




 
    

 
 




It can be shown that:

 
#    Ú   

          


To reduce the bias we define the corrected  by


  



or alternatively the adjusted 

 
        


23
Both usually have a smaller but negative bias.

3.3 Prediction or the out of sample forecasting

     
We use the model



 

  

where
 is unknown and  is given
Example: with the help of past values predict consumption in 2001

The problem is again to find the best linear unbiased estimator for
 .


  )
 .
# )
 .  # 
 
Æ 
  )
 .

Find Æ such that:


 # Æ  
ÆÆ is minimal

Theorem 6 (Gauss Markov Theorem – Continued) Under the assumptions A1 to A5


the best linear unbiased forecast for
 is given by
     

24
4 Stochastic regression

Social scientists are rarely able to analyse experimental data. Thus it is necessary to
extend the results of the preceding section to cases in which some or all independent
variables are randomly drawn from some probability distribution.
A convenient method of obtaining the statistical properties of is to

1. obtain results on the statistical properties conditional on  (equivalent to the


case of non-stochastic regressors),

2. find unconditional results by ”averaging”(integrating over) the conditional dis-


tributions.

As before,

 
 




So, conditioned on the observed 

#     
 
 
 #    
#    # #     #   

The unbiasedness of is robust to assumptions about  ; it rests only on assumption


A3.
The variance of , conditioned on  , is

 
(    %  
 

  
  

(     

 
# (     (  #   


     

(      
 # %    


% #





With the Gauss-Markov Theorem in the previous section we have shown that


  
   


25
This inequality, if it holds for every particular  , must hold over the average values of


  #
 
 
   

Theorem 7 (Gauss Markov Theorem – Continued) In the classical linear regres-


sion model the least squares estimator     
is the minimum variance
unbiased estimator of when  is either stochasitc or nonstochastic.

Further Remark on Assumption A3

We noticed that

#       #    # #     

but by the same argument we get

)*+   )*+ #     

which says that  and are uncorrelated. The interpretation is that  in some sense
captures all relevant effects which are necessary to explain
.
Example:
– wage,  – education, – ability. #      – average ability in all
education groups is the same.

26
5 Statistical inference in the classical linear regression model

So far we have solved the estimation problem, but there remain some open questions,
like

Which explanatory variables are best included in the model ?

What about the functional form ?

Which is the distribution of the residuals ?

We need a testing framework in the linear regression model and exact distributional
assumptions for the errors. We make an additional assumption

A6  0  %  

There are several arguments in favour of the normal distribution

normality is preserved under linear transformations.

quadratic forms of normals give 3 - or  - distributed random variables.

under normality of , the OLS estimator is also the Maximum Likelihood


estimator.

5.1 Introduction and review

The Normal Distribution

An -dimensional random variable is normally distributed if its density function is


of the following form:

  4     5




 




  4 

  4

with 4   and  is symmetric and nonnegative definite.

#    4
(    

27
Corollary 8 Let be an    random variable with 0 4 .
If
 )  ., where ) is an 1   matrix with  )   1 and 
1
, then


0 )4  . ) ) 

Let us apply the corollary to the linear regression model

 0  %  

 0   %   

 




0
  




%  


The principle of Maximum Likelihood

Suppose we have a given a sample of observations


    
 with the joint den-
sity function  
    
 $  and we want to estimate the parameter vector $ . The
estimator of $ dependent on the observations is called $ 
    
 .
The Likelihood function is defined as

/ $
    
    
    
 $

The principle of Maximum Likelihood states that an estimator of $ is given by the


maximum of the Likelihood Function

$ 
    
   6 / $
    
 

Here we want to find the ML estimator for the linear regression model. First we set up
the Likelihood Function for


 
    
   %  
  

  


%

  
  

 
¾ 
5%
 /  % 
    


28
Taking logarithms


/  
  
 5  
% 
 

  
 
 %

and setting the derivatives with respect to parameters equal to zero

 
/ 
    
    
 %


 
/   
   
  
   
%  %
  %

gives

 
 
 




 

  
%  

 
 

The Maximum Likelihood estimator for equals the OLS estimator and the Maximum
Likelihood estimator for % is given by the biased variance estimator.

Theorem 9 In the classical regression model with normally distributed errors the
least squares estimator has minimal variance of all unbiased estimators. Thus is
efficient, not only linearly efficient.

Remark: For non-normally distributed errors the ML estimator usually has a smaller
variance than . Thus for non-normality it is better to use the ML estimator than OLS.

The statistical testing problem

We start with a parameter space .


Null Hypothesis: 7 $    

Alternative: 7 $     
Example:

  
   
 $  

29
A statistical test is a decision rule based on the sample. The decision rule determines
if 7 is accepted or rejected.
For the possible outcomes of the testing procedure we find the following

Accept 7 Reject 7

7 true ok Error Type 1

7 false Error Type 2 ok

Table 3: Testing outcomes

Probability(Error Type 1) ...size of the test

 Probability(Error Type 2)   ... power of the test

Tests are compared in terms of the size and power. The ”best” test has maximal power
for a given (fixed) size.

5.2 Testing principles

In this section we discuss three testing principles which are based on Maximum Like-
lihood estimation of the parameter $ . Given an arbitrary function . the testing hy-
pothesis is the following

7 . $   

Likelihood Ratio Test

If the restriction . $    is valid, imposing it should not lead to a large reduction of


the Likelihood Function. Therefore we base the test on the difference 
/  
/ ,
where / is the likelihood of unconstrained value of $ and / is the value likelihood
at the restricted estimate. Let $  be the maximum likelihood estimate of $ obtained
without regard to the constraints, and let $  be the maximum likelihood estimate we
receive by maximising the likelihood function subject to the constraint. If /

 and
/

 are the values of the Likelihood Function evaluated at these two estimates, then the
Likelihood Ratio is

/ 
8
/



This function must be between 0 and 1. If 8 is too small we reject the null hypothesis.

30
 
Wald Test

 
If the restriction . $    is valid . $  should be close to zero, because maximum
likelihood estimation is consistent. Therefore the test is based on . $  . We reject
the null hypothesis if this is significantly different from zero.

Lagrange Multiplier Test

If the restriction . $    is valid, the restricted estimator should be near the point that
maximises the log likelihood. Therefore, the slope of the likelihood function should be
near zero at the restricted estimator. The test is based on the slope of the log Likelihood
at the point where the function is maximised subject to the restriction. The derivative
of the Likelihood with respect to the parameters is called the score

   /$
1$  
$

The test is based on 1 $  and we reject the null hypothesis if this is significantly
different from zero.

The three tests have asymptotically equivalent behaviour, but differ in small samples.
The choice among the three principles is often made on practical computational con-
siderations
Wald Test: requires estimation of the unrestricted model
LM Test: requires estimation of the restricted model
LR Test: requires estimation of both models

Example: derive a decision rule in case the of LR-test

Let the parameter space be given by . And the null hypothesis by


7 $   with   
Then the test statistic of the Likelihood Ratio test is given by

 /$   


 $
¼ ¼
8
  8 
  
   
 /$   
 $
 

The test statistic 8


 is a function of the random variable
 
    
 , thus it is
itself a random variable with density function 6 8 $ .

31
Note: 8 is always between 0 and 1, because both likelihood functions are positive
and /

 cannot be larger than / (a restricted maximum is never greater than an

unrestricted one).
If 7 is true: 8 is near 1.
If 7 is false: 8 is near 0.
We need the statistical framework to specify what “near” means. For a given signifi-
cance level the graph shows the density of 8 in the case that the null hypothesis is
true and in the case that it is not true.

6 8 if 7 is true 6 8 if 7 is false

For a given the critical region  8


8 is defined by the -quantile of 6




6 8 7  -8 


And we get the decision rule:


Reject 7 if   8
8
Accept 7 if 8  8


5.3 Testing values of ¬ : practical part

Testing a hypothesis about a single parameter

We start with testing a hypothesis about a single element of the parameter vector ,
like

7   


32
To derive a decision rule we have to find a suitable test statistic with a known proba-
bility distribution. Remember that under A6

 0   %     

and

  
  
%    
 0  

If we knew % we would be fine but we have to replace % by the estimated value %


.

It can be shown (in the next section) that

2  
  
%     


  
&  

of  is given by &   

has a 2-distribution with    degrees of freedom. Remember that the standard error
%     .


The most common application of a test on a single value of is to test the hypothesis

7   

This hypothesis claims that the variable  does not have a partial effect on
after the
other independent variables               have been accounted for. If the
null hypothesis is not rejected  can be eliminated from of the regression equation.
In this case the 2-statistic is



2  
&  

2  is small if either
 is close to zero or &   is large.
There are two possibilities to define the alternative hypothesis. We start with a one-
sided alternative

7   

7  9

and we choose the significance level  .


To set up a decision rule we have to find a sufficiently large value of 2  in order to

33
reject 7 
  . We reject the null hypothesis if 2  9 . with .   percentile
of the 2 -distribution.

Figure 14: One side alternative

According to this decision rule a rejection of 7 will occur in 5% of all random sam-
ples in which 7 is true (error type I).

Example 10 We consider a model which explains log wages by the years of education,
years of working experience and years of tenure with the current employer

  6&            

estimating this model gives the following result

  6&
  
       
          
      

       

Now we want to test if experience has a partial influence on wages, once education
and tenure have been accounted for.

7   

7  9

The degree of freedoms is    = 526 - 4 = 524, and 2 ¿  


   . Let
  then the critical value is given by .  . Hence we reject 7 .

34
The second possibility to define an alternative hypothesis is a two-sided alternative

7   

7   

In this case we reject 7 if 2   9 ..

Figure 15: Two side alternative

Example 11 We consider a model in which college grade point average (GPA) is


explained by other test results and the average number of lectures missed per week
(skipped). We want to test if missing lectures has an influence on college GPA.


          

The estimation result gives



   
           
     

    
   

7  

7  

For the significance level  the critical value is .   and the test statistic
  

2 ¿   
   . Hence 7 can be rejected.

35
If we want to test if  is equal to some given constant (e.g.    or   ) we
proceed the same way.

7   
7   

and the test statistic is given by


  
2   
&  

Example 12 (Constant elasticity model) In this model we study the effect of air pol-
lution on housing prices. The dependent variable are median housing prices in 506
Boston regions and the variable * gives the average amount of nitrous oxide.

  :.&                   

7   
7   

 :.&
  
                  
  
       

    
  

The test statistic 2 ¾  


    . 7 can be accepted at any significance
level, hence  is not significantly different from -1.

Testing multiple exclusion restrictions

Now we want to test whether a group of variables has a partial effect on the dependent
variable.
Consider the two models:

  6&    &- .   & :&  2& &  (3)

and

  6&    &- .   (4)

36
A test between the two models is a test for the hypothesis

7      

7 7 is not true

First, we estimate the unrestricted model given by equation( 3) and keep the coefficient
of determination   , which tells us how good the model fits. Second, we estimate the
restricted model of equation( 4) and keep  from the regression. A test statistic can
then be based on the difference between the  from both models

    ;

  
   ;   


To generalise the procedure applied in this example we consider the linear regression
model
   and rearrange the columns of  so that the independent variables


from the restricted model come first.
  
    and  


Now  and  are matrices of the dimension     1 and   1 respectively,


and  are vectors of dimension   1   and 1   and 1 is the number of
restrictions.

7          

7 7 is not true

The  -statistic is given by

    ;1

  
   ;   


37
Figure 16:

We will show that  has an   -distribution with degrees of freedom 1 and    .


Note that   . We reject 7 if the  -statistic is sufficiently “large”.

Example 13 Consider again the example with the wage equation from before. Now
we want to test if wages are completely determined by years of education.

7    

7 7 is not true

The estimation results for the unrestricted model were given by       for
   and  
. For the restricted model we get    and the number of
 

restrictions 1  . Then the test statistic is given by    and for the significance
level   the critical value .  !.

Test the overall significance of the regression

Consider the regression model


        

where a constant is included say by          .


To test if this regression makes any sense at all we set up the hypothesis

7        

7 7 is not true

38
We will show that in this case the test statistic

 ;  
  
   ;   


has an F-distribution with parameters       

Example 14 (wage equation)

  
 
.  

5.4 Testing values of ¬ : theoretical part

We begin with the introduction of several important testing distributions. Then we


derive some results on statistical independence in the linear regression model. Finally
we derive the test statistics.

3 - distribution

< 

Let       be -dimensional random vector with  0   , then

 has a distribution with the density function



 ½!
 < & if <  

¾ ¾
 <   
 otherwise


where .   ¾ "  .
< is called centrally 3 -distributed with  degrees of freedom.
Let  0 4  , then <  has a non-central 3 -distribution with parameters 
and 8  4 4.

Remark: If  0 4  has an arbitrary normal distribution


3    4  
  4 has a central 3 - distribution with  degrees of freedom
and
3   has a non-central 3 - distribution with parameters  and 8  4  4.
 

39
 - distribution

Let 3 and 3 be two independent 3 - distributed random variables with  and 
degrees of freedom respectively then the random variable

3 ;
 
3;

has a central  - distribution. The corresponding density function is given by

! 
" ¾ ½

"
.  · if   
"
      
 
¾ 
 otherwise


with
#   
"
  ¾
.   # "   
"
  

Let 3 and 3 be independent random variables


and let 3 be non-centrally 3 - distributed and
3 be centrally 3 - distributed
then the ratio of both is non-centrally  - distributed.

2 - distribution

The 2 - distribution is given as a special case of the  - distribution.


If    and the random variable  is centrally  - distributed then the random
variable

2  

has a central 2 - distribution with  degrees of freedom,


if  is non-centrally  - distributed, 2 has a non-central 2 - distribution
Example: the ratio of a standard normal distributed random variable and the square
root of an independently 3 - distributed random variable gives a random variable
with a 2 - distribution.

40
Results on stochastic independence of quadratic forms

Theorem 15 Let the n-dimensional random vector  0 4 %  , then


<  $ is 3  distributed with 8  4 4;% 

if and only if A is idempotent, symmetric and rank(A) = r.

Theorem 16 Let A be an idempotent    matrix , rank(A) = r,


let B be a    matrix with =  
and let  0 4 %   be a    random vector
then the random variables   = and <   are stochastically independent.

Example: In the classical linear regression model with normally distributed errors
and % 
are independent.

    

 
%   


 


 0   %  

     
             
 

 


  
             




  

Theorem 17 Let A be a symmetric idempotent    matrix , rank(A) = r,


let B be a symmetric    matrix and  0 4 %  .
If BA = 0 the quadratic forms <   and <  = are stochastically indepen-
dent.

Theorem 18 Under the assumptions A1, A2, A3, A4, A6 the quadratic form
;%

 is 3 distributed with (n-k) degrees of freedom.
 

Proof.

 





% %

where the matrix  is symmetric and idempotent with rank( ) = (n-k). The rest
follows from Theorem 15.

41
Theorem 19 Let    be fixed.
Under the assumptions A1, A2, A3, A4, A6 the quadratic form


  
    
  3 
%
with

8    
    

%

is independent from
;%

.

Proof.

     


 
       
 
  
  
        

 


  
    
      
         


     
 
  
       
&



     
       


   
  0    
  %  

The matrix  is idempotent and symmetric hence the 3 distribution follows from
Theorem 15. Further     and we get independence from Theorem 16 and
Theorem 17.
Now we have collected all tools that are necessary to derive the test statistics
First we note a helpful equation which can be verified by multiplying out


 
 
 
  
   
      
    
 (5)

Hypothesis I 7 


The first hypothesis is a hypothesis on the complete parameter vector. We derive the 
test statistic from a Likelihood Ratio test. Remember that the LR test statistic is given
by

 ¼
/  
 

42
We already derived the Maximum Likelihood estimators for and % from the Likeli-
hood Function

 
  /   5%   
  
 
  %
and
    



   
 
%  


and % 
 



   
In the restricted model we have     .
What we need are the values of the Likelihood function at the restricted maximum and
at the unrestricted maximum.

 
   /$     5 
 
 
 

¼  

  

 
 
 

  
 
 


  
   $      5 
   
   
   

  
now we get the LR test statistic

/ 
½ 
        
½      

¾

 
¾
(6)
" 


for the last equality we used equation ( 5) and

  
    
 
 


Under 7  is centrally  - distributed with parameters      (see Theorem 19


and the definition of the  - distribution).
Under 7  is non-centrally  - distributed with parameters     8 and 8 

  
     
;% .

Hypothesis II: 7  



This is a hypothesis on a part of the parameter vector. We partition in two parts

 
according to the test hypothesis

 


43
Where is a   1   vector and  is a 1   vector. That means we are testing
1 restrictions.

7  



 
Now the restricted model is given by


     


 
  
   

   

 
  


We define by ½

½       



and by 7

7   ½ 

and we can express the residuals from this regression by

 ½     


Next we need an auxiliary equation analogous to equation(5)


  
 ½ 
  
  
   
       
 7     

(7)

The LR test statistic now equals

 ¼ /$


 ½ 
        
 
¾

 
/  
½   
 /$  ½ ½   ½  ½ 
  


 ¾

¾ ¾  ½ ¾ ¾ 


 
" 

applying equation( 7)

¾

44
with

   
 7       


 
1

Under 7  is centrally  - distributed with parameters 1    . Under 7  is


non-centrally  - distributed with parameters 1    8 and 8     
 7 
 
;% .


Hypothesis III: 7 


This is the hypothesis about a single element in the parameter vector. The third hy-
pothesis is a special case of the second with 1  .

7 


7  

As this is a special case of Hypothesis II with 1  .


7   ½  
  



and thus

  
   
 
 
   
    
  


    
   

Now we can derive the familiar 2 test statistic

2 

 

+

 


 2 

Under 7 2 is centrally 2 - distributed. Under 7 2 is non-centrally 2 - distributed.

45
5.5 Confidence intervals

With the help of the 2 - statistic we can construct a confidence interval for the parameter
.

2 

+

 


& 
 2

 
Therefore a confidence interval for is given by

 *  2 ¾ &     2 ¾ &    

2 ¾ is the  quantile from the 2 - distribution with parameter    .



For example the 95% confidence interval for is given by

#  . &   . & $

where . is 97.5 - quantile of 2 - distribution.


To find a confidence interval for % we remember that


 3
% 


%  


 
and we construct the confidence interval by


  %¾
 * '¾½ 

% 

'¾

  
¾ ¾

5.6 Testing linear restrictions

We consider again the classical linear regression model with normally distributed er-
rors in which assumptions        hold

   

46
In addition we impose a set of 1 linear restrictions on the model

          >
..
.
           >

In matrix form the restrictions are written as

  >

where  is a 1    matrix of known constants, with linearly independent rows and


1  . The vector > is an 1   vector of known constants.

Testing one linear restriction

We start with the case of 1   where we test the following restriction

7          >  or in the vector form   >

The sample estimate of  is given by   >,


and we can construct a 2 - test for  by

>  >
2  2
&>  

We still need to specify the standard deviation &> . Under the assumption of nor-
mality of error terms it is determined by Corollary 8.

&>   +
>
   %

  

 

The other possibility to construct the test is by re-parametrisation. We will see how
this works in an example.

Example 20 The aim is to compare the returns to education between a two-year col-
lege (junior college) and a four-year college (university). The model we have in mind
is

  6&    .   +  & :& 

The hypothesis of interest: is one year of a junior college worth one year of university

47
education?

7     

7   

The restriction equation is given by

      >  

and the test statistic we derived before is


 

2 
&   


now we estimate the model and get the following estimation result (standard errors in
parentheses)


6&
   1.430  0.098 .  0.124 +  0.019 & :&
(0.270) (0.031) (0.035) (0.008)

    
  

The difference between the coefficients of interest  


   which is (-2.6%)
of wage. Is this statistically significant?
For the re-parametrisation method we define a new parameter $ by

$   

We want to test

7 $
7 $

It is possible to rewrite the model so that $ appears directly as parameter of one of


independent variables

  6&    $  .   +  & :& 


  $.  
  
.  +
() *+( ,
 & :&  

48
OLS estimation of the new model gives

  
6&  1.430  0.026 .  0.124 .*''  0.019 & :&
(0.270) (0.018) (0.035) (0.008)

    
   

Now we immediately see that 2      not significantly different from
zero at the 5 % level.
Note:

and are equal for both estimated equations


 is the same as well

A confidence interval for $ is given by

$  . &$ 

Testing 1 linear restrictions

Now we will generalise the two approaches that were introduced in the example above

direct approach

re-parametrisation approach

We consider the general hypothesis

7   >

To simplify the analysis we partition  in two groups of variables according to its


columns. One group of variables consists of 1 linearly independent columns and the
second group consists of the remaining   1 columns. To be consistent in notation
the linearly independent columns come last in R. (That means they form  .)

 
 

 
 

  >

Consequently in the restricted model (the model on which the restrictions are imposed)
only the first   1 elements of are free to vary.
All the hypothesis we studied in the previous section are special cases of linear restric-
tions. Examples of how the hypothesis translate in linear restrictions are

49
1. 7                 >

2. 7             >

3. 7            >
   
     

4. 7               
   

   




 >


 

       
¿ 
 

Given the OLS-estimator , our interest centres on the discrepancy vector:

$    >    

Remember

 0  %  
 0   %     


  0   %      

$    >  0  %      

To construct a test statistic we use the approach of the Wald test

?  $ ( $ $  3

?     >  %         
   >  3

% , however, is still unknown, therefore we will construct an  -test.


We know that

  3

by a result analogous to the one in Theorem 19 it is independent from


         
 
    3
%

50
and an  - statistic is constructed by

  >        >;1


 

    
;
   



    >  %
        >;1
 

Two following two applications show that this approach results in exactly the same test
statistics we derived in the previous section.

1. Hypothesis III on a single parameter

7   

              >  
 

   
    

 
 
         
%     
&  


2     2 
&  

2. Hypothesis II on a group of parameters

7   

where
    1
 
 1

     is a partitioned matrix according to our notation and >       


is an 1   vector of zeros..

 
We make use of a result on the inversion of partitioned matrices.

 
    

   

51
 
  ¾  

 



  =
  

 =    

   
 ½   


Multiplying the matrix  from the left and from the right with the partitioned
inverse    in     cuts out the bottom right hand corner of
   . Thus



  ½    ;1
 7  ;1

   
;
    ;
   

which is the well known  - test statistic with 7  =



is the matrix defined
in the last section.

Further interpretation of the  - test

We start with examining the quadratic form


  ½    

Premultiplying the fitted regression equation of the unrestricted model


     

by ½ yields

½
 ½   ½    ½

We know that ½    and ½



, thus

½
 ½    

Transpose this equation and multiply by ½


½
  ½  


 


  ½  

  

½

..

  ..

where the subscripts  and stand for “restricted” and “unrestricted”.

52
Applying this result we can now rewrite the  - test statistic

  
 
;  
    ;1

 ;  

   ;1


   ;   


The intuition for the reformulated  - test statistic comes from the following para-
graph. So be a little patient with the interpretation.

Restricted least squares estimation

Here we have the restricted linear regression model in mind. That means we want to


solve the optimisation problem


=   
   
 

subject to   >

Let us set up the Lagrange function

  8  
  
   8   > 

and derive the first order conditions


  
     8  


    >  
8

If we call the optimal parameter value  , solving the first order conditions results in

   
8
   >
      
     8 
 
   8
  

         8  >

8        
 
>   

53
and we get 

           
  
 >    (8)

For the variance of this restricted estimator we note

(   
   
%   
/  

 %             

,(0*( (
( #

The variance of the restricted estimator  equals the variance of the unrestricted OLS
estimator minus some positive definite matrix. This implies that

(    (  

There occurs a strict reduction in the variance if we move from the unrestricted model
to the restricted model. The intuition is that the restrictions contain additional infor-
mation on the model and consequently the precision of the estimation increases. This
leads us to a new idea for a test. If the restrictions are valid in the general unrestricted
model the the reduction in variance should not be very large. On the other hand if the
restrictions are not valid in general they contain substantial additional information and
therefore we should observe a large reduction in the variance of the restricted estimate.
So we construct a test based on the loss of fit. Denote by  the residuals from the
restricted linear regression model.

 
  
            

Note: According to the notation from above    . The unrestricted model given
by
   has the residuals  .
Transpose the equation above and multiply by 

              



  
          

Now substitute for   from Equation 8

  >        
 
 
 >   

54
Finally we end up with the  - test statistic

  ;1     ;1
 
 
   1    
;
    ;  
 
   ;1
  
   ;   


55
6 Some tests for specification error

Given the assumptions for the multiple linear regression model in section 3 we derived
estimators and showed that they have desirable properties (linearity, unbiasedness and
minimal variance). Further we employed an array of inference procedures. However,
there is a crucial question. How do we know if the assumptions underlying our esti-
mation frameworl are valid given the data set?
If the assumptions are wrong there is an specification error in the model.

6.1 Tests for structural change

In the classical linear regression model the assumptions apply to all observations in the
whole sample. In this section we want to test the hypothesis that some or all regression
coefficients are different in subsets of the sample.
Applications for these tests occur in different context mainly due to the data type

Time series data: Regime shifts

Cross-sectional data: Differences among population groups

Pooled cross-sectional data: Training models, policy evaluations

Structural break in all variables

We explain the derivation of the Chow test on the basis of examples using the Longley
data. The dependent variable in the regression models will be employment either total
or in one of two sectors given. The independent variables will be a constant, a time
trend GNP, the GNP deflator and the number of armed forces. This data set spans
the years 1947 - 1962. Within this period falls the Korean war ending in 1953. We
consider a model for employment

&:'   
&    @0  @0   &- *.&1  (9)

Is there a difference between wartime 1947 - 1953 and peacetime 1954 - 1962? We
partition the observation according to these periods


 in the years 1947 - 1953

   in the years 1954 - 1962

56
The unrestricted model is the model which allows for different parameters in both

    
periods


 

  
  

   



the estimator for the parameters is equivalent to the one we get from estimating two

    
separate regressions

  

    



   


   




      



The sum of squared residuals in the unrestricted model equals

 
  

To test whether the parameters are equal in both periods we set up the hypothesis

7  

  > with       >  

The test statistic for this problem is given by

        ;
 

 
;
   

The computation of the test statistic in this form requires additional programming
steps. We can, however, choose to estimate the restricted model and simplify the
computations. The restricted model is the model which pools all observations. As
no differences between the time periods are assumed OLS can be estimated for the

 
complete sample



   


The residuals from the restricted model are


 .

57
We formulate the test statistic by comparison of the residuals. This test is also known
as the Chow Breakpoint Test.

   ;        ;
    
;      ;  

Estimation results from the Longley data give

Years 1947 - 62 1947 - 53 1954 - 62


SSR 4898.6 345.2 800.2

      ;
 
 # $   
   ;!    

= 0.05, critical value 4.39.


In this example we were interested if the regime shift affects all parameters. It might,
however, be that the differences in regimes only affect certain parameters e.g. only the
intercept terms or only the slopes.

Different constant terms

In the next example we keep the model from equation( 9) above but we consider dif-
ferences in employment between the agricultural and the nonagricultural sectors. Em-
ployment levels in both sectors are of different magnitude. Therefore we could allow
for different intercepts and test whether the independent variables affect employment
in both sectors differently. This means we test whether slope coefficients alone are
different. Now
and
 correspond to employment in each sector and the matrices
     of independent variables are equal.

    
We can formulate the restricted model as follows


 % %
  % 

  % %


The first two columns are dummy variables indicating the sector in which the observa-
tions falls. 
% includes all columns of  except the constant.

Estimation results show


 # $ = 106.8 significant.
To investigate further which variable is responsible for the differences in employment
in both sectors we change to a subset of coefficients. In the next step we allow for
different intercepts and time trends in both sectors.

58
Agricultural Non Agricultural Restricted M.

Constant
Agricultural 201.8 626.2
Non Agric. 1086.7 662.4

SSR 241.2 5037.9 107780.2

    
The restricted model changes to



&   %%

 %
%

  
& %% 

 = 9.22, critical value 3.05.

Dummy variables

We will introduce dummy variables on the basis of examples of wage equations. Indi-
vidual wages in a cross-sectional sample are explained by the degree of education and
further personal individual characteristics.
So far our in examples we mainly worked with quantitative variables like 6&, @, ,
etc.
Qualitative variables are 1& , .&, 1&.2* , &6*, etc.
How can we include qualitative variables as independent variables in the regression
model?


Binary variables We define a binary variable - dummy variable - for example

 *
&'&  
 

We can use this variable to estimate differences in the mean wage for men and women
in the model

 6&    &'&   &- . 

The parameter   # 6&&'&    &- .  # 6&&'&    &- .


estimates the average wage differential between men and women holding years of

59
education fixed.

Multiple categories In the same way we can define dummy variable for multiple
categories like
married woman
married man
unmarried woman
unmarried man
In the model we have to exclude one reference category, e.g

 6&            &- .

The parameters , , give wage differentials of the other groups to single men,
again holding years of education fixed.

Ordinal variables Suppose we only know the individual’s highest educational de-
gree instead of the years of education. There is information on: primary school, high
school, college education, etc. It is possible to construct a variable of the form

##!  :
1.A**'

##"
 A6A 1.A**'
&- .  
.*''&6&
..
.

and include it in the model. But it preferable to form a set of dummy variables 1,
, ,, ..., because the differences between the educational categories may not be
linear.

60
Interaction of variables We already had an example for the interaction of dummy
variables interacting categories woman/man and married/unmarried.
It is also possible to interact dummies and quantitative variables. Consider the model

  6&    &'&   &- .  &'&  &- . 


   &'&     &'&&- .  

with this regression model we can examine the question if there are differences in the
return to education for men and women. We allow for gender specific slopes of the
wage profiles as well as for different intercepts

The hypothesis that no differences in returns to education between the sexes exist is
given by

7   

and the hypothesis that there are no wage differences between women and men is the
following

7   
 

6.2 Prediction

After estimation of the model parameters suppose we want to predict the value
 for
some specific     vector of regressors  .
In section 3.3 we have already shown that


  

61
is the best linear unbiased predictor for
 .
Now we can construct a confidence interval for the expected value # 
 

# 
    

According to the regression model


   is also the predictor for # 
  and its
variance is given by

(      (     %   
  


Consequently we get

   
(   
0  

and replacing % by its estimate %


gives

   
%     

2 

Hence we can construct a confidence interval, with significance level , for  by



  2 ¾ %     


In the next step we want to construct a prediction interval for


     . We
define the prediction error by

& 
 
        
 
    



The variance of the prediction error & is given by

( &  # &&   #    %   
  
  %
  
  
  



62
Now we have

%
  

 

    

2 

and prediction interval for




  2 ¾ %
       


A convenient method of computing forecasts:

Suppose that estimation is based upon  observations in the model


   and
we want to make  forecasts on


    
# # #

      
First construct the augmented regression model


 
 
  #  


  
  

Each variable in the second part of  is a dummy variable, which takes the value 
for one observation and  for all other observations.

 
Form OLS estimation of the augmented regression model we get the following results


The regression of
on  produces the coefficient vector
    , where

are OLS coefficients from the original model and  is a vector of predictions
for
 .

 
Residuals from the augmented regression are



    


since the coefficients are the same.

63


 
The estimated covariance matrix for is given by


(    
%    



% 
 
     



The variance matrix contains (   in the upper left and ( & in the lower
right blocks.

Note:
A dummy variable that takes the value  only for one observation has the effect of
deleting this observation from the least squares computations.

6.3 Further specification tests

CHOW forecast test

This test is an alternative to the Chow breakpoint test if there is an insufficient number
of observations available. The concept of the test is based on an evaluation of the
predictive power of the model.
Out of sample forecasts provide an easy check for the model fit. To see how good the
estimated model predicts we might proceed the following way
First we estimate the OLS coefficients with    observations and get the parameter
estimate


 




Then we compute predictors for the other      observations


  

and obtain prediction errors

& 
 
 
  

Finally we test the hypothesis 7 &  .

64
Under the null-hypothesis the restricted model is the regression model which pools all
observations


  
 

Now we make use of the method for computing forecasts as described above. As a
result from this method we get the unrestricted model, defined by

  
 
   
 

  ¾ %
 %
  

The  - test statistic is easily computed by

   ;
  ¾ ½
;   

Note: Compare the test statistic of the CHOW Breakpoint test

      ;
   
    ;  

6.4 Tests based on recursive estimation: CUSUM and CUSUMQ test

The next group of tests are based on a similar intuition: How good is the model’s
ability to predict outside the range of observations used to estimate it.
The primary aim of the tests are applications to time series data. And the tests are more
general than the CHOW tests in the sense that they do not require a prior specification
of when the structural break takes place.
The disadvantage of the CUSUM and CUSUMQ test is, however, that they are of
limited power compared to the CHOW test.
First we introduce the concept of recursive residuals.

65
Recursive residuals

Suppose the sample contains " observations. (We use " instead of 0 to indicate that
we are in a time-series setting.) Then the 2-th recursive residual & is defined as the one
step ahead prediction error; the prediction error for
 from the model estimated with
only the first 2   observations.

& 
     2         "

and  corresponds to the t-th row in  and  is the parameter estimate from the
model with 2   observations. The variance of the 2-th recursive residual is given by

( &   %
      
 
 

 

We define the 2-th scaled residual as

   
    
&

 

2         "


Thus, under the assumptions A1-A6 and under the null hypothesis that the parameters
are constant during the full sample period  0  % . It can also be shown that
the scaled recursive residuals are pairwise uncorrelated.
The tests are based on the hypothesis that the distribution of  does not change over
time.
1. CUSUM
The CUSUM test is based on the cumulative sum of residuals.



?   2      "

%

with


    

2

  
%  and  
"  " 

Under the null hypothesis # ?    and (  ?   2.

66
 
The test is performed by plotting ? against time. Confidence bounds are obtained by
½ ½
two lines connecting the points  "    ¾ and "  "    ¾ . the param-
eter  corresponds to the significance level, for     .
2. CUSUMQ
The CUSUMQ test is based on the cumulative sum of squares. It uses the test statistik



 


2






Since the residuals are independent the numerator and denominator of  are approxi-
mately 3 distributed and therefore

2
#    
" 

Again the test statistic is plotted against time. Confidence bounds for #    for 2 

  " are constructed and plotted.

67
7 Asymptotic theory

Consider the estimation problem where we would like to estimate a parameter vector
$ from a sample        . Let $  be an estimator for $, i.e. let $   A       
be a function of the sample. In the linear regression model $  is a linear function of
       . And we can easily express the the expected value and the variance covari-
ance matrix of $  in terms of the first and second moments of  , provided that they
exist. Especially we saw that if the sample is normally distributed so is $  . Frequently,
however, the estimator of interest will be a nonlinear function of the sample and the
calculations of the exact expressions become very complex or it is inappropriate to
make specific assumptions about the distribution of the error. In view of these difficul-
ties in obtaining the exact expressions for the characteristics of the estimators and their
moments we will often have to resort to approximations for these exact expressions.
Asymptotic theory is one of the ways of obtaining such expressions by essentially ask-
ing what happens to the exact expressions as the sample size tends to infinity. For
example, if we are interested in the expected value of $  and exact expression is un-
available, we could ask if the expected value of $  converges to $ in an appropriate
sense.
In this section we give a short introduction to asymptotic theory and then apply it to the
linear regression model. That means we drop assumption A6 about the normality of
the error term and see what asymptotic theory can tell us about the distribution of the
estimators $   % . Using convergence theorems like Laws of Large Numbers
    

and Central Limit theorems we will prove consistency and asymptotic normality of
OLS estimators.

7.1 Introduction to asymptotic theory

Various modes of convergence

First we define and discuss various modes of convergence for sequences of random
variables taking their values in . The definitions and results can be extended to  -
dimensional random vectors.

Definition 21 (Convergence in Probability) The sequence of random variables 


is said to converge to the random variable  in probability for every B 9  if

       9 B  


1
We then write    or      .

68
Definition 22 (Convergence in Mean Square) The sequence of random variables 
is said to converge to the random variable  in mean square if

  #      



We then write   .

Definition 23 (Convergence in Distribution) The sequence of random variables 


is said to converge to the random variable  in distribution if the distribution function
 of  converges to the distribution function  of  at every continuity point of  .

We then write    and call  the limiting distribution of  .

The following theorems establish relationships between the concepts of convergence.

 1
Theorem 24    implies   

This is a direct consequence of Chebyshev’s inequality as we see from

 
Theorem 25 (Chebyshev) #    implies   
1

Proof. We can always find an C 9  such that

# 
  




-    B 
 -   (10)
 .

with      B .

 -   
 3
-   


-     B     B (11)


.  3
    B     B    9 B 

Combining the inequalities in ( 10) and ( 11) we get


  9 B 
#  
B

   9 B
#  
B

69
This inequality is called Chebyshev’s inequality and it implies the theorem.
Note

1. A generalised form of Chebyshev’s inequality is given by


 6   9 B 
# 6  
B

where 6 is a nonnegative continuous function.


 1
2. The general statement        follows from Theorem 25 if
 is replaced by    .


3. The converse of Theorem 25 is not generally true. For example

 with probability  
 
 with probability 

#    


 is not convergent in mean square.

The next corollary follows immediately from Theorem 25 by utilising the decompo-
sition #   .  (    #  . .
1
Corollary 26 Suppose #  . and (      then   ..

This corollary is frequently used to show that for an estimator $  with # $   $ (i.e.
1
an asymptotically unbiased estimator) and with (  $     we have $   $ .

1 
Theorem 27    implies   

The converse of the theorem does not hold in general. To see this consider the follow-
ing example. Let   0   and put     . Then  does not converge

in probability. But since each   0  , evidently    .

Convergence properties and transformations

Theorem 28 Let  be a sequence of random variables whose first and second mo-
ments exist #    .  #    .  and let   be a sequence of random vari-
ables with     .
Then      .

70
Theorem 29 (Slutsky) Let  and  be sequences of random variables with
    , ,    
,
with , and
, non-stochastic, and let 6    be a function continuous in
 , 
, .

Then plim 6      6  , 


, .

 
examples:

      
 
      ,
 
4 
 4 .

Such relations do not hold for expected values unless  and  are stochastically
independent.


Theorem 30 (Bernstein) Let    and     . Then      .


Theorem 31 Let      with    , and      . Then

   .

Asymptotic properties of estimators

Definition 32 (Consistency) A sequence of estimators $  is called consistent if


$
    $.

Definition 33 (Asymptotic Normality) A sequence of estimators $  is called asymp-


totically normally distributed with mean 4 and variance  if

   
 $   $   with  0 4 

Remark: $  is called asymptotically efficient if it has minimum variance in limiting


distribution.
1
Let $  be an estimator of the parameter $ and assume $   $ . If @ denotes the
cumulative distribution function of $  , then as   
  for   $
@   
 for  9 $

To see this observe that  $ 


    $   $
  $ 
 $   $   $    for  
$ and  $ 
      $  9       $   $ 9   $     $   $ 9   $

71
for  9 $ . The result shows that the distribution of $  collapses into the degenerate
distribution at $ , that means into
  for   $
@  
 for   $

1
Consequently, knowing that $   $ does not provide information about the shape
of @ . This raises the question of how we can obtain information about @ based
on some limiting process. Consider, for example, the case where $  is the sample
1
mean of iid random variables with mean $ and variance % . Then $   $ in the
light of corollary 26, since #$   $ and (  $    %  ;  . Consequently,
as discussed above, the distribution of $  collapses into the degenerate distribution

at $ . Observe, however, that the rescaled variable $   $  has mean zero and

variance % . This indicates that the distribution of $   $  will not collapse to a

degenerate distribution. Using Theorem 35 below it can be shown that $   $ 
converges to a 0  %  distributed random variable. As a result we take 0  %  as

an approximation for the finite sample distribution of $   $ , and consequently
take 0 $ % ; as an approximation for the finite sample distribution of $  .

Laws of large numbers


   

Let  ,   0 be a sequence of random variables with #




 denote the sample mean, and let 4  #   


 

4 . Furthermore let

 4 . A
law of large numbers (LLN) then specifies conditions under which


  #   
  
   4 


converges to zero in probability.


The usefulness of LLN’s stems from the fact that many estimators can be expressed
as continuous functions of sample averages of random variables. Thus to establish the
probability limit of such an estimator we may try to establish in a first step the limits
for the respective averages by means of LLNs. In a second step we may then use
Theorem 29 to describe the actual limit for the estimator.

Theorem 34 (Law of large numbers, Kolmogorov) Let  be a sequence of iden-

#  4 then   



tically and independently distributed (iid) random variables with #     and
 1
  4 as   .


72
Central limit theorems

Let  ,   0 be a sequence of iid random variables with #


% ,   %  . Let   
 
 denote the sample mean. By Kolo-
 

 4 and (   

mogorov’s law of large numbers for iid random variables it then follows that  
 
#  converges to zero in probability. This implies that the limiting distribution of


   #  is degenerate at zero, and thus no insight is gained from this limiting
distribution regarding the shape of the distribution of the sample mean for finite .
Suppose we consider the rescaled quantity


 5
   #    
   
  4


Then the variance of the rescaled expression is % for all , indicating that its limit-
ing distribution will not be degenerate. Theorems that provide results concerning the
limiting distribution of expressions like that are called central limit theorems (CLT).

Theorem 35 (Lindeberg - Lévy) Let  be a sequence of iid random variables with


#    4, (    % . Then

 
    4
  0
 
 % 

Theorem 36 (Lindeberg - Feller) Let  be a sequence of independent random vari-


ables with #    4 and (     %  . Let % 


% and suppose that

% 9 , except for finitely many . If for every C 9 



  #      C%   
 % 

then
 
 

  4;%  0  .

7.2 Asymptotic properties of OLS estimators

Consistency of

Consider the classical linear regression model under assumptions A1-A5.


Assume that


    <
 

73
and < is a positive definite matrix.

 
   
  
 

  



 
     <
 




   
 
 
#  # 
 
   (12)
 
%  
 
 
(    #    (13)
    


%  %
  (       <    
       

From equations ( 12) and ( 13) we get the conditions for Corollary 26 which implies
that  
  .
Under assumptions A1 – A5 and     < the estimator is consistent.
 

Consistency of %



      
%  










   





  
 
  
 

%  

    


 
 


 %    !½  

The leading constant converges to 1. The second term in brackets converges to zero.
That leaves

 
 
  

Assuming that the errors are independent  is the mean of a random sample, and

74
we can apply the law of large numbers (Theorem 34) and get


   %


So under assumptions A1 – A5 and     <, with < positive definite %  is


 
consistent.

Asymptotic distribution of

As a corollary to the Lindeberg-Lévy CLT we note the following

Theorem 37 Let  ,   , be a sequence of iid random variables with #    and


#   %  . Let  ,   , with      be a sequence of real nonstochastic
   matrices with  
      < finite. Let &          , then


 
  &  0  %  <


   
We remember that

   
 
   
 
 


     
 

Under the conditions A1 to A5 and     < and if furthermore < is nonsin-


 
gular we obtain

 

   0  %  <


  
    0 <   < %  << 
 

 
and finally

 
   0  %  < 

If regressors are well-behaved the asymptotic normality of the least squares estimator
does not depend on the normality of disturbances.

75
Asymptotic distribution of %



Under the additional assumption #       it can be shown that

   
 %  %  0    % 

Similar results can be derived for asymptotic distributions of the test statistics.
We have shown that

 %
 0  < 



This implies

 
%¾ < 

 0  


and

2   
%     

 0  

For testing the validity of linear restrictions:  >


    


 >     
   > ;1


  3 1
% 

76
8 The generalised linear regression model

8.1 Aitken estimator

We modify assumption A4 in our set of assumptions.

A1
  

A2 dim( )=(   ); rank( )=

A3 #     

A4* #     %  ' '  ' ; ' 9 

A5  is nonstochastic

If ' is unknown we have  


additional parameters in the model. That means there
are more unknown parameters than observations.
Therefore we first assume that ' is known and only % is unknown. As a normalisation
condition we take 2 '  .
The symmetric positive definite matrix ' and its inverse can be decomposed according
to the Cholesky decomposition in

'  
'

 

with       and     


Now we premultiply the regression model with the matrix 


   

and get a new transformed model



  
 
(14)

This model has the properties

1.   is non-stochastic

77
2.     

    is non-singular
½

3. #     #    

4. #      #     %  '  % 

That means in the transformed model ( 14) the assumptions of the classical regres-
sion model are fulfilled. We can apply the OLS estimator on this model and get the
following result.

Theorem 38 Under the assumptions A1, A2, A3, A4 and A5 best linear unbiased
estimator of is given by

%    '

  
'


Proof.

   



 
         
 

'

 
'


% is called Generalised Least Squares (GLS) estimator or Aitken estimator.

Covariance matrix of GLS estimator


  #  %  %

 #  '

  
'

'  

'

 

   '

  
'

#  '   
'

 

 %  '

  ' 
''

  '

 

 %  '

 

% 
  % and %%    % ' % 

Properties of OLS in Generalised Regression Model

1. OLS is unbiased

#  #      
  

78
2. In general is less efficient then %.


   #     
  
#        

 


 %  ' 

3. %

 


# %  
  
#
    

# 

# 2 
     
  
  
 % %
2  # 2    2  ' 
   
  '
  
% %
2 '    2    '     

   

 
% %
 2    '  %  2    '  %
 

2    2'   (normalisation)


OLS estimator of % is biased.

8.2 Asymptotic properties of GLS

Theorem 39 Under the assumptions A1, A2, A3, A4 , A5 and


  '  
 <
 

if < is positive definite, the GLS estimator % is consistent.

Proof.

%     '

  
'


#  % 

(  %  # %   %    %   '

 

% 
 
    
 
 '
 
 !½

 1
Thus % and this implies % .

79
Remark: The assumption


  '  
 < with < pos.definite
 

implies

 
 
 '


 

 < 

 

We give an interpretation of this assumption for the case '   , 

 
 


 
    < 

 

with < positive definite. It means that 


  must not converge to . There has to be
enough variation in the data. When sample size increases variation has to increase as
well.

Is OLS consistent in the Generalised Regression Model ?

Here we present two examples. One in which OLS is not consistent in the generalised
model and one in which it is consistent.

Example 1 OLS is not consistent (heteroscedastic errors)


We consider the model


    

in this model  . Further we assume about the error terms 


  0  % . That
means the diagonal elements of '   we have heteroscedastic errors and there exist
upper and lower bounds for the variables

  .
  
.   

The consistent GLS estimator is given by

 

 
 


% 
 

 


80

Let us compare it to the OLS estimator


  


 

with expected value and variance

#   
   

 
 
 
(   (   % 
 
    
    

%    
   % .
 
 
%
.   


 

 
  .   .  

% .
 9 
 .

The variance of is always positive, therefore the OLS estimator is inconsistent.

Example 2 OLS is consistent (AR(1) errors).


we consider the same model as above but we change the assumptions about the error
terms to

  D    B D  

 
Further we assume that



    
 

 
then

  
  
               
  

  ½     $
 

 converges to zero.
 
We still need to see if  


#   


81
We evaluate the variance (    column by column
 
( 
  

  
# 
%
'
   
%  
     

   D  
  

% D
  
  
   
  if

¾ 6
 7¾ 

½


  (    
 

 1
and so  which implies  .
The matrix ' in this model is given by

$ '
%%%  D D  D 

(((
%%%    
(((
& )
'   D
 D
D 


  is a supremum norm of the matrix  , it is defined as the maximum of absolute


values of the elements of  .

 
Let us summarise this example. We found that under the assumptions

  

  
 

     
 

in the model
   , with error terms   D    B where D   and

B 0  %  the OLS estimator is consistent.

Asymptotic distribution of %

Theorem 40 under the assumptions        and

   '

  < positive definite
 

      
 



%   converges in distribution to 0  % < .

82
Two-Step Estimation (Feasible GLS)

So far we always assumed that ' is given. But how do we proceed if ' is unknown ?
We can use a 2-step procedure

 
.
1. Estimate ' by '


2. Estimate with %
%   
'   
'

' still has has  


unknown parameters and the number of parameters exeeds the
number of observartions . Therefore we have to find a parametrization for ', which
reduces the number of parameters.
Let us look at some examples

Example 1 AR(1)

  D    B

1. In the first step we calculate the OLS estimator and the OLS residuals



  . Then we estimate D from the equation

 


D 

  B

D 
 
 



Note that we only have to estimate one additional parameter D.

$ '
2. In the second step we use

%%%  D D   D  

(((
%%%    
(((
& )
'    D 
 D
D  


%
and estimate with %.

83
$ '
Example 2 heteroscedastic errors

%%% 
(((
  % %& 
..
.
()


like in the previous example we want to find estimators   for the  in the first step
and then estimate %
% in the second step. But here we still have  additional parameters
and further assumptions are necessary to reduce the dimension of the parameter space.
Concerning the asymptotic distribution of the feasible GLS estimator we note that
under relatively general assumptions we get


   %% %  

  
or


   %
%    %   

The 2-step estimator is consistent and has the same asymptotic distribution as % (GLS).
% %
%
Asymptotically and are equal.

8.3 Heteroscedasticity

We consider the linear regression model


 

with the following structure on the error terms

#    
$

'
%%% %   
(((
#    
%&  % 
..
.


()  % '

   %

This model has (   ) unknown parameters and we need additional assumptions for
estimation. But first we examine what happens if we estimate with OLS in this model.

84
Properties of OLS

If  is non-stochastic OLS has the following properties

1. OLS is unbiased

2. OLS is consistent if

 
  
 '  = with = finite


     with  finite and non-singular
 

3. OLS is inefficient

4. OLS standard errors are incorrect.

Remember
The BLUE in this model is given by the GLS estimator

%    '

  
'


and according to the Gauss Markov Theorem the minimum variance of all unbiased
linear estimators is

 
 %   '

 

For the OLS estimator we have

    



  %     '   
 

Concerning item 2 we note that the Gauss-Markov Theorem states that    

and this implies the inefficiency of OLS.
Further we note that if we let the number of observations go to infinity

% 
      
 
 
    
 '     

   
   $½ , finite  - , finite  $½ , finite

85
Thus under the conditions that  
   '   = and      with non-
 
singular and finite matrices < and  also     
 and thus is consistent,
which verifies item 2
Concerning item 4 note that in the classical linear regression model we calculate
%    
as estimator for the variance of instead of  
 %     '    .
 

Consequently the standard errors for are calculated wrong under OLS and all infer-
ence based on OLS standard errors is incorrect.

Correction of OLS Standard Errors

If the sample size  is very large we can proceed with OLS in spite of inefficiency of
the parameter estimates. The main problem is how to get valid statistical inference.
Without additional assumptions the estimation of % ' is still impossible, because '

contains  unknown parameters.


In an important article White (1980, Econometrica) has shown that it is sufficient to
estimate %  '  , which is of dimension    .
Let   
  be -th observation vector with       corresponding to the
-th row of  . Then we can write
  
%  '           
 %
..
.

 
..
.
 

%  

 % 

in this expression White’ estimator replaces the unknown % by the squared OLS resid-
uals 

 
. Similarly an estimator for the variance is given by




        % '
     

with
 
% '


 
..
.


 

It can be shown that this is a consistent estimator for   .

With this method we get standard errors of  by



 and they are called Heteroscedasticity-

Consistent (Robust) Standard Errors.

86
Tests for Heteroscedasticity

White (1980) presents a test for the hypothesis

7 %  % 
7 heteroscedasticity

The test is performed in two steps

Step 1 Estimate OLS residuals


.

Step 2 Estimate an auxiliary regression by regressing 


on cross-products of all re-

* +
gressors including constant

                                C

and keep the  from this regression.

It can be shown that


 3 >  

 
where > is the number of regressors in auxiliary regression > 
 .

Estimation under Heteroscedasticity

Estimation of feasible GLS requires prior knowledge about the structural form of het-
eroscedasticity. Consider an example

Example Let denote household income and let


denote household consumption
and consider a model in which consumption expenditures are explained by income and
other variables.


          

One often observes that the variation of average expenditure increases as income in-
creases. Therefore we model % proportional to 

%  % 

87
$ '
and get

  %%%% (((


#
&  %

..
.
()  % '

 

A more general model for heteroscedasticity is given by

%    ¾         

where  is a single variable usually one of the regressors. Depending on the values of
the parameters       we get the models
 Homoscedasticity
      Variance proportional to 
      Variance proportional to 

Weighted Least Squares Modelling heteroscedasticity in this form is also called


weighted least squares regression. We can see this from

%  
 
% ' 
 
..
.


 

The rows in the matrix  correspond to the individual observations on all independent
variables. Here we call            the -th row of X. Accordingly
 is the
observation of the dependent variable for the -th individual. We rewrite the regression

  
model

 '


       
 ½
..
.

 

..
.
 











 '

   



88
 

 



%    


 


From this expression we see that the GLS estimator is gives the OLS estimator for the
weighted set of observations   
 .
To estimate the parameters       in %    ¾ .directly from the data,
one could proceed the following way
Use OLS rediduals and substitute  
for % .
 

By non-linear regression estimate

    ¾  B 

8.4 Autocorrelation

Autoregressive process of order one

A stochastic process of the form

  D    B
#  B   

# B B   Æ %3

is called first order autoregressive process, or AR(1).


Under the assumption that the process starts from some non stochastic value  a
solution for  is given by

 D   B
  D   DB  B
..
.

  D   D B  
 

The properties of  as 2 goes to infinity 2   depend on the parameter D. Here we


treat convergence in the sense of convergence in mean square.

89
Case 1 D  .

   

#     D   # D B  
   


   

(     D   D ( B 
  
 %
¾
% ¾
½

½ ¾

The expected value converges to zero and the variance to some finite value. In
this case we speak of a ”stable solution”.

Case 2 D  .


    B 
 
#     
 
(     #    #     # B 

 #B
   2%3
    

The variance of  increases as 2  . This is called the “random walk”


solution.

Case 3 D 9 

#     
(     

Both expected value and variance increase over time. This is the ”unstable”
solution.

From now on we consider only stable solutions where D  .


The general solution for the  starting in the infinite past has the following properties

90

  D B  
 

#    # D B
 
    

 

 


(     # D B    # D  B  B  
   

D %3  %3
 

  D


)*+     #   
D  B  B      %3

D  

D

 %3 D   D  %3  E2  1


  D
 

The mean and the variance of the process are constant and finite and the covariance of
two observations  and  only depends on the difference 2  1. Processes with these
properties are called weakly “stationary”.

Generalised Regression Model with AR(1) Disturbances

Now we return to the linear regression model


  

but we assume that the error term  follows a first order auto regressive process with

#    
 
  D D  D 


 
..
.

 

#    % '  %3 ..
. D
  D
D
D 



%  %3
  D

91
Model misspecification may be a reason for autocorrelated disturbances.

Example Suppose the true model were the following


      
  

But the researcher estimates instead


      +

in this misspecified model the error term + is autocorrelated: +  


   .

Properties of OLS

Like for heteroscedasticity of errors, if  is non-stochastic we have the following


properties of OLS

1. OLS is unbiased.

2. OLS is consistent under certain conditions.

3. OLS is inefficient.

4. OLS standard errors are incorrect.

However, if  is stochastic, we have to be more careful. This is also the case if the
lagged dependent variable is one of the regressors. Consider again an example.

Example We have the model


 
     
  D    B D  
# B   

# B B   Æ %3

 
Estimating this model by OLS results in




 

 
  



   
plim 


plim 
 

92
Consider now  


  :



 
  
 

   
# 
    #     D %3
 D
 

   
%3 D
 9
  D   D


This means that OLS is inconsistent !

Testing for Autocorrelated Disturbances

In the model


  
  D    B

we want to test the hypothesis

7 D
7 D  

This is a hypothesis about the error term . But is unobservable and therefore one
has to look for a test based on OLS residuals
.


   
  % 

because  

Even if 7 is true and #  
     




%  , the OLS residuals will display autocorrelation
 is not a diagonal matrix and it is dependent of
.
Durbin – Watson Test

93
the Durbin - Watson test statistic is given by


  


   

- 



Note that - is small for positive autocorrelation and large for negative autocorrelation.
The test statistic - will take on an intermediate value for no autocorrelation.
Durbin and Watson established upper and lower bounds for the distribution of -. These
bounds are independent of  under assumptions that

1.  is non-stochastic.

2. 0  %  .

3. A constant is included in  .

8 -
 -
 - - 

Durbin and Watson suggest the following test procedure


For a given significance level we can derive values - and -8 from  and 8 .
Then we apply the decision rule

If -  - we reject the null hypothesis of non autocorrelation in favour of


positive first order autocorrelation.

If - 9 -8 we do not reject the null hypothesis.

94
If -  -  -8 the test result is inconclusive.

The assymptotic properties of the Durbin - Watson test statistic are

 

  
%¾

 




 
7%¾
  
-  D

    

  

  


    %¾

#    # D  B    D#  
#

D B B
  
       
 
 

 



#! D
#"


  -   D
 D  

For - 9  a test of 7 against the alternative can be constructed by calculating  -


and comparing this to - and -8 .
Disadvantages of DW Test

 and 8 are only tabulated; analytical functional forms are not given.

inconclusive region

only non stochastic  are allowed

Further tests: Breusch – Godfrey LM-test

Feasible GLS with Autocorrelated Disturbances

Beware of the problem of misspecification !!!


  
  D    B

95
 
  D    D 


    D

D 


%3
#    % .. % 
  D

.


D
D    

' 
  

..
.
   D 


 D

 

D 
 

  D   

 D   D D   


 
 D   D
 D  

' 
..

 
.
 D D
D   D

 
   D   

 D   

 

 
..
.
D 

..
.


..
.


    D 

'

 



 
 
 

For a model with a constant and one explanatory variable as regressors the transformed

    
model becomes

   D

    D   D





  D

..
.
  



  D
..
.
  D
..
.


  D
   D   D  

96
There is systematic transformation for all observations except the first.
Cochrane – Orcutt transformation:
For computational siplicity the Cochrane Orcutt transformation omits the first obser-
vation and uses

   




  D

..
.
  

  D
..
.
  D
..
.


  D
    D   D 

Prais – Winston transformation:


The Prais - Winston transformation uses all observations. Asymptotically two trans-
formations are the equivalent, but they differ in small samples.
To obtain an estimators for and D we can apply the following iterative procedure.

Step 1 OLS regression of


   gives residuals .

Step 2 OLS regression


 D
  B gives an OLS estimator D .

Step 3 use D one of the transformations for


 and   , and get a new estimator

for
and new residuals .

Step 4 Iterate steps 2 and 3.

Note that iterative methods always depend on the starting values, as they may converge
to local extrema.

Serial Correlation Robust Inference after OLS

Like in the heteroskedasticity model methods of correcting standard errors after OLS
estimation exist : Newy and West (1987).

97
9 Limited dependent variables models

The linear regression model assumes that the dependent variable is continuous and
has been measured for all cases in the sample. Yet, many outcomes of fundamental
interest to social scientists are not continuous or are not observed for all cases. There
are special regression models that are appropriate when the dependent variable is cen-
sored, truncated, binary, ordinal, or count. Variables of that kind are often subsumed
as categorical and limited dependent variables.
In this section we will discuss models for binary and censored dependent variables

Differences in parameter interpretation between linear and nonlinear regression


models

Consider a linear models in two independent variables


   Æ-

where is a continuous variable and - is dichotomous with values 0 and 1. For sim-
plicity we assume there is no random error. A graph of the model is given in Figure 6.
In this model a change of by one unit will always result in a change of units in the
dependent variable regardless of the values of and -. And a change in - from  to 
will always result in a change of
by Æ units.
Now, consider the same graph for the nonlinear model


 6 


 Æ -


where 6 is a non-linear function. In Figure 7 we see that the effect of a unit change in
on
depends on the level of as well as on the level -. Analogously also the effect
of a change in - from  to  changes for different levels of .
We will return to these basic observations when it comes to parameter interpretation
in nonlinear models for limited dependent variables.

9.1 Binary regression models: Logit, Probit

A binary response model is a regression model in which the dependent variable


is
a binary random variable that takes only the values zero and one. In many economic
applications of this model, an agent makes a choice between two alternatives: for ex-
ample, a commuter chooses to drive a car to work or to take public transport. Another

98
Figure 6: Linear Model

y
d=1

E G d=0
E

DG G
E

x1 x2

example is the choice of a worker between taking a job or not. Driving to work and
taking a job are choices that correspond to
  , and taking public transport and not
taking a job to
  . The model gives the probability that
  is chosen conditional
on a set of explanatory variables.
The econometric problem is to estimate the conditional probability that
  consid-
ered as a function of the explanatory variables. The most commonly used approach,
notably logit and probit models, assumes that the functional form of the dependence
on the explanatory variables is known.

The linear probability model

A first approach to estimate a model with a binary dependent variable would be to


use standard OLS. This is the so called linear probability model. We note it here not
to recommend its use but to illustrate the problems resulting from a binary dependent
variable and to motivate the discussion of the logit an probit models.

99
Figure 7: Nonlinear Model

y d=1
'6
'4
' 5 d=0
'3

'1
'2

x1 x2

The structural model is


    
#     

# 
     

Note: In this chapter we use elementwise notation,


 gives the i’th component of the
vector
, and  is the row of  corresponding to observations for the i’th individual.
We will eventually suppress the index  to make notation more simple.
A graph for a model with a single independent variable is given in Figure 8. Let us
consider the meaning of # 
 

# 
      
         
         
    

and hence

 
      

100
So  gives the probability of
  given . Depending on  this probability need
not always be between zero and one. So there is a problem of nonsensical predictions
in the linear probability model.
As
only takes on two values also the error term for a given only takes two values
    
     if
  
 
   
      if
  

So the errors cannot be normally distributed.


For the variances of the errors we have

(        
        
        
          
     
  
       
          


This means there is a further problem with heteroscedastic errors in the linear proba-
bility model.

Figure 8: Linear Probability Model

XE

101
Latent variable model

As before, we have an observed binary variable


. Suppose that there is an unobserved
(latent) variable
 which is continuous and ranging between  and  that gen-
erates the observed
. In the example of labour force participation we can think of this
variable the utility derived from working. The structural model for
 is the following
linear one:



 

The individual joins the labour force only if its utility is above a certain threshold F .
Hence we only observe a discrete outcome
which is linked to
 by
  if
 9 F

 
 if
  F

Since
 is continuous, we avoid problems encountered in the linear probability model.
However, since the dependent variable is unobserved, the model cannot be estimated
by OLS. Instead we use Maximum Likelihood estimation which requires assumptions
for the distribution of the errors.
First we assume #       like in the linear probability model. Since
 is unob-
served we cannot estimate the variance of the error as in the linear model. In the probit
model we assume (        and in the logit model we assume (      

5 ;  .
By assuming a specific form for the distribution of it is possible to compute the
probability of
  for a given . Setting F   consider

 
       
 9       

  9        9     

This is simply the cumulative distribution function of the error evaluated at . Ac-
cordingly,

 
         

where F is the normal distribution function ( for the probit model and the logistic
(1  
distribution function )   
(1   for the logit model.

102
Model identification assumptions

In specifying the logit and probit model we made the following identifying assump-
tions

#     

F   this is not important if F is assumed to be constant in the model

(   =1 resp. (     5 ;

These assumptions are arbitrary, in the sense that they cannot be tested, but they are
necessary to identify the model. Since a latent variable is unobserved, its mean and
variance cannot be estimated. To see the relationship between the variance of the
dependent variable and the identification of the ’s in a regression model, consider the
model
    and assume we rescale
by   
 Æ
. The variance of  equals


(  
  ( Æ

  Æ ( 



and it follows

 
 Æ

 Æ   Æ

The magnitude of the slope parameter depends on the scale of the dependent variable.
If we do not know the variance of the dependent variable, then the slope coefficients
are not identified.
Differences in the variances of the error terms in the logit and probit model also affect
the parameter estimates. Let

logit
  with (     9¾
  


probit
&  &  & with (  &   

As transformation to compare coefficients from the logit and probit models we can use

 
 5 ; &   &

Nonlinear probability model

The logit and probit models can also be derived without appealing to an underlying
latent variable. This is done by specifying a nonlinear model relating the to the

103
probability of the event
 . Remember, in the linear probability model we had the
problem that the predicted probabilities  
   can take on values that are greater
than one or less than zero. To eliminate this problem we transform  
  into a
 

function that ranges between  and . First, consider

 
    
  

 
      
  

which ranges between  and . Then take the logarithm and get an expression be-
tween  and . In the logit model this equals

 
  


   
  

because

 
 
   
   

General probability models can be generated by choosing functions of that range


between  and , e.g. any distribution function 

 
        (15)

Maximum Likelihood estimation

To specify the likelihood function, define : as


  
     if
  
: 
   
     if
  

 
      is defined by equation (15). If the observations are independent, the
likelihood function is given by

/ 
 
,

:
, ,



/ 
   
       
   
, ,
   

 

          

 

104
and the log likelihood is


/ 
   
     
     

 

It has been shown that under mild conditions, the likelihood function is globally con-
cave. The estimates are consistent, asymptotically normally distributed, and asymp-
totically efficient.
Remark: These are only asymptotic properties and nothing is said about small sample
properties of the Maximum Likelihood estimators. Contrary to OLS, Maximum Like-
lihood estimation of nonlinear functions is only justified for relatively large sample
sizes (above 500 observations).

Numerical maximisation procedures

For nonlinear models algebraic solutions are rarely possible. Consequently, numerical
methods are used to maximise the likelihood function. Numerical methods start with
a guess of the values of the parameters and iterate to improve on that guess.
Assume that we are trying to estimate the vector of parameters $ . We start with an
initial guess $ called start values and attempt to improve by adding a vector G of
adjustments and proceed updating

$  $  G
..
.
$  $  G

Iterations continue until a convergence criterion is reached. This may be either that
the gradient of 
/$  is close to zero, or that parameter values do not change any
more.
How is G determined? By a product of gradient and direction matrix

G  , 
 
/
 $ 
$ 


The gradient vector indicates the direction of a change in the likelihood function for a
change in the parameters. , is the direction matrix that reflects the curvature of the
likelihood function.

105
Method of steepest ascend

,  
 
/
$ $  $ 
$ 


Newton Raphson method

, 7# 
   
/
$ 
 

$$ 
 

Method of scoring

,
#
  
/
$ 
 

$$ 


 
BHHH

 
/  
/ 

,  $ 
$ $

In addition to estimating the parameter $ , numerical methods provide estimates of


the asymptotic covariance matrix (  $ , which are used for the test statistics. The
asymptotic covariance matrix for the maximum likelihood estimator equals

(  $  
#
  
/
 

$$

evaluated at the true parameter value $ .


Estimators for the covariance matrix are given by

(  $
  
 #
  
/
$
 

$$

 
the inverse of the Hessian evaluated at the maximum of the likelihood function or

 
/  
/ 

( $   $
$ $

the inverse of the outer product of the gradient vector evaluated at the maximum of the
likelihood function.

106
Parameter interpretation

Since binary regression models are nonlinear, no single approach to parameter in-
terpretation can fully describe the relationship between a variable and the outcome
probability. Here we discuss several methods to interpret parameters. For a given
application, you may need to try each method before a final approach is determined.

Predicted probabilities: The most direct approach for interpretation is to examine


the predicted probabilities of an event for different values of the independent variables.
A useful first step is to examine the range of predicted probabilities within the sample,
and the degree to which each variable affects the probabilities. If we consider the
shape of the cumulative Normal (or logistic) distribution function we see that they are
approximately linear in the range between 0.2 and 0.8. Hence if the range of predicted
probabilities is between 0.2 and 0.8, the relationship between the ’s and the predicted
probability is nearly linear, and simple measures can be used to summarise the results.
Minimum and maximum predicted probabilities are evaluated by


    

  
   
 

    
       
 

To examine the effect of a single variable on the predicted probabilities we could al-
low one variable to vary from its minimum to its maximum, while all other variables
are fixed at their means. Let  
 
    be the probability computed when all
variables except are set equal to their means, and equals some specified value.
Then the predicted change in the probability as changes from its minimum to its
maximum value is given by

 
  
     
  
 


One can also use plots of predicted probabilities to examine the effect of one or two
variables while the other variables are held at constant. The effect of discrete in-
dependent variable on the probability can be illustrated by tabulating the predicted
probabilities at selected values.

107
Partial change or marginal effect: In the structural latent variable model
 

 the partial derivative with respect to







Since the model is linear in


 the interpretation is straight forward. For a unit change
in ,
 is expected to change by units, holding all other variables constant. The
problem is that the variance of
is unknown, so the meaning of a change of

is
unclear. Since the variance of
changes when a new variable is added to the model,


the magnitudes of all 1 will change even if the new variable is uncorrelated with
the original variables. This makes it misleading to compare coefficients from different
specifications.
The 1 can be used to compute the partial change in the probability of an event. Let

 
      

then the partial change in the probability or the marginal effect is

 
      
    
  

In the probit case

 
  
 !  


In the logit case

 
    
 8  
    


  
     
  

The marginal effect is the slope or the probability curve relating to  


   ,
holding all other variables fixed. The sign of the marginal effect is determined by .
The magnitude of the change depends on the magnitude of and the value of .
To assess the relative magnitudes of the marginal effects of two variables we can use

108
the ratio of marginal effects for and 

&  


   
 
&       


Since the value of the marginal effect depends on the levels of all variables, we must
decide on which values of the variables to use when computing the marginal effect.
One method is to use the average over all observations


 
   

    
  

or to compute the marginal effect on the mean of the independent variables

 
  
    


Taking the mean value of an independent variable does of course make no sense for
0-1 dummy variables. In that case it is better to fix the dummy variable at either value
or go for discrete changes.

Discrete change: Let  


     Æ be the probability of an event given ,
noting, in particular, the value of . The discrete change in the probability for a
change of Æ in equals

*  
  
  
    Æ   
   
*

In nonlinear models the discrete change is unequal to the marginal change, except in
the limit as Æ becomes infinitely small. The practical problem is again choosing which
values of the variables to consider and how much to let them change. Some options
are:
unit change in , if increases from
 to
  

*  
  
  
  
      
 
    
*

centred unit change

*  
  
  
  
      
  
  
  
*

109
standard deviation change, 1 is the standard deviation of

*  
  
  
  
     1    
  
   1
   
*

a change from 0 to 1 for dummy variables

*  
  
  
  
     
 
     
*

Odds ratios: This methods takes advantage of the tractable form of the logit model.
Define the odds of an event as

 
    
  
'   
 
      
  

taking logs in the logit model gives


'  

The log odds is linear in . The interpretation of the effect of a change in is straight-
forward. (For a unit change in , we expect the log odds to change by , holding all
other variables constant.)

'    

To compare the odds before and after adding Æ to , we take the odds ratio

'   Æ
  Æ
'  

and the parameters can be interpreted in terms of odds ratios


For a change of Æ in , the odds are expected to change by a factor of  Æ ,
holding all other variables constant.

110
Hypothesis testing and measures of goodness of fit

Using the result about asymptotic normality of the Maximum Likelihood estimator
allows formulating Wald test for testing linear restrictions

7   >   1
?     > (    
   >  3 1

where (    is the estimated covariance matrix for which we get from the iterative
maximisation procedure.
Alternatively we can formulate a Likelihood Ratio test

@ 8    
/8    
/   3 1

where 8 is the value of the likelihood function evaluated at the Maximum Likelihood
estimates for the unconstrained model, and  is the value of the likelihood function
evaluated at the Maximum Likelihood estimates for the constrained model.
To test the overall significance of the model we can use the so called Likelihood Ratio
3 test. It compares the unconstrained model with a model where only a constant is
included.

 full model
 model with only a constant included
@    
/    
/   3   

Residuals: For a binary model we define

5  # 
      
    

Since
is binary, the deviations
  5 are heteroscedastic, with (  
     5  
5 . This suggests the Pearson residual:

  
  5 
5    5  

111
Large values of  suggest a failure of the model to fit a given observation. Pearson
residuals can be used to construct a summary measure of fit, known as Person statistic


  


Beware that while (  


     5   5 ,

( 
  5    5    5  

Consequently the variance of  is not equal 1.

Pseudo  ’s Several Pseudo- for nonlinear models have been defined in analogy
to the formulas for the linear regression model. These formulas produce different val-
ues in models with categorical outcomes, and, consequently are thought of as distinct
measures.


Percentage of explained variation

 
  5  
:
,   

 



Likelihood Ration Index, McFadden

 
/ 
 "   

/ 

 
increases as a new variable is added to the model

5
 /  

   

/ 

observed versus predicted values

5  

  



if 5

;

 
 if 5
 9 ;

 
,   correctly predicted


112
9.2 Censored regression models: Tobit

In the linear regression model, the values of all variables are known for the entire sam-
ple. Here we consider the situation in which the sample is limited by censoring or
truncation. Censoring occurs when we observe the independent variables for the entire
sample, but for some observations we have only limited information about the depen-
dent variable. For example, we might know that the dependent variable is less than
100, but not know how much less. Truncation limits the data more severely by exclud-
ing observations based on characteristics of the dependent variable. For example, in
a truncated sample all cases where the dependent variable is less than 100 would be
deleted, While truncation changes the sample, censoring does not.

Problem of censoring example Let


 be a dependent variable that is not censored.
If we do not know the value of
 for
   then
 is a latent variable that cannot be
observed over its entire range. The censored variable
is defined by

 if
 9 
 


 
 if




Consider the model


 
       with no censoring of observations OLS
would result in





      

If
 were censored from below at 1, we would know for all observations, but observe

only for
9 . For example, the values of
at or below 1 are censored with
  

 . The resulting estimate would be





      

which underestimates the intercept and overestimates the slope.


Since including censored observations causes problems, we might use OLS to estimate
the regression after truncating the sample to exclude cases with a censored dependent
variable. After deleting cases at
 , the OLS estimate





      

overestimates the intercept and underestimates the slope.

113
Truncated and censored Normal distribution

Assume   is normally distributed with    0 4 % 


 

4 

 

 
 
  (
%
 4


 
4 %

 !
% %

Truncated Normal distribution: When values below F are deleted, the variable


9 F has a truncated Normal distribution with density

 
4 %


 

9 F 4 % 

 
  9 F 


 
   
 ; ;
%! %!

% %
 

9 F 4 % 
< ; 
;<
 ( % (
%

Given that the left-hand side of the distribution has been truncated, # 

9 F  must
be larger than # 
   4. For the Normal distribution we have
 
!
 
;<

 
# 

9 F   4% %
;<
(
%
4F
 4  %8 (16)
%

where 8  !;( is called the inverse Mill’s ratio.


Censored Normal distribution:


if
9 F
 



F if

F 

Most often, F  F . We know that if


is Normal, then the probability of an observa-


 
tion being censored is

F 4
 censored    

F  
 ( (17)
%

114
   
and the probability of not being censored is

F 4 4F
 uncensored    (  ( (18)
% %

Thus the expected value of a censored variable equals

# 
 

 (
 -4  %8
4F
 (
 .  
 uncensored   # 

9 F $  # censored  # 

 F $
#

F 4 F 4
F (19)
% % %

Tobit model for censored data

The structural equation is




   

where   0  %  .
 is a latent variable that is observed for values greater than F
and censored for values less or equal F

 if
 9 F
 


 
F if

F 


combining the two equations results in the model


 
    if
 9 F



 
F if

F


The probability of an observation being censored for a given

 censored      

F       
F     

(20)

Since   0  %  

 censored     
  F  

 
 
 (
F  
 (21)
% % %

   
and

F     F
 uncensored       (  (
% %

115
Deriving the probability of a case being censored is very similar to deriving the prob-
ability of an event in the probit model. In the tobit model we know the value of
 if

9 F , while in the probit model we only know if


9 F . Since more information
 

is available in tobit, estimates of the parameters from tobit are more efficient than the
estimates from probit. Further since all cases are censored in probit, we have no way
to estimate the variance of
 and must make assumptions about it, while (    
can be estimated in the tobit model.

Analysing a truncated sample:

# 
 
 9 F    #     
 9 F  
   #   
 9 F  

about #   
 9 F   we know from equation (16)

# 
 
 9 F      %8Æ  (22)

where 8 is the Mill’s ratio and Æ    F ;%. The problem introduced by truncation
is that the regression model implied by equation (22) is of the form


    %8  

In this equation 8 8Æ  may be thought of as an additional variable. If we estimate




using
     we have a misspecified model that excludes 8.

Analysing a censored sample: In the case of a censored sample applying equation


(19) we get

# 
 
  #  uncensored     # 
 
 9 F  $  # censored    2(23)
$

using equations (20) and (21) with Æ    F ;%,

# 
 
  ( Æ    %!Æ   (Æ F

# 
  is nonlinear in , so that estimating the OLS regression of
on results in


inconsistent estimates of parameters.

116
Estimation Maximum Likelihood estimation of the tobit model involves dividing
the observations into two sets. The first contains the uncensored observations, which
ML treats in the same way as in the linear regression model. For the censored obser-
vations we do not know the value of
 , but we know that

F . Hence we use the
probability of being censored as the likelihood.

 
Formally for uncensored observations the likelihood contributions are


 
! 
 
 
 
    %
,
 
% %

 
! 
 
/8   %  

 (,(
% %

 
and for censored observations

F  

,  
 
  F  

  (
%
F  
/   %   (

(,(
%

   
The log likelihood is thus given by


  F  
! 
 

/  % 
   
 
(

 (,(
% % (,(
%

Interpretation For the interpretation of the estimation outcome it is important if


censoring or truncation occurred by accident or if censoring is a genuine property of
the data. We distinguish between the effects of changes in the independent variables
on the latent variable, the truncated variable, or the censored variable.

change in latent outcome

# 
  

# 
 





change in truncated outcome

!Æ
# 

9 F    %
Æ
( 

# 

9 F 
   Æ8Æ   8Æ  


117
change in censored outcome

# 
   Æ
(   %!Æ  (ÆF
# 
 
 Æ
(    F  F !Æ
 %
  uncensored   if F  F

McDonald and Moffitt’s decomposition McDonald and Moffitt suggest a decom-


: 
position of  that highlights two sources of change in the censored outcome. The
simplest way to derive their decomposition is to differentiate equation (23) by parts
and apply the product rule.

# 
 
  # uncensored   # 
 
 9 F   $  # censored     2 $
# 
  # 

9 F 
  uncensored   
 
 uncensored  
 # 

9 F   F 


The decomposition shows that when changes, it affects the expectation of


 for
uncensored cases weighted by the probability of being uncensored, and it affects the
probability of being uncensored weighted by the expected value for uncensored cases
minus the censoring value F .

118

You might also like