Econometrics II-1

Jigjiga University College of Business and Economics Department of Economics
CHAPTER ONE
1. Regression Analysis with Qualitative Information: Binary (or Dummy
Variables)
In Econometrics I, the dependent and independent variables in our regression models have had
quantitative meaning. Just a few examples include hourly wage rate, years of education, college
grade point average, household’s consumption level, housel’s, saving rate, deposit interest rate,
and country’s output level, inflation rate, imports and exports. In each case, the magnitude of the
variable conveys useful information. In empirical work, we must also incorporate qualitative
factors into regression models. The gender (male/female) or race of an individual (white, black),
the industry of a firm (manufacturing, retail, etc.), religious affiliation categories (Catholic, Jewish,
Protestant, Muslim, other), primary mode of transportation to work (automobile, bicycle, bus,
subway, walk),favorite type of music (classical, country, folk, jazz, rock), favorite place to shop
(local mall, local downtown, Internet, other) and the regional states in Ethiopia where a city is
located (ESRS, Oromia, Tigray, Amhara. Harari, Afar, SNNP, other) are all considered to be
qualitative factors.
This chapter is dedicated to regression analysis with qualitative variables. After we discuss the
appropriate ways to describe qualitative information in Section 1.1, we show how qualitative
explanatory variables can be easily incorporated into multiple regression models in Sections 1.2.
The last section of this chapter deals with dummy as dependent variable (LPM, Logit and probit
models).
1.1. Describing Qualitative Information

Qualitative factors often come in the form of binary information: a person is female or male; a
person does or does not own a personal computer; a firm offers a certain kind of employee pension
plan or it does not; a state administers capital punishment or it does not. In all of these examples,
the relevant information can be captured by defining a binary variable or a zero-one variable. In
econometrics, binary variables are most commonly called dummy variables, although this name is
not especially descriptive.
In defining a dummy variable, we must decide which event is assigned the value one and which is
assigned the value zero. For example, in a study of individual wage determination, we might define
female to be a binary variable taking on the value one for females and the value zero for males.
The name in this case indicates the event with the value one. The same information is captured by
defining male to be one if the person is male and zero if the person is female. Either of these is
better than using gender because this name does not make it clear when the dummy variable is
one: does gender =1 correspond to male or female? What we call our variables is unimportant for
getting regression results, but it always helps to choose names that clarify equations and
expositions.
1|P a g e Econome trics II Le cture Note s

Variables that assume such 0 and 1 values are called dummy variables. Such variables are thus
essentially a device to classify data into mutually exclusive categories such as male or female.
Dummy Variables usually indicates the dichotomized “presence” or “absence”, “yes” or “no”, etc.
Variables indicates a “quality” or an attribute, such as “male” or “female”, “black” or “white”,
“urban” or non-urban” , “before” or “after”, “North” or “south”, “East” or “West” , marital status,
job category, region, season, etc. We quantify such variables by artificially assigning values to
them (for example, assigning 0 and 1 to sex, where 0 indicates male and 1 indicates female), and
use them in the regression equation together with the other independent variables. Such variables
are called dummy variables. Alternative names are indicator variables, binary variables,
categorical variables, and dichotomous variables.
Dummy variables can be incorporated in regression models just as easily as quantitative variables.
As a matter of fact, a regression model may contain regressors that are all exclusively dummy, or
qualitative, in nature. Such models are called Analysis of Variance (ANOVA) models. Regression
models in most economic research involve quantitative explanatory variables in addition to
dummy variables. Such models are known as Analysis of Covariance (ANCOVA) models.
1.2. Dummy as Independent Variables

1.2.1. A Single Dummy Independent Variable
How do we incorporate binary information into regression models? In the simplest case, with only
a single dummy explanatory variable, we just add it as an independent variable in the equation.
For example, consider the following simple model of hourly wage determination:
wage  0  0 D  1educ  u...........................................(1.1)
Where: wage =wage rate of a certain individual
educ  level of education
1 if female
D (Hereafter, we shall designate all dummy variables by the letter D).
0 otherwise
In model (1.1), only two observed factors affect wage: gender and education. Since D  1 when
the person is female, and D  0 when the person is male, the parameter 0 has the following
interpretation: 0 is the difference in hourly wage between females and males, given the same
amount of education (and the same error term u). Thus, the coefficient 0 determines whether there
is discrimination against women: if0  0 , then, for the same level of other factors, women earn
less than men on average.
In terms of expectations, if we assume the zero conditional mean assumption E(u)  0 , then

0  E(wage/ D  1,educ)  E(wage/ D  0,educ)
The key here is that the level of eductation is the same in both expectations; the difference,  0 is
due to gender only.
The situation can be depicted graphically as an intercept shift between males and females. In Figure
1.1, the case 0  0 is shown, so that men earn a fixed amount more per hour than women. The
difference does not depend on the amount of education, and this explains why the wage-education
profiles for women and men are parallel.
At this point, you may wonder why we do not also include in (1.1) a dummy variable, say D, which
is one for males and zero for females. The reason is that this would be redundant. In (1.1), the
intercept for males is 0 , and the intercept for females is 0   0 . Since there are just two groups,
we only need two different intercepts. This means that, in addition to 0 , we need to use only one
dummy variable; we have chosen to include the dummy variable for females. Using two dummy
variables would introduce perfect collinearity because female  male  1 , which means that male
is a perfect linear function of female. Including dummy variables for both genders is the simplest
example of the so-called dummy variable trap, which arises when too many dummy variables
describe a given number of groups.
Figure 1.1

Model (1.1) contains one quantitative variable (level f education) and one qualitative variable (sex)
that has two classes (or categories), namely, male and female. What is the meaning of this
equation? Assuming, as usual, that E (ui )  0, we see that
Mean salary of female college professor: E (Y D  1,educ ) (  0   0)  educ

1 --------- (1.2)
Mean salary of male college professor: E (Y D  0, educ)   0  1educ -------------------- (1.3)
If the assumption of common slopes is valid, a test of the hypothesis that the two regressions (1.2)
and (1.3) have the same intercept (i.e., there is no sex discrimination) can be made easily by
running the regression (1.2) and noting the statistical significance of the estimated 0 on the basis
of the traditional t test. If the t test shows that it is statistically significant, we reject the null
hypothesis that the male and female college professors’ levels of mean annual salary are the same.
Before proceeding further, note the following features of the dummy variable regression model
considered previously.
1. To distinguish the two categories, male and female, we have introduced only one dummy
variable D . For example, if Di  1 always denotes a male, when Di  0 we know that it
is a female since there are only two possible outcomes. Hence, one dummy variable
suffices to distinguish two categories. The general rule is this: If a qualitative variable has
‘m’ categories, introduce only ‘m-1’ dummy variables. In our example, sex has two
categories, and hence we introduced only a single dummy variable. If this rule is not
followed, we shall fall into what might be called the dummy variable trap, that is, the
situation of perfect multicollinearity.
2. The assignment of 1 and 0 values to two categories, such as male and female, is arbitrary
in the sense that in our example we could have assigned D=1 for female and D=0 for male.
3. The group, category, or classification that is assigned the value of 0 is often referred to as
the base, benchmark, control, comparison, reference, or omitted category. It is the
base in the sense that comparisons are made with that category.
4. The coefficient attached to the dummy variable D can be called the differential intercept
coefficient because it tells by how much the value of the intercept term of the category that
receives the value of 1 differs from the intercept coefficient of the base category.
Example 1.1
(Effects of Computer Ownership on College GPA)
In order to determine the effects of computer ownership on college grade point average, we
estimate the model
colGPA  0   0 D  1hsGPA  2 ACT  u
where the dummy variable D equals one if a student owns a personal computer and zero
otherwise. There are various reasons PC ownership might have an effect on colGPA. A student’s

work might be of higher quality if it is done on a computer, and time can be saved by not having
to wait at a computer lab. Of course, a student might be more inclined to play computer games
or surf the Internet if he or she owns a PC, so it is not obvious that 0 is positive. The variables
hsGPA (high school GPA) and ACT (achievement test score) are used as controls: it could be
that stronger students, as measured by high school GPA and ACT scores, are more likely to own
computers. We control for these factors because we would like to know the average effect on
colGPA if a student is picked at random and given a personal computer.
If the estimated model is
colGPA  1.26  0.157D  0.447hsGPA  0.0087 ACT R2  .219
SE (.33) (.057) (.049) (.0105)
This equation implies that a student who owns a PC has a predicted GPA about .16 point higher
than a comparable student without a PC (remember, both colGPA and hsGPA are on a four-
point scale). The effect is also very statistically significant, with tPC  .157 / .057  2.75.
1.2.2. Using Dummy Variables for Multiple Categories
Suppose that, on the basis of the cross-sectional data, we want to regress the annual expenditure
on health care by an individual on the income and education of the individual. Since the
variable education is qualitative in nature, suppose we consider three mutually exclusive levels
of education: less than high school, high school, and college. Now, unlike the previous case,
we have more than two categories of the qualitative variable education. Therefore, following
the rule that the number of dummies be one less than the number of categories of the variable,
we should introduce two dummies to take care of the three levels of education. Assuming that
the three educational groups have a common slope but different intercepts in the regression of
annual expenditure on health care on annual income, we can use the following model:
Yi  1  2 D2i  3 D3i   X i  ui …………………………………. (1.4)
Where Yi  annual expenditure on health care
X i = annual expenditure
1 if high school education

D2  
0 otherwise
1 if college education
D3  
0 otherwise
Note that in the preceding assignment of the dummy variables we are arbitrarily treating the “less
than high school education” category as the base category. Therefore, the intercept 1 will reflect

the intercept for this category. The differential intercepts  2 and  3 tell by how much the
intercepts of the other two categories differ from the intercept of the base category, which can be
readily checked as follows: Assuming E (ui )  0 , we obtain from (1.4)
E(Yi | D2  0, D3  0, X i )  1   X i
E(Yi | D2  1, D3  0, X i )  (1  2 )   X i
E(Yi | D2  0, D3  1, X i )  (1  3 )   X
which are, respectively the mean health care expenditure functions for the three levels of education,
namely, less than high school, high school, and college. Geometrically, the situation is shown in
fig 1.2 (for illustrative purposes it is assumed that 3  2 ).
Figure 1.2. Expenditure on health care in relation to income for three levels of education
1.2.3. Regression on One Quantitative Variable and Two Qualitative Variables
The technique of dummy variable can be easily extended to handle more than one qualitative
variable. Let us revert to the college professors’ salary regression (1.1), but now assume that
in addition to years of teaching experience and sex, the skin color of the teacher is also an
important determinant of salary. For simplicity, assume that color has two categories: black
and white. We can now write (1.1) as:
Yi  1  2 D2i  3 D3i   X i  ui ------------------------------------------- (1.5)

Where Y = annual salary
X i = years of teaching experience
1 if male
D2  
0 otherwise
1 if white
D3  
0 otherwise
Notice that each of the two qualitative variables, sex and color, has two categories and hence needs
one dummy variable for each. Note also that the omitted, or base, category now is “black female
professor.”
Assuming E (ui )  0 , we can obtain the following regression from (1.5)
Mean salary for black female professor:

E(Yi | D2  0, D3  0, X i )  1  X i
Mean salary for black male professor:
E(Yi | D2  1, D3  0, X i )  (1   2 )  X i
Mean salary for white female professor:
E(Yi | D2  0, D3  1, X i )  (1   3 )  X i
Mean salary for white male professor:
E(Yi | D2  1, D3  1, X i )  (1   2   3 )  X i
Once again, it is assumed that the preceding regressions differ only in the intercept
coefficient but not in the slope coefficient  .
An OLS estimation of (1.5) will enable us to test a variety of hypotheses. Thus, if  3 is
statistically significant, it will mean that color does affect a professor’s salary. Similarly,
if  2 is statistically significant, it will mean that sex also affects a professor’s salary. If
both these differential intercepts are statistically significant, it would mean sex as well as
color is an important determinant of professors’ salaries.
From the preceding discussion it follows that we can extend our model to include more than one
quantitative variable and more than two qualitative variables. The only precaution to be taken is
that the number of dummies for each qualitative variable should be one less than the number of
categories of that variable.

Example 1.2
(Log Hourly Wage)
Let us estimate a model that allows for wage differences among four groups: married men,
married women, single men, and single women. To do this, we must select a base group; we
choose single men. Then, we must define dummy variables for each of the remaining groups.
Call these D1 (married male), D2 (married female), and D3 (single female). Putting these three
variables into (1.1) and if the estimated equation gives the following result,
log(wage)  .321  .213D1  .198D2  .110D3  .079educ  .027exp
SE (.100) (.055) (.058) (.056) (.007) (.005)
To interpret the coefficients on the dummy variables, we must remember that the base group is
single males. Thus, the estimates on the three dummy variables measure the proportionate
difference in wage relative to single males. For example, married men are estimated to earn
about 21.3% more than single men, holding levels of education and experience fixed. A married
woman, on the other hand, earns a predicted 19.8% less than a single man with the same levels
of the other variables.
1.2.4. Interactions Among Dummy Variables
Consider the following model:
Yi  1  2 D2i  3 D3i   X i  ui ……………………………………………. (1.6)
1 if female
Where Yi = annual expenditure on clothing D2  
0 if male
1 if college graduate
X i = annual income D2  
0 otherwise
Implicit in this model is the assumption that the differential effect of the sex dummy D2 is constant
across the two levels of education and the differential effect of the education dummy D3 is also
constant across the two sexes. That is, if, say, the mean expenditure on clothing is higher for
females than males this is so whether they are college graduates or not. Likewise, if, say, college
graduates on the average spend more on clothing than non-college graduates, this is so whether
they are female or males.
In many applications such an assumption may be untenable. A female college graduate may spend
more on clothing than a male college graduate. In other words, there may be interaction between
the two qualitative variables D2 and D3 . Therefore their effect on mean Y may not be simply
additive as in (1.6) but multiplicative as well, as in the following model:

Yi  1  2 D2i  3 D3i  4 D2i D3i   X i  ui ………………………………. (1.7)
From (1.7) we obtain

E(Yi | D2  1, D3  1, X i )  (1   2   3   4 )  X i ------------ (1.8)
which is the mean clothing expenditure of graduate females. Notice that
 2  differential effect of being a female
 3  differential effect of being a college graduate
 4  differential effect of being a female graduate which shows that the mean
clothing expenditure of graduate females is different (by  4 ) from the mean clothing
expenditure of females or college graduates. If  2 , 3 , and  4 are all positive, the average
clothing expenditure of females is higher (than the base category, which here is male no
graduate), but it is much more so if the females also happen to be graduates. Similarly, the
average expenditure on clothing by a college graduate tends to be higher than the base
category but much more so if the graduate happens to be a female. This shows how the
interaction dummy modifies the effect of the two attributes considered individually.
Whether the coefficient of the interaction dummy is statistically significant can be tested
by the usual t test. If it turns out to be significant, the simultaneous presence of the two
attributes will attenuate or reinforce the individual effects of these attributes. Needless to
say, omitting a significant interaction term incorrectly will lead to a specification bias.
1.2.5. Testing for Structural Stability of Regression Models
Until now, in the models considered in this chapter we assumed that the qualitative variables affect
the intercept but not the slope coefficient of the various subgroup regressions. But what if the
slopes are also different? If the slopes are in fact different, testing for differences in the intercepts
may be of little practical significance. Therefore, we need to develop a general methodology to
find out whether two (or more) regressions are different, where the difference may be in the
intercepts or the slopes or both.
Suppose we are interested in estimating a simple saving function that relates domestic
household savings (S) with gross domestic product (Y) for Ethiopia. Suppose further that, at a
certain point of time, a series of economic reforms have been introduced. The hypothesis
here is that such reforms might have considerably influenced the savings- income relations hip,
that is, the relationship between savings and income might be different in the post reform period
as compared to that in the pre-reform period. If this hypothesis is true, then we say a structural
change has happened. How do we check if this is so?
Write the savings function as:

St   0  1 Dt   2 X t   3 (Yt Dt )  ut ..........................................................(1.8)
where St is household saving at time t, Yt is GDP at time t and:
0 if pre  reform( 1991)
Dt  
1 if post  reform( 1991)
Here  3 is the differential slope coefficient indicating how much the slope coefficient of the pre-
reform period savings function differs from the slope coefficient of the savings function in the post
reform period. If 1 and  3 are both statistically significant as judged by the t-test, then the pre-
reform and post-reform regressions differ in both the intercept and the slope. However, if only
1 is statistically significant, then the pre-reform and post-reform regressions differ only in
the intercept (meaning the marginal propensity to save (MPS) is the same for pre-reform
and post-reform periods). Similarly, if only  3 is statistically significant, then the two regressions
differ only in the slope (MPS).
Example: Sˆt  -20.76005  5.99916Dt  2.616285Yt -0.5298177(Yt Dt )
SE (6.04) (6.4) (.57) (.6035149)
Since both 1 and  3 are both statistically insignificant, there is no difference between pre-reform
and after reform regression for the saving model.
1.3. Dummy as Dependent Variable

In the last several sections, we studied how, through the use of binary independent variables, we
can incorporate qualitative information as explanatory variables in a multiple regression model. In
all of the models up until now, the dependent variable Y has had quantitative meaning (for
example, Y is a dollar amount, a test score, a percent, or the logs of these). What happens if we
want to use multiple regression to explain a qualitative event?
In the simplest case, and one that often arises in practice, the event we would like to explain is a
binary outcome. In other words, our dependent variable, Y, takes on only two values: zero and
one. For example, Y can be defined to indicate whether an adult has a high school education; or Y
can indicate whether a college student used illegal drugs during a given school year; or Y can
indicate whether a firm was taken over by another firm during a given year. In each of these
examples, we can let Y  1 denote one of the outcomes and Y  0 the other outcome.
There are several methods to analyze regression models where the dependent variable is binary.
The simplest procedure is to just use the usual OLS method. In this case the model is called the
linear probability model (LPM). The other alternative is to say that there is an underlying or latent
variable Y * which we do not observe. What we observe is
10 | P a g e Econome t r i cs I I L e ct ur e N otes
1 if Y *  0
Y 
0 otherwise
This is the idea behind the logit and probit models.
1.3.1.The Linear Probability Models (LPM)
When we use a linear regression model to estimate probabilities, we call the model the linear
probability model. Consider the following model with when Y is a binary variable.
Y  0  1 X1  2 X 2  ...  k X k  u  X   u........................................(1.9)
Since Y can take on only two values,  j cannot be interpreted as the change in Y given a one-unit
increase in X j , holding all other factors fixed: Y either changes from zero to one or from one to
zero. Nevertheless, the  j still have useful interpretations. If we assume that the zero conditio na l
mean assumption holds, that is, E(u)  0 , then we have, as always,
E(Y)  0  1 X1  2 X 2  ...  k X k  X 
The key point is that when Y is a binary variable taking on the values zero and one, it is always
true that P(Y  1 X )  E (Y X ) : the probability of “success”—that is, the probability that Y  1 —
is the same as the expected value of Y. Thus, we have the important equation
P(Y  1 X )   0  1 X 1   2 X 2  ...   k X k  X  ..................................(1.10)
which says that the probability of success, say P(X)  P(Y  1 X ) , is a linear function of the X j .
Equation (1.10) is an example of a binary response model, and P(Y  1 X ) is also called the
response probability. Because probabilities must sum to one, P(Y  0 X )  1  P(Y  1 X )
(which is called the non-response probability) is also a linear function of the X j .
The multiple linear regression model with a binary dependent variable is called the linear
probability model (LPM) because the response probability is linear in the parameters of X j . In
the LPM,  j measures the change in the probability of success when X j changes, holding other
factors fixed. This is the usual linear regression model. This makes linear probability models easy
to estimate and interpret, but it also highlights some shortcomings of the LPM.
The drawbacks of this model are:
1. The right hand side of equation (1.9) is a combination of discrete and continuous variables
while the left hand side variable is discrete.
2. Usually we arbitrarily (or for convenience) use 0 and 1 for Y. If we use other values for Y,
say 3 and 4, β will also change even if the vector of factors X remains unchanged.
3. u assumes only two values:
if Y  1 then u  1  X  (with prob. P )
if Y= 0 then u   X  (with prob. 1- P )
Consequently, u is not normally distributed but rather has a discrete (binary) probability
distribution defined.
4. It is easy to see that, if we plug in certain combinations of values for the independent
variables into (1.9), we can get predictions either less than zero or greater than one. Since
these are predicted probabilities, and probabilities must be between zero and one, this can
be a little embarrassing.
5. Due to problem 3, the variance u is hetroscedastic.
1.3.2. The Logit and Probit Models
The linear probability model is simple to estimate and use, but it has some drawbacks that we
discussed in Section 1.3.1. The two most important disadvantages are that the fitted probabilities
can be less than zero or greater than one and the partial effect of any explanatory variable
(appearing in level form) is constant. These limitations of the LPM can be overcome by using more
sophisticated binary response models.
In a binary response model, interest lies primarily in the response probability
P(Y  1 X )  P(Y  1 X 1 , X 2 ,..., X k ) ………………………………….. (1.11)
where we use X to denote the full set of explanatory variables. For example, when Y is an
employment indicator, X might contain various individual characteristics such as education, age,
marital status, and other factors that affect employment status, including a binary indicator variable
for participation in a recent job training program.
Specifying Logit and Probit Models
In the LPM, we assume that the response probability is linear in a set of parameters,  j . To avoid
the LPM limitations, consider a class of binary response models of the form
P(Y  1 X )  G (  0  1 X1   2 X 2  ....   k X k )  G (X  )..........................................(1.12)
where G is a function taking on values strictly between zero and one: 0  G(z)  1 , for all real
numbers z. This ensures that the estimated response probabilities are strictly between zero and one.
As in earlier Econometrics I, we write X   0  1 X1  2 X2  ....  k Xk .
Various nonlinear functions have been suggested for the function G in order to make sure that the
probabilities are between zero and one. The two we will cover here are used in the vast majority
of applications (along with the LPM). In the logit model, G is the logistic function:
eX 
G(z)  exp(z) / [1  exp(z)]  (z)  ..........................................................(1.13)
1 eX 
which is between zero and one for all real numbers z. This is the cumulative distribution functio n
(cdf) for a standard logistic random variable.
Here the response probability P (Y  1 X ) is evaluated as:
eX 
P  P(Y  1 X ) 
1 eX 
Similarly, the non-response probability is evaluated as:
eX  1
1  P  P(Y  0 X )  1  X

1 e 1 eX 
Note that the response and nonresponse probabilities both lie in the interval [0 , 1] , and
hence, are interpretable.
For the logit model, the ratio:
eX 
P P(Y  1 X ) 1  e X 
   e X   e 0  1 X1  e 2 X 2  ...e k X k
1  P P(Y  0 X ) 1
1 eX 
is the ratio of the odds of Y  1 against Y  0 . The natural logarithm of the odds (log-odds) is:
 P 
ln    0  1 X 1   2 X 2  ...   k X k
1  P 
Thus, the log-odds is a linear function of the explanatory variables.
In the probit model, G is the standard normal cumulative distribution function (cdf ), which is
expressed as an integral:
z
G (z)  (z)    (v) dv..........................................................(1.14)

where (z) is the standard normal density
(z)  (2 )1/2 exp( z2 / 2)..........................................................(1.15)
The standard normal cdf has a shape very similar to that of the logistic cdf.
The estimating model that emerges from the normal CDF is popularly known as the probit model,
although sometimes it is also known as the normit model.
Note that both the probit and the logit models are estimated by Maximum Likelihood Estimation.
1.3.3.Interpreting the Probit and Logit Model Estimates
Given modern computers, from a practical perspective, the most difficult aspect of logit or probit
models is presenting and interpreting the results. The coefficient estimates, their standard errors,
and the value of the log-likelihood function are reported by all software packages that do logit and
probit, and these should be reported in any application.
The coefficients give the signs of the partial effects of each Xj on the response probability, and the
statistical significance of Xj is determined by whether we can reject H0 :  j  0 at a sufficie ntly
small significance level. However, the magnitude of the estimated parameters ( dZ / dX ) has no
particular interpretation. We care about the magnitude of dProb(Y)/dX. From the computer
output for a probit or logit estimation, you can interpret the statistical significance and sign of each
coefficient directly. Assessing magnitude is trickier.
Goodness of Fit Statistics
The conventional measure of goodness of fit, R 2 , is not particularly meaningful in binary

regressand models. Measures similar to R 2 , called pseudo R 2 , are available, and there are a variety
of them.
1. Measures based on likelihood ratios
Let LUR be the maximum likelihood function when maximized with respect to all the parameters
and LR be the maximum likelihood function when maximized with restrictions i  0 .
2
 L  n
R2  1   R 
 LUR 
2. Cragg and Uhler (1970) suggested a pseudo R 2 that lies between 0 and 1.
2 2
LURn  LRn
R 
 
2
2 2
1  LR
n
LURn
3. McFadden (1974) defined R 2 as
log LUR
R2  1 
log LR
4. Another goodness-of-fit measure that is usually reported is the so-called percent correctly
predicted, which is computed as follows. For each i, we compute the estimated probability
that Yi takes on the value one, Yˆ . If Yˆ  0.5 the prediction of Yi is unity, and if Yˆ  0.5 Yi
i i i
is predicted to be zero. The percentage of times the predicted Yî matches the actual Yi
(which we know to be zero or one) is the percent correctly predicted.
[
No. of correct predictions

Count R 
2
Tota l no. of observations
Numerical Example
This session shows an example of probit and logit regression analysis with Stata. The data in this
example were gathered on undergraduates applying to graduate school and includes undergraduate
GPAs, the reputation of the school of the undergraduate (a topnotch indicator), the students' GRE
score, and whether or not the student was admitted to graduate school. Using this dataset, we can
predict admission to graduate school using undergraduate GPA, GRE scores, and the reputation of
the school of the undergraduate. Our outcome variable is binary, and we will use either a probit
or a logit model. Thus, our model will calculate a predicted probability of admission based on our
predictors.
. logit admit GRE topnotch GPA
Iteration 0: log likelihood = -249.98826

Logistic regression Number of obs = 400

LR chi2(3) = 21.85
Prob > chi2 = 0.0001
Log likelihood = -239.06481 Pseudo R2 = 0.0437
admit Coef. Std. Err. z P>|z| [95% Conf. Interval]
GRE .0024768 .0010702 2.31 0.021 .0003792 .0045744

topnotch .4372236 .2918532 1.50 0.134 -.1347983 1.009245
GPA .6675556 .3252593 2.05 0.040 .0300591 1.305052
_cons -4.600814 1.09638 -4.20 0.000 -6.749678 -2.451949
Iteration History - This is a listing of the log likelihoods at each iteration for the probit/logit
model. Remember that probit/logit regression uses maximum likelihood estimation, which is an
iterative procedure. The first iteration (called Iteration 0) is the log likelihood of the "null" or
"empty" model; that is, a model with no predictors. At the next iteration (called Iteration 1), the
specified predictors are included in the model. In this example, the predictors are GRE, topnotch
and GPA. At each iteration, the log likelihood increases because the goal is to maximize the log
likelihood. When the difference between successive iterations is very small, the model is said to
have "converged" and the iterating stops.
Log likelihood - This is the log likelihood of the fitted model. It is used in the Likelihood Ratio
Chi-Square test of whether all predictors' regression coefficients in the model are simultaneo us ly
zero.
LR chi2(3) - This is the Likelihood Ratio (LR) Chi-Square test that at least one of the predictors'
regression coefficient is not equal to zero. The number in the parentheses indicates the degrees of
freedom of the Chi-Square distribution used to test the LR Chi-Square statistic and is defined by
the number of predictors in the model (3).
Prob > chi2 - This is the probability of getting a LR test statistic as extreme as, or more so, than
the observed statistic under the null hypothesis; the null hypothesis is that all of the regression
coefficients are simultaneously equal to zero. In other words, this is the probability of obtaining
this chi-square statistic or one more extreme if there is in fact no effect of the predictor variables.
This p-value is compared to a specified alpha level, our willingness to accept a type I error, which
is typically set at 0.05 or 0.01. The small p-value from the LR test, 0.0001, would lead us to
conclude that at least one of the regression coefficients in the model is not equal to zero. The
parameter of the chi-square distribution used to test the null hypothesis is defined by the degrees
of freedom in the prior line, chi2(3).
Pseudo R2 - This is McFadden's pseudo R-squared. Because this statistic does not mean what R-
square means in OLS regression (the proportion of variance of the response variable explained by
the predictors), it should be interpreted with great caution.
The interpretation of the coefficients can be awkward. For example, for a one unit increase in
GPA, the log odds of being admitted to graduate school (vs. not being admitted) increases by .667.
For this reason, many researchers prefer to exponentiate the coefficients and interpret them as
odds-ratios. Look at the following result.
Logistic regression Number of obs = 400

LR chi2(3) = 21.85
Prob > chi2 = 0.0001
Log likelihood = -239.06481 Pseudo R2 = 0.0437
admit Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
GRE 1.00248 .0010729 2.31 0.021 1.000379 1.004585

topnotch 1.548402 .4519062 1.50 0.134 .8738922 2.74353
GPA 1.949466 .634082 2.05 0.040 1.030515 3.687881
Now we can say that for a one unit increase in GPA, the odds of being admitted to graduate school
(vs. not being admitted) increased by a factor of 1.94.
Since GRE scores increase only in units of 10, we can take the odds ratio and raise it to the 10th
power, e.g. 1.00248 ^ 10 = 1.0250786, and say for a 10 unit increase in GRE score, the odds of
admission to graduate school increased by a factor of 1.025.
Dear students, we will look into the regression results and interpretatio n of LPM, logit and probit
models in detail in the lab. session.
CHAPTER TWO
2. Introduction to Basic Regression Analysis with Time Series Data

[[[[
2.1.The Nature of Time Series Data
Time series data are data collected for a single entity (person, firm, and country) collected
(observed) at multiple time periods. Examples:
Aggregate consumption and GDP for a country (for example, 20 years of quarterly
observations= 80 observations)
Birr/$, pound/$ and Euro/$ exchange rates (daily data for 2 years= 730 observations)
Inflation rate for Ethiopia (quarterly data for 30 years = 120 observations )
Gross domestic investment for Ethiopia (annual data for 40 years= 40 observations )
An obvious characteristic of time series data which distinguishes it from cross-sectional data is
that a time series data set comes with a temporal ordering. For example, in the above data set
mentioned in the examples, we must know that the data for 1970 immediately precede the data for
1971. For analyzing time series data in the social sciences, we must recognize that the past can
affect the future, but not vice versa. To emphasize the proper ordering of time series data, Table
2.1 gives a partial listing of the data on Ethiopia gross capital formation (GCF) and gross domestic
savings (GDS) both in million ETB since 1969-2010.
Year GCF GDS

1969 759.9 703.4
1970 822.1 693
1971 894.3 763.4
1972 844.2 892.1
1973 814.1 919.4
1974 859.6 1009.82
. . .
. . .
. . .
2006 38044 14905.01

2007 55512 13266.85
2008 76185 21433.07
2009 94497 21351.63
2010 130466 44981.8
Table 2.1 Partial Listing of Data on Ethiopia GCF and GDS, 1969–2010
Another difference between cross-sectional and time series data is more subtle. In Chapters 2 and
3 of Econometrics I, we studied statistical properties of the OLS estimators based on the notion
that samples were randomly drawn from the appropriate population. Understanding why cross-
sectional data should be viewed as random outcomes is fairly straightforward: a different sample
drawn from the population will generally yield different values of the independent and dependent
variables (such as education, experience, wage, and so on). Therefore, the OLS estimates
computed from different random samples will generally differ, and this is why we consider the
OLS estimators to be random variables.
How should we think about randomness in time series data? Certainly, economic time series satisfy
the intuitive requirements for being outcomes of random variables. For example, today we do not
know what the stock price will be at its close at the end of the next trading day. We do not know
what the annual growth in output will be in Ethiopia during the coming year. Since the outcomes
of these variables are not foreknown, they should clearly be viewed as random variables.
Formally, a sequence of random variables indexed by time is called a stochastic process or a time
series process. (“Stochastic” is a synonym for random.) When we collect a time series data set, we
obtain one possible outcome, or realization, of the stochastic process. We can only see a single
realization, because we cannot go back in time and start the process over again. (This is analogous
to cross-sectional analysis where we can collect only one random sample.) However, if certain
conditions in history had been different, we would generally obtain a different realization for the
stochastic process, and this is why we think of time series data as the outcome of random variables.
The set of all possible realizations of a time series process plays the role of the population in cross-
sectional analysis.
2.2. Stationary and non-stationary Stochastic Processes

2.2.1. Stochastic Processes
A random or stochastic process is a collection of random variables ordered in time. If we let Y

denote a random variable, and if it is continuous, we denote it as Y(t), but if it is discrete, we
denoted it as Yt . An example of the former is an electrocardiogram, and an example of the latter is
GDP, PDI, GDI, GDS, etc. Since most economic data are collected at discrete points in time, for
our purpose we will use the notation Yt rather than Y(t). If we let Y represent GDS, for our data
we have Y1 ,Y2 ,Y3 ,...,Y39,Y40,Y41, where the subscript 1 denotes the first observation (i.e., GDS
of 1969) and the subscript 41 denotes the last observation (i.e., GDS of 2010).Keep in mind that
each of these Y’s is a random variable.
Stationary Stochastic Processes
A type of stochastic process that has received a great deal of attention and analysis by time series
analysts is the so-called stationary stochastic process. Broadly speaking, a stochastic process is
said to be stationary if its mean and variance are constant over time and the value of the
covariance between the two time periods depends only on the distance or gap or lag between the
two time periods and not the actual time at which the covariance is computed. In the time series
literature, such a stochastic process is known as a weakly stationary, or covariance stationary, or
second-order stationary, or wide sense, stochastic process. For the purpose of this chapter, and in
most practical situations, this type of stationarity often suffices.
To explain weak stationarity, let Yt be a stochastic time series with these properties:
Mean: E (Yt )   …………………………………………..……. (2.1)
Variance: var(Yt )  E (Yt   )2   2 ……………………….…….. (2.1.)
Covariance:  k  E[(Yt   )(Yt k   )]..............................................(2.3)
Where γk , the covariance (or autocovariance) at lag k, is the covariance between the values of Yt
and Yt+k , that is, between two Y values k periods apart. If k=0, we obtain γ0 , which is simply the
variance of Y(=σ 2 ); if k=1,γ1 is the covariance between two adjacent values of Y.
Suppose we shift the origin of Y from Yt to Yt+m (say, from 1969 to 1974 for our GDS data). Now
if Yt is to be stationary, the mean, variance, and autocovariances of Yt+m must be the same as those
of Yt . In short, if a time series is stationary, its mean, variance, and autocovariance (at various lags)
remain the same no matter at what point we measure them; that is, they are time invariant. Such a
time series will tend to return to its mean (called mean reversion) and fluctuations around this
mean (measured by its variance) will have a broadly constant amplitude.
If a time series is not stationary in the sense just defined, it is called a nonstationary time series
(keep in mind we are talking only about weak stationarity). In other words, a nonstationary time
series will have a time-varying mean or a time-varying variance or both.
Why are stationary time series so important? Because if a time series is nonstationary, we can
study its behavior only for the time period under consideration. Each set of time series data will
therefore be for a particular episode. As a consequence, it is not possible to generalize it to other
time periods. Therefore, for the purpose of forecasting, such (nonstationary) time series may be
of little practical value.
How do we know that a particular time series is stationary? In particular, is the time series shown
in Figure 2.1 stationary? We will take this important topic up in Sections 2.5, where we will
consider several tests of stationarity. But if we depend on common sense, it would seem that the
time series depicted in Figures 2.1 is nonstationary, at least in the mean values
Figure 2.1 Graph of Foreign Exchange Rate for Ethiopia, 1971-2010

1500
1000
500
0
1970 1980 1990 2000 2010

Year
Before we move on, we mention a special type of stochastic process (or time series), namely, a
purely random, or white noise, process. We call a stochastic process purely random if it has zero
mean, constant variance σ 2 , and is serially uncorrelated. You may recall that the error term ut ,
entering the classical normal linear regression model that we discussed in Econometrics I was
assumed to be a white noise process, which we denoted as ut ∼IID N(0,σ2 ); that is, ut is
independently and identically distributed as a normal distribution with zero mean and constant
variance.
Nonstationary Stochastic Processes
Although our interest is in stationary time series, one often encounters nonstationary time series,
the classic example being the random walk model (RWM). It is often said that asset prices, such
as stock prices or exchange rates, follow a random walk; that is, they are nonstationary. We
distinguish two types of random walks: (1) random walk without drift (i.e., no constant or intercept
term) and (2) random walk with drift (i.e., a constant term is present).
Random Walk without Drift: Suppose ut is a white noise error term with mean 0 and variance
 2 .Then the series Yt is said to be a random walk if
Yt  Yt 1  ut .............................................................................(2.4)
In the random walk model, as (2.4) shows, the value of Y at time t is equal to its value at time (t−1)
plus a random shock; thus it is an AR (1) model in the language of Chapter 4 of Econometrics I.
We can think of (2.4) as a regression of Y at time t on its value lagged one period. Believers in the
efficient capital market hypothesis argue that stock prices are essentially random and therefore
there is no scope for profitable speculation in the stock market: If one could predict tomorrow’s
price on the basis of today’s price, we would all be millionaires.
Now from (2.4) we can write
Y1  Y0  u1
Y2  Y1  u2  Y0  u1  u2
Y3  Y2  u3  Y0  u1  u2  u3
In general, if the process started at some time 0 with a value of Y0 , we have
Yt  Y0   ut .......................................(2.5)
Therefore,
E (Yt )  E (Y0   ut )  Y0 .......................................(2.6)
In like fashion, it can be shown that
var(Yt )  t  2 ................................................................(2.7)
As the preceding expression shows, the mean of Y is equal to its initial, or starting, value, which
is constant, but as t increases, its variance increases indefinitely, thus violating a condition of
stationarity. In short, the RWM without drift is a nonstationary stochastic process. In practice Y0
is often set at zero, in which case E (Yt ) =0.
An interesting feature of RWM is the persistence of random shocks (i.e., random errors), which is
clear from (2.5): Yt is the sum of initial Y0 plus the sum of random shocks. As a result, the impact
of a particular shock does not die away. For example, if u2  2 rather than u2  0 , then all Yt ’s
from Y2 onward will be 2 units higher and the effect of this shock never dies out.
That is why random walk is said to have an infinite memory. The implication is that, random walk
remembers the shock forever; that is, it has infinite memory.
Interestingly, if you write (2.4) as
(Yt  Yt 1 )  Yt  ut ....................................................................(2.8)
Where  is the first difference operator? It is easy to show that, while Yt is nonstationary, its first
difference is stationary. In other words, the first differences of a random walk time series are
stationary.
Random Walk with Drift. Let us modify (2.4) as follows:
Yt    Yt 1  ut .................................................(2.9)
Where δ is known as the drift parameter. The name drift comes from the fact that if we write the
preceding equation as
(Yt  Yt 1 )  Yt    ut ..............................................(2.10)
it shows that Yt drifts upward or downward, depending on δ being positive or negative. Note that
model (2.9) is also an AR(1) model. Following the procedure discussed for random walk without
drift, it can be shown that for the random walk with drift model (2.9),
E(Yt )  Y0  t. ..................................................................(2.11)
var(Yt )  t  2 ................................................................(2.12)
As you can see, for RWM with drift the mean as well as the variance increases over time, again
violating the conditions of (weak) stationarity. In short, RWM, with or without drift, is a
nonstationary stochastic process. The random walk model is an example of what is known in the
literature as a unit root process.
Unit Root Stochastic Process
Let us write the RWM (2.4) as:
Yt  Yt 1  ut ; 1    1 ………………………………………………….. (2.4.1)
This model resembles the Markov first-order autoregressive model that we discussed on
autocorrelation. If ρ=1, (2.4.1) becomes a RWM (without drift). If ρ is in fact 1, we face what is
known as the unit root problem, that is, a situation of nonstationary; we already know that in this
case the variance of Yt is not stationary. The name unit root is due to the fact that ρ=1. Thus the
terms nonstationary, random walk, and unit root can be treated as synonymous.
If, however, |ρ|<1, that is if the absolute value of ρ is less than one, then it can be shown that the
time series Yt is stationary in the sense we have defined it.
2.3. Trend Stationary and Difference Stationary Stochastic Processes
If the trend in a time series is completely predictable and not variable, we call it a deterministic
trend, whereas if it is not predictable, we call it a stochastic trend. To make the definition more
formal, consider the following model of the time series Yt .
Yt  1  2t  3Yt 1  ut ...............................................(2.13)
Where ut is a white noise error term and where t is time measured chronologically. Now we
have the following possibilities:
Pure random walk: If in (2.13) β1 =0,β2 =0,β3 =1,we get
Yt  Yt 1  ut ...........................................................................(2.14)
which is nothing but a RWM without drift and is therefore nonstationary. But note that, if we
write (2.14) as
Yt  Yt  Yt 1  ut ...............................................(2.15)
it becomes stationary, as noted before. Hence, a RWM without drift is a difference stationary
process (DSP).
Random walk with drift: If in (2.13) 1  0, 2  0, 3  1 , we get
Yt  1  Yt 1  ut ................................................................(2.16)
which is a random walk with drift and is therefore nonstationary. If we write it as
Yt  Yt 1  Yt  1  ut .....................................................(2.16a)
this means Yt will exhibit a positive (β1 >0) or negative (β1 <0) trend. Such a trend is called a
stochastic trend. Equation (2.16a) is a DSP process because the nonstationarity in Yt can be
eliminated by taking first differences of the time series.
Deterministic trend: If in (2.13) 1  0, 2  0, 3  0 , we get
Yt  1  2t  ut .........................................................................(2.17)
which is called a trend stationary process (TSP).Although the mean of Yt is β1 +β2 t, which is not
constant, its variance (=σ 2 ) is. Once the values of β1 and β2 are known, the mean can be forecasted
perfectly. Therefore, if we subtract the mean of Yt from Yt , the resulting series will be stationary,
hence the name trend stationary. This procedure of removing the (deterministic) trend is called
detrending.
Random walk with drift and deterministic trend: If in (2.13) 1  0, 2  0, 3  1 , we obtain:
Yt  1  2t  Yt 1  ut .................................................................(2.18)
we have a random walk with drift and a deterministic trend, which can be seen if we write this
equation as
Yt  1  2t  ut ........................................................................(2.18a)
which means that Yt is nonstationary.
Deterministic trend with stationary AR (1) component: If in (2.13) 1  0, 2  0, 3  1 , we

obtain:
Yt  1  2t  3Yt 1  ut .....................................................................(2.19)
which is stationary around the deterministic trend.
2.4. Integrated Stochastic Process
The random walk model is but a specific case of a more general class of stochastic processes
known as integrated processes. Recall that the RWM without drift is nonstationary, but its fir st
difference, as shown in (2.8), is stationary. Therefore, we call the RWM without drift integrated
of order 1, denoted as I(1). Similarly, if a time series has to be differenced twice (i.e., take the first
difference of the first differences) to make it stationary, we call such a time series integrated of
order 2. In general, if a (nonstationary) time series has to be differenced d times to make it
stationary, that time series is said to be integrated of order d. A time series Yt integrated of order
d is denoted as Yt ∼I(d). If a time series Yt is stationary to begin with (i.e., it does not require any
differencing), it is said to be integrated of order zero, denoted by Yt ∼I(0).Thus, we will use the
terms “stationary time series” and “time series integrated of order zero” to mean the same thing.
Most economic time series are generally I(1); that is, they generally become stationary only after
taking their first differences.
Properties of Integrated Series
The following properties of integrated time series may be noted: Let Xt , Yt , and Zt be three time
series.
i. If Xt ∼I(0) and Yt ∼I(1),then Zt =(Xt +Yt )=I(1); that is, a linear combination or sum of
stationary and nonstationary time series is nonstationary.
ii. If Xt ∼I(d), then Zt =(a+bXt )=I(d), where a and b are constants. That is, a linear
combination of an I(d) series is also I(d). Thus, if Xt ∼I(0), then Zt =(a+bXt )∼I(0).
iii. If Xt ∼I(d1 ) and Yt ∼I(d2 ), then Zt =(aXt +bYt )∼I(d2 ), where d1 <d2 .
iv. If Xt ∼I(d) and Yt ∼I(d), then Zt =(aXt +bYt )∼I(d*); d* is generally equal to d, but in some
cases d*<d.
2.5. Tests of Stationarity: The Unit Root Test
A test of stationarity (or nonstationarity) that has become widely popular over the past several
years is the unit root test. The starting point is the unit root (stochastic) process that we discussed
in Section 2.2. We start with (2.4.1)
Yt  Yt 1  ut ; 1    1 ………………………………………………………. (2.20)
Where ut is a white noise error term.
We know that if ρ =1, that is, in the case of the unit root, (2.20) becomes a random walk model
without drift, which we know is a nonstationary stochastic process. Therefore, why not simply
regress Yt on its (one period) lagged value Yt−1 and find out if the estimated ρ is statistically equal
to 1? If it is, then Yt is nonstationary. This is the general idea behind the unit root test of stationar ity.
For theoretical reasons, we manipulate (2.20) as follows: Subtract Yt−1 from both sides of (2.20)
to obtain:
Yt  Yt 1  Yt 1  Yt 1  ut
 (  1)Yt 1  ut ………………………………………… (2.21)
which can be alternatively written as:
Yt   Yt 1  ut ........................................................................(2.22)
Where δ = (ρ−1) and  , as usual, is the first-difference operator.
In practice, therefore, instead of estimating (2.20), we estimate (2.22) and test the (null) hypothesis
that δ = 0.If δ = 0, then ρ = 1, that is we have a unit root, meaning the time series under
consideration is nonstationary. Unfortunately, under the null hypothesis that δ = 0 (i.e., ρ = 1), the
t value of the estimated coefficient of Yt−1 does not follow the t distribution even in large samples;
that is, it does not have an asymptotic normal distribution. Dickey and Fuller have shown that
under the null hypothesis that δ = 0, the estimated t value of the coefficient of Yt−1 in (2.22) follows
the τ(tau) statistic. In the literature the tau statistic or test is known as the Dickey–Fuller (DF) test,
in honor of its discoverers.
To allow for the various possibilities, the DF test is estimated in three different forms, that is, under
three different null hypotheses.
Yt is a random walk: Yt = Yt-1 +u t ..............................................................(2.23a)

Yt is a random walk with drift: Yt =1   Yt-1 +u t .............................................(2.23b)
Yt is a random walk with drift
around a stochastic trend: Yt =1   2t   Yt-1 +u t ......................................(2.23c)
Where t is the time or trend variable. In each case, the null hypothesis is that δ = 0; that is, there is
a unit root—the time series is nonstationary. The alternative hypothesis is that δ is less than zero;
that is, the time series is stationary. If the null hypothesis is rejected, it means that Yt is a stationary
time series.
The Augmented Dickey–Fuller (ADF) Test
In conducting the DF test as in (2.23a-c), it was assumed that the error term ut was uncorrelated.
But in case the ut are correlated, Dickey and Fuller have developed a test, known as the Augmented
Dickey–Fuller (ADF) test. This test is conducted by “augmenting” the preceding three equations
by adding the lagged values of the dependent variable ΔYt-i. To be specific, suppose we use
(2.23c). The ADF test here consists of estimating the following regression:
Yt =1   2t   Yt-1    i Yt 1 + t ..............................................(2.24)
Where εt is a pure white noise error term and where ΔYt−1 = (Yt−1 −Yt−2 ), ΔYt−2 = (Yt−2 −Yt−3 ), etc.
In ADF we still test whether δ = 0 and the ADF test follows the same asymptotic distribution as
the DF statistic, so the same critical values can be used.
To give a glimpse of this procedure, we estimated (2.24) for the GDP series using one lagged
difference of natural log of GDP of Ethiopia; the results were as follows:
GDPt  -.2095743+.0015952t  .0197157 GDPt 1  .0269423GDPt 1

t (  )   -0.28   0.67  (0.27 )  0.15
(1% CV  4.242 5% CV  3.540)
The t(=τ) value of the GDPt−1 coefficient ( =δ) is 0.27, but this value in absolute terms is much less
than even the 1% and 5% critical τ value of −4.242 and -3.540 respectively, again suggesting that
even after taking care of possible autocorrelation in the error term, the GDP series is nonstationar y.
The Phillips–Perron (PP) Unit Root Tests
An important assumption of the DF test is that the error terms ut are independently and identica lly
distributed. The ADF test adjusts the DF test to take care of possible serial correlation in the error
terms by adding the lagged difference terms of the regressand. Phillips and Perron use
nonparametric statistical methods to take care of the serial correlation in the error terms without
adding lagged difference terms.
The Phillips-Perron test involves fitting the following regression:
Yt  1  2t  Yt 1  ut
Under the null hypothesis that ρ = 0, the PP Z(t) and Z(  ) statistics have the same asymptotic
distributions as the ADF t-statistic and normalized bias statistics. One advantage of the PP tests
over the ADF tests is that the PP tests are robust to general forms of heteroscedasticity in the error
term ut . Another advantage is that the user does not have to specify a lag length for the test
regression. Now let’s test for whether lnGDP is stationary or not using PP test.
Phillips-Perron test for unit root Number of obs = 41

Newey-West lags = 3
Interpolated Dickey-Fuller
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value
Z(rho) 2.441 -24.548 -19.116 -16.368

Z(t) 1.210 -4.233 -3.536 -3.202
MacKinnon approximate p-value for Z(t) = 1.0000
The above result shows that lnGDP is no-stationary at level.
CHAPTER THREE
3. INTRODUCTION TO SIMULTANEOUS EQUATION MODELS
3.1.The Nature of Simultaneous Equation Models
In so far, we were concerned exclusively with single equation models, i.e., models in which there
was a single dependent variable Y and one or more explanatory variables, the X’s. In such models
the emphasis was on estimating and/or predicting the average value of Y conditional upon the
fixed values of the X variables. The cause-and-effect relationship, if any, in such models therefore
ran from the X’s to the Y.
But in many situations, such a one-way or unidirectional cause-and-effect relationship is not

meaningful. This occurs if Y is determined by the X’s, and some of the X’s are, in turn, determined
by Y. In short, there is a two way, or simultaneous, relationship between Y and (some of) the X’s,
which makes the distinction between dependent and explanatory variables of dubious value. It is
better to lump together a set of variables that can be determined simultaneously by the remaining
set of variables—precisely what is done in simultaneous-equation models. In such models there is
more than one equation—one for each of the mutually, or jointly, dependent or endogenous
variables. And unlike the single-equation models, in the simultaneous-equation models one may
not estimate the parameters of a single equation without taking into account information provided
by other equations in the system.
Example: At the macro level, aggregate consumption expenditure depends on aggregate

disposable income; aggregate disposable income depends upon the national income and taxes
imposed by the government; national income depends on aggregate consumption expenditure of
the economy. Disregarding these sequences of relationship, if we estimate a single equation
of, say, aggregate consumption on disposable income, then the estimates will be biased and
inconsistent.
The classic example of simultaneous causality in economics is supply and demand. Both prices
and quantities adjust until supply and demand are in equilibrium. A shock to demand or supply
causes both prices and quantities to move. As is well known, the price P of a commodity and the
Figure 3.1 Interdependence of Price and Quantity
quantity Q sold are determined by the intersection of the demand-and-supply curves for that
commodity. Thus, assuming for simplicity that the demand-and-supply curves are linear and
adding the stochastic disturbance terms u1 and u2 , we may write the empirical demand-and-supply
functions as
Demand function: Qtd = 0 +1Pt + 2 Yt +u1t 1 <0 ................................................. (3.1)

Supply function: Q =0 +1Pt +u 2t
s
t 1 >0 ............................................................. (3.2)
Equilibrium condition: Q  Q
s
t
d
t
Where Qtd =quantity demanded Qts = quantity supplied t = time and the α’s and β’s are the
parameters.
Now it is not too difficult to see that P and Q are jointly dependent variables. If, for example, u1t
in (3.1) changes because of changes in other variables affecting Qtd (such as income, wealth, and
tastes), the demand curve will shift upward if u1t is positive and downward if u1t is negative. These
shifts are shown in Figure 3.1. As the figure shows, a shift in the demand curve changes both P
and Q. Similarly, a change in u2t (because of strikes, weather, import or export restrictions, etc.)
will shift the supply curve, again affecting both P and Q. Because of this simultaneous dependence
between Q and P, u1t and Pt in (3.1) and u2t and Pt in (3.2) cannot be independent. Therefore, a
regression of Q on P as in (3.1) would violate an important assumption of the classical linear
regression model, namely, the assumption of no correlation between the explanatory variable(s)
and the disturbance term.
Definitions of Some Concepts
The variables P and Q are called endogenous variables because their values are determined
within the system we have created.
The income variable Y has a value that is given to us, and which is determined outside this
system. It is called an exogenous variable.
Predetermined variables are exogenous variables, lagged exogenous variables and lagged
endogenous variables. Predetermined variables are non-stochastic and hence independent
of the disturbance terms.
Structural models: A structural model describes the complete structure of the relations hips
among the economic variables. Structural equations of the model may be expressed in terms
of endogenous variables, exogenous variables and disturbances (random variables).
Reduced form of the model: The reduced form of a structural model is the model in which the
endogenous variables are expressed a function of the predetermined variables and the error term
only.
Example: The following simple Keynesian model of income determination can be considered as a
structural model.
C    Y  U -----------------------------------------------(3.3)
Y  C  Z ----------------------------------------------------(3.4)
for  >0 and 0<<1
where: C=consumption expenditure
Z=non-consumption expenditure
Y=national income
C and Y are endogenous variables while Z is exogenous variable.
Reduced form of the model:

The reduced form of a structural model is the model in which the endogenous variables are
expressed a function of the predetermined variables and the error term only.
Illustration: Find the reduced form of the above structural model.
Since C and Y are endogenous variables and only Z is the exogenous variables, we have to express
C and Y in terms of Z. To do this substitute Y=C+Z into equation (3.4).
C     (C  Z ) + U
C    C  Z  U
C  C    Z  U
C(1   )    Z  U
    U
C    Z  ----------------------------------(3.5)
1  1   1 
Substituting again (3.5) into (3.4) we get;
  1  U
Y    Z  --------------------------------(3.6)
1  1   1 
Equation (3.5) and (3.6) are called the reduced form of the structural model of the above. We can
write this more formally as:
Structural form equations Reduced form equations

C    Y  U     U
C    Z 
1  1   1 
C   01  11Z  v11
Y CZ   1  U
Y    Z 
1  1   1 
Y   02  12 Z  v12
Parameters of the reduced form measure the total effect (direct and indirect) of a change in
exogenous variables on the endogenous variable
3.2. Simultaneity Bias
Unlike the single equation models, in simultaneous equation models it is not usually possible
(possible only under specific assumptions) to estimate a single equation of the model without
taking into account the information provided by other equation of the system. If one applies OLS
to estimate the parameters of each equation disregarding other equations of the model, the
estimates so obtained are not only biased but also inconsistent; i.e. even if the sample size
increases indefinitely, the estimators do not converge to their true values.
The bias arising from application of such procedure of estimation which treats each equation of
the simultaneous equations model as though it were a single model is known as simultaneity bias
or simultaneous equation bias. It is useful to see, in a simple model, that an explanatory variable
that is determined simultaneously with the dependent variable is generally correlated with the error
term, which leads to bias and inconsistency in OLS.
The two-way causation in a relationship leads to violation of the important assumption of linear
regression model, i.e. one variable can be dependent variable in one of the equation but becomes
also explanatory variable in the other equations of the simultaneous-equation model. In this case
E[XiUi] may be different from zero. To show simultaneity bias, let’s consider the following
simple simultaneous equation model.
Y   0  1 X  U 
 -------------------------------------------------- (3.3)
X   0  1Y   2 Z  V 
Suppose that the following assumptions hold.
(U )  0 , (V )  0
(U 2 )   u2 , (V 2 )   u2
(U iU j )  0 , (ViV j )  0, also (UiVi )  0;
where X and Y are endogenous variables and Z is an exogenous variable.
The reduced form of X of the above model is obtained by substituting Y in the equation of X.
X   0  1 ( 0  1 X  U )   2 Z  V
 0   0 1   2   1U  V 
X  Z                        (3.4)
1  11  1  11   1  11 
Applying OLS to the first equation of the above structural model will result in biased estimator
because cov( X iU i )  ( X iU j )  0 . Now, let’s proof whether this expression.
cov( XU )  X  ( X )U  (U )
   X  ( X )U                           (3.5)
    0 1   2    U  V   0   0 1   2  
   0   Z   1     Z U
 1   1 1  1   1 1   1   1 1  1   1 1  1   1 1  
Substituting the value of X in equation (3.4) into equation (3.5)
 U 
   (  0   0 1   2 Z  1U  V   0  1 0   2 Z )
1   1 1 
 U 
   ( 1U  V )
1   1 1 
 1 
  ( 1U 2  UV )
 1   1 1 
 1  2
  (U 2 )  1 u  0 , since E(UV) = 0
 1   1 1  1   1 1
That is, covariance between X and U is not zero. As a consequence, if OLS is applied to each
equation of the model separately the coefficients will turn out to be biased. Now, let’s examine
how the non-zero co-variance of the error term and the explanatory variable will lead to biasness
in OLS estimates of the parameters.
If we apply OLS to the first equation of the above structural equation (3.3) Y   0  1 X  U , we
obtain
xy x(Y  Y ) xY Yx

ˆ1    ; (since is zero)
x 2 x 2 x 2
x 2
x( 0  1 X  U )  0 x xU xU

   1 
x 2 x 2
x 2 x 2
xX
But, we know that x  0 and  1 , hence
x 2
xU
ˆ  1                        (3.6)
x 2
Taking the expected values on both sides;
 xU 
(ˆ )   1   2 
 x 
Since, we have already proved that (XU )  0 ; which is the same as ( XU )  0 . Consequently,
xu
when ( XU )  0 ; (ˆ )   , that is ̂ 1 will be biased by the amount equivalent to .
x 2
3.3.The Identification Problem
In simultaneous equation models, the Problem of identification is a problem of model formulatio n;

it does not concern with the estimation of the model. The estimation of the model depends up on
the empirical data and the form of the model. If the model is not in the proper statistical form, it
may turn out that the parameters may not uniquely estimated even though adequate and relevant
data are available. In a language of econometrics, a model is said to be identified only when it is
in unique statistical form to enable us to obtain unique estimates of its parameters from the sample
data. To illustrate the problem identification, let’s consider a simplified wage-price model.
W =  +  P + E + U -------------------------------------- (i)
P    W  V ------------------------------------------------ (ii)
where W and P are percentage rates of wage and price inflation respectively, E is a measure of
excess demand in the labor market while U and V are disturbances, E is assumed to be exogenous ly
determined. If E is assumed to be exogenously determined, then (i) and (ii) represent two equations
determining two endogenous variables: W and P. Let’s explain the problem of identification with
help of these two equations of a simultaneous equation model.
Let’s use equation (ii) to express ‘W’ in terms of P:
 1 V
W   P
   ------------------------------------------------- (iii)
Now, suppose A and B are any two constants. Let’s multiply equation (i) by A, multiply equation
(ii) by B and then add the two equations. This gives
  B B
( A  B)W  A  B   A   P  AE  AU  V or
   
B  A  B  B
A    AU   V
W V    A 
P    --------------- -(iv)
E 
A B  A B   A B A B
 
 
Equation (iv) is what is known as a linear combination of (i) and (ii). The point about equation (iv)
is that it is of the same statistical form as the wage equation (i). That is, it has the form:
W = constant + (constant)P + (constant)E + disturbance
Moreover, since A and B can take any values we like, this implies that our wage price model
generates an infinite number of equations such as (iv), which are all statistically indistinguishable
from the wage equation (i). Hence, if we apply OLS or any other technique to data on W, P and
E in an attempt to estimate the wage equation, we can’t know whether we are actually estimating
(i) rather than one of the infinite number of possibilities given by (iv). Equation (i) is said to be
unidentified, and consequently there is now no way in which unbiased or even consistent
estimators of its parameters may be obtained.
Notice that, in contrast, price equation (ii) cannot be confused with the linear combination (iv),
because it is a relationship involving W and P only and does not, like (iv), contain the variable E.
The price equation (ii) is therefore said to be identified, and in principle it is possible to obtain
consistent estimates of its parameters. A function (an equation) belonging to a system of
simultaneous equations is identified if it has a unique statistical form, i.e. if there is no other
equation in the system, or formed by algebraic manipulations of the other equations of the system,
contains the same variables as the function(equation) in question.
Identification problems do not just arise only on two equation-models. Using the above procedure,
we can check identification problems easily if we have two or three equations in a given
simultaneous equation model. However, for ‘n’ equations simultaneous equation model, such a
procedure is very cumbersome. In general for any number of equations in a given simultaneo us
equation, we have two conditions that need to be satisfied to say that the model is in general
identified or not. In the following section we will see the formal conditions for identification.
3.4.Order and Rank Conditions of Identification (without proof)

3.4.1. The order condition for identification
This condition is based on a counting rule of the variables included and excluded from the
particular equation. It is a necessary but not sufficient condition for the identification of
an equation. The order condition may be stated as follows.
For an equation to be identified the total number of variables (endogenous and exogenous)
excluded from it must be equal to or greater than the number of endogenous variables in
the model less one. Given that in a complete model the number of endogenous variables
is equal to the number of equations of the model, the order condition for identification is
sometimes stated in the following equivalent form. For an equation to be identified the total
number of variables excluded from it but included in other equations must be at least as
great as the number of equations of the system less one.
Let: G = total number of equations (= total number of endogenous variables)
K= number of total variables in the model (endogenous and predetermined)
M= number of variables, endogenous and exogenous, included in a particular
equation.
Then the order condition for identification may be symbolically expressed as:
(K  M )  (G  1)
excluded
 var iable   total number of equatioins 1
 
For example, if a system contains 10 equations with 15 variables, ten endogenous and five
exogenous, an equation containing 11 variables is not identified, while another containing
5 variables is identified.
a. For the first equation we have
G  10 K  15 M  11
Order condition:
( K  M )  (G  1)
; that is, the order condition is not satisfied.
(15  11)  (10  1)
b. For the second equation we have

G  10 K  15 M 5
Order condition:
( K  M )  (G  1)
; that is, the order condition is satisfied.
(15  5)  (10  1)
The order condition for identification is necessary for a relation to be identified, but it is
not sufficient, that is, it may be fulfilled in any particular equation and yet the relation may
not be identified.
3.4.2. The rank condition for identification
The rank condition states that: in a system of G equations any particular equation is
identified if and only if it is possible to construct at least one non-zero determinant of order
(G-1) from the coefficients of the variables excluded from that particular equation but
contained in the other equations of the model. The practical steps for tracing the
identifiably of an equation of a structural model may be outlined as follows.
Firstly. Write the parameters of all the equations of the model in a separate table, noting
that the parameter of a variable excluded from an equation is equal to zero.
For example let a structural model be:
y1  3 y 2  2 x1  x2  u1
y2  y3  x3  u2
y3  y1  y2  2x3  u3
where the y’s are the endogenous variables and the x’s are the predetermined variables.
This model may be rewritten in the form
 y1  3 y2  0 y3  2x1  x2  0x3  u1  0
0 y1  y2  y3  0x1  0x2  x3  u2  0
y1  y2  y3  0x1  0x2  2x3  u3  0
Ignoring the random disturbance the table of the parameters of the model is as follows:
Variables
Equations Y1 Y2 Y3 X1 X2 X3
1st equation -1 3 0 -2 1 0
2nd equation 0 -1 1 0 0 1
rd
3 equation 1 -1 -1 0 0 -2
Secondly. Strike out the row of coefficients of the equation which is being examined for
identification. For example, if we want to examine the identifiability of the second equation
of the model we strike out the second row of the table of coefficients.
Thirdly. Strike out the columns in which a non-zero coefficient of the equation being
examined appears. By deleting the relevant row and columns we are left with the
coefficients of variables not included in the particular equation, but contained in the other
equations of the model. For example, if we are examining for identification the second
equation of the system, we will strike out the second, third and the sixth columns of the
above table, thus obtaining the following tables.
Table of structural parameters Table of parameters of excluded variables
Y1 Y2 Y3 X1 X2 X3 Y1 X1 X2
  
1 st -1 3 0 -2 1 0 -1 -2 1
2 nd 0 -1 1 0 0 1
3 rd 1 -1 -1 0 0 -2 1 0 0
Fourthly. Form the determinant(s) of order (G-1) and examine their value. If at least one
of these determinants is non-zero, the equation is identified. If all the determinants of order
(G-1) are zero, the equation is underidentified.
In the above example of exploration of the identifiability of the second structural equation
we have three determinants of order (G-1) =3-1=2. They are:
1  2 2 1 1 1
1  0 2  0 3  0
1 0 0 0 1 0
(the symbol  stands for ‘determinant’) We see that we can form two non-zero
determinants of order G-1=3-1=2; hence the second equation of our system is identified.
Fifthly. To see whether the equation is exactly identified or overidentified we use the order
condition ( K  M )  (G  1). With this criterion, if the equality sign is satisfied, that is if
( K  M )  (G  1) , the equation is exactly identified. If the inequality sign holds, that is, if
( K  M )  (G  1) , the equation is overidentified.

In the case of the second equation we have:
G=3 K=6 M=3
And the counting rule ( K  M )  (G  1) gives
(6-3) > (3-1)
Therefore the second equation of the model is overidentified.
3.5. Estimation of Simultaneous Equations Models
1. Indirect Least Squares (ILS) Method
In this method, we first obtain the estimates of the reduced form parameters by applying OLS to
the reduced form equations and then indirectly get the estimates of the parameters of the structural
model. This method is applied to exactly identified equations.
Steps:
a. Obtain the reduced form equations (that is, express the endogenous variables in terms
of predetermined variables).
b. Apply OLS to the reduced form equations individually. OLS will yield consistent
estimates of the reduced form parameters (since each equation involves only non-
stochastic (predetermined) variables that appear as ‘independent’ variables).
c. Obtain (or recover back) the estimates of the original structural coefficients
using the estimates in step (b).
2. Two-Stage Least Squares (2SLS) Method
The 2SLS procedure is generally applicable for estimation of over-identified equations as it

provides unique estimators.
Steps:
a) Estimate the reduced form equations by OLS and obtain the predicted Yî
b) Replace the right hand side endogenous variables in the structural equations by
the corresponding Yî and estimate them by OLS.
Consider the following simultaneous equations model:
Y1  a1  b1Y2  c1 z1  c2 z2  u1............................................(a)
Y2  a2  b2Y1  c3 z3  u2 .....................................................(b)
Where Y1 and Y2 are endogenous while z1 , z2 and z3 are predetermined.
The 2-SLS procedure of estimation of equation (b) (which is over-identified) is:
• We first estimate the reduced form equations by OLS; that is, we regress Y1 on z1 , z2 and z3
using OLS and obtain Yˆ1 . We then replace Y1 by Yˆ1 and estimate equation (b) by OLS, that is,
we apply OLS to: Y2  a2  b2Yˆ1  c3 z3  u2
CHAPTER FOUR
4. INTRODUCTION TO PANEL DATA REGRESSION MODELS

4.1.Introduction
You may recall that we discussed briefly the types of data that are generally available for empirica l
analysis, namely, time series, cross section, and panel. In time series data we observe the values
of one or more variables over a period of time (e.g., GDP for several quarters or years). In cross-
section data, values of one or more variables are collected for several sample units, or entities, at
the same point in time (e.g., crime rates for 9 regions in Ethiopia for a given year). In panel data
the same cross-sectional unit (say a family or a firm or a state) is surveyed over time. In short,
panel data have space as well as time dimensions.
Hypothetical examples:
 Data on 200 Ethiopian Somali regional state school in 2004 and again in 2005, for
400 observations total.
 Data on 9 regional states of Ethiopia, each state is observed in 5 years, for a total
of 45 observations.
 Data on 1000 individuals, in four different months, for 4000 observations total.
There are other names for panel data, such as pooled data (pooling of time series and cross-
sectional observations), combination of time series and cross-section data (cross-sectional
time-series data), and micropanel data, longitudinal data (a study over time of a variable or
group of subjects).
Why Should We Use Panel Data? Their Benefits and Limitations
Baltagi (2005) list several benefits from using panel data. These include the following.
1. Controlling for individual heterogeneity. Panel data allows you to control for variables you
cannot observe or measure like cultural factors or difference in business practices across
companies; or variables that change over time but not across entities (i.e. national policies,
federal regulations, international agreements, etc.). This is, it accounts for individ ua l
heterogeneity. Time-series and cross-section studies not controlling this heterogeneity run
the risk of obtaining biased results.
2. Panel data give more informative data, more variability, less collinearity among the
variables, more degrees of freedom and more efficiency. Time-series studies are plagued
with multicollinearity.
3. Panel data are better able to study the dynamics of adjustment. Cross-sectional distributio ns
that look relatively stable hide a multitude of changes.
4. Panel data are better able to identify and measure effects that are simply not detectable in
pure cross-section or pure time-series data.
5. Panel data models allow us to construct and test more complicated behavioral models than
purely cross-section or time-series data. For example, technical efficiency is better studied
and modeled with panels.
6. Micro panel data gathered on individuals, firms and households may be more accurately
measured than similar variables measured at the macro level. Biases resulting from
aggregation over firms or individuals may be reduced or eliminated.
Limitations of panel data include:
1. Design and data collection problems. These include problems of coverage (incomp lete
account of the population of interest), nonresponse (due to lack of cooperation of the
respondent or because of interviewer error), recall (respondent not remembering correctly),
frequency of interviewing, interview spacing, reference period, the use of bounding and
time-in-sample bias.
2. Distortions of measurement errors. Measurement errors may arise because of faulty
responses due to unclear questions, memory errors, deliberate distortion of responses (e.g.
prestige bias), inappropriate informants, misrecording of responses and interviewer effects .
3. Selectivity problems. These include:
(a) Self-selectivity. People choose not to work because the reservation wage is higher than
the offered wage. In this case we observe the characteristics of these individuals but
not their wage. Since only their wage is missing, the sample is censored. However, if
we do not observe all data on these people this would be a truncated sample.
(b) Nonresponse. This can occur at the initial wave of the panel due to refusal to participate,
nobody at home, untraced sample unit, and other reasons. Item (or partial) nonresponse
occurs when one or more questions are left unanswered or are found not to provide a
useful response.
(c) Attrition. While nonresponse occurs also in cross-section studies, it is a more serious
problem in panels because subsequent waves of the panel are still subject to
nonresponse. Respondents may die, or move, or find that the cost of responding is high.
4. Short time-series dimension. Typical micro panels involve annual data covering a short
time span for each individual. This means that asymptotic arguments rely crucially on the
number of individuals tending to infinity. Increasing the time span of the panel is not
without cost either. In fact, this increases the chances of attrition and increases the
computational difficulty for limited dependent variable panel data models.
5. Cross-section dependence. Macro panels on countries or regions with long time series that
do not account for cross-country dependence may lead to misleading inference.
Notation for panel data
A double subscript is used to distinguish entities (states, family, country, individuals, etc.)
and time periods. Consider the following simple panel data regression model:
Yit =  0 +  1 Xit +  2 Zi + uit, i =1,…,n, T = 1,…,T…………………………(4.1)
Where i = entity (state), n = number of entities, so i = 1,…,n
t = time period (year, month, quarter, etc.), T = number of time periods, so that t =1,…,T
Panel data with k regressors:
Yit  0  1 X1it  2 X 2it  ...  k X kit  uit ; i  1,, n, t  1,, T .............................(4.2)
4.2.Estimation of Panel Data Regression Model

4.2.1. The Fixed Effects (Entity/Time Fixed) Approach
You may apply entity fixed effects regression when you want to control for omitted variables
that differ among panels but are constant over time. On the other hand, if there are unobserved
effects that vary across time rather than across panels, we apply time fixed effects regression
model.
Use fixed-effects (FE) whenever you are only interested in analyzing the impact of variables that
vary over time. FE explore the relationship between predictor and outcome variables within an
entity (country, person, company, etc.). Each entity has its own individual characteristics that may
or may not influence the predictor variables (for example being a male or female could influe nce
the opinion toward certain issue or the political system of a particular country could have some
effect on trade or GDP or the business practices of a company may influence its stock price).
When using FE we assume that something within the individual may impact or bias the predictor
or outcome variables and we need to control for this. This is the rationale behind the assumptio n
of the correlation between entity’s error term and predictor variables. FE remove the effect of those
time-invariant characteristics from the predictor variables so we can assess the predictors’ net
effect.
Another important assumption of the FE model is that those time-invariant characteristics are
unique to the individual and should not be correlated with other individual characteristics. Each
entity is different therefore the entity’s error term and the constant (which captures individ ua l
characteristics) should not be correlated with the others. If the error terms are correlated then FE
is no suitable since inferences may not be correct and you need to model that relationship (probably
using random-effects.
Entity-demeaned OLS Regression
Think of the following two variables panel regression model in fixed effect form:
Yit =  i +  1 Xit + uit………………………………………………….. (4.3)
  i is called an “entity fixed effect” or “entity effect” – it is the constant (fixed) effect of
being in entity i.
The state averages satisfy:
1 T 1 T 1

T t 1
Yit   i 1 
T t 1
X it   uit
T
Deviation from entity averages:
1 T  1 T   1 T 
Yit  
T t 1
Yit   i 1

X it  
T t 1
X 
it   it
 
u   uit 
T t 1 
1 T  1 T   1 T 
 Yit  1 X it  uit where Yit  Yit  
T t 1
Yit ,X it   X it   X it  , uit   uit   uit 
 T t 1   T t 1 
Then we apply OLS to Yit  1 X it  uit to estimate 1 .
Example: Traffic Deaths and Alcohol Taxes
There are approximately 40,000 highway traffic fatalities each year in the U.S. Approximately
one-third of fatal crashes involve a driver who was drinking, and this fraction rises during peak
drinking periods. One study (Levitt and Porter, 2001) estimates that as many as 25% of drivers
on the road between 1 A.M. and 3 A.M. have been drinking, and that a driver who is legally
drunk is at least 13 times as likely to cause a fatal crash as a driver who has not been drinking.
Public policy issues
 Drunk driving causes massive externalities (sober drivers are killed, society bears
medical costs, etc. etc.) – there is ample justification for governmental intervention
 Are there any effective ways to reduce drunk driving? If so, what?
 What are effects of specific laws:
 mandatory punishment
 minimum legal drinking age
 economic interventions (alcohol taxes)
This example shows how effective various government policies (e.g. beer tax) to discourage
drunk driving actually are in reducing traffic deaths.
Fixed-effects (within) regression Number of obs = 336

Group variable: state Number of groups = 48
R-sq: within = 0.0407 Obs per group: min = 7

between = 0.1101 avg = 7.0
overall = 0.0934 max = 7
F(1,47) = 5.05
corr(u_i, Xb) = -0.6885 Prob > F = 0.0294
(Std. Err. adjusted for 48 clusters in state)
Robust
fatalities Coef. Std. Err. t P>|t| [95% Conf. Interval]
beertax -.0000656 .0000292 -2.25 0.029 -.0001243 -6.87e-06

_cons .0002377 .000015 15.87 0.000 .0002076 .0002678
sigma_u .00007147
sigma_e .00001899
rho .93408484 (fraction of variance due to u_i)
3.4.2. The Random Effects (RE) Approach
If you believe that some omitted variables may be constant over time but vary among panels, and
others may be fixed among panels but vary over time, then you can apply random effects regression
model.
Random effects assume that the entity’s error term is not correlated with the predictors which
allows for time-invariant variables to play a role as explanatory variables. In random-effects you
need to specify those individual characteristics that may or may not influe nce the predictor
variables.
The basic idea of random effects model is to start with (4.3):
Yit =  i +  1 Xit +uit ……………………………………. (4.3a)
Instead of treating  i as fixed, we assume that it is a random variable with a mean value of  (no
subscript i here). And the intercept value for an individual entity can be expressed as
i     it i =1, 2,..., N........................................(4.4)
Where εi is a random error term with a mean value of zero and variance of 2 .
What we are essentially saying is that the entities included in our sample are a drawing from a
much larger universe of such population and that they have a common mean value for the intercept
(=  ) and the individual differences in the intercept values of each entity are reflected in the error
term εi.
Substituting (4.4) into (4.3a), we get:
Yit    1 X it   i  uit
………………………………….. (4.5)
   1 X it  wit
Where wit =  i  uit
In random effects model (REM) or error component model (ECM) it is assumed that the intercept
of an individual unit is a random drawing from a much larger population with a constant mean
value. The individual intercept is then expressed as a deviation from this constant mean value. One
advantage of ECM over FEM is that it is economical in degrees of freedom, as we do not have to
estimate N cross-sectional intercepts. We need only to estimate the mean value of the intercept
and its variance. ECM is appropriate in situations where the (random) intercept of each cross-
sectional unit is uncorrelated with the regressors.
Random-effects GLS regression Number of obs = 336


between = 0.1101 avg = 7.0
Wald chi2(1) = 0.18

corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.6753
fatalities Coef. Std. Err. z P>|z| [95% Conf. Interval]
beertax -5.20e-06 .0000124 -0.42 0.675 -.0000295 .0000191

_cons .0002067 1.00e-05 20.68 0.000 .0001871 .0002263
sigma_u .00005158
sigma_e .00001899
3.5. Choosing Between Fixed and Random Effects
If you aren't exactly sure which models, fixed effects or random effects, you should use, you can
do a test called Hausman test. To run a Hausman test in Stata, you need to save the coefficients
from each of the models and use the stored results in the test. To store the coefficients, you can
use "estimates store" command.
. hausman fixed random
Coefficients
(b) (B) (b-B) sqrt(diag(V_b-V_B))
fixed random Difference S.E.
beertax -.0000656 -5.20e-06 -.0000604 .0000141
b = consistent under Ho and Ha; obtained from xtreg

B = inconsistent under Ha, efficient under Ho; obtained from xtreg
Test: Ho: difference in coefficients not systematic
chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 18.35
Prob>chi2 = 0.0000
The hausman test tests the null hypothesis that the coefficients estimated by the efficient random
effects estimator are the same as the ones estimated by the consistent fixed effects estimator. If
they are, then it is safe to use random effects. If you get a statistically significant P-value, however,
you should use fixed effects. In this example, the P-value is statistically significant. Therefore,
fixed effects would be more appropriate in this case.
Summary
Fixed Effect Model Random Effect Model

Functional form Yit  (  ui )   Xit  vit Yit     Xit  (ui  vit )
Intercepts Varying across groups and/or times Constant
Error variances Constant Varying across groups and/or times
Slopes Constant Constant
Estimation LSDV, within effect method GLS, FGLS
Hypothesis test Incremental F test Breusch-Pagan LM test
Other Tests/Diagnostics
Testing for Time-fixed Effects
To see if time fixed effects are needed when running a FE model use the command testparm. It is
a joint test to see if the dummies for all years are equal to 0, if they are then no time fixed effects
are needed.
. xi:xtreg fatality beertax i.year,fe

i.year _Iyear_1982-1988 (naturally coded; _Iyear_1982 omitted)
Fixed-effects (within) regression Number of obs = 336


between = 0.1101 avg = 7.0
F(7,281) = 3.50
corr(u_i, Xb) = -0.6781 Prob > F = 0.0013
fatality Coef. Std. Err. t P>|t| [95% Conf. Interval]
beertax -.000064 .0000197 -3.24 0.001 -.0001029 -.0000251

_Iyear_1983 -7.99e-06 3.84e-06 -2.08 0.038 -.0000155 -4.41e-07
_Iyear_1984 -7.24e-06 3.84e-06 -1.89 0.060 -.0000148 3.07e-07
_Iyear_1985 -.0000124 3.84e-06 -3.23 0.001 -.00002 -4.83e-06
_Iyear_1986 -3.79e-06 3.86e-06 -0.98 0.327 -.0000114 3.81e-06
_Iyear_1987 -5.09e-06 3.90e-06 -1.31 0.193 -.0000128 2.58e-06
_Iyear_1988 -5.18e-06 3.96e-06 -1.31 0.192 -.000013 2.62e-06
_cons .0002428 .0000108 22.46 0.000 .0002216 .0002641
sigma_u .00007095
sigma_e .00001879
F test that all u_i=0: F(47, 281) = 53.19 Prob > F = 0.0000
. testparm _Iyear*
( 1) _Iyear_1983 = 0
( 2) _Iyear_1984 = 0
( 3) _Iyear_1985 = 0
( 4) _Iyear_1986 = 0
( 5) _Iyear_1987 = 0
( 6) _Iyear_1988 = 0
F( 6, 281) = 2.01
Prob > F = 0.0642
We failed to reject the null that all years’ coefficients are jointly equal to zero therefore no time
fixed effects are needed.
Testing for Random Effects: Breusch-Pagan Lagrange Multiplier (LM)

The LM test helps you decide between a random effects regression and a simple OLS regression.
The null hypothesis in the LM test is that variances across entities is zero. This is, no significant
difference across units (i.e. no panel effect). The command in Stata is xttset0 type it right after
running the random effects model.
xtreg fatality beertax,re
. xttest0
Breusch and Pagan Lagrangian multiplier test for random effects
fatality[state,t] = Xb + u[state] + e[state,t]
Estimated results:
Var sd = sqrt(Var)
fatality 3.25e-09 .000057

e 3.60e-10 .000019
u 2.66e-09 .0000516
Test: Var(u) = 0
chibar2(01) = 754.57
Prob > chibar2 = 0.0000
Here we reject the null and conclude that random effects is appropriate. This is, there is evidence
of significant differences across countries, therefore you should run a random effects regression.
Testing for Cross-Sectional Dependence/Contemporaneous Correlation: Using

Breusch-Pagan LM Test of Independence
According to Baltagi, cross-sectional dependence is a problem in macro panels with long time
series (over 20-30 years). This is not much ofa problem in micro panels (few years and large
number of cases).The null hypothesis in the B-P/LM test of independence is that residuals across
entities are not correlated. The command to run this test is xttest2 (run it after xtreg, fe):
Xtreg fatality beertax,fe
Xttest0
Testing for Cross-Sectional Dependence/Contemporaneous Correlation: Using

Pasaran CD Test
Pasaran CD (cross-sectional dependence) test isused to test whether the residuals are correlated
across entities*. Cross-sectional dependence can lead to bias in tests results (also called
contemporaneous correlation). The null hypothesis is that residuals are not correlated.
The command for the test is xtcsd, you have to install it typing ssc install xtcsd
xtreg fatality beertax, fe
xtcsd, pesaran/xtcsd, frees/xtcsd, frees
Pesaran's test of cross sectional independence = 5.256, Pr = 0.0000

There is cross-sectional dependence.
Since there is cross-sectional dependence in our model, it is suggested to use Driscoll and
Kraay standard errors using the command xtscc.
. xtscc fatality beertax,fe
Regression with Driscoll-Kraay standard errors Number of obs = 336

Method: Fixed-effects regression Number of groups = 48
Group variable (i): state F( 1, 47) = 50.19
maximum lag: 2 Prob > F = 0.0000
within R-squared = 0.0407
Drisc/Kraay
fatality Coef. Std. Err. t P>|t| [95% Conf. Interval]
beertax -.0000656 9.26e-06 -7.08 0.000 -.0000842 -.000047

_cons .0002377 5.74e-06 41.39 0.000 .0002262 .0002493
A test for heteroscedasticity is available for the fixed- effects model using the command xttest3.
. xttest3
Modified Wald test for groupwise heteroskedasticity

in fixed effect regression model
H0: sigma(i)^2 = sigma^2 for all i
chi2 (48) = 4826.21

Prob>chi2 = 0.0000
The null is homoscedasticity (or constant variance). Above we reject the null and conclude
heteroscedasticity.
Serial correlation tests apply to macro panels with long time series (over 20-30 years). Not a
problem in micro panels (with very few years). Serial correlation causes the standard errors of
the coefficients to be smaller than they actually are and higher R-squared.
A Lagrange-Multiplier test for serial correlation is available using the command xtserial.
. xtserial fatality beertax
Wooldridge test for autocorrelation in panel data

H0: no first-order autocorrelation
F( 1, 47) = 14.261
Prob > F = 0.0004
The null is no serial correlation. Above we reject the null and conclude the data do have first-
order autocorrelation.
Testing for Unit Roots/Stationarity

xtunitroot performs a variety of tests for unit roots (or stationarity) in panel datasets. The Levin-
Lin-Chu (2002), Harris-Tzavalis (1999), Breitung (2000; Breitung and Das 2005), Im-Pesaran-
Shin (2003), and Fisher-type (Choi 2001) tests have as the null hypothesis that all the panels
contain a unit root. The Hadri (2000) Lagrange multiplier (LM) test has as the null hypothesis
that all the panels are (trend) stationary. The top of the output for each test makes explicit the
null and alternative hypotheses. Options allow you to include panel-specific means (fixed
effects) and time trends in the model of the data-generating process.

Econometrics II-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Econometrics II-1

Uploaded by

Copyright:

Available Formats

Jigjiga University College of Business and Economics Department of Economics

1.1. Describing Qualitative Information

1|P a g e Econome trics II Le cture Note s

1.2. Dummy as Independent Variables

wage  0  0 D  1educ  u...........................................(1.1)

Where: wage =wage rate of a certain individual

educ  level of education

2|P a g e Econome trics II Le cture Note s

0  E(wage/ D  1,educ)  E(wage/ D  0,educ)

3|P a g e Econome trics II Le cture Note s

Mean salary of female college professor: E (Y D  1,educ ) (  0   0)  educ

Mean salary of male college professor: E (Y D  0, educ)   0  1educ -------------------- (1.3)

4|P a g e Econome trics II Le cture Note s

1.2.2. Using Dummy Variables for Multiple Categories

Yi  1  2 D2i  3 D3i   X i  ui …………………………………. (1.4)

Where Yi  annual expenditure on health care

1 if high school education

5|P a g e Econome trics II Le cture Note s

1.2.3. Regression on One Quantitative Variable and Two Qualitative Variables

Yi  1  2 D2i  3 D3i   X i  ui ------------------------------------------- (1.5)

6|P a g e Econome trics II Le cture Note s

Where Y = annual salary

X i = years of teaching experience

Assuming E (ui )  0 , we can obtain the following regression from (1.5)

Mean salary for black female professor:

7|P a g e Econome trics II Le cture Note s

1.2.4. Interactions Among Dummy Variables

Consider the following model:

Yi  1  2 D2i  3 D3i   X i  ui ……………………………………………. (1.6)

8|P a g e Econome trics II Le cture Note s

Yi  1  2 D2i  3 D3i  4 D2i D3i   X i  ui ………………………………. (1.7)

From (1.7) we obtain

1.2.5. Testing for Structural Stability of Regression Models

Write the savings function as:

9|P a g e Econome trics II Le cture Note s

Example: Sˆt  -20.76005  5.99916Dt  2.616285Yt -0.5298177(Yt Dt )

SE (6.04) (6.4) (.57) (.6035149)

1.3. Dummy as Dependent Variable

This is the idea behind the logit and probit models.

1.3.1.The Linear Probability Models (LPM)

P(Y  1 X )   0  1 X 1   2 X 2  ...   k X k  X  ..................................(1.10)

The drawbacks of this model are:

1.3.2. The Logit and Probit Models

In a binary response model, interest lies primarily in the response probability

P(Y  1 X )  P(Y  1 X 1 , X 2 ,..., X k ) ………………………………….. (1.11)

Specifying Logit and Probit Models

P(Y  1 X )  G (  0  1 X1   2 X 2  ....   k X k )  G (X  )..........................................(1.12)

Here the response probability P (Y  1 X ) is evaluated as:

Similarly, the non-response probability is evaluated as:

For the logit model, the ratio:

Thus, the log-odds is a linear function of the explanatory variables.

where (z) is the standard normal density

(z)  (2 )1/2 exp( z2 / 2)..........................................................(1.15)

1.3.3.Interpreting the Probit and Logit Model Estimates

Goodness of Fit Statistics

The conventional measure of goodness of fit, R 2 , is not particularly meaningful in binary

1. Measures based on likelihood ratios

3. McFadden (1974) defined R 2 as

No. of correct predictions

Tota l no. of observations

. logit admit GRE topnotch GPA

Iteration 0: log likelihood = -249.98826

Logistic regression Number of obs = 400