Professional Documents
Culture Documents
1
Chapter One
Regression on Dummy Variables
2
1.1 The nature of dummy variables
3
Cont.
• For example, holding all other factors constant, female
daily wage workers are found to earn less than their
male counterparts, and nonwhites are found to earn
less than whites.
• This pattern may result from sex or racial discrimination,
but whatever the reason, qualitative variables such as sex
and race do influence the dependent variable and clearly
should be included among the explanatory variables.
4
Cont.
• Qualitative variables usually indicate the presence or
absence of a “quality” or an attribute, such as male or
female, black or white, or Christian or Muslim.
• One method of “quantifying” such attributes is by
constructing artificial variables that take on values of 1
or 0, 0 indicating the absence of an attribute and 1
indicating the presence (or possession) of that attribute.
5
Cont.
• For example, 1 may indicate that a person is a male, and 0
may designate a female; or 1 may indicate that a person is a
college graduate, and 0 that he is not, and so on.
• Variables that assume such 0 and 1 values are called dummy
variables.
• Alternative names are indicator variables, binary variables,
categorical variables, and dichotomous variables.
6
Cont.
• Dummy variables can be used in regression models just as
easily as quantitative variables. As a matter of fact, a
regression model may contain explanatory variables that are
exclusively dummy, or qualitative, in nature.
7
Cont.
• Model (1.01) may enable us to find out whether sex makes any
difference in a college professor’s salary, assuming, of course,
that all other variables such as age, degree attained, and years of
experience are held constant.
• Assuming that the disturbance satisfies the usually assumptions
of the classical linear regression model, we obtain from (1.01).
8
Assumptions of the Classical Linear Regression Model:
9
Cont.
the intercept term gives the mean salary of female college professors and the slope
coefficient tells by how much the mean salary of a male college professor differs from the
mean salary of his female counterpart, + reflecting the mean salary of the male college
professor.
A test of the null hypothesis that there is no sex discrimination ( H 0 : = 0) can be easily made
by running regression (1.01) in the usual manner and finding out whether on the basis of the t
test the estimated is statistically significant.
10
A. Dummy Independent Variable Models
1.2 Regression on one quantitative variable and one qualitative
variable with two classes, or categories
11
Cont.
• Model (1.03) contains one quantitative variable (years of
teaching experience) and one qualitative variable (sex)
that has two classes (or levels, classifications, or
categories), namely, male and female.
What is the meaning of this equation? Assuming, as usual, that E (u i ) = 0, we see that
12
Cont.
• Geometrically, we have the situation shown in fig. 1.1 (for
illustration, it is assumed that ). In words, model 1.01
postulates that the male and female college professors’ salary
functions in relation to the years of teaching experience have
the same slope but different intercepts.
• In other words, it is assumed that the level of the male
professor’s mean salary is different from that of the female
professor’s mean salary (by but the rate of change in the
mean annual salary by years of experience is the same for both
sexes.
13
14
Cont.
• If the assumption of common slopes is valid, a test of the
hypothesis that the two regressions (1.04) and (1.05) have the
same intercept (i.e., there is no sex discrimination) can be
made easily by running the regression (1.03) and noting the
statistical significance of the estimated on the basis of the
traditional t test.
• If the t test shows that is statistically significant, we reject the
null hypothesis that the male and female college professors’
levels of mean annual salary are the same.
15
Cont.
• Before proceeding further, note the following features of the
dummy variable regression model considered previously
1. To distinguish the two categories, male and female, we have
introduced only one dummy variable . For if always
denotes a male, when D = 0 we know that it is a female since
there are only two possible outcomes.
Hence, one dummy variable suffices to distinguish two
categories. The general rule is this: If a qualitative variable
has ‘m’ categories, introduce only ‘m-1’ dummy variables.
16
Cont.
• In our example, sex has two categories, and hence we
introduced only a single dummy variable. If this rule is not
followed, we shall fall into what might be called the dummy
variable trap, that is, the situation of perfect multicollinearity.
2. The assignment of 1 and 0 values to two categories, such as
male and female, is arbitrary in the sense that in our example we
could have assigned D = 1 for female and D = 0 for male.
17
Cont.
3. The group, category, or classification that is assigned the
value of 0 is often referred to as the base, benchmark, control,
comparison, reference, or omitted category. It is the base in
the sense that comparisons are made with that category.
18
What is dummy variable ?
• In statistics and econometrics, particularly in regression analysis,
a dummy variable is one that takes only the value 0 or 1 to
indicate the absence or presence of some categorical effect that
may be expected to shift the outcome.
19
What is the purpose of dummy variables?
• Dummy variables are useful because they enable us to use a
single regression equation to represent multiple groups.
20
How do you determine the number of dummy
variables?
• The first step in this process is to decide the number of dummy
variables.
• This is easy; it's simply k-1, where k is the number of levels of
the original variable.
• You could also create dummy variables for all levels in the
original variable, and simply drop one from each analysis.
21
Is 0 male or female?
• In the case of gender, there is typically no natural reason to code
the variable female = 0, male = 1, versus male = 0, female = 1.
• However, convention may suggest one coding is more familiar
to a reader; or choosing a coding that makes the regression
coefficient positive may ease interpretation.
22
Can dummy variables be 1 and 2?
• Technically, dummy variables are dichotomous,
quantitative variables.
• Their range of values is small; they can take on only two
quantitative values.
• As a practical matter, regression results are easiest to interpret
when dummy variables are limited to two specific values, 1 or
0.
23
Why do we drop one dummy variable?
24
1.3 Regression on one quantitative variable and one
qualitative variable with more than two classes
25
Cont.
• Now, unlike the previous case, we have more than two categories
of the qualitative variable education.
• Therefore, following the rule that the number of dummies be
one less than the number of categories of the variable, we should
introduce two dummies to take care of the three levels of
education.
• Assuming that the three educational groups have a
common slope but different intercepts in the regression
of annual expenditure on health care on annual income,
we can use the following model:
26
Cont.
Yi = 1 + 2 D2i + 3 D3i + X i + ui --------------------------(1.06)
X i = annual expenditure
27
Cont.
• Note that in the preceding assignment of the dummy variables
we are arbitrarily treating the “less than high school
education” category as the base category. Therefore, the
intercept will reflect the intercept for this category.
28
Cont.
• Assuming , we obtain from (1.06)
E (Yi | D2 = 0, D3 = 0, X i ) = 1 + X i
E (Yi | D2 = 1, D3 = 0, X i ) = ( 1 + 2 ) + X i
E (Yi | D2 = 0, D3 = 1, X i ) = ( 1 + 3 ) + X i
29
Cont.
• Geometrically, the situation is shown in fig 1.2 (for illustrative
purposes it is assumed that ).
30
1.4 Regression on one quantitative variable and two
qualitative variables
31
Cont.
• We can now write (1.03) as:
Yi = 1 + 2 D2i + 3 D3i + X i + ui ----------(1.07)
Where Yi = annual salary
D2 = 1 if female
=0 otherwise
D3 = 1 if white
=0 otherwise
32
Cont.
• Notice that each of the two qualitative variables, sex and color,
has two categories and hence needs one dummy variable for
each. Note also that the omitted, or base, category now is
“black female professor”.
33
Cont.
34
Cont.
• Once again, it is assumed that the preceding regressions
differ only in the intercept coefficient but not in the slope
coefficient.
• An OLS estimation of (1.07) will enable us to test a variety
of hypotheses. Thus, if is statistically significant, it will
mean that colour does affect a professor’s salary.
• Similarly, if is statistically significant, it will mean that
sex also affects a professor’s salary. If both these differential
intercepts are statistically significant, it would mean sex as
well as colour is an important determinant of professors’
salaries.
35
Cont.
• From the preceding discussion it follows that we can extend our
model to include more than one quantitative variable and more
than two qualitative variables.
• The only precaution to be taken is that the number of dummies
for each qualitative variable should be one less than the
number of categories of that variable.
36
1.5 Interaction effects
37
Cont.
• The implicit assumption in this model is that the differential
effect of the sex dummy is constant across the two levels of
education and the differential effect of the education dummy
is also constant across the two sexes.
• That is, if, say, the mean expenditure on clothing is higher for
females than males this is so whether they are college
graduates or not. Likewise, if, say, college graduates on the
average spend more on clothing than non-college graduates, this
is so whether they are female or males.
38
Cont.
• In many applications, such an assumption may be untenable. A
female college graduate may spend more on clothing than a
male graduate.
• In other words, there may be interaction between the two
qualitative variables and therefore their effect on mean Y
may not be simply additive as in (1.08) but multiplicative as
well, as in the following model:
39
Cont.
• From (4.09) we obtain
E (Yi | D2 = 1, D3 = 1, X i ) = (1 + 2 + 3 + 4 ) + X i ------------(4.10)
• which is the mean clothing expenditure of graduate females.
Notice that
• differential effect of being a female
• differential effect of being a college graduate
• differential effect of being a female graduate
40
Cont.
• If are all positive, the average clothing
expenditure of females is higher than the base category (which
here is male non-graduate), but it is much more so if the females
also happen to be graduates.
• This shows how the interaction dummy modifies the effect of the
two attributes considered individually.
• Whether the coefficient of the interaction dummy is statistically
significant can be tested by the usual t test. Omitting a significant
interaction term will lead to a specification bias.
41
Cont.
The importance of interactions among\
dummy variables
Help us to get influential variables
To avoid misspecification bias
42
Slope indicator variables
45
Cont.
We can allow for a change in a slope by
including in the model an additional
explanatory variable that is equal to the
product of an indicator variable and
continuous variable.
In our model, the slope of the relationship is
the value of an additional number of bed
rooms.
If we assume 1 value for homes in desirable
neibourhood, and 0 other wise; we can specify
our model as follows:
prhou = 0 + 1nbdr + (nbdr * neib) + ui
46
Cont.
The new variable (nbdr*neib) is the product of the
number of bedroom and the indicator variables, is
called an interaction variable as it captures the
interaction of location and number of bedroom on
condominium house prices.
Or it is called a slope –indicator variable or a slope
dummy variable, b/c it allows for the change in the
slope of the relationship.
The slope indicator variable takes a value equal to
nbdr for houses in the desirable neibourhood, when
neib = 1, and it is 0 for homes in other
neighbourhoods.
47
Cont.
A slope indicator variable is treated as just like
any other explanatory variable in a regression
model.
0 + 1 nbdr + nbdr − − − when D = 1
E ( prhou ) =
0 + 1 nbdr − − − − − − − − when D = 0
48
Cont.
The effect of including a slope indicator variable
also can be see by using calculus.
The partial derivatives of expected house price
with respect to number of bed rooms
E ( prhou ) 1 + when D = 1
(nbdr ) = 1 w hen D = 0
If 0
E( prhou) = 0 + ( 1 + )nbdr
slope = 1 +
prhou
E( prhou) = 0 + 1nbdr
0
slope = 1
nbdr
49
Cont.
( 0 + ) + ( 1 + ) nbdr − − − when D = 1
E ( prhou ) =
0 + 1 nbdr − − − − − − − − when D = 0
50
Some Important Uses of Dummy Variable
51
Cont.
52
Cont.
53
54
Cont.
55
56
57
58
59
Cont.
60
Cont.
61
Structural Stability
Testing for structural stability will help us to find
out whether two or more regressions are different,
where the difference may be in the intercepts or
the slopes or both.
Suppose we are interested in estimating a
simple saving function that relates domestic
household savings (S) with gross domestic
product (Y) for Ethiopia.
Suppose further that, at a certain point of
time, a series of economic reforms have been
introduced.
62
Cont.
So far we assumed that the intercept and
all the slope coefficients (βj's) are the
same/stable for the whole set of
observations. Y = Xβ + e
But, structural shifts and/or group
differences are common in the real world.
May be:
the intercept differs/changes, or
the (partial) slope differs/changes, or
both differ/change across categories or
time period.
63
Cont.
The hypothesis here is that such reforms might have
considerably influenced the savings - income
relationship, that is, the relationship between savings
and income might be different in the post reform
period as compared to that in the pre-reform period.
If this hypothesis is true, then we say a structural
change has happened.
65
Cont.
is the differential slope coefficient
3
66
Cont.
If only is statistically significant, then the
1
67
Cont.
Example 2: Using the DVR to Test for Structural
Break:
Recall the example of consumption function:
Period 1: consi = α1+ β1*inci+ui
vs. Period 2: consi = α2+ β2*inci+ui
Let’s define a dummy variable D1, where:
for the period 1974-1991, and
for the period 1992-2006
Then, consi = α0+α1*D1+β0*inci+β1(D1*inci)+ui
For period 1: consi = (α0+α1)+(β0+β1)inci+ui
For period 2 (base category): consi= α0+ β0*inci+ui
Regressing cons on inc, D1 and (D1*inc) gives:
cons = 1.95 + 152D1 + 0.806*inc – 0.056(D1*inc)
p-value: (0.968) (0.010) (0.000) (0.002)
68
Cont.
D1=1 for i ϵ period-1 & D1=0 for i ϵ period-2:
Period 1 (1974-1991):cons = 153.95 + 0.75*inc
Period 2 (1992-2005): cons = 1.95 + 0.806*inc
69
2. Chow’s test
One approach for testing the presence of structural change (structural
instability) is by means of Chow’s test. The steps involved in this
procedure:
Step 1: Estimate the regression equation for the whole period (pre-
reform plus post-reform periods) and find the error sum of
squares ( ESSR ) or RRSS.
Step 2: Estimate equation (model) using the available data in the pre-
reform period (say, of size n1), and find the error sum of squares (ESS1) or
RSS1
Step 3: Estimate equation (model) using the available data in the post-
reform period (say, of size n2), and find the error sum of squares (ESS2) or
RSS2.
Step 4: Calculate RSSUR= RSS1+RSS2.
Step 5: Calculate the Chow test statistic
(RSS R − RSSU ) / k
Fc =
RSSU /(n1 + n2 − 2k)
Where k is number of estimated regression coefficients
and intercept
70
Cont.
F is the critical value from the t-
(k,n1 +n2 −2k)
74
Cont.
Draw backs:
Chow’s test does not tell us whether the difference
(change) in the slope only, in the intercept only or
in both the intercept and the slope.
The Chow Tests
Using an F-test to determine whether a single
regression is more efficient than two/more separate
regressions on sub-samples.
75
Cont.
80
Dummy dependent variable
(Qualitative Response Model)
Qualitative Response Model shows situations in
which the dependent variable in a regression
equation simply represents a discrete choice
assuming only a limited number of values
Such a model is called
Limited dependent variable
Qualitative response
Categories of Qualitative Response Models
There are two broad categories of QRM
1. Binomial Model: it shows the choice
between two alternatives
e.g: Decision to participate in labor force or not
2. Multinomial models: the choice between
more than two alternatives
e.g: Y= 1, occupation is farming
=2, occupation is carpentry
=0, government employee
Important terminologies
Binary variables: variables that have two categories and used
to an event that has occurred or some characteristics present.
Cont.
1if Y i 0
Y =
0 if Y i 0
Probit model
The latent variable Y* is continuous (- < Y* <
).
It generates the observed binary variable Y.
An observed variable, Y can be observed in two
states:
if an event occurs it takes a value of 1
if an event does not occur it takes a value of 0
The latent variable is assumed to be a linear
function of the observed X’s through the
structural model.
Probit model
However, since the latent dependent variable is
unobserved the model cannot be estimated using
OLS.
Maximization of the likelihood function for
either the probit or the logit model is
accomplished by nonlinear estimation methods.
Maximum likelihood can be used instead.
Most often, the choice is between normal errors
and logistic errors, resulting in the probit
(normit) and logit models, respectively.
Probit model
The coefficients derived from the maximum
likelihood (ML) function will be the coefficients
for the probit model, if we assume a normal
distribution.
If we assume that the appropriate distribution
of the error term is a logistic distribution, the
coefficients that we get from the ML function
will be the coefficient of the logit model.
In both cases, as with the LPM, it is assumed
that E[i/Xi] = 0
Probit model
In the probit model, it is assumed that Var
(i/Xi) = 1; In the logit model, it is assumed that
Var (i/Xi) = 2 3 .
Hence, the estimates of the parameters ( ’s)
from the two models are not directly
comparable.
But as Amemiya suggests, a logit estimate of a
parameter multiplied by 0.625 gives a fairly
good approximation of the probit estimate of the
same parameter.
Probit model
CountR =
2