You are on page 1of 13

8 Dummy Variables and Functional Forms

8.1 Dummy Variables

• We have implicitly assumed that our data are continuous in nature.

• Yet, many times data are inherently discrete NOT continuous.

• What if we wanted to look at how GDP has changed over time?

• There is a good argument that during war years GDP should behave differently than during

peacetime.

• We can control for these ”war years” with a ”dummy variable”

• A DV is a dichotomous variable, which typically takes on a value of zero if a condition is not

met and one if it is met.

• Thus, for our GDP model we might have something like

Yt = β0 D1t + β1 D2t + β2 Xt + ²t

• Let D1t = 1 when there is a war and 0 otherwise.

• Let D2t = 1 when there is no war, and 0 otherwise.

• Note that this model has no specific intercept term.

• If you regressed this model with an intercept term, the estimation would break down because

of multicolinearity. This is because D1t + D2t = 1 and the constant term’s ”variable” is also

1.

• We could run the model with only one dummy variable and use an intercept term:

Yt = β0 + β1 D2t + β2 Xt + ²t

108
• Here, we could run OLS on this model with no problems.

• What do dummy variables do for us?

• The dummy variables actually act as a shift parameter for the mean of one group. They allow

us to differentiate across different groups. Why?

• We know that there are various functional forms we can use in econometrics. But they all

have to be linear in parameters. This linearity is constrained because there are no ways to

introduce discontinuities in the model.

• Dummy variables allow us to introduce discontinuities into a linear system.

• Consider the model with the intercept included. If the D = 0 then the intercept for the model

is simply β0 . If the D = 1 then the intercept becomes β0 + β1 .

• Thus, whenever the condition is satisfied, such that D = 1, the intercept of our regression

line shifts.

• Of course, depending on whether β1 is greater than, less than or equal to zero will determine

which way the model will shift. Consider the β1 > 0 and β2 > 0, then

• ......
..
............
..... ................................
..... .......................
.....
.....
...
..
.
.......................
• slope = β2
...................
..... .............
.
... . ..........

.....
.....
..... • .....
.....
.....
..... .....
.......
. ..
......
.... ....
..... .....
..... .....
..... .....
...
...... ...
. .....
.... ....
..... .....
..... .....
..... .....
..... .....
Y .
.....
.....
.
..
.....
.....
.
...

..... .....
..... .....
.......
. .
.......
.... ....
..... .....
..... .....
β0 + β1 ....
.
....
...
.....
.....
.....
.....
.....
.
........
...
.....
β0 .....

• Here, the intercept shift allows the same marginal influence of X on GDP (Y ) but recognizes

that during peace-time GDP is that much greater.

109
• One must be careful in interpreting intercept shifts by looking at the exact definition of the

dummy variables.

• One must also be careful of the “dummy variable trap.”

• As mentioned earlier, one cannot estimate a collinear model (because the rank of X < k).

Thus, if all of the dummy variables included in your model will always sum to one, then there

can be no formal intercept in the econometric model.

• A good way to avoid the ”trap” is to include an intercept term always.

• Then, we include j − 1 dummy variables for our qualitative variables where j is the number

of possible categories.

• For example: If sex is our qualitative variable, then we include SEX = 0 for male and 1 for

female.

• However, if we are looking at highest education then we may have several categories, e.g.,

grade school, high-school, some college, undergraduate degree, master’s degree, doctorate.

Here, j = 6 but we only include 5 dummy variables.

• That category omitted is the “reference” category - which all other categories are compared

against.

• Thus, if income was the dependent variable and level of education dummy variables are

included, then we might expect positive parameter estimates if grade-school is the reference

category. On the other hand, if graduate school was the reference category, we might expect

negative parameter estimates. Be careful in the interpretation of dummy variable coefficients!!

8.2 Interaction Terms

• A potential pitfall in the use of dummy variables can be given in an example taken from labor

economics.

110
• Consider the hypothesis that there is wage discrimination against females. One may wish

to estimate a wage equation that would control for various individual attributes such as

productivity, motivation, teamwork, etc. At the same time, one would want to include a

dummy variable that would control for the sex of a particular worker.

• This is a common approach in labor models. Thus, an example of a wage model would be

Wi = β0 + β1 SEXi + β2 EDUi + αZi + ²i

where EDU is the education (typically in years), and Zi is a set of variables thought to

influence wages such as the years of tenure at a particular job, the age of the worker and the

experience of the worker.

• Oftentimes, the explanatory variables are highly correlated with each other. This is a potential

problem.

• Nonetheless, let’s say that we estimate a model and find that β1 < 0. Some would claim that

this is evidence of discrimination, but is it really?

• The estimated equation only states that, given the sample used, that women start out at a

lower wage. In this model, the returns to education, tenure, age and experience are assumed

the same for men and women.

111
• Thus, a picture of this in the EDU space would look like

Males
......
..... • ....
......
...... ..... .......................
..... ..... .......................
wage ... ..
. ∆W
• slope = ∆EDU
.....
....
...
.
.....
.....
..
.....
.
.....................
...........
...................
....................... = β2
.... ..
... .. ..
..
... ..... .....
... .....
..... • .....
...
..... ........ .....
........ ..
......
... ....
..... .....
..... .....
..... .....
.
..
...... ...
......
.... .....
..... .....
..... ..... ....
..... ................................................................
..
......
. .......
.
............
... ...
..... .....
..... .....
... .......
. .
....
.
. ..... Females
..... .....
..... .....
..... .....
..... .....
β0 .....
.....
.....
.
....
.
..

.....
.
........
...
.....
.....
.....
β0 + β1 .....

EDUCATION

• Here, the implication is that women start at a lower wage (as indicated by the lower intercept),

i.e., that β1 < 0 but that the returns to education are the same for men and women alike.

• This might be true in some areas, not in others, e.g., economics.

• Perhaps we think that women and men are actually rewarded differently for their education.

• To accommodate this possibility, we then interact our SEX dummy variable with EDU to

obtain:

Wi = β0 + β1 SEXi + β2 EDUi + γ0 (EDUi × SEXi ) + αZi + ²i

• In this model, the intercepts are allowed to shift as well as the slope parameter on EDU.

• If the observation is of a male, then we see that

∆Wi
= β2
∆EDUi

• Whereas if the observation is of a female then we see that,

∆Wi
= β2 + γ0
∆EDUi

112
• When the slopes differ across groups, then we can claim a difference in the returns to education

across the two sexes.

• If the parameter γ0 is insignificant, then it would imply that there is no difference in the

marginal effect of education on wages across the sexes. This is NOT the same as saying that

there is a wage differential, however.

• What if we think that the returns to education are not exactly linear, but that there may be

some second-order affect of education on wages. This would seem a rather straight forward

idea.

• We could accommodate this non-linearity by including an interaction of education with itself


to obtain:

Wi = β0 + β1 SEXi + β2 EDUi + β3 EDUi2 + γ0 (EDUi × SEXi ) + γ1 (EDUi2 × SEXi ) + αZi + ²i

• This model allows there to be a second-order effect (probably negative) of education and that

it may differ across sexes.

• A picture of this may look like

slope = β2 + 2β3 EDUi


...
...
..
...
...
... ................
................
...................
.....
Males
.... .............................
Wi ...............
...............
...
...
.. ...
...
............
........
........
.....
.....
..
..
..
.
................
..................
................ Females
.... ...............
..............
β0 ...
...............
.
....
............... ....
...
...
...
...

............. ...
........ ....
...
..
......... ...
..
.......
. ...
.....
.
.
...
β0 + β1 slope = (β2 + γ0 ) + 2EDUi (β3 + γ1 )

EDUCATION

• In this case, there is a difference in the intercept terms, but the γ parameter would be

insignificant or equal. This is reflected in the parallel course of the two curves.

113
• Note that there would be a ”glass ceiling” in this graph even though the returns to education

were the same on the margin but that the returns differ because of the starting values of the

wages (as reflected in the intercepts).

• Consider an alternative picture, however.

......
.......
......
Females
......
...
.......
.....
.....
......
......
.
. .......
.
....
......
.
......
......
...... ...............................
.................
...................
.....
Males
....
..................... ..
Wi ....
.....................
............... .......
..
..
.
. .
....
...
............ .........
........
....... ...
.........
.
. ..
.. .
. . .... . .. slope = β + 2β3 EDUi 2
.... .
....
... ....
...
β0 .
....
.
.....
.
..........
..... ....................
.... ..........
.....
. ..........
..........
.
... .
....
β0 + β1 slope = (β2 + γ0 ) + 2EDUi (β3 + γ1 )

EDUCATION

• Here, we see that β1 < 0 so that women start out at a lower wage. We also see that the

returns to education on the margin are greater for women (γ1 > 0) than men but that at

some level of education, women will begin to earn more than men. The question then is how

much education does a woman need to overcome the initial disparity?

• Thus, the use of dummy variables and interaction terms can allow us to test all sorts of extra

hypotheses.

• These pictures do not necessarily represent general conclusions. Different data samples will

reveal different results.

8.3 Time Trends

• In certain applications we recognize that there may be a time trend in the data.

• A good example is prices or GDP which tend to grow over time. We would like to control for

these time trends.

114
• One may think to use dummy variables to control for different time periods. However, this

could be problematic.

• If you have T time periods in your sample and T observations, then to treat each time period

as different would require T − 1 dummy variables, plus and intercept term. At this point, you

have exhausted all the degrees of freedom available in the sample.

• An alternative approach is to create what is called a ”trend variable”.

• A trend variable is a monotonically increasing variable, typically just the time index.

• Trend variables enter into an equation as a separate explanatory variable, e.g.,

yt = β0 + β1 Xt + β2 T IM E

where T IM E = 1, 2, . . . , T .

• Note: there is a potential problem with time trends in log-log or lin-log models.

• If one takes the log of time, you obtain the following vis-a-vis linear time trends:

115
100
80
60
40
20
0

0 20 40 60 80 100
trend

trend lntrend

• Thus, when you take the log of time, you may be discounting the effect of time on your

dependent variable.

• Note: The log of zero does not exist; one should be careful in how you set up your time trend.

116
8.4 Example: The Taft-Hartley Act of 1947

• Here is a graph of annual work stoppages from 1916-2001.


5000
4000
3000
strikes
2000 1000
0

1920 1940 1960 1980 2000


year

• It is very apparent that something dramatic happened around the mid 1940s - work stoppages

declined significantly.

• In fact, in 1947 the Taft-Hartley Act- was passed. From Infoplease.com “the act qualified

or amended much of the National Labor Relations (Wagner) Act of 1935, the federal law

regulating labor relations of enterprisers engaged in interstate commerce, and it nullified

parts of the Federal Anti-Injunction (Norris-LaGuardia) Act of 1932. The act established

control of labor disputes on a new basis by enlarging the National Labor Relations Board and

providing that the union or the employer must, before terminating a collective-bargaining

agreement, serve notice on the other party and on a government mediation service.”

• Multiple regression results using 73 years of data from 1929-2001:

. sum strikes realgdp unemp minwage time if e(sample)

117
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
strikes | 73 829.2466 1330.081 17 4985
realgdp | 73 3627.236 2473.242 603.3 9214.54
unemp | 73 7.306849 5.179692 1.2 25.2
minwage | 73 1.846575 1.621499 0 5.15
time | 73 36 21.21713 0 72
. reg strikes realgdp unemp time minwage

Source | SS df MS Number of obs = 73


-------------+------------------------------ F( 4, 68) = 12.89
Model | 54920903.3 4 13730225.8 Prob > F = 0.0000
Residual | 72455474.2 68 1065521.68 R-squared = 0.4312
-------------+------------------------------ Adj R-squared = 0.3977
Total | 127376378 72 1769116.36 Root MSE = 1032.2

---------------------------------------------------------------------
strikes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+-----------------------------------------------------------
realgdp | .530194 .3803215 1.39 0.168 -.2287257 1.289114
unemp |-27.50805 29.85787 -0.92 0.360 -87.0885 32.07241
time |-106.8932 25.38961 -4.21 0.000 -157.5574 -56.22906
minwage | 84.74099 515.9023 0.16 0.870 -944.726 1114.208
_cons | 2798.781 577.0888 4.85 0.000 1647.218 3950.344
---------------------------------------------------------------------
. reg strikes realgdp unemp time minwage pre47

Source | SS df MS Number of obs = 73


-------------+------------------------------ F( 5, 67) = 68.86
Model | 106626680 5 21325336 Prob > F = 0.0000
Residual | 20749697.4 67 309696.976 R-squared = 0.8371
-------------+------------------------------ Adj R-squared = 0.8249
Total | 127376378 72 1769116.36 Root MSE = 556.5

---------------------------------------------------------------------
strikes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+-----------------------------------------------------------
realgdp |-.8260032 .2303428 -3.59 0.001 -1.285769 -.366237
unemp |-100.4215 17.05749 -5.89 0.000 -134.4684 -66.37464
time | 112.2559 21.79502 5.15 0.000 68.75289 155.759
minwage | 24.52896 278.1735 0.09 0.930 -530.7077 579.7656
pre47 | 4519.136 349.7473 12.92 0.000 3821.037 5217.234
_cons | -641.696 409.5055 -1.57 0.122 -1459.072 175.6804
---------------------------------------------------------------------

118
• If we do not include the pre47 dummy variable the constant term is overstated and the

impacts of unemployment and real gross domestic product are muted (insignificant).

• If we include the pre47 dummy variable, the constant term during the pre-1947 period is

equal to

. lincom _b[pre47]+_b[_cons]

( 1) pre47 + _cons = 0

-----------------------------------------------------------------
strikes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------+------------------------------------------------------
(1) | 3877.44 649.3468 5.97 0.000 2581.338 5173.542

whereas after 1947, the constant term is -641 but not significantly different from zero.

• Notice that after we include the Pre47 dummy variable, REALGDP is negatively related to

work stoppages as is UNEMP. Both of these results make sense: with greater income and

higher unemployment workers are less likely to go on strike.

• What if we include a post-1947 dummy instead?

. reg strikes realgdp unemp time minwage post47

Source | SS df MS Number of obs = 73


-------------+------------------------------ F( 5, 67) = 68.86
Model | 106626680 5 21325336 Prob > F = 0.0000
Residual | 20749697.4 67 309696.976 R-squared = 0.8371
-------------+------------------------------ Adj R-squared = 0.8249
Total | 127376378 72 1769116.36 Root MSE = 556.5

---------------------------------------------------------------------
strikes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+-----------------------------------------------------------
realgdp | -.8260032 .2303428 -3.59 0.001 -1.285769 -.366237
unemp | -100.4215 17.05749 -5.89 0.000 -134.4684 -66.37464
time | 112.2559 21.79502 5.15 0.000 68.75289 155.759
minwage | 24.52896 278.1735 0.09 0.930 -530.7077 579.7656
post47 | -4519.136 349.7473 -12.92 0.000 -5217.234 -3821.037
_cons | 3877.44 322.1265 12.04 0.000 3234.473 4520.407
---------------------------------------------------------------------

119
• Notice that the marginal impacts of REALGDP and UNEMP have not changed. The only

thing that has change is the intercept term. Now, after 1947 we find an intercept term is

exactly equal to the _CONS when including pre1947

. lincom _b[post47]+_b[_cons]

( 1) post47 + _cons = 0

------------------------------------------------------------------
strikes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------
(1) |-641.696 739.3518 -0.87 0.389 -2117.448 834.0564
------------------------------------------------------------------

120

You might also like