You are on page 1of 11

Semester II

Application of Econometric Theory with STATA

STATA Instruction Manual

By

TANIMA BANERJEE

UNIVERSITY OF CALCUTTA

Email id: tanima.bnrj@gmail.com


Unit 3: Dummy Variable Regression
Variables are mainly of two types – quantitative variables and qualitative variables. Here,
our focus is on qualitative variables that are also termed as categorical variables as these
variables only represent some categories and assume limited numerical values representing those
categories. These variables are not continuous by nature like quantitative variable.

When a regression model comprises of some categorical variables, then we need to create
some dummy variables to incorporate these qualitative/categorical variables. Dummy variables
actually represent categories that a qualitative variable is comprised of. If a qualitative variable
comprises of two categories, then we can create two dummy variables representing each
category. It is important to note that a dummy variable can take only one of two values, usually 0
and 1, i.e. a dummy variable can either take the value 1 or the value 0. For example, gender/sex
is a categorical/qualitative variable that represents only two categories – male and female. Then,
we can create two dummy variables, one for male and another for female. The male dummy will
take the value 1 for those individuals who are male and 0 for female, while the opposite would
be the case for the female dummy. These simple variables are a very powerful tool for capturing
qualitative characteristics of individuals, such as gender, race, and geographic region of
residence. When using dummy variables in a regression model, some changes in model
interpretation are required. We will discuss how to incorporate how to incorporate dummy
variables in a regression equation, how to run a dummy variable regression in STATA and how
to interpret the results.

Application of dummy variables using STATA:


Use of Intercept Dummy:
Dummy variables allow us to construct models in which some or all regression model
parameters, including the intercept, change for some observations in the sample. To
make matters specific, we consider an example from wage determination. Wage is determined by
a number of factors. For the present, let us just consider only age (represents level of experience
of an individual) and gender as the explanatory variables in the wage regression model. So, we
can write the wage equation as follows:
wagei = β1 + β2 agei + β3 sexi + ui…………………………………..(1)
Since, sex/gender is a categorical variable, we need to incorporate dummy variable. But,
we need to include one dummy variable to represent sex that is comprised of two categories,
male and female, to avoid the dummy variable trap. If we include both male and female
dummies, then it would create the problem of perfect collinearity. Hence, we should always
incorporate one less dummy variable than the number of categories for any qualitative
variable. The excluded category is considered as the reference category against while we
make inference about the impact of other categories on the dependent variable in the regression
model. Here, we will incorporate only the female dummy variable and the males will serve as the
reference group. So, we can rewrite the regression equation as follows:
wagei = β1 + β2 agei + β3 d_femalei + ui…………………………………..(2)

Here, d_female takes the value 1 for female and 0 for male.
Now, for any given age,
wagei^ = β1^ + β2^ age + β3^ represents estimated wage for female, and
wagei^ = β1^ + β2^ age represents estimated wage for male,
where the sign ^ stands for estimated value obtained after running regression.
Thus, the presence of the gender dummy variable causes a change in intercept of the regression
function for certain observations. If d_female = 1, then the intercept of the regression function
is (β1^ + β3). On the other hand, for other observations, i.e. if d_female = 0, then the intercept is
only β1^.
Generally, females earn less than males on average, and thus, inclusion of a gender
dummy variable causes a parallel shift in the relationship between age and wage by the amount
β3^. If β3^ < 0, the wage regression line for female will lie below the male wage regression line.
The absolute value of β3^ stands for the ‘wage premium’ that a male worker would receive
over a female worker irrespective of age due to being a male.
A dummy variable like d_female that is incorporated into a regression model to capture
a shift in the intercept as the result of some qualitative factor, here gender, is an intercept
dummy variable.
The least squares estimator’s properties are not affected by the fact that one of the
explanatory variables consists only of zeros and ones. A dummy variable, here d_female, is
treated as any other explanatory variable. We can construct an interval estimate for β3^
(coefficient of the dummy variable) or we can test the significance of its least squares estimate.
Such a test is a statistical test of whether the gender effect of wage is “statistically significant” in
our example. If β3^ is not significantly different from 0, then there is no wage premium due to
gender.

Important Note: In practice, we generally use log of wage as the dependent variable to
incorporate the Non-linearity of wage relation. So, we actually estimate the following regression
equation:
log_wagei = β1 + β2 agei + β3 d_femalei + ui…………………………………..(3)

Then, the regression coefficients represent percentage change in the dependent variable
(wage) for one unit change in explanatory variable. Now for the dummy variable, d_female, if
β3^ < 0, then we can say female earn (β3^ * 100) percentage less than male on average
irrespective of age. In other words to say, if we hold age (and other explanatory variables,
if any in a regression model) constant at certain level, then female would earn (β3^ * 100)
percentage less than male on average.
To run, the regression equation in our example in Stata, first we need to create a female
dummy variable. First open the data set named wage.dta. Then, type:
g d_female = 1 if sex == 2
replace d_female = 0 if d_female == .
[In this data set the variable sex has two values: 1 representing male and 2 representing
female. Hence, d_female will take the value 1 when sex takes the value 2]
To run the regression, type:
regress log_wage age d_female

We will get the following result:


. regress log_wage age d_female

Source SS df MS Number of obs = 4,120


F(2, 4117) = 31.57
Model 68.0065914 2 34.0032957 Prob > F = 0.0000
Residual 4434.62284 4,117 1.0771491 R-squared = 0.0151
Adj R-squared = 0.0146
Total 4502.62943 4,119 1.09313655 Root MSE = 1.0379

log_wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0068111 .0008812 7.73 0.000 .0050834 .0085388


d_female -.0581914 .0323988 -1.80 0.073 -.1217105 .0053276
_cons 6.836737 .03611 189.33 0.000 6.765941 6.907532

From this result, we can infer that the gender coefficient is not significantly different
from 0 at 5 percent level of significance as the P value of t statistics corresponding to d_female
regression coefficient is greater than 0.05. However, at 10 percent level of significance, the result
is significant as as the P value of t statistics corresponding to d_female regression coefficient is
smaller than 0.10. It is also interesting to note that the estimated regression coefficient
corresponding to d_female is negative. So, considering a 10 percent level of significance, we
can infer from the regression outcome that gender has some impact on wage earning of
individuals in our data set. To be more precise, a female worker earns about 6 percent less
than a male on average at any given age.

Use of slope dummy:


We can allow for a change in of the slope of a regression line by including in the model
an additional explanatory variable that is equal to the product of a dummy variable and a
continuous variable. Slop dummy is actually an interaction variable where a dummy variable
interacts with a continuous variable. In our example, if we plot wage on Y axis and age on X
axis, then the slope of the regression equation is the value of additional wage income if age of
the worker increases by 1 year. Now, if we assume that additional wage obtained through an
increase in age of the worker varies across gender, then we need to incorporate an interaction
variable (a product of gender dummy, i.e. d_female, and age) as an additional explanatory
variable in the original regression equation. So, we can write the model as follows:
log_wagei = β1 + β2 agei + β3 (d_female*age)i + ui………………………… (4)
The new variable d_female*age is the product of female dummy and age, and it is called
an interaction variable, as it captures the interaction effect of gender and age on wage.
Alternatively it is also called a slope dummy variable as it allows for a change in the slop of the
relationship.
In this example, the interaction variable, i.e. d_female*age, takes a value equal to age if the
worker is female, i.e. when d_female = 1, and the interaction variable takes the value equal to 0
if the worker is a male.
So we have:
log_wagei^ = β1^ + (β2^ + β3^)age represents estimated log_wage for female,
and
log_wagei^ = β1^ + β2^ age represents estimated log_wage for male.

Thus, we can say that for female wage increases by (β2^ + β3^)*100 percent due to one
year increase in age, while for male wage increases by β2^ *100 percent as age increases by one
year. β3^)*100 stands for the percentage difference in wage increase between male and female
due to one year increase in age at any given age. Thus, the slopes of female and male regression
lines are different.
If the assumptions of the regression model hold for Equation (9.3.1), then the least
squares estimators have their usual good properties. A test of the hypothesis that wage increases
at same proportion with age for both male and female is carried out by testing the null hypothesis
H0: β3 = 0 against the alternative H1: β3 ≠ 0. In this male-female wage case, we might test H0: β3
= 0 against H1: β3 < 0, since we expect the effect to be negative as we assume that increase in
wage for female should be less than that for male as age increases by 1 year.
To estimate equation (4) in STATA, we first need to create the slope dummy/interaction
variable. Hence, type:
g d_female_age = d_female*age
Now to run the regression, type:
regress log_wage age d_female_age

We will get the following result:


. g d_female_age = d_female*age

. regress log_wage age d_female_age

Source SS df MS Number of obs = 4,120


F(2, 4117) = 32.70
Model 70.414296 2 35.207148 Prob > F = 0.0000
Residual 4432.21514 4,117 1.07656428 R-squared = 0.0156
Adj R-squared = 0.0152
Total 4502.62943 4,119 1.09313655 Root MSE = 1.0376

log_wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0077491 .0009664 8.02 0.000 .0058545 .0096437


d_female_age -.0020439 .0008744 -2.34 0.019 -.0037582 -.0003297
_cons 6.809999 .0326603 208.51 0.000 6.745968 6.874031

Given this regression outcome, it can be deduced that the slope dummy coefficient is
significantly different from 0 at 5 percent level of significance as the P value of t statistics
corresponding to d_female_age regression coefficient is smaller than 0.05. It is also interesting
to note that the estimated regression coefficient corresponding to d_female_age is negative. So,
considering a 5 percent level of significance, we can infer that gender causes the impact of age
on wage to vary in our data set. To be more precise, an additional year of age causes .02
percent less increase in wage earning for female worker than for a male on average at any
given age. Even if the magnitude of the difference in wage increase (in percentage term) is
very small, the difference is significantly different from 0 implying the presence of some
gender effect on wage income.

If we consider that gender affects both the intercept and the slope, then both effects can
be incorporated into a single model. The resulting regression model is as follows:
log_wagei = β1 + β2 agei + β3 d_femalei +β4 (d_female*age)i + ui………………………… (5)
So here we have:
log_wagei^ = (β1^ + β3^) + (β2^ + β4^)age represents estimated log_wage for
female, and
log_wagei^ = β1^ + β2^ age represents estimated log_wage for male.

Thus, the gender effect becomes much stronger if gender affect both intercept and slope.
In all these cases you can generate the predicted values of log_wage immediately after
running the regression in STATA. After running the regression and generating the regression
outcome, type:
predict log_wage_predicted
Now, for example, if you want to figure out the difference in predicted value of log_wage
between male and female who are 30 years old, then first type:
tabulate log_wage_predicted if d_female == 0 & age == 30
We will have:
. tabulate log_wage_predicted if d_female == 0 & age == 30

Fitted
values Freq. Percent Cum.

7.041071 338 100.00 100.00

Total 338 100.00

Therefore, males aged 30 years are getting log_wage of 7.04 on average.


Now, type:
tabulate log_wage_predicted if d_female == 1 & age == 30
We will have:
. tabulate log_wage_predicted if d_female == 1 & age == 30

Fitted
values Freq. Percent Cum.

6.982879 392 100.00 100.00

Total 392 100.00

Females aged 30 years are getting log_wage of 6.98 on average. So the estimated
log_wage difference is (7.04-6.98) = 0.06.

Note: We can calculate this log_wage difference in the all the three cases, i.e. when
gender has only intercept effect, only slope effect and both slope and intercept effect. Then,
we can compare the results and figure out when the gender effect is the strongest.
Use of interaction between qualitative variables through intercept dummies:
We may have more than two qualitative variables in the regression model and we may be
interested in the interaction between these qualitative variables as well. For example, let us
incorporate household group as another qualitative variable in our example along with
gender/sex.
In our data, household group variable (hhd_gr) has four groups, STs (code 1), SCs(code
2), OBCs (code 3) and Others/General (code 9). But for the present, let us focus on only two
household groups- SCs and Non-SCs, where Non-SCs combines STs, OBCs and Others. Now,
through the interaction between gender/sex and household group, four categories can be created,
namely female SCs, male SCs, female Non-SCs and male Non-SCs. Let us consider male Non-
SCs as the reference group. Then, we need to create three dummy variables to incorporate in our
model for rest of the three groups.
In this case, the regression model can be written as follows:
log_wagei = β1 + β2 agei + β3 (d_SC_female)i + β3 (d_Non_SC_female)i + β3 (d_SC_male)i +
ui……… (6)

Here, d_SC_female takes the value 1 if the individual is a female and belongs to the SC
household group, and 0 for any other observation. d_Non_SC_female takes the value 1 if the
individual is a female and belongs to the Non-SC household group, and 0 for any other
observation. And, d_SC_male takes the value 1 if the individual is a male and belongs to the SC
household group, and 0 for any other observation. All these three variables are intercept dummies
as they causes change only the intercept of the regression line. These dummies help to figure out
how the impact of gender on wage varies across household groups holding other factors
constant.
To execute this regression in STATA, first we need to create the dummy variables:
To generate d_SC_female, type:
g d_SC_female = 1 if hhd_gr == 2 & sex == 2
replace d_SC_female = 0 if d_SC_female == .
To generate d_Non_SC_female, type:
g d_Non_SC_female = 1 if hhd_gr == 1 & sex == 2
replace d_Non_SC_female = 1 if hhd_gr == 3 & sex == 2
replace d_Non_SC_female = 1 if hhd_gr == 9 & sex == 2
replace d_Non_SC_female = 0 if d_Non_SC_female == .
To generate d_SC_male, type:
g d_SC_male = 1 if hhd_gr == 2 & sex == 1
replace d_SC_male = 0 if d_SC_male == .

Now, to run the regression type:


regress log_wage age d_SC_female d_Non_SC_female d_SC_male

We will get the following result:

. regress log_wage age d_SC_female d_non_SC_female d_SC_male

Source SS df MS Number of obs = 4,120


F(4, 4115) = 29.82
Model 126.832884 4 31.7082209 Prob > F = 0.0000
Residual 4375.79655 4,115 1.06337705 R-squared = 0.0282
Adj R-squared = 0.0272
Total 4502.62943 4,119 1.09313655 Root MSE = 1.0312

log_wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .0066066 .0008761 7.54 0.000 .0048889 .0083242


d_SC_female -.3412171 .0513009 -6.65 0.000 -.4417945 -.2406397
d_non_SC_female -.0328997 .0378403 -0.87 0.385 -.1070871 .0412877
d_SC_male -.2244718 .0496077 -4.52 0.000 -.3217298 -.1272138
_cons 6.90437 .038649 178.64 0.000 6.828597 6.980143

Given the above result, the null hypotheses that, for any given age, the wage difference
between SC female and Non-SC male workers and that between SC male and Non-SC male
workers being 0 are rejected at 5 percent level of significance as the p values for the coefficients
corresponding to d_SC_female and d_SC_male are less than 0.05.
However, the null hypothesis that, for any given age, the wage difference between Non-
SC female and Non-SC male workers being 0 is accepted at 5 percent level of significance as
the p value for the coefficient corresponding to d_Non_SC_female is greater than 0.05.
Moreover, it can be deduced from the above result that, for any given age, wage earning
is approximately 34 percent lower for SC females while comparing with Non-SC male workers.
Similarly, we can infer that SC males receive around 22 percent lower wages than Non-SC
males. However, Non-SC females do not receive significantly different wages than Non-SC male
workers while holding the age constant.

You might also like