Professional Documents
Culture Documents
By
TANIMA BANERJEE
UNIVERSITY OF CALCUTTA
When a regression model comprises of some categorical variables, then we need to create
some dummy variables to incorporate these qualitative/categorical variables. Dummy variables
actually represent categories that a qualitative variable is comprised of. If a qualitative variable
comprises of two categories, then we can create two dummy variables representing each
category. It is important to note that a dummy variable can take only one of two values, usually 0
and 1, i.e. a dummy variable can either take the value 1 or the value 0. For example, gender/sex
is a categorical/qualitative variable that represents only two categories – male and female. Then,
we can create two dummy variables, one for male and another for female. The male dummy will
take the value 1 for those individuals who are male and 0 for female, while the opposite would
be the case for the female dummy. These simple variables are a very powerful tool for capturing
qualitative characteristics of individuals, such as gender, race, and geographic region of
residence. When using dummy variables in a regression model, some changes in model
interpretation are required. We will discuss how to incorporate how to incorporate dummy
variables in a regression equation, how to run a dummy variable regression in STATA and how
to interpret the results.
Here, d_female takes the value 1 for female and 0 for male.
Now, for any given age,
wagei^ = β1^ + β2^ age + β3^ represents estimated wage for female, and
wagei^ = β1^ + β2^ age represents estimated wage for male,
where the sign ^ stands for estimated value obtained after running regression.
Thus, the presence of the gender dummy variable causes a change in intercept of the regression
function for certain observations. If d_female = 1, then the intercept of the regression function
is (β1^ + β3). On the other hand, for other observations, i.e. if d_female = 0, then the intercept is
only β1^.
Generally, females earn less than males on average, and thus, inclusion of a gender
dummy variable causes a parallel shift in the relationship between age and wage by the amount
β3^. If β3^ < 0, the wage regression line for female will lie below the male wage regression line.
The absolute value of β3^ stands for the ‘wage premium’ that a male worker would receive
over a female worker irrespective of age due to being a male.
A dummy variable like d_female that is incorporated into a regression model to capture
a shift in the intercept as the result of some qualitative factor, here gender, is an intercept
dummy variable.
The least squares estimator’s properties are not affected by the fact that one of the
explanatory variables consists only of zeros and ones. A dummy variable, here d_female, is
treated as any other explanatory variable. We can construct an interval estimate for β3^
(coefficient of the dummy variable) or we can test the significance of its least squares estimate.
Such a test is a statistical test of whether the gender effect of wage is “statistically significant” in
our example. If β3^ is not significantly different from 0, then there is no wage premium due to
gender.
Important Note: In practice, we generally use log of wage as the dependent variable to
incorporate the Non-linearity of wage relation. So, we actually estimate the following regression
equation:
log_wagei = β1 + β2 agei + β3 d_femalei + ui…………………………………..(3)
Then, the regression coefficients represent percentage change in the dependent variable
(wage) for one unit change in explanatory variable. Now for the dummy variable, d_female, if
β3^ < 0, then we can say female earn (β3^ * 100) percentage less than male on average
irrespective of age. In other words to say, if we hold age (and other explanatory variables,
if any in a regression model) constant at certain level, then female would earn (β3^ * 100)
percentage less than male on average.
To run, the regression equation in our example in Stata, first we need to create a female
dummy variable. First open the data set named wage.dta. Then, type:
g d_female = 1 if sex == 2
replace d_female = 0 if d_female == .
[In this data set the variable sex has two values: 1 representing male and 2 representing
female. Hence, d_female will take the value 1 when sex takes the value 2]
To run the regression, type:
regress log_wage age d_female
From this result, we can infer that the gender coefficient is not significantly different
from 0 at 5 percent level of significance as the P value of t statistics corresponding to d_female
regression coefficient is greater than 0.05. However, at 10 percent level of significance, the result
is significant as as the P value of t statistics corresponding to d_female regression coefficient is
smaller than 0.10. It is also interesting to note that the estimated regression coefficient
corresponding to d_female is negative. So, considering a 10 percent level of significance, we
can infer from the regression outcome that gender has some impact on wage earning of
individuals in our data set. To be more precise, a female worker earns about 6 percent less
than a male on average at any given age.
Thus, we can say that for female wage increases by (β2^ + β3^)*100 percent due to one
year increase in age, while for male wage increases by β2^ *100 percent as age increases by one
year. β3^)*100 stands for the percentage difference in wage increase between male and female
due to one year increase in age at any given age. Thus, the slopes of female and male regression
lines are different.
If the assumptions of the regression model hold for Equation (9.3.1), then the least
squares estimators have their usual good properties. A test of the hypothesis that wage increases
at same proportion with age for both male and female is carried out by testing the null hypothesis
H0: β3 = 0 against the alternative H1: β3 ≠ 0. In this male-female wage case, we might test H0: β3
= 0 against H1: β3 < 0, since we expect the effect to be negative as we assume that increase in
wage for female should be less than that for male as age increases by 1 year.
To estimate equation (4) in STATA, we first need to create the slope dummy/interaction
variable. Hence, type:
g d_female_age = d_female*age
Now to run the regression, type:
regress log_wage age d_female_age
Given this regression outcome, it can be deduced that the slope dummy coefficient is
significantly different from 0 at 5 percent level of significance as the P value of t statistics
corresponding to d_female_age regression coefficient is smaller than 0.05. It is also interesting
to note that the estimated regression coefficient corresponding to d_female_age is negative. So,
considering a 5 percent level of significance, we can infer that gender causes the impact of age
on wage to vary in our data set. To be more precise, an additional year of age causes .02
percent less increase in wage earning for female worker than for a male on average at any
given age. Even if the magnitude of the difference in wage increase (in percentage term) is
very small, the difference is significantly different from 0 implying the presence of some
gender effect on wage income.
If we consider that gender affects both the intercept and the slope, then both effects can
be incorporated into a single model. The resulting regression model is as follows:
log_wagei = β1 + β2 agei + β3 d_femalei +β4 (d_female*age)i + ui………………………… (5)
So here we have:
log_wagei^ = (β1^ + β3^) + (β2^ + β4^)age represents estimated log_wage for
female, and
log_wagei^ = β1^ + β2^ age represents estimated log_wage for male.
Thus, the gender effect becomes much stronger if gender affect both intercept and slope.
In all these cases you can generate the predicted values of log_wage immediately after
running the regression in STATA. After running the regression and generating the regression
outcome, type:
predict log_wage_predicted
Now, for example, if you want to figure out the difference in predicted value of log_wage
between male and female who are 30 years old, then first type:
tabulate log_wage_predicted if d_female == 0 & age == 30
We will have:
. tabulate log_wage_predicted if d_female == 0 & age == 30
Fitted
values Freq. Percent Cum.
Fitted
values Freq. Percent Cum.
Females aged 30 years are getting log_wage of 6.98 on average. So the estimated
log_wage difference is (7.04-6.98) = 0.06.
Note: We can calculate this log_wage difference in the all the three cases, i.e. when
gender has only intercept effect, only slope effect and both slope and intercept effect. Then,
we can compare the results and figure out when the gender effect is the strongest.
Use of interaction between qualitative variables through intercept dummies:
We may have more than two qualitative variables in the regression model and we may be
interested in the interaction between these qualitative variables as well. For example, let us
incorporate household group as another qualitative variable in our example along with
gender/sex.
In our data, household group variable (hhd_gr) has four groups, STs (code 1), SCs(code
2), OBCs (code 3) and Others/General (code 9). But for the present, let us focus on only two
household groups- SCs and Non-SCs, where Non-SCs combines STs, OBCs and Others. Now,
through the interaction between gender/sex and household group, four categories can be created,
namely female SCs, male SCs, female Non-SCs and male Non-SCs. Let us consider male Non-
SCs as the reference group. Then, we need to create three dummy variables to incorporate in our
model for rest of the three groups.
In this case, the regression model can be written as follows:
log_wagei = β1 + β2 agei + β3 (d_SC_female)i + β3 (d_Non_SC_female)i + β3 (d_SC_male)i +
ui……… (6)
Here, d_SC_female takes the value 1 if the individual is a female and belongs to the SC
household group, and 0 for any other observation. d_Non_SC_female takes the value 1 if the
individual is a female and belongs to the Non-SC household group, and 0 for any other
observation. And, d_SC_male takes the value 1 if the individual is a male and belongs to the SC
household group, and 0 for any other observation. All these three variables are intercept dummies
as they causes change only the intercept of the regression line. These dummies help to figure out
how the impact of gender on wage varies across household groups holding other factors
constant.
To execute this regression in STATA, first we need to create the dummy variables:
To generate d_SC_female, type:
g d_SC_female = 1 if hhd_gr == 2 & sex == 2
replace d_SC_female = 0 if d_SC_female == .
To generate d_Non_SC_female, type:
g d_Non_SC_female = 1 if hhd_gr == 1 & sex == 2
replace d_Non_SC_female = 1 if hhd_gr == 3 & sex == 2
replace d_Non_SC_female = 1 if hhd_gr == 9 & sex == 2
replace d_Non_SC_female = 0 if d_Non_SC_female == .
To generate d_SC_male, type:
g d_SC_male = 1 if hhd_gr == 2 & sex == 1
replace d_SC_male = 0 if d_SC_male == .
Given the above result, the null hypotheses that, for any given age, the wage difference
between SC female and Non-SC male workers and that between SC male and Non-SC male
workers being 0 are rejected at 5 percent level of significance as the p values for the coefficients
corresponding to d_SC_female and d_SC_male are less than 0.05.
However, the null hypothesis that, for any given age, the wage difference between Non-
SC female and Non-SC male workers being 0 is accepted at 5 percent level of significance as
the p value for the coefficient corresponding to d_Non_SC_female is greater than 0.05.
Moreover, it can be deduced from the above result that, for any given age, wage earning
is approximately 34 percent lower for SC females while comparing with Non-SC male workers.
Similarly, we can infer that SC males receive around 22 percent lower wages than Non-SC
males. However, Non-SC females do not receive significantly different wages than Non-SC male
workers while holding the age constant.