You are on page 1of 9

dummyvariables

October 6, 2021

0.1 Dummy variables


We will illustrate the use of dummy variables with an application in Stata. Stata has a particular
way for dealing with dummy variables.
We start by reading a dataset.

[1]: use housing2, clear

(Sample of single-family houses sold in CHS, SC 2003-2007)


These data contains records for over 5,000 individual house sales with information on characteristics
of the house.

[2]: describe

Contains data from housing2.dta


obs: 5,065 Sample of single-family houses
sold in CHS, SC 2003-2007
vars: 17 1 Oct 2021 10:56
--------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------
bedrooms int %8.0g Number of Bedrooms
baths_full int %8.0g Number of Full Baths
baths_half int %8.0g Number of Half Baths
price long %12.0g Price sold
heatsqft long %12.0g Total heated square footage
year_built int %8.0g Year of construction
built6080 float %9.0g Built between 1961-1980
built8000 float %9.0g Built between 1981-2000
built01plus float %9.0g Built after 2000
brickdummy float %9.0g Brick house
story2 float %9.0g 2 Stories
story3 float %9.0g 3 Stories
storyhalf float %9.0g 1.5 Stories
year float %9.0g Year sold
antebellum float %9.0g Built Before 1861
month float %9.0g Month sold

1
unemp float %8.0g Monthly unemployment rate in %
--------------------------------------------------------------------------------
Sorted by:
Next we run summary statistics

[3]: summarize

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
bedrooms | 5,065 3.408292 .6935228 1 6
baths_full | 5,065 2.127345 .6483761 1 6
baths_half | 5,065 .4805528 .5182938 0 3
price | 5,065 296301.6 327033.6 50000 5250000
heatsqft | 5,065 1981.833 732.0524 609 6915
-------------+---------------------------------------------------------
year_built | 5,065 1988.923 25.48778 1588 2007
built6080 | 5,065 .1443238 .3514524 0 1
built8000 | 5,065 .3119447 .4633331 0 1
built01plus | 5,065 .4471866 .497252 0 1
brickdummy | 5,065 .3026654 .459457 0 1
-------------+---------------------------------------------------------
story2 | 5,065 .4420533 .4966799 0 1
story3 | 5,065 .0234946 .151483 0 1
storyhalf | 5,065 .0221125 .1470641 0 1
year | 5,065 2005.005 1.325774 2003 2007
antebellum | 5,065 .0071076 .0840147 0 1
-------------+---------------------------------------------------------
month | 5,065 6.595064 3.270593 1 12
unemp | 5,065 5.143139 .5423721 4.4 6.3
We start by regressing price on a constant only.

[4]: regress price

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(0, 5064) = 0.00
Model | 0 0 . Prob > F = .
Residual | 5.4160e+14 5,064 1.0695e+11 R-squared = 0.0000
-------------+---------------------------------- Adj R-squared = 0.0000
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.3e+05

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 296301.6 4595.181 64.48 0.000 287293.1 305310.1
------------------------------------------------------------------------------

2
A couple things are worth mentioning: the estimate for the β1 is the mean of price and the R2 is
zero.
Next we will run several different specification with the single purpose of illustrating how to handle
dummies in Stata. We start by adding a dummy variable that takes the value 1 if the house was
built after 2000. First we tabulate the variable

[5]: tab built01plus

Built after |
2000 | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,800 55.28 55.28
1 | 2,265 44.72 100.00
------------+-----------------------------------
Total | 5,065 100.00
and now we run the regression

[6]: regress price built01plus

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(1, 5063) = 3.59
Model | 3.8338e+11 1 3.8338e+11 Prob > F = 0.0583
Residual | 5.4122e+14 5,063 1.0690e+11 R-squared = 0.0007
-------------+---------------------------------- Adj R-squared = 0.0005
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.3e+05

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
built01plus | -17498.2 9239.704 -1.89 0.058 -35612.02 615.6139
_cons | 304126.6 6178.776 49.22 0.000 292013.5 316239.6
------------------------------------------------------------------------------
The estimate associated with β1 is the average price of a house built before 2001 while the estimate
for β2 gives the difference between the average prices for the two periods.
We can easily confirm this:

[7]: tabstat price, by(built01plus) stat(mean)

Summary for variables: price


by categories of: built01plus (Built after 2000)

built01plus | mean
------------+----------
0 | 304126.6

3
1 | 286628.4
------------+----------
Total | 296301.6
-----------------------
However, by the way we introduced the variable in the model Stata does not “know” that built01plus
is a dummy variable and treated it like any other variable. But this is not the recommended
approach. We could have “informed” Stata to create a dummy variable using the “i.” notation.
This has several advantages that will become obvious as we proceed.

[8]: regress price i.built01plus

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(1, 5063) = 3.59
Model | 3.8338e+11 1 3.8338e+11 Prob > F = 0.0583
Residual | 5.4122e+14 5,063 1.0690e+11 R-squared = 0.0007
-------------+---------------------------------- Adj R-squared = 0.0005
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.3e+05

-------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
1.built01plus | -17498.2 9239.704 -1.89 0.058 -35612.02 615.6139
_cons | 304126.6 6178.776 49.22 0.000 292013.5 316239.6
-------------------------------------------------------------------------------
Even though the results of the regression are practically the same what happened was that Stata
now created a dummy variable (1.built01plus) that equaled one when built01plus was 1 and 0
elsewhere. Of course, 1.built01plus and built01plus are identical so the results of the two regressions
are also identical. What if I wanted to define the dummy the other way around, taking the value
1 for the period before 2001 and 0 elsewhere?

[9]: regress price i0.built01plus

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(1, 5063) = 3.59
Model | 3.8338e+11 1 3.8338e+11 Prob > F = 0.0583
Residual | 5.4122e+14 5,063 1.0690e+11 R-squared = 0.0007
-------------+---------------------------------- Adj R-squared = 0.0005
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.3e+05

-------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
0.built01plus | 17498.2 9239.704 1.89 0.058 -615.6139 35612.02
_cons | 286628.4 6869.852 41.72 0.000 273160.5 300096.2
-------------------------------------------------------------------------------

4
We should note that the “i.” are created but not saved with the data. This may be particularly
handy because we don’t need to store the data. Take the variable bedroom. Suppose that we want
to run a regression that includes a dummy variable for each number of bedrooms. With the “i.”
construct this is very easy:

[10]: regress price i.bedroom

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(5, 5059) = 183.33
Model | 8.3081e+13 5 1.6616e+13 Prob > F = 0.0000
Residual | 4.5852e+14 5,059 9.0634e+10 R-squared = 0.1534
-------------+---------------------------------- Adj R-squared = 0.1526
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.0e+05

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms |
2 | 33764.27 103020.3 0.33 0.743 -168200.1 235728.7
3 | -2861.138 100501.1 -0.03 0.977 -199886.8 194164.6
4 | 124921.5 100649.7 1.24 0.215 -72395.51 322238.5
5 | 385224.4 101719.1 3.79 0.000 185810.8 584638
6 | 1099137 115357.7 9.53 0.000 872986 1325288
|
_cons | 228555.6 100351.7 2.28 0.023 31822.79 425288.3
------------------------------------------------------------------------------
Stata created 5 dummy variables and the omitted category was “1 bedroom”. But we can have
control over this. For example, if we wanted 3 to be the base category we could do:

[11]: regress price ib3.bedroom

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(5, 5059) = 183.33
Model | 8.3081e+13 5 1.6616e+13 Prob > F = 0.0000
Residual | 4.5852e+14 5,059 9.0634e+10 R-squared = 0.1534
-------------+---------------------------------- Adj R-squared = 0.1526
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.0e+05

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms |
1 | 2861.138 100501.1 0.03 0.977 -194164.6 199886.8
2 | 36625.41 23931.79 1.53 0.126 -10291.27 83542.09
4 | 127782.7 9482.348 13.48 0.000 109193.1 146372.2
5 | 388085.5 17502.43 22.17 0.000 353773.2 422397.9

5
6| 1101998 57157.2 19.28 0.000 989945.3 1214051
|
_cons | 225694.4 5478.258 41.20 0.000 214954.7 236434.2
------------------------------------------------------------------------------
and now the omitted category is “3 bedrooms”. We could have typed “ib(freq).bedrooms” to use
the most frequent category or “ib(first).bedrooms” or “ib(last).bedrooms” to use either the first or
last category as the omitted one. Another possibility is to not use a base category:

[12]: regress price ibn.bedroom, nocons

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(6, 5059) = 970.50
Model | 5.2776e+14 6 8.7960e+13 Prob > F = 0.0000
Residual | 4.5852e+14 5,059 9.0634e+10 R-squared = 0.5351
-------------+---------------------------------- Adj R-squared = 0.5346
Total | 9.8628e+14 5,065 1.9472e+11 Root MSE = 3.0e+05

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms |
1 | 228555.6 100351.7 2.28 0.023 31822.79 425288.3
2 | 262319.8 23296.34 11.26 0.000 216648.9 307990.7
3 | 225694.4 5478.258 41.20 0.000 214954.7 236434.2
4 | 353477.1 7739.742 45.67 0.000 338303.8 368650.3
5 | 613779.9 16622.99 36.92 0.000 581191.7 646368.2
6 | 1327693 56894.06 23.34 0.000 1216156 1439230
------------------------------------------------------------------------------
but now we better drop the constant or we will have a perfect multicollinearity problem. Actually,
if you didn’t use the “nocons” option in regress Stata would simply drop one variable and proceed
with estimation. Try it for yourself!
Finally, what if we want to include a dummy only “6 bedrooms”? It is very easy to create individual
dummies based on a categorical variable.

[13]: regress price i6.bedrooms

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(1, 5063) = 296.38
Model | 2.9951e+13 1 2.9951e+13 Prob > F = 0.0000
Residual | 5.1165e+14 5,063 1.0106e+11 R-squared = 0.0553
-------------+---------------------------------- Adj R-squared = 0.0551
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.2e+05

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]

6
-------------+----------------------------------------------------------------
6.bedrooms | 1037124 60243.01 17.22 0.000 919022 1155227
_cons | 290568.2 4479.154 64.87 0.000 281787.2 299349.3
------------------------------------------------------------------------------

0.2 Be careful with dummies!!!


Stata automatically drops collinear variables. By default id drops one category for dummy variables
but if collinearity presists then it will arbitrarily set one coefficient to zero and report the regression
results. To understand how this works let us construct a variable that changes by year. For example,
suppose that instead of the monthly unemployment rate we use the annual unemployment (which
for simplification we assume to be the yearly average of the observations).

[14]: bys year: egen ur=mean(unemp)

Now, let us run a regression that includes as regressors ur and the year dummies:

[15]: regress price ur i.year

note: 2007.year omitted because of collinearity

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(4, 5060) = 13.67
Model | 5.7892e+12 4 1.4473e+12 Prob > F = 0.0000
Residual | 5.3581e+14 5,060 1.0589e+11 R-squared = 0.0107
-------------+---------------------------------- Adj R-squared = 0.0099
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.3e+05

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ur | -69836.53 11115.09 -6.28 0.000 -91626.92 -48046.15
|
year |
2004 | 10907.82 13070.23 0.83 0.404 -14715.49 36531.13
2005 | 16965.74 12524.12 1.35 0.176 -7586.963 41518.44
2006 | -15907.05 14783.37 -1.08 0.282 -44888.86 13074.76
2007 | 0 (omitted)
|
_cons | 652627.8 59555.26 10.96 0.000 535873.7 769381.9
------------------------------------------------------------------------------
To avoid perfect collinearity Stata dropped the dummy variable for the year 2003 (the base year)
but, because we included ur, the collinearity persisted. Thus, Stata arbitrarily dropped the dummy
for 2007 and reported a coefficient for ur. But notice what happens to the coeffcient on ur if instead
of 2003 we use 2004 for base year.

[16]: regress price ur ib2004.year

note: 2007.year omitted because of collinearity

7
Source | SS df MS Number of obs = 5,065
-------------+---------------------------------- F(4, 5060) = 13.67
Model | 5.7892e+12 4 1.4473e+12 Prob > F = 0.0000
Residual | 5.3581e+14 5,060 1.0589e+11 R-squared = 0.0107
-------------+---------------------------------- Adj R-squared = 0.0099
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.3e+05

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ur | -58344.01 15943.72 -3.66 0.000 -89600.61 -27087.41
|
year |
2003 | -16384.97 19633.19 -0.83 0.404 -54874.53 22104.59
2005 | 11519.96 11987.9 0.96 0.337 -11981.5 35021.43
2006 | -16165.42 14717.29 -1.10 0.272 -45017.68 12686.85
2007 | 0 (omitted)
|
_cons | 599901.7 81904.01 7.32 0.000 439334.4 760469
------------------------------------------------------------------------------
or 2005,

[17]: regress price ur ib2005.year

note: 2007.year omitted because of collinearity

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(4, 5060) = 13.67
Model | 5.7892e+12 4 1.4473e+12 Prob > F = 0.0000
Residual | 5.3581e+14 5,060 1.0589e+11 R-squared = 0.0107
-------------+---------------------------------- Adj R-squared = 0.0099
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.3e+05

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ur | -34032.78 30851.31 -1.10 0.270 -94514.7 26449.15
|
year |
2003 | -51045.66 37681.95 -1.35 0.176 -124918.6 22827.27
2004 | -23074.35 24011.61 -0.96 0.337 -70147.5 23998.8
2006 | -16711.96 14522.53 -1.15 0.250 -45182.41 11758.5
2007 | 0 (omitted)
|
_cons | 488365.2 150418.7 3.25 0.001 193479.3 783251
------------------------------------------------------------------------------

8
In this case the coefficient on ur is meaningless and cannot be interpreted. By the way, what
happens if we change the order of the regressors?

[18]: regress price ib2005.year ur

note: ur omitted because of collinearity

Source | SS df MS Number of obs = 5,065


-------------+---------------------------------- F(4, 5060) = 13.67
Model | 5.7892e+12 4 1.4473e+12 Prob > F = 0.0000
Residual | 5.3581e+14 5,060 1.0589e+11 R-squared = 0.0107
-------------+---------------------------------- Adj R-squared = 0.0099
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.3e+05

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
2003 | -83439.87 14501.43 -5.75 0.000 -111869 -55010.79
2004 | -39249.08 13717.24 -2.86 0.004 -66140.82 -12357.34
2006 | -1350.491 13478.43 -0.10 0.920 -27774.05 25073.06
2007 | 16126.55 14619 1.10 0.270 -12533.02 44786.13
|
ur | 0 (omitted)
_cons | 316101 9297.417 34.00 0.000 297874 334327.9
------------------------------------------------------------------------------

You might also like