Professional Documents
Culture Documents
Dummyvariables
Dummyvariables
October 6, 2021
[2]: describe
1
unemp float %8.0g Monthly unemployment rate in %
--------------------------------------------------------------------------------
Sorted by:
Next we run summary statistics
[3]: summarize
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 296301.6 4595.181 64.48 0.000 287293.1 305310.1
------------------------------------------------------------------------------
2
A couple things are worth mentioning: the estimate for the β1 is the mean of price and the R2 is
zero.
Next we will run several different specification with the single purpose of illustrating how to handle
dummies in Stata. We start by adding a dummy variable that takes the value 1 if the house was
built after 2000. First we tabulate the variable
Built after |
2000 | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,800 55.28 55.28
1 | 2,265 44.72 100.00
------------+-----------------------------------
Total | 5,065 100.00
and now we run the regression
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
built01plus | -17498.2 9239.704 -1.89 0.058 -35612.02 615.6139
_cons | 304126.6 6178.776 49.22 0.000 292013.5 316239.6
------------------------------------------------------------------------------
The estimate associated with β1 is the average price of a house built before 2001 while the estimate
for β2 gives the difference between the average prices for the two periods.
We can easily confirm this:
built01plus | mean
------------+----------
0 | 304126.6
3
1 | 286628.4
------------+----------
Total | 296301.6
-----------------------
However, by the way we introduced the variable in the model Stata does not “know” that built01plus
is a dummy variable and treated it like any other variable. But this is not the recommended
approach. We could have “informed” Stata to create a dummy variable using the “i.” notation.
This has several advantages that will become obvious as we proceed.
-------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
1.built01plus | -17498.2 9239.704 -1.89 0.058 -35612.02 615.6139
_cons | 304126.6 6178.776 49.22 0.000 292013.5 316239.6
-------------------------------------------------------------------------------
Even though the results of the regression are practically the same what happened was that Stata
now created a dummy variable (1.built01plus) that equaled one when built01plus was 1 and 0
elsewhere. Of course, 1.built01plus and built01plus are identical so the results of the two regressions
are also identical. What if I wanted to define the dummy the other way around, taking the value
1 for the period before 2001 and 0 elsewhere?
-------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
0.built01plus | 17498.2 9239.704 1.89 0.058 -615.6139 35612.02
_cons | 286628.4 6869.852 41.72 0.000 273160.5 300096.2
-------------------------------------------------------------------------------
4
We should note that the “i.” are created but not saved with the data. This may be particularly
handy because we don’t need to store the data. Take the variable bedroom. Suppose that we want
to run a regression that includes a dummy variable for each number of bedrooms. With the “i.”
construct this is very easy:
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms |
2 | 33764.27 103020.3 0.33 0.743 -168200.1 235728.7
3 | -2861.138 100501.1 -0.03 0.977 -199886.8 194164.6
4 | 124921.5 100649.7 1.24 0.215 -72395.51 322238.5
5 | 385224.4 101719.1 3.79 0.000 185810.8 584638
6 | 1099137 115357.7 9.53 0.000 872986 1325288
|
_cons | 228555.6 100351.7 2.28 0.023 31822.79 425288.3
------------------------------------------------------------------------------
Stata created 5 dummy variables and the omitted category was “1 bedroom”. But we can have
control over this. For example, if we wanted 3 to be the base category we could do:
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms |
1 | 2861.138 100501.1 0.03 0.977 -194164.6 199886.8
2 | 36625.41 23931.79 1.53 0.126 -10291.27 83542.09
4 | 127782.7 9482.348 13.48 0.000 109193.1 146372.2
5 | 388085.5 17502.43 22.17 0.000 353773.2 422397.9
5
6| 1101998 57157.2 19.28 0.000 989945.3 1214051
|
_cons | 225694.4 5478.258 41.20 0.000 214954.7 236434.2
------------------------------------------------------------------------------
and now the omitted category is “3 bedrooms”. We could have typed “ib(freq).bedrooms” to use
the most frequent category or “ib(first).bedrooms” or “ib(last).bedrooms” to use either the first or
last category as the omitted one. Another possibility is to not use a base category:
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms |
1 | 228555.6 100351.7 2.28 0.023 31822.79 425288.3
2 | 262319.8 23296.34 11.26 0.000 216648.9 307990.7
3 | 225694.4 5478.258 41.20 0.000 214954.7 236434.2
4 | 353477.1 7739.742 45.67 0.000 338303.8 368650.3
5 | 613779.9 16622.99 36.92 0.000 581191.7 646368.2
6 | 1327693 56894.06 23.34 0.000 1216156 1439230
------------------------------------------------------------------------------
but now we better drop the constant or we will have a perfect multicollinearity problem. Actually,
if you didn’t use the “nocons” option in regress Stata would simply drop one variable and proceed
with estimation. Try it for yourself!
Finally, what if we want to include a dummy only “6 bedrooms”? It is very easy to create individual
dummies based on a categorical variable.
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
6
-------------+----------------------------------------------------------------
6.bedrooms | 1037124 60243.01 17.22 0.000 919022 1155227
_cons | 290568.2 4479.154 64.87 0.000 281787.2 299349.3
------------------------------------------------------------------------------
Now, let us run a regression that includes as regressors ur and the year dummies:
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ur | -69836.53 11115.09 -6.28 0.000 -91626.92 -48046.15
|
year |
2004 | 10907.82 13070.23 0.83 0.404 -14715.49 36531.13
2005 | 16965.74 12524.12 1.35 0.176 -7586.963 41518.44
2006 | -15907.05 14783.37 -1.08 0.282 -44888.86 13074.76
2007 | 0 (omitted)
|
_cons | 652627.8 59555.26 10.96 0.000 535873.7 769381.9
------------------------------------------------------------------------------
To avoid perfect collinearity Stata dropped the dummy variable for the year 2003 (the base year)
but, because we included ur, the collinearity persisted. Thus, Stata arbitrarily dropped the dummy
for 2007 and reported a coefficient for ur. But notice what happens to the coeffcient on ur if instead
of 2003 we use 2004 for base year.
7
Source | SS df MS Number of obs = 5,065
-------------+---------------------------------- F(4, 5060) = 13.67
Model | 5.7892e+12 4 1.4473e+12 Prob > F = 0.0000
Residual | 5.3581e+14 5,060 1.0589e+11 R-squared = 0.0107
-------------+---------------------------------- Adj R-squared = 0.0099
Total | 5.4160e+14 5,064 1.0695e+11 Root MSE = 3.3e+05
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ur | -58344.01 15943.72 -3.66 0.000 -89600.61 -27087.41
|
year |
2003 | -16384.97 19633.19 -0.83 0.404 -54874.53 22104.59
2005 | 11519.96 11987.9 0.96 0.337 -11981.5 35021.43
2006 | -16165.42 14717.29 -1.10 0.272 -45017.68 12686.85
2007 | 0 (omitted)
|
_cons | 599901.7 81904.01 7.32 0.000 439334.4 760469
------------------------------------------------------------------------------
or 2005,
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ur | -34032.78 30851.31 -1.10 0.270 -94514.7 26449.15
|
year |
2003 | -51045.66 37681.95 -1.35 0.176 -124918.6 22827.27
2004 | -23074.35 24011.61 -0.96 0.337 -70147.5 23998.8
2006 | -16711.96 14522.53 -1.15 0.250 -45182.41 11758.5
2007 | 0 (omitted)
|
_cons | 488365.2 150418.7 3.25 0.001 193479.3 783251
------------------------------------------------------------------------------
8
In this case the coefficient on ur is meaningless and cannot be interpreted. By the way, what
happens if we change the order of the regressors?
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
2003 | -83439.87 14501.43 -5.75 0.000 -111869 -55010.79
2004 | -39249.08 13717.24 -2.86 0.004 -66140.82 -12357.34
2006 | -1350.491 13478.43 -0.10 0.920 -27774.05 25073.06
2007 | 16126.55 14619 1.10 0.270 -12533.02 44786.13
|
ur | 0 (omitted)
_cons | 316101 9297.417 34.00 0.000 297874 334327.9
------------------------------------------------------------------------------