You are on page 1of 11

Diego Garzon - dig5269

Homework 3

100 points

README: The following two problems will require a lot of calculations in STATA. It will
generate many pages of output. Here is how you should organize it. The first pages should
contain your answers to all the questions, along with showing any key algebraic equations
or explanations you need to use along the way. After that, include a printout of the output
from the regressions you executed in support of your answers. Highlight any numbers in this
output that you used in the first section. (To save paper, you may print this section double-side
and/or with 2-up format.) Last, include a copy of the DO file that contains the commands you
asked STATA to execute. Be sure you organize these in a way that will be clear to the reader.

1. (52 points total. 12 parts worth 4 each, 4 points free.)


With this assignment you will find a STATA data file called HW3.Housing.dta. For
reference, the variables in this file are:

price = House Price in $


sqft = Total square feet of living area
beds = Number of bedrooms
baths = Number of full bathrooms
age = age of house in years
stories = number of stories of the house
vacant = If yes, this variable=1. If no, this variable=0.

Open this dataset within STATA. Before you begin answering the following, its not a
bad idea to ask STATA to summarize the data using the command summarize. You
should also start a log file to store your results.

Price 0 1 *beds
a. Run the following regression:

. regress price beds

Source | SS df MS Number of obs = 880

-------------+------------------------------ F( 1, 878) = 188.50

Model | 4.3334e+11 1 4.3334e+11 Prob > F = 0.0000

Residual | 2.0185e+12 878 2.2989e+09 R-squared = 0.1767

-------------+------------------------------ Adj R-squared = 0.1758

Total | 2.4518e+12 879 2.7893e+09 Root MSE = 47947


Diego Garzon - dig5269

------------------------------------------------------------------------------

price | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

beds | 32104.46 2338.381 13.73 0.000 27514.99 36693.93

_cons | 11025.06 7587.875 1.45 0.147 -3867.426 25917.55

------------------------------------------------------------------------------

b. Hypothesize the sign of the bias, if any, resulting from excluding sqft from the
regression. Explain your reasoning.

Excluding sqft from the regression, I hypothesize that the bias of beds will be
positive. That is, as the number of bedrooms increase, so too does the price.

c. Use the data to verify (or not) your claim from b). Break down the bias into the
component pieces as we did in class

The data verifies that the number of bedrooms are related to the price. We see the constant
coef. Is 11025.06, then we see the bedroom coef. Is 32104.46

d. You will see from c. that the effect of beds is negative, once we control for square
footage. Does this make sense?

Yes, because if we control square footage and allow for an increase of beds, we would end
up with a house with more and more bedrooms. Who would want that?

e. Now, run the regression:


Price 0 1 * beds 2 * sqft 3 * baths 4 * age 5 * stories

reg price beds sqft baths age stories

Source | SS df MS Number of obs = 880


-------------+------------------------------ F( 5, 874) = 431.77
Diego Garzon - dig5269

Model | 1.7452e+12 5 3.4905e+11 Prob > F = 0.0000


Residual | 7.0655e+11 874 808414232 R-squared = 0.7118
-------------+------------------------------ Adj R-squared = 0.7102
Total | 2.4518e+12 879 2.7893e+09 Root MSE = 28433

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
beds | -17474.4 1983.609 -8.81 0.000 -21367.59 -13581.2
sqft | 98.13947 2.992471 32.80 0.000 92.26621 104.0127
baths | -1332.919 2688.311 -0.50 0.620 -6609.219 3943.381
age | -299.074 53.9454 -5.54 0.000 -404.9516 -193.1963
stories | -6708.066 3246.125 -2.07 0.039 -13079.18 -336.9543
_cons | 28193.49 5344.806 5.27 0.000 17703.34 38683.64
------------------------------------------------------------------------------

f. At a level of =.05, for which, if any, values of i, would you reject the null
hypothesis that i=0?

We reject all null hypotheses except for baths which has a p-value greater than .05 baths
=.620

g. What is the predicted price with beds=4, sqft=2185, age=45, baths=2.5,


stories=3? (Note: these are the figures for my house here in State College, but
the data is not, so the price ehre is pretty meaningless as a predictor of my own home
value.)
From data in Q1e

Beds: 4*-17474.4= -69897.6

Sqft: 2185*98.13947= 214434.742

Baths: 2.5*-1332.919= -3332.2975

Age: 45*-299.074= -13458.33

Stories: 3*-6708.066= -20124.198

= 107,622.32

According to this model, how much will my house change in value five years from
today?
Diego Garzon - dig5269

Beds: 4*-17474.4= -69897.6

Sqft: 2183*91.13947= 214434.742

Baths: 2.5*-1332.919= -3332.2975

Age: (45+5)*-299.074= -14953.7

Stories: 3*-6708.066= -20124.198

= 106,126.90

h. What percentage of the variation in price is explained by the five X-variables?

Our R2 tells us that 71.18% is explained R-squared = 0.7118

Now change the measurement of price. Use the gen command:


gen price_thous=price/1000
and then use this in place of price in the regression command for part e

gen price_thous=price/1000

.
. reg price_thous beds sqft baths age stories

Source | SS df MS Number of obs = 880


-------------+------------------------------ F( 5, 874) = 431.77
Model | 1745246.47 5 349049.293 Prob > F = 0.0000
Residual | 706554.038 874 808.414231 R-squared = 0.7118
-------------+------------------------------ Adj R-squared = 0.7102
Total | 2451800.5 879 2789.3066 Root MSE = 28.433

------------------------------------------------------------------------------
price_thous | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
beds | -17.4744 1.983609 -8.81 0.000 -21.36759 -13.5812
sqft | .0981395 .0029925 32.80 0.000 .0922662 .1040127
baths | -1.332919 2.688311 -0.50 0.620 -6.609219 3.943381
age | -.299074 .0539454 -5.54 0.000 -.4049516 -.1931963
stories | -6.708066 3.246125 -2.07 0.039 -13.07918 -.3369542
_cons | 28.19349 5.344806 5.27 0.000 17.70334 38.68364
------------------------------------------------------------------------------
i. Compare the coefficients, standard error, and t-statistics for the independent variables.
Briefly interpret the difference between this model and the version from part e.

The real values of the coefficients were unchanged, only the decimal place is
moved for all coefficients. They are just divided by 1000. Also, the t-stats are
unchanged.
Diego Garzon - dig5269

j. Create a new age variable by converting age from years to days (365 days in a year).
Rerun the regression from e with the new age variable in place of the original age.
gen age_days=age*365

. reg price beds sqft baths age_days stories

Source | SS df MS Number of obs = 880

-------------+------------------------------ F( 5, 874) = 431.77

Model | 1.7452e+12 5 3.4905e+11 Prob > F = 0.0000

Residual | 7.0655e+11 874 808414232 R-squared = 0.7118

-------------+------------------------------ Adj R-squared = 0.7102

Total | 2.4518e+12 879 2.7893e+09 Root MSE = 28433

------------------------------------------------------------------------------

price | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

beds | -17474.4 1983.609 -8.81 0.000 -21367.59 -13581.2

sqft | 98.13947 2.992471 32.80 0.000 92.26621 104.0127

baths | -1332.919 2688.311 -0.50 0.620 -6609.219 3943.381

age_days | -.8193808 .1477956 -5.54 0.000 -1.109457 -.5293049

stories | -6708.066 3246.125 -2.07 0.039 -13079.18 -336.9543

_cons | 28193.49 5344.806 5.27 0.000 17703.34 38683.64

------------------------------------------------------------------------------

k. What has changed between the regression in e and regression in k? Be precise.


Only the age variable has changed. The coefficient has been divided by 365 days, as
well as the standard error. Nothing else has been affected.
Diego Garzon - dig5269

2. (48 Points Total, 9 parts worth 5 points each, add on 3 points for free.)
For the following problem, use the STATA dataset called crime.dta. This data set was
compiled by Christopher Cornwell and William Trumbull to study factors that influence
crime rates. The data set contains observations for 90 counties in North Carolina for
1981. The definitions of the variables are given in the data set:

According to the economic model of crime rates, lower crime rates are associated with
better labor markets (higher wages), more police presence and tougher sentences, and
lower population density. We will use this data set to examine these hypotheses. Use a
significance level of =.05 for all hypothesis tests. All of the following regressions
will utilize the following subset of variables from this dataset.

crmrte=crime rate
prbarr=probability of arrest
prbconv=probability of conviction
prbpris=probability of a prison sentence
avgsen=average sentence in days
polpc=number of police per capita
density=population density
pctmin=percent minority
taxpc=tax revenue per capita
wmfg=average weekly wage in manufacturing
wcon=average weekly wage in construction
wtuc=average weekly wage in transportation,utilities,and communications
wtrd=average weekly wage in wholesale and retail trade
wfir=average weekly wage in finance,insurance,and real estate
wser=average weekly wage in services
wfed=average weekly wage in federal government
wsta=average weekly wage in state government
wloc=average weekly wage in local government

a. Run a regression of crmrte on the variables listed above. Call this Model 1.

//model 1

.
Diego Garzon - dig5269

. reg crmrte prbarr prbconv prbpris avgsen polpc density pctmin taxpc wmfg wcon wtuc wtrd
wfir wser wfed wsta wloc

Source | SS df MS Number of obs = 90

-------------+------------------------------ F( 17, 72) = 15.02

Model | .020071496 17 .001180676 Prob > F = 0.0000

Residual | .005661278 72 .000078629 R-squared = 0.7800

-------------+------------------------------ Adj R-squared = 0.7281

Total | .025732774 89 .000289132 Root MSE = .00887

------------------------------------------------------------------------------

crmrte | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

prbarr | -.0381948 .0094201 -4.05 0.000 -.0569734 -.0194162

prbconv | -.0009047 .0009389 -0.96 0.338 -.0027764 .0009669

prbpris | -.0022334 .0131996 -0.17 0.866 -.0285463 .0240795

avgsen | -.0002937 .0003008 -0.98 0.332 -.0008933 .0003059

polpc | -.0461456 .9862908 -0.05 0.963 -2.01228 1.919989

density | .007183 .0010027 7.16 0.000 .005184 .0091819

pctmin80 | .0002989 .0000698 4.28 0.000 .0001597 .000438

taxpc | .0001502 .000162 0.93 0.357 -.0001726 .0004731

wmfg | .0000101 .0000223 0.45 0.651 -.0000343 .0000546

wcon | .0000462 .0000448 1.03 0.306 -.0000431 .0001355

wtuc | -.0000134 .000022 -0.61 0.544 -.0000573 .0000305

wtrd | .0000197 .0000628 0.31 0.755 -.0001055 .0001449

wfir | .0000686 .0000533 1.29 0.202 -.0000376 .0001748

wser | 3.07e-06 .0000515 0.06 0.953 -.0000997 .0001058

wfed | .0000111 .00003 0.37 0.712 -.0000487 .0000709

wsta | 1.40e-06 .0000398 0.04 0.972 -.000078 .0000808

wloc | -.0000551 .0000871 -0.63 0.529 -.0002287 .0001186

_cons | .0083859 .0180953 0.46 0.644 -.0276865 .0444583


Diego Garzon - dig5269

------------------------------------------------------------------------------

b. Do any p-values indicate a variable is not statistically significant? Which?

Probability of arrest, population density, and percent minority are all not statistically
significant, because they al fall below our original alpha test value

c. Interpret the F-statistic STATA has calculated for Model 1.

Our F Statistic is calculated through the SSRrestricted minus the SSR unrestricted devided
by the number of restrictions all over the SSR unrestricted devided by our observations
minus the number of individual variables in the unrestricted regression minus one. Our
model gives us F(17, 72) This means our SSE df is 17 and our SSR df is 72. It tests the
joined hypotheses of all of our coefficients on all of our variables and that they are all zero.
This pval of 0.000 that is generated tells us that we will reject this null hyp at this level of .
05

d. Test the hypothesis that the coefficients on wsta and wloc are equal to each other.
Use the t-test method described in the lectures. What transformation do you need to
do here? Be specific.

wstawloc | -8.84e-06 .0000356 -0.25 0.804 -.0000797 .000062

We use an elaborate t test. We must generate some new value of wsta+wloc labled as
wstawloc. We run the model one regression using wstawloc in place of wsta and wloc We
find our t val as -.25 and our pval as .804, so we would not reject with an alpha of .05.

e. Test the hypothesis that the coefficients on wfed, wsta and wloc are all equal to
each other. Do this by writing down the formula for the relevant F-statistic.
Calculate it (by running the appropriate restricted regression) and test the hypothesis.
Report these results. This restricted version of the regression will be called Model 2.
Diego Garzon - dig5269

For this, we use the same elaborate ttest as we ran in D. We generate some
variable labeled Qe=(wsta+wfed+wloc). We run the model 1 regression with Qe
in place of wsta wfed and wloc on crmrte. We must find our fstat with
.005698845.005661278/2
F=
.005661278/72 = .238887688 WE will fail to reject .23 > .05

Restricted:
Source | SS df MS Number of obs = 90
-------------+------------------------------ F( 15, 74) = 17.34
Model | .020033929 15 .001335595 Prob > F = 0.0000
Residual | .005698845 74 .000077011 R-squared = 0.7785
-------------+------------------------------ Adj R-squared = 0.7336
Total | .025732774 89 .000289132 Root MSE = .00878

Wstawfedwloc=Qe | 1.99e-06 .0000206 0.10 0.923 -.000039 .000043

Unrestricted is Model 1

f. Return to Model 1. Now test the hypothesis that all 9 of the wage variables have a
coefficient of zero. Do this by writing down the formula for the relevant F-statistic.
Calculate it (by running the appropriate restricted regression) and test the hypothesis.
Report these results.

Unrestricted with Qf in place of all wages


Source | SS df MS Number of obs = 90

-------------+------------------------------ F( 9, 80) = 29.63

Model | .01979444 9 .002199382 Prob > F = 0.0000

Residual | .005938334 80 .000074229 R-squared = 0.7692

-------------+------------------------------ Adj R-squared = 0.7433

Total | .025732774 89 .000289132 Root MSE = .00862

Qf | .000012 5.55e-06 2.17 0.033 9.72e-07 .0000231

Same process for the one above but all wage variables are replaced by Qf

.005938334.005661278/8
F=
.005661278/80 = .489387732 Therefore we will fail to reject the Ho
Diego Garzon - dig5269

Restricted is model 1

g. If a crime is committed, the probability of arrest is prbarr. If a person is arrested for


the crime, the probability of conviction is prbconv. If the person is convicted, the
probability of prison is prbpris. Assuming all these probabilities are independent.
What is the formula for calculating the probability that someone who commits a
crime will a) get arrested AND b) get convicted AND c) get a prison sentence? That
is, how would you calculate the probability of this intersection of statistically
independent events? [Note: the probabilities produced by the researchers are
derived from the arrest data, and thus may not follow the usual rules of probability.
In particular, some probabilities are greater than one. Dont worry about that here.]
Call this variable prjail_ifcrime and create it in STATA.

gen prjail_ifcrime= prbarr if prbconv & prbpris

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
prjail_ifc~e | 90 .2990574 .1268487 .0588235 .7

h. Given this result prjail_ifcrime, how would you use the variable avgsen to calculate
the expected time in jail if commiting a crime. Call this variable jailtime_ifcrime and
create it in STATA.

gen jailtime_ifcrime= avgsen if prjail_ifcrime

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

jailtime_i~e | 90 10.62656 3.511753 4.64 25.83

i. Return to regression Model 1. Replace the variables prbarr, prbconv, prbpris and
avgsen with your new variable jailtime_ifcrime. This is Model 4. Write a paragraph
in which you discuss how Model 4 compares with Model 1.
Diego Garzon - dig5269

. reg crmrte jailtime_ifcrime polpc density pctmin taxpc wmfg wcon wtuc wtrd wfir
wser wfed wsta wloc
Source | SS df MS Number of obs = 90
-------------+------------------------------ F( 14, 75) = 14.31
Model | .018723082 14 .001337363 Prob > F = 0.0000
Residual | .007009691 75 .000093463 R-squared = 0.7276
-------------+------------------------------ Adj R-squared = 0.6767
Total | .025732774 89 .000289132 Root MSE = .00967

----------------------------------------------------------------------------------
crmrte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------------+----------------------------------------------------------------
jailtime_ifcrime | -.0001843 .000319 -0.58 0.565 -.0008197 .000451
polpc | -1.135305 .7543213 -1.51 0.137 -2.637991 .3673797
density | .0081923 .0010578 7.74 0.000 .006085 .0102996
pctmin80 | .0002044 .0000719 2.84 0.006 .0000611 .0003476
taxpc | .0002442 .00017 1.44 0.155 -.0000944 .0005828
wmfg | .0000139 .0000238 0.58 0.561 -.0000335 .0000614
wcon | .0000297 .0000481 0.62 0.539 -.0000662 .0001256
wtuc | -.0000255 .0000233 -1.09 0.278 -.000072 .000021
wtrd | -.0000212 .000066 -0.32 0.749 -.0001527 .0001103
wfir | .0000352 .000057 0.62 0.538 -.0000782 .0001487
wser | 2.04e-06 .000055 0.04 0.970 -.0001075 .0001116
wfed | .0000485 .0000305 1.59 0.116 -.0000123 .0001093
wsta | .000016 .0000424 0.38 0.707 -.0000685 .0001006
wloc | -7.90e-06 .0000932 -0.08 0.933 -.0001935 .0001777
_cons | -.0116069 .0185137 -0.63 0.533 -.0484881 .0252743

Upon creating model 4 and comparing our data to model 1 we fine several
differences in the data. Among our data, police per capita and tax revenue per
capita has increased so too as our average wage in manufacturing. It appears
that the rest of the wages have fallen in terms of their coefficients, while most of
their t values have increased. More surprising is our R2 has decreased
measurably, So I would say that this model 4 is not as good at predicting as
model 1.