You are on page 1of 4

Econ7810: Applied Econometrics, Fall 2022

Homework #2
Due date: 27 October. 2022; 1pm.
Do not copy and paste the answers from your classmates. Two identical homework will be treated as
cheating. Do not copy and paste the entire output of your statistical package's. Report only the relevant part
of the output. Please also submit your R-script for the empirical part. Please put all your work in one single
le and upload via Moodle.

Part I : Multiple Choice (3 points each, 24 points in total)

Please choose the answer that you think is appropriate.


1.The slope estimator, β1 , has a smaller standard error, other things equal, if
a. there is more variation in the explanatory variable, X.
b. there is a large variance of the error term, u.
c. the sample size is smaller.
d. the intercept, β0 , is small.

2. The reason why estimators have a sampling distribution is that


a. economics is not a precise science.
b. individuals respond dierently to incentives.
c. in real life you typically get to sample many times.
d. the values of the explanatory variable and the error term dier across samples.

3. To decide whether or not the slope coecient is large or small,


a. you should analyze the economic importance of a given increase in X.
b. the slope coecient must be larger than one.
c. the slope coecient must be statistically signicant.
d. you should change the scale of the X variable if the coecient appears to be too small.

4. The p-value for a one-sided right-tail test is given by


a. P r(Z > tact ) = 1 − φ(tact )
b. P r(Z < tact ) = φ(tact )
c. P r(Z > tact ) < 1.645
d. cannot be calculated, since probabilities must always be positive.

5. Imagine you regressed earnings of individuals on a constant, a binary variable (Male) which takes on
the value 1 for males and is 0 otherwise, and another binary variable (Female) which takes on the value 1
for females and is 0 otherwise. Because females typically earn less than males, you would expect
a. the coecient for Male to have a positive sign, and for Female a negative sign.
b. both coecients to be the same distance from the constant, one above and the other below.
c. none of the OLS estimators to exist because there is perfect multicollinearity.
d. this to yield a dierence in means statistic.

6. Using the textbook example of 420 California school districts and the regression of testscores on the
student-teacher ratio, you nd that the standard error on the slope coecient is 0.51 when using the het-
eroskedasticity robust formula, while it is 0.48 when employing the homoskedasticity only formula. When
calculating the t-statistic, the recommended procedure is to
a. use the homoskedasticity only formula because the t-statistic becomes larger
b. rst test for homoskedasticity of the errors and then make a decision
c. use the heteroskedasticity robust formula
d. make a decision depending on how much dierent the estimate of the slope is under the two procedures

1
7. Consider the multiple regression model with two regressors X1 and X2, where both variables are
determinants of the dependent variable. You rst regress Y on X1 only and nd no relationship. However
when regressing Y on X1 and X2, the slope coecient changes by a large amount. This suggests that your
rst regression suers from
a. heteroskedasticity
b. perfect multicollinearity
c. omitted variable bias
d. dummy variable trap

8. Imperfect multicollinearity
a. implies that it will be dicult to estimate precisely one or more of the partial eects using the data at
hand
b. violates one of the four Least Squares assumptions in the multiple regression model
c. means that you cannot estimate the eect of at least one of the Xs on Y
d. suggests that a standard spreadsheet program does not have enough power to estimate the multiple
regression model

Part II Short Questions (29 points in total)

Please limit your answer (except for tables or gures) to less than or equal to 5 lines per sub-question.
Note: for each sub-question, the answer should not be longer than 7 lines.
(14 points) 2.1 You have collected a sub-sample from the Current Population Survey for the western region
of the United States. Running a regression of average hourly earnings (ahe) on an intercept only, you get the
following result:
ˆ = βˆ0 = 18.58
ahe
(3 points) a. Interpret the result.
(3 points) b. You decide to include a single explanatory variable without an intercept. The binary variable
DF emme takes on a value of 1 for females but is 0 otherwise. The regression result changes as follows:

ˆ = βˆ1 × DF emme = 16.50 × DF emme


ahe

What is the interpretation now?


(3 points) c. You generate a new binary variable Dmale by subtracting DF emme from 1, and run the new
regression:
ˆ = βˆ2 ×DM ale = 20.09 × DM ale
ahe
What is the interpretation of the coecient now?
(5 points) d. After thinking about the above results, you recognize that you could have generated the last
two results either by running a regression on both binary variables, or on an intercept and one of the binary
variables. What would the results have been?

(15 points) 2.2 The cost of attending your college has once again gone up. Although you have been told
that education is investment in human capital, which carries a return of roughly 10% a year, you (and your
parents) are not pleased. One of the administrators at your university/college does not make the situation
better by telling you that you pay more because the reputation of your institution is better than that of others.
To investigate this hypothesis, you collect data randomly for 100 national universities and liberal arts colleges
from the 2000-2001 U.S. News and World Report annual rankings. Next you perform the following regression

d = 7, 311.17 + 3, 985.20 × Reputation − 0.20 × Size + 8, 406.79 × Dpriv − 416.38 × Dlibert − 2, 376.51 × Dreligion
Cost
R2 = 0.72
where Cost is Tuition, Fees, Room and Board in dollars, Reputation is the index used in U.S. News and
World Report (based on a survey of university presidents and chief academic ocers), which ranges from 1
(marginal) to 5 (distinguished), Size is the number of undergraduate students, and Dpriv , Dlibert, and
Dreligion are binary variables indicating whether the institution is private, a liberal arts college, and has a
religious aliation.

2
(5 points) (a) Interpret the results. Do the coecients have the expected sign?
(2 points) (b) What is the forecasted cost for a liberal arts college, which has no religious aliation, a size
of 1,500 students and a reputation level of 4.5? (All liberal arts colleges are private.)
(3 points) (c) To save money, you are willing to switch from a private university to a public university,
which has a ranking of 0.5 less and 10,000 more students. What is the eect on your cost? Is it substantial?
(5 points) (d) Eliminating the Size and Dlibert variables from your regression, the estimation regression
becomes
d = 5, 450.33 + 3, 538.84 × Reputaion + 10, 935.70 × Dpriv − 2, 783.31 × Dreligion
Cost
R2 = 0.72
Why do you think that the eect of attending a private institution has increased now?

Part III Empirical Exercise (49 points in total)

For all regressions, please report the heteroskedasticity-robust standard errors. Please limit your answer
(except for tables or gures) to less than or equal to 8 lines per sub-question. You may use appropriate table
in answering the questions. Please hand in your R script le with the problem set.
(19 points) 3.1 Please download the data set retro2021.dta for this problem. The data comes from Paul
Glewwe, Michael Kremer et al. , "Retrospective vs. Prospective Analyses of School Inputs: The Case of
Flip Charts in Kenya," . (If you like, you could read the paper (kenya.pdf ) for background on this education
project.) We are going to examine the eect of wall charts on test scores using the data. In this data set,
there is a variable wallchar that measures the number of wall charts in the school which the student attends.
When measuring test scores in the regressions of this problem, we will always be using normalized test score
(nmsc).
(3 points) (a). Please use R command  str() to check the description of the variables and read the content
in label to understand the meaning of each variable. Please provide a summary of statistics table including the
number observations, mean, standard deviation, minimum and maximum of all variables in the dataset, except
for schoolid, std, pupid, sub. What are the mean and standard deviation of the normalized test score (nmsc) in
the data? (Hint: you may want to use as.data.f rame() to transfer your data into data frame if it is not. You
may use R code to select the part of data you like, ex. part < −subset(retro, select = −c(schoolid, std, ...)))
(4 points) (b). In an attempt to estimate the eect of wall charts on test scores, regress normalized test
score on wallchar. Report and interpret the coecient on wallchar.
(6 points) (c). Now regress the normalized test score on wallchar, controlling for (i.e., including as ex-
planatory variables) whether the classroom is indoors, whether the roof leaks, blackboard condition, textbooks
per pupil (use bkpup), desks per pupil, teacher training level, and class size. Why might you want to do this?
How does the coecient on wallchar in this regression compare with the coecient on wallchar in (b)? What
do you conclude from this comparison? Do the conditions of classroom matter?
(6 points) (d). What might be the problem with using an OLS regression like (b) to estimate the eect of
wall charts on test scores? Which way would you expect the coecient on wallchar to be biased?

(30 points) 3.2 Use the data homework2022.dta for this exercise. Dr. Qin want to study the relationship
between the performance, measured as the score of homework #1 (hw1) and attendance rate (attend) and
maybe some other characteristics of students. attend is measured as a percent ranging from 0 to 100, and the
score (hw1 ) has a maximum possible value of 100.
(3 points) (a) Please provide a summary of statistics table including the number of observations, mean,
standard deviation, minimum and maximum of all variables in the dataset.
(3 points) (b) Using the data to estimate the population model

hw1 = β0 + β1 attend + u

Report the results in a table, including sample size and R-squared. Interpret the coecient β1 . Does attend
explain a lot of the variation in the score of homeowrk #1?
(4 points) (c) Please produce a scatter plot for attend vs. hw1. What do you nd? Is there any procedure
you would like to implement? Why?

3
(3 points) (d) After you have implemented the practice in (c), estimate the population model again

hw1 = β0 + β1 attend + u

Report the results in the same table (as used in (b)), including sample size and R-squared. Interpret the
coecient β1 . Does hw1 explain a lot of the variation in the attendance rate?
(4 points) (e) Use the data from (d), transform the homework score to 100 unit, i.e, generate a new
variable,hw1_100 = hw1/100. Estimate the population model

hw1_100 = β0 + β1 attend + u

Report the result in the same table and interpret the coecient β1 . Is the explanation dierent from (d)?
(4 points) (f ) Dr. Qin just found another variable is available in this data set, entry _GP A, which is the
GP A before the students were enrolled in the program. If Dr. Qin is interested in discovering the relationship
between homework performance and the attendance rate, should she add entry _GP A into the regression
model? Why or why not?
(5 points) (g) Dr. Qin tries to estimate the model

hw1 = β0 + β1 attend + β2 entry _GP A + u

Please report the result in the same table. Please interpret β1 and β2 . What's the dierence between the
coecient estimate for β1 here and the estimate from (d)? What does the dierence suggest?
(4 points) (h) Dr. Qin further wonders whether dierent gender groups of students have the same perfor-
mance in the homework. Please suggest a regression model, estimate it, report it in the same table, and use
the result to answer the question.

You might also like