Professional Documents
Culture Documents
Fall 2019
Your solutions should be typed and well organized. You need to explain / show all of the steps you
used to arrive at your answer. Submit your work through Blackboard as a Word or pdf file.
1. Indicate whether the following statements are true or false, along with a brief explanation.
(a) The assumptions regarding the disturbance term e for a multiple linear regression are the
same as those for a simple linear regression.
(b) The coefficient of determination is not useful when analyzing a multiple linear
regression.
(d) Heteroskedasticity occurs if the disturbances/errors are not independent of one another.
(e) Multicollinearity occurs when an explanatory variable is correlated with the dependent
variable.
(f) Multicollinearity makes it more likely that a researcher will commit a Type I error when
evaluating H 0 : 1 = 0 .
(g) While an explanatory variable can be lagged in a time series regression model, it is not
possible to lag the dependent variable in a time series regression model.
(h) The full model should be used when a researcher conducts a partial F test and rejects the
null hypothesis.
(i) For a multiple linear regression, there is less uncertainty in predicting a conditional mean
of Y than in predicting an individual value of Y.
(j) A larger value for the variance inflation factor (VIF) implies it is more likely that
multicollinearity is a problem.
Solution:
(a) True: the assumptions are that the disturbances (i) are zero on average; (ii) are normally
distributed; (iii) have a constant variance; and (iv) are independent of one another.
(b) False: the coefficient of determination is still useful in the context of a multiple linear
regression since it indicates the proportion of the variation in Y explained by the model.
(c) False. It is mathematically impossible for the coefficient of determination to decrease as
explanatory variables are added to a regression model. This fact implies that the
coefficient of determination is a poor criterion to assess whether additional variables
should be added to a model. A better criterion in this regard is the adjusted R 2 since it
can decrease if an additional variable does not contribute much to the model.
(d) False: autocorrelation occurs if the disturbances/errors are not independent of one
another. Heteroskedasticity occurs if the variance of the errors is not constant over
different values of the explanatory variable(s).
(e) False: multicollinearity occurs when an explanatory variable is correlated with another
explanatory variable.
(f) False: multicollinearity can result in a disproportionally large standard deviation of the
regression coefficient (i.e., the “standard error” of b1 is large). This implies that the t
statistic will be small, which makes less likely that the hypothesis H 0 : 1 = 0 will be
rejected. A Type II error occurs when a researcher fails to reject H 0 : 1 = 0 when, in
fact, the hypothesis is false. Thus, multicollinearity makes it more likely a Type II error
will be committed.
(g) False: both the explanatory variable(s) and the dependent variable can be lagged in a
time series regression model.
(h) True: this is because the null hypothesis of a partial F test is that the additional betas
(i..e, those that appear only in the full model) all equal zero. Rejecting this null
hypothesis is tantamount to saying the full model should be used since at least one of
the additional betas is not zero.
(i) True: there is greater uncertainty in predicting an individual value of Y than there is
with predicting a conditional mean of Y.
(j) True: a larger VIF implies that multicollinearity is more likely to be a problem. Page
163 in our text provides some suggested guidelines about how large a VIF must be
before multicollinearity becomes problematic.
2
2. The textbook dataset DIV4 contains data from a random sample of 42 firms listed on the
S&P 500 in 2003. The dividend yield (DIVYIELD), the earnings per share (EPS), and the
stock price (PRICE) were recorded for these 42 firms. The following questions should be
answered using Stata.1
(a) Use Stata’s correlate command to obtain a correlation matrix of DIVYIELD, EPS,
and PRICE. Show Stata’s output in your solutions.
(b) Use Stata’s univar command to obtain summary statistics on DIVYIELD, EPS, and
PRICE. Show Stata’s output in your solutions.
(c) Use Stata to run a multiple linear regression using DIVYIELD as the dependent
variable. Show Stata’s output in your solutions.
(d) What is the sample regression equation relating DIVYIELD to PRICE and EPS?
Please use an equation editor to type this equation and be precise with your notation.
Round values to the nearest hundredth digit.
(e) What percentage of the variation of DIVYIELD has been explained by the
regression?
(f) Test the overall fit of the regression. Use a 10% level of significance and the p-value
method. State the hypotheses to be tested, the decision rule, the p value, and your
decision. What conclusion can be drawn from the test result?
(g) Consider the relationship between DIVYIELD and EPS. Given the results you
obtained in (a) through (f), do you believe there is strong evidence that a statistical
relationship exists between these two variables? Briefly explain.
Solution:
1
Please do the following to show your work that appears on the screen in Stata: highlight the output in Stata, right-click
and choose “Copy as picture”, then paste this into your solutions. If desired, you can then crop this picture and/or stretch
it to a suitable size.
3
(a) Command: correlate divyield eps price
divyield eps price
divyield 1.0000
eps 0.2397 1.0000
price 0.0244 0.6494 1.0000
(e) About 8.73% of the variation in dividend yield is explained by the regression model.
(f) Test the overall fit (i.e., F test):
H 0 : 1 = 2 = 0
Hypothesis:
H1 : 1 0 and/or 2 0
Decision rule: Reject the null if pvalue alpha where alpha = 0.10
Test statistic: pvalue = 0.1684
Decision: Do not reject the null hypothesis at the 10% level. .
Conclusion: the coefficients 1 and 2 are both equal to zero. Neither variable (EPS,
PRICE) is useful in explaining the variation in dividend yield.
(g) No, there is not strong evidence. The individual coefficient for EPS is statistically
significant, but we should be distrustful of this because (1) multicollinearity seems likely
given the correlation between EPS and PRICE; and (2) the F test indicates both coefficients
are equal to zero.
4
3. The textbook dataset COLLEGE4 contains data from 195 public and private schools across
the U.S. that appeared in a 2003 issue of Kiplinger’s Personal Finance. The following
questions should be answered using Stata.2 The regression model should use GRADRATE4
as the dependent variable. The following variables are included in the dataset:
GRADRATE4 Proportion of students who earned a bachelor’s degree in 4 years
(a) Do you expect GRADRATE4 to have a positive, negative, or no relationship with each
of the other variables in the dataset? Briefly explain.
(b) Use Stata’s graph matrix command to obtain a scatterplot matrix for all of the variables
in the dataset. Include this graph in your solutions. Briefly comment on whether these
scatterplots match your expectations in (a).
(d) Use Stata to run a multiple linear regression using GRADRATE4 as the dependent
variable, and ADMISRATE, SFACTRATIO, and AVGDEBT as the explanatory
variables. Show Stata’s output in your solutions.
(e) Stata calculates variance inflation factors (VIFs) after a regression has been run by
entering the command vif. Run this command and show the output in your solutions.
Based on the three guidelines discussed in the textbook on p. 163, do you believe
multicollinearity is a serious problem here? Briefly explain.
(f) Use Stata’s test command to perform a partial F test of the null hypothesis that the
coefficient on AVGDEBT equals zero. Show Stata’s output in your solutions. Based
on this test, what can you conclude?
(g) Use the regression output in (d) to fill in the blanks for the following statements:
After controlling for student-faculty ratio and the average student debt level, the analysis
indicates that, on average, a school’s graduation rate will change by _____ percentage point(s)
when the admission rate increases by 10 percentage points.
2
In order to show your work that appears in the results window in Stata, please do the following: highlight the output in
Stata, right-click and choose “Copy as picture”, then paste this into your solutions. If desired, you can then crop this
picture and/or stretch it to a suitable size.
5
Solution:
(a) I expect to observe the following relationships:
The top row of scatterplots plot the dependent variable (GRADRATE4) on the vertical axis
and the explanatory variables on the horizontal axes. The scatterplot GRADRATE4 and
ADMISRATE shows a negative relationship which is consistent with my expectation in (a).
The scatterplot GRADRATE4 and SFACRATIO shows a negative relationship which is
consistent with my expectation in (a). Lastly, the scatterplot GRADRATE4 and AVGDEBT
does not clearly show any relationship.
6
(c) Command: univar gradrate4 admisrate sfacratio avgdebt
-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
gradrate4 195 0.55 0.24 0.00 0.32 0.58 0.77 0.91
admisrate 195 0.54 0.21 0.08 0.36 0.55 0.70 0.95
sfacratio 195 13.18 4.22 3.00 10.00 13.00 17.00 24.00
avgdebt 195 15454.82 4755.33 0.00 13926.00 15580.00 17400.00 28217.00
-------------------------------------------------------------------------------
No, it does not appear that multicollinearity is a serious problem. None of the individual
VIFs (1.61, 1.60, and 1.01) are larger than 10. The average VIF is 1.41 which is not
considerably larger than 1. Finally, none of the VIFs are larger than 1 /(1 − R 2 ) = 2.279 .
F( 1, 191) = 0.05
Prob > F = 0.8296
7
Based on this test, we can conclude that the coefficient for AVGDEBT is zero. This variable
is not linearly related with GRADRATE4.
8
4. Go to the St. Louis Federal Reserve Bank’s FRED database and obtain quarterly data on the
“30-year fixed rate mortgage average in the United States”.3 The data begins in 1971Q2 and
ends in 2018Q3. Be sure to properly format and set the data as time series (I recommend you
follow the three-step process summarized on slide #11 in the tutorial “Time Series Data – Pt
1”). The following model is used to forecast mortgage rates:
yt = 0 + 1 yt −1 + et
where yt is the mortgage rate at time t and yt −1 is the mortgage rate during the previous
quarter.
(a) Use Stata’s tsline command to obtain a time-series line plot of the 30-year fixed
mortgage rates in the U.S. Give this graph a title and include it in your solutions.
(b) Use Stata’s univar command along with the tin option to obtain the average mortgage
rates over the following time periods.4 You need only calculate and report the average
mortgage rates—you do not need to show the output from Stata for this question.
(c) Use Stata to estimate the regression model above from 1971Q2 to 2018Q2. This requires
you to use the tin option, as well as the lag operator (the lag operator is addressed in the
Stata tutorial “Time Series Data – Pt 2”). Show Stata’s output in your solutions.
(d) Interpret the coefficient of determination in (c). Do you believe this regression model is
causal or extrapolative? Briefly explain.
(e) Use the regression from (c) to forecast the mortgage rate in 2018Q3. Do this by using
Stata’s predict command (see the Stata tutorial “Post Estimation Commands”). Report
the forecasted value in your solutions and briefly discuss the accuracy of this forecast.
3
The data is available here: https://fred.stlouisfed.org/series/MORTGAGE30US. The frequency of the data should be
changed from weekly to quarterly, which can be done via the following steps: (1) select “EDIT GRAPH”; (2) select the
drop-down menu below “Modify Frequency”; and (3) choose quarterly. Next, download this dataset as an Excel file and
then copy/paste or import the data from Excel into Stata (you may want to watch the videos posted on Blackboard under
Module 4 about moving data from Excel to Stata).
4
The tin option is covered in the Stata tutorial “Time Series Data – Pt 1” from module 3.
9
(f) Use the regression from (c) to forecast the mortgage rate in 2018Q4. Do this forecast
without using Stata. Show your work.
Solution:
Before the data is analyzed, it must be properly formatted and “set” as time series data. I
accomplish this task using the 3-step process: (i) generate time = tq(1971q2)+_n-1; (ii)
format time %tq; and (iii) tsset time, quarterly.
10
Source SS df MS Number of obs = 188
F(1, 186) = 8018.55
Model 1857.58434 1 1857.58434 Prob > F = 0.0000
Residual 43.0889463 186 .231661002 R-squared = 0.9773
Adj R-squared = 0.9772
Total 1900.67329 187 10.1640283 Root MSE = .48131
mortgage30us
L1. .9918335 .0110762 89.55 0.000 .9699823 1.013685
(d) The coefficient of determination indicates that about 97.73% of the variation in the
mortgage rate from 1971Q2 to 2018Q2 is explained by the mortgage rate in the preceding
quarter. The relationship between the mortgage rate at time t and the mortgage rate in the
previous time period is extrapolative since the previous quarter’s mortgage rate does not
literally cause the mortgage rate to take a specific value in the following quarter.
The actual and predicted values can be obtained by looking at the data editor. Alternatively,
the list command can be used: list time mortgage30us yhat if tin(2018q2,2018q3)
The forecasted rate of 4.55% in 2018Q3 is very close to the actual 4.57% rate.
(f) The forecasted rate for 2018Q4 is 4.5646%. This was obtained by using the estimated
regression equation from (c) and plugging in the actual mortgage rate in 2018Q3 for the
lagged dependent variable.
yˆ t = 0.051 + 0.992 yt −1 yˆ 2018 Q 4 = 0.051 + 0.992 y2018 Q3 = 0.051 + 0.992 4.57 = 4.584
11