You are on page 1of 11

ECON 601 Module 4 Problem Set

Fall 2019

Your solutions should be typed and well organized. You need to explain / show all of the steps you
used to arrive at your answer. Submit your work through Blackboard as a Word or pdf file.

1. Indicate whether the following statements are true or false, along with a brief explanation.

(a) The assumptions regarding the disturbance term e for a multiple linear regression are the
same as those for a simple linear regression.

(b) The coefficient of determination is not useful when analyzing a multiple linear
regression.

(c) Unlike adjusted R 2 , the coefficient of determination can decrease as additional


explanatory variables are added to the model.

(d) Heteroskedasticity occurs if the disturbances/errors are not independent of one another.

(e) Multicollinearity occurs when an explanatory variable is correlated with the dependent
variable.

(f) Multicollinearity makes it more likely that a researcher will commit a Type I error when
evaluating H 0 :  1 = 0 .

(g) While an explanatory variable can be lagged in a time series regression model, it is not
possible to lag the dependent variable in a time series regression model.

(h) The full model should be used when a researcher conducts a partial F test and rejects the
null hypothesis.

(i) For a multiple linear regression, there is less uncertainty in predicting a conditional mean
of Y than in predicting an individual value of Y.

(j) A larger value for the variance inflation factor (VIF) implies it is more likely that
multicollinearity is a problem.

Solution:
(a) True: the assumptions are that the disturbances (i) are zero on average; (ii) are normally
distributed; (iii) have a constant variance; and (iv) are independent of one another.
(b) False: the coefficient of determination is still useful in the context of a multiple linear
regression since it indicates the proportion of the variation in Y explained by the model.
(c) False. It is mathematically impossible for the coefficient of determination to decrease as
explanatory variables are added to a regression model. This fact implies that the
coefficient of determination is a poor criterion to assess whether additional variables
should be added to a model. A better criterion in this regard is the adjusted R 2 since it
can decrease if an additional variable does not contribute much to the model.
(d) False: autocorrelation occurs if the disturbances/errors are not independent of one
another. Heteroskedasticity occurs if the variance of the errors is not constant over
different values of the explanatory variable(s).
(e) False: multicollinearity occurs when an explanatory variable is correlated with another
explanatory variable.
(f) False: multicollinearity can result in a disproportionally large standard deviation of the
regression coefficient (i.e., the “standard error” of b1 is large). This implies that the t
statistic will be small, which makes less likely that the hypothesis H 0 : 1 = 0 will be
rejected. A Type II error occurs when a researcher fails to reject H 0 : 1 = 0 when, in
fact, the hypothesis is false. Thus, multicollinearity makes it more likely a Type II error
will be committed.
(g) False: both the explanatory variable(s) and the dependent variable can be lagged in a
time series regression model.
(h) True: this is because the null hypothesis of a partial F test is that the additional betas
(i..e, those that appear only in the full model) all equal zero. Rejecting this null
hypothesis is tantamount to saying the full model should be used since at least one of
the additional betas is not zero.
(i) True: there is greater uncertainty in predicting an individual value of Y than there is
with predicting a conditional mean of Y.
(j) True: a larger VIF implies that multicollinearity is more likely to be a problem. Page
163 in our text provides some suggested guidelines about how large a VIF must be
before multicollinearity becomes problematic.

2
2. The textbook dataset DIV4 contains data from a random sample of 42 firms listed on the
S&P 500 in 2003. The dividend yield (DIVYIELD), the earnings per share (EPS), and the
stock price (PRICE) were recorded for these 42 firms. The following questions should be
answered using Stata.1

(a) Use Stata’s correlate command to obtain a correlation matrix of DIVYIELD, EPS,
and PRICE. Show Stata’s output in your solutions.

(b) Use Stata’s univar command to obtain summary statistics on DIVYIELD, EPS, and
PRICE. Show Stata’s output in your solutions.

(c) Use Stata to run a multiple linear regression using DIVYIELD as the dependent
variable. Show Stata’s output in your solutions.

(d) What is the sample regression equation relating DIVYIELD to PRICE and EPS?
Please use an equation editor to type this equation and be precise with your notation.
Round values to the nearest hundredth digit.

(e) What percentage of the variation of DIVYIELD has been explained by the
regression?

(f) Test the overall fit of the regression. Use a 10% level of significance and the p-value
method. State the hypotheses to be tested, the decision rule, the p value, and your
decision. What conclusion can be drawn from the test result?

(g) Consider the relationship between DIVYIELD and EPS. Given the results you
obtained in (a) through (f), do you believe there is strong evidence that a statistical
relationship exists between these two variables? Briefly explain.

Solution:

1
Please do the following to show your work that appears on the screen in Stata: highlight the output in Stata, right-click
and choose “Copy as picture”, then paste this into your solutions. If desired, you can then crop this picture and/or stretch
it to a suitable size.
3
(a) Command: correlate divyield eps price
divyield eps price

divyield 1.0000
eps 0.2397 1.0000
price 0.0244 0.6494 1.0000

(b) Command: univar divyield eps price


-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
divyield 42 2.75 1.88 0.18 0.98 2.87 3.94 6.93
eps 42 1.92 1.21 0.12 1.08 1.73 2.51 5.05
price 42 29.26 14.58 4.00 18.00 29.00 36.00 65.00
-------------------------------------------------------------------------------

(c) Command: regress divyield eps price


Source SS df MS Number of obs = 42
F(2, 39) = 1.87
Model 12.6765509 2 6.33827547 Prob > F = 0.1684
Residual 132.531837 39 3.39825223 R-squared = 0.0873
Adj R-squared = 0.0405
Total 145.208388 41 3.541668 Root MSE = 1.8434

divyield Coef. Std. Err. t P>|t| [95% Conf. Interval]

eps .6040673 .3138289 1.92 0.062 -.0307115 1.238846


price -.0293105 .025961 -1.13 0.266 -.0818215 .0232005
_cons 2.450218 .6529224 3.75 0.001 1.129558 3.770878

(d) The estimated equation is: DIVYIELDi = 2.45 + 0.60  EPS i − 0.03  PRICEi

(e) About 8.73% of the variation in dividend yield is explained by the regression model.
(f) Test the overall fit (i.e., F test):
H 0 : 1 =  2 = 0
Hypothesis:
H1 : 1  0 and/or  2  0
Decision rule: Reject the null if pvalue  alpha where alpha = 0.10
Test statistic: pvalue = 0.1684
Decision: Do not reject the null hypothesis at the 10% level. .

Conclusion: the coefficients 1 and  2 are both equal to zero. Neither variable (EPS,
PRICE) is useful in explaining the variation in dividend yield.
(g) No, there is not strong evidence. The individual coefficient for EPS is statistically
significant, but we should be distrustful of this because (1) multicollinearity seems likely
given the correlation between EPS and PRICE; and (2) the F test indicates both coefficients
are equal to zero.
4
3. The textbook dataset COLLEGE4 contains data from 195 public and private schools across
the U.S. that appeared in a 2003 issue of Kiplinger’s Personal Finance. The following
questions should be answered using Stata.2 The regression model should use GRADRATE4
as the dependent variable. The following variables are included in the dataset:
GRADRATE4 Proportion of students who earned a bachelor’s degree in 4 years

ADMISRATE Proportion of student applicants admitted to the school

SFACRATIO Average number of students per faculty

AVGDEBT Average student debt at graduation

(a) Do you expect GRADRATE4 to have a positive, negative, or no relationship with each
of the other variables in the dataset? Briefly explain.

(b) Use Stata’s graph matrix command to obtain a scatterplot matrix for all of the variables
in the dataset. Include this graph in your solutions. Briefly comment on whether these
scatterplots match your expectations in (a).

(c) Use Stata’s univar command to obtain summary statistics on GRADRATE4,


ADMISRATE, SFACTRATIO, and AVGDEBT. Show Stata’s output in your
solutions.

(d) Use Stata to run a multiple linear regression using GRADRATE4 as the dependent
variable, and ADMISRATE, SFACTRATIO, and AVGDEBT as the explanatory
variables. Show Stata’s output in your solutions.

(e) Stata calculates variance inflation factors (VIFs) after a regression has been run by
entering the command vif. Run this command and show the output in your solutions.
Based on the three guidelines discussed in the textbook on p. 163, do you believe
multicollinearity is a serious problem here? Briefly explain.

(f) Use Stata’s test command to perform a partial F test of the null hypothesis that the
coefficient on AVGDEBT equals zero. Show Stata’s output in your solutions. Based
on this test, what can you conclude?

(g) Use the regression output in (d) to fill in the blanks for the following statements:

After controlling for student-faculty ratio and the average student debt level, the analysis
indicates that, on average, a school’s graduation rate will change by _____ percentage point(s)
when the admission rate increases by 10 percentage points.

2
In order to show your work that appears in the results window in Stata, please do the following: highlight the output in
Stata, right-click and choose “Copy as picture”, then paste this into your solutions. If desired, you can then crop this
picture and/or stretch it to a suitable size.
5
Solution:
(a) I expect to observe the following relationships:

• GRADRATE4 is expected to be negatively associated with ADMISRATE (more students


are admitted, but some of these students may not be academically prepared to succeed
and may not graduate in 4 years);

• GRADRATE4 is expected to be negatively associated with SFACRATIO (more students


per faculty may mean less attention given to a student and a greater likelihood he/she will
not graduate in 4 years);

• GRADRATE4 is expected to be negatively associated with AVGDEBT (the greater the


debt level the more likely a student is to be working and/or having serious financial
difficulties, making it less likely that he/she will graduate within 4 years).

(b) Command: graph matrix gradrate4 admisrate sfacratio avgdebt

The top row of scatterplots plot the dependent variable (GRADRATE4) on the vertical axis
and the explanatory variables on the horizontal axes. The scatterplot GRADRATE4 and
ADMISRATE shows a negative relationship which is consistent with my expectation in (a).
The scatterplot GRADRATE4 and SFACRATIO shows a negative relationship which is
consistent with my expectation in (a). Lastly, the scatterplot GRADRATE4 and AVGDEBT
does not clearly show any relationship.

6
(c) Command: univar gradrate4 admisrate sfacratio avgdebt
-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
gradrate4 195 0.55 0.24 0.00 0.32 0.58 0.77 0.91
admisrate 195 0.54 0.21 0.08 0.36 0.55 0.70 0.95
sfacratio 195 13.18 4.22 3.00 10.00 13.00 17.00 24.00
avgdebt 195 15454.82 4755.33 0.00 13926.00 15580.00 17400.00 28217.00
-------------------------------------------------------------------------------

(d) Command: regress gradrate4 admisrate sfacratio avgdebt


Source SS df MS Number of obs = 195
F(3, 191) = 81.47
Model 6.10245957 3 2.03415319 Prob > F = 0.0000
Residual 4.76907572 191 .024968983 R-squared = 0.5613
Adj R-squared = 0.5544
Total 10.8715353 194 .056038842 Root MSE = .15802

gradrate4 Coef. Std. Err. t P>|t| [95% Conf. Interval]

admisrate -.3797602 .0689821 -5.51 0.000 -.5158247 -.2436956


sfacratio -.0278918 .0034038 -8.19 0.000 -.0346056 -.021178
avgdebt 5.17e-07 2.40e-06 0.22 0.830 -4.21e-06 5.25e-06
_cons 1.109542 .0514142 21.58 0.000 1.008129 1.210954

(e) Command: vif


Variable VIF 1/VIF

admisrate 1.61 0.620539


sfacratio 1.60 0.625162
avgdebt 1.01 0.989159

Mean VIF 1.41

No, it does not appear that multicollinearity is a serious problem. None of the individual
VIFs (1.61, 1.60, and 1.01) are larger than 10. The average VIF is 1.41 which is not
considerably larger than 1. Finally, none of the VIFs are larger than 1 /(1 − R 2 ) = 2.279 .

(f) Command: test avgdebt==0


( 1) avgdebt = 0

F( 1, 191) = 0.05
Prob > F = 0.8296

7
Based on this test, we can conclude that the coefficient for AVGDEBT is zero. This variable
is not linearly related with GRADRATE4.

(g) Answer: 3.798

The coefficient on ADMISRATE is -0.3798. Thus, if ADMISRATE increases by 1


percentage point, then GRADRATE4 decreases by 0.3798 percentage point. And if
ADMISRATE increases by 10 percentage points, then GRADRATE4 decreases by 3.798
percentage points.

8
4. Go to the St. Louis Federal Reserve Bank’s FRED database and obtain quarterly data on the
“30-year fixed rate mortgage average in the United States”.3 The data begins in 1971Q2 and
ends in 2018Q3. Be sure to properly format and set the data as time series (I recommend you
follow the three-step process summarized on slide #11 in the tutorial “Time Series Data – Pt
1”). The following model is used to forecast mortgage rates:

yt =  0 + 1 yt −1 + et

where yt is the mortgage rate at time t and yt −1 is the mortgage rate during the previous
quarter.

(a) Use Stata’s tsline command to obtain a time-series line plot of the 30-year fixed
mortgage rates in the U.S. Give this graph a title and include it in your solutions.

(b) Use Stata’s univar command along with the tin option to obtain the average mortgage
rates over the following time periods.4 You need only calculate and report the average
mortgage rates—you do not need to show the output from Stata for this question.

Time Period Average 30-Year Mortgage Rate


1971Q2 – 1979Q4
1980Q1 – 1989Q4
1990Q1 – 1999Q4
2000Q1 – 2009Q4
2010Q1 – 2018Q3

(c) Use Stata to estimate the regression model above from 1971Q2 to 2018Q2. This requires
you to use the tin option, as well as the lag operator (the lag operator is addressed in the
Stata tutorial “Time Series Data – Pt 2”). Show Stata’s output in your solutions.

(d) Interpret the coefficient of determination in (c). Do you believe this regression model is
causal or extrapolative? Briefly explain.

(e) Use the regression from (c) to forecast the mortgage rate in 2018Q3. Do this by using
Stata’s predict command (see the Stata tutorial “Post Estimation Commands”). Report
the forecasted value in your solutions and briefly discuss the accuracy of this forecast.

3
The data is available here: https://fred.stlouisfed.org/series/MORTGAGE30US. The frequency of the data should be
changed from weekly to quarterly, which can be done via the following steps: (1) select “EDIT GRAPH”; (2) select the
drop-down menu below “Modify Frequency”; and (3) choose quarterly. Next, download this dataset as an Excel file and
then copy/paste or import the data from Excel into Stata (you may want to watch the videos posted on Blackboard under
Module 4 about moving data from Excel to Stata).
4
The tin option is covered in the Stata tutorial “Time Series Data – Pt 1” from module 3.
9
(f) Use the regression from (c) to forecast the mortgage rate in 2018Q4. Do this forecast
without using Stata. Show your work.

Solution:
Before the data is analyzed, it must be properly formatted and “set” as time series data. I
accomplish this task using the 3-step process: (i) generate time = tq(1971q2)+_n-1; (ii)
format time %tq; and (iii) tsset time, quarterly.

(a) Command: tsline mortgage30us

(b) Commands: univar mortgage30us if tin(,1979q4); univar mortgage30us if


tin(1980q1,1989q4); univar mortgage30us if tin(1990q1,1999q4); univar mortgage30us if
tin(2000q1,2009q4); and univar mortgage30us if tin(2010q1,)

Time Period Average 30-Year Mortgage Rate


1971Q2 – 1979Q4 8.90%
1980Q1 – 1989Q4 12.70%
1990Q1 – 1999Q4 8.12%
2000Q1 – 2009Q4 6.29%
2010Q1 – 2018Q3 4.09%

(c) Command: regress mortgage30us L.mortgage30us if tin(1971q2, 2018q2)

10
Source SS df MS Number of obs = 188
F(1, 186) = 8018.55
Model 1857.58434 1 1857.58434 Prob > F = 0.0000
Residual 43.0889463 186 .231661002 R-squared = 0.9773
Adj R-squared = 0.9772
Total 1900.67329 187 10.1640283 Root MSE = .48131

mortgage30us Coef. Std. Err. t P>|t| [95% Conf. Interval]

mortgage30us
L1. .9918335 .0110762 89.55 0.000 .9699823 1.013685

_cons .0511952 .0967348 0.53 0.597 -.1396432 .2420335

(d) The coefficient of determination indicates that about 97.73% of the variation in the
mortgage rate from 1971Q2 to 2018Q2 is explained by the mortgage rate in the preceding
quarter. The relationship between the mortgage rate at time t and the mortgage rate in the
previous time period is extrapolative since the previous quarter’s mortgage rate does not
literally cause the mortgage rate to take a specific value in the following quarter.

(e) Command: predict yhat, xb

The actual and predicted values can be obtained by looking at the data editor. Alternatively,
the list command can be used: list time mortgage30us yhat if tin(2018q2,2018q3)

Quarter Actual mortgage rate Forecasted mortgage rate


2018Q2 4.54
2018Q3 4.57 4.554119

The forecasted rate of 4.55% in 2018Q3 is very close to the actual 4.57% rate.

(f) The forecasted rate for 2018Q4 is 4.5646%. This was obtained by using the estimated
regression equation from (c) and plugging in the actual mortgage rate in 2018Q3 for the
lagged dependent variable.

yˆ t = 0.051 + 0.992 yt −1  yˆ 2018 Q 4 = 0.051 + 0.992  y2018 Q3 = 0.051 + 0.992  4.57 = 4.584

11

You might also like