You are on page 1of 11

Econ 251

Problem Set #4

SOLUTIONS

This problem set introduces you to multiple linear regression in STATA, simple
hypotheses tests about the parameters, and omitted variable bias.

Instructions:
Following each question, please handwrite or type your answers and copy/paste the
STATA output (please, use the ‘copy as picture’ option).

Problem 1: Ability bias in the estimated return to education (10 points in total)
Note: Please, refer to your class notes when solving this problem; we discussed this
exact same example in class.
The problem uses STATA file WAGE2.dta. Download the STATA file WAGE2.dta
from the CANVAS website. FYI: this data was used in a publication by Blackburn and
Neumark (1992).
WAGE2.dta contains information on monthly earnings, employment history, education,
demographic characteristics, and two test scores for 935 men in year 1980. In particular,
it contains the following variables:
wage monthly earnings (in 1976 USD)
hours average weekly hours of work
IQ IQ (intelligence quotient) score
educ years of education
age age in years
married =1 if the person is married
black =1 if the person is black

Suppose that the population model for log-earnings (lwage) is given by:
lwage = 𝜷0 + 𝜷1educ + 𝜷2ability + v.
Since a person’s ability is not observed, you estimate the simple linear regression model
lwage = 𝜷0 + 𝜷1educ + u
instead, where ability is part of the error term u.

1. What is the OLS estimate of the slope on education from the simple linear regression
and how do you interpret it? (1 point)

Hint: You will need to first generate a variable lwage equal to the natural logarithm of
variable wage. In order to do so, type in STATA:
gen lwage=log(wage)

1
Solution:
The OLS estimate of the slope on education is 0.06 and it implies that one more year
of education increases log-wages by 0.06, on average. (Alternatively, you could
interpret this as a percentage: one more year of education increases wages by 6
percent, on average.

. reg lwage educ

Source SS df MS Number of obs = 935


F( 1, 933) = 100.70
Model 16.1377042 1 16.1377042 Prob > F = 0.0000
Residual 149.518579 933 .160255712 R-squared = 0.0974
Adj R-squared = 0.0964
Total 165.656283 934 .177362188 Root MSE = .40032

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .0598392 .0059631 10.03 0.000 .0481366 .0715418


_cons 5.973063 .0813737 73.40 0.000 5.813366 6.132759

2. What is the key condition for the OLS estimate of 𝜷1 you obtained in part (i) to be
unbiased? Does this condition hold? (1 point)

Solution:
The key condition for the OLS estimate of 𝜷1 we obtained in part (i) to be unbiased
is: cov(educ, u)=0 (exogeneity). You could also state this verbally: education should
be uncorrelated with any other factor that affects a person’s log-wage.
This condition clearly fails to hold: education is correlated with a person’s ability
level, and ability affects wages, i.e. it is part of u.

3. What is the direction of the bias of the OLS estimate of 𝜷1 from the simple linear
regression model? Explain. (2 points)
Solution:
Questions about the direction of the bias in the OLS estimate are, by far, the most
important questions in this class. They are also the hardest, so please read the
solution carefully and let me know if you have any questions.
As a general rule, whenever there is an omitted variable bias question, there are two
things to consider:
1) How is the omitted variable (ability) related to the included variable
(educ) – do more able people get more education, on average, or less? Or,
stated statistically: is corr (educ, ability) > 0 or corr (educ, ability) < 0?
2) How is the omitted variable (ability) related to the Y-variable (lwage) – do
more able people earn more or less, on average? In other words: do we think
𝜷2 > 0 or 𝜷2 < 0 in the population model? (Recall our population model is:
lwage = 𝜷0 + 𝜷1educ + 𝜷2ability + v).

2
In our particular example we expect that: 1) more able people get more education,
on average, i.e. corr (educ, ability) > 0, and 2) more able people earn higher wages,
i.e. 𝜷2 > 0. Since corr (educ, ability) and 𝜷2 have the same sign (in our case both of
them are >0) the bias in the OLS estimate of 𝜷1 from the simple linear regression is
positive (also called an upward bias). Loosely speaking, this means that we estimated
𝜷1 to be larger than it actually is.
In order to get full points it was enough to simply state the bias in the OLS estimate
from the simple linear regression is positive as more able people get more education
and they also earn higher wages.

IMPORTANT NOTE
As a general rule, whenever the correlation between the included variable X1 and
the excluded variable X2, and 𝜷2 – the effect of the omitted variable on Y have the
same sign (either both positive or both negative) the bias in the OLS estimate from
the simple linear regression is positive. Whenever, they have different signs, the OLS
bias is negative. See the following table below from the textbook:

corr (X1, X2) > 0 corr (X1, X2) < 0


𝜷2 > 0 Positive bias Negative bias
𝜷2 < 0 Negative bias Positive bias

4. One way to solve (or at least moderate) the omitted variable bias of 𝜷1 in the simple
linear regression model is to include variable IQ (a person’s IQ test score) in the
regression. This variable is not exactly the same thing as the omitted variable ability but
it is closely related to it. IQ test score is a so called a proxy for ability.
Let’s see what happens to the OLS estimate of 𝜷1 if we add IQ to our regression. Run a
regression of lwage on educ and IQ (type reg lwage educ IQ). What is the OLS estimate
of 𝜷1 from this regression and how do you interpret it? Why is it smaller that the estimate
you obtained in part (i)? Explain. (2 points)

Solution:

3
. reg lwage educ IQ

Source SS df MS Number of obs = 935


F( 2, 932) = 69.42
Model 21.4779447 2 10.7389723 Prob > F = 0.0000
Residual 144.178339 932 .154697788 R-squared = 0.1297
Adj R-squared = 0.1278
Total 165.656283 934 .177362188 Root MSE = .39332

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .0391199 .0068382 5.72 0.000 .0256998 .05254


IQ .0058631 .0009979 5.88 0.000 .0039047 .0078215
_cons 5.658288 .0962408 58.79 0.000 5.469414 5.847162

The OLS estimate of the slope on education from the multiple linear regression is
0.04 and it implies that one more year of education increases log-wages by 0.04, on
average (or, alternatively, one more year of education increases wages by 4 percentage
points, on average), HOLDING IQ TEST SCORE FIXED.

The estimate got smaller as the OLS estimate from the simple linear regression was
biased up.

Just FYI:

How are the slope estimates on educ from the SLR and MLR related?

𝛃 ̂ 𝟏 (MLR) + 𝛃
̂ (SLR) = 𝛃 ̂ ⋅𝛂
̂𝟏
𝟏 𝟐

where β̂ is the slope estimate on IQ in MLR, and α


̂ 1 slope estimate in a regression of
2
IQ on educ (α
̂ 1 = 3.53 − 𝑠𝑒𝑒 𝑏𝑒𝑙𝑜𝑤).

. reg IQ educ

Source SS df MS Number of obs = 935


F(1, 933) = 338.02
Model 56280.9277 1 56280.9277 Prob > F = 0.0000
Residual 155346.531 933 166.502177 R-squared = 0.2659
Adj R-squared = 0.2652
Total 211627.459 934 226.581862 Root MSE = 12.904

IQ Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ 3.533829 .1922095 18.39 0.000 3.156616 3.911042


_cons 53.68715 2.622933 20.47 0.000 48.53962 58.83469

4
di .0391199 + .0058631 * 3.533829
.05983909

5. Now consider the general version of the population regression model:


Y = β0 + β1X1 + β2X2 + u,
where cov(X1,u)=0.
Suppose that instead you estimate the simple linear regression model:
Y = β0 + β1X1 + v.
Show formally that if both cov(X1, X2)>0 and β2>0 the OLS estimate of β1 from the
simple linear model is biased up. (3 points)
Hint: Use the fact that
n

plim 𝛽̂1= plim  ( X 1 i  X 1 )(Yi  Y ) = cov( X 1 , Y ) by the Law of Large Numbers, replace
i 1
n var( X 1 )
 ( X 1i  X 1 )
2

i 1

Y with β0 + β1X1 + β2X2 + u from the population model in cov(X1,Y), and work from
there. Consult your lecture notes if you experience difficulties.

Solution:
n

plim 𝛽̂1= plim  ( X 1 i  X 1 )(Yi  Y ) = cov( X 1 , Y ) =


i 1
n var( X 1 )
 ( X 1i  X 1 )
2

i 1

cov( X 1 ,  0 + 1 X 1 +  2 X 2  u ) cov( X 1 , X 2 ) > β1 since cov(X1, X2)>0 and


 1   2
var( X 1) var( X 1 )
β2>0 imply that their product is a positive number (remember the variance is
always >0).

Problem 2 (20 points in total)


This problem is part of a Midterm exam question from the Fall term of 2015.

The problem uses STATA file SKIPPED.dta, containing the following variables:
termGPA GPA after the Fall term
firstyearGPA cumulative first year GPA
skipped number of classes skipped during the Fall term
final average final exam score

You wish to investigate whether skipping classes in college affects a student’s GPA.
Dataset SKIPPED.dta contains data on 680 sophomores (second year students) with
information on each student’s Fall term GPA during their second year (termGPA), and
the number of classes they skipped during that term (skipped).
5
1. Estimate a simple linear regression model
termGPA = 𝜷0 + 𝜷1skipped + u.
(i) What are the OLS estimates of the intercept and slope parameters? Interpret
these estimates, being as specific as possible. (2 points)

Solution:

. reg termGPA skipped

Source | SS df MS Number of obs = 680


-------------+------------------------------ F( 1, 678) = 309.40
Model | 115.436802 1 115.436802 Prob > F = 0.0000
Residual | 252.960735 678 .373098429 R-squared = 0.3133
-------------+------------------------------ Adj R-squared = 0.3123
Total | 368.397537 679 .542558964 Root MSE = .61082

------------------------------------------------------------------------------
termGPA | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
skipped | -.0755857 .0042971 -17.59 0.000 -.084023 -.0671484
_cons | 3.043399 .0343692 88.55 0.000 2.975916 3.110881
------------------------------------------------------------------------------

Intercept = 3.04. This means that the estimated population average term GPA for
students who skipped no classes during the term is 3.04.
Slope = -0.08. This means that one more skipped class during the term is estimated
to decrease the student’s term GPA by 0.08, on average.

(ii) Use the regression output to test the hypothesis that number of skipped classes is
unrelated to college GPA. State the null and alternative hypotheses in terms of the
notation used in class. What do you? Use a significance level of 5%. (2 points)

Solution:
The null hypothesis is: H0: 𝜷1 = 0 (the number of classed the student skipped during
the term has no effect on their term GPA)
The alternative hypothesis is: H1: 𝜷1 ≠0
We reject the null hypothesis as the p-value=0.000<0.05 (the significance level).
(iii) Why might the OLS estimate of the slope from the simple linear regression
be biased for the true causal effect of skipping classes on a student’s GPA?
(2 points)
Solution:
As usual, we are worried that exogeneity may fail to hold, i.e. that cov(X,u)≠0. In
other words, we are worried that the number of classes a student skipped during the
term may be correlated with other factors that determine the students term GPA.
E.g. a more motivated student may skip less classes and motivation affects term
GPA, i.e. it is one of the factors in the error term u. If so, the OLS estimate of 𝜷2
from the simple linear regression is biased.

6
2. Now estimate the multiple linear regression model
termGPA = 𝜷0 + 𝜷1skipped + 𝜷2 firstyearGPA+ u.
(i) What are the OLS estimates of the intercept and both slope parameters?
(3 points)
Solution:
Slope on skipped = -0.05. This means skipping one more class during the term is
estimated to decrease the population mean term GPA by 0.05, holding first year
GPA fixed.
Slope on firstyearGPA = 0.68. This means that every 1-point increase in first year’s
GPA is estimated to increase the population mean term GPA by 0.68, holding
number of skipped classes fixed.
Intercept = 1.10. This means that the estimated mean term GPA for someone with a
zero first year GPA who skipped zero classes in the term is 1.10 (as we said in class,
the intercept does not always have a meaningful interpretation as nobody really has
a 0 first year GPA).

. reg termGPA skipped firstyearGPA

Source | SS df MS Number of obs = 680


-------------+------------------------------ F( 2, 677) = 371.21
Model | 192.687884 2 96.3439422 Prob > F = 0.0000
Residual | 175.709652 677 .259541584 R-squared = 0.5230
-------------+------------------------------ Adj R-squared = 0.5216
Total | 368.397537 679 .542558964 Root MSE = .50945
------------------------------------------------------------------------------
termGPA | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
skipped | -.0463716 .0039639 -11.70 0.000 -.0541547 -.0385886
firstyearGPA | .6848606 .0396966 17.25 0.000 .6069173 .7628038
_cons | 1.10083 .1161888 9.47 0.000 .8726964 1.328964
------------------------------------------------------------------------------

(ii) Why did the estimated slope on skipped increase (notice it became less
negative) in the multiple linear regression compared to the simple linear
regression? (5 points)
Hint: There are two things to consider: what is the sign of the slope on firstyearGPA
in this model, and how are variables skipped and firstyearGPA related?

Solution:
The OLS estimate of 𝜷1 (i.e. the slope on skipped) increased because in the simple
regression it was biased DOWN (it had a NEGATIVE bias) as we did not control for
variable firstyearGPA.
The reason for this negative bias is that: 1) students with a higher first year GPA
skip less classes, on average (as they are better students), i.e. cov(skipped,
firstyearGPA)<0, and 2) students with a higher first year GPA have higher GPA in
the Fall term of their second year, as well, i.e. 𝜷2 > 0. Since the
covariance/correlation between the omitted variable (firstyearGPA) and Y
(termGPA) has a different sign than the covariance between the omitted variable

7
(firstyearGPA) and the included variable (skipped), the bias in the OLS estimate of
𝜷2 from the simple linear regression is negative. Loosely speaking, this means that
we estimated 𝜷2 to be lower than it truly is – skipping classes is not as bad as we
first estimated it to be in the SLR.

A brief discussion:

The whole point is that students who skip classes and students who don't are not the
same quality students (in particular, those who skip more classes are worse
students), which is why OLS in SLR cannot yield an unbiased estimate of the effect
we’re after. In order to account for the quality of the student, we include first year
GPA in the regression – in a sense it is a summary measure of the student's
motivation, ability, etc. BEFORE the student got to the Fall term of the second year.
Including first year GPA in the regression would allow us to hold it fixed, i.e. we
would be able to compare two students with the exact same first year GPA but e.g.
one of them skipped one more class than the other. Since their first year GPA was
exactly the same, we can conclude that before the Fall term these students were
similar in terms of their motivation, ability and knowledge, and if we see a
difference in their fall term GPA we can plausible attribute it to difference in
number of skipped classes. This is by no means to say we are measuring the causal
effect of first year GPA on fall term GPA. In a way we can think of first year GPA
as a "proxy" for the quality of the student - their motivation, ability and
knowledge.

(iii) What is the R2 of the regression and how do you interpret it? Why did it
increase compared to the simple linear regression? (2
points)

Solution:
R2 = 0.52, which means that 52% of the total variation in term GPA is explained by
variation in the number of skipped classes during the term and first year GPA.
It is bigger than R2 from the simple regression because the R2 never decreases when
we add a variable in the regression: skipped and firstyearGPA should explain at least
as much of the variation in termGPA as skipped alone.
(iv) At the 5% significance level (α=0.05), do you reject the null hypothesis that
the slope on skipped equals zero? What about the slope on firstyearGPA? Why or
why not? (2 points)

Solution:
The p-value associated with H0: 𝜷1 = 0 is 0.000, and the p-value associated with H0:
𝜷2 = 0 is 0.000, as well. Both of p-values are less than α=0.05 (our significance level),
so we reject both of these null hypotheses.

(v) On average, who has a higher predicted term GPA: a student with a perfect 4.0
first-year GPA who skipped 3 classes during the term, or a student with a 3.0
first-year GPA who skipped no classes during the term? (2 points)

8
Solution:
The predicted term GPA is given by:
A 4.0 first-year GPA student who skipped 3 classes during the term:
̂
𝒕𝒆𝒓𝒎𝑮𝑷𝑨 = 1.10 - 0.05*3 + 0.68*4 = 3.67
A 3.0 first-year GPA student who skipped 0 classes during the term:
̂
𝒕𝒆𝒓𝒎𝑮𝑷𝑨 = 1.10 - 0.05*0 + 0.68*3 = 3.16
Clearly, the 4.0 first year GPA student has a higher predicted GPA in the Fall term
of the second year, even though s/he skipped 3 classes.

Problem 3: Algebra of OLS (5 points for part a, part b is optional)

a. You estimate a simple regression model Y= 𝛽0 + 𝛽1 𝑋 + 𝑢 with N = 3 observations.


Suppose that 𝑋1 = 1, 𝑋2 = −1 , and that 𝒖 ̂ 𝟐 = 0. Calculate 𝑋3 and 𝒖
̂ 𝟏 = 1 and 𝒖 ̂𝟑 .

N
Hint: Use the normal equation  𝑢̂i=0 and the fact that and that 𝒖̂𝟏 = 1 and 𝒖̂𝟐 = 0 to find
i 1
N
̂𝟑.
𝒖 Then use the values of 𝑋1 , 𝑋2 , 𝒖̂𝟏 , 𝒖̂𝟐 and 𝒖̂𝟑 and the second normal equation  𝑢̂iXi=0
i 1
to find 𝑋3 .
Solution:
N
From the first normal equation  𝑢̂i=0:
i 1
̂𝟏 + 𝒖
𝒖 ̂𝟐 + 𝒖
̂𝟑 = 𝟎

Substituting 𝒖̂𝟏 = 1 and 𝒖̂𝟐 = 0: 1 + 0 + 𝒖̂𝟑 = 𝟎.

From here: 𝒖̂𝟑 = −𝟏.


N
From the second normal equation  𝑢̂iXi=0:
i 1
̂ 𝟏 𝑋1 + 𝒖
𝒖 ̂ 𝟐 𝑋2 + 𝒖
̂ 𝟑 𝑋3 = 0.

Substituting 𝑋1 = 1, 𝑋2 = −𝟏 𝒖̂𝟏 = 1, 𝒖̂𝟐 = 0 and 𝒖̂𝟑 = −𝟏: 𝟏 ⋅ 𝟏 + 𝟎 ⋅ (−𝟏) + (−𝟏) ⋅ 𝑋3 = 0.

From here: 𝑋3 = 𝟏.

b. OPTIONAL: You will get full credit without this part.


Throughout this class, we are going to be using STATA to calculate regression lines, but
it is a good idea to compute R2 by hand once.

In problem set 3 you calculated the OLS estimates of the intercept and slope from a
regression of Y on X (see data points below) as 𝛽̂1 =1, 𝛽̂2 =0.5, and you also calculated
̅=4. Calculate the R2 of this regression by filling in the table below proceeding the same
Y
way as the calculations for observation 1. Also illustrate the normal equations, i.e. that:
N N


i 1
𝑢̂i=0 and that  𝑢̂i Xi=0.
i 1
Show all your calculations in the table below!

9
Obs. i Xi Yi ̂ 𝒊 =𝜷
𝒀 ̂𝟎 + 𝜷
̂ 𝟏 · 𝑿𝒊 ̂𝒊)
̂ 𝒊 =(Yi - 𝒀
𝒖 ̅)
(Yi - 𝒀 ( 𝒀𝒊 − 𝒀̂ 𝒊 )𝟐 ̅ )𝟐
( 𝒀𝒊 − 𝒀 ̂ 𝒊 Xi
𝒖
1 3 3 ̂
𝑌1 =1+0.5·3=2.5 ̂
𝑢̂1 =Y1 -𝑌1 = 3 - 4=-1 2
0.5 =0.25 1 0.5·3=1.5
=3-2.5=0.5
2 9 6 𝑌̂2 = 𝑢̂2 =
3 6 3 𝑌̂3 = 𝑢̂3 =
N N
∑ 
i 1
̂i
𝒖
SSE=sum of
the above
SST=sum of
the above  𝒖̂ X
i 1
i i

𝑆𝑆𝐸 𝑆𝑆𝑅
From here, R2 = 1 - 𝑆𝑆𝑇 (note this also equals 𝑆𝑆𝑇 ).

OPTIONAL: Double-check your calculation with STATA following the steps below:
1) Import the data into STATA following the same steps as in Problem set 3, rename the
two variables as Y and X, then regress Y on X. What is the R2 of the regression?
2) Now double-check that this is the same as the squared sample correlation between X
and Y. Type:
corr Y X
and square the number you get to see it is the same as the R2 from 1) after rounding to the
4th decimal.

Solution:

Obs. i Xi Yi 𝒀 ̂𝟎 + 𝜷
̂ 𝒊 =𝜷 ̂ 𝟏 · 𝑿𝒊 ̂𝒊)
̂ 𝒊 =(Yi - 𝒀
𝒖 ̅)
(Yi - 𝒀 ̂ 𝒊 )𝟐
( 𝒀𝒊 − 𝒀 ̅ )𝟐
( 𝒀𝒊 − 𝒀 ̂ 𝒊 Xi
𝒖
1 3 3 ̂
𝑌1 =1+0.5·3=2.5 𝑢̂1 =3-2.5=0.5 3 - 4=-1 0.25 1 0.5·3=1.5
2 9 6 𝑌̂2 =1+0.5·9=5.5 𝑢̂2 =6-5.5=0.5 6-4=2 0.25 4 0.5·9=4.5
3 6 3 𝑌̂3 =1+0.5·3=4 𝑢̂3 =3-4=-1 3-4=-1 1 1 -1·6=-6
N N
∑  𝒖̂ =0
i 1
i 0 SSE=1.5 SST=6  𝒖̂ X =0
i 1
i i

𝑆𝑆𝐸
From here, R2 = 1 - 𝑆𝑆𝑇 = 1 -1.5/6=1-0.25=0.75.

The STATA output confirms this result:

. reg Y X

Source SS df MS Number of obs = 3


F( 1, 1) = 3.00
Model 4.5 1 4.5 Prob > F = 0.3333
Residual 1.5 1 1.5 R-squared = 0.7500
Adj R-squared = 0.5000
Total 6 2 3 Root MSE = 1.2247

Y Coef. Std. Err. t P>|t| [95% Conf. Interval]

X .5 .2886751 1.73 0.333 -3.167965 4.167965


_cons 1 1.870829 0.53 0.687 -22.77113 24.77113

10
We also confirm that R2 equals the squared sample correlation between X and Y:
. corr Y X
(obs=3)

Y X

Y 1.0000
X 0.8660 1.0000

. di 0.8660^2
.749956

11

You might also like