You are on page 1of 48

L3.

Multiple Regression Model


• Topics
• Multiple regression model
• Joint hypothesis testing
• Note: Control variables
1. Multiple regression model
• Consider a linear regression model with two independent
variables:
Yi = β0 + β1X1i + β2X2i + ui, i = 1,…,n

Y is the dependent variable


X1, X2 are the two independent variables
(Yi, X1i, X2i) denote the ith observation on Y, X1, and X2.
β0 = unknown population intercept
β1 = effect on Y of a change in X1, holding X2 constant
β2 = effect on Y of a change in X2, holding X1 constant
ui = the regression error (omitted factors)
The OLS estimator in multiple regression
• With two regressors, the OLS method solves:

n
min b ,b ,b
0 1 2
 i 0 1 1i 2 2i
[Y  (b  b X  b X )]2

i1
• The OLS estimators minimize the average squared difference
between the actual values of Yi and the predicted values based
on the linear model.

• This minimization problem provides the OLS estimators of


β0, β1 and β2
OLS assumptions
Yi = β0 + β1X1i + β2X2i + … + βkXki + ui, i = 1,…,n
 
1. The conditional distribution of u given the X-values has mean
zero, that is, E(ui|X1i = x1,…, Xki = xk) = 0.

2. (X1i,…,Xki,Yi), i =1,…,n, are i.i.d.

3. Large outliers are


4 unlikely: X1,…,
4 Xk, and Y 4have finite fourth
X X Y
moments: E( 1i ) < ∞,…, E( ki) < ∞, E( i ) < ∞.

4. There is no perfect (positive or negative) correlation between


the X-variables (i.e., no linear association between any single
X-variable and other X-variables).
Omitted variable bias
The error u arises because of factors, or variables, that influence
Y but are not included in the regression function. There are
always omitted variables that may explain some variation in Y.
 
Sometimes the omission of those variables can lead to bias in
the OLS estimator.

For omitted variable bias to occur, the omitted variable Z must


satisfy two conditions:
1. Z is a determinant of Y (i.e. Z is part of u); and
2. Z is correlated with X (i.e. corr(Z,X) ≠ 0)
Directions of bias

• The bias is determined by two things


• Correlation between the omitted variable Z and Y
• Correlation between the omitted variable Z and X

• Four scenarios
• Cov(Z, Y)>0 and Cov(Z, X)>0 gives positive bias in coefficient of X
• Cov(Z, Y)<0 and Cov(Z, X)<0 gives positive bias in coefficient of X

• Cov(Z, Y)>0 and Cov(Z, X)<0 gives negative bias in coefficient of X


• Cov(Z, Y)<0 and Cov(Z, X)>0 gives negative bias in coefficient of X
What can we do about omitted variable bias?
Three ways to overcome omitted variable bias:
1. Include the omitted variable in the model if you have the
information
2. Run a controlled experiment where treatments are randomly
assigned to subjects
3. Find an instrument for the omitted variable (more on that
later)
Multiple regression in Stata
• Please open Lecture_3.do in Stata
• Data: Risk_Field.dta
• (see Andersen et al. [2010] for further documentation)
Example with linear regression model
• Linear regression model with age, female, income and education as
independent variables

. regress crra age female IncLow IncHigh skilled longedu

Source | SS df MS Number of obs = 842


-------------+---------------------------------- F(6, 835) = 10.20
Model | 33.9285307 6 5.65475512 Prob > F = 0.0000
Residual | 462.712478 835 .55414668 R-squared = 0.0683
-------------+---------------------------------- Adj R-squared = 0.0616
Total | 496.641009 841 .590536277 Root MSE = .74441

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0110023 .0018369 -5.99 0.000 -.0146078 -.0073968
female | -.0047448 .0526605 -0.09 0.928 -.1081072 .0986177
IncLow | .0061799 .0645934 0.10 0.924 -.1206046 .1329644
IncHigh | -.1309954 .0659997 -1.98 0.047 -.2605402 -.0014507
skilled | .2007726 .0676065 2.97 0.003 .0680738 .3334713
longedu | .3303963 .0704457 4.69 0.000 .1921247 .4686678
_cons | .9947522 .1157919 8.59 0.000 .7674747 1.22203
------------------------------------------------------------------------------
Example with linear regression model
------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0110023 .0018369 -5.99 0.000 -.0146078 -.0073968
female | -.0047448 .0526605 -0.09 0.928 -.1081072 .0986177
IncLow | .0061799 .0645934 0.10 0.924 -.1206046 .1329644
IncHigh | -.1309954 .0659997 -1.98 0.047 -.2605402 -.0014507
skilled | .2007726 .0676065 2.97 0.003 .0680738 .3334713
longedu | .3303963 .0704457 4.69 0.000 .1921247 .4686678
_cons | .9947522 .1157919 8.59 0.000 .7674747 1.22203
------------------------------------------------------------------------------

• Predicted value of crra = 0.995 − 0.011∙age − 0.005∙female +


0.006∙IncLow − 0.131∙IncHigh + 0.201∙skilled + 0.330∙longedu
Example with linear regression model
------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0110023 .0018369 -5.99 0.000 -.0146078 -.0073968
female | -.0047448 .0526605 -0.09 0.928 -.1081072 .0986177
IncLow | .0061799 .0645934 0.10 0.924 -.1206046 .1329644
IncHigh | -.1309954 .0659997 -1.98 0.047 -.2605402 -.0014507
skilled | .2007726 .0676065 2.97 0.003 .0680738 .3334713
longedu | .3303963 .0704457 4.69 0.000 .1921247 .4686678
_cons | .9947522 .1157919 8.59 0.000 .7674747 1.22203
------------------------------------------------------------------------------

• Marginal effect of age: d(crra)/d(age) = −0.011


• Increasing age by one year reduces crra by 0.011
Example with linear regression model
------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0110023 .0018369 -5.99 0.000 -.0146078 -.0073968
female | -.0047448 .0526605 -0.09 0.928 -.1081072 .0986177
IncLow | .0061799 .0645934 0.10 0.924 -.1206046 .1329644
IncHigh | -.1309954 .0659997 -1.98 0.047 -.2605402 -.0014507
skilled | .2007726 .0676065 2.97 0.003 .0680738 .3334713
longedu | .3303963 .0704457 4.69 0.000 .1921247 .4686678
_cons | .9947522 .1157919 8.59 0.000 .7674747 1.22203
------------------------------------------------------------------------------

• Marginal effect of female: d(crra)/d(female) = −0.005


• Women have on average a crra value that is 0.005 smaller than men
Exercises
• Estimate a linear regression model on crra with female,
young, middle and old as independent variables (assuming
homoskedasticity)
• What is the predicted value of crra for young men and women,
respectively?
• What is the predicted value of crra for old men and women,
respectively?
• What are the marginal effects of young and female on crra?
Run model with female, young, middle and old
. regress crra female young middle old

Source | SS df MS Number of obs = 846


-------------+---------------------------------- F(4, 841) = 13.52
Model | 30.042182 4 7.51054549 Prob > F = 0.0000
Residual | 467.047214 841 .55534746 R-squared = 0.0604
-------------+---------------------------------- Adj R-squared = 0.0560
Total | 497.089396 845 .588271474 Root MSE = .74522

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .0308326 .05184 0.59 0.552 -.0709183 .1325835
young | .1001164 .0846307 1.18 0.237 -.0659957 .2662285
middle | -.3525659 .0764231 -4.61 0.000 -.5025682 -.2025636
old | -.3198506 .0720565 -4.44 0.000 -.4612822 -.178419
_cons | .8300892 .0638664 13.00 0.000 .7047329 .9554456
------------------------------------------------------------------------------
Predicted crra values for young men and women
. regress crra female young middle old

Source | SS df MS Number of obs = 846


-------------+---------------------------------- F(4, 841) = 13.52
Model | 30.042182 4 7.51054549 Prob > F = 0.0000
Residual | 467.047214 841 .55534746 R-squared = 0.0604
-------------+---------------------------------- Adj R-squared = 0.0560
Total | 497.089396 845 .588271474 Root MSE = .74522

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .0308326 .05184 0.59 0.552 -.0709183 .1325835
young | .1001164 .0846307 1.18 0.237 -.0659957 .2662285
middle | -.3525659 .0764231 -4.61 0.000 -.5025682 -.2025636
old | -.3198506 .0720565 -4.44 0.000 -.4612822 -.178419
_cons | .8300892 .0638664 13.00 0.000 .7047329 .9554456
------------------------------------------------------------------------------

• crra(young men) = 0.830 + 0.100 = 0.930


• crra(young women) = 0.830 + 0.100 + 0.031 = 0.961
Predicted crra values for old men and women
. regress crra female young middle old

Source | SS df MS Number of obs = 846


-------------+---------------------------------- F(4, 841) = 13.52
Model | 30.042182 4 7.51054549 Prob > F = 0.0000
Residual | 467.047214 841 .55534746 R-squared = 0.0604
-------------+---------------------------------- Adj R-squared = 0.0560
Total | 497.089396 845 .588271474 Root MSE = .74522

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .0308326 .05184 0.59 0.552 -.0709183 .1325835
young | .1001164 .0846307 1.18 0.237 -.0659957 .2662285
middle | -.3525659 .0764231 -4.61 0.000 -.5025682 -.2025636
old | -.3198506 .0720565 -4.44 0.000 -.4612822 -.178419
_cons | .8300892 .0638664 13.00 0.000 .7047329 .9554456
------------------------------------------------------------------------------

• crra(old men) = 0.830 − 0.320 = 0.510


• crra(old women) = 0.830 − 0.320 + 0.031 = 0.541
Marginal effects of young and female
. regress crra female young middle old

Source | SS df MS Number of obs = 846


-------------+---------------------------------- F(4, 841) = 13.52
Model | 30.042182 4 7.51054549 Prob > F = 0.0000
Residual | 467.047214 841 .55534746 R-squared = 0.0604
-------------+---------------------------------- Adj R-squared = 0.0560
Total | 497.089396 845 .588271474 Root MSE = .74522

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .0308326 .05184 0.59 0.552 -.0709183 .1325835
young | .1001164 .0846307 1.18 0.237 -.0659957 .2662285
middle | -.3525659 .0764231 -4.61 0.000 -.5025682 -.2025636
old | -.3198506 .0720565 -4.44 0.000 -.4612822 -.178419
_cons | .8300892 .0638664 13.00 0.000 .7047329 .9554456
------------------------------------------------------------------------------

• Marginal effect of young: d(crra)/d(young) = 0.100


• Marginal effect of female: d(crra)/d(female) = 0.031
Single hypothesis testing
• Probability distributions
• Linear transformations of normal distributions

• Standard normal and t-distributions

• Sampling distribution
• Distribution of sample mean values

• Normally distributed for sufficiently large n (typically > 30)

• We want to:
• Generate t-statistic for test of single (one- or two-sided) hypothesis

• t = ( estimated mean − hypothesis ) / standard error of estimated mean


Density

Density
Transformation of normal distributions

.25

.15

.05
.5

.4

.3

.2

.1

.2

.1
0

0
Normal distributions
Standard normal distribution N(0,1)

-5 0 5 10 15
std_normal

Alternative normal distribution N(4,4)

-5 0 5 10 15
alt_normal
Density
.4

.3

.2

.1
Student t-distribution with large degree of freedom

0
Student t-distribution

-4 -2 0 2 4
t-value
Population and sampling distributions
2
Population distribution, 1 mill observations ~ N(200, 50)

0 100 200 300 400


Random variable (Y)

2
Sampling distribution, sample size = 100 observations ~ N(200, 5)

0 100 200 300 400


Sample mean (Y_bar)
Sampling distribution
• Population (normal) distribution with μ = 200 and σ = 50

• Look at large number of random samples with n=100 from


population distribution:

• Sampling distribution with μ = 200 and σ = 50/sqrt(n) = 5


when n=100

• Standard deviation of sampling distribution is called the


standard error, which we use in tests of hypotheses about
sample means
Example with continuous independent variable
• Linear regression model with age as only independent variable

. regress crra age

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0109613 .0018155 -6.04 0.000 -.0145248 -.0073978
_cons | 1.14752 .0857724 13.38 0.000 .9791665 1.315873
------------------------------------------------------------------------------

• t-value = (− 0.011 − 0) / 0.0018 = −6.04


• H0: β1 = 0 and H1: β1 ≠ 0
• Two-sided test of β1 = 0: critical t-value = ±1.96 at 5% level
• p-value: pr(t-value < −6.04) + pr(t-value > 6.04) < 0.001
• 95% confidence interval of β1: − 0.011 ± 1.96 ∙ 0.0018
2. Joint hypothesis testing
• Test of coefficient on single X-variable: t-test (same as before)

• Joint test of coefficients on several X-variables: F-test

CRRAi = β0 + β1Agei + β2femalei + β3IncLowi + β4IncHighi


+ β5skilledi + β6longedui + ui
 
The null hypothesis that “income doesn’t matter” for risk aversion,
and the alternative hypothesis that it does matter, corresponds to:
 H0: β3 = 0 and β4 = 0
H1: either β3 ≠ 0, or β4 ≠ 0, or both
F-statistic under homoskedasticity
• Is there a significant improvement in the fit of the model
by relaxing the two restrictions?

F=

where q is number of restrictions, n is number of observations, and k is


number of X-variables in the unrestricted model.
“Restricted” refers to the model without the two income variables, and
“unrestricted” refers to the full model with the income variables.

• SSR is sum of squared residuals (always positive)


• Unexplained variance in Y: (SSRrestricted – SSRunrestricted) > 0
• We also refer to SSR as unobserved heterogeneity
F-statistic under heteroskedasticity
The F-statistic is more complicated to derive when we control
for heteroskedasticity.
 
Formula for the special case of the joint hypothesis β1=β1,0 and
β2=β2,0 in a model with two independent variables:
 
1 1 2  2ˆ t1 ,t2 t1t2 
 2 2
t  t
F = 2 1  ˆ t21 ,t2 
 
 
where ˆ t1 ,t2 is the estimated correlation coefficient between t1 and t2.

When do we reject the null hypothesis?


F-distribution with (1,120) degrees of freedom

0 2 4 6 8
F-value
F-distribution with (2,120) degrees of freedom

0 2 4 6 8
F-value
Example with linear regression model
• Linear regression model with age, female, income and education as
independent variables assuming homoscedasticity

. regress crra age female IncLow IncHigh skilled longedu

Source | SS df MS Number of obs = 842


-------------+---------------------------------- F(6, 835) = 10.20
Model | 33.9285307 6 5.65475512 Prob > F = 0.0000
Residual | 462.712478 835 .55414668 R-squared = 0.0683
-------------+---------------------------------- Adj R-squared = 0.0616
Total | 496.641009 841 .590536277 Root MSE = .74441

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0110023 .0018369 -5.99 0.000 -.0146078 -.0073968
female | -.0047448 .0526605 -0.09 0.928 -.1081072 .0986177
IncLow | .0061799 .0645934 0.10 0.924 -.1206046 .1329644
IncHigh | -.1309954 .0659997 -1.98 0.047 -.2605402 -.0014507
skilled | .2007726 .0676065 2.97 0.003 .0680738 .3334713
longedu | .3303963 .0704457 4.69 0.000 .1921247 .4686678
_cons | .9947522 .1157919 8.59 0.000 .7674747 1.22203
------------------------------------------------------------------------------
Example with linear regression model
• Linear regression model with age, female and education as
independent variables assuming homoscedasticity

. regress crra age female skilled longedu

Source | SS df MS Number of obs = 842


-------------+---------------------------------- F(4, 837) = 13.91
Model | 30.9578743 4 7.73946859 Prob > F = 0.0000
Residual | 465.683134 837 .556371726 R-squared = 0.0623
-------------+---------------------------------- Adj R-squared = 0.0579
Total | 496.641009 841 .590536277 Root MSE = .7459

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0108795 .0018387 -5.92 0.000 -.0144886 -.0072704
female | -.0097489 .0526586 -0.19 0.853 -.1131073 .0936095
skilled | .2029999 .0667915 3.04 0.002 .0719014 .3340984
longedu | .2847977 .0668483 4.26 0.000 .1535877 .4160077
_cons | .9650865 .106825 9.03 0.000 .7554102 1.174763
------------------------------------------------------------------------------
Example with linear regression model
• F-test of joint hypothesis that the two income variables
(IncLow and IncHigh) are equal to 0:

F-value = = 2.68

Critical value at the 5% significance level for F-distribution with


(2,835) degrees of freedom is equal to 3.

We cannot reject the null hypothesis that the joint marginal effect of
the two income variables is equal to 0
Example with linear regression model
• What are the implications of omitting the two income variables?

• Are the marginal effects of education on risk attitudes different across the
unrestricted and restricted models?

• Income and education are most likely positively correlated: higher


education is typically associated with higher income, i.e. corr(income,
education) > 0

• In the unrestricted model we see that crra is negatively correlated with


income, i.e. corr(income, crra) < 0

• We thus expect that omitting the income variables will lead to a downward
bias in the estimated coefficients for education, and this is what happens
with the coefficient on the longedu variable
Exercises
• Estimate a linear regression model on crra with female,
young, middle, old, skilled and longedu as independent
variables (assuming homoscedasticity)
• Are each of the estimated coefficients in the model significantly
different from 0?
• Are the estimated coefficients on skilled and longedu jointly
significant?
• Are the estimated coefficients on young, middle and old jointly
significant?
• Are the estimated coefficients on all the independent variables
jointly significant?
Are each of the estimated coefficients significant?
. regress crra female young middle old skilled longedu

Source | SS df MS Number of obs = 846


-------------+---------------------------------- F(6, 839) = 13.98
Model | 45.1836697 6 7.53061161 Prob > F = 0.0000
Residual | 451.905726 839 .538624227 R-squared = 0.0909
-------------+---------------------------------- Adj R-squared = 0.0844
Total | 497.089396 845 .588271474 Root MSE = .73391

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .0189913 .0514265 0.37 0.712 -.0819484 .1199309
young | .1783451 .0848779 2.10 0.036 .0117471 .3449431
middle | -.3600816 .0753263 -4.78 0.000 -.5079319 -.2122314
old | -.2746899 .0715047 -3.84 0.000 -.415039 -.1343407
skilled | .2593997 .0661773 3.92 0.000 .1295072 .3892922
longedu | .3510618 .0672979 5.22 0.000 .2189697 .4831538
_cons | .5791444 .0799944 7.24 0.000 .4221317 .7361571
------------------------------------------------------------------------------

• Female is not significant at the 10% level, young is significant at the 5%


level, and the remaining variables are significant at the 1% level
Are the education variables jointly significant?
. regress crra female young middle old skilled longedu

Source | SS df MS Number of obs = 846


-------------+---------------------------------- F(6, 839) = 13.98
Model | 45.1836697 6 7.53061161 Prob > F = 0.0000
Residual | 451.905726 839 .538624227 R-squared = 0.0909
-------------+---------------------------------- Adj R-squared = 0.0844
Total | 497.089396 845 .588271474 Root MSE = .73391

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .0189913 .0514265 0.37 0.712 -.0819484 .1199309
young | .1783451 .0848779 2.10 0.036 .0117471 .3449431
middle | -.3600816 .0753263 -4.78 0.000 -.5079319 -.2122314
old | -.2746899 .0715047 -3.84 0.000 -.415039 -.1343407
skilled | .2593997 .0661773 3.92 0.000 .1295072 .3892922
longedu | .3510618 .0672979 5.22 0.000 .2189697 .4831538
_cons | .5791444 .0799944 7.24 0.000 .4221317 .7361571
------------------------------------------------------------------------------

• F-value = 14.06, with critical F-value = 3.00 at the 5% significance level


• The two education variables are jointly significant and different from 0
Are the age-related variables jointly significant?
. regress crra female young middle old skilled longedu

Source | SS df MS Number of obs = 846


-------------+---------------------------------- F(6, 839) = 13.98
Model | 45.1836697 6 7.53061161 Prob > F = 0.0000
Residual | 451.905726 839 .538624227 R-squared = 0.0909
-------------+---------------------------------- Adj R-squared = 0.0844
Total | 497.089396 845 .588271474 Root MSE = .73391

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .0189913 .0514265 0.37 0.712 -.0819484 .1199309
young | .1783451 .0848779 2.10 0.036 .0117471 .3449431
middle | -.3600816 .0753263 -4.78 0.000 -.5079319 -.2122314
old | -.2746899 .0715047 -3.84 0.000 -.415039 -.1343407
skilled | .2593997 .0661773 3.92 0.000 .1295072 .3892922
longedu | .3510618 .0672979 5.22 0.000 .2189697 .4831538
_cons | .5791444 .0799944 7.24 0.000 .4221317 .7361571
------------------------------------------------------------------------------

• F-value = 20.64, with critical F-value = 3.00 at the 5% significance level


• The three age-related variables are jointly significant and different from 0
Are all the estimated coefficients jointly significant?
. regress crra female young middle old skilled longedu

Source | SS df MS Number of obs = 846


-------------+---------------------------------- F(6, 839) = 13.98
Model | 45.1836697 6 7.53061161 Prob > F = 0.0000
Residual | 451.905726 839 .538624227 R-squared = 0.0909
-------------+---------------------------------- Adj R-squared = 0.0844
Total | 497.089396 845 .588271474 Root MSE = .73391

------------------------------------------------------------------------------
crra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .0189913 .0514265 0.37 0.712 -.0819484 .1199309
young | .1783451 .0848779 2.10 0.036 .0117471 .3449431
middle | -.3600816 .0753263 -4.78 0.000 -.5079319 -.2122314
old | -.2746899 .0715047 -3.84 0.000 -.415039 -.1343407
skilled | .2593997 .0661773 3.92 0.000 .1295072 .3892922
longedu | .3510618 .0672979 5.22 0.000 .2189697 .4831538
_cons | .5791444 .0799944 7.24 0.000 .4221317 .7361571
------------------------------------------------------------------------------

• F-value = 13.98, with critical F-value = 2.10 at the 5% significance level


• The six independent variables are jointly significant and different from 0
Summary
• Learning outcomes
• Know how to estimate a multiple regression model in Stata
• Be able to interpret estimated coefficients in linear regression
models
• Understand the conditions for omitted variable bias
• Construct a joint hypothesis test of two or more coefficients
• Understand the meaning of control variables and conditional
mean independence
Extra exercises
• Use the RA_Lab.dta dataset (see Andersen et al. [2010])
A. Regress crra on female, age and IncHigh
• Are the individual coefficients on female, age and IncHigh
significantly different from 0?
• Test the joint null hypotheses that the individual coefficients on
female and age are different from 0 at the 5% significance level

B. Regress crra on female, age, IncHigh and longedu


• Are the individual coefficients on female, age, IncHigh and longedu
significantly different from 0?
• Does excluding longedu lead to omitted variable bias in the estimated
coefficient for IncHigh?
• Test the joint null hypotheses that the individual coefficients on
IncHigh and longedu are different from 0 at the 5% significance level
Note: Control variables
Control variable
• Omitted variable bias: can we find a proxy for the omitted
variable?
• A control variable is a proxy for an omitted variable, which is
correlated with an independent variable of interest
• A control variable, W, is correlated with an omitted variable in
the model
• The control variable may have a causal effect on Y.
Example: student test scores in the US
Test_Score = 700.2 – 1.00∙STR – 0.122∙PctEL – 0.547∙LchPct,
 
Test_Score = Student test score
STR = student-teacher ratio
PctEL = percent of English learners in the school district
LchPct = percent of students receiving a free or subsidized lunch
(only students from low-income families are eligible)
 
• Student-teacher ratio (STR) is the variable of interest
• Which variables are control variables? Do they have causal
components? What do they control for?
Control variables, ctd.
What makes an effective control variable?
i. An effective control variable is one which, when included in
the regression, makes the error term uncorrelated with the
variable of interest.
ii. Holding constant the control variable(s), the variable of
interest is “as if” randomly assigned.
iii. Among individuals (entities) with the same value of the
control variable(s), the variable of interest is uncorrelated
with the omitted determinants of Y
Conditional mean independence.
• Because the coefficient on a control variable can be biased,
LSA #1 (E(ui|X1i,…,Xki) = 0) must not hold. For example, the
coefficient on LchPct is correlated with unmeasured
determinants of test scores such as outside learning
opportunities, so LchPct is subject to OV bias. But the fact that
LchPct is correlated with these omitted variables is precisely
what makes it a good control variable!
• If LSA #1 doesn’t hold, then what does?

• We need a mathematical statement of what makes an effective


control variable. This condition is conditional mean
independence: given the control variable, the mean of ui
doesn’t depend on the variable of interest
Conditional mean independence, ctd.
Let Xi denote the variable of interest and Wi denote the control
variable(s). W is an effective control variable if conditional mean
independence holds:
 
E(ui|Xi, Wi) = E(ui|Wi) (conditional mean independence)
 
If W is a control variable, then conditional mean independence
replaces LSA #1 – it is the version of LSA #1 which is relevant
for control variables.
Conditional mean independence, ctd.
Consider the regression model,
Y = β0 + β1X + β2W + u
where X is the variable of interest and W is an effective control
variable so that conditional mean independence holds:
E(ui|Xi, Wi) = E(ui|Wi).

In addition, suppose that LSA #2, #3, and #4 hold. Then:


1. β1 has a causal interpretation.

2. is unbiased
3. The coefficient on the control variable, , is in general biased.
Implications for variable selection
• Identify the variable of interest

• Think of the omitted causal effects that could result in omitted


variable bias

• Include those omitted causal effects if you can. If you can’t,


then include variables correlated with them that serve as
control variables. The control variables are effective if the
conditional mean independence assumption plausibly holds.
Short summary of Appendix 6.2
• Distribution of OLS estimators
• Suppose you have two regressors, X1 and X2 and assume
homoskedasticity
• The variance of is then a function of the variance of the error term,
the variance of X1, and the correlation coefficient between X1 and
X2
• If X1 and X2 are independent then the correlation coefficient is 0,
and we have the usual formula for the variance of in a model with
a single regressor

You might also like