Professional Documents
Culture Documents
Introduction To Econometrics, 5 Edition: Chapter 6: Specification of Regression Variables
Introduction To Econometrics, 5 Edition: Chapter 6: Specification of Regression Variables
Dougherty
Introduction to Econometrics,
5th edition
Chapter heading
Chapter 6: Specification of
Regression Variables
True model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Ŷ ˆ1 ˆ2 X 2
Fitted model
Yˆ ˆ1 ˆ2 X 2
ˆ3 X 3
In this sequence and the next we will investigate the consequences of misspecifying the
regression model in terms of explanatory variables.
1
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
True model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Ŷ ˆ1 ˆ2 X 2
Fitted model
Yˆ ˆ1 ˆ2 X 2
ˆ X
3 3
To keep the analysis simple, we will assume that there are only two possibilities. Either Y
depends only on X2, or it depends on both X2 and X3.
2
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
True model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Correct specification,
Ŷ ˆ1 ˆ2 X 2 no problems
Fitted model
Yˆ ˆ1 ˆ2 X 2
ˆ X
3 3
If Y depends only on X2, and we fit a simple regression model, we will not encounter any
problems, assuming of course that the regression model assumptions are valid.
3
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
True model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Correct specification,
Ŷ ˆ1 ˆ2 X 2 no problems
Fitted model
Likewise we will not encounter any problems if Y depends on both X2 and X3 and we fit the
multiple regression.
4
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
True model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Correct specification,
Ŷ ˆ1 ˆ2 X 2 no problems
Fitted model
In this sequence we will examine the consequences of fitting a simple regression when the
true model is multiple.
5
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
True model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Correct specification,
Ŷ ˆ1 ˆ2 X 2 no problems
Fitted model
In the next one we will do the opposite and examine the consequences of fitting a multiple
regression when the true model is simple.
6
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
True model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
7
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
ˆ
E 2 2 3
X 2i X 2 X 3i X 3
X X2
2
2i
In the present case, the omission of X3 causes ˆ2 to be biased by the term highlighted in
yellow. We will explain this first intuitively and then demonstrate it mathematically.
8
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
ˆ
E 2 2 3
X 2i X 2 X 3i X 3
X X2
2
2i
Y
effect of X3
direct effect of
b2 b3
X2, holding X3
apparent effect of X2,
constant
acting as a mimic for X3
X2 X3
The intuitive reason is that, in addition to its direct effect b2, X2 has an apparent indirect
effect as a consequence of acting as a proxy for the missing X3.
9
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
ˆ
E 2 2 3
X 2i X 2 X 3i X 3
X X2
2
2i
Y
effect of X3
direct effect of
b2 b3
X2, holding X3
apparent effect of X2,
constant
acting as a mimic for X3
X2 X3
The strength of the proxy effect depends on two factors: the strength of the effect of X3 on
Y, which is given by b3, and the ability of X2 to mimic X3.
10
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
E 2 2 3
ˆ X 2i X 2 X 3i X 3
X X2
2
2i
Y
effect of X3
direct effect of
b2 b3
X2, holding X3
apparent effect of X2,
constant
acting as a mimic for X3
X2 X3
11
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
ˆ
2 X 2 i X 2 Yi Y
X X2
2
2i
X X 2 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u
2i
X 2i X 2
2
2 X 2 i X 2 3 X 2 i X 2 X 3 i X 3 X 2 i X 2 ui u
2
X 2i X 2
2
2 3
X X X X X X u u
2i 2 3i 3 2i 2 i
X X X X
2 2
2i 2 2i 2
12
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
ˆ
2 X 2 i X 2 Yi Y
X X2
2
2i
X X 2 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u
2i
X 2i X 2
2
2 X 2 i X 2 3 X 2 i X 2 X 3 i X 3 X 2 i X 2 ui u
2
X 2i X 2
2
2 3
X X X X X X u u
2i 2 3i 3 2i 2 i
X X X X
2 2
2i 2 2i 2
Although Y really depends on X3 as well as X2, we make a mistake and regress Y on X2 only.
The slope coefficient is therefore as shown.
13
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
ˆ
2 X 2 i X 2 Yi Y
X X2
2
2i
X X 2 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u
2i
X 2i X 2
2
2 X 2 i X 2 3 X 2 i X 2 X 3 i X 3 X 2 i X 2 ui u
2
X 2i X 2
2
2 3
X X X X X X u u
2i 2 3i 3 2i 2 i
X X X X
2 2
2i 2 2i 2
We substitute for Y.
14
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
ˆ
2 X 2 i X 2 Yi Y
X X2
2
2i
X X 2 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u
2i
X 2i X 2
2
2 X 2 i X 2 3 X 2 i X 2 X 3 i X 3 X 2 i X 2 ui u
2
X 2i X 2
2
2 3
X X X X X X u u
2i 2 3i 3 2i 2 i
X X X X
2 2
2i 2 2i 2
15
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
ˆ
2 2 3 X 2i X 2 X 3i X 3
X X u u
2i 2 i
X X2 X X
2 2
2i 2i 2
E ˆ2 2 3
X X X X E X X u u
2i 2 3i 3 2i 2 i
2 2
X X 2i 2 X X 2i 2
To investigate biasedness or unbiasedness, we take the expected value of ˆ2 . The first two
terms are unaffected because they contain no random components. Thus we focus on the
expectation of the error term.
16
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
E ˆ2 2 3
X X X X E X X u u
2i 2 3i 3 2i 2 i
2 2
X X2i 2 X X 2i 2
X 2 i X 2 ui u
E X 2 i X 2 ui u
1
E
X 2i X 2 X 2i X 2
2 2
1
2
E X 2 i X 2 ui u
X 2i X 2
1
2
X 2 i X 2 E ui u
X 2i X 2
0
17
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
E ˆ2 2 3
X X X X E X X u u
2i 2 3i 3 2i 2 i
2 2
X X 2i 2 X X 2i 2
X 2 i X 2 ui u
E X 2 i X 2 ui u
1
E
X 2i X 2 X 2i X 2
2 2
1
2
E X 2 i X 2 ui u
X 2i X 2
1
2
X 2 i X 2 E ui u
X 2i X 2
0
In the numerator the expectation of a sum is equal to the sum of the expectations (first
expected value rule).
18
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
E ˆ2 2 3
X X X X E X X u u
2i 2 3i 3 2i 2 i
2 2
X X 2i 2 X X 2i 2
X 2 i X 2 ui u
E X 2 i X 2 ui u
1
E
X 2i X 2 X 2i X 2
2 2
1
2
E X 2 i X 2 ui u
X 2i X 2
1
2
X 2 i X 2 E ui u
X 2i X 2
0
In each product, the factor involving X2 may be taken out of the expectation because X2 is
nonstochastic.
19
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
E ˆ2 2 3
X X X X E X X u u
2i 2 3i 3 2i 2 i
2 2
X X 2i 2 X X 2i 2
X 2 i X 2 ui u
E X 2 i X 2 ui u
1
E
X 2i X 2 X 2i X 2
2 2
1
2
E X 2 i X 2 ui u
X 2i X 2
1
2
X 2 i X 2 E ui u
X 2i X 2
0
By Assumption A.3, the expected value of u is 0. It follows that the expected value of the
sample mean of u is also 0. Hence the expected value of the error term is 0.
20
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
E ˆ2 2 3
X X X X E X X u u
2i 2 3i 3 2i 2 i
X X
2 2
2i 2 X X 2i 2
2 3
X X X X
2i 2 3i 3
X X
2
2i 2
Thus we have shown that the expected value of ˆ2 is equal to the true value plus a bias
term. Note: the definition of a bias is the difference between the expected value of an
estimator and the true value of the parameter being estimated.
21
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u Ŷ ˆ1 ˆ2 X 2
E ˆ2 2 3
X X X X E X X u u
2i 2 3i 3 2i 2 i
X X
2 2
2i 2 X X 2i 2
2 3
X X X X
2i 2 3i 3
X X
2
2i 2
As a consequence of the misspecification, the standard errors, t tests and F test are invalid.
22
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 105.58
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------
We will illustrate the bias using an educational attainment model. To keep the analysis
simple, we will assume that in the true model S depends only on ASVABC and SM. The
output above shows the corresponding regression using EAWE Data Set 21.
23
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM . cor SM ASVABC
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
| SM ASVABC
-----------+------------------------------ F( 2, 497) = 105.58
--------+------------------
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
SM| 1.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
ASVABC| 0.3594 1.0000
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------
E 2 2 3
ˆ ASVABC i ASVABC SM i SM
ASVABC ASVABC
2
i
We will run the regression a second time, omitting SM. Before we do this, we will try to
predict the direction of the bias in the coefficient of ASVABC.
24
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM . cor SM ASVABC
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
| SM ASVABC
-----------+------------------------------ F( 2, 497) = 105.58
--------+------------------
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
SM| 1.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
ASVABC| 0.3594 1.0000
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------
E 2 2 3
ˆ ASVABC i ASVABC SM i SM
ASVABC ASVABC
2
i
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM . cor SM ASVABC
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
| SM ASVABC
-----------+------------------------------ F( 2, 497) = 105.58
--------+------------------
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
SM| 1.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
ASVABC| 0.3594 1.0000
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------
E 2 2 3
ˆ ASVABC i ASVABC SM i SM
ASVABC ASVABC
2
i
The correlation between ASVABC and SM is positive, so the numerator of the bias term
must be positive. The denominator is automatically positive since it is a sum of squares
and there is some variation in ASVABC. Hence the bias should be positive.
26
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC . cor SM ASVABC
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
| SM ASVABC
-----------+------------------------------ F( 1, 498) = 182.56
--------+------------------
Model | 1006.99534 1 1006.99534 Prob > F = 0.0000
SM| 1.0000
Residual | 2747.02666 498 5.5161178 R-squared = 0.2682
ASVABC| 0.3594 1.0000
-----------+------------------------------ Adj R-squared = 0.2668
Total | 3754.022 499 7.52309018 Root MSE = 2.3486
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.580901 .1170059 13.51 0.000 1.351015 1.810787
_cons | 14.43677 .1097335 131.56 0.000 14.22117 14.65237
----------------------------------------------------------------------------
E 2 2 3
ˆ ASVABC i ASVABC SM i SM
ASVABC ASVABC
2
i
27
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------
. reg S ASVABC
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.580901 .1170059 13.51 0.000 1.351015 1.810787
_cons | 14.43677 .1097335 131.56 0.000 14.22117 14.65237
----------------------------------------------------------------------------
As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the
difference may be due to pure chance, but part is attributable to the bias.
28
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S SM
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 68.44
Model | 453.551645 1 453.551645 Prob > F = 0.0000
Residual | 3300.47036 498 6.62745051 R-squared = 0.1208
-----------+------------------------------ Adj R-squared = 0.1191
Total | 3754.022 499 7.52309018 Root MSE = 2.5744
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
SM | .3598719 .0435019 8.27 0.000 .2744021 .4453417
_cons | 9.992614 .6002469 16.65 0.000 8.813286 11.17194
----------------------------------------------------------------------------
E 3 3 2
ˆ ASVABC i ASVABC SM i SM
SM SM
2
i
Here is the regression omitting ASVABC instead of SM. We would expect ˆ3 to be
upwards biased. We anticipate that b2 is positive and we know that both the numerator and
the denominator of the other factor in the bias expression are positive.
29
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------
. reg S SM
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
SM | .3598719 .0435019 8.27 0.000 .2744021 .4453417
_cons | 9.992614 .6002469 16.65 0.000 8.813286 11.17194
----------------------------------------------------------------------------
In this case the bias is quite dramatic. The coefficient of SM has nearly doubled. The
reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC,
while b2 and b3 are similar in size, judging by their estimates.
30
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg S ASVABC SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 105.58
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
. reg S ASVABC
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 182.56
Model | 1006.99534 1 1006.99534 Prob > F = 0.0000
Residual | 2747.02666 498 5.5161178 R-squared = 0.2682
-----------+------------------------------ Adj R-squared = 0.2668
Total | 3754.022 499 7.52309018 Root MSE = 2.3486
. reg S SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 68.44
Model | 453.551645 1 453.551645 Prob > F = 0.0000
Residual | 3300.47036 498 6.62745051 R-squared = 0.1208
-----------+------------------------------ Adj R-squared = 0.1191
Total | 3754.022 499 7.52309018 Root MSE = 2.5744
Finally, we will investigate how R2 behaves when a variable is omitted. In the simple
regression of S on ASVABC, R2 is 0.27, and in the simple regression of S on SM it is 0.12.
31
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg S ASVABC SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 105.58
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
. reg S ASVABC
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 182.56
Model | 1006.99534 1 1006.99534 Prob > F = 0.0000
Residual | 2747.02666 498 5.5161178 R-squared = 0.2682
-----------+------------------------------ Adj R-squared = 0.2668
Total | 3754.022 499 7.52309018 Root MSE = 2.3486
. reg S SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 68.44
Model | 453.551645 1 453.551645 Prob > F = 0.0000
Residual | 3300.47036 498 6.62745051 R-squared = 0.1208
-----------+------------------------------ Adj R-squared = 0.1191
Total | 3754.022 499 7.52309018 Root MSE = 2.5744
Does this imply that ASVABC explains 27% of the variance in S and SM 12%? No, because
the multiple regression reveals that their joint explanatory power is 0.30, not 0.39.
32
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg S ASVABC SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 105.58
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
. reg S ASVABC
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 182.56
Model | 1006.99534 1 1006.99534 Prob > F = 0.0000
Residual | 2747.02666 498 5.5161178 R-squared = 0.2682
-----------+------------------------------ Adj R-squared = 0.2668
Total | 3754.022 499 7.52309018 Root MSE = 2.3486
. reg S SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 68.44
Model | 453.551645 1 453.551645 Prob > F = 0.0000
Residual | 3300.47036 498 6.62745051 R-squared = 0.1208
-----------+------------------------------ Adj R-squared = 0.1191
Total | 3754.022 499 7.52309018 Root MSE = 2.5744
In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its
apparent explanatory power. Similarly, in the third regression, SM is partly acting as a
proxy for ASVABC, again inflating its apparent explanatory power.
33
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
LGEARN 1 2 S 3 EXP u
. reg LGEARN S EXP
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 40.12
Model | 21.2104059 2 10.6052029 Prob > F = 0.0000
Residual | 131.388814 497 .264363811 R-squared = 0.1390
-----------+------------------------------ Adj R-squared = 0.1355
Total | 152.59922 499 .30581006 Root MSE = .51416
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0916942 .0103338 8.87 0.000 .0713908 .1119976
EXP | .0405521 .009692 4.18 0.000 .0215098 .0595944
_cons | 1.199799 .1980634 6.06 0.000 .8106537 1.588943
----------------------------------------------------------------------------
However, it is also possible for omitted variable bias to lead to a reduction in the apparent
explanatory power of a variable. This will be demonstrated using a simple earnings
function model, supposing the logarithm of hourly earnings to depend on S and EXP.
34
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
LGEARN 1 2 S 3 EXP u
. reg LGEARN S EXP . cor S EXP
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2,| S
497) = EXP
40.12
Model | 21.2104059 2 10.6052029 --------+------------------
Prob > F = 0.0000
Residual | 131.388814 497 .264363811 S|
R-squared 1.0000
= 0.1390
-----------+------------------------------ EXP| -0.5836 1.0000
Adj R-squared = 0.1355
Total | 152.59922 499 .30581006 Root MSE = .51416
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0916942 .0103338 8.87 0.000 .0713908 .1119976
EXP | .0405521 .009692 4.18 0.000 .0215098 .0595944
_cons | 1.199799 .1980634 6.06 0.000 .8106537 1.588943
----------------------------------------------------------------------------
S S EXPi EXP
E ˆ2 2 3
i
Si S
2
If we omit EXP from the regression, the coefficient of S should be subject to a downward
bias. b3 is likely to be positive. The numerator of the other factor in the bias term is
negative since S and EXP are negatively correlated. The denominator is positive.
35
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
LGEARN 1 2 S 3 EXP u
. reg LGEARN S EXP . cor S EXP
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2,| S
497) = EXP
40.12
Model | 21.2104059 2 10.6052029 --------+------------------
Prob > F = 0.0000
Residual | 131.388814 497 .264363811 S|
R-squared 1.0000
= 0.1390
-----------+------------------------------ EXP| -0.5836 1.0000
Adj R-squared = 0.1355
Total | 152.59922 499 .30581006 Root MSE = .51416
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0916942 .0103338 8.87 0.000 .0713908 .1119976
EXP | .0405521 .009692 4.18 0.000 .0215098 .0595944
_cons | 1.199799 .1980634 6.06 0.000 .8106537 1.588943
----------------------------------------------------------------------------
E ˆ3 3 2
EXP EXP S S
i i
EXP EXP
2
i
For the same reasons, the coefficient of EXP in a simple regression of LGEARN on EXP
should be downwards biased.
36
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0664621 .0085297 7.79 0.000 .0497034 .0832207
_cons | 1.83624 .1289384 14.24 0.000 1.58291 2.089571
As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions.
In the third regression, the negative bias is sufficient to wipe out the positive effect of
EXP.
37
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0664621 .0085297 7.79 0.000 .0497034 .0832207
_cons | 1.83624 .1289384 14.24 0.000 1.58291 2.089571
Indeed, the regression suggests that EXP has a very small negative effect on
earnings, very small in terms of the numerical coefficient, and hence also very small in
terms of R2.
38
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0664621 .0085297 7.79 0.000 .0497034 .0832207
_cons | 1.83624 .1289384 14.24 0.000 1.58291 2.089571
If one plotted the data for LGEARN and EXP in a simple scatter diagram, not
controlling for S, one would see no relationship at all.
39
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 60.71
Model | 16.5822819 1 16.5822819 Prob > F = 0.0000
Residual | 136.016938 498 .273126381 R-squared = 0.1087
-----------+------------------------------ Adj R-squared = 0.1069
Total | 152.59922 499 .30581006 Root MSE = .52261
A comparison of R2 for the three regressions shows that the sum of R2 in the simple
regressions is actually less than R2 in the multiple regression.
40
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 60.71
Model | 16.5822819 1 16.5822819 Prob > F = 0.0000
Residual | 136.016938 498 .273126381 R-squared = 0.1087
-----------+------------------------------ Adj R-squared = 0.1069
Total | 152.59922 499 .30581006 Root MSE = .52261
This is because the apparent explanatory power of S in the second regression has been
undermined by the downwards bias in its coefficient. As we have seen, the same is even
more evident for the apparent explanatory power of EXP in the third equation.
41
Copyright Christopher Dougherty 2016.
Individuals studying econometrics on their own who feel that they might benefit
from participation in a formal course should consider the London School of
Economics summer school course
EC212 Introduction to Econometrics
http://www2.lse.ac.uk/study/summerSchools/summerSchool/Home.aspx
or the University of London International Programmes distance learning course
EC2020 Elements of Econometrics
www.londoninternational.ac.uk/lse.
2016.05.04