You are on page 1of 43

Type author name/s here

Dougherty

Introduction to Econometrics,
5th edition
Chapter heading
Chapter 6: Specification of
Regression Variables

© Christopher Dougherty, 2016. All rights reserved.


VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification

True model

Y  1   2 X 2  u Y  1   2 X 2   3 X 3  u

Ŷ  ˆ1  ˆ2 X 2
Fitted model

Yˆ  ˆ1  ˆ2 X 2
 ˆ3 X 3

In this sequence and the next we will investigate the consequences of misspecifying the
regression model in terms of explanatory variables.

1
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification

True model

Y  1   2 X 2  u Y  1   2 X 2   3 X 3  u

Ŷ  ˆ1  ˆ2 X 2
Fitted model

Yˆ  ˆ1  ˆ2 X 2
 ˆ X
3 3

To keep the analysis simple, we will assume that there are only two possibilities. Either Y
depends only on X2, or it depends on both X2 and X3.

2
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification

True model

Y  1   2 X 2  u Y  1   2 X 2   3 X 3  u

Correct specification,
Ŷ  ˆ1  ˆ2 X 2 no problems
Fitted model

Yˆ  ˆ1  ˆ2 X 2
 ˆ X
3 3

If Y depends only on X2, and we fit a simple regression model, we will not encounter any
problems, assuming of course that the regression model assumptions are valid.

3
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification

True model

Y  1   2 X 2  u Y  1   2 X 2   3 X 3  u

Correct specification,
Ŷ  ˆ1  ˆ2 X 2 no problems
Fitted model

Yˆ  ˆ1  ˆ2 X 2 Correct specification,


no problems
 ˆ X
3 3

Likewise we will not encounter any problems if Y depends on both X2 and X3 and we fit the
multiple regression.

4
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification

True model

Y  1   2 X 2  u Y  1   2 X 2   3 X 3  u

Correct specification,
Ŷ  ˆ1  ˆ2 X 2 no problems
Fitted model

Yˆ  ˆ1  ˆ2 X 2 Correct specification,


no problems
 ˆ X
3 3

In this sequence we will examine the consequences of fitting a simple regression when the
true model is multiple.

5
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification

True model

Y  1   2 X 2  u Y  1   2 X 2   3 X 3  u

Correct specification,
Ŷ  ˆ1  ˆ2 X 2 no problems
Fitted model

Yˆ  ˆ1  ˆ2 X 2 Correct specification,


no problems
 ˆ X
3 3

In the next one we will do the opposite and examine the consequences of fitting a multiple
regression when the true model is simple.

6
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification

True model

Y  1   2 X 2  u Y  1   2 X 2   3 X 3  u

Coefficients are biased (in


Correct specification,
Ŷ  ˆ1  ˆ2 X 2 no problems
general). Standard errors
Fitted model

are invalid (in general).

Yˆ  ˆ1  ˆ2 X 2 Correct specification,


no problems
 ˆ X
3 3

In general, the omission of a relevant explanatory variable causes the regression


coefficients to be biased and the standard errors to be invalid.

7
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

 
ˆ
E 2  2  3
  X 2i  X 2   X 3i  X 3 
 X  X2 
2
2i

In the present case, the omission of X3 causes ˆ2 to be biased by the term highlighted in
yellow. We will explain this first intuitively and then demonstrate it mathematically.

8
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

ˆ 
E 2  2  3
  X 2i  X 2   X 3i  X 3 
 X  X2 
2
2i

Y
effect of X3

direct effect of
b2 b3
X2, holding X3
apparent effect of X2,
constant
acting as a mimic for X3

X2 X3

The intuitive reason is that, in addition to its direct effect b2, X2 has an apparent indirect
effect as a consequence of acting as a proxy for the missing X3.

9
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

ˆ 
E 2  2  3
  X 2i  X 2   X 3i  X 3 
 X  X2 
2
2i

Y
effect of X3

direct effect of
b2 b3
X2, holding X3
apparent effect of X2,
constant
acting as a mimic for X3

X2 X3

The strength of the proxy effect depends on two factors: the strength of the effect of X3 on
Y, which is given by b3, and the ability of X2 to mimic X3.

10
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

E  2   2  3
ˆ   X 2i  X 2   X 3i  X 3 
 X  X2 
2
2i

Y
effect of X3

direct effect of
b2 b3
X2, holding X3
apparent effect of X2,
constant
acting as a mimic for X3

X2 X3

The ability of X2 to mimic X3 is determined by the slope coefficient obtained when X3 is


regressed on X2, the term highlighted in yellow.

11
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

ˆ
2    X 2 i  X 2   Yi  Y 
 X  X2 
2
2i

 X  X 2     1   2 X 2 i   3 X 3 i  ui    1   2 X 2   3 X 3  u  
 2i

 X 2i  X 2 
2


   2  X 2 i  X 2    3  X 2 i  X 2   X 3 i  X 3    X 2 i  X 2   ui  u 
2

 X 2i  X 2 
2

 2  3
 X  X   X  X    X  X   u  u
2i 2 3i 3 2i 2 i

 X  X   X  X 
2 2
2i 2 2i 2

We will now derive the expression for the bias mathematically.

12
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

ˆ
2    X 2 i  X 2   Yi  Y 
 X  X2 
2
2i

 X  X 2     1   2 X 2 i   3 X 3 i  ui    1   2 X 2   3 X 3  u  
 2i

 X 2i  X 2 
2


   2  X 2 i  X 2    3  X 2 i  X 2   X 3 i  X 3    X 2 i  X 2   ui  u 
2

 X 2i  X 2 
2

 2  3
 X  X   X  X    X  X   u  u
2i 2 3i 3 2i 2 i

 X  X   X  X 
2 2
2i 2 2i 2

Although Y really depends on X3 as well as X2, we make a mistake and regress Y on X2 only.
The slope coefficient is therefore as shown.

13
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

ˆ
2    X 2 i  X 2   Yi  Y 
 X  X2 
2
2i

 X  X 2     1   2 X 2 i   3 X 3 i  ui    1   2 X 2   3 X 3  u  
 2i

 X 2i  X 2 
2


   2  X 2 i  X 2    3  X 2 i  X 2   X 3 i  X 3    X 2 i  X 2   ui  u 
2

 X 2i  X 2 
2

 2  3
 X  X   X  X    X  X   u  u
2i 2 3i 3 2i 2 i

 X  X   X  X 
2 2
2i 2 2i 2

We substitute for Y.

14
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

ˆ
2    X 2 i  X 2   Yi  Y 
 X  X2 
2
2i

 X  X 2     1   2 X 2 i   3 X 3 i  ui    1   2 X 2   3 X 3  u  
 2i

 X 2i  X 2 
2


   2  X 2 i  X 2    3  X 2 i  X 2   X 3 i  X 3    X 2 i  X 2   ui  u 
2

 X 2i  X 2 
2

 2  3
 X  X   X  X    X  X   u  u
2i 2 3i 3 2i 2 i

 X  X   X  X 
2 2
2i 2 2i 2

We simplify and demonstrate that ˆ2 has three components.

15
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

ˆ
2  2  3   X 2i  X 2   X 3i  X 3 

  X  X   u  u
2i 2 i

 X  X2   X  X 
2 2
2i 2i 2

 
E ˆ2   2   3
  X  X   X  X   E    X  X   u  u  
2i 2 3i 3 2i 2 i
 
     
2 2
X  X 2i 2  X  X  2i 2

To investigate biasedness or unbiasedness, we take the expected value of ˆ2 . The first two
terms are unaffected because they contain no random components. Thus we focus on the
expectation of the error term.
16
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

 
E ˆ2   2   3
  X  X   X  X   E    X  X   u  u  
2i 2 3i 3 2i 2 i
 
     
2 2
X  X2i 2  X  X  2i 2

   X 2 i  X 2  ui  u  
E    X 2 i  X 2  ui  u  
1
E 

  X 2i  X 2     X 2i  X 2 
2 2

1
 2 
E  X 2 i  X 2  ui  u  
  X 2i  X 2 
1
 2 
 X 2 i  X 2  E  ui  u 
  X 2i  X 2 
0

X2 is nonstochastic, so the denominator of the error term is nonstochastic and may be


taken outside the expression for the expectation.

17
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

 
E ˆ2   2   3
  X  X   X  X   E    X  X   u  u  
2i 2 3i 3 2i 2 i
 
     
2 2
X  X 2i 2  X  X  2i 2

   X 2 i  X 2  ui  u  
E    X 2 i  X 2  ui  u  
1
E 

  X 2i  X 2     X 2i  X 2 
2 2

1
 2 
E  X 2 i  X 2  ui  u  
  X 2i  X 2 
1
 2 
 X 2 i  X 2  E  ui  u 
  X 2i  X 2 
0

In the numerator the expectation of a sum is equal to the sum of the expectations (first
expected value rule).

18
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

 
E ˆ2   2   3
  X  X   X  X   E    X  X   u  u  
2i 2 3i 3 2i 2 i
 
     
2 2
X  X 2i 2  X  X  2i 2

   X 2 i  X 2  ui  u  
E    X 2 i  X 2  ui  u  
1
E 

  X 2i  X 2     X 2i  X 2 
2 2

1
 2 
E  X 2 i  X 2  ui  u  
  X 2i  X 2 
1
 2 
 X 2 i  X 2  E  ui  u 
  X 2i  X 2 
0

In each product, the factor involving X2 may be taken out of the expectation because X2 is
nonstochastic.

19
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

 
E ˆ2   2   3
  X  X   X  X   E    X  X   u  u  
2i 2 3i 3 2i 2 i
 
     
2 2
X  X 2i 2  X  X  2i 2

   X 2 i  X 2  ui  u  
E    X 2 i  X 2  ui  u  
1
E 

  X 2i  X 2     X 2i  X 2 
2 2

1
 2 
E  X 2 i  X 2  ui  u  
  X 2i  X 2 
1
 2 
 X 2 i  X 2  E  ui  u 
  X 2i  X 2 
0

By Assumption A.3, the expected value of u is 0. It follows that the expected value of the
sample mean of u is also 0. Hence the expected value of the error term is 0.

20
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

E  ˆ2    2   3
  X  X   X  X   E    X  X   u  u  
2i 2 3i 3 2i 2 i
 
 X  X    
2 2
2i 2  X  X  2i 2

 2  3
 X  X   X  X 
2i 2 3i 3

 X  X 
2
2i 2

Thus we have shown that the expected value of ˆ2 is equal to the true value plus a bias
term. Note: the definition of a bias is the difference between the expected value of an
estimator and the true value of the parameter being estimated.
21
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1   2 X 2   3 X 3  u Ŷ  ˆ1  ˆ2 X 2

E  ˆ2    2   3
  X  X   X  X   E    X  X   u  u  
2i 2 3i 3 2i 2 i
 
 X  X    
2 2
2i 2  X  X  2i 2

 2  3
 X  X   X  X 
2i 2 3i 3

 X  X 
2
2i 2

As a consequence of the misspecification, the standard errors, t tests and F test are invalid.

22
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC   3 SM  u
. reg S ASVABC SM
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 105.58
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------

We will illustrate the bias using an educational attainment model. To keep the analysis
simple, we will assume that in the true model S depends only on ASVABC and SM. The
output above shows the corresponding regression using EAWE Data Set 21.
23
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC   3 SM  u
. reg S ASVABC SM . cor SM ASVABC
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
| SM ASVABC
-----------+------------------------------ F( 2, 497) = 105.58
--------+------------------
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
SM| 1.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
ASVABC| 0.3594 1.0000
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------

E  2   2  3
ˆ   ASVABC i  ASVABC   SM i  SM 
  ASVABC  ASVABC 
2
i

We will run the regression a second time, omitting SM. Before we do this, we will try to
predict the direction of the bias in the coefficient of ASVABC.

24
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC   3 SM  u
. reg S ASVABC SM . cor SM ASVABC
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
| SM ASVABC
-----------+------------------------------ F( 2, 497) = 105.58
--------+------------------
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
SM| 1.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
ASVABC| 0.3594 1.0000
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------

E  2   2  3
ˆ   ASVABC i  ASVABC   SM i  SM 
  ASVABC  ASVABC 
2
i

It is reasonable to suppose, as a matter of common sense, that b3 is positive. This


assumption is strongly supported by the fact that its estimate in the multiple regression is
positive and highly significant.
25
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC   3 SM  u
. reg S ASVABC SM . cor SM ASVABC
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
| SM ASVABC
-----------+------------------------------ F( 2, 497) = 105.58
--------+------------------
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
SM| 1.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
ASVABC| 0.3594 1.0000
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------

E  2   2  3
ˆ   ASVABC i  ASVABC   SM i  SM 
  ASVABC  ASVABC 
2
i

The correlation between ASVABC and SM is positive, so the numerator of the bias term
must be positive. The denominator is automatically positive since it is a sum of squares
and there is some variation in ASVABC. Hence the bias should be positive.
26
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC   3 SM  u
. reg S ASVABC . cor SM ASVABC
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
| SM ASVABC
-----------+------------------------------ F( 1, 498) = 182.56
--------+------------------
Model | 1006.99534 1 1006.99534 Prob > F = 0.0000
SM| 1.0000
Residual | 2747.02666 498 5.5161178 R-squared = 0.2682
ASVABC| 0.3594 1.0000
-----------+------------------------------ Adj R-squared = 0.2668
Total | 3754.022 499 7.52309018 Root MSE = 2.3486
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.580901 .1170059 13.51 0.000 1.351015 1.810787
_cons | 14.43677 .1097335 131.56 0.000 14.22117 14.65237
----------------------------------------------------------------------------

E  2   2  3
ˆ   ASVABC i  ASVABC   SM i  SM 
  ASVABC  ASVABC 
2
i

Here is the regression omitting SM.

27
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC   3 SM  u
. reg S ASVABC SM

----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------

. reg S ASVABC

----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.580901 .1170059 13.51 0.000 1.351015 1.810787
_cons | 14.43677 .1097335 131.56 0.000 14.22117 14.65237
----------------------------------------------------------------------------

As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the
difference may be due to pure chance, but part is attributable to the bias.

28
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC   3 SM  u
. reg S SM
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 68.44
Model | 453.551645 1 453.551645 Prob > F = 0.0000
Residual | 3300.47036 498 6.62745051 R-squared = 0.1208
-----------+------------------------------ Adj R-squared = 0.1191
Total | 3754.022 499 7.52309018 Root MSE = 2.5744
----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
SM | .3598719 .0435019 8.27 0.000 .2744021 .4453417
_cons | 9.992614 .6002469 16.65 0.000 8.813286 11.17194
----------------------------------------------------------------------------

E  3   3  2
ˆ   ASVABC i  ASVABC   SM i  SM 
  SM  SM 
2
i

Here is the regression omitting ASVABC instead of SM. We would expect ˆ3 to be
upwards biased. We anticipate that b2 is positive and we know that both the numerator and
the denominator of the other factor in the bias expression are positive.
29
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC   3 SM  u
. reg S ASVABC SM

----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017
SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705
_cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859
----------------------------------------------------------------------------

. reg S SM

----------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
SM | .3598719 .0435019 8.27 0.000 .2744021 .4453417
_cons | 9.992614 .6002469 16.65 0.000 8.813286 11.17194
----------------------------------------------------------------------------

In this case the bias is quite dramatic. The coefficient of SM has nearly doubled. The
reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC,
while b2 and b3 are similar in size, judging by their estimates.
30
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

. reg S ASVABC SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 105.58
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024

. reg S ASVABC
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 182.56
Model | 1006.99534 1 1006.99534 Prob > F = 0.0000
Residual | 2747.02666 498 5.5161178 R-squared = 0.2682
-----------+------------------------------ Adj R-squared = 0.2668
Total | 3754.022 499 7.52309018 Root MSE = 2.3486

. reg S SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 68.44
Model | 453.551645 1 453.551645 Prob > F = 0.0000
Residual | 3300.47036 498 6.62745051 R-squared = 0.1208
-----------+------------------------------ Adj R-squared = 0.1191
Total | 3754.022 499 7.52309018 Root MSE = 2.5744

Finally, we will investigate how R2 behaves when a variable is omitted. In the simple
regression of S on ASVABC, R2 is 0.27, and in the simple regression of S on SM it is 0.12.

31
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

. reg S ASVABC SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 105.58
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024

. reg S ASVABC
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 182.56
Model | 1006.99534 1 1006.99534 Prob > F = 0.0000
Residual | 2747.02666 498 5.5161178 R-squared = 0.2682
-----------+------------------------------ Adj R-squared = 0.2668
Total | 3754.022 499 7.52309018 Root MSE = 2.3486

. reg S SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 68.44
Model | 453.551645 1 453.551645 Prob > F = 0.0000
Residual | 3300.47036 498 6.62745051 R-squared = 0.1208
-----------+------------------------------ Adj R-squared = 0.1191
Total | 3754.022 499 7.52309018 Root MSE = 2.5744

Does this imply that ASVABC explains 27% of the variance in S and SM 12%? No, because
the multiple regression reveals that their joint explanatory power is 0.30, not 0.39.

32
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

. reg S ASVABC SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 105.58
Model | 1119.3742 2 559.687101 Prob > F = 0.0000
Residual | 2634.6478 497 5.30110221 R-squared = 0.2982
-----------+------------------------------ Adj R-squared = 0.2954
Total | 3754.022 499 7.52309018 Root MSE = 2.3024

. reg S ASVABC
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 182.56
Model | 1006.99534 1 1006.99534 Prob > F = 0.0000
Residual | 2747.02666 498 5.5161178 R-squared = 0.2682
-----------+------------------------------ Adj R-squared = 0.2668
Total | 3754.022 499 7.52309018 Root MSE = 2.3486

. reg S SM
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 68.44
Model | 453.551645 1 453.551645 Prob > F = 0.0000
Residual | 3300.47036 498 6.62745051 R-squared = 0.1208
-----------+------------------------------ Adj R-squared = 0.1191
Total | 3754.022 499 7.52309018 Root MSE = 2.5744

In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its
apparent explanatory power. Similarly, in the third regression, SM is partly acting as a
proxy for ASVABC, again inflating its apparent explanatory power.
33
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

LGEARN   1   2 S   3 EXP  u
. reg LGEARN S EXP
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 40.12
Model | 21.2104059 2 10.6052029 Prob > F = 0.0000
Residual | 131.388814 497 .264363811 R-squared = 0.1390
-----------+------------------------------ Adj R-squared = 0.1355
Total | 152.59922 499 .30581006 Root MSE = .51416
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0916942 .0103338 8.87 0.000 .0713908 .1119976
EXP | .0405521 .009692 4.18 0.000 .0215098 .0595944
_cons | 1.199799 .1980634 6.06 0.000 .8106537 1.588943
----------------------------------------------------------------------------

However, it is also possible for omitted variable bias to lead to a reduction in the apparent
explanatory power of a variable. This will be demonstrated using a simple earnings
function model, supposing the logarithm of hourly earnings to depend on S and EXP.
34
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

LGEARN   1   2 S   3 EXP  u
. reg LGEARN S EXP . cor S EXP
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2,| S
497) = EXP
40.12
Model | 21.2104059 2 10.6052029 --------+------------------
Prob > F = 0.0000
Residual | 131.388814 497 .264363811 S|
R-squared 1.0000
= 0.1390
-----------+------------------------------ EXP| -0.5836 1.0000
Adj R-squared = 0.1355
Total | 152.59922 499 .30581006 Root MSE = .51416
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0916942 .0103338 8.87 0.000 .0713908 .1119976
EXP | .0405521 .009692 4.18 0.000 .0215098 .0595944
_cons | 1.199799 .1980634 6.06 0.000 .8106537 1.588943
----------------------------------------------------------------------------

 S  S   EXPi  EXP 
E  ˆ2    2   3
i

  Si  S 
2

If we omit EXP from the regression, the coefficient of S should be subject to a downward
bias. b3 is likely to be positive. The numerator of the other factor in the bias term is
negative since S and EXP are negatively correlated. The denominator is positive.
35
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

LGEARN   1   2 S   3 EXP  u
. reg LGEARN S EXP . cor S EXP
(obs=500)
----------------------------------------------------------------------------
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2,| S
497) = EXP
40.12
Model | 21.2104059 2 10.6052029 --------+------------------
Prob > F = 0.0000
Residual | 131.388814 497 .264363811 S|
R-squared 1.0000
= 0.1390
-----------+------------------------------ EXP| -0.5836 1.0000
Adj R-squared = 0.1355
Total | 152.59922 499 .30581006 Root MSE = .51416
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0916942 .0103338 8.87 0.000 .0713908 .1119976
EXP | .0405521 .009692 4.18 0.000 .0215098 .0595944
_cons | 1.199799 .1980634 6.06 0.000 .8106537 1.588943
----------------------------------------------------------------------------

E  ˆ3    3   2
  EXP  EXP   S  S 
i i

  EXP  EXP 
2
i

For the same reasons, the coefficient of EXP in a simple regression of LGEARN on EXP
should be downwards biased.

36
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

. reg LGEARN S EXP


----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0916942 .0103338 8.87 0.000 .0713908 .1119976
EXP | .0405521 .009692 4.18 0.000 .0215098 .0595944
_cons | 1.199799 .1980634 6.06 0.000 .8106537 1.588943

. reg LGEARN S
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0664621 .0085297 7.79 0.000 .0497034 .0832207
_cons | 1.83624 .1289384 14.24 0.000 1.58291 2.089571

. reg LGEARN EXP


----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
EXP | -.0096339 .0084625 -1.14 0.255 -.0262605 .0069927
_cons | 2.886352 .0598796 48.20 0.000 2.768704 3.003999

As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions.
In the third regression, the negative bias is sufficient to wipe out the positive effect of
EXP.
37
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

. reg LGEARN S EXP


----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0916942 .0103338 8.87 0.000 .0713908 .1119976
EXP | .0405521 .009692 4.18 0.000 .0215098 .0595944
_cons | 1.199799 .1980634 6.06 0.000 .8106537 1.588943

. reg LGEARN S
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0664621 .0085297 7.79 0.000 .0497034 .0832207
_cons | 1.83624 .1289384 14.24 0.000 1.58291 2.089571

. reg LGEARN EXP


----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
EXP | -.0096339 .0084625 -1.14 0.255 -.0262605 .0069927
_cons | 2.886352 .0598796 48.20 0.000 2.768704 3.003999

Indeed, the regression suggests that EXP has a very small negative effect on
earnings, very small in terms of the numerical coefficient, and hence also very small in
terms of R2.
38
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

. reg LGEARN S EXP


----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0916942 .0103338 8.87 0.000 .0713908 .1119976
EXP | .0405521 .009692 4.18 0.000 .0215098 .0595944
_cons | 1.199799 .1980634 6.06 0.000 .8106537 1.588943

. reg LGEARN S
----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
S | .0664621 .0085297 7.79 0.000 .0497034 .0832207
_cons | 1.83624 .1289384 14.24 0.000 1.58291 2.089571

. reg LGEARN EXP


----------------------------------------------------------------------------
LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
EXP | -.0096339 .0084625 -1.14 0.255 -.0262605 .0069927
_cons | 2.886352 .0598796 48.20 0.000 2.768704 3.003999

If one plotted the data for LGEARN and EXP in a simple scatter diagram, not
controlling for S, one would see no relationship at all.
39
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

. reg LGEARN S EXP


Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 40.12
Model | 21.2104059 2 10.6052029 Prob > F = 0.0000
Residual | 131.388814 497 .264363811 R-squared = 0.1390
-----------+------------------------------ Adj R-squared = 0.1355
Total | 152.59922 499 .30581006 Root MSE = .51416

. reg LGEARN S
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 60.71
Model | 16.5822819 1 16.5822819 Prob > F = 0.0000
Residual | 136.016938 498 .273126381 R-squared = 0.1087
-----------+------------------------------ Adj R-squared = 0.1069
Total | 152.59922 499 .30581006 Root MSE = .52261

. reg LGEARN EXP


Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 1.30
Model | .396095486 1 .396095486 Prob > F = 0.2555
Residual | 152.203124 498 .305628763 R-squared = 0.0026
-----------+------------------------------ Adj R-squared = 0.0006
Total | 152.59922 499 .30581006 Root MSE = .55284

A comparison of R2 for the three regressions shows that the sum of R2 in the simple
regressions is actually less than R2 in the multiple regression.

40
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

. reg LGEARN S EXP


Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 2, 497) = 40.12
Model | 21.2104059 2 10.6052029 Prob > F = 0.0000
Residual | 131.388814 497 .264363811 R-squared = 0.1390
-----------+------------------------------ Adj R-squared = 0.1355
Total | 152.59922 499 .30581006 Root MSE = .51416

. reg LGEARN S
Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 60.71
Model | 16.5822819 1 16.5822819 Prob > F = 0.0000
Residual | 136.016938 498 .273126381 R-squared = 0.1087
-----------+------------------------------ Adj R-squared = 0.1069
Total | 152.59922 499 .30581006 Root MSE = .52261

. reg LGEARN EXP


Source | SS df MS Number of obs = 500
-----------+------------------------------ F( 1, 498) = 1.30
Model | .396095486 1 .396095486 Prob > F = 0.2555
Residual | 152.203124 498 .305628763 R-squared = 0.0026
-----------+------------------------------ Adj R-squared = 0.0006
Total | 152.59922 499 .30581006 Root MSE = .55284

This is because the apparent explanatory power of S in the second regression has been
undermined by the downwards bias in its coefficient. As we have seen, the same is even
more evident for the apparent explanatory power of EXP in the third equation.
41
Copyright Christopher Dougherty 2016.

These slideshows may be downloaded by anyone, anywhere for personal use.


Subject to respect for copyright and, where appropriate, attribution, they may be
used as a resource for teaching an econometrics course. There is no need to
refer to the author.

The content of this slideshow comes from Section 6.2 of C. Dougherty,


Introduction to Econometrics, fifth edition 2016, Oxford University Press.
Additional (free) resources for both students and instructors may be
downloaded from the OUP Online Resource Centre
www.oxfordtextbooks.co.uk/orc/dougherty5e/.

Individuals studying econometrics on their own who feel that they might benefit
from participation in a formal course should consider the London School of
Economics summer school course
EC212 Introduction to Econometrics
http://www2.lse.ac.uk/study/summerSchools/summerSchool/Home.aspx
or the University of London International Programmes distance learning course
EC2020 Elements of Econometrics
www.londoninternational.ac.uk/lse.

2016.05.04

You might also like