You are on page 1of 32

Internal and external validity

Karim Nchare

African School of Economics

January 2020
Correlation does not imply causation!!
Correlation does not imply causation!!
Correlation does not imply causation!!
Definitions of internal and external validity

Internal validity: the statistical inferences about causal effects are


valid for the population and setting being studied.

External validity: the statistical inferences can be generalized from


the population and setting studied to other populations and
settings.
Internal validity in an OLS regression model

Suppose we are interested in the causal effect of X1 on Y and we


estimate the following regression model

Yi = β0 + β1 X1i + ui

Internal validity has two components:


I The OLS estimator of β1 is unbiased and consistent
I E [β̂1 ] = β1
I plim β̂1 = β1
n→∞

I Hypothesis tests should have the desired significance level and


confidence intervals should have the desired confidence level.
Internal validity in an OLS regression model

I Is this regression internally valid?

I Is the causal effect of an additional year of education on


average hourly earnings equal to 9.3%?
I If we increase the education of a random sample of individuals
in the U.S. by one year does this increase their average hourly
earnings by 9.3%?
Threats to internal validity

The 3 assumptions of an OLS regression model with one regressor:


I E (ui |X1i ) = 0
I (X1i , Yi ), i = 1, ..., n are iid
I Big outliers are unlikely.
Threats to internal validity:
I Omitted variables
I Functional form misspecification
I Measurement error
I Sample selection
I Simultaneous causality
I Heteroskedasticity and/or correlated error terms
The first 5 are violations of assumption (1) the last one is a
violation of assumption (2).
Omitted variables
I Suppose we want to estimate the causal effect of X1i on Yi .
I The true population regression model is:

Yi = β0 + β1 X1i + β2 X2i + wi , E [wi |X1i , X2i ] = 0

I But we estimate the following model

Yi = β0 + β1 X1i + ui , ui = β2 X2i + wi

I we then have
cov (Yi , X1i ) cov (ui , X1i )
plim β̂1 = = β1 +
n→∞ var (X 1i ) var (X1i )
cov (β2 X2i + wi , X1i ) cov (X2i , X1i )
= β1 + = β1 + β2
var (X1i ) var (X1i )
cov (X2i , X1i )
plim β̂1 = β1 + β2
n→∞ var (X1i )
Omitted variables
I An omitted variable X2i leads to an inconsistent OLS estimate
of the causal effect of X1i if:
I The omitted variable X2i is a determinant of the dependent
variable Yi (β2 6= 0), and
I he omitted variable X2i is correlated with the regressor of
interest X1i (cov (X2i , X1i ) 6= 0)
I If there exists 1 or more variables that satisfy both conditions,
I The OLS estimator does not provide a unbiased and consistent
estimate of the causal effect of X1i
I the OLS regression is not internally valid
I The sign of the bias is given by the sign of β2 cov (X2i , X1i )
I If β2 cov (X2i , X1i ) < 0 we say that OLS underestimates the
causal effect of X1i
I If β2 cov (X2i , X1i ) > 0 we say that OLS overestimates the
causal effect of X1i
I Example: In the returns to education, ability is an omitted
variable.
Functional form misspecification
I Suppose that the true population regression model is

Yi = β0 + β1 X1i + β2 X1i2 + wi , E [wi |X1i ] = 0

I But we estimate the following model

Yi = β0 + β1 X1i + ui , ui = β2 X1i2 + wi

I we then have
cov (Yi , X1i ) cov (ui , X1i )
plim β̂1 = = β1 +
n→∞ var (X1i ) var (X1i )
2
cov (β2 X1i + wi , X1i )
= β1 +
var (X1i )
cov (X1i2 , X1i )
plim β̂1 = β1 + β2
n→∞ var (X1i )

I Since cov (X1i2 , X1i ) 6= 0, OLS is not internally valid if β2 6= 0


Functional form misspecification
Should we include education squared in the regression model?
Functional form misspecification

For major part of the support of education, linear and quadratic


models are very similar.
Measurement error

There are different types of measurement error


I Measurement error in the independent variable X
I Classical measurement error (independent of X )
I Measurement error correlated with X
I Both types of measurement error in X are a violation of
internal validity

I Measurement error in the dependent variable Y


I Less problematic than measurement error in X
I Usually not a violation of internal validity
I Leads to less precise estimates
Measurement error in X : classical measurement error
I Suppose we have the following population regression model

Yi = β0 + β1 X1i + ui E (ui |X1i ) = 0

I Suppose that we do not observe X1i but we observe X̃1i a


noisy measure of X1i :

X̃1i = X1i + ωi

I Using the fact that X1i = X̃1i − ωi , we obtain

Yi = β0 + β1 X̃1i − β1 ωi + ui

I Classical measurement error:

Cov (X1i , ωi ) = 0, Cov (ωi , ui ) = 0, E [ωi ] = 0, Var (ωi ) = σω2

I Example: measurement error due to someone making random


mistakes when imputing data in a database.
Measurement error in X : classical measurement error
I Consider Yi = β0 + β1 X̃1i + ei , ei = −β1 ωi + ui
I With classical measurement error the OLS estimate is
inconsistent:

cov (Yi , X̃1i ) cov (ei , X̃1i )


plim β̂1 = = β1 +
n→∞ var (X̃1i ) var (X̃1i )
cov (−β1 ωi + ui , X1i + ωi )
= β1 +
var (X1i + ωi )
I Since Cov (X1i , ωi ) = Cov (ωi , ui ) = Cov (X1i , ui ) = 0
 
−β1 cov (ωi , ωi ) var (ωi )
plim β̂1 = β1 + = β1 1 −
n→∞ var (X1i ) + var (ωi ) var (X1i ) + var (ωi )
var (X1i )
= β1
var (X1i ) + σω2
I With classical measurement error β1 is underestimated
Measurement error in X : classical measurement error
Measurement error in X : correlated with X
I Measurement error can also be related to X
I For example if X1i is taxable income and individuals
systematically underreport by 10%: X̃1i = 0.9X1i
I Suppose we estimate

Yi = β0 + β1 X̃1i + ei , ei = 0.1β1 X1i + ui


I In this case, we can show that β1 is overestimated

cov (Yi , X̃1i ) cov (ei , X̃1i )


plim β̂1 = = β1 +
n→∞ var (X̃1i ) var (X̃1i )
cov (0.1β1 X1i + ui , 0.9X1i )
= β1 +
var (0.9X1i )
0.9 · 0.1 · β1 var (X1i )
= β1 +
0.92 var (X1i )
 
1
= β1 1 + > β1
9
Measurement error in the dependent variable Y
I Measurement error in Y is generally less problematic than
measurement error in X .
I Suppose Y is measured with classical error

Ỹi = Yi + ηi

I and we estimate

Ỹi = β0 + β1 X1i + ei , ei = ui + ηi

I The OLS estimate β̂1 will be unbiased and consistent because


E [ei |X1i ] = 0
I However, the OLS estimate will be less precise since:

var (ei ) = var (ui ) + var (ηi ) > var (ui )


Measurement error in the dependent variable Y
Measurement error in the returns to education example

I Is measurement error a threat to internal validity in the


regression of earnings on education?
I Data come from the Current Population Survey, a survey
among households in the U.S.
I When individuals have to report their earnings and years of
education in a survey it is likely that they make mistakes.
I Earnings is the dependent variable so measurement error not
so problematic.
I Measurement error in years of education is problematic and
will give a biased and inconsistent estimate of the returns to
education.
Sample selection

I Missing data are a common feature of economic data sets


I We consider 3 types of missing data
I Data are missing at random
I this will not impose a threat to internal validity
I Data are missing based on X
I This will not impose a threat to internal validity.
I For example when we only observe education & earnings for
those who completed high school.
I Can impose a threat to external validity.
I Data are missing based on Y
I This imposes a threat to internal validity.
I For example when individuals with high earnings refuse to
report how much they earn.
I Resulting bias in OLS estimates is called sample selection
bias.
Sample selection
Simultaneous causality
I So far it was assumed X affects Y , what if Y also affects X ?
Yi = β0 + β1 Xi + ui , Xi = γ0 + γ1 Yi + vi
I Simultaneous causality leads to biased & inconsistent OLS
estimate.
I We first solve for Cov (Xi , ui ) assuming cov (ui , vi ) = 0
Cov (Xi , ui ) = cov (γ0 + γ1 Yi + vi , ui ) = cov (γ1 Yi , ui )
= cov (γ1 (β0 + β1 Xi + ui ), ui )
= γ1 β1 Cov (Xi , ui ) + γ1 var (ui )
I Solving for Cov (Xi , ui ) gives:
γ1
Cov (Xi , ui ) = var (ui )
1 − γ1 β 1
I so that
Cov (Xi , ui ) γ1 var (ui )
plim β̂1 = β1 + = β1 + 6= β1
n→∞ var (Xi ) (1 − γ1 β1 )var (Xi )
Simultaneous causality: Example

I Simultaneous causality is unlikely a threat to internal validity


in returns to education example

I earnings are generally realized after completing (formal)


education

I Simultaneous causality is more likely a threat to internal


validity when

I estimating the effect of class size on average test scores

I estimating the effect of increasing the price on product demand


Heteroskedasticity and/or correlated error terms

I The threats to internal validity discussed so far


I Lead to a violation of the first OLS assumption: E [ui |Xi ] = 0
I Lead to biased & inconsistent OLS estimates of the
coefficient(s)

I Heteroskedasticity and/or correlated error terms


I Are a violation of the second OLS assumption: (Xi , Yi ) are iid
I Do not lead to biased & inconsistent OLS estimates of the
coefficient(s)
I But lead to incorrect standard errors
I Hypothesis tests do not have the desired significance level
I Confidence intervals do not have the desired confidence level.
Heteroskedasticity and/or correlated error terms
I Heteroskedasticity (var (ui ) 6= σu2 ) has been discussed during
previous lectures

I Solution is to compute heteroskedasticity robust standard


errors

I Correlated error terms are due to nonrandom sampling

Cov (ui , ui ) 6= 0, i 6= j

I For example if a dataset contains multiple members from 1


family, because instead of individuals entire families are
sampled.

I Solution: Compute cluster-robust se’s that are robust to


autocorrelation
I More about this in lecture on panel data
What to do when you doubt the internal validity?

What to do in the case of:


I Omitted variables:
I if observed, include them as additional regressors
I if unobserved: use panel data or instrumental variables
I Functional form misspecification: adjust the functional
form
I Measurement error:
I develop model of measurement error and adjust estimates
I use instrumental variables
I Sample selection: use different estimation method (beyond
scope of this course)
I Simultaneous causality: use instrumental variables
External validity
I Suppose we estimate a regression model that is internally valid
I Can the statistical inferences be generalized from the
population and setting studied to other populations and
settings?
I Threats to external validity: Differences in populations
I The population from which the sample is drawn might differ
from the population of interest.
I If you estimate the returns to education for men, these results
might not be informative if you want to know the returns to
education for women.
I Threats to external validity: Differences in settings
I The setting studied might differ from the setting of interest
due to differences in laws, institutional environment and
physical environment.
I For example, the estimated returns to education using data
from the U.S might not be informative for Norway
I the educational system is different
I different labor market laws (minimum wage laws)
Internal & external validity and prediction

I Up to now we have dicussed the use of regression analysis to


estimate causal effects
I Regression models can also be used for forecasting
I When regression models are used for forecasting
I When regression models are used for forecasting
I internal validity less important
I not very important that the estimated coefficients are unbiased
and consistent
Internal & external validity and prediction
Consider the following 2 questions:
I What is the causal effect of an additional year of education on
earnings
I What are the average earnings of a 40 year old man with 14
years of education in the U.S in 2018?
We have these results based on CPS data collected in March 2009
in the U.S.:
Internal & external validity and prediction

I The regression results can be used to answer the first question


I if the OLS estimate on education is unbiased and consistent
I if there are no threats to internal validity
I The regression results can be used to answer the second
question
I if the included explanatory variables ’explain’ a lot of the
variation in earnings
I if the regression is externally valid
I if the population and setting studied are sufficiently close to
the population and setting of interest
I It is not nessesary that the OLS estimate on education is
unbiased and consistent.

You might also like