You are on page 1of 22

Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Chapter 8. More on Specification and Data


Issues

M. Ryan Sanjaya

Departemen Ilmu Ekonomi


Fakultas Ekonomika dan Bisnis
Universitas Gadjah Mada

Maret 2023

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 1/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Specification and data issues

How do we know that our econometric model is correctly


specified?
There is no exact way to know if our model is correctly specified.
Nonetheless, there are some selection criteria to judge whether an
econometric model is good or not.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 2/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

What is a Good Model?


A good model for empirical analysis should (Hendry & Richard, 1983):
Be data admissible; that is, predictions made from the model must
be logically possible.
Be consistent with theory; that is, it must make good economic
sense.
Have weakly exogenous regressors; that is, the explanatory
variables, must be uncorrelated with the error term (no omitted
variable bias).
Exhibit parameter constancy; that is, the values of the parameters
should be stable.
Exhibit data coherency; that is, the residuals estimated from the
model must be purely random (technically, white noise).
Be encompassing; that is, other models cannot be an improvement
over the chosen model.
M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 3/22
Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Specification Errors
Let’s say the true model is
y = β0 + β1 x1 + β2 x2 + u
We make specification error if we
omit variables → underfitting the model
y = β0 + β 1 x 1 + u
include irrelevant variable → overfitting the model
y = β0 + β1 x 1 + β 2 x 2 + β3 x 3 + u
estimate the wrong functional form
ln y = β0 + β1 x1 + β2 x2 + u
use proxy, e.g., x2∗ , that may contain measurement error
y = β0 + β1 x1 + β2 x2∗ + u
incorrectly specify the stochastic error term.
M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 4/22
Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Detecting Model Misspecification

Look at the pattern of the residual.


Residual in the vertical axis, explanatory variable in the horizontal
axis.
No pattern = good
RESET (regression specification error test).
Davidson-MacKinnon J test.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 5/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

RESET
1 2 of the original (but not
Obtain the fitted values ŷ and Rold
necessarily true) model
y = β0 + β1 x1 + β2 x2 + u.
2 Estimate the expanded model by adding ŷ 2 and ŷ 3 , and get Rnew
2

y = β0 + β1 x1 + β2 x2 + δ1 ŷ 2 + δ2 ŷ 3 + error .
3 Calculate the F statistic
2 2

Rnew − Rold (n − k − 3)
F = 2
(1 − Rnew ) 2
under the null hypothesis of H0 : δ1 = 0 and δ2 = 0.
The distribution of the F statistic is approximately F2,n−k−3 in large
samples (and the Gauss-Markov assumptions).
If H0 is rejected, then we have functional form problem.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 6/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

RESET

An LM version is also available (and the chi-square distribution


will have two df).
We can also do the test to be made robust to heteroskedasticity
using the methods discussed in the previous chapter.
Drawback of RESET test: It provides no real direction on how to
proceed if the model is rejected → what’s the best model then?
RESET has no power for detecting omitted variables.
If the functional form is properly specified, RESET has no power
for detecting heteroskedasticity.
reg y x1 x2
estat ovtest

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 7/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Davidson-MacKinnon test
Two nonnested models:

y = β0 + β1 x1 + β2 x2 + u (1)

vs
y = β0 + β1 ln x1 + β2 ln x2 + u. (2)

1 Estimate model 2 and obtain fitted values y̌ .


2 Use y̌ as an additional regressor in model 1.
You can also do the opposite: estimate model 1 and use the fitted
values as regressor in model 2.
3 Use t-test: if the estimated parameter for y̌ is significant, then
model 1 is rejected.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 8/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Drawbacks of the Davidson-MacKinnon test

A clear winner need not emerge.


Both models could be rejected or neither model could be rejected.
If none are rejected, use adjusted R 2 to choose between them.
If both models are rejected, more work needs to be done.
Rejection of one model does not mean that the other model is the
correct one.
We can’t compare models with different dependent variables.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 9/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Proxy Variables

In the absence of a relevant variable, use a proxy variable.


If ability is unobserved, use IQ score.
Does IQ and ability the same? Measurement error?

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 10/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

What is a Good Proxy?

Suppose the model is

y = β0 + β1 x1 + β2 x2 + β3 x3∗ + u

where x3∗ is unobserved and is proxied by x3 in the plug-in


regression
y on x1 , x2 , x3 .
The variable x3 is a good proxy for x3∗ if
x3∗ is closely correlated with x3 , that is x3∗ = δ0 + δ3 x3 + v3 ,
the estimated error term u is uncorrelated with x1 , x2 , and x3∗ ,
u is uncorrelated with x3 ,
v3 is uncorrelated with x1 , x2 , x3 .

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 11/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Lagged Dependent Variable as Proxy


We can use lagged dependent variable as a proxy in a cross
sectional regression.
For example:

crime = β0 + β1 officer + β2 unem + β3 crime−1 + u.

Some cities had high crime rate in the past and today.
If crime−1 is not included we might suffer from reverse causality:
since the city has high crime rate → high unemployment and many
police officers.
If crime−1 is included, we can do this experiment: if two cities have
the same previous crime rate and current unemployment rate, then
β1 measures the effect of another police officer on crime rate.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 12/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Models with Random Slopes

A model with random slopes is given by

yi = ai + bi xi .

The slope coefficient bi is a random draw from the population.


That is, the slope of x varied by individual.
We cannot estimate the slope for each observation but we can
estimate the average slope across population → average partial
effect (APE) or average marginal effect (AME).
The assumption is that the slopes are independent of the
explanatory variables.
In Stata:
mixed

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 13/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Example

Determinant of language score.


Dataset: snijders.dta
Cross sections of 2287 8th grade students from 131 schools in the
Netherlands.
Variables of interest: langpost (language score), iqvc (average
verbal IQ score), and schoolnr (identity code for each school).
In Stata:
** Random intercepts only
mixed langpost iqvc || schoolnr: , mle

** Random intercepts and slopes


mixed langpost iqvc || schoolnr: iqvc, mle covariance(indep)

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 14/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Example Result — Random Intercepts

The expected language score


for a kid with average verbal
IQ averages 40.61 across all
schools, but with substantial
variation (variance = 9.50).
The common slope is
estimated as a gain of 2.49
points in language score per
point of verbal IQ.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 15/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Example Result — Random Intercepts and Slopes

The expected language score


for a child with average IQ
now averages 40.64 across
schools, with about the same
variance of 9.54.
The expected gain in language
score per point of IQ averages
2.52, a bit higher than in
random intercepts.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 16/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Measurement Error

The measurement error is defined as the difference between the


observed value (y , x1 ) and the actual value in population (y ∗ , x1∗ )

e0 = y − y ∗

e1 = x1 − x1∗

If e0 and e1 is uncorrelated with the explanatory variables → good.


If e0 and e1 is correlated with the error term u → bias → need to
collect new data with better data-collecting technique.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 17/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Classical errors-in-variables

Classical errors-in-variables (CEV) assumption: the measurement


error is uncorrelated with unobserved explanatory variable

Cov (x1∗ , e1 ) = 0.

Violation of this properties will resulted in a biased and


inconsistent estimator → attenuation bias (the estimated slope
will always be attenuated/weaker/underestimated).

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 18/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Missing Data

If the data are missing completely at random (MCAR), then


missing data cause no statistical problems.
Complete cases estimator: use only observations with complete
data in the regression.
Multiple-imputation method.
mi estimate

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 19/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Missing Indicator Method

Procedure.
1 Create Zik = xik when it is Drawbacks of MIM.
observed, 0 otherwise. Requires strong
2 Create a missing data assumptions, such as xk to
indicator mik = 1 when xik be uncorrelated with
is missing, 0 otherwise. x1 , x2 , ...xk−1 .
3 Estimate yi on It is less robust than the
xi1 , ..., xi,k−1 , Zik , mik for complete cases estimator.
i = 1, ..., n.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 20/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Nonrandom Samples

Exogenous sample selection.


Selection based on the independent variables (sometimes called
missing at random, MAR).
E.g., regressing y on x1 and age, but the survey is only for those
age > 40; nonrandom sample of adults.
Do not cause bias.
Endogenous sample selection.
Selection based on the dependent variable.
E.g., regressing wealth on x1 , x2 , but only those with
wealth < 250, 000 is in the sample.
Creates bias and inconsistent estimates.

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 21/22


Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Outliers

Least absolute deviations (LAD) estimation can be used to


minimise the impact of outliers in a regression.
Minimize the sum of the absolute residuals
LAD is designed to estimate the parameters of the conditional
median of y given the xs.
LAD is a special case of robust regression and quantile regression.
In Stata:
qreg y x1 x2

M. Ryan Sanjaya — m.ryan.sanjaya@ugm.ac.id Maret 2023 22/22

You might also like