Professional Documents
Culture Documents
net/publication/334038979
CITATIONS READS
0 6,499
1 author:
SEE PROFILE
All content following this page was uploaded by Marius van Oordt on 06 August 2019.
ABSTRACT
Keywords: Econometrics
JEL Classifications: C01
1
Email: marius.vanoordt@up.ac.za
1
Contents
CROSS-SECTIONAL DATA ................................................................................................... 5
Proxy variables................................................................................................................... 7
Consistency ...................................................................................................................... 10
Tobit model for continuous dependent variable with many zero observations ............... 16
Heteroscedasticity ................................................................................................................ 17
Outliers ................................................................................................................................. 22
2
Testing whether a variable is endogenous ........................................................................... 23
Stationary ......................................................................................................................... 41
3
Correcting serial correlation ............................................................................................ 47
Heteroscedasticity ................................................................................................................ 49
SEMs ................................................................................................................................ 50
Forecasting ........................................................................................................................... 52
PANEL DATA......................................................................................................................... 54
IV estimator ......................................................................................................................... 60
4
CROSS-SECTIONAL DATA
Ordinary Least Squares (OLS) Assumptions
1. The parameters are linear (note not the independent variables). OLS cannot be
performed when the equation is e.g. 𝑦𝑦 = 𝛼𝛼 + 𝛽𝛽 2 𝑥𝑥 + 𝑢𝑢
2. The sample is obtained randomly from a population. This is not always the case.
3. There is variance in independent variables. This is always the case and can be ignored
as a requirement.
4. Unbiased parameters, zero mean error assumption, written as
𝐸𝐸(𝑢𝑢|𝑥𝑥1 , 𝑥𝑥2 … 𝑥𝑥𝑘𝑘 ) = 𝐸𝐸(𝑢𝑢) = 0
This means that there are no unobserved factors (included in the error term) that are
correlated with the independent variable. Alternatively stated, all other factors not
included in the model that effect 𝑦𝑦 are uncorrelated with 𝑥𝑥1 , 𝑥𝑥2 … 𝑥𝑥𝑘𝑘 .
If this does not hold, the parameters are biased upwards or downwards and we say that
we have endogenous explanatory variables. Note that this assumption will also not hold
if the incorrect functional form for independent variables is chosen, if there is
measurement error in the independent variables or in the presence of simultaneity bias
(all of these are discussed later). Functional form is less important asymptotically than
the other mentioned.
It is important to understand the omitted variables bias that result if this assumption
does not hold, this can be written 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 (𝐵𝐵1 ) = 𝐵𝐵2 𝛿𝛿 where 𝐵𝐵2 indicates the correlation
between the omitted variable, 𝑥𝑥𝑗𝑗 and y; and 𝛿𝛿 indicates the correlation between
𝑥𝑥1 𝑎𝑎𝑎𝑎𝑎𝑎 𝑥𝑥𝑗𝑗 , the endogenous variable and the omitted variable. It is not possible to
determine the magnitude of the bias, but we do indicate whether the bias is upwards or
downwards. If 𝐵𝐵2 is positive and 𝛿𝛿 is positive, we have upward bias (this is based on
intuition). Similarly if one is positive and other negative we have downward bias. If
both are negative we have upward bias.
It should be remembered that a bias parameter will influence all parameters that are
correlated with the variable of that parameter. In discussing our results from a multiple
regression, however, we do not discuss whether the exogenous variables, which means
variables not correlated with the error term, are upwards or downwards bias as a result
of including an endogenous variable in the model.
5
5. Homoskedasticity
𝑉𝑉𝑉𝑉𝑉𝑉(𝑢𝑢|𝑥𝑥1 , 𝑥𝑥2 , … 𝑥𝑥𝑘𝑘 ) = 𝜎𝜎 2 = 𝑉𝑉𝑉𝑉𝑉𝑉(𝑦𝑦|𝑥𝑥)
This means that the variance of the dependent variable, given the variance of the
independent variables is constant. This also means the variance of the error terms is
constant around the regression line for each observation and does not change as the
level of the independent variables change.
If this does not hold, the standard errors of the parameters are incorrect and the
parameters are therefore a poorer estimation of the population parameter.
It is also very important to note that increased variability in the independent variable
will decrease the standard error of the parameter.
6. There is no perfect collinearity between the independent variables
An independent variable may not be a constant. There may not be an exact linear
relationship between independent variables, e.g. 𝑥𝑥1 = 𝑘𝑘. 𝑥𝑥2 𝑜𝑜𝑜𝑜 𝑥𝑥1 = 𝑥𝑥2 + 𝑥𝑥3 .
Note that 𝑥𝑥1 𝑎𝑎𝑎𝑎𝑎𝑎 𝑙𝑙𝑙𝑙𝑙𝑙𝑥𝑥1 𝑜𝑜𝑜𝑜 𝑥𝑥12 are not linear relationships and are allowed.
The main purpose of including multiple independent variables is to take controls out of the
error terms and put them explicitly in the equation. This is done to adhere to assumption four
above.
For interpretation take the following regression
𝑦𝑦 = 𝛼𝛼 + 𝛽𝛽𝑥𝑥1 + 𝛾𝛾𝑥𝑥2 + 𝑢𝑢
𝛽𝛽 measures the relationship between 𝑦𝑦 and 𝑥𝑥1 after other variables (𝑥𝑥2 ) has been partialled out.
The same is true for all other parameters, unless two parameters use different functional forms
of the same variable, discussed next.
In the case where 𝑥𝑥1 is e.g. income and 𝑥𝑥2 is income squared then the derivative of the equation
would have to be used to interpret 𝛽𝛽, for instance
𝑦𝑦 = 𝑎𝑎 + 𝐵𝐵1 𝑥𝑥 + 𝐵𝐵2 𝑥𝑥 2 + 𝑢𝑢
∆𝑦𝑦
= 𝐵𝐵1 + 2𝐵𝐵2 𝑥𝑥
∆𝑥𝑥
If there are other independent variables included, the partial derivative (treating all other
variables as constant) would need to be taken to interpret 𝐵𝐵1. The same logic is applied to
interaction terms, the interaction term forms part of the interpretation just as would be the case
for a partial derivative.
6
Proxy variables
Before estimating a model, we should always specify the population model. Often a population
model will include unobservable variables (for instance ability) that we cannot include in our
model to be estimated (we cannot observe it). In such instances, it is generally preferable to
include a proxy variable (which can be observed) to reduce or possibly remove the bias of not
including the unobservable variable. The requirements for an ideal proxy are
1. If we were able to include the unobserved variable, the proxy variable would be
irrelevant. This is always met when the population model is correctly specified.
2. The independent variables are not partially correlated with the unobserved variable after
including the proxy variable. If this is not the case then the independent variables will
still be correlated with the error term, although most likely to a lesser extent than if the
proxy was not included (less bias).
It should be noted that even if the second requirement is not met and we have an imperfect
proxy, it is generally still a good idea to include it in the estimation model.
It may also be required that the proxy interact with another independent variable in the
population model. If 𝑞𝑞 is taken as the unobserved variable in the model
Then the interpretation of 𝑥𝑥 will be the partial effect: 𝛽𝛽1 + 𝛾𝛾2 𝑞𝑞. This provides a problem, since
𝑞𝑞 is not observed. We can however obtain the average partial effect if we assume the average
of 𝑞𝑞 in the population is zero, meaning the average partial effect is: 𝛽𝛽1. 2 Once we take a proxy
for 𝑞𝑞, it is therefore required that we demean the proxy in the sample before interaction and
then we obtain the average partial effect for 𝛽𝛽1. Further note that if the interaction term is
significant, the error term will be heteroskedastic. A model with an interaction proxy is called
a random coefficient model.
Sum of squares total (SST) = Sum of squares explained (SSE) + Sum of squared residuals
(SSR). 𝑅𝑅 2 is therefore SSE over SST; the explained variance over the total variance. A higher
R squared does not always indicate a better model, additional variables should only be included
if it has a non-zero partial effect on the dependent variable in the population. It is also common
2
If 𝑥𝑥 is binary, then we call this the average treatment effect. As previously mentioned, all estimated coefficients
are average partial effects.
7
1−𝜎𝜎2
to calculate the adjusted 𝑅𝑅 2 𝑎𝑎𝑎𝑎 𝑆𝑆𝑆𝑆𝑆𝑆(𝑛𝑛−1). This is useful as the adjusted 𝑅𝑅 2 is not always
increasing by adding additional variables. If an additional variable has a t-stat of less than one,
the adjusted 𝑅𝑅 2 will decrease. This is also useful for non-nested model selection.
The sampling variance of the OLS slope estimates is calculated as follow:
𝑉𝑉𝑉𝑉𝑉𝑉�𝐵𝐵𝑗𝑗 � = 𝜎𝜎 2 /𝑆𝑆𝑆𝑆𝑇𝑇𝐽𝐽 �1 − 𝑅𝑅𝑗𝑗2 �
Where 𝜎𝜎 2 is the error variance of the regression. This means a larger variance in the error (more
noise) leads to more variance in the estimate. Adding more variables reduces this variance.
Further, 𝑆𝑆𝑆𝑆𝑇𝑇𝐽𝐽 is the total sample variation in 𝑥𝑥𝑗𝑗 . This means that the more variance in the sample
(or alternatively the larger the sample), the smaller will the variance of the estimate become.
Lastly and very importantly, 𝑅𝑅𝑗𝑗2 indicates the extent of multicollinearity between 𝑥𝑥𝑗𝑗 (e.g. the
variable of interest) and the other independent variables. This can for instance be seen by
looking at VIF’s for 𝑥𝑥𝑗𝑗 . In other words, this is the linear relationship between one independent
variable to all other independent variables. The more collinearity between this variable and the
other, the larger will 𝑉𝑉𝑉𝑉𝑉𝑉�𝐵𝐵𝑗𝑗 � become. This is where multicollinearity becomes a “problem”,
but it should be seen that multicollinearity has the same effect as a small sample as this will
reduce 𝑆𝑆𝑆𝑆𝑇𝑇𝐽𝐽 . If a variable is dropped due to multicollinearity, then we may not meet assumption
4 (estimates will be bias) and 𝜎𝜎 2 will increase, so this is not a good idea. Multicollinearity does
not make any OLS assumption invalid and does not need to be addressed (as opposed to perfect
multicollinearity). Further, if other variables are collinear, besides the variable of interest, and
these variables are not correlated with the variable of interest, this will not influence 𝑉𝑉𝑉𝑉𝑉𝑉�𝐵𝐵𝑗𝑗 �.
In conclusion, focus on having 𝜎𝜎 2 as small as possible and 𝑆𝑆𝑆𝑆𝑇𝑇𝐽𝐽 as large as possible and worry
less about multicollinearity.
This, however, does not mean that we should add as many as possible variables in the model.
The ceteris paribus interpretation should always be considered. It does not make sense to add
for instance the amount of beer consumption and the amount of tax collected from beer
consumption in a model where we are interested in the effect of the beer tax on fatalities in
motor vehicle accidents; the ceteris paribus interpretation becomes nonsensical. However, if
we have a variable that affects y and is uncorrelated with all other independent variables, such
a variable should always be included; it does not increase multicollinearity and results in
smaller standard errors.
To calculate 𝜎𝜎 2 in a sample ,we write
8
𝑆𝑆𝑆𝑆𝑅𝑅
𝜎𝜎 2 =
𝑑𝑑𝑑𝑑
Where 𝑑𝑑𝑑𝑑 (degrees of freedom) is 𝑛𝑛 (observations) – 𝑘𝑘 (parameters including intercept) -1.
Take the root to obtain 𝜎𝜎, the standard error of the regression. This standard error is used to
compute the standard deviation of a parameter, 𝑠𝑠𝑠𝑠�𝐵𝐵𝑗𝑗 � = �𝑣𝑣𝑣𝑣𝑣𝑣(𝐵𝐵𝑗𝑗 ). Note that
heteroscedasticity violates this and we cannot be certain that OLS has the smallest variance of
all estimators (that OLS is Best).
The classic linear model is not an estimator but an assumption important for hypothesis testing
and statistical inference of the sample to the population. The assumption includes 1-6 of OLS
and the normality assumption.
Officially the assumption of CLM is
𝐸𝐸(𝑢𝑢|𝑥𝑥1 , 𝑥𝑥2 … 𝑥𝑥𝑘𝑘 ) = 𝐸𝐸(𝑢𝑢) 𝑎𝑎𝑎𝑎𝑎𝑎 𝑢𝑢~𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁(0, 𝜎𝜎 2 )
The assumption is therefore that the error term follows a normal distribution, which means that
the estimates are normally distributed, linear combination of 𝛽𝛽1 , 𝛽𝛽2 , … 𝛽𝛽𝑘𝑘 is normally
distributed and a subset of 𝛽𝛽𝑗𝑗 has a joint normal distribution.
T = (Estimated – Hypothesised value) / Standard error of estimated (this is useful for when
hypothesized value is not zero).
It should be seen that smaller standard errors lead to higher t-stats, this, in turn, means this
decrease the probability of an obtained t-stat, meaning a lower p-value. Standard errors are
calculated based on standard deviations (divided by √𝑛𝑛) and this is in turn is calculated based
on 𝑉𝑉𝑉𝑉𝑉𝑉(𝐵𝐵𝑗𝑗 ). This means for statistical significance under the CLM assumption we want small
𝜎𝜎 2 , large 𝑆𝑆𝑆𝑆𝑇𝑇𝐽𝐽 and small 𝑅𝑅𝑗𝑗2 . Large samples therefore is key to statistical inference. Also,
remember that statistical significance is not necessarily equal to economic significance.
For the population hypothesis 𝐻𝐻0 : 𝐵𝐵1 = 𝐵𝐵2 , 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝐵𝐵1 − 𝐵𝐵2 = 0.
�1 − 𝐵𝐵
T-test therefore 𝐵𝐵 �2 − 0/𝑠𝑠𝑠𝑠(𝐵𝐵
�1 − 𝐵𝐵
�2 )
9
This can be estimated by creating a new variable for 𝐵𝐵1 − 𝐵𝐵2 and replacing this in the original
equation.
For the population hypothesis 𝐻𝐻0 : 𝐵𝐵3 = 0, 𝐵𝐵4 = 0, 𝐵𝐵5 = 0 one cannot look at individual t-tests
as the other parameters are not restricted and we are interested in the joint significance of three
(or however many) variables. One way to see this is how SSR change with the removal of these
three variables. We therefore have an unrestricted (original) model and a restricted model,
which is the original model after removing the variables that we wish to restrict (indicated in
Ho). The F test is then written
(𝑆𝑆𝑆𝑆𝑅𝑅𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 − 𝑆𝑆𝑆𝑆𝑅𝑅𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 )/(𝑑𝑑𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 − 𝑑𝑑𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 )
𝐹𝐹 =
𝑆𝑆𝑆𝑆𝑅𝑅𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 /𝑑𝑑𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢
If the null is rejected then 𝐵𝐵3, 𝐵𝐵4 and 𝐵𝐵5 is jointly statistically significant.
The F-test is also useful for testing the exclusion of a group of variables if highly correlated. It
may often be the case that many similar variables are not significant under the t-test, but they
are jointly significant under the F-test. This is where the F-test becomes very important as we
do not need to drop variables due to multicollinearity. The F-statistics is also shown for each
regression by Stata and this indicates the hypothesis that all parameters are equal to zero.
Consistency
As 𝑛𝑛 grows, 𝐵𝐵�𝑗𝑗 collapses to 𝐵𝐵𝑗𝑗 , meaning the estimate gets closer and closer to the actual
population parameter. This essentially means that there is no bias and the parameter is
consistently correctly estimated. The assumption required for this to hold is
10
Note that this is a slightly less strict assumption than assumption 4 of OLS for a finite sample
and states that the covariance between all variables individually and the error term should be
zero. If this assumption does not hold, the variable that is correlated with the error term, as well
as all other variables that are correlated with this variable or the error term, will be bias and
inconsistent. This inconsistency increase as the sample size increase, meaning 𝐵𝐵�𝑗𝑗 collapses to
an incorrect population estimate.
Asymptotic normality
The T, F and LM tests rely on a normal distribution of u in the population.
According to the central limit theorem, OLS estimates (and the error term) are approximately
normally distributed in large samples (n>30 about) and we can, therefore, use these tests for
large samples, even if it appears that are errors are non-normally distributed (there are certain
cases where the non-normal distribution may still be an issue). This means that the assumption
of CLM is generally not required for OLS hypothesis testing.
Note that the zero mean and homoscedasticity assumptions are still required.
Other consequences of the asymptotic normality of the estimators are that the error variance is
consistent and that standard errors are expected to shrink at a rate of 1/√𝑛𝑛.
Asymptotic efficiency
If OLS assumptions hold, then it has the smallest asymptotic variance of all estimators. If
heteroscedasticity is present, there may exist better estimators than OLS.
Transformation of variables
Scaling data does not change any measured effect or testing outcome, only the interpretation
changes.
It may be useful in certain scenarios to run a standardized model with only beta coefficients
(also called standardized coefficients) as this gives an indication of the magnitude of the effect
of the different independent variables on the dependent variable. This is done by taking the z
score of all variables and the interpretation is the change in standard deviation to a change in
standard deviation.
Logs are useful for obtaining elasticities or semi-elasticities. Further, taking the natural log of
a variable may increase the normality and reduce heteroscedasticity of the variable by drawing
in large variances (this also increase the likelihood of statistical significance as there is less
variance in the error term). This is particularly useful for significantly skewed variables where
11
the central limited theorem is unlikely to hold (CLM assumption is therefore violated). Also,
the impact of outliers is reduced. It should, however, be noted that the log of a variable is a new
variable with a different interpretation than the original variable. Further, a log should not be
taken for a variable with many values between 0 and 1 or a variable with 0 values. A constant
can be added if there are few 0 values, but this is generally not preferred. Generally, it is not
preferred to transform a variable, outliers should rather be treated separately. Only if a variable
is greatly positively skewed does it make sense (or you are estimating elasticities). Further,
taking the log of the variable of interest make little sense; you cannot argue causality on a log-
transformed variable as the variable (particularly its variance) is not the same as the non-
transformed variable. Of course, if a variable has a log-linear relationship with the dependent
variable, the log must be taken, otherwise the model will be misspecified and there will be bias
in the parameters.
Quadratic terms are also common, just remember the interpretation of such a term requires the
partial derivative of the equation. The adjusted 𝑅𝑅 2 is particularly useful to determine whether
a quadratic term should be included in addition to the non-quadratic variable. Again, if a
variable has a quadratic relationship with the dependent variable, the quadratic term must be
included, otherwise the model is misspecified and the estimates bias.
Logs and quadratic terms are the most common functional forms for variables. As noted, the
zero mean error assumption will not hold if a model has functional form misspecification,
meaning there is an omitted variable and it is a function of an included dependent variable.
One way to test for additional functional forms is with the F test after including additional
transformed variables. Other tests are
To conduct this test, run the regression and save fitted values 𝑦𝑦�, calculate 𝑦𝑦� 2 , 𝑦𝑦� 3 … Run a
regression that is the same as original, but adding the calculated values as variables.
Conduct an F test on the parameters of the newly added variables (H0 is all is nil). If
rejected then there is misspecification that needs further consideration.
12
Qualitative independent variables should be transformed into dummy categories. If the
dependent variable has a log function, the interpretation is percentage. Where there are multiple
binary or ordinal variables, the coefficient takes the interpretation of all the 0 categories. Binary
variables can also be used as interaction terms to obtain additional information from an
intercept (binary interact with binary) or a different slope (binary with continuous). Binary
variables can also be used to determine whether e.g. females and males have different models;
this is done by interacting all variables and keeping the original variables and using the F test
where the non-interacted model is the restricted model.
It may also be useful to include a lagged dependent variable in the model. This new independent
variable will control for unobservable historical facts that cause current differences in the
dependent variable.
This model is very easy to run and interpret, but has some issues. Some predictions of
probability (for individual cases) will exceed 1 or be less than 0, this is nonsensical. Further, it
is not possible to relate probability linearly to independent variables as this model does; this
means that e.g. the probability of being employed is not a linear function of the number of
children one has. These prediction problems can be resolved by taking 𝑦𝑦� = 1 𝑖𝑖𝑖𝑖 𝑦𝑦� ≥
0.5 𝑎𝑎𝑎𝑎𝑎𝑎 𝑦𝑦� = 0 𝑖𝑖𝑖𝑖 𝑦𝑦� ≤ 0.5 and then see how often the prediction is correct. This goodness of
fit measure is called the percentage correctly predicted approach.
13
The major issue with this model is that heteroscedasticity is always present and the standard
errors under the t or f test can therefore not be trusted. The preferred approach to address this
is to use robust tests since weighted least squares can be complex to calculate.
If we take 𝑋𝑋 as all independent variables. Written in functional form together with parameters
this is
Note that the shorthand 𝐺𝐺(𝛽𝛽0 + 𝒙𝒙𝜷𝜷) can also be written 𝐺𝐺(𝑥𝑥𝑥𝑥) for simplicity. Since we are
concerned about probability it is required that for all real numbers, 𝑧𝑧
We, therefore, need a method to calculate 𝐺𝐺(𝑧𝑧) where it will adhere to this requirement. The
most common methods are the logistic function (used in the logit model) and the normal
cumulative distribution function (used in the probit model). Both of these distributions are non-
linear and look very similar (the logistic distribution has heavier tails). They are useful as they
indicate that probability increase the fastest close to zero and slowest close to one. In the logit
model,
exp(𝑧𝑧)
𝐺𝐺(𝑧𝑧) =
1 + exp(𝑧𝑧)
1 𝑧𝑧 2
𝐺𝐺(𝑧𝑧) = (2𝜋𝜋)−2 exp(− )
2
The probit model is more popular than the logit model since it is often assumed that the errors
are normally distributed. Since both the logit and probit models rely on non-linear parameters,
we use Maximum Likelihood Estimation (MLE) to estimate the models.
14
Maximum Likelihood Estimation for logit and probit models
The MLE estimator is based on the distribution of 𝑦𝑦 given 𝑥𝑥 and is therefore very important
for estimating probit or logit models. To see how MLE for LDVs are estimated we first write
the density of 𝑦𝑦 given 𝑥𝑥 as
From this, we get the log-likelihood function by taking the log of the density function above
𝑙𝑙𝑖𝑖 (𝜷𝜷) = 𝑦𝑦𝑖𝑖 𝑙𝑙𝑙𝑙𝑙𝑙 [𝐺𝐺(𝒙𝒙𝒊𝒊 𝜷𝜷)] + (1 − 𝑦𝑦𝑖𝑖 )log[1 − 𝐺𝐺(𝒙𝒙𝒊𝒊 𝜷𝜷)]
Summing all 𝑙𝑙𝑖𝑖 (𝜷𝜷) for all n gives the log-likelihood for the sample, 𝐿𝐿(𝜷𝜷). Under MLE, 𝛽𝛽̂ is
obtained by maximizing 𝐿𝐿(𝜷𝜷). If we used 𝐺𝐺(𝑧𝑧) as in the logit model, we call this the logit
estimator and if the used 𝐺𝐺(𝑧𝑧) as in the probit model, we call this the probit estimator. MLE
under general conditions is consistent and asymptotically normal and efficient.
Note that the difference in log-likelihood is multiplied by two to allow the statistic to follow a
chi-square distribution. P-values are therefore also obtained from this distribution.
If the variable of interest is discrete the partial effect for the variable can be obtained by
𝐺𝐺(𝛽𝛽0 + 𝛽𝛽1 (𝑥𝑥1 + 1) + 𝛽𝛽2 𝑥𝑥2 + ⋯ ) − 𝐺𝐺(𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯ )
If the variable of interest is continuous then we need to take the partial derivative for the partial
effect which will give
15
𝑔𝑔(𝛽𝛽0 + 𝒙𝒙𝒙𝒙)(𝛽𝛽𝑗𝑗 )
To compare the estimated parameters with OLS, we make use of scale factors based on the
partial effects. This is done by Stata and the most useful is the average partial effects (APE). It
is, therefore, standard to estimate a model by LPM, probit and logit and compare the estimated
coefficients.
Tobit model for continuous dependent variable with many zero observations
Using a linear estimator for models with a continuous dependent variable with many zero
observations (for instance the number of cigarettes smoked per month over the population) will
give negative predictions of 𝑦𝑦�𝚤𝚤 and heteroscedasticity will be present. It is therefore preferred
to use a non-linear estimator that do not allow for negative values of 𝑦𝑦�𝚤𝚤 (meaning the estimated
parameters are more reliable).
Similar to the probit and logit model, for the tobit model we use MLE as the estimator to
maximize the sum of the following log-likelihood function
Where Φ indicates the standard normal cdf and 𝜙𝜙 indicates the standard normal pdf. This can
be called the tobit estimator. Hypothesis testing is conducted in the same manner as for the
logit and probit models.
In interpreting the tobit model we again rely on partial derivatives. These are used to calculate
APEs that can be compared to an OLS estimation of the same model and interpreted as a usual
(not probabilities as for binary dependent variables). APEs are routinely calculated by Stata.
16
Where ℎ is a count variable and used to indicate that 𝑦𝑦 is a count variable, and ℎ! means
factorial. Further note that exponential function are used as these are strictly positive. The log-
likelihood function is therefore
And the sum of this over n is again maximized by MLE, t-stats are given and we can use APEs
to compare the coefficients with OLS. It is, however, very important to note that the Poisson
distribution assumes that
𝑉𝑉𝑉𝑉𝑉𝑉(𝑦𝑦|𝑥𝑥) = 𝐸𝐸(𝑦𝑦|𝑥𝑥)
Which is very restrictive an unlikely to hold. If this is not assumed then we should rather use
Quasi MLE (QMLE) as the estimator together with the quasi-likelihood test statistic for
multiple hypotheses.
𝑐𝑐𝑖𝑖 − 𝑥𝑥𝑖𝑖 𝛽𝛽
𝑃𝑃((𝑦𝑦 ≥ 𝑐𝑐𝑖𝑖 |𝑥𝑥) = 1 − Φ � �
𝜎𝜎
This means that we can again use MLE after taking the log-likelihood where MLE will
maximize the sum. The interpretation of the estimates does not require any scaling and are
directly comparable with OLS. It should, however, be noted that in the presence of
heteroscedasticity or non-normal errors, MLE will be bias and inconsistent.
Heteroscedasticity
Heteroscedasticity under OLS
Heteroscedasticity does not cause bias or inconsistency in the OLS estimates and does not
influence R-squared or adjusted R-squared. It does, however, bias the variance of the OLS
estimates, resulting in incorrect standard errors and T, F and LM test results. OLS is then no
longer asymptotic most efficient amongst linear estimators. The first step is to test for
17
heteroscedasticity and then to address it. Note that incorrect functional forms may indicate
heteroscedasticity even when none is present, it is therefore important to first test whether the
functional forms are correct.
For the Breuscg Pagan test, OLS is run and 𝑢𝑢 is saved and 𝑢𝑢2 is calculated. This is regressed
on the original model and a F or LM test is conducted to test the null hypothesis that all
parameters are equal to nil. If the null is rejected, heteroscedasticity is present.
For the special case of the White test, OLS is run and 𝑢𝑢� 𝑎𝑎𝑎𝑎𝑎𝑎 𝑦𝑦� is saved, 𝑢𝑢�2 𝑎𝑎𝑎𝑎𝑎𝑎 𝑦𝑦� 2 is computed.
, 𝑢𝑢�2 is regressed on 𝑦𝑦�, 𝑦𝑦� 2 and the null is whether the parameters of these two are equal to nil.
If the null is rejected, heteroscedasticity is present. This test specifically test for the type of
heteroscedasticity that gives bias variances under OLS.
It is important to note that for both these tests, it is required that the errors in the second
regression, 𝑣𝑣𝑖𝑖 be homoscedastic, 𝐸𝐸(𝑣𝑣𝑖𝑖2 �𝑋𝑋) = 𝑘𝑘 (k means constant). This implies that for the
original error 𝐸𝐸(𝑢𝑢𝑖𝑖4 �𝑋𝑋) = 𝑘𝑘 2 (where 𝑘𝑘 2 is also a constant). This is called the homokurtosis
assumption. There are heterokurtosis-robust tests for heteroskedasticity also, but these are
seldom used (see page 141 in Wooldridge (2010) if interested).
18
assuming all OLS assumptions hold, besides homoscedasticity, and the heteroscedasticity
function (the weight) for WLS is correctly identified (WLS is BLUE).
If we write 𝑉𝑉𝑉𝑉𝑉𝑉(𝑢𝑢𝑖𝑖 |𝑥𝑥𝑖𝑖 ) = 𝜎𝜎 2 ℎ(𝑥𝑥𝑖𝑖 ) = 𝜎𝜎 2 ℎ𝑖𝑖 where ℎ(𝑥𝑥𝑖𝑖 ) is some function of the explanatory
variables that determines the heteroscedasticity, the standard deviation is 𝜎𝜎�ℎ𝑖𝑖 . We can divide
this by 1/�ℎ𝑖𝑖 to get 𝜎𝜎, the standard deviation if heteroscedasticity was not present. To do this,
we weight the original OLS model with 1/�ℎ𝑖𝑖 for each variable, including the dependent and
the intercept. After dividing, the estimators are written 𝛽𝛽𝑗𝑗∗ , this is an example of generalised
least squares (GLS) and is estimated by OLS.
The WLS model does exactly the same as OLS with GLS estimators, the only difference is that
we do not calculate the GLS estimators, but rather divide the entire least squares by ℎ𝑖𝑖 (note
not root square). WLS therefore minimises the weighted sum of squared residuals, where each
squared residual is weighted by 1/ℎ𝑖𝑖 .
Specifying the weighting function ℎ𝑖𝑖 is therefore the key. In a simple model with one
independent variable, the weighting function must be that independent variable. This means
that we do not need a GLS estimator to estimate WLS. For more complex models we need to
estimate the weighting function, meaning we then again need a GLS estimator to estimate by
WLS. This is done by estimating feasible GLS (FGLS).
Note that using FGLS makes WLS biased, but consistent and more efficient than OLS. It is,
therefore, a good idea to run WLS and OLS with robust standard errors. Robust standard errors
should also be calculated for WLS, since the weighting function may be incorrect, meaning
heteroscedasticity remains present. WLS should then still be more efficient than OLS (both
with robust standard errors).
19
Measurement error
Measurement error is not the same as taking a proxy. A proxy is where we have an unobserved
factor and we take an observable variable that is likely correlated with the unobserved factor.
This is always a good idea even if it increases multicollinearity, it will lead to smaller standard
errors and less bias estimates. An example is IQ for ability.
Measurement error is where we have an observable variable, but this variable is measured with
error, for instance, actual income vs declared income for tax purposes. If the measurement error
is in the dependent variable, it is generally not a problem. It is then just assumed that the
measurement error is random and not correlated with the independent variables. OLS,
therefore, remains unbiased and consistent as long as this assumption holds.
Measurement error in the independent variables is a problem. If it can be assumed that the
covariance between the measurement error and the actual variable included in the model is nil,
then there is no bias and OLS is BLUE. This is however unlikely to be the case. The general
assumption that needs to be made is that
𝐶𝐶𝐶𝐶𝐶𝐶�𝑥𝑥1,∗ 𝑒𝑒1 � = 0
Where 𝑥𝑥1∗ is the true variable that should be in the model and 𝑒𝑒1 is the measurement error
calculate as
Where 𝑥𝑥1 is the variable included in the model that contains the measurement error. This
assumption is called the classic-error-in-variance assumption (CEV). This assumption leads to
bias and inconsistency in the estimates of OLS, this bias is called attenuation bias. The bias is
towards zero, e.g. if 𝐵𝐵1 is positive then 𝐵𝐵�1will underestimate 𝐵𝐵1. If any other variable is
correlated with the variable that contains the measurement error, those estimates will also be
biased and inconsistent. This means an alternative estimator to OLS is required to obtain
unbiased and consistent estimates when there is measurement error in the independent
variables.
One way to resolve the measurement error bias is with the use of instrumental variables (IV)
(refer below for a discussion hereon). Taking
20
𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + (𝑢𝑢 − 𝛽𝛽1 𝑒𝑒1 )
In the above model, it is assumed that all independent variables are exogenous. The
requirement for a valid IV is that it is correlated with 𝑥𝑥1 and not correlated with 𝑢𝑢 or 𝑒𝑒1 . If we
have two measures of 𝑥𝑥1 , the second measure can be taken as an IV. Otherwise we ,can always
take other excluded exogenous variables as IV. By doing this we correct the attenuation bias.
Non-random sampling
Non-random sample selection generally violated OLS assumption 2. There are certain
instances where OLS remains BLUE, even though this assumption is violated. This is if 1)
missing data is random and the reason for the missing data is therefore not correlated with any
endogenous or unobservable variables (or the error) in the model, 2) the sample is selected
based on the level of the exogenous independent variable(s) (called exogenous sample
selection), e.g. only adults older than 40 are included in the sample and age is an independent
variable, 3) the sample is selected based on an exogenous variable to the model.
OLS will, however, be biased if 1) missing data is not random and the reason is endogenous to
the model or correlated with the error, 2) the sample is selected based on the level of the
dependent variable, e.g. where firm size is the dependent and only the biggest 20 firms are
sampled, 3) sample is selected based on an endogenous variable in the model.
[𝑓𝑓(𝑦𝑦|𝑥𝑥𝑖𝑖 𝛽𝛽, 𝜎𝜎 2 )]
𝑔𝑔(𝑦𝑦|𝑥𝑥𝑖𝑖 , 𝑐𝑐𝑖𝑖 ) = 𝑦𝑦 ≤ 𝑐𝑐𝑖𝑖
[𝐹𝐹(𝑐𝑐𝑖𝑖 |𝑥𝑥𝑖𝑖 𝛽𝛽, 𝜎𝜎 2 )]
From this, we can take the log-likelihood function and use MLE to maximize the sum for all
observations (Stata does this). The interpretation is the same as for OLS. In the presence of
homoscedasticity or non-normal errors, MLE will, however, be bias and inconsistent.
21
Incidental truncated models
For truncated models, the truncations are generally applied by choice of the data collector. It is
also possible that truncation occurs incidentally. We take a random sample, but due to
truncation, the sample is non-random for estimation purposes. Under incidental truncation,
whether we observe y will depend on external factors. If we, for instance, collect data on labor
variables, some observations will have zero wage, meaning wage is dependent on labor force
participation. We will still have observations on the other variables, but not on wage. If wage
is then used as the dependent variable, OLS will be biased.
To correct for this we follow the Heckman method (Heckman command in Stata):
1) First, we estimate a selection equation with the probit estimator using all observations. This
equation can be written
𝑠𝑠 = 𝑧𝑧𝑧𝑧 + 𝑣𝑣
Where s=1 where we observe 𝑦𝑦𝑖𝑖 and zero otherwise (we make s binary) and 𝑧𝑧 is a set of
independent variables that includes all the population variables, 𝑥𝑥, and at least one
additional variable that is correlated with s (the selection process). 𝛾𝛾 are parameters as
usual.
2) Compute the inverse Mills ratio 𝜆𝜆�𝚤𝚤 = 𝜆𝜆(𝑧𝑧𝑖𝑖 𝛾𝛾�)
3) Run OLS of 𝑦𝑦𝑖𝑖 = 𝑥𝑥𝑖𝑖 𝛽𝛽 + 𝜆𝜆�𝚤𝚤 𝛽𝛽
The significance of 𝜆𝜆�𝚤𝚤 ’s parameter indicate whether selection bias is present. If this
parameter is not zero, then OLS test statistics are not computed correctly and an adjustment
is required (Wooldridge 2010).
Outliers
Studentized residuals, leverage and Cook’s distance is useful to detect outliers in the sample.
This is important since OLS squares residuals, it is very sensitive to outliers. It is generally
recommended to report results with and without outliers unless an outlier is clearly a result of
a data capturing error. It may also be preferred to use an alternative estimator as a supplement
to OLS such as:
22
symmetric around the zero mean under LAD, the result will greatly differ from OLS and be
biased. Further, the t, F and LM test statistics is only valid in large samples under LAD.
To perform the test we need at least one instrument for each perceived endogenous variable.
Then we conduct the test by
1) Estimate each endogenous variable (perceived) in its reduced form (all exogenous
variables)
2) Save the residuals for each estimation
3) Include the residuals as new variables in the structural equation and test for significance
(t test if one endogenous and F test if more than one). It is important to take the robust
test statistics for both types of tests. If the residuals are not significant, the perceived
endogenous variable is exogenous (take robust standard errors). OLS can, therefore, be
preferred if this is the case for all perceived endogenous variables since OLS will be
Best.
This test is the same as the first steps of the Control Function estimator discussed later, so
also refer to this section.
Since the two samples are not drawn at the same time, the variables will not be identically
distributed between the two periods. To correct this it is required to include a dummy variable
for each year/time period (besides year 1 generally) in the regression that will control for
changes between years. It is often useful to interact this dummy with other variables to
determine how they have changed over time.
23
It is further possible that the functional forms of the variables in the regression should not be
the same for the different periods. This can be tested with an F test in the same manner as was
done for model selection, by conducting the test on each time period individually.
𝑦𝑦 = 𝛽𝛽0 + 𝛿𝛿0 𝑑𝑑2 + 𝛽𝛽1 𝑑𝑑𝑑𝑑 + 𝛿𝛿1 𝑑𝑑2𝑑𝑑𝑑𝑑 + 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 + 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
Where 𝑑𝑑2 is a dummy for the post event time period and 𝑑𝑑𝑑𝑑 is a dummy for treatment group
= 1 and control group = 0. The following table indicates the interpretation of the parameters:
𝑦𝑦 = 𝛽𝛽0 + 𝛿𝛿0 𝑑𝑑2 + 𝛽𝛽1 𝑑𝑑𝑑𝑑 + 𝛿𝛿1 𝑑𝑑2. 𝑑𝑑𝑑𝑑 + 𝛽𝛽2 𝑑𝑑𝑑𝑑 + 𝛽𝛽3 𝑑𝑑𝑑𝑑. 𝑑𝑑𝑑𝑑 + 𝛿𝛿2 𝑑𝑑2. 𝑑𝑑𝑑𝑑 + 𝛿𝛿3 𝑑𝑑2. 𝑑𝑑𝑑𝑑. 𝑑𝑑𝑑𝑑
+ 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 + 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
24
The coefficient of interest is therefore 𝛿𝛿3 . It is of course also possible to use more time periods
with either the DD and DDD estimate.
Cluster samples
In cluster sampling, clusters are randomly sampled from a population of clusters and units of
observation are sampled from the clusters. An example is siblings (units) sampled from
families (clusters) where the population is all families (the population of clusters). It is very
important that clustering should not be done ex-post (for instance obtain a random sample of
individuals and cluster them into families) as this will result in incorrect standard errors.
Matched pairs samples are also applicable to this section.
The benefit of cluster sampling is that a fixed cluster effect that influences all of the units in
the cluster can be controlled for in the model. Note that if the key independent variable only
changes at the cluster level and not at unit level then we would not want to include a fixed
cluster effect.
To include a fixed cluster effect, we use panel data methods (first-difference estimator, fixed
effects estimator, random effects estimator, correlation random effects model or pooled OLS)
to control for the cluster effect. These methods are discussed in the section on panel data. Note
that if pooled OLS is used after cluster sampling, the errors will have cluster correlation and
cluster-robust standard errors need to be used.
1. Ignore the problem and indicate the direction of bias. This is not ideal, but we may still
learn something.
2. Include proxy variables for the unobserved variables. It may be difficult to find applicable
proxies.
3. Control for the time constant unobservable variables by including fixed effects. Refer to
the cluster sampling discussion and panel data methods.
Another popular method is by using the Instrumental variable (IV) estimator. The IV estimator
obtains consistent (although bias) estimates when the OLS estimates will be bias and
inconsistent due to unobservable variable bias. The IV estimator is, therefore, most useful in
25
large samples. To use the IV estimator, we first have to identify and IV or instrument. Taken
the simple regression model
𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝑢𝑢
Where 𝐶𝐶𝐶𝐶𝐶𝐶(𝑥𝑥, 𝑢𝑢) ≠ 0 the estimated parameter 𝛽𝛽1 will be inconsistent and bias under OLS. If
we take a new variable (𝑧𝑧) that adheres to the following assumptions
Then 𝑧𝑧 is a valid instrument for 𝑥𝑥. Note that the first assumption means that the IV may not
have a partial effect on the dependent variable after controlling by the independent variables,
meaning that the IV must be exogenous in the original equation. Because the error cannot be
observed, we cannot test the first assumption and need to rely on logic and theory that argument
this. The second assumption can easily be tested by regressing 𝑧𝑧 on 𝑥𝑥. It is important that the
direction of the found correlation be aligned with logic and theory. Where an endogenous
variable is interacted with another variable, the IV for the interaction variable is the IV for the
endogenous variable interacted with the interacted variable in the model.
Further see that a good proxy is a bad IV since a proxy requires correlation between the proxy
and the error (before including the proxy) and a good IV requires no correlation between the
IV and the error.
If we found a good IV, we can use the IV assumptions to identify3 the parameter 𝛽𝛽1.Write the
simple model above as
𝐶𝐶𝐶𝐶𝐶𝐶(𝑧𝑧, 𝑦𝑦)
𝛽𝛽1 =
𝐶𝐶𝐶𝐶𝐶𝐶(𝑧𝑧, 𝑥𝑥)
�1 𝑥𝑥̅ . See that if 𝑧𝑧 = 𝑥𝑥 then the IV estimator becomes the OLS estimator. As
�0 = 𝑦𝑦� − 𝛽𝛽
And 𝐵𝐵
�1 is consistent but bias and the IV estimator is therefore only really
previously mentioned 𝛽𝛽
useful in larger samples.
3
This means we can write the parameter in terms of population moments that can be estimated.
26
The above can be extended to a multivariate model. To do this we need to make use of structural
equations and reduced forms. Given the structural equation
The 𝑦𝑦 variables are interpreted as endogenous variables (correlated with the error term) and the
𝑧𝑧 variable is interpreted as exogenous (not correlated with the error term). It is evident that the
independent variable 𝑦𝑦2 is problematic since it is endogenous and if estimated under OLS will
result in bias in all the parameters. To resolve this we can use the IV estimator, but note that 𝑧𝑧1
may not be an IV for 𝑦𝑦2 , since it is already included in the model. We therefore need a new
exogenous variable, 𝑧𝑧2 , to serve as an IV for 𝑦𝑦2 . We therefore need to assume that 𝐶𝐶𝐶𝐶𝐶𝐶(𝑧𝑧2 , 𝑢𝑢1 =
0) and further that the partial correlation between 𝑧𝑧2 and 𝑦𝑦2 is not zero. To test the second
assumption we write 𝑦𝑦2 in its reduced form, meaning we write an endogenous variable in terms
of exogenous variables (including IV’s). This can also be done for dependent variables where
the interpretation of the parameters of the reduced form is intention-to-treat as opposed to treat
in the structural model. 𝑦𝑦2 in its reduced form is therefore
The assumption holds if 𝜋𝜋2 ≠ 0 and the reduced form is estimated by OLS (with the
assumption of no perfect multicollinearity). Note that if the model contained further exogenous
variables then those would also be included in the reduced form.
𝜎𝜎 2
�1 =
𝛽𝛽
𝑛𝑛𝜎𝜎𝑥𝑥2 𝜌𝜌𝑥𝑥,𝑧𝑧
2
2 2
Where 𝜌𝜌𝑥𝑥,𝑧𝑧 is the square of the population correlation between 𝑥𝑥 and 𝑧𝑧 (𝑅𝑅𝑥𝑥,𝑧𝑧 ).
�2
𝜎𝜎
�1 =
𝛽𝛽 2
𝑆𝑆𝑆𝑆𝑇𝑇𝑥𝑥 . 𝑅𝑅𝑥𝑥,𝑧𝑧
2
Note that the only difference between the standard errors of OLS and IV is the term 𝑅𝑅𝑥𝑥,𝑧𝑧 . Since
this is always less than one, the standard errors under IV will always be larger than under OLS
27
(a weakness of IV). Further, if we have a poor IV, meaning there is weak correlation between
the endogenous variable and its instrument, besides large standard errors, IV will also have
large asymptotic bias. Therefore although consistent, IV can be worse than OLS if we have a
poor IV. Generally an, IV is considered to be weak (and should not be used) if the t-stat of the
IV in the reduced form model is less than absolute 3.2 (√10) (reference to Stock and Yogo,
2005).
The obtained R squared from an IV estimation is not useful and should not be reported.
And we have two exogenous variables that are correlated with 𝑦𝑦2 called 𝑧𝑧2 and 𝑧𝑧3 , any linear
combination of exogenous variables is a valid IV for 𝑦𝑦2 . The reduced form of 𝑦𝑦2 is therefore
In other words, the independent variable 𝑦𝑦2 is divided into two parts, 𝑦𝑦2∗ (the part that is
exogenous in the structural model) and 𝑣𝑣1 (the part that is endogenous in the structural model).
We only wish to use the exogenous part of the variable.
To estimate 𝑦𝑦
�2 we need two OLS estimations, called the first stage and the second stage.
First stage
𝑦𝑦
�2 = 𝜋𝜋
�0 + 𝜋𝜋
�𝑧𝑧
1 1 + 𝜋𝜋
�𝑧𝑧
2 2 + 𝜋𝜋
�𝑧𝑧
3 3 + 𝜀𝜀
28
Second stage
It can, therefore, be seen that 2SLS first purges 𝑦𝑦2 of its correlation with 𝑢𝑢1 and it therefore
consistent where OLS would not be. Note that the econometric package automatically estimates
both stages and this should not be done manually. Further, when asking for instrumental
variables, all exogenous variables (included and excluded) are given as all of these are used in
the first stage and therefore estimation of the IV.
𝜎𝜎 2
�1 =
𝛽𝛽
� �2
𝑆𝑆𝑆𝑆𝑇𝑇2 (1 − 𝑅𝑅2)
Where 𝑆𝑆𝑆𝑆𝑇𝑇2 is the total variation in 𝑦𝑦 �2 is the R squared of the reduced form equation.
�2 and 𝑅𝑅2
See from this that 2SLS will always have larger variance than OLS since
1. 𝑦𝑦
�2 has less variation than 𝑦𝑦2 (a part of its variation is in the reduced form error term)
2. 𝑦𝑦
�2 is more correlated with the exogenous variables, increasing the multicollinearity
problem.
we would require at least two excluded exogenous variables that are partially correlated with
𝑦𝑦2 and 𝑦𝑦3 . This means that the two or more excluded exogenous variables should be jointly
significant (with an F stat greater than 10) in both the reduce form models of 𝑦𝑦2 and 𝑦𝑦3 . To use
2SLS and to obtain valid estimates we need to adhere to the order condition. The order
condition requires that we have at least as many excluded exogenous variables as included
endogenous variables.
A requirement for a valid instrument is that it is uncorrelated with the error term in the structural
model (endogenous). If we have more instruments than we need to identify an equation (more
instruments than endogenous variables) we can test whether the additional instruments are
uncorrelated with the error term (called testing the overidentification restriction).
29
3) The null hypothesis that all instruments are uncorrelated with 𝑢𝑢
�1 is tested by testing
whether the R squared multiplied by the sample size follows a chi-square distribution
where the degrees of freedom is the instruments less the endogenous variables. If 𝑛𝑛𝑅𝑅 2
exceeds the critical value in the chi-square distribution we reject the Ho, meaning all
instruments are not exogenous. This means that the additional instruments are useful,
but only to a certain extent. It may still be that one of the additional instruments are
endogenous.
4) To obtain a heteroscedasticity robust test, we regress all endogenous variables on all
exogenous variables (included and additional instrumental variables 4) and save the
�).
fitted values (𝑦𝑦2 Next ,we regress each of the overidentifying restrictions (instruments
not needed for the model to be just identified) on the exogenous variables included in
the original model and the 𝑦𝑦�
2 ′𝑠𝑠 and we save the residuals 𝑟𝑟�2 . Then we regress the saved
residuals in step 1, 𝑢𝑢
�1 on 𝑟𝑟̂2 and perform the heteroscedasticity robust Wald test on this
regression.
5) 𝐸𝐸(𝑢𝑢2 |𝑍𝑍) = 𝜎𝜎 2
Under 1-5 2SLS is consistent and test statistics are asymptotically valid. The 2SLS estimator
is the best IV estimator under these assumptions.
4
Note that an exogenous variable is its own instrument.
30
If 5 does not hold, then 2SLS is not the most efficient IV estimator. Homoscedasticity can be
tested by saving the residuals from 2SLS and regressing this on all exogenous variables with
the null being the joint significance of all exogenous variables is zero (required for
homoscedasticity). This is analog to the Breusch Pagan test. To correct heteroscedasticity under
2SLS
1) Take robust standard errors as for OLS, or
2) Use weighted 2SLS that is done the same as for OLS, but 2SLS are used after applying the
weights.
We include 𝑞𝑞1 in the model and then use 𝑞𝑞2 as an instrument for 𝑞𝑞1. Doing this provide for
consistency where OLS would have been inconsistent (using 𝑞𝑞1). It is important that 𝑞𝑞2 meets
the normal requirements for a good and valid instrument. This approach is called the multiple
indicator solution.
Similarly, measurement error can be resolved if we have two indicators that measure and
independent variable with error (where we do not have the correctly measured independent
variable). For OLS we would just have been able to include one of the two indicators, but using
2SLS we can use the second indicator as an IV for the first, resulting in consistent estimators
(this is also discussed under measurement error).
31
obtained from data). If the parameter for the generated regressor ≠ 0, then all standard errors
and statistics need to be adjusted for valid inference.
A generated instrument does not result in the same problems, 2SLS remains consistent with
valid test statistics (assuming the other assumptions hold). Of course, if a generated regressor
is included in 2SLS then we need to adjust the asymptotic variance.
Where 𝑧𝑧1 are all the exogenous variables in the structural model and 𝑦𝑦2 is the endogenous
variable. If we have at least one additional exogenous variable that is not included in the
structural model, the reduced form of 𝑦𝑦2 is
Where 𝑧𝑧 includes at least one variable not in 𝑧𝑧1 . This is required to avoid perfect
multicollinearity (see the final model below) . Since 𝑦𝑦2 is correlated with 𝑢𝑢1 , 𝑣𝑣2 must be
correlated with 𝑢𝑢1 as well. Therefore we can write
See that this is a simple test for endogeneity of 𝑦𝑦2 , if 𝜌𝜌 = 0 then 𝑦𝑦2 is actually exogenous.
Further se,e that 𝑣𝑣2 and 𝑒𝑒1 are uncorrelated and consequently 𝑧𝑧 (which includes 𝑧𝑧1 ) are also
uncorrelated with both 𝑣𝑣2 and 𝑒𝑒1. We can therefore substitute 𝑢𝑢1 in the original model to get
Which is a model with no endogeneity and will be consistent by OLS. Since 𝑣𝑣2 is a generated
regressor, we need to correct the standard errors.
CF provides identical results to 2SLS unless there are more than one function of 𝑦𝑦2 included
in the model (for instance 𝑦𝑦2 and 𝑦𝑦22). In such instances on,ly 2SLS will be consistent, but CF
will be more efficient. CF is very useful for non-linear models (discussed later).
32
endogenous variable. Taken the model that we can estimate (not having the unobserved
heterogeneity data)
𝑎𝑎1 , the ‘coefficient’ of 𝑦𝑦2 is an unobserved random variable, meaning it will change as 𝑦𝑦2
changes. We can write
Where 𝛼𝛼1 is the correct (constant) coefficient which we wish to estimate. Substituting this into
the original model gives the population model
This shows the interaction between the unobserved heterogeneity for which we do not have a
proxy, 𝑣𝑣1 , and the endogenous variable. To address the endogeneity of 𝑦𝑦2 we would want to
use 2SLS. The problem with 2SLS is that the error term (𝑣𝑣1 𝑦𝑦2 + 𝑢𝑢1 )in the model to be
estimated is not necessarily uncorrelated with the instrument (𝑧𝑧) that we would want to use. A
further requirement is therefore necessary being
which means the conditional covariance is not a function of the instrumental variable. Finding
an instrument that satisfies this condition is difficult. One option is to obtain fitted values of a
first stage regression of 𝑦𝑦𝑖𝑖2 on 𝑧𝑧𝑖𝑖 and then use as IV’s 1, 𝑧𝑧𝑖𝑖 and 𝑦𝑦�𝑖𝑖2 (𝑧𝑧𝑖𝑖1 − 𝑧𝑧̅1 ).
Alternatively, a control function approach can be used by first regressing 𝑦𝑦2 on 𝑧𝑧 and save the
reduced form residuals, 𝑣𝑣�2 and then run the OLS regression 𝑦𝑦1 on 1, 𝑧𝑧1 , 𝑦𝑦2 , 𝑣𝑣�2 𝑦𝑦2 and 𝑣𝑣�2 . This
approach requires a stronger assumption which is
𝐸𝐸(𝑢𝑢1 |𝑧𝑧, 𝑣𝑣2 ) = 𝜌𝜌1 𝑣𝑣2 , 𝐸𝐸(𝑣𝑣1 |𝑧𝑧, 𝑣𝑣2 ) = 𝜗𝜗1 𝑣𝑣2
Systems of Equations
It is possible that the population model is a set of equations, for instance in estimating a demand
system, for instance
33
Since each equation has its own vector of coefficients 𝛽𝛽𝑔𝑔 , this model is known as seemingly
unrelated regression (SUR). In estimating such a system we can use OLS equation by equation,
system OLS (SOLS) or FGLS. From these FGLS will be more efficient if we can assume
system homoscedasticity. SOLS is generally more likely to be consistent as it contains a lesser
assumption; FGLS requires strict exogeneity. If we cannot assume system homoscedasticity,
then either SOLS or FGLS may be more efficient.
Systems of equations often have endogenous variables and IV method is therefore commonly
used (see SEM models). There are more efficient estimators than 2SLS for systems of equations
with endogeneity, for instance, the General Methods of Moments estimator (GMM) and GMM
3SLS.
Where 𝑧𝑧 indicated exogenous variables. See that the observed hours are determined by the
intersection of supply and demand and the true hours that workers are willing to supply cannot
34
be observed, but we wish to estimate this. Because we only observe the equilibrium of hours
worked where supply equals demand we can write for each individual
And
See that the only difference between these two equations is the subscript for the exogenous
variables. If the exogenous variables are exactly the same, then the two equations will be
exactly the same, meaning we have an identification problem; the true hours that workers wish
to supply cannot be estimated. Taking crime and police as an example the first equation will
be
See that both equations have a ceteris paribus interpretation. Further note that these two
equations describe different behaviors. In the first equation, we are interested in factors that
change in the behavior of criminals and in the second we are interested in factors that change
in the behavior of the country/state etc. in appointing policemen and policewomen. It is,
therefore, most plausible that the exogenous variables will be different and the first (or second)
equation can be estimated. Note, however, that if we use OLS on the first or second equation,
the estimated parameters will be biased because of simultaneity. We, therefore, use 2SLS.
35
The order condition is again necessary, but not sufficient. The rank condition for SEMs with
more than two equations follows. (Wooldridge 2010, c9).
The instrumental variables that are used in estimating the equation of interest are therefore all
exogenous variables in the system of equations. By doing this we remove the simultaneity bias
in the independent variable that is jointly determined with the dependent variable.
In conclusion, the only difference between 2SLS to address endogeneity bias and simultaneity
bias is in how we obtain the instrumental variables to be used and the necessary condition to
estimate an equation.
36
TIME SERIES DATA
OLS Assumptions for finite samples
Assumption 1-3
The OLS assumption for time series data (TSD) to ensure that OLS is BLUE in finite samples
is similar to cross-sectional data. For instance, the model needs to be linear in parameters (1)
and there may not be any perfect collinearity (2). For OLS to be unbiased with TSD, a further
assumption needs to be adhered to. This assumption combines the random sample and zero
conditional mean assumption for cross-sectional data and adds a stricter requirement. If 𝑋𝑋 is
taken to represent all independent variables for all time periods (𝑡𝑡) then
This means that for each time period the expected value of the error term of that period, given
the independent variables for all time periods is zero (3). In other words, the error in any one
time period may not be correlated with any independent variable in any time period. If this
holds we say the model is strictly exogenous and OLS is unbiased and consistent. This
assumption will not hold if the data does not come from a random sample. Note that this
assumption includes the assumption for cross-sectional data and can be written
𝐸𝐸(𝑢𝑢𝑡𝑡 |𝑥𝑥𝑡𝑡 ) = 0
Which means that the error term and independent variables for one time period are not
correlated. If only the second assumption holds, then the model is said to be contemporaneously
exogenous. OLS will be consistent, but biased. This means this assumption is not sufficient to
have OLS be BLUE.
37
Meeting assumption 1-3 result in OLS being unbiased and consistent. The assumptions
required for OLS to have the smallest variance (to be Best) are
Assumption 4
Homoscedasticity, meaning
Note again that the requirement is on all independent variables at all time periods, this said, in
most cases, the heteroscedasticity in the error for a time period is as a result of the independent
variables of that time period.
Assumption 5
No serial correlation (autocorrelation), meaning the errors (given all independent variables for
all time periods) may not be correlated over time. This can be written
Note that this does not mean that an independent variable may not be correlated with itself or
other independent variables over time, only the errors (that contain unobserved factors and
measurement error) are of concern.
Under assumptions 1-5 OLS is BLUE for time series data. Further, the OLS sampling variance
is calculated exactly as in cross-sectional data (see above) and the estimated variance of the
error terms are unbiased estimates of the population error term. Therefore OLS has the same
desirable property for time series data.
Assumption 6
To be able to use the t and F test in finite samples, the classic linear model assumption is
required. Without this assumption, the errors will not have an F and T distribution. This
assumption is that 𝑢𝑢𝑡𝑡 are independent of 𝑋𝑋 and independent and identically distributed as
normal.
𝑦𝑦𝑡𝑡 = 𝛽𝛽0 + 𝛽𝛽1 𝑧𝑧𝑡𝑡 + 𝛽𝛽2 𝑧𝑧𝑡𝑡−1 + 𝛽𝛽3 𝑧𝑧𝑡𝑡−2 + 𝛽𝛽4 𝑧𝑧𝑡𝑡−3 + 𝑢𝑢𝑡𝑡
Where 𝛽𝛽1 indicates the immediate propensity, meaning the change in 𝑦𝑦𝑡𝑡 due to a one unit
increase in 𝑧𝑧 at time 𝑡𝑡 ; and 𝛽𝛽1 + 𝛽𝛽2 + 𝛽𝛽3 + 𝛽𝛽4 indicates the long run propensity, meaning the
change in 𝑦𝑦𝑡𝑡 over four time periods (or how many lags are included plus one) due to a one unit
increase in 𝑧𝑧 at time 𝑡𝑡. This means that 𝛽𝛽2 indicates the change in 𝑦𝑦𝑡𝑡 one period after a change
𝑧𝑧 at time 𝑡𝑡 and similar for the remaining parameters individually considered.
𝑦𝑦𝑡𝑡 = 𝛽𝛽0 + 𝑦𝑦𝑡𝑡−1 + 𝑦𝑦𝑡𝑡−2 + 𝛽𝛽1 𝑧𝑧𝑡𝑡 + 𝛽𝛽2 𝑧𝑧𝑡𝑡−1 + 𝛽𝛽3 𝑧𝑧𝑡𝑡−2 + 𝛽𝛽4 𝑧𝑧𝑡𝑡−3 + 𝑢𝑢𝑡𝑡
For such a model there cannot be any serial correlation, meaning the serial correlation
assumption always holds. This does not mean all models should be dynamically complete. If
the purpose of the regression is to forecast, the model must be dynamically complete. If we are
however interested in the static impact (a static model) or the long run effect (FDL) model,
such a model need not by dynamically complete. It should however then be noted that the
model will have serial correlation and this will have to be corrected (discussed later).
39
Dummy variables and binary variables can also be used. Binary variables are useful for event
studies using time series data.
It should further be noted that for time series data, we always want to use real economic
variables and not nominal economic variables. This means that if data is in nominal form, this
data needs to be adjusted by an index, such as the consumer price index, to obtain the real
economic variable. Alternatively stated, not accounting for inflation gives rise to measurement
error.
1. Trends
Often we may think that variables are correlated over time, but this correlation can
partly be described to a similar time trend that variables follow. If a dependent or
independent variable follows a time trend, we need to control for this trend in the model.
Not doing so means that the trend will be included in the error term and this means the
estimates will be biased, called a spurious regression. Including the trend in the model
depends on the type of trend.
For a linear time trend, we can write
𝑦𝑦𝑡𝑡 = 𝛽𝛽0 + 𝐵𝐵1 𝑡𝑡 + 𝑒𝑒𝑡𝑡 , 𝑡𝑡 = 1,2,3 …
Note that the independent variable “t” indicates time where 1 is for instance 2010, 2 is
2011, 3 is 2012, etc. Including this variable detrends the results of the equation. If a
variable has an exponential trend we can include logs and for a quadratic trend, we can
include polynomial functions. Note that when including trends, the R-squared or
adjusted R-squared is biased, but this does not influence the T of F stat.
2. Seasonality
If our time periods are less than a year, data can also be influenced by seasonality, e.g.
crop output is influenced by rainfall and rainfall is seasonal. Most often series are
already seasonally adjusted and we do not have to make any changes to our model. If
the data you receive is not seasonally adjusted and suspect to seasonality, it is required
to do such an adjustment. This is easily done by including dummy variables for the
relevant seasons (for instance for each month (less one) or for each quarter (less one)).
This will control for the seasonality in the data.
40
OLS asymptotic assumptions
In large samples, the assumptions of OLS can be made less strict, as long as the law of large
numbers and the central limit theorem holds. Additional requirements, besides having a large
sample are required for this to be the case. The two additional requirements for OLS and other
estimators are that the time series’ included in a regression are stationary and weakly
dependent. It should be noted that we are interested here in the specific variables individually
and not the regression model. We look at one variable over time (a time series) individually to
see whether it is stationary and weakly dependent. For a time series to be stationary is not
critical, but weakly dependent is.
Logically, to understand the relationship between variables over time, we need to be able to
assume that this relationship does not change arbitrarily between time periods. This means that
each variable should follow a determinable path over time. For this reason, a time series (one
variable over time) can be seen as a process (and defined in term of a process).
For any time series, we are dealing with a stochastic process, meaning that the time series level
is not deterministic in any one period, the data points are determined by probability. The
important aspect of the process is whether it is stationary or non-stationary.
Stationary
A stationary stochastic process is a process where the joint probability distribution of the
sequence of random variables in that process remains unchanged over time. Again, flipping a
coin is a stationary stochastic process, since the joint probability of heads and tails remains
unchanged over time. If a variable, for instance, has a time trend, then the stochastic process
cannot be stationary, meaning it is a non-stationary stochastic process. A stationary stochastic
process is called strictly stationary.
41
If we write {𝑥𝑥1𝑠𝑠 , 𝑥𝑥2𝑠𝑠 , 𝑥𝑥3𝑠𝑠 … 𝑥𝑥4𝑠𝑠 }/𝑛𝑛 the first moment is where s = 1 (this is the mean) and the second
moment is where s = 2 (this is the variance). This can continue further to get skewness and
kurtosis.
The lesser form of stationary is called covariance stationary or weak stationary and is more
important than strict stationary (since strict stationary seldom hold) and this holds where all the
random variables have a finite second moment (𝐸𝐸(𝑥𝑥𝑡𝑡2 ) < ∞ for all 𝑡𝑡), the mean and the
variance of the process is constant and the covariance depends only on the time period between
two terms and not the starting time period. Mathematically this can be written
𝐸𝐸(𝑥𝑥𝑡𝑡 ) = 𝜇𝜇
𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥𝑡𝑡 ) = 𝜎𝜎 2
This requirement means that there is one data generating process that determine 𝑥𝑥𝑡𝑡 in all time
periods, this data generating process does not change between time periods. The data
generating process is unknown and can be likened to a true model that explains changes in the
time series. If the generating process changes between periods then it would not be possible to
have a linear relationship in the regression model, since the parameter would change greatly
between time periods.
It can be seen that a strictly stationary process with a finite second moment is automatically a
covariance stationary process, but vice versa is not true.
Weakly dependent
The weakly dependent requirement differs between a strictly stationary process and a
covariance stationary process. For a strictly stationary process, it is required that 𝑥𝑥𝑡𝑡 , 𝑥𝑥𝑡𝑡+ℎ are
“almost independent” as h increases without bound. The covariance stationary requirement is
less abstract and generally how we think of weak dependence. This requires that the correlation
between 𝑥𝑥𝑡𝑡 , 𝑥𝑥𝑡𝑡+ℎ goes satisfactorily quickly to zero at h goes to infinity. In other words, we do
not want persistent correlation for a variable with itself over time, only taking into account the
first time period and another time period further away.
One example of a weakly dependent process is a moving average process of order 1 ([MA(1)]).
This can be written as
42
This process states that a once off change in 𝜖𝜖𝑡𝑡 will influence 𝑥𝑥𝑡𝑡 in the period of the change,
the following period, but not thereafter. The covariance therefore goes to zero within two
periods. This process is stationary (since 𝜖𝜖𝑡𝑡 is i.i.d) and weakly dependent.
This process states that as long as 𝜌𝜌 is less than one, a change in 𝑥𝑥𝑡𝑡 will have a persistent effect
on 𝑥𝑥𝑡𝑡 , but the effect will decrease to zero over time. It should be noted that if 𝜌𝜌 gets close to
one, the process will decrease to zero over time, but not satisfactorily quickly (it seems that
below 0.95 is satisfactorily). This process is also weakly dependent and stationary.
We now turn back to the regression model as these assumptions need to hold in the model.
Assumption 1
The model must be linear in the parameters and the process must be stationary and weakly
dependent so that LLN and CLT can be applied to sample averages. For this purpose, weakly
dependent is more important.
Assumption 2
No perfect multicollinearity
Assumption 3
The explanatory variables are contemporaneously exogenous, meaning 𝐸𝐸(𝑢𝑢𝑡𝑡 |𝑥𝑥𝑡𝑡 ) = 𝐸𝐸(𝑢𝑢𝑡𝑡 ) =
0. Note that this assumption is less strict than the finite sample assumption as it is not concern
a on how the error for one period are related to the explanatory variables in other time periods.
Under assumptions 1-3 OLS will be consistent, but not necessarily unbiased. Strict exogenous
is required for unbiasedness. In large sample sizes, the bias is likely to be small.
Assumption 4
The errors are contemporaneously homoscedastic, 𝑉𝑉𝑉𝑉𝑉𝑉(𝑢𝑢𝑡𝑡 |𝑥𝑥𝑡𝑡 ) = 𝜎𝜎 2 . Note again this is less
strict than the finite sample assumption. Further note that 𝑥𝑥𝑡𝑡 here can also include lags of either
or both the dependent and independent variables.
Assumption 5
43
The errors for different time periods are uncorrelated, no serial correlation.
Under assumption 1-5, OLS estimators are asymptotically normal and the standard errors, T,
F, and LM test statistics are valid. If a model has trending explanatory variables and the trend
is stationary and included in the model, the assumption 1-5 can be applied.
Many variables do not tend to zero satisfactorily quickly over time, in other words, it is a highly
persistent time series where the level in one period depends greatly on the level in the previous
period(s). A process that describes such a time series is a random walk, which is part of a unit
root process. The term unit root comes from the 𝜌𝜌 in the AR(1) model that equals unity (one).
A random walk can be written
In this model, the expected value does not depend on the time period, but the variance does and
increases as a linear function of time and the correlation between 𝑦𝑦𝑡𝑡 𝑎𝑎𝑎𝑎𝑎𝑎 𝑦𝑦𝑡𝑡−1 gets arbitrarily
close to one. This process is not weakly dependent and also non-stationary. It is also possible
for this process to have a time trend, called a random walk with drift.
We can, therefore, estimate 𝜌𝜌 by obtaining the correlation between 𝑦𝑦𝑡𝑡 𝑎𝑎𝑎𝑎𝑎𝑎 𝑦𝑦𝑡𝑡−1, but it should
be noted that this estimate is bias and can be largely biased (we therefore rather use the Dickey-
Fuller test discussed below). Note that if the process has a trend, we first need to detrend before
taking the correlation. If |𝜌𝜌| > 0.8 𝑡𝑡𝑡𝑡 0.9 (preference differ on this) then it is better to conclude
that the process is I(1). If the process is I(1), we need to take the first difference of the process
and include this in the regression. For the random walk process, the first difference is therefore
44
Note that 𝑐𝑐𝑦𝑦 = 𝑑𝑑𝑑𝑑 = ∆𝑦𝑦, which all means the first difference in y. Note that we will lose the
first observation, meaning we will start at period 2, as a result of taking the first difference.
Taking the first difference also has the advantage of detrending the time series. This is true
since the first difference of a linear relationship is constant.
A more formal test for a unit root is known as the Dickey-Fuller (DF) test. Taking the AR(1)
model above and subtracting 𝑦𝑦𝑡𝑡−1 gives
Where 𝜃𝜃 = (1 − 𝜌𝜌). This module can be estimated by OLS, but 𝜃𝜃 does not follow a normal
distribution, but what is known as the Dickey-Fuller distribution. We therefore need alternative
critical values which can then be used in the t-test. Higher order AR processes to address serial
correlation are also allowed and can be written
If a series has a time trend, we need to include the trend in the Dickey-Fuller test. Note,
however, that alternative critical values need to be used after including the time trend.
Spurious regression
It is possible for two variables to be correlated only because both are correlated with a third
variable not included in the model. Including this variable removes the correlation between the
first two variables. If this is the case we have a spurious regression. This is of course also
possible for time series, but time series has an additional issue. If we have an I(1) dependent
variable and at least one I(1) independent variable, this will in most instances result in spurious
regression. This means the t-statistics cannot be trusted.
One way to address this is by differencing the variables, but this limits our application. Another
possibility is to determine whether the two I(1) variables are co-integrated.
Co-integration
If two I(1) variables have a long run relationship it is possible that the difference between the
two variables is an I(1) process. This can be written
To test whether two I(1) variables are co-integrated we perform the Engle-Granger test
45
3) Use the Engle-Granger critical values to determine whether 𝜃𝜃 is significant.
4) If t-stat is below critical value then 𝑦𝑦𝑡𝑡 − 𝛽𝛽𝑥𝑥𝑡𝑡 is 𝐼𝐼(0), meaning we can calculate a new
variable that often has economic interpretation.
If we include this new variable, we call the model an error correction model and this can be
written (note that variables are differenced because y and x are I(1))
Serial correlation
Remember, in a dynamically complete model there is no serial correlation. Serial correlation
can, however, exist in other types of models, or where there is misspecification in a dynamically
complete model. When there is serial correlation, OLS remains consistent and unbiased (even
if the model includes lagged dependent variables). OLS will, however, be less efficient (no
longer BLUE) and the test statistics will be invalid. The goodness of fit tests (Rsquared)
remains valid.
If there is no serial correlation in adjacent errors, then 𝜌𝜌 = 0. This is therefore the null
hypothesis of the test. Since we only have strictly exogenous variables, the estimate of 𝑢𝑢𝑡𝑡 is
unbiased and can be used for testing the null. Therefore
I. Run OLS of 𝑦𝑦𝑡𝑡 on xt1 , 𝑥𝑥𝑡𝑡2 , … . . 𝑥𝑥𝑡𝑡𝑡𝑡 and take 𝑢𝑢�𝑡𝑡 for all 𝑡𝑡′𝑠𝑠
II. Run 𝑢𝑢�𝑡𝑡 𝑜𝑜𝑜𝑜 𝑢𝑢�𝑡𝑡−1 for all 𝑡𝑡. The parameter 𝜌𝜌’s p-value indicates serial correlation.
Generally the nul,l can be rejected at the 5 percent level. The test can be made robust
to heteroscedasticity by computing robust standard errors.
It should be remembered that this test only tests for AR(1) serial correlation, meaning only
correlation in adjacent error terms. It may be that there is serial correlation in non-adjacent
error terms.
46
Another possible test is the Durbin-Watson test, but this requires that the classical assumptions
all hold and provides the same answer as above. It is therefore suggested that this test is rather
not used.
For higher order (e.g. AR(2) errors, meaning there are two lags) serial correlation the same test
can be done, but with including the higher order error terms in step 2. The F test is then used
to test for joint significance (all parameters of the residuals should be zero) and the test can be
made robust to heteroscedasticity as discussed for cross-sectional data.
To understand the estimator, you need to understand how the data is transformed. AR(1) errors
(residuals as we are using 𝑝𝑝,
� but for ease just writing 𝜌𝜌) are written
47
Where 𝑣𝑣𝑣𝑣𝑣𝑣(𝑢𝑢𝑡𝑡 ) = 𝜎𝜎𝑒𝑒2 /(1 − 𝜌𝜌2 ). Note that 𝜌𝜌 indicates the extent of the serial correlation and if
0 then 𝑣𝑣𝑣𝑣𝑣𝑣(𝑢𝑢𝑡𝑡 ) = 𝜎𝜎𝑒𝑒2 , meeting the serial correlation and homoscedasticity assumption. To
obtain this we take the quasi-difference for each variable in the regression besides in time
period 1. This is done by multiplying the t>1 time period, multiplying each variable by 𝜌𝜌 and
the deducting this from the previous time period (e.g for time period 2, this time period is
multiplied by 𝜌𝜌 and the we deduct this from time period 1). Note that if 𝜌𝜌 were equal to one
(which we assume not to be the case) then this will be the exact same process as taking the
difference to transform a variable to be weakly dependent.
To include time period 1 in our estimation, each variable in this time period is multiplied by
1
(1 − 𝜌𝜌2 )2 . Note that these transformations are performed automatically by the regression
software.
For higher order serial correlation (AR(q)) a similar approach is followed by quasi-
transforming all variables. This again is done automatically by the regression software.
From the above, there are two possible estimators when the errors are serially correlated with
strictly exogenous variables, OLS and FGLS. FGLS is generally preferred since the
transformation ensures all variables are I(0) and that there is no serial correlation. FGLS will
however only be consistent if
𝐶𝐶𝐶𝐶𝐶𝐶(𝑥𝑥𝑡𝑡 , 𝑢𝑢𝑡𝑡 ) = 0
Note that this is a stronger requirement than OLS, which only needs the first written covariance
to hold. If the second written covariance does not hold, then OLS can be preferred to FGLS
since OLS will be consistent (although the test statistics will be invalid). Taking the difference
for variables with OLS, especially when 𝜌𝜌 is large, eliminates most of the serial correlation.
Both OLS and FGLS should be used and reported to show (hopefully) that there are no large
differences between the estimated parameters.
48
It may further be a good idea to compute these standard errors even when the independent
variables are strictly endogenous after using OLS or FGLS. FGLS is included since the
parameter 𝜌𝜌 may not account for all serial correlation (the errors did not follow the selected
AR model) and there may be heteroscedasticity in the errors.
Heteroscedasticity
If the errors are heteroskedastic, but there is no serial correlation, the same procedures as
discussed for cross-sectional data can be applied to time series. A specific type of
heteroscedasticity in time series is autoregressive conditional heteroscedasticity (ARCH). The
type of heteroscedasticity does not result in OLS not being BLUE and all the OLS assumptions
remain to hold, but in the presence of ARCH, there may be estimators that are asymptotically
more efficient than OLS, for instance, weighted least squares. An ARCH(1) model for the
errors can be written
2
𝑢𝑢𝑡𝑡2 = 𝛼𝛼0 + 𝛼𝛼1 𝑢𝑢𝑡𝑡−1 + 𝜀𝜀𝑡𝑡
Where 𝛼𝛼1 contains the serial correlation in the square of the errors even though there is no
serial correlation in the errors (non-squared). This type of heteroscedasticity is often found if
the model contains lagged dependent variables (therefore the name), although it may be present
even when the model does not contain lagged dependent variables.
49
2SLS estimator
The mechanics of the 2SLS estimator is identical for time series and cross-sectional data. Just
as variables are differenced for time series, so instrumental variables can be differenced. The
tests and correction for serial correlation change slightly when using the 2SLS estimator.
To test for AR(1) serial correlation:
1) Estimate the 2SLS and save the residuals, 𝑢𝑢�𝑡𝑡
2) Estimate 𝑦𝑦𝑡𝑡 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥𝑡𝑡1 + ⋯ + 𝜌𝜌𝑢𝑢𝑡𝑡−1 + 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
3) The null hypothesis is that the parameter for 𝑢𝑢𝑡𝑡−1 is zero (no serial correlation).
Serially robust standard errors can be taken or we can use quasi-difference data by
SEMs
For time series, using 2SLS for simultaneous equation models and to address simultaneity bias
is no different than for cross-sectional data. In SEMs lagged variables are often called
predetermined variables. It should further be noted that SEMs generally are highly persistent
and the correct treatment for these series are required (for instance first differencing).
2) No perfect multicollinearity among instrumental variables and the order condition for
identification holds. This means we need at least one excluded exogenous variable (which
parameter is not zero in the reduced form equation) for each included endogenous variable.
For SEMs the rank condition is required.
3) 𝐸𝐸(𝑢𝑢) = 0, 𝐶𝐶𝐶𝐶𝐶𝐶�𝑧𝑧𝑗𝑗 , 𝑢𝑢� = 0
50
Note that each exogenous independent variable is seen as its own instrumental variable,
therefore all exogenous variables are denoted 𝑧𝑧𝑗𝑗
Where it is required that 𝛿𝛿𝑗𝑗 → 0, 𝑗𝑗 → ∞, which makes logical sense since the distant past has
less of an impact than the recent past for nearly all series’. The interpretation of this model is
also the same as FDL; 𝛿𝛿𝑗𝑗 is the change in the expected value of the dependent variable for a
one-unit temporary change in the independent variable at time zero (after j periods). 𝛿𝛿0 is again
the impact propensity and the sum of all the coefficients that are sufficiently large can be used
to approximate the long run propensity (this is required since the model is indefinite).
Although in certain situations this assumption can be weakened to only include present and
past periods (not 𝑧𝑧𝑡𝑡+1 , …).
𝛿𝛿𝑗𝑗 = 𝛾𝛾𝜌𝜌𝑗𝑗
Where 𝜌𝜌 is in absolute form between zero and one (to ensure 𝛿𝛿𝑗𝑗 → 0, 𝑗𝑗 → ∞) and 𝑗𝑗 = 0,1,2, …
then the original IDL model at time t is written
51
𝑦𝑦𝑡𝑡 = 𝛼𝛼 + 𝛾𝛾𝑧𝑧𝑡𝑡 + 𝛾𝛾𝛾𝛾𝑧𝑧𝑡𝑡−1 + 𝛾𝛾𝜌𝜌2 𝑧𝑧𝑡𝑡−2 … . +𝑢𝑢𝑡𝑡
If we also write this for time t-1, multiply the t-1 equation by 𝜌𝜌 and subtract it from the from
the time t equation we get the geometric lag model
Where 𝑣𝑣𝑡𝑡 = 𝑢𝑢𝑡𝑡 − 𝜌𝜌𝑢𝑢𝑡𝑡−1 , an MA(1). The impact propensity is 𝛾𝛾 and the long run propensity
𝛾𝛾
can be shown to be 1− 𝜌𝜌.
This equation can be estimated by OLS, but there is a few problems. Firstly, 𝑦𝑦𝑡𝑡−1 is endogenous
and 𝑣𝑣𝑡𝑡 is serially correlated where 𝜌𝜌𝑢𝑢𝑡𝑡−1 ≠ 0 and the model is not dynamically complete. The
endogeneity can be resolved by using 2SLS and a good instrumental variable for 𝑦𝑦𝑡𝑡−1 is
generally 𝑧𝑧𝑡𝑡−1 (𝑧𝑧𝑡𝑡 and 𝑧𝑧𝑡𝑡−1 are IVs). Note that using 𝑧𝑧𝑡𝑡−1 requries the strict exogeneity
assumption to hold (otherwise it is correlated with 𝑦𝑦𝑡𝑡 ) Afterwards, we can adjust the standard
errors as discussed previously.
Forecasting
Some terminology:
Conditional forecasting is where we know the future values of the independent variables. It is
then easy to forecast the future dependent variable. We can write
52
Where we need to assume that 𝐸𝐸(𝑢𝑢𝑡𝑡+1 |𝐼𝐼𝑡𝑡 )=0.
The problem with conditional forecasting is that we rarely know 𝑧𝑧𝑡𝑡+1 . If we for instance want
to forecast a trend, then we can use conditional forecasting as we know 𝑧𝑧𝑡𝑡+1 .
Unconditional forecasting is where we do not know the level of the independent variables it is
not included in 𝐼𝐼𝑡𝑡 . This would mean that we would have to first forecast 𝑧𝑧𝑡𝑡+1 before we can
forecast 𝑦𝑦𝑡𝑡+1 .
One-step forecasting
The conditional forecasting problem of not knowing 𝑧𝑧𝑡𝑡+1 can be resolved by forecasting the
dependent variable based on the lags of the dependent and independent variables. This will
allow us to know 𝑧𝑧𝑡𝑡+1 as it is the variable observed in the current time period. A model that
makes use of this approach is called a vector autoregressive model (VAR) and can be written
𝑦𝑦𝑡𝑡 = 𝛿𝛿0 + 𝛼𝛼1 𝑦𝑦𝑡𝑡−1 + 𝛽𝛽1 𝑧𝑧𝑡𝑡−1 + 𝛼𝛼2 𝑦𝑦𝑡𝑡−2 + 𝛽𝛽2 𝑧𝑧𝑡𝑡−2 + ⋯ + 𝑢𝑢𝑡𝑡
Where we include as many variables as to make the model dynamically complete. See that to
forecast we would have
𝑦𝑦𝑡𝑡+1 = 𝛿𝛿0 + 𝛼𝛼1 𝑦𝑦𝑡𝑡 + 𝛽𝛽1 𝑧𝑧𝑡𝑡 + 𝛼𝛼2 𝑦𝑦𝑡𝑡−1 + 𝛽𝛽2 𝑧𝑧𝑡𝑡−1 + ⋯ + 𝑢𝑢𝑡𝑡
And all the independent variables are included in 𝐼𝐼𝑡𝑡 . As we obtain additional data we can then
repeat the estimation. If after controlling for past y, z helps to forecast y, we say that z Granger
causes y. If we include additional variables, w, we say that z Granger causes y conditional on
w. If we consider different models that can forecast the dependent variable, the model with the
lowest root mean square error or mean absolute error can be generally preferred.
Multiple-step forecasting
Multiple-step forecasting is less reliable than one-step forecasting since the error variance
increase as the forecast horizon increases. We can use the VAR model above to also forecast
the independent variables. We can then use the forecasted dependent and independent variables
as lags to forecast 𝑦𝑦𝑡𝑡+2 . This process can be repeated indefinitely, but obviously becomes less
reliable as the forecast horizon increases.
53
PANEL DATA
Panel data is similar to pooled cross-sectional data, with the difference being that the same
individual, country, firm, etc. are sampled for different time periods. A panel dataset is
therefore organized as
City Year Variables
Pretoria 2015 (t=1) 421
Pretoria 2016 (t=2) 464
Johannesburg 2015 (t=1) 658
Johannesburg 2016 (t=2) 863
One estimator that can be used on this data is pooled OLS, but this is seldom used since it does
not make use of the benefits of panel data. The fact that the same individual, firm, country, etc.
is sampled over time, gives an advantage of panel data sets to control for fixed factors of the
individuals, firms countries, etc. that are correlated with the dependent variable over time. To
see this we can write the error term for a panel as
𝑣𝑣𝑖𝑖𝑖𝑖 = 𝑎𝑎𝑖𝑖 + 𝑢𝑢𝑖𝑖𝑖𝑖
Where 𝑣𝑣𝑖𝑖𝑖𝑖 is known as the composite error and includes both constant (𝑎𝑎𝑖𝑖 ) and variable (𝑢𝑢𝑖𝑖𝑖𝑖 )
unobserved factors explaining the dependent variable. 𝑎𝑎𝑖𝑖 is called the fixed effect, unobserved
heterogeneity or individual/firm/country etc. heterogeneity and 𝑢𝑢𝑖𝑖𝑖𝑖 is called the idiosyncratic
error. A fixed effects model is used to include the fixed effect. It is useful to control for these
fixed effects as this removes a lot of the persistence in the variables.
54
For a two-period panel, we simply take the first-difference between the model for t=2 and t=1
(note that 𝛿𝛿0 𝑑𝑑2𝑡𝑡 = 0 for period 1) which gives one cross section
∆𝑦𝑦𝑖𝑖 = 𝛿𝛿0 + 𝛽𝛽1 ∆𝑥𝑥𝑖𝑖 + ∆𝑢𝑢𝑖𝑖
Using this model is the same as saying we are only modeling what has changed over time (non-
constant), which is the same as saying that 𝑎𝑎𝑖𝑖 is controlled for. This model is also similar to the
difference-in-difference estimator for pooled cross sections, with the only difference being that
it is the same individual, firm, country etc. that has been sampled.
This model can be extended for more time periods and the process of taking the first difference
(t2-t1;t3-t2 etc.) remains the same. To ensure that the R-squared for the model is correctly
calculated, it is advised to drop the dummy parameter for t2-1 and include an intercept. The
model is therefore written as
∆𝑦𝑦𝑖𝑖𝑖𝑖 = 𝛼𝛼0 + 𝛼𝛼3 𝑑𝑑3𝑡𝑡 + 𝛼𝛼4 𝑑𝑑4𝑡𝑡 + ⋯ + 𝛽𝛽1 ∆𝑥𝑥𝑖𝑖𝑖𝑖1 + 𝛽𝛽2 ∆𝑥𝑥𝑖𝑖𝑖𝑖2 + ⋯ + ∆𝑢𝑢𝑖𝑖𝑖𝑖
Note that this will only hold if the non-differenced errors (𝑢𝑢𝑖𝑖𝑖𝑖 ) follows a random walk. If they
are AR(q) then this will not hold.
7. Conditional on 𝑋𝑋𝑖𝑖 the ∆𝑢𝑢𝑖𝑖𝑖𝑖 are independent and identically distributed normal random
variables.
Under 5-7, OLS test statistics are valid, under 5-6 asymptotically valid.
55
Testing for homoscedasticity and serial correlation can be done in exactly the same manner as
for cross section and time series, respectively. If we only have heteroscedasticity (no serial
correlation) the corrections for cross sections can be used. If we only have serial correlation
this can be corrected by way of the PW transformation. Note, however, that this needs to be
done by hand as the regression software assumes that the serial correlation is over 𝑖𝑖 and 𝑡𝑡, but
in panel data we have independent 𝑖𝑖. The HAC standard errors can also be used.
If we have both heteroscedasticity and serial correlation then one option is to run OLS and take
HAC standard errors. The general approach, however, is clustering. In this approach, each
cross-sectional unit is defined as a cluster over time and arbitrary correlation is allowed within
each cluster. Clustered standard errors are valid in large panel datasets with any kind of serial
correlation and heteroskedasticity.
Where for instance 𝑥𝑥𝚤𝚤𝚤𝚤̈ = (𝑥𝑥𝑖𝑖𝑖𝑖 − 𝑥𝑥�𝚤𝚤 ) and ̈ indicates time demeaned data.
Note that the intercept has been eliminated and the degrees of freedom is calculated as 𝑑𝑑𝑑𝑑 =
𝑁𝑁𝑁𝑁 − 𝑁𝑁 − 𝐾𝐾 (automatically done by regression software).
It is important to see that for the fixed effects estimator we cannot include time-consistent
variables (such as gender, race or for instance the distance that a house is from a river). Further,
if we include dummy variables for time, then we cannot include variables with constant change
over time, such as age or years of experience. To calculate the fixed effect 𝑎𝑎�𝑖𝑖 (if of importance
we write
�1 ����
𝑎𝑎�𝚤𝚤 = 𝑦𝑦�𝚤𝚤 − 𝛽𝛽 �𝑘𝑘 ����
𝑥𝑥11 − ⋯ − 𝛽𝛽 𝑥𝑥𝚤𝚤𝚤𝚤
56
FD or FE
Although FD and FE both provide the same estimates of the parameters (assuming all
assumptions related to this hold), the extent of serial correlation changes which estimator is
most efficient. If 𝑢𝑢𝑖𝑖𝑖𝑖 is not serially correlated, FE is more efficient. If 𝑢𝑢𝑖𝑖𝑖𝑖 follows a random
walk then FD is more efficient. If there is a substantial negative correlation in ∆𝑢𝑢𝑖𝑖𝑖𝑖 then FE is
more efficient. If T is large and N is not large the use FD as inference on FE can be very
sensitive to violations. If the model includes a lagged dependent variable then the bias is much
less under FE than FD, therefore use FE.
Under 1-6 FE is BLUE (smaller variances than FD since idiosyncratic errors are uncorrelated,
which is not the case for FD)
7. Conditional on 𝑋𝑋𝑖𝑖 and 𝑎𝑎𝑖𝑖 , the 𝑢𝑢𝑖𝑖𝑖𝑖 are independent and identically distributed normal random
variables.
Under 5-7 the test statistics are valid, under 5-6 asymptotically valid (large N, small T)
57
Random effects model
It is generally preferred to use fixed effects in panel data (this is one of the strengths of panel
data), but if 𝑐𝑐𝑐𝑐𝑐𝑐�𝑥𝑥𝑖𝑖𝑖𝑖𝑖𝑖 , 𝑎𝑎𝑖𝑖 � = 0 then the FE/FD estimator is not the most efficient. If we were to
then use pooled OLS with a model that can be written as
Where the error term includes both the fixed effects and the idiosyncratic error. Because all
fixed effects are left in the error, 𝑢𝑢𝑖𝑖𝑖𝑖 will necessarily be serially correlated across time and
therefore pooled OLS will have invalid standard errors (unless serial and heteroscedasticity
robust standard errors are calculated). Further, we lose all the benefit of being able to control
for fixed effects. To alleviate these issues we use GLS and the random effects estimator.
Where 𝜎𝜎𝑢𝑢2 is the variance of the idiosyncratic error, 𝑇𝑇 is the total number of time periods that
data is observed (note in an unbalanced panel this will change over i’s) and 𝜎𝜎𝑎𝑎2 is the variance
of the fixed effects. After quasi-demeaning the data (where demeaning is the same as for the
fixed effects estimator) the equation becomes 5
It can, therefore, be seen that the random effects estimator subtracts a fraction (𝜃𝜃) of the time
average from the data. Further, the errors are serially uncorrelated. Also, see that is 𝜃𝜃 = 0 the
random effects estimator becomes the pooled OLS estimator. Also if 𝜃𝜃 = 1 then the random
effects estimator becomes the fixed effects estimator. There is further a tendency for 𝜃𝜃 to
approach one as the amount of time periods increase, meaning that RE and FE will give very
similar results. Note that 𝜃𝜃 is never known, but can be estimated and therefore we use FGLS.
5
Note that the original equation here is the same as the fixed effects model, but with a composite error term.
58
2. No perfect multicollinearity. Due to time constant independent variables being allowed,
additional assumptions are required on how the unobserved fixed effect is related to the
independent variables.
3. 𝐸𝐸(𝑢𝑢𝑖𝑖𝑖𝑖 |𝑋𝑋𝑖𝑖 , 𝑎𝑎𝑖𝑖 ) = 𝐸𝐸(𝑢𝑢𝑖𝑖𝑖𝑖 ) = 0 (strict exogeneity assumption) and 𝐸𝐸(𝑎𝑎𝑖𝑖 |𝑋𝑋𝑖𝑖 ) = 𝛽𝛽0 which
means that there is no correlation between the unobserved effect and the explanatory
variables.
Under 1-5, RE is consistent and test statistics are asymptotically valid (Large N, small T).
Asymptotically RE is also more efficient than pooled OLS and more efficient than FE for time-
varying variables’ estimates. FE is more robust (no bias) and therefore BLUE, but RE is more
efficient (but not BLUE since it is biased).
If 4 and 5 do not hold, use clustered standard errors (discussed under FD assumptions).
A benefit of random effects over fixed effects is that it is serially uncorrelated (although this is
easily corrected for under FE/FD and pooled OLS) and time-constant independent variables
can be included in the model. Therefore, if the variable of interest is time-constant (e.g. gender)
then FE/FD cannot be used and another estimator should be used.
Generally, it cannot be easily assumed that 𝑐𝑐𝑐𝑐𝑐𝑐�𝑥𝑥𝑖𝑖𝑖𝑖𝑖𝑖 , 𝑎𝑎𝑖𝑖 � = 0, which means that FE/FD should
be used (otherwise we have bias estimates). The Hausman test can be used to test this
assumption, but note that failure to reject does not mean that we should use RE, it means that
we can use either test. If the Hausman test rejects the nul, it means that we should be careful to
assume that 𝑐𝑐𝑐𝑐𝑐𝑐�𝑥𝑥𝑖𝑖𝑖𝑖𝑖𝑖 , 𝑎𝑎𝑖𝑖 � = 0 and that FE/FD may be preferred. Note, however, that the
Hausman test is not a model selection test and should not be used as such.
59
Further, if we have reason to believe that we do not have a random sample from the population,
FE/FD should be used as this is the same as allowing for a unique intercept for each unit. FE/FD
is also more robust in unbalanced panels where the reason for selection may be correlated with
the error term.
Then 𝛾𝛾 indicates the correlation between 𝑎𝑎𝑖𝑖 and 𝑥𝑥𝑖𝑖𝑖𝑖 . Substituting 𝑎𝑎𝑖𝑖 as assumed above into the
fixed effects model gives
Where 𝑟𝑟𝑖𝑖 + 𝑢𝑢𝑖𝑖𝑖𝑖 is a composite error and 𝑟𝑟𝑖𝑖 is a time constant unobservable. Note the only
difference is the inclusion of the time average variable 𝑥𝑥�𝚤𝚤 . Including this variable (which can
easily be calculated for each independent variable) is the same as demeaning the data and
therefore the estimate of 𝛽𝛽 is exactly the same under CRE and FE. However, because we are
not demeaning, we can include time-constant variables in the model. Further, 𝛾𝛾 can be seen as
a further test between FE and RE, if 𝛾𝛾 = 0 then there is no correlation between 𝑎𝑎𝑖𝑖 and 𝑥𝑥𝑖𝑖𝑖𝑖 ,
meaning the FE or RE estimator can be used. If 𝛾𝛾 is statistically significant then the assumption
for RE does not hold (economic significance should also be considered) and we may prefer FE.
When using the CRE model, it is important not to include time averages of variables that
change only over time and not over units (for instance dummies for years), but if the panel is
unbalanced, these should be included. Further, in unbalanced panels, the time averages should
be calculated based on the number of periods that data is available per unit which will be
different for different units in the panel. The assumptions for CRE follows the FE estimator.
IV estimator
For panel data, the mechanics of the 2SLS estimator remains the same as for cross-sectional
data. The unobserved constant effect is first removed by FE/FD and then the 2SLS estimator
60
is used. Because the constant effect is removed, it is most likely that the instrumental variables
will have to be time-variant, otherwise, they are unlikely to be correlated with the FE/FD
endogenous variable. SEMs also do not pose any particular challenge.
To ensure that all assumptions are met, refer to the assumptions for 2SLS for cross-sectional
data, read together with the homoscedasticity and serial correlation 2SLS assumption for time
series data and then the relevant effect estimator assumptions.
There are multiple estimators that can be used. Refer to the Stata manual for xtivreg.
Spatial panels
When observing firms, countries and other similar samples, cross-sectional correlation (also
called spatial correlation) can cause problems. The correlation mainly arise as a result of spatial
dependency and spatial structure. This results in inefficient standard errors. For a correction,
see the stata paper on xtscc.
61