You are on page 1of 6


Research Question
A research question must be an interesting puzzle which can be understood through some
theory and accepts falsifiable hypotheses to explain the phenomenon.
2. What kind of data/methodology will allow you to answer it?
Even using perfect data or being able to observed and accurately measured all the data for
the whole population statistical analysis has some limitations. Some limitations come from inherent
assumptions in the processes while others will be consequence of noise in reality. Some tools to
analysis effects of certain factors are descriptive or comparative statistics, modeling, experimental,
in vitro analysis, regression analysis, agent-based modeling, spacial analysis, etc. These methods
are the quantitative part of possible analysis as opposed to qualitative analysis.
Regression analysis allows for hypothesis testing, i.e., test assumption against empirical data
and provide an indicator of the likelihood the assumption is true given the observable outcomes
such as a p-value. Given the nature of the analysis and importance of certainty the researcher can
choose a level at which to reject the null hypothesis therefore being able to reject null hypothesis if
the significance level provided by the data is above said threshold. Lastly, consideration is given to
the practical significance which calls into consideration effect size in order to judge whether the
possible variation in the outcome predicted by the factor is relevant within the context. Relevance
can be a consideration of the relative possible deviation in the outcome given how manipulable the
explanatory variable can be.
3. Regression analysis (Standard Econometrics)
Regression analysis is a set of tools that allow for hypotheses testing through comparative statics.
The basic idea is to use an economic model or mathematical equations as a base to fit data given
said structure and uncover estimates for the variables that can best solve the problem. These
estimators represent the value that best fit the data given the specifications and operator employed.
These estimates with their associated certainty and effect size considerations can be used for
hypotheses testing, prediction, and other purposes such as strategy design. Generally, these
applications come from the analysis at mean values and at ceterus paribus, i.e., where we would
observed it most of the time and when everything else is held constant.
Theory and empirical analysis are both compliments as it is borderline absurd to do one
without the other. Theory is the underlying description of how something works. Any strategy or
policy that is based on science follows some understanding that comes from theory. Likewise,
theory until it has been supplemented and revised through empirical analysis can and should not be
applied unless necessary as it carries high uncertainty. As it is in science, theory is not to be taken as
true, but carries its worth on it practical usefulness and resistance to have been disproved. Theory
which has not been tested cannot be fails in credentials regardless of its practical usefulness.

4. Theory
A theory must be a proposed explanation of a phenomenon observed or believed to be
possible. It must be of practical usefulness and it must be falsifiable. Goods theories also must work
given a domain which is set by working assumptions. The mechanism or processes that make up the
theory must also be consistent with logic or observable behavior. Lastly, the best theories also allow
for expansions of the theoretical realm to a model that can be used to better understand the
workings of the phenomenon and operationalized its arguments with measurable indicators.
Model Specification for Empirical Analysis
A good model operationalizes the theory in ways that allow to test its validity through
empirical analysis. In the case of regression analysis that includes to protect the economic model
from omitted variables. A good economic model must have present all relevant factors according to
the theory including an error term if the model is stochastic and not deterministic. The equation or
series of equations used in the economic model must also have the correct nature. For example,
linear regressions where there is a single outcome variable as opposed to non-linear regression
5. Data and Methodology
The data and methodology depends on your economic model, but are also constrained by
what data is both attainable and feasible. The methodology employed will most likely be a function
of the data which in turn is influenced by both the research question, theory, and chosen model. In
practice, the ultimate decision will depend on all factors and not necessarily in the order proposed.
Two aspects that largely influence the methodology are going to be the previous approaches to the
issue in the literature and the quality and nature of the data.
6. Least Squares Method (Most used linear regression approach)
Least Squares method is a technique in which equations are modeled with a single outcome
variable on the left and some right side parameters. The most common form is a multiple linear
regression with an intercept coefficient,
(Equation 1)
where y is the outcome variable (which by theory is a dependent variable), 0 is an intercept
which best fits the data when all variables take on the value of zero, 1 is the sub-vector containing
the coefficient for the variables of interest which are denoted V.I., the 2 is the remaining sub-vector
which includes the coefficients for the remaining explanatory variables denoted X, and lastly, u is
the stochastic normally distributed error term by assumption
Once the model is estimated the values theoretical values become data-estimated
(Equation 2)

7. Ordinary Least Squares (OLS)

OLS is the most popular regression analysis method given its characteristics:
a) Computationally feasible
The parameters are estimated using only calculus and algebra. Modern technology
and matrices allow for statistical software to easily and promptly compute these
estimations using vast amount of data.
b) Applicable to many linear models
As long as the model has a single observable outcome variable in each equation and
the parameters are linear OLS can be employed.
Examples of Linear Right-Side Parameters

Examples of Non-linear Right-Side

However, the full potential of the OLS method is when the Gauss-Markov Assumptions are
met and through the Gauss-Markov Theorem it can be proved that the estimates are both efficient
and unbiased, i.e., the Best Linear Unbiased Estimates (BLUE). The five Gauss-Markov
Assumptions are:
a) Linearity in the functional form

The economic model must have linearity in the right-side parameters.

b) Spherical Errors (or White Noise)

In order to have White Noise there most be

Homoscedasticity: uniform variance and uncorrelated with the error term
Nonautocorrelation: the errors are uncorrelated between observations:

c) Zero Conditional Mean

It follows that the distribution of the error term has mean zero (
explanatory variables are not correlated with the error term (

) and the

d) No Perfect Collinearity
The explanatory variables are linearly independent.
X which denotes the design matrix (explanatory variables) n x k for n observations
and k variables is full rank.
e) No simultaneity
The outcome variable is determined by the right-side and these are not influenced by
the outcome variable.
Given these assumptions the derived estimator is

However, there are technical considerations as well as violations of the assumption on reality.

Model Specification: incorrect functional form

Omitted variable bias: does not control for a variables that influences the outcome

What it does: Violates the Zero Conditional Mean

Consequence: Estimator is biased

Test: Use model comparisons.

Solution: Refer to the theory and test possible variables to verify robustness of
results. If the results change, there is evidence that the model suffers from omitted
variable bias and should be included in order to fix it.

Irrelevant variable may be included in the model

What it does: Potentially contaminate your results or could inflate the explanatory
power of the model.

Consequence: It does not represents the actual analysis and can potentially alter the
estimates as it attempts to fit the data to noise.

Test: Use model comparisons and calculate AIC / BIC measures to account judge the
added value of controlling for certain variable.

In Stata: estat ic, also useful saving estimates and presenting these on tables.

Solution: Refer to theory

Simultaneity: functional form allows for feedback effects.

What it does: Explanatory variables are correlated with the error term.

Consequence: Violates Zero Conditional Mean

Test: Refer to theory and perform 2SLS with possible I.V.

In Stata: ivregress

Solution: If variable is need it in order to prevent other violations such as Omitted

Variable Bias, find a good instrument to use in a 2SLS.

In Stata: estat endogenous, estat overid, estat firststage

The true nature of the model must be linear in nature.

Data considerations

Sample must be representative of the population.

Independent Identically Distributed (iid) allows for this.

Survey data might be collected in non-random sampling ways.

Heteroscedasticity: non-uniform variance

What it does: Violates White Noise Assumption

Consequence: Estimator is no longer efficient

Test: Graph the residuals, White's Test for heteroscedasticity (avoid BP LM-test)

In Stata: whitetst

Solutions: Attempt to find what is causing the heteroscedasticity. It might be one or

more variables.

Identify the source of heteroscedasticity and which variables.

Use theory and visual interpretation or (GQ) test to identify possible culprits
and causes.

Respecify the Model/Transform the Variables

If the source of heteroscedasticity is of a scalar multiplier then transform

variable to a per unit measure (income per capita). If the variable is not
normally distributed attempt a standardization (use z-scores). For outcome
variable or variables with large values attempt a natural log transformation

Check whether the sample is homogenous and if not run separated restricted

Perform a Chow's Test (F-test for the slopes)

Use Weight Least Squares

Use White's standard errors

In Stata reg y x, robust

Another possible solution to this problem would be to choose another method

such as multi-level modeling or non-regression analysis.


What it does: Violates White Noise Assumption

Consequence: Estimator is no longer efficient.

Test: White's Test for autocorrelation or DW

In Stata: wntestq, estat durbinalt


Make sure the model is correct (unit root test for D)

Control for seasonality effects

Generalized Least Squaress

In Stata: prais y x, ssesearch

Technical Considerations

Measurement errors may affect the independent variables.


What it does: Makes the errors unreliable

Consequence: Cannot be used for hypotheses testing

Test: Covariance matrix, VIF and Conditional Number Test

In Stata: cor, vif, collin x w


Specify and code variables such that the explanatory variables are linearly
independent and not too strongly correlated.

Sample size

Enough observations.

Do not over do with high level of unnecessary observations

Variation on variables

Variables should have some variation across the sample

8. Generalized Least Squares

The idea is to convert the variance covariance matrix of a non spherical error regression to
an equivalent with White Noise.
9. Latent Outcome Variable
When the outcome variable is unobservable through a mathematical model an equation can
be used to described how explanatory variables push the unobservable value until a threshold is met
and an outcome is observable. Two common ways to regress latent variable models are probit and
logit. The results for both are usually consistent, but each one forces the underlying distribution to
the data. Other methods are cox relative hazard and the likes which allows for panel analysis.
Logistic regressions have the advantage that the estimators can be interpreted as odd ratios as
opposed to marginal effect intensities with the probit. The likelihood functions are solved not
through an analytical solution, but through numerical analysis.
While there are not formal tests as there are for OLS, it can be useful to run reg on the
model to see whether there might be some indications of complications while using the latent
variable methods.
In Stata some useful commands are: probit, mfx, logistic or logit, or/rr.