A basic econometrics in a nutshell.
Useful for an introductory course at the graduate level.

Research Question

A research question must be an interesting puzzle which can be understood through some

theory and accepts falsifiable hypotheses to explain the phenomenon.

2. What kind of data/methodology will allow you to answer it?

Even using perfect data or being able to observed and accurately measured all the data for

the whole population statistical analysis has some limitations. Some limitations come from inherent

assumptions in the processes while others will be consequence of noise in reality. Some tools to

analysis effects of certain factors are descriptive or comparative statistics, modeling, experimental,

in vitro analysis, regression analysis, agent-based modeling, spacial analysis, etc. These methods

are the quantitative part of possible analysis as opposed to qualitative analysis.

Regression analysis allows for hypothesis testing, i.e., test assumption against empirical data

and provide an indicator of the likelihood the assumption is true given the observable outcomes

such as a p-value. Given the nature of the analysis and importance of certainty the researcher can

choose a level at which to reject the null hypothesis therefore being able to reject null hypothesis if

the significance level provided by the data is above said threshold. Lastly, consideration is given to

the practical significance which calls into consideration effect size in order to judge whether the

possible variation in the outcome predicted by the factor is relevant within the context. Relevance

can be a consideration of the relative possible deviation in the outcome given how manipulable the

explanatory variable can be.

3. Regression analysis (Standard Econometrics)

Regression analysis is a set of tools that allow for hypotheses testing through comparative statics.

The basic idea is to use an economic model or mathematical equations as a base to fit data given

said structure and uncover estimates for the variables that can best solve the problem. These

estimators represent the value that best fit the data given the specifications and operator employed.

These estimates with their associated certainty and effect size considerations can be used for

hypotheses testing, prediction, and other purposes such as strategy design. Generally, these

applications come from the analysis at mean values and at ceterus paribus, i.e., where we would

observed it most of the time and when everything else is held constant.

Theory and empirical analysis are both compliments as it is borderline absurd to do one

without the other. Theory is the underlying description of how something works. Any strategy or

policy that is based on science follows some understanding that comes from theory. Likewise,

theory until it has been supplemented and revised through empirical analysis can and should not be

applied unless necessary as it carries high uncertainty. As it is in science, theory is not to be taken as

true, but carries its worth on it practical usefulness and resistance to have been disproved. Theory

which has not been tested cannot be fails in credentials regardless of its practical usefulness.

4. Theory

A theory must be a proposed explanation of a phenomenon observed or believed to be

possible. It must be of practical usefulness and it must be falsifiable. Goods theories also must work

given a domain which is set by working assumptions. The mechanism or processes that make up the

theory must also be consistent with logic or observable behavior. Lastly, the best theories also allow

for expansions of the theoretical realm to a model that can be used to better understand the

workings of the phenomenon and operationalized its arguments with measurable indicators.

Model Specification for Empirical Analysis

A good model operationalizes the theory in ways that allow to test its validity through

empirical analysis. In the case of regression analysis that includes to protect the economic model

from omitted variables. A good economic model must have present all relevant factors according to

the theory including an error term if the model is stochastic and not deterministic. The equation or

series of equations used in the economic model must also have the correct nature. For example,

linear regressions where there is a single outcome variable as opposed to non-linear regression

models.

5. Data and Methodology

The data and methodology depends on your economic model, but are also constrained by

what data is both attainable and feasible. The methodology employed will most likely be a function

of the data which in turn is influenced by both the research question, theory, and chosen model. In

practice, the ultimate decision will depend on all factors and not necessarily in the order proposed.

Two aspects that largely influence the methodology are going to be the previous approaches to the

issue in the literature and the quality and nature of the data.

6. Least Squares Method (Most used linear regression approach)

Least Squares method is a technique in which equations are modeled with a single outcome

variable on the left and some right side parameters. The most common form is a multiple linear

regression with an intercept coefficient,

(Equation 1)

where y is the outcome variable (which by theory is a dependent variable), 0 is an intercept

which best fits the data when all variables take on the value of zero, 1 is the sub-vector containing

the coefficient for the variables of interest which are denoted V.I., the 2 is the remaining sub-vector

which includes the coefficients for the remaining explanatory variables denoted X, and lastly, u is

the stochastic normally distributed error term by assumption

Once the model is estimated the values theoretical values become data-estimated

parameters,

(Equation 2)

OLS is the most popular regression analysis method given its characteristics:

a) Computationally feasible

The parameters are estimated using only calculus and algebra. Modern technology

and matrices allow for statistical software to easily and promptly compute these

estimations using vast amount of data.

b) Applicable to many linear models

As long as the model has a single observable outcome variable in each equation and

the parameters are linear OLS can be employed.

Examples of Linear Right-Side Parameters

However, the full potential of the OLS method is when the Gauss-Markov Assumptions are

met and through the Gauss-Markov Theorem it can be proved that the estimates are both efficient

and unbiased, i.e., the Best Linear Unbiased Estimates (BLUE). The five Gauss-Markov

Assumptions are:

a) Linearity in the functional form

b) Spherical Errors (or White Noise)

Homoscedasticity: uniform variance and uncorrelated with the error term

i.

ii.

Nonautocorrelation: the errors are uncorrelated between observations:

i.

It follows that the distribution of the error term has mean zero (

explanatory variables are not correlated with the error term (

) and the

).

d) No Perfect Collinearity

The explanatory variables are linearly independent.

X which denotes the design matrix (explanatory variables) n x k for n observations

and k variables is full rank.

e) No simultaneity

The outcome variable is determined by the right-side and these are not influenced by

the outcome variable.

Given these assumptions the derived estimator is

However, there are technical considerations as well as violations of the assumption on reality.

Omitted variable bias: does not control for a variables that influences the outcome

variable.

Solution: Refer to the theory and test possible variables to verify robustness of

results. If the results change, there is evidence that the model suffers from omitted

variable bias and should be included in order to fix it.

What it does: Potentially contaminate your results or could inflate the explanatory

power of the model.

Consequence: It does not represents the actual analysis and can potentially alter the

estimates as it attempts to fit the data to noise.

Test: Use model comparisons and calculate AIC / BIC measures to account judge the

added value of controlling for certain variable.

In Stata: estat ic, also useful saving estimates and presenting these on tables.

What it does: Explanatory variables are correlated with the error term.

In Stata: ivregress

Variable Bias, find a good instrument to use in a 2SLS.

Data considerations

Test: Graph the residuals, White's Test for heteroscedasticity (avoid BP LM-test)

In Stata: whitetst

more variables.

Use theory and visual interpretation or (GQ) test to identify possible culprits

and causes.

variable to a per unit measure (income per capita). If the variable is not

normally distributed attempt a standardization (use z-scores). For outcome

variable or variables with large values attempt a natural log transformation

(ln[population]).

Check whether the sample is homogenous and if not run separated restricted

regressions.

such as multi-level modeling or non-regression analysis.

Autocorrelation

Solutions

Technical Considerations

Multicollinearity

Solve:

Specify and code variables such that the explanatory variables are linearly

independent and not too strongly correlated.

Sample size

Enough observations.

Variation on variables

The idea is to convert the variance covariance matrix of a non spherical error regression to

an equivalent with White Noise.

9. Latent Outcome Variable

When the outcome variable is unobservable through a mathematical model an equation can

be used to described how explanatory variables push the unobservable value until a threshold is met

and an outcome is observable. Two common ways to regress latent variable models are probit and

logit. The results for both are usually consistent, but each one forces the underlying distribution to

the data. Other methods are cox relative hazard and the likes which allows for panel analysis.

Logistic regressions have the advantage that the estimators can be interpreted as odd ratios as

opposed to marginal effect intensities with the probit. The likelihood functions are solved not

through an analytical solution, but through numerical analysis.

While there are not formal tests as there are for OLS, it can be useful to run reg on the

model to see whether there might be some indications of complications while using the latent

variable methods.

In Stata some useful commands are: probit, mfx, logistic or logit, or/rr.

