Advanced Econometrics Dr.
• A quick round-up of basic econometrics • Econometric Modelling
Regression analysis • Theory specifies the functional relationship • Measurement of relationship uses regression analysis to arrive at values of a and b.
Y = a + bX + e
• Components dependent & independent variables, intercept (O), coefficients, error term • Regression may be simple or multivariate according to the no. of independent variables
• Model Specification: relationship between dependent and independent variables
– scatter plot – specify function that best fits the scatter
• Sufficient Data for estimation
– cross sectional – time series – panel
Some Important terminology
– Least Squares Regression: Y = a + bX + e – Estimation
• point estimate • Interval estimates
• t-statistic • R-square or Coefficient of Determination • F-statistic
His minimizes the squared distance between the line and actual data points.
.Estimation -. and want to fit a line to the data • The most “efficient” can be shown to be OLS.OLS
18 16 14 12 10 8 6 4 2 0 0 10 20 30 40 50
Ordinary Least Squares (OLS)
• We have a set of datapoints.
b [Yi − a − bX i ] Min
[Yi − a − bX i ]
a .• How to Estimate a and b in the linear equation? • The OLS estimator solves:
• This minimization problem can be solved • using calculus. • The result is the OLS estimators of a and b
and a bar a sample mean.Regression Analysis -.OLS
Yj = a + b ⋅ X j + ε j ( X − X )( Y − Y ) ∑ ˆ= b ∑(X −X )
i i 2 i
•The basic equation
•OLS estimator of b
ˆX ˆ =Y −b a
•OLS estimator of a
Here a hat denotes an estimator.
Regression Analysis -.89
These are the estimated coefficients for the data above.98 -11.
production period 1 2 3 4 5 6 7 Demand (Y) Price (X) 410 1 370 5 240 8 230 10 160 15 150 23 100 25
Intercept Q (X)
the R-squared is a measure of the goodness of fit of our model.Regression Analysis -.Inference
2 2 ˆ ( Y − Y ) ∑ i
∑ (Y −Y )
(n − k )∑ ( X i − X ) 2
2 ˆ ∑ (Yi − Yi )
Here. while the standard deviation of b gives us a measure of confidence for out estimate of b.
004636 12.676 79942.Confidence
SUMMARY OUTPUT Regression Statistics Multiple R 0.976786811 R Square 0.85714
Intercept Q (X)
Coefficients Standard Error t Stat P-value 87.859204 0.Regression Analysis -.000155729 3668.92601129 4.47725 76274.10614525 17.19618 0.000156
These are “goodness of fit” measures reported by excel for our example data.48 103.379888 733.
.19773201 10.954112475 Adjusted R Square 0.94493497 Standard Error 27.2122905 1.08645377 Observations 7 ANOVA df Regression Residual Total 1 5 6 SS MS F Significance F 76274.9621 0.
check if b falls within. If it does.
• Hypothesis formulation • Test:
– Confidence interval method: Construct interval of estimated b at desired level of confidence & SE of b. If former less than latter accept the null hypothesis. accept null hypothesis – Test of significance method: Estimate t-value of b and compare with table t value.
92601129 4.004636 12. Combined with information in critical values from a “student-t” distribution.000155729 3668.48 103.10614525 17.
.19773201 10. this ratio tells us how confident we are that a value is significantly different from zero.676 79942.94493497 Standard Error 27.976786811 R Square 0.859204 0.08645377 Observations 7 ANOVA df Regression Residual Total 1 5 6 SS MS F Significance F 76274.85714
Intercept Q (X)
Coefficients Standard Error t Stat P-value 87.379888 733.954112475 Adjusted R Square 0.2122905 1.19618 0.47725 76274.000156
ˆ b = Sbˆ
the t-ratio.9621 0.Hypothesis testing
SUMMARY OUTPUT Regression Statistics Multiple R 0.
Explained var iation F = Un exp lained var iation ( k − 1) (n − k )
( k − 1) (1 − R ) /( n − k )
• Tests the marginal contribution of new variable
ESS new − ESS old F= RSS new ( no .ofnewX ) ( n − no.Analysis of Variance: F ratio
• F ratio tests the overall significance of the regression.ofXnew mod el )
• Tests for structural change in data
RSS R − RSS UR F = (k ) RSS UR ( n1 + n 2 − 2 k )
Y1 1 Y = 1 2 Yn 1
X 21 X 22 X 2n
β = (X X ) X Y
y = Xβ + u
X k1 β1 u ∧ X k 2 β 2 + u ∧ X kn β 3 u
∧ 1 ∧ 2 ∧ 3
Assumptions of OLS regression
– Model is correctly specified & is linear in parameters – X values are fixed in repeated sampling and y values are continuous & stochastic – Each uiis normally distributed with mean of ui=0 – Equal variance ui (Homoscedasticity) – No autocorrelation or no correlation between ui and uj – Zero covariance between Xi and ui – No multicollinearity Cov(Xi Xj)=0 . multivariate regression – Under assumption of CNLRM estimates are BLUE
Regression Analysis : Some problems
Autocorrelation: covariance between error terms ● Identification : DW d test 0-4 (near 2 indicates no autocorrelation) ● R2 is overestimated ● t and F tests misleading ● Missed Variable: Correctly specify ● Consider AR scheme •Heteroscedasticity: Non-constant variance . park test.Detection: scatter plot of error terms. white test etc
. goldfeld-Quandt test.
high pair-wise correlation between explanatory variables t and F tests misleading remove model over-specification.Regression Analysis : Some problems
.Remedial measures include transformation of variables through WLS Muticollinearity: covariance between various X variables Detection: high R2 but t test insignificant.t and F tests misleading . use pooled data. transform variables
Sources of misspecification
• • • • • Omission of relevant variable Inclusion of unnecessary variables Wrong functional form Errors of measurement Incorrect specification of the stochastic error term
Model Specification Errors: Omitting Relevant Variables and Including Irrelevant Variables
• To properly estimate a regression model. we need to have specified the correct model • A typical specification error occurs when the estimated model does not include the correct set of explanatory variables • This specification error takes two forms – Omitting one or more relevant explanatory variables – Including one or more irrelevant explanatory variables • Either form of specification error results in problems with OLS estimates
then there is a violation of classical assumption Cov(uiXi)=0
. the error term of this model is actually equal to • If there is any correlation between the omitted variable (INF) and the explanatory variable (GDP).Model Specification Errors: Omitting Relevant Variables
• Example: Two-factor model of stock returns • Suppose that the true model that explains a particular stock’s returns is given by a two-factor model with the growth of GDP and the inflation rate as factors
rt = β 0 + β1GDPt + β 2 INFt + ε t
Suppose instead that we estimated the following model
rt = β 0 + β1GDPt + ε t
∗ εt = β 2 INFt + ε t
it is highly likely that there will be some correlation between two financial (or economic) variables • If. however. the correlation is low or the true coefficient of the omitted variable is zero. then the specification error is very small
.Model Specification Errors: Omitting Relevant Variables
• This means that the explanatory variable and the error term are not uncorrelated • If that is the case. the OLS estimate of β1 (the coefficient of GDP) will be biased • As in the above example.
slope unbiased • Variance of error is incorrectly estimated • Consequently. Estimate of both constant & slope biased • Bias continues even with larger sample • When Cov(X1X2)=0.• When Cov(X1X2)#0. variance of slope is biased • Leads to misleading conclusions through confidence interval and hypothesis testing procedures regarding statistical significance of the estimated parameters. • Forecasts therefore based on mis-specified model will be unreliable
. constant is biased.
. but the problem with this solution is to be able to detect which is the omitted variable Omitted variable bias is hard to detect. The best way to detect the omitted variable specification bias is to rely on the theoretical arguments behind the model. A simple solution is to add the omitted variable back to the model. but there could be some obvious indications of this specification error.Which variables does the theory suggest should be included?
.Model Specification Errors: Omitting Relevant Variables
To avoid omitted variable bias.
Have we omitted a variable that most other similar studies include in the estimated model?
Note.Model Specification Errors: Omitting Relevant Variables
.What are the expected signs of the coefficients? . most of the data sets used in empirical finance are large enough that this most likely is not the cause of the specification bias. that a significant coefficient with the unexpected sign can also occur due to a small sample size However. though.
Model Specification Errors: Including Irrelevant Variables
• Example: Going back to the two-factor model. for example. we estimate the following model
rt = β0 + β1GDPt + β2 INFt + β3 INEQt + ε t
• The estimated coefficients (both constant and slope) are unbiased • The variance of the error term is estimated accurately
. the degree of wage inequality (INEQ) • So. suppose that we include a third explanatory variable in the model.
thus.Model Specification Errors: Including Irrelevant Variables
• However the variance of the coefficients are inefficient • The inclusion of an irrelevant variable (INEQ) in the model increases the standard errors of the estimated coefficients and. decreases the tstatistics • This implies that it will be more difficult to reject a null hypothesis that a coefficient of one of the explanatory variables is equal to zero
Model Specification Errors: Including Irrelevant Variables
• Also. the inclusion of an irrelevant variable will usually decrease the adjusted R-sq (but not the R-sq) • Overspecified model Considered to be a lesser evil compared to underspecified model • But other problems like multicollinearity. loss of degrees of freedom
Model Specification Criteria
• To decide whether an explanatory variable belongs in a regression model. we can test whether most of the following conditions hold – The importance of theory: Is the decision to include an explanatory variable in the model theoretically sound? – t-Test: Is the variable statistically significant and does it have the expected coefficient sign? – Adjusted R2: Does the overall fit of the model improve when we add the explanatory variable? – Bias: Do the coefficients of the other variables change significantly (sign or statistical significance) when we add the variable to the model?
the researcher would estimate every possible model and choose to report only those that produce desired results • The researcher should try to minimize the number of estimated models and guide the selection of variables mainly on theory and not purely on statistical fit
.Problems with Specification Searches
• In an attempt to find the “right” or “desired” model. a researcher may estimate numerous models until an estimated model with the desired properties is obtained • It is definitely the case that the wrong approach to model specification is data mining – In this case.
it is important to follow the principle of parsimony: try to find the simplest model that best fits the data • Make use of the F test for incremental contribution of variables
. it is common to begin with a benchmark (or base) specification and then sequentially add or drop variables • The base specification can rely on theory and then add or drop variables based on adjusted R2 and tstatistics • In this effort.Sequential Model Specification Searches
• In an effort to find the appropriate regression model.
.g.F test for incremental contribution of variables
• Very useful test in deciding if a new variable should be retained in the model • e. • Question is should we include inflation in the model. • Estimate a model without inflation and get Rsq(old). Return on a stock is a function of GDP and inflation of the country.
of `new`parameters F= 2 (1 − Rnew ) / n − knew
Ho: Addition of new variable does not improve the model H1: Addition of new variable improves the model If estimated F is higher than critical F table value.F test for incremental contribution of variables
Re-estimate including inflation and get its Rsq(new).
. It means inflation needs to be included in the above example. reject null hypothesis.
( R − R ) / no.
10% • To begin with if there were “c” candidate regressor of which “k” are selected after data mining. true level of significance ( α ) is related to nominal significance level as: c/k *
α = 1 − (1 − α ) α * ≈ (c / k )α
* • If c=2.5.Nominal vs. True level of Significance
• Model derived from data mining should be assessed not at conventional levels of significance (α ) such as 1. k=1 and α =5% then α =10%
Model Specification: Choosing the Functional Form
• One of the assumptions to derive the nice properties of OLS estimates is that the estimated model is linear • What if the relationship between two variables is not linear? • OLS maintains its nice properties of unbiased and minimum variance estimates if we transform the non-linear relationship into a model that is linear in the coefficients • Interesting case – Double-log (log-log) form
Model Specification: Choosing the Functional Form
Example: A well-known model of nominal exchange rate determination is the Purchasing Power Parity (PPP) model s = P/P* s = nominal exchange rate (e.g. P = price level in the SA. we can estimate the following model ln(s) = β0 + β1ln(P) + β2ln(P*) + εi
. rand/$). P* = price level in the US
Taking natural logs.
Expected coefficient signs. R2. t-stat and DW dstat
How do we know if we’ve gotten the right functional form for our model? .Model Specification: Choosing the Functional Form
Property of double-log model: Estimated coefficients show elasticities between dependent and explanatory variables
Example: A 1% change in P will result in a β1% change in the nominal exchange rate (s).
LHS or both • Can use quadratic forms of x’s • Can use interactions of x’s
. – Examine the error terms – use economic theory to guide you • We’ve seen that a linear regression can really fit nonlinear relationships • Can use logs on RHS.Model Specification: Choosing the Functional Form
• If not satisfactory.
• Does it make more sense for x to affect y in percentage (use logs) or absolute terms? • Does it make more sense for the derivative of x1 to vary with x1 (quadratic) or with x2 (interactions) or to be fixed?
.How to choose Functional Form
• Think about the interpretation.
How to choose Functional Form (cont'd)
• We already know how to test joint exclusion restrictions to see if higher order terms or interactions belong in the model • It can be tedious to add and test extra terms. plus may find a square term matters when really using logs would be even better • A test of functional form is Ramsey’s regression specification error test (RESET)
• From the assumed model. • If autocorrelation is noticed. then the model is misspecified. obtain OLS residuals.
.DW test for model misspecification
• You suspect that relevant variable Z ( might be a polynomial of existing X) was omitted from the assumed model. • Order residuals according to increasing values of Z • Compute d stat from thus ordered residuals.
in RESET you don’t have to specify the the correct alternative model • Disadvantage: doesn't help in attaining the right model
. estimate y = β0 + β1x1 + … + βkxk + δ1ŷ2 + δ1ŷ3 +error and test • H0: δ1 = 0.of `new`parameters F= 2 (1 − Rnew ) / n − k new
• If Ho rejected it indicates mis-specified model • Advantage . δ2 = 0
2 2 ( Rnew − Rold ) / no.Ramsey’s RESET
• Regression Specification Error Test • Estimate assumed model and derive ŷ • Then.
. reject the restricted regression.ofrestrictions
• If estimated Chi-sq>critical chi-sq.Lagrange Multiplier Test for Adding variable
• Y=b0+bX1 • Y=b0+b1X1+b2X2+b3X3 1 (restricted) 2 (UR)
• Obtain residuals from 1 and regress it on all X in Eq2 including ones in eq1 • Ui=a0+a1X1+a2X2+a3X3
nR ≈ χ
2 no .
Nested vs. Non-nested Models
• Nested: • Y=a+b1X1+b2X2+b3X3+b4X4 • Y=a+b1X1+b2X2 • Specification test and the restricted F test can be used to test for model specification errors • Non-nested: • Y=a+b1X1+b2X2 • Y=c0+c1Z1+c2Z2
Tests for Non-nested Models
• 1) Discrimination approach: simply select better model based on goodness of fit
– Rsq. AIC. Adj-Rsq. SBC ESS R = TSS
R = 1 − (1 − R 2 )
n −1 n−k
RSS AIC = e n k / n RSS SIC = n n
2k / n
• 2)Discerning approach: make use of information provided by other models as well along with the initial model
but non-nested x’s could still just make a giant model with the x’s from both and test joint exclusion restrictions that lead to one model or the other. ( Rnew − Rold ) / no.Non-nested Discerning Tests
• If the models have the same dependent variables. • Y=a+b1X1+b2X2 • Y=c0+c1Z1+c2Z2 • Y=a+b1X1+b2X2+c1Z1+c2Z2 • Use F test using both equations as reference model 2 2 in turns.of `new`parameters
2 (1 − Rnew ) / n − k new
• An alternative.B • Estimate B and obtain Y^B • Y=a+b1X1+b2X2+ b3Y^B
.Davidson-MacKinnon J test. uses ŷ from one model as regressor in the second model and tests for significance.A • Y=c0+c1Z1+c2Z2 . • Y=a+b1X1+b2X2 . the Davidson-MacKinnon test.
Davidson-MacKinnon test may reject neither or both models rather than clearly preferring one specification
. we accept model A Reverse the models and re-do steps More difficult if one model uses y and the other uses ln(y) Can follow same basic logic and transform predicted ln(y) to get ŷ for the second step In any case.
Use t-test. if b3=0.Davidson-MacKinnon J test. not rejected.
• Sometimes we have the variable we want, but we think it is measured with error • Examples: A survey asks how many hours did you work over the last year, or how many weeks you used child care when your child was young • Consequences of Measurement error in y different from measurement error in x
Measurement Error: Dependent Variable
• Y* is not directly measurable, it is measured wrongly as y=y*+ e • Thus, really estimating y = (β0 + β1x1 + …+ βkxk + e)+u • When will OLS produce unbiased results? • Only if E(e) =E(u)=0, e is uncorrelated with xj & u, β is unbiased • But β has larger variances than with no measurement error
Measurement Error: Explanatory Variable
• x* is not directly measurable, it is measured wrongly as X=X*+ e • Define measurement error as e1 = x1 – x1* • y= β0 + β1(x1 -e)+u • Really estimating y = β0 + β1x1 + (u – β1e1)
OLS estimates are biased. e1) # 0.ui)=0
The effect of measurement error on OLS estimates depends on our assumption about the correlation between e1 and x1
If Cov(x1. cov(ei. and variances larger Use Proxy or IV variables
. cov(ei.ej)=0 .Measurement Error: Explanatory Variable
Assume E(e1) = 0 .
• What if model is mis-specified because no data is available on an important x variable? • It may be possible to avoid omitted variable bias by using a proxy variable • A proxy variable must be related to the unobservable variable – • But must be uncorrelated with the error term • Sargen test
and you can’t find reasonable proxy variables? • May be possible to include a lagged dependent variable to account for omitted variables that contribute to both past and current levels of y • Obviously.Lagged Dependent Variables
• What if there are unobserved variables. you must think past and current y are related for this to make sense
using a sample restricted to observations with no missing values will be fine • A problem can arise if the data is missing systematically – say high income individuals refuse to provide income data
. it can’t be used • If data is missing at random.Missing Data – Is it a Problem?
• If any observation is missing data on one of the variables in the model.
• If the sample is chosen on the basis of an x variable. then we have sample selection bias • Sample selection can be more subtle • Say looking at wages for workers – since people choose to work this isn’t the same as wage offers
. then estimates are unbiased • If the sample is chosen on the basis of the y variable.
• Sometimes an individual observation can be very different from the others. and can have a large effect on the outcome • Sometimes this outlier will simply be do to errors in data entry – one reason why looking at summary statistics is important • Sometimes the observation will just truly be very different from the others
although readers may prefer to see estimates with and without the outliers • Can use Stata to investigate outliers
. • Not unreasonable to drop observations that appear to be extreme outliers. etc.Outliers (cont'd)
• Not unreasonable to fix observations where it’s clear there was just an extra zero entered or left off.
Model Selection Criteria
• Be data admissible: Prediction must be realistic • Be consistent with theory • Have weakly exogenous regressors • Exhibit parameter constancy: values and signs must be consistent • Exhibit data coherency: white noise residuals • Be encompassing
Matrix Approach to OLS
n x k+1
k+1 x 1
• bˆ = (X'X)-1X'Y
• E(uu`)= where I is an n x n identity matrix (homoscedasticity and no autocorrelation) • N x k matrix X I non-stochastic
• E(u)=0 where u and 0 are n x 1 column vectors. 0 being a null vector.
σ 2 I )
The rank of X is p(X)=k. U~N(0.
λ xu=vector 0 The has a multivariate normal distribution
i. n (no multi-collinearity) λI x = 0
λI x = 0 where λI is a 1 x k row vector and x is a k x 1
column vector.e. where k is the number of columns in X and k is less than the number of observations.
β X y − nY
yI y − nY
var− cov(β ) = σ ( X X )
^ 2 I
u ∑ =
u u = n−k n−k
^ 2 i