Professional Documents
Culture Documents
CH Validation Handout
CH Validation Handout
Overview
Chapter 4
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 0 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 1 / 86
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 2 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 3 / 86
Regression diagnostics Regression diagnostics
Goal: Find points that are not fitted as well as they should be or have Preprocessing:
undue influence on the fitting of the model. Omit incomplete observations (only Wisconsin) using na.omit().
Techniques: Based on deletion of observations, see Belsley, Kuh, and Scale income to be in 10,000 USD.
Welsch (1980).
Illustration: PublicSchools data provide per capita Expenditure Visualization: Scatterplot with fitted linear model and three highlighted
on public schools and per capita Income by state for the 50 states of observations.
the USA plus Washington, DC., for 1979. R> ps <- na.omit(PublicSchools)
R> ps$Income <- ps$Income / 10000
R> data("PublicSchools", package = "sandwich") R> plot(Expenditure ~ Income, data = ps, ylim = c(230, 830))
R> summary(PublicSchools) R> ps_lm <- lm(Expenditure ~ Income, data = ps)
Expenditure Income R> abline(ps_lm)
Min. :259 Min. : 5736 R> id <- c(2, 24, 48)
1st Qu.:315 1st Qu.: 6670 R> text(ps[id, 2:1], rownames(ps)[id], pos = 1, xpd = TRUE)
Median :354 Median : 7597
Mean :373 Mean : 7608
3rd Qu.:426 3rd Qu.: 8286
Max. :821 Max. :10851
NA's :1
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 4 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 5 / 86
Alaska
QQ plot: ordered residuals versus normal quantiles (for checking
700
normality).
p
Scale-location plot: |r̂i | (of standardized residuals ri ) versus
600
Expenditure
●
fitted values ŷi (for checking i.i.d. assumption, in particular
Var(ε|X ) = σ 2 I).
500
●
●
●
● ●
● ● ● ● ●
● ● ●
● Combinations of standardized residuals, leverage, and Cook’s
400
● ●
● ● Washington DC
● ●
● ● ● ● distance.
● ● ●
● ●
● ● ● ●● ● ●●
300
● ●
●
●● ● ● ●●
● ● R> plot(ps_lm, which = 1:6)
Mississippi
By default only four of the six available plots are shown: which =
0.6 0.7 0.8 0.9 1.0 1.1
c(1:3, 5).
Income
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 6 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 7 / 86
Regression diagnostics Regression diagnostics
Alaska ● Alaska ●
4
200
Standardized residuals
3
● Utah
100
Residuals
2
● ● ●
● ●
● ●
● ● ●
●● ●
1
●
● ● ●
● ● ● ●●
● ●●
● ● ● ●●●●
●● ● ●●●●
0
●● ●
● ● ●●
● ● ● ● ●●●
0
● ● ●●
● ● ●●●●
●● ● ●●
● ●●
● ●●●
● ●● ●●●
● ● ●●
−1
●
−100
● ●●
● ●
●
Nevada ● ●
● Nevada
−2
● Washington DC
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 8 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 9 / 86
2.5
2.0
Alaska ●
Alaska
2.0
Standardized residuals
1.5
Washington
● Nevada DC ●
Cook's distance
●
●
1.5
● ●
●
● ●●
1.0
● ●●
● ●● ●
●● ●
1.0
● ●
● ● ● ●
●●
● ●
● ●
● ● ●
● ●
0.5
● ●
● ● ●
0.5
● ● ●
● ● Washington DC
Nevada
●
0.0
0.0
●
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 10 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 11 / 86
Regression diagnostics Regression diagnostics
2.0
Standardized residuals
3
1
Cook's distance
1.5
2
● 0.5
● ●
● 3
●
1
● ●
●
1.0
●
● ●● ● ● ● ●
● ●
● ●● ●
●●
0
●●
● ●
● ● ●●
●
●
● ●
● ●● 2
0.5
●
●
−1
●
●● ●
●
● Washington DC
● Nevada 0.5
−2
● Washington DC 1
●● ●● ●
Nevada
Cook's distance
0.0
●
●
●●●●
● ●
●● ●●●
●●● ●●●●●●
● ● 0
1
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 12 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 13 / 86
Interpretation: Alaska stands out in all plots. Recall: OLS residuals are not independent and do not have the same
Large residual (which = 1). variance.
Upper tail of empirical distribution of residuals (which = 2). More precisely: If Var(ε|X ) = σ 2 I, then Var(ε̂|X ) = σ 2 (I − H ).
Casts doubt on the assumption of homogeneous variances H = X (X > X )−1 X > is the “hat matrix”.
(which = 3). Hat values:
Corresponds to an extraordinarily large Cook’s distance (which = Diagonal elements hii of H.
4 and 6).
Provided by generic function hatvalues().
Has the highest leverage (which = 5 and 6).
Since Var(ε̂i |X ) = σ 2 (1 − hii ), observations with large hii will have
small values of Var(ε̂i |X ), and hence tend to have residuals ε̂i
There are further observations singled out, but none of these are as close to zero.
dominant as Alaska.
hii measures the leverage of observation i.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 14 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 15 / 86
Leverage and standardized residuals Leverage and standardized residuals
High leverage:
The trace of H is k (the number of regressors).
●
0.20
Alaska
High leverage hii typically means two or three times larger than
average hat value k /n.
0.15
Leverage only depends on X and not on y .
ps_hat
●
Good/bad leverage points: high leverage points with Washington DC
typical/unusual yi .
0.10
●
Visualize hat values with mean and three times the mean: ● ●
0.05
● ● ●
● ● ● ●
● ● ● ●
R> ps_hat <- hatvalues(ps_lm) ● ● ●●
●
● ●●
●
●● ● ● ● ●
R> plot(ps_hat) ● ● ● ●●●
● ● ●●● ●
●● ● ●● ● ●
R> abline(h = c(1, 3) * mean(ps_hat), col = 2)
R> id <- which(ps_hat > 3 * mean(ps_hat)) 0 10 20 30 40 50
R> text(id, ps_hat[id], rownames(ps)[id], pos = 1, xpd = TRUE)
Index
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 16 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 17 / 86
Standardized residuals: Var(ε̂i |X ) = σ 2 (1 − hii ) suggests Idea: Detection of unusual observations via leave-one-out (or deletion)
diagnostics (see Belsley, Kuh, and Welsch 1980).
ε̂i
ri = √ . Notation: Exclude point i and compute
σ̂ 1 − hii
estimates β̂(i ) and σ̂(i ) ,
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 18 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 19 / 86
Deletion diagnostics Deletion diagnostics
Basic quantities: In R:
dffit(), dffits(),
DFFIT i = yi − ŷi ,(i ) ,
dfbeta(), dfbetas(),
DFBETA = β̂ − β̂(i ) ,
covratio(),
det(σ̂(2i ) (X(>i ) X(i ) )−1 )
COVRATIO i = , cooks.distance().
det(σ̂ 2 (X > X )−1 )
(ŷ − ŷ(i ) )> (ŷ − ŷ(i ) ) Additionally: rstudent() provides the (externally) studentized
Di2 = .
k σ̂ 2 residuals
ε̂
Interpretation: ti = √ i
σ̂(i ) 1 − hii .
DFFIT : change in fitted values (scaled version: DFFITS).
DFBETA: changes in coefficients (scaled version: DFBETAS).
COVRATIO: change in covariance matrix.
D 2 (Cook’s distance): reduce information to a single value per
observation.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 20 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 21 / 86
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 22 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 23 / 86
The function influence.measures()
800
full sample
Alaska
without influential obs.
700
600 Diagnostics and Alternative Methods of Regression
Expenditure
●
Diagnostic Tests
500
●
●
●
● ●
● ● ● ● ● ●
● ● ●
400
● ●
● ● Washington DC
● ●
● ● ● ●
● ● ●
● ●
● ● ● ●● ● ●●
300
● ●
●
●● ● ● ●●
● ●
Mississippi
Income
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 24 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 25 / 86
More formal validation: Diagnostic testing, e.g., for heteroskedasticity Illustration: Reconsider Journals data as an example for
in cross-section regressions or disturbance autocorrelation in time cross-section regressions.
series regressions.
Preprocessing: As before and additionally include age of the journals
In R: Package lmtest provides a large collection of diagnostic tests. (for the year 2000, when the data were collected).
Typically, the tests return an object of class “htest” (hypothesis test) R> data("Journals", package = "AER")
R> journals <- Journals[, c("subs", "price")]
with test statistic, corresponding p value, and additional parameters R> journals$citeprice <- Journals$price/Journals$citations
such as degrees of freedom (where appropriate), the name of the R> journals$age <- 2000 - Journals$foundingyear
tested model, or the method used.
Regression: log-subscriptions explained by log-price per citation.
Background: Underlying theory is provided in Baltagi (2002), Davidson
R> jour_lm <- lm(log(subs) ~ log(citeprice), data = journals)
and MacKinnon (2004), and Greene (2003).
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 26 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 27 / 86
Testing for heteroskedasticity Testing for heteroskedasticity
Assumption of homoskedasticity Var(εi |xi ) = σ 2 must be checked in In R: bptest() from lmtest implements Breusch-Pagan test.
cross-section regressions.
Provides several flavors of the test via specification of a variance
formula.
Breusch-Pagan test:
Provides studentized version (overcomes assumption of normally
Fit linear regression model to the squared residuals ε̂2i .
distributed errors).
Reject if too much of the variance is explained by additional
explanatory variables, e.g., Default: Studentized version for auxiliary regression with original
Original regressors X (as in the main model). regressors X .
Original regressors plus squared terms and interactions.
Example using defaults
Illustration: For jour_lm, variance seems to decreases with the fitted R> bptest(jour_lm)
values, or, equivalently, increases with log(citeprice). studentized Breusch-Pagan test
data: jour_lm
Null distribution: Approximately χ2q , where q is the number of auxiliary BP = 9.8, df = 1, p-value = 0.002
regressors (excluding constant term).
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 28 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 29 / 86
White test uses original regressors as well as their squares and Goldfeld-Quandt test:
interactions in auxiliary regression. Nowadays probably more popular in textbooks than in applied
work.
Can use bptest():
Order sample with respect to the variable explaining the
R> bptest(jour_lm, ~ log(citeprice) + I(log(citeprice)^2),
+ data = journals)
heteroskedasticity (in example: price per citation).
studentized Breusch-Pagan test Split sample and compare mean residual sum of squares before
and after the split point via an F test.
data: jour_lm
BP = 11, df = 2, p-value = 0.004 Problem: A meaningful split point is rarely known in advance.
Modification: Omit some central observations to improve the
power.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 30 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 31 / 86
Testing for heteroskedasticity Testing the functional form
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 32 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 33 / 86
data: jour_lm In R: raintest() implements both flavours (and some further options
RESET = 1.4, df1 = 2, df2 = 180, p-value = 0.2
for the subsample choice).
Default: Assume that the data are already ordered and use middle
50% as subsample.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 34 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 35 / 86
Testing the functional form Testing the functional form
Illustration: Assess stability of functional form over age of journals. Harvey-Collier test:
R> raintest(jour_lm, order.by = ~ age, data = journals) Order sample prior to testing.
Rainbow test Compute recursive residuals of the fitted model.
data: jour_lm Recursive residuals are essentially standardized one-step-ahead
Rain = 1.8, df1 = 90, df2 = 88, p-value = 0.004 prediction errors.
If model is correctly specified, recursive residuals have mean zero.
Interpretation: If mean differs from zero, ordering variable has an influence on the
The fit for the 50% “middle-aged” journals is significantly different regression relationship.
from the fit comprising all journals. Use simple t test for testing.
Relationship between the number of subscriptions and the price
per citation also depends on the age of the journal.
As we will see below: Libraries are willing to pay more for
established journals.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 36 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 37 / 86
R> harvtest(jour_lm, order.by = ~ age, data = journals) Illustration: Reconsider first model for US consumption function.
Harvey-Collier test R> library("dynlm")
R> data("USMacroG", package = "AER")
data: jour_lm R> consump1 <- dynlm(consumption ~ dpi + L(dpi),
HC = 5.1, df = 180, p-value = 9e-07 + data = USMacroG)
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 38 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 39 / 86
Testing for autocorrelation Testing for autocorrelation
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 40 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 41 / 86
Box-Pierce test/Ljung-Box test: Remark: Unlike diagnostic tests in lmtest, function expects a series of
Originally suggested for diagnostic checking of ARIMA models. residuals and not the specification of a linear model as its first
argument.
(Approximate) χ2 statistics based on estimates of the
autocorrelations up to order p.
R> Box.test(residuals(consump1), type = "Ljung-Box")
Box-Pierce statistic: n times the sum of squared autocorrelations.
Box-Ljung test
Ljung-Box refinement: squared autocorrelation at lag j is weighted
by (n + 2)/(n − j ). data: residuals(consump1)
X-squared = 180, df = 1, p-value <2e-16
In R: Box.test() in base R (package stats) implements both Ljung-Box test confirms significant residual autocorrelation.
versions.
Default: Box-Pierce test with p = 1.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 42 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 43 / 86
Testing for autocorrelation Testing for autocorrelation
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 44 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 45 / 86
Robust Standard Errors and Tests Estimation: OLS typically still consistent.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 46 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 47 / 86
Robust standard errors and tests Robust standard errors and tests
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 48 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 49 / 86
HC estimators HC estimators
where hii are the hat values, h̄ is their mean, and δi = min{4, hii /h̄}.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 50 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 51 / 86
HC estimators HC estimators
Illustration: For journals regression Coefficient summary: Regression output typically contains table with
R> vcov(jour_lm) regression coefficients, their standard errors, and associated t statistics
(Intercept) log(citeprice)
and p values.
(Intercept) 3.126e-03 -6.144e-05 summary(jour_lm) computes table along with additional
log(citeprice) -6.144e-05 1.268e-03
information about the model.
R> vcovHC(jour_lm)
coeftest(jour_lm)computes only the table.
(Intercept) log(citeprice)
(Intercept) 0.003085 0.000693 It additionally allows for specification of a vcov argument, either as
log(citeprice) 0.000693 0.001188 a function or directly as a fitted matrix.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 52 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 53 / 86
HC estimators HC estimators
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.7662 0.0559 85.2 <2e-16
log(citeprice) -0.5331 0.0356 -15.0 <2e-16
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 54 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 55 / 86
HC estimators HC estimators
Estimate Std. Error t value Pr(>|t|) Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.7662 0.0555 85.8 <2e-16 (Intercept) 4.7662 0.0555 85.8 <2e-16
log(citeprice) -0.5331 0.0345 -15.5 <2e-16 log(citeprice) -0.5331 0.0345 -15.5 <2e-16
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 56 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 57 / 86
HC estimators HC estimators
Comparison of the different types of HC estimators: Illustration: Reconsider public schools data to show that using robust
R> t(sapply(c("const", "HC0", "HC1", "HC2", "HC3", "HC4"), covariances can make a big difference.
+ function(x) sqrt(diag(vcovHC(jour_lm, type = x)))))
(Intercept) log(citeprice) Model: Include quadratic term in model. This appears to be significant
const 0.05591 0.03561 due to influential observations (most notably Alaska), but is in fact
HC0 0.05495 0.03377 spurious.
HC1 0.05526 0.03396
HC2 0.05525 0.03412
HC3 0.05555 0.03447 R> ps_lm <- lm(Expenditure ~ Income, data = ps)
HC4 0.05536 0.03459 R> ps_lm2 <- lm(Expenditure ~ Income + I(Income^2), data = ps)
R> anova(ps_lm, ps_lm2)
All estimators lead to almost identical results. In fact, all standard errors Analysis of Variance Table
are slightly smaller than those computed under the assumption of
homoskedastic errors. Model 1: Expenditure ~ Income
Model 2: Expenditure ~ Income + I(Income^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 48 181015
2 47 150986 1 30030 9.35 0.0037
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 58 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 59 / 86
HC estimators HAC estimators
waldtest() provides the same type of test as anova() but has a In time series regressions: If the error terms εi are correlated, Ω is
vcov argument. vcov can be either a function or a matrix (computed not diagonal and can only be estimated directly upon introducing further
for the more complex model). assumptions on its structure.
R> waldtest(ps_lm, ps_lm2, vcov = vcovHC(ps_lm2, type = "HC4")) Solution: If form of heteroskedasticity and autocorrelation is unknown,
Wald test valid standard errors and tests may be obtained by estimating X > ΩX
instead.
Model 1: Expenditure ~ Income
Model 2: Expenditure ~ Income + I(Income^2)
Res.Df Df F Pr(>F) Technically: Computing weighted sums of the empirical
1 48 autocorrelations of ε̂i xi .
2 47 1 0.08 0.77
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 60 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 61 / 86
In R: vcovHAC() provides general framework for HAC estimators. Comparison of standard errors: Spherical errors, quadratic spectral
kernel and Bartlett kernel HAC estimators (both using prewhitening).
Convenience interfaces:
R> rbind(SE = sqrt(diag(vcov(consump1))),
NeweyWest() (by default) uses a Bartlett kernel with + QS = sqrt(diag(kernHAC(consump1))),
nonparametric bandwidth selection (Newey and West 1987, 1994). + NW = sqrt(diag(NeweyWest(consump1))))
kernHAC() (by default) uses a quadratic spectral kernel with (Intercept) dpi L(dpi)
parametric bandwidth selection (Andrews 1991, Andrews and SE 14.51 0.2063 0.2075
QS 94.11 0.3893 0.3669
Monahan 1992). NW 100.83 0.4230 0.3989
Both use prewhitening (by default).
Interpretation: Both sets of robust standard errors are rather similar
weave() implements the class of weighted empirical adaptive
(except maybe for the intercept) and much larger than the uncorrected
variance estimators (Lumley and Heagerty 1999).
standard errors.
Illustration: Newey-West and Andrews kernel HAC estimators for These can again be passed to coeftest() or waldtest() (and other
consumption function regression. inference functions).
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 62 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 63 / 86
Resistant regression
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 64 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 65 / 86
LMS: arg min medi ε2i log(Yt /Y0 ) = β1 + β2 log(Y0 ) + β3 log(Kt ) + β4 log(nt + 0.05) + εt
β
q Yt and Y0 are real GDP in periods t and 0.
X
LTS: arg min ε̂2i :n (β) Kt is a measure of the accumulation of physical capital.
β
i =1
nt is population growth (plus 5% for labor productivity growth).
where ε̂i = yi − i x >β
and i : n denotes that the residuals are arranged
in increasing order. Data: OECDGrowth provides these variables for the period 1960–1985
LTS preferable on theoretical grounds. for all OECD countries with a population exceeding 1 million.
In econometrics: not widely used (unfortunately). gdp85 and gdp60 are real GDP (per person of working age).
invest is the average of the annual ratios of real domestic
investment to real GDP.
popgrowth is annual population growth.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 66 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 67 / 86
Resistant regression Resistant regression
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 68 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 69 / 86
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 70 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 71 / 86
Resistant regression Resistant regression
Scale estimates: lqs() provides two estimates In R: cov.rob() from MASS provides both (default: MVE).
The first is defined via the fit criterion. R> X <- model.matrix(solow_lm)[,-1]
R> Xcv <- cov.rob(X, nsamp = "exact")
The second is based on the variance of those residuals whose R> nohighlev <- which(
absolute value is less than 2.5 times the initial estimate. + sqrt(mahalanobis(X, Xcv$center, Xcv$cov)) <= 2.5)
Second estimate is typically used for scaled residuals.
Details:
Outliers: Observations with “large” scaled residuals (exceeding 2.5 in
Extract model matrix.
absolute values).
Estimate its covariance matrix by MVE.
R> smallresid <- which(
+ abs(residuals(solow_lts)/solow_lts$scale[2]) <= 2.5) Compute the leverage utilizing the mahalanobis() function.
Store observations that are not high-leverage points.
High leverage: For consistency, use robust measure of leverage based
on robust covariance estimator, e.g., minimum-volume ellipsoid (MVE)
or minimum covariance determinant (MCD) estimator.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 72 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 73 / 86
Good observations: Observations that have at least one of the R> summary(solow_rob)
desired properties, small residual or low leverage. Call:
R> goodobs <- unique(c(smallresid, nohighlev)) lm(formula = log(gdp85/gdp60) ~ log(gdp60) + log(invest) +
log(popgrowth + 0.05), data = OECDGrowth,
subset = goodobs)
Bad observations:
Residuals:
R> rownames(OECDGrowth)[-goodobs] Min 1Q Median 3Q Max
-0.15454 -0.05548 -0.00651 0.03159 0.26773
[1] "Canada" "USA" "Turkey" "Australia"
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Robust OLS: exclude bad leverage points (Intercept) 3.7764 1.2816 2.95 0.0106
R> solow_rob <- update(solow_lm, subset = goodobs) log(gdp60) -0.4507 0.0569 -7.93 1.5e-06
log(invest) 0.7033 0.1906 3.69 0.0024
log(popgrowth + 0.05) -0.6504 0.4190 -1.55 0.1429
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 74 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 75 / 86
Resistant regression
Interpretation:
Results somewhat different from full-sample OLS.
Population growth does not seem to belong in this model.
Population growth does not seem to explain economic growth for Diagnostics and Alternative Methods of Regression
this subset of countries (given other regressors).
Potential explanation: OECD countries fairly homogeneous with
respect to population growth, some countries with substantial
Quantile Regression
population growth have been excluded in robust fit.
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 76 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 77 / 86
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 78 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 79 / 86
Quantile regression Quantile regression
Default: rq() by default uses τ = 0.5; i.e., median or LAD (for “least Particularly useful: Model and compare several quantiles
absolute deviations”) regression. simultaneously.
LAD regression: Illustration: Model first and third quartiles (i.e., τ = 0.25 and
R> library("quantreg") τ = 0.75).
R> data("CPS1988", package = "AER") R> cps_rq <- rq(cps_f, tau = c(0.25, 0.75), data = CPS1988)
R> cps_f <- log(wage) ~ experience + I(experience^2) + education
R> cps_lad <- rq(cps_f, data = CPS1988)
R> summary(cps_lad)
Question: Are the regression lines or surfaces parallel (i.e., are effects
Call: rq(formula = cps_f, data = CPS1988) uniform across quantiles)?
tau: [1] 0.5
Answer: Compare fits with anova() method, either in an overall test of
Coefficients: all coefficients or in coefficient-wise comparisons.
Value Std. Error t value Pr(>|t|)
(Intercept) 4.24088 0.02190 193.67805 0.00000
experience 0.07744 0.00115 67.50041 0.00000
I(experience^2) -0.00130 0.00003 -49.97891 0.00000
education 0.09429 0.00140 67.57171 0.00000
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 80 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 81 / 86
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 82 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 83 / 86
Quantile regression Quantile regression
(Intercept) experience
0.10
●
●
●
5.0
● ●
quantile.
●
0.09
●
●
● ●
4.5
●
●
●
0.08
●
●
●
●
● ●
0.07
●
●
4.0
●
●
● ●
0.06
●
●
●
3.5
●
●
0.05
+ data = CPS1988) 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
R> cps_rqbigs <- summary(cps_rqbig)
R> plot(cps_rqbigs) I(experience^2) education
−0.0008
●
●
● ● ●
Details
● ● ●
● ● ● ●
● ● ● ● ●
●
0.09
●
●
●
●
−0.0012
●
0.08
●
−0.0016
●
0.07
intervals for the quantile regression estimates.
●
−0.0020
●
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 84 / 86 Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 85 / 86
Quantile regression
Christian Kleiber, Achim Zeileis © 2008–2017Applied Econometrics with R – 4 – Diagnostics and Alternative Methods of Regression – 86 / 86