Additional Regression Considerations

Week 8: Additional Regression Considerations
MFE 402
Dan Yavorsky
1
Topics for Today
• MLE: Consumer Search

• Outliers
• Forecasting
• Bootstrapping
2
Consumer Search with Maximum
Likelihood
A Model of Sequential Consumer Search
uij = x⏟
j 𝛽 + 𝜂ij +𝜀ij
𝛿ij
• xj are observable vehicle characteristics

• 𝜂ij ∼ N(0, 1) are preferences known only to the consumer
• 𝜀ij ∼ N(0, 𝜎) are consumer-product specific match values
• Consumer knows F𝜀 (𝜀), but must search to discover 𝜀ij
• Search costs cij = exp{𝛾0 + dij 𝛾1 }
• Perfect recall, costless revisits, no outside option
3
Reservation Utilities
Weitzman (1979) characterizes optimal behavior via “reservation utilities”
The marginal benefit from an additional search:
∞
Bij (u∗i ) = ∫ (uij - u∗i ) fuij (uij ) duij
u∗i
Define a reservation utility (zij ) by equating marginal benefit and marginal cost:
𝜙(𝜁ij )
cij = Bij (zij ) ⟹ cij = [1 - Φ(𝜁ij )] × [ -𝜁 ]×𝜎
1 - Φ(𝜁ij ) ij
Where 𝜁ij = (zij - 𝛿ij ) /𝜎
Kim et al. (2010) shows you can invert this function to find zij = 𝛿ij + 𝜁ij 4
Rational (Optimal) Behavior
Weitzman (1979) shows that a consumer acting optimally will search alternatives in
order of descending reservation utilities, continuing until the maximum realized utility
of the searched alternatives is higher than the reservation utility of the
next-to-be-searched alternative
For a consumer who makes K searches,
• Continuation: zk ≥ maxh<k {uh } for k = 2, … , K

• Selection: z1 ≥ z2 ≥ … ≥ zK ≥ maxh>K {zh }
• Stopping: maxk≤K {uk } ≥ maxh>K {zh }
• Choice: uj∗ = arg maxk≤K {uk }
5
Example
6
Likelihood
Individual Likelihood
Li (𝛽, 𝛾) = ∫ 1[zij ≥ max {uh } for j = 2, … , Ki ⋂
h<j
zij = arg max {zik } for j = 1, … , Ki ⋂

k>j
max {uih } ≥ max {zik } ⋂

h≤Ki k>Ki
uij∗i = arg max {uih } ] dF(𝜂, 𝜀)

h≤Ki
Total Likelihood
N
L(𝛽, 𝛾) = ∏ Li (𝛽, 𝛾)
i=1
7
Estimation via KSF MLE
Following Honka (2014) and Ursu (2018), estimate with KSF:
Define:
• 𝜈1,j = zij - maxh≤j {uih } for j = 2, … , Ki
• 𝜈2,j = zij - maxk>j {zik } for j = 1, … , Ki
• 𝜈3 = maxh≤Ki {uih } - maxk>j {zik }
• 𝜈4 = uij∗i - maxh≤Ki {uih } for the chosen j∗i
Calculate Li for one set of draws
-1
Ki Ki
q
Lĩ =⎛
⎜1 + ∑ e-𝜆𝜈1,j + ∑ e-𝜆𝜈2,j + e-𝜆𝜈3 + e-𝜆𝜈4 ⎞
⎟
8
⎝ j=2 j=1 ⎠
Results
9
Leverage and Outliers
Intuition about Unusual Observations
Q: What do we mean by “outlier”?
A: An observation that deviates markedly from the rest of the sample, due to…
• Large residual: the distance between the actual data point (Yn |Xn ) and its fitted value
(Ŷ T ) is much larger for the outlier than for other observations
• High leverage: the data point has an unusual combination of values for the explanatory
variable values (ie, it’s in a “remote” part of the X-space)
Why do we care?
• Observations with high leverage and large residuals are influential: if you dropped that
observation from the data, coefficient estimates would change markedly
• May suggest something is wrong with the model specification
• Understand if predicted values are driven more by data or modeling assumptions
10
Visual Example: Influential Observation
^ < e
Large Positive e, e
X 11
Visual Example: “Hidden” Leverage
2
1
X2
0 −1
−2
−2 −1 0 1 2
12
X1
Leverage Values
The leverage value of observation i is the ith diagonal element of the “hat” matrix
P = X(X′ X)-1 X′ :
hii = x′i (X′ X)-1 xi
• hii is a standardized distance metric in ℝk

• hii is bounded: 0 < hii < 1
n
• ∑i=1 hii = k and thus hiī = k/n
Rule of thumb:
• flag the observation if hii > 3k/n

• high leverage values have the potential to influence coefficient estimates
13
The Effect on the Coefficients and Fitted Values (part 1)
Let’s assess the effect on the coefficient estimates when we leave out observation i:
𝛽(̂ -i) = (X′(-i) X(-i) )-1 X′(-i) y(-i)
-1
= (∑ Xj X′j ) (∑ Xj Yj )
j≠i j≠i
-1
= (X′ X - Xi X′i ) (X′ Y - Xi Yi )
Multiply both sides by (X′ X)-1 (X′ X - Xi X′i ) to get
𝛽(̂ -i) - (X′ X)-1 Xi X′i 𝛽(̂ -i) = (X′ X)-1 (X′ Y - Xi Yi ) = 𝛽 ̂ - (X′ X)-1 Xi Yi
Rearrange to find:
𝛽 ̂ - 𝛽(̂ -i) = (X′ X)-1 Xi (Yi - Xi 𝛽(̂ -i) ) = (X′ X)-1 Xi eĩ
14
The Effect on the Coefficients and Fitted Values (part 2)
Pre-multiply by X′i and subtract from Yi to find:
eî = Yi - X′i 𝛽 ̂ = Yi - X′i 𝛽(̂ -i) - X′i (X′ X)-1 Xi eĩ = (1 - hii )eĩ
⟹ 𝛽 ̂ - 𝛽(̂ -i) = (X′ X)-1 Xi (1 - hii )-1 eî
Alternatively:
Ŷ i - Ỹ i = X′i 𝛽 ̂ - X′i 𝛽(̂ -i) = X′i (X′ X)-1 Xi eĩ = hii (1 - hii )-1 eî
Thus both differences (in the coefficient estimates and in the fitted values) are
functions of the observation’s leverage and residual.
15
Some Intuition
Because OLS minimizes squared errors, observations with high leverage can “pull” the
regression line toward the observation in order to minimize the error. Thus, observations with
high leverage (hii ) will have smaller residuals (eî ) as a result of this influence.
Under homoskedasticity:
var(e)̂ = var(Me) = 𝔼 [Mee′ M′ ] = M𝔼[ee′ ]M = 𝜎2 M
⟹ var(eî ) = 𝜎2 (1 - Xi (X′ X)-1 Xi ) = 𝜎2 (1 - hii )
So when looking for unusual observations, you should look for those with high leverage and
high standardized residuals, defined as:
eî ei
rî = ≈ ∼ N(0, 1)
s√1 - hii 𝜎
16
Regression Diagnostics
Residuals vs Fitted Normal Q−Q
4
236 227 227 236
3
2
Standardized residuals
2
1
Residuals
1
0
−1 0
−1
−2
−3
123
123
2.90 2.95 3.00 3.05 3.10 −3 −2 −1 0 1 2 3
Fitted values Theoretical Quantiles
Scale−Location Residuals vs Leverage
4
123
236 227 0.5
1.5
198
2
1.0
0
35
0.5
−2
123
0.0
Cook's distance
−4
0.5
2.90 2.95 3.00 3.05 3.10 0.00 0.01 0.02 0.03 0.04 0.05 0.06
17
What to do with Influential Observations
If the influence is a result of wrong data:
• correct the data

• omit the observations
If the influence is a result of an unusual circumstance:
• likely modify the model to incorporate the observation
Otherwise, it’s not obvious:
• Reconsider your choice of model

• Reconsider the specification (the X’s and non-linear transformations of them)
• Reconsider your estimation strategy (use a robust estimator instead of OLS)
18
Forecasting
Prediction and Prediction Error
Consider an out-of-sample realization (Yn+1 , Xn+1 ) where Xn+1 is observed; Yn+1 is

not.
Define the predicted value:
Ỹ n+1 = X′n+1 𝛽 ̂
We can easily decompose the prediction error:

eñ +1 = Yn+1 - Ŷ n+1
= (X′n+1 𝛽 + en+1 ) - (X′n+1 𝛽)̂
= en+1 + X′n+1 (𝛽 - 𝛽)̂
= e⏟
n+1 [(𝛽0 - 𝛽0̂ ) + (𝛽1 - 𝛽1̂ )X1,n+1 + … + (𝛽k - 𝛽k̂ )Xk,n+1 ]
+ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
Inherent Sampling Error
Randomness
19
Prediction Error Visualization
E[Y|X] = β0 + β1X
Yn+1
en+1
Prediction Error
^ ^
Sampling Error β0 + β1X
^
Yn+1
Xn+1 20
MSFE: Mean Squared Forecast Error
Define the mean-squared forecast error MSFE = 𝔼[e2ñ +1 ]. Then:
𝔼[e2ñ +1 ] = en+1 - X′n+1 (𝛽 ̂ - 𝛽)

= 𝔼[e2n+1 ] - 2𝔼[en+1 X′n+1 (𝛽 ̂ - 𝛽)] + 𝔼[X′n+1 (𝛽 ̂ - 𝛽)(𝛽 ̂ - 𝛽)′ Xn+1 ]
= 𝜎2 - 2 × 0 + 𝔼 [X′n+1 V𝛽 X
̂ n+1 ]
Notice that;
• MSFE > 𝜎2
• MSFE depends on Xn+1
• In particular, MSFE gets larger as Xn+1 is further from X̄
21
Estimating MSFE & Prediction Intervals
Under conditional homoskedasticity (i.e., var(ei |Xi ) = 𝜎2 for all i), this simplifies to:
MSFE = 𝜎2 (1 + 𝔼 [X′n+1 (X′ X)-1 Xn+1 ])
̂ of the MSFE:
Replace 𝜎2 with s2 to get an estimator MSFE
̂ = s2 (1 + X′ (X′ X)-1 Xn+1 )

MSFE n+1
Assuming the errors are normally distributed (i.e., e ∼ N(0, 𝜎2 )), we can construct
prediction intervals using the estimated MSFE, where c is the critical value from a tn-k
distribution:
PIn+1 = (X′n+1 𝛽 ̂ - c × √MSFE

̂ , X′ 𝛽 ̂ + c × √MSFE
n+1
̂ )
22
Cross Validation
One way to assess prediction error is to set aside some of your data (the “validation”
data), fit your model on the rest. Then assess predictions on the observations in the
validation dataset.
• When data are scarce (as is often the case), this is not possible
A general, simple, and widely-used method for estimating prediction error (in any
model, not just linear regression) is K-fold cross validation.
• Split your data into K roughly equal-sized parts

• For each of the K parts:
• Fit the model on the other K-1 parts
• Assess your prediction errors on the Kth part
• Average all prediction errors to get the MSFE estimate
23
CV and a General MSFE Estimator
It turns out that a general (not just under an assumption of homoskedasticity)

estimator for the MSFE is equivalent to n-fold cross validation:
The estimator is:

n
̂ 1 = 1 ∑ e2ĩ
MSFE where eĩ = eî /(1 - hii )
n i=1
We solved for eĩ analytically, but we could have done CV:
• For each observation i:

• Fit the model on the other n - 1 observations
• Calculate eĩ = Yi - Ỹ i
• Then compute the mean squared error of the n prediction errors eĩ , i = 1, … , n
24
Computation
# get data # select example observation

dat <- read.table("support/cps09mar.txt") j <- 56
exper <- dat[,1] - dat[,4] - 6
lwage <- log( dat[,5]/(dat[,6]*dat[,7]) ) # calculate e_tilde from formula
sam <- dat[,11]==4 & dat[,12]==7 & dat[,2]==0 hii <- t(x[j,]) %*% xxi %*% x[j,]
dat <- data.frame(exper=exper[sam], ehat[j] / (1 - hii)
lwage=lwage[sam])
[,1]
[1,] 0.4442777
y <- matrix(lwage[sam], ncol=1)
# calculate e_tilde from loo regression
x <- cbind(1, exper[sam])
reg <- lm(lwage[-j] ~ exper[-j])
y[j] - x[j,] %*% coef(reg)
xxi <- solve(crossprod(x))
xy <- crossprod(x,y) [,1]
betahat <- xxi %*% xy [1,] 0.4753381
yhat <- x %*% betahat

ehat <- y - yhat 25
Computation
# analytic formula
P <- x %*% xxi %*% t(x)
etilde <- ehat / (1-diag(P))
mean(etilde^2)
[1] 0.511118
# loo regression
n <- length(y)
etilde_loo <- vector(length=n)
for(i in 1:n) {
coef_est <- coef(lm(lwage[-i] ~ exper[-i]))
etilde_loo[i] <- y[i] - x[i,] %*% coef_est
}
mean(etilde_loo^2)
[1] 0.5046459 26
Bootstrap
Bootstrap Procedure
While CV is a general purpose method for assessing prediction errors, the bootstrap
procedure is a general purpose method for assessing standard errors (and thus
confidence intervals and hypothesis tests).
The procedure:
• Sample (with replacement) n observations from the original dataset, call this
bootstrap sample b
• Calculate the parameter (or quantity of interest) from the bootstrap sample
dataset b, call this 𝜃(̂ b)
• Repeat the first two steps B times (often, B = 1, 000 or B = 10, 000)
• Then assess this distribution of B values of 𝜃(̂ b) , as explained on the next two slides
27
Bootstrap Standard Errors
The bootstrap estimator of the variance of an estimator 𝜃 ̂ is the sample variance across
the bootstrap parameter estimates (or quantity of interest):
1 B 1 B (̂ b)
V̂ boot
̂ = ∑ (𝜃(̂ b) - 𝜃)̄ where 𝜃 ̄ = ∑𝜃
𝜃 B - 1 b=1 B b=1
The standard errors are the square-roots of the diagonal elements:
̂ ) = √[V̂ boot ]
s (𝜃jboot ̂
𝜃 jj
We can use these standard errors to construct normal-approximation bootstrap

confidence intervals (and test hypotheses):
̂ ) , 𝜃 ̂ + c × s (𝜃boot
= (𝜃 ̂ - c × s (𝜃jboot ̂ ))
se boot
CI j where c = z∗1-𝛼/2 28
Bootstrap Percentile Intervals
Given that we have an empirical distribution, we can simply take the empirical
quantiles as the confidence interval boundary values.
• For example, if you have 10,000 bootstrap values of 𝜃(̂ b) , sort those values (call
(b) (250) (9750)
the sorted values 𝜃∗̂ ) and take 𝜃∗̂ and 𝜃∗̂ as your 95% confidence interval:
(250) (9750)
= (𝜃∗̂ , 𝜃∗̂
pi boot
CI )
The most useful feature of this approach is that it is transformation-respecting. If you

want a confidence interval for m(𝜃)̂ where m(⋅) is some function, you simply calculate
(250) (9750)
(m(𝜃)̂ ∗ , m(𝜃)̂ ∗ )
29
Example Computations
# get data # bootstrap
dat <- read.table("support/cps09mar.txt") set.seed(1234)
exper <- dat[,1] - dat[,4] - 6 B <- 1000
lwage <- log( dat[,5]/(dat[,6]*dat[,7]) ) n <- nrow(dat)
sam <- dat[,11]==4 & dat[,12]==7 & dat[,2]==0
dat <- data.frame(exper=exper[sam], lwage=lwage[sam]) res <- matrix(NA_real_, nrow=B, ncol=2)
for(b in 1:B) {
# run regression draws <- sample(1:n, size=n, replace=T)
out <- lm(lwage ~ exper, data=dat) res[b,] <- lm(lwage ~ exper, data=dat[draws, ])$coef
tt <- summary(out)$coefficients[1:2,1:2] }
tt
# CIs from stderr
Estimate Std. Error
serr <- apply(res, 2, function(x) sqrt(var(x)))
(Intercept) 2.876515044 0.067631401
cbind(low = out$coef - 2*serr,
exper 0.004776039 0.004335196
high = out$coef + 2*serr)
# calculate CIs
cbind(low = tt[,1] - 2*tt[,2], low high
high = tt[,1] + 2*tt[,2]) (Intercept) 2.737715643 3.01531444
exper -0.003596382 0.01314846
low high
# CIs from percentiles
(Intercept) 2.741252241 3.01177785
cbind(sort(res[,1])[c(25, 975)],
exper -0.003894353 0.01344643
sort(res[,2])[c(25, 975)])
[,1] [,2]
[1,] 2.742939 -0.003159473
[2,] 3.012780 0.013336742
30
Next Time
Next time:
• Bayes!
31

Additional Regression Considerations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Additional Regression Considerations

Uploaded by

Copyright:

Available Formats

Week 8: Additional Regression Considerations

• MLE: Consumer Search

• xj are observable vehicle characteristics

Weitzman (1979) characterizes optimal behavior via “reservation utilities”

The marginal benefit from an additional search:

Where 𝜁ij = (zij - 𝛿ij ) /𝜎

For a consumer who makes K searches,

• Continuation: zk ≥ maxh<k {uh } for k = 2, … , K

zij = arg max {zik } for j = 1, … , Ki ⋂

max {uih } ≥ max {zik } ⋂

uij∗i = arg max {uih } ] dF(𝜂, 𝜀)

Following Honka (2014) and Ursu (2018), estimate with KSF:

• 𝜈1,j = zij - maxh≤j {uih } for j = 2, … , Ki

• 𝜈2,j = zij - maxk>j {zik } for j = 1, … , Ki

• 𝜈3 = maxh≤Ki {uih } - maxk>j {zik }

• 𝜈4 = uij∗i - maxh≤Ki {uih } for the chosen j∗i

Calculate Li for one set of draws

Q: What do we mean by “outlier”?

hii = x′i (X′ X)-1 xi

• hii is a standardized distance metric in ℝk

• flag the observation if hii > 3k/n

Multiply both sides by (X′ X)-1 (X′ X - Xi X′i ) to get

Pre-multiply by X′i and subtract from Yi to find:

⟹ 𝛽 ̂ - 𝛽(̂ -i) = (X′ X)-1 Xi (1 - hii )-1 eî

var(e)̂ = var(Me) = 𝔼 [Mee′ M′ ] = M𝔼[ee′ ]M = 𝜎2 M

⟹ var(eî ) = 𝜎2 (1 - Xi (X′ X)-1 Xi ) = 𝜎2 (1 - hii )

2.90 2.95 3.00 3.05 3.10 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

If the influence is a result of wrong data:

• correct the data

If the influence is a result of an unusual circumstance:

• likely modify the model to incorporate the observation

Otherwise, it’s not obvious:

• Reconsider your choice of model

Consider an out-of-sample realization (Yn+1 , Xn+1 ) where Xn+1 is observed; Yn+1 is

Define the predicted value:

We can easily decompose the prediction error:

Define the mean-squared forecast error MSFE = 𝔼[e2ñ +1 ]. Then:

𝔼[e2ñ +1 ] = en+1 - X′n+1 (𝛽 ̂ - 𝛽)

MSFE = 𝜎2 (1 + 𝔼 [X′n+1 (X′ X)-1 Xn+1 ])

̂ = s2 (1 + X′ (X′ X)-1 Xn+1 )

PIn+1 = (X′n+1 𝛽 ̂ - c × √MSFE

• Split your data into K roughly equal-sized parts

It turns out that a general (not just under an assumption of homoskedasticity)

The estimator is:

We solved for eĩ analytically, but we could have done CV:

• For each observation i:

# get data # select example observation

yhat <- x %*% betahat

The standard errors are the square-roots of the diagonal elements:

We can use these standard errors to construct normal-approximation bootstrap

The most useful feature of this approach is that it is transformation-respecting. If you

You might also like