You are on page 1of 35

Week 8: Additional Regression Considerations

MFE 402

Dan Yavorsky

1
Topics for Today

• MLE: Consumer Search


• Outliers
• Forecasting
• Bootstrapping

2
Consumer Search with Maximum
Likelihood
A Model of Sequential Consumer Search

uij = x⏟
j 𝛽 + 𝜂ij +𝜀ij
𝛿ij

• xj are observable vehicle characteristics


• 𝜂ij ∼ N(0, 1) are preferences known only to the consumer
• 𝜀ij ∼ N(0, 𝜎) are consumer-product specific match values
• Consumer knows F𝜀 (𝜀), but must search to discover 𝜀ij
• Search costs cij = exp{𝛾0 + dij 𝛾1 }
• Perfect recall, costless revisits, no outside option

3
Reservation Utilities

Weitzman (1979) characterizes optimal behavior via “reservation utilities”

The marginal benefit from an additional search:


Bij (u∗i ) = ∫ (uij - u∗i ) fuij (uij ) duij
u∗i

Define a reservation utility (zij ) by equating marginal benefit and marginal cost:

𝜙(𝜁ij )
cij = Bij (zij ) ⟹ cij = [1 - Φ(𝜁ij )] × [ -𝜁 ]×𝜎
1 - Φ(𝜁ij ) ij

Where 𝜁ij = (zij - 𝛿ij ) /𝜎

Kim et al. (2010) shows you can invert this function to find zij = 𝛿ij + 𝜁ij 4
Rational (Optimal) Behavior

Weitzman (1979) shows that a consumer acting optimally will search alternatives in
order of descending reservation utilities, continuing until the maximum realized utility
of the searched alternatives is higher than the reservation utility of the
next-to-be-searched alternative

For a consumer who makes K searches,

• Continuation: zk ≥ maxh<k {uh } for k = 2, … , K


• Selection: z1 ≥ z2 ≥ … ≥ zK ≥ maxh>K {zh }
• Stopping: maxk≤K {uk } ≥ maxh>K {zh }
• Choice: uj∗ = arg maxk≤K {uk }

5
Example

6
Likelihood

Individual Likelihood
Li (𝛽, 𝛾) = ∫ 1[zij ≥ max {uh } for j = 2, … , Ki ⋂
h<j

zij = arg max {zik } for j = 1, … , Ki ⋂


k>j

max {uih } ≥ max {zik } ⋂


h≤Ki k>Ki

uij∗i = arg max {uih } ] dF(𝜂, 𝜀)


h≤Ki

Total Likelihood

N
L(𝛽, 𝛾) = ∏ Li (𝛽, 𝛾)
i=1
7
Estimation via KSF MLE

Following Honka (2014) and Ursu (2018), estimate with KSF:

Define:

• 𝜈1,j = zij - maxh≤j {uih } for j = 2, … , Ki

• 𝜈2,j = zij - maxk>j {zik } for j = 1, … , Ki

• 𝜈3 = maxh≤Ki {uih } - maxk>j {zik }

• 𝜈4 = uij∗i - maxh≤Ki {uih } for the chosen j∗i

Calculate Li for one set of draws

-1
Ki Ki
q
Lĩ =⎛
⎜1 + ∑ e-𝜆𝜈1,j + ∑ e-𝜆𝜈2,j + e-𝜆𝜈3 + e-𝜆𝜈4 ⎞

8
⎝ j=2 j=1 ⎠
Results

9
Leverage and Outliers
Intuition about Unusual Observations

Q: What do we mean by “outlier”?

A: An observation that deviates markedly from the rest of the sample, due to…

• Large residual: the distance between the actual data point (Yn |Xn ) and its fitted value
(Ŷ T ) is much larger for the outlier than for other observations
• High leverage: the data point has an unusual combination of values for the explanatory
variable values (ie, it’s in a “remote” part of the X-space)

Why do we care?

• Observations with high leverage and large residuals are influential: if you dropped that
observation from the data, coefficient estimates would change markedly
• May suggest something is wrong with the model specification
• Understand if predicted values are driven more by data or modeling assumptions
10
Visual Example: Influential Observation
^ < e
Large Positive e, e

X 11
Visual Example: “Hidden” Leverage

2
1
X2
0 −1
−2

−2 −1 0 1 2
12
X1
Leverage Values

The leverage value of observation i is the ith diagonal element of the “hat” matrix
P = X(X′ X)-1 X′ :

hii = x′i (X′ X)-1 xi

• hii is a standardized distance metric in ℝk


• hii is bounded: 0 < hii < 1
n
• ∑i=1 hii = k and thus hiī = k/n

Rule of thumb:

• flag the observation if hii > 3k/n


• high leverage values have the potential to influence coefficient estimates

13
The Effect on the Coefficients and Fitted Values (part 1)

Let’s assess the effect on the coefficient estimates when we leave out observation i:
𝛽(̂ -i) = (X′(-i) X(-i) )-1 X′(-i) y(-i)
-1

= (∑ Xj X′j ) (∑ Xj Yj )
j≠i j≠i
-1
= (X′ X - Xi X′i ) (X′ Y - Xi Yi )

Multiply both sides by (X′ X)-1 (X′ X - Xi X′i ) to get

𝛽(̂ -i) - (X′ X)-1 Xi X′i 𝛽(̂ -i) = (X′ X)-1 (X′ Y - Xi Yi ) = 𝛽 ̂ - (X′ X)-1 Xi Yi

Rearrange to find:

𝛽 ̂ - 𝛽(̂ -i) = (X′ X)-1 Xi (Yi - Xi 𝛽(̂ -i) ) = (X′ X)-1 Xi eĩ
14
The Effect on the Coefficients and Fitted Values (part 2)

Pre-multiply by X′i and subtract from Yi to find:

eî = Yi - X′i 𝛽 ̂ = Yi - X′i 𝛽(̂ -i) - X′i (X′ X)-1 Xi eĩ = (1 - hii )eĩ

⟹ 𝛽 ̂ - 𝛽(̂ -i) = (X′ X)-1 Xi (1 - hii )-1 eî

Alternatively:

Ŷ i - Ỹ i = X′i 𝛽 ̂ - X′i 𝛽(̂ -i) = X′i (X′ X)-1 Xi eĩ = hii (1 - hii )-1 eî

Thus both differences (in the coefficient estimates and in the fitted values) are
functions of the observation’s leverage and residual.

15
Some Intuition

Because OLS minimizes squared errors, observations with high leverage can “pull” the
regression line toward the observation in order to minimize the error. Thus, observations with
high leverage (hii ) will have smaller residuals (eî ) as a result of this influence.

Under homoskedasticity:

var(e)̂ = var(Me) = 𝔼 [Mee′ M′ ] = M𝔼[ee′ ]M = 𝜎2 M

⟹ var(eî ) = 𝜎2 (1 - Xi (X′ X)-1 Xi ) = 𝜎2 (1 - hii )

So when looking for unusual observations, you should look for those with high leverage and
high standardized residuals, defined as:

eî ei
rî = ≈ ∼ N(0, 1)
s√1 - hii 𝜎

16
Regression Diagnostics
Residuals vs Fitted Normal Q−Q

4
236 227 227 236

3
2

Standardized residuals

2
1
Residuals

1
0

−1 0
−1
−2

−3
123
123

2.90 2.95 3.00 3.05 3.10 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

4
123
236 227 0.5
Standardized residuals

Standardized residuals
1.5

198

2
1.0

0
35
0.5

−2
123
0.0

Cook's distance
−4

0.5

2.90 2.95 3.00 3.05 3.10 0.00 0.01 0.02 0.03 0.04 0.05 0.06
17
What to do with Influential Observations

If the influence is a result of wrong data:

• correct the data


• omit the observations

If the influence is a result of an unusual circumstance:

• likely modify the model to incorporate the observation

Otherwise, it’s not obvious:

• Reconsider your choice of model


• Reconsider the specification (the X’s and non-linear transformations of them)
• Reconsider your estimation strategy (use a robust estimator instead of OLS)

18
Forecasting
Prediction and Prediction Error

Consider an out-of-sample realization (Yn+1 , Xn+1 ) where Xn+1 is observed; Yn+1 is


not.

Define the predicted value:

Ỹ n+1 = X′n+1 𝛽 ̂

We can easily decompose the prediction error:


eñ +1 = Yn+1 - Ŷ n+1
= (X′n+1 𝛽 + en+1 ) - (X′n+1 𝛽)̂
= en+1 + X′n+1 (𝛽 - 𝛽)̂
= e⏟
n+1 [(𝛽0 - 𝛽0̂ ) + (𝛽1 - 𝛽1̂ )X1,n+1 + … + (𝛽k - 𝛽k̂ )Xk,n+1 ]
+ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
Inherent Sampling Error
Randomness
19
Prediction Error Visualization

E[Y|X] = β0 + β1X
Yn+1
en+1
Prediction Error
^ ^
Sampling Error β0 + β1X
^
Yn+1

Xn+1 20
MSFE: Mean Squared Forecast Error

Define the mean-squared forecast error MSFE = 𝔼[e2ñ +1 ]. Then:

𝔼[e2ñ +1 ] = en+1 - X′n+1 (𝛽 ̂ - 𝛽)


= 𝔼[e2n+1 ] - 2𝔼[en+1 X′n+1 (𝛽 ̂ - 𝛽)] + 𝔼[X′n+1 (𝛽 ̂ - 𝛽)(𝛽 ̂ - 𝛽)′ Xn+1 ]
= 𝜎2 - 2 × 0 + 𝔼 [X′n+1 V𝛽 X
̂ n+1 ]

Notice that;

• MSFE > 𝜎2
• MSFE depends on Xn+1
• In particular, MSFE gets larger as Xn+1 is further from X̄

21
Estimating MSFE & Prediction Intervals

Under conditional homoskedasticity (i.e., var(ei |Xi ) = 𝜎2 for all i), this simplifies to:

MSFE = 𝜎2 (1 + 𝔼 [X′n+1 (X′ X)-1 Xn+1 ])

̂ of the MSFE:
Replace 𝜎2 with s2 to get an estimator MSFE

̂ = s2 (1 + X′ (X′ X)-1 Xn+1 )


MSFE n+1

Assuming the errors are normally distributed (i.e., e ∼ N(0, 𝜎2 )), we can construct
prediction intervals using the estimated MSFE, where c is the critical value from a tn-k
distribution:

PIn+1 = (X′n+1 𝛽 ̂ - c × √MSFE


̂ , X′ 𝛽 ̂ + c × √MSFE
n+1
̂ )

22
Cross Validation

One way to assess prediction error is to set aside some of your data (the “validation”
data), fit your model on the rest. Then assess predictions on the observations in the
validation dataset.

• When data are scarce (as is often the case), this is not possible

A general, simple, and widely-used method for estimating prediction error (in any
model, not just linear regression) is K-fold cross validation.

• Split your data into K roughly equal-sized parts


• For each of the K parts:
• Fit the model on the other K-1 parts
• Assess your prediction errors on the Kth part
• Average all prediction errors to get the MSFE estimate
23
CV and a General MSFE Estimator

It turns out that a general (not just under an assumption of homoskedasticity)


estimator for the MSFE is equivalent to n-fold cross validation:

The estimator is:


n
̂ 1 = 1 ∑ e2ĩ
MSFE where eĩ = eî /(1 - hii )
n i=1

We solved for eĩ analytically, but we could have done CV:

• For each observation i:


• Fit the model on the other n - 1 observations
• Calculate eĩ = Yi - Ỹ i
• Then compute the mean squared error of the n prediction errors eĩ , i = 1, … , n

24
Computation

# get data # select example observation


dat <- read.table("support/cps09mar.txt") j <- 56
exper <- dat[,1] - dat[,4] - 6
lwage <- log( dat[,5]/(dat[,6]*dat[,7]) ) # calculate e_tilde from formula
sam <- dat[,11]==4 & dat[,12]==7 & dat[,2]==0 hii <- t(x[j,]) %*% xxi %*% x[j,]
dat <- data.frame(exper=exper[sam], ehat[j] / (1 - hii)
lwage=lwage[sam])
[,1]
[1,] 0.4442777
y <- matrix(lwage[sam], ncol=1)
# calculate e_tilde from loo regression
x <- cbind(1, exper[sam])
reg <- lm(lwage[-j] ~ exper[-j])
y[j] - x[j,] %*% coef(reg)
xxi <- solve(crossprod(x))
xy <- crossprod(x,y) [,1]
betahat <- xxi %*% xy [1,] 0.4753381

yhat <- x %*% betahat


ehat <- y - yhat 25
Computation
# analytic formula
P <- x %*% xxi %*% t(x)
etilde <- ehat / (1-diag(P))
mean(etilde^2)

[1] 0.511118
# loo regression
n <- length(y)
etilde_loo <- vector(length=n)
for(i in 1:n) {
coef_est <- coef(lm(lwage[-i] ~ exper[-i]))
etilde_loo[i] <- y[i] - x[i,] %*% coef_est
}

mean(etilde_loo^2)

[1] 0.5046459 26
Bootstrap
Bootstrap Procedure

While CV is a general purpose method for assessing prediction errors, the bootstrap
procedure is a general purpose method for assessing standard errors (and thus
confidence intervals and hypothesis tests).

The procedure:

• Sample (with replacement) n observations from the original dataset, call this
bootstrap sample b
• Calculate the parameter (or quantity of interest) from the bootstrap sample
dataset b, call this 𝜃(̂ b)
• Repeat the first two steps B times (often, B = 1, 000 or B = 10, 000)
• Then assess this distribution of B values of 𝜃(̂ b) , as explained on the next two slides

27
Bootstrap Standard Errors

The bootstrap estimator of the variance of an estimator 𝜃 ̂ is the sample variance across
the bootstrap parameter estimates (or quantity of interest):

1 B 1 B (̂ b)
V̂ boot
̂ = ∑ (𝜃(̂ b) - 𝜃)̄ where 𝜃 ̄ = ∑𝜃
𝜃 B - 1 b=1 B b=1

The standard errors are the square-roots of the diagonal elements:

̂ ) = √[V̂ boot ]
s (𝜃jboot ̂
𝜃 jj

We can use these standard errors to construct normal-approximation bootstrap


confidence intervals (and test hypotheses):

̂ ) , 𝜃 ̂ + c × s (𝜃boot
= (𝜃 ̂ - c × s (𝜃jboot ̂ ))
se boot
CI j where c = z∗1-𝛼/2 28
Bootstrap Percentile Intervals

Given that we have an empirical distribution, we can simply take the empirical
quantiles as the confidence interval boundary values.

• For example, if you have 10,000 bootstrap values of 𝜃(̂ b) , sort those values (call
(b) (250) (9750)
the sorted values 𝜃∗̂ ) and take 𝜃∗̂ and 𝜃∗̂ as your 95% confidence interval:

(250) (9750)
= (𝜃∗̂ , 𝜃∗̂
pi boot
CI )

The most useful feature of this approach is that it is transformation-respecting. If you


want a confidence interval for m(𝜃)̂ where m(⋅) is some function, you simply calculate
(250) (9750)
(m(𝜃)̂ ∗ , m(𝜃)̂ ∗ )

29
Example Computations
# get data # bootstrap
dat <- read.table("support/cps09mar.txt") set.seed(1234)
exper <- dat[,1] - dat[,4] - 6 B <- 1000
lwage <- log( dat[,5]/(dat[,6]*dat[,7]) ) n <- nrow(dat)
sam <- dat[,11]==4 & dat[,12]==7 & dat[,2]==0
dat <- data.frame(exper=exper[sam], lwage=lwage[sam]) res <- matrix(NA_real_, nrow=B, ncol=2)
for(b in 1:B) {
# run regression draws <- sample(1:n, size=n, replace=T)
out <- lm(lwage ~ exper, data=dat) res[b,] <- lm(lwage ~ exper, data=dat[draws, ])$coef
tt <- summary(out)$coefficients[1:2,1:2] }
tt
# CIs from stderr
Estimate Std. Error
serr <- apply(res, 2, function(x) sqrt(var(x)))
(Intercept) 2.876515044 0.067631401
cbind(low = out$coef - 2*serr,
exper 0.004776039 0.004335196
high = out$coef + 2*serr)
# calculate CIs
cbind(low = tt[,1] - 2*tt[,2], low high
high = tt[,1] + 2*tt[,2]) (Intercept) 2.737715643 3.01531444
exper -0.003596382 0.01314846
low high
# CIs from percentiles
(Intercept) 2.741252241 3.01177785
cbind(sort(res[,1])[c(25, 975)],
exper -0.003894353 0.01344643
sort(res[,2])[c(25, 975)])
[,1] [,2]
[1,] 2.742939 -0.003159473
[2,] 3.012780 0.013336742
30
Next Time

Next time:

• Bayes!

31

You might also like