Professional Documents
Culture Documents
13 March 2024
Overview
Introduction
Recap: Confidence interval
Simultaneous intervals
Selected inference (Pitfalls)
Conclusion
References
2
Introduction
3
Introduction
Who of the gentlemen likes?
1
1
By Gausanchennai - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=761923
4
Introduction (cont’d)
■ A well known study by Giovannucci et al. (1995) analyzed the relation between
intake of various carotenoids, retinol (vitamin A) fruits and vegetables and the risk
of prostate cancer.
■ They measured the influence of more than 40 vegetables on prostate cancer.
■ Next they calculated p-values.
■ Four of the products (mainly tomato based products) were significantly related to
prostate cancer, i.e. they all reduce the risk.
■ For these products the authors reported confidence intervals for the relative risk.
■ The relative risk (RR) is here defined as
# having cancer & high consumption of tomato based products
# high consumption of tomato based products
RR = .
# having cancer & low consumption of tomato based products
# low consumption of tomato based products
5
Introduction (cont’d)
■ The upper part of the ratio of RR is the estimated probability of cancer for those
with a high consumption of tomato based products.
■ The denominator of RR is the estimated probability of cancer for those with a low
consumption.
■ If RR < 1 this is an indication for a positive effect of tomato based products.
■ As said above Giovannucci et al. (1995) reported confidence intervals for each of
the selected (through the tests /p-values preceding the construction of confidence
intervals) products and the right end-points of these intervals were less than 1.
■ You may now ask yourself at least two questions:
1. Is the construction of univariate confidence intervals justified if several
parameters are involved?
2. Is that (first select relevant parameters and then construct confidence intervals)
how I learned to construct confidence intervals in my introductory courses? If
not, is it a smart extension?
■ To address these questions we will briefly recall in the next section the theoretical
concepts/justifications underlying confidence intervals.
6
Introduction (cont’d)
A few remarks on 2. on the previous slide.
■ We have just seen that next to the construction of several univariate intervals the
approach of Giovannuci et al. (1995) to analyse the data involved another step
often found in data analysis: First they selected relevant parameters (through
p-values) and then performed inference (means here construction of confidence
intervals) for the selected ones.
■ This approach to inference is called selected inference.
■ Selected inference is known for long in the statistical literature but intensive
research has been carried out only very recently.
■ Two developments that have boosted this intensified research are
◆ Multiple testing (for large data sets);
◆ The LASSO.
■ With larger data sets multiple testing has become more and more popular to detect
variables of interest (just think of lecture 10 and the genes example).
■ The LASSO has led to an intensified research on selected inference, because the
LASSO estimator selects variables by construction.
7
Introduction (cont’d)
To summarize this introduction there are two approaches to inference in the work by
Giovannucci et al. (1995) that deserve our attention as econometricians and data
scientists
■ First the construction of univariate confidence intervals when one is interested in
several univariate parameters.
■ Second the approach of first using the data to find the variables that are relevant
and to do inference (above construct confidence intervals) afterward for the
relevant variables.
8
Recap: Confidence interval
9
Recap
■ Definition (confidence set) Given an unknown parameter θ ∈ Θ, Θ ⊂ Rp , and
observations X1 , . . . , Xn we call C(X1 , . . . , Xn ) ⊂ Rp a confidence set at level
1 − α for θ if for all θ ∈ Θ we have
Pθ (θ ∈ C(X1 , . . . , Xn )) ≥ 1 − α. (1)
11
Confidence set (cont’d)
Example (confidence set N (µ, σ 2 ))
■ Let X ∼ N (µ, σ 2 ) with µ unknown and σ 2 known. For a sample X1 , . . . , Xn we
have !
n
√ 1X
n Xi − µ ∼ N (0, σ 2 ),
n
i=1
regardless of µ.
■ Therefore the event !
n
√ 1 X
c1 ≤ n Xi − µ ≤ c2
n
i=1
12
Confidence set (cont’d)
Example (confidence set N (µ, σ 2 ) (cont’d))
■ The last equation can be written as (dropping also the here unnecessary index µ)
n n
!
σc2 1X c1 σ 1X
P −√ + Xi ≤ µ ≤ − √ + Xi = Φ(c2 ) − Φ(c1 ).
n n n n
i=1 i=1
13
Confidence set (cont’d)
Example (confidence set AR(1)))
■ As you know we cannot always find the exact distribution and then the
understanding is that (1) holds for n large.
■ An example is the AR(1)-process Yt = φ Yt−1 + ǫt , t = 1, 2, . . . .
■ Then an (asymptotic) confidence interval at level 1 − α is obtained from
PT
√
t=2 Y t Y t−1
d 2
T −1 P T
−φ → N (0, σ ),
2
t=2 Yt−1
| {z }
φ̂n
yielding (asymptotically)
√ PT !
T − 1 t=2 Yt Yt−1
Pφ c1 ≤ PT − φ ≤ c2 = Φ(c2 ) − Φ(c1 ),
σ̂ 2
t=2 Yt−1
where σ̂ estimator of σ and we could drop the index φ at P because rhs does not
depend on φ.
14
Confidence set (cont’d)
Example (confidence set AR(1) (cont’d)
■ Similar as for the previous example the last equation can be written as
PT PT !
√
c2 σ̂ t=2 Yt Yt−1 √
c1 σ̂ t=2 Yt Yt−1
P − + PT ≤φ≤− + PT
T −1 Y 2 T − 1 Y 2
t=2 t−1 t=2 t−1
= Φ(c2 ) − Φ(c1 ).
this can also be written in the form (1) with the understanding that it is an
asymptotic interval.
■ As before, we can see that it is a random interval known only after sampling.
15
Simultaneous intervals
16
Simultaneous intervals
■ Let us now start with Question 1. from slide 6.
■ To do so, we ignore the fact that the parameters for which the intervals were
constructed were found by first testing a larger set of parameters.
■ Then, we can re-phrase our question as: Given d univariate parameters θj ,
j = 1 . . . , d, is it problematic to construct an interval for each of them?
■ Answer: It depends on
◆ How they were constructed.
■ Suppose we have a linear model and want to have guarantees (quantified
confidence) for the vector β = (β1 , . . . , βd ) (we use β instead of θ for regression
parameters) then it might be problematic depending on how the confidence
intervals were constructed.
17
Simultaneous intervals (cont’d)
To be concrete:
■ Suppose we have independent intervals Ij , 1 ≤ j ≤ d, for βj with
P(βj ∈ Ij ) ≥ 1 − α, j = 1, . . . , d (here, as the notation suggests, we tacitly
assume that P(βj ∈ Ij ) ≥ 1 − α holds whatever the true βj ).
■ Note that it does make sense to talk about independent intervals because they are
random (cf. Quiz 11).
■ Suppose each interval has a coverage probability of exactly 95%.
■ Then (cf. Quiz 11 for the case d = 2)
d
Y
P ((β1 , . . . , βd ) ∈ I1 × . . . × Id ) = P (βj ∈ Ij )
j=1
= (0.95)d , (2)
19
Selected inference (Pitfalls)
20
Pitfalls
■ To come back to our Question 2. on slide 6.
◆ Recall that the authors constructed confidence intervals after having selected
the parameters through tests /p-values.
◆ Compare this to our formula from above
Pθ (θ ∈ C(X1 , . . . , Xn )) ≥ 1 − α.
21
Pitfalls (cont’d)
■ Such a question can only be addressed by theoretical derivations or Monte Carlo
method, but it cannot be decided by looking at data simply because we do not
know the truth of the data.
■ Here to answer the question of the appropriateness we first examine a simulation
study of Hurvich and Tsai (1990).
22
Pitfalls (cont’d)
■ The assumed model considered in Hurvich and Tsai was
7
X
Yi = βj Xij + ǫi , (3)
j=1
where the explanatory variables Xij and the error terms were normally distributed
(random design).
■ The data available for inference consisted of Y1 , . . . , Y50 and
X11 . . . X17
X21 . . . X27
X50 = . .. .. .
. . . .
X50 1 . . . X50 7
■ Here we look at their results when the Y1 , . . . , Y50 were actually generated from
4
X
Yi = βjtrue Xij + ǫi , i = 1, . . . , 50.
j=1
23
Pitfalls (cont’d)
■ With the true model parameter being β true = (β1 , . . . , β4 ) = (1, 2, 3, 0.6).
■ The first part of the analysis consisted of choosing which of the seven candidate
models to use (note that this already includes a choice). The seven candidate
models are obtained from the general assumption in Eq. (3):
◆ Model 1 has regression function r1 given by
◆ ...
◆ Model 7 has regression function r7 given by
24
Pitfalls (cont’d)
■ The candidate models are nested in the sense that by putting β2 = 0 model 1 is
contained in model 2 etc.
■ To choose between the seven candidate models given the data y1 , . . . , y50 and xij ,
1 ≤ i ≤ 50, 1 ≤ j ≤ 7, Hurvich and Tsai used Akaike’s Information Criterion
(AIC); see, for instance, Advanced Econometrics.
■ Then they constructed a confidence region for the parameter vector of the selected
model.
■ Examples: If model 2, for instance, had the smallest AIC value they constructed a
confidence set for (β1 , β2 ) etc.
Similarly, if model 5, for example, had the smallest AIC value they constructed a
confidence set for (β1 , . . . , β5 ).
■ The results for this approach with n = 50 and the above true β true = (1, 2, 3, 0.6)
are given on the next slide.
■ In total Hurvich and Tsai ran the same simulation 500 times.
25
Pitfalls (cont’d)
Confidencd level
Model chosen by AIC 0.9 0.95 0.99
3 0 (0/6) 0 (0/6) 0 (0/6)
4 0.918 (326/355) 0.960 (341/355) 0.994 (353/355)
5 0.789 (56/71) 0.887 (63/71) 0.972 (69/71)
6 0.694 (25/36) 0.750 (27/36) 0.917 (33/36)
7 0.500 (16/32) 0.656 (21/32) 0.969 (31/32)
■ The second number in brackets gives how often the model was chosen by AIC and
the first number in brackets how often the confidence set (a subset of R3 ,
R4 , . . . , R7 depending on the model chosen) constructed after selecting the model
contained the true parameter vector.
■ Clearly in all cases but one (correct model chosen) the level of the confidence
regions is much less than what is aimed for.
■ Regarding the last bullet note that in about 30% of the cases an incorrect model
was chosen.
26
Pitfalls (cont’d)
■ You could now argue that the result changes if
◆ we use a different criterion to select the model;
◆ if we consider a problem that can be sufficiently answered by looking at
univariate intervals only.
■ The simulation results on the next slide again due Hurvich and Tsai show that the
problem does not disappear by doing model selection in the first place with
another criterion. The set-up in this simulation study was exactly as described
above the only difference is that the Bayesian Information criterion (BIC) was
used instead of AIC.
■ Finally, we will look at a simulation study by Benjamini and Yekutieli (2005) that
shows that restricting to univariate intervals does not solve the problem either.
27
Pitfalls (cont’d)
Confidencd level
Model chosen by BIC 0.9 0.95 0.99
3 0.000 (0/27) 0.000 (0/27) 0.000 (0/27)
4 0.909 (398/438) 0.963 (422/438) 0.995 (436/438)
5 0.542 (13/24) 0.708 (17/24) 0.917 (22/24)
6 0.167 (1/6) 0.167 (1/6) 0.667 (4/6)
7 0.000 (0/5) 0.000 (0/5) 1 (5/5)
■ The second number in brackets gives how often the model was chosen by BIC and
the first number in brackets how often the confidence region (a subset of R3 ,
R4 , . . . , R7 depending on the model chosen) constructed after selecting the model
contained the true parameter vector.
■ Clearly in all cases but one (right model chosen) the level of the confidence sets is
much less than what is aimed for.
■ Regarding the last bullet note that here in about 12% of the cases an incorrect
model was chosen.
28
Pitfalls (cont’d)
■ As said above, restricting to univariate intervals the problem does not disappear.
■ Benjamini & Yekutieli (2005) looked at the following situation
◆ We have 200 independent measurements Y1 , . . . , Y200 .
◆ Each measurement comes from a normal distribution with expectation θ and
known variance being equal to 1.
◆ Our hypotheses about the expectations µi are Hj : µj = 0, 1 ≤ j ≤ 200,
i.e. our hypotheses correspond to the truth if θ = 0.
◆ We reject Hj , 1 ≤ j ≤ 200, at level α if |Yj | > q1−α/2 , where q1−α/2 is the
1 − α/2 quantile of the standard normal.
◆ If we rejected hypothesis j, say, we construct a confidence interval Ij for µj
by Ij = [Yj − q1−α/2 , Yj + q1−α/2 ].
◆ Note that, as in Hurvich and Tsai we first make a selection and then do further
inference for the selected parameters.
■ The simulation results obtained by Benjamini & Yekutieli are given in the table on
the next slide.
29
Pitfalls (cont’d)
θ Average coverage after selection (unadjusted)
0 0
0.5 0.6
1 0.84
2 0.95
4 0.97
30
Pitfalls (cont’d)
■ You could now argue ’Well the lecturer himself as well as Benjamini & Yekutieli
forgot what we learned earlier this period: The selection was done using multiple
testing without accounting for it by a procedure like Bonferroni or Holm.’
■ The next table shows the result if one performs the above with Bonferroni, i.e. in
bullet 4 and 5 we replace q1−α/2 by q1−α/(2·200) .
31
Pitfalls (cont’d)
θ Average coverage after selection (adjusted)
0 0
0.5 0.82
1 0.97
2 1.00
4 1.00
32
Conclusion
33
Conclusion
■ At the beginning we identified two approaches to inference often found in applied
work that deserved our attention
1. The use of several univariate confidence intervals;
2. Selected inference.
■ We have seen above (slide 19) that we can combine univariate confidence intervals
for βi , 1 ≤ i ≤ d, to obtain a confidence region for β.
■ We have seen various examples illustrating that when first selecting variables and
then doing inference is problematic in the sense that our inferential methods do not
come with the same statistical guarantees as they would if we had not first selected
a model or relevant parameters (above the coverage of our confidence intervals
after model selection or after selecting relevant parameters was less than the level
aimed for).
■ Unlike for 1. there is no easy solution to the problems that come with 2 (easy is
meant here in the sense that there is no drawback).
34
Conclusion (cont’d)
■ One possibility to overcome the issues of selected inference discussed above is to
split the sample into two equally sized samples. The first part of the sample is used
to select (a model or relevant parameters) and the other one for inference.
■ The drawback of splitting the sample is that the sample size available for inference
(which is only half of the full sample) might be relatively small leading to large(r)
variances and therefore to not very informative inference.
■ In our next meeting we will discuss another approach to deal with the problems of
selective inference.
■ In general, one of the purposes of today’s lecture is to make you aware that the
common practice of first selecting (a model or relevant parameters) and then doing
inference needs extra attention.
35
References
36
References
■ Benjamini, Y., Yekutieli, D. (2005). False Discovery Rate-Adjusted Multiple
Confidence Intervals for Selected Parameters. Journal of the American Statistical
Association, 100, 71–81.
■ Hurvich, C. M., Tsai, C.-L. (1990). The Impact of Model Selection on Inference in
Linear Regression. The American Statistician, 44, 214–217.
37