You are on page 1of 37

Big Data Statistics, meeting 11: Things

every data scientist should be aware of,


part I

13 March 2024
Overview
Introduction
Recap: Confidence interval
Simultaneous intervals
Selected inference (Pitfalls)
Conclusion
References

2
Introduction

3
Introduction
Who of the gentlemen likes?

1
1
By Gausanchennai - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=761923

4
Introduction (cont’d)
■ A well known study by Giovannucci et al. (1995) analyzed the relation between
intake of various carotenoids, retinol (vitamin A) fruits and vegetables and the risk
of prostate cancer.
■ They measured the influence of more than 40 vegetables on prostate cancer.
■ Next they calculated p-values.
■ Four of the products (mainly tomato based products) were significantly related to
prostate cancer, i.e. they all reduce the risk.
■ For these products the authors reported confidence intervals for the relative risk.
■ The relative risk (RR) is here defined as
# having cancer & high consumption of tomato based products
# high consumption of tomato based products
RR = .
# having cancer & low consumption of tomato based products
# low consumption of tomato based products

5
Introduction (cont’d)
■ The upper part of the ratio of RR is the estimated probability of cancer for those
with a high consumption of tomato based products.
■ The denominator of RR is the estimated probability of cancer for those with a low
consumption.
■ If RR < 1 this is an indication for a positive effect of tomato based products.
■ As said above Giovannucci et al. (1995) reported confidence intervals for each of
the selected (through the tests /p-values preceding the construction of confidence
intervals) products and the right end-points of these intervals were less than 1.
■ You may now ask yourself at least two questions:
1. Is the construction of univariate confidence intervals justified if several
parameters are involved?
2. Is that (first select relevant parameters and then construct confidence intervals)
how I learned to construct confidence intervals in my introductory courses? If
not, is it a smart extension?
■ To address these questions we will briefly recall in the next section the theoretical
concepts/justifications underlying confidence intervals.

6
Introduction (cont’d)
A few remarks on 2. on the previous slide.
■ We have just seen that next to the construction of several univariate intervals the
approach of Giovannuci et al. (1995) to analyse the data involved another step
often found in data analysis: First they selected relevant parameters (through
p-values) and then performed inference (means here construction of confidence
intervals) for the selected ones.
■ This approach to inference is called selected inference.
■ Selected inference is known for long in the statistical literature but intensive
research has been carried out only very recently.
■ Two developments that have boosted this intensified research are
◆ Multiple testing (for large data sets);
◆ The LASSO.
■ With larger data sets multiple testing has become more and more popular to detect
variables of interest (just think of lecture 10 and the genes example).
■ The LASSO has led to an intensified research on selected inference, because the
LASSO estimator selects variables by construction.

7
Introduction (cont’d)
To summarize this introduction there are two approaches to inference in the work by
Giovannucci et al. (1995) that deserve our attention as econometricians and data
scientists
■ First the construction of univariate confidence intervals when one is interested in
several univariate parameters.
■ Second the approach of first using the data to find the variables that are relevant
and to do inference (above construct confidence intervals) afterward for the
relevant variables.

8
Recap: Confidence interval

9
Recap
■ Definition (confidence set) Given an unknown parameter θ ∈ Θ, Θ ⊂ Rp , and
observations X1 , . . . , Xn we call C(X1 , . . . , Xn ) ⊂ Rp a confidence set at level
1 − α for θ if for all θ ∈ Θ we have

Pθ (θ ∈ C(X1 , . . . , Xn )) ≥ 1 − α. (1)

■ Remarks (confidence set)


◆ Equation (1) means that the probability that C(X1 , . . . , Xn ) contains the true
but unknown parameter is at least 1 − α regardless of what the true parameter
is.
◆ If p > 1 we call C(X1 , . . . , Xn ) a confidence region.
◆ With regard to Quiz 11:
■ The only random quantity in (1) is C(X1 , . . . , Xn ).
■ The probability statement is therefore about the random set
C(X1 , . . . , Xn ) containing the unknown parameter.
■ This means it refers to the situation before taking the sample.
■ Given a sample (x1 , . . . , xn ) of (X1 , . . . , Xn ) there are only two
possibilities: Either θ ∈ C(x1 , . . . , xn ) or θ ∈ / C(x1 , . . . , xn ).
10
Confidence set (cont’d)
Remarks (confidence set (cont’d))
■ If C(X1 , . . . , Xn ) is an interval, we simply say confidence interval.
■ Because (X1 , . . . , Xn ) depends on the true but unknown parameter statements
about C(X1 , . . . , Xn ) will often depend on θ.
■ We therefore index P by θ to make this dependence explicit.
■ This possible dependence on the true unknown θ is also the reason for requiring
that (1) holds for all θ ∈ Θ.
■ As you know there are cases where the distribution of C(X1 , . . . , Xn ) does not
depend on the true but unknown θ like in the example on the next slide. These
cases are often easy and we can drop the index.

11
Confidence set (cont’d)
Example (confidence set N (µ, σ 2 ))
■ Let X ∼ N (µ, σ 2 ) with µ unknown and σ 2 known. For a sample X1 , . . . , Xn we
have !
n
√ 1X
n Xi − µ ∼ N (0, σ 2 ),
n
i=1

regardless of µ.
■ Therefore the event !
n
√ 1 X
c1 ≤ n Xi − µ ≤ c2
n
i=1

does not depend on µ either.


■ Hence, independently of µ
√ n
! !
n 1 X
Pµ c1 ≤ Xi − µ ≤ c2 = Φ(c2 ) − Φ(c1 ).
σ n
i=1

12
Confidence set (cont’d)
Example (confidence set N (µ, σ 2 ) (cont’d))
■ The last equation can be written as (dropping also the here unnecessary index µ)
n n
!
σc2 1X c1 σ 1X
P −√ + Xi ≤ µ ≤ − √ + Xi = Φ(c2 ) − Φ(c1 ).
n n n n
i=1 i=1

■ With the confidence interval


" n n
#
σc2 1 X c1 σ 1 X
C(X1 , . . . , Xn ) = − √ + Xi ; − √ + Xi
n n n n
i=1 i=1

we can write it in the above form (1).


■ As you know at level 95% we have c2 = 1.96 and c1 = −1.96.
■ Here we can see clearly that C(X1 . . . , Xn ) is a random interval known only after
sampling.

13
Confidence set (cont’d)
Example (confidence set AR(1)))
■ As you know we cannot always find the exact distribution and then the
understanding is that (1) holds for n large.
■ An example is the AR(1)-process Yt = φ Yt−1 + ǫt , t = 1, 2, . . . .
■ Then an (asymptotic) confidence interval at level 1 − α is obtained from
PT
√ 
t=2 Y t Y t−1

d 2
T −1 P T
−φ → N (0, σ ),
2
t=2 Yt−1
| {z }
φ̂n

yielding (asymptotically)
√ PT !
T − 1 t=2 Yt Yt−1
Pφ c1 ≤ PT − φ ≤ c2 = Φ(c2 ) − Φ(c1 ),
σ̂ 2
t=2 Yt−1

where σ̂ estimator of σ and we could drop the index φ at P because rhs does not
depend on φ.

14
Confidence set (cont’d)
Example (confidence set AR(1) (cont’d)
■ Similar as for the previous example the last equation can be written as
PT PT !

c2 σ̂ t=2 Yt Yt−1 √
c1 σ̂ t=2 Yt Yt−1
P − + PT ≤φ≤− + PT
T −1 Y 2 T − 1 Y 2
t=2 t−1 t=2 t−1
= Φ(c2 ) − Φ(c1 ).

■ With the confidence interval


" PT PT #
c2 σ̂ t=2 Yt Yt−1 c1 σ̂ t=2 Yt Yt−1
C(Y1 , . . . , YT ) = − √ + PT ; −√ + PT ,
T −1 2 T −1 2
t=2 Yt−1 t=2 Yt−1

this can also be written in the form (1) with the understanding that it is an
asymptotic interval.
■ As before, we can see that it is a random interval known only after sampling.

15
Simultaneous intervals

16
Simultaneous intervals
■ Let us now start with Question 1. from slide 6.
■ To do so, we ignore the fact that the parameters for which the intervals were
constructed were found by first testing a larger set of parameters.
■ Then, we can re-phrase our question as: Given d univariate parameters θj ,
j = 1 . . . , d, is it problematic to construct an interval for each of them?
■ Answer: It depends on
◆ How they were constructed.
■ Suppose we have a linear model and want to have guarantees (quantified
confidence) for the vector β = (β1 , . . . , βd ) (we use β instead of θ for regression
parameters) then it might be problematic depending on how the confidence
intervals were constructed.

17
Simultaneous intervals (cont’d)
To be concrete:
■ Suppose we have independent intervals Ij , 1 ≤ j ≤ d, for βj with
P(βj ∈ Ij ) ≥ 1 − α, j = 1, . . . , d (here, as the notation suggests, we tacitly
assume that P(βj ∈ Ij ) ≥ 1 − α holds whatever the true βj ).
■ Note that it does make sense to talk about independent intervals because they are
random (cf. Quiz 11).
■ Suppose each interval has a coverage probability of exactly 95%.
■ Then (cf. Quiz 11 for the case d = 2)
d
Y
P ((β1 , . . . , βd ) ∈ I1 × . . . × Id ) = P (βj ∈ Ij )
j=1

= (0.95)d , (2)

where I1 × . . . × Id = {(x1 , . . . , xd ) ∈ Rd |xj ∈ Ij , j = 1, . . . , d}.


■ For instance, if d = 10 this probability equals 60%.
■ This reminds you of lecture 8 when doing multiple testing without accounting for
the fact that we test d hypotheses and not only one.
18
Simultaneous intervals (cont’d)
To be concrete (cont’d):
■ Bottom line of the previous slide: Even if each Ij , 1 ≤ j ≤ d, has a coverage
probability of 95% the probability that they jointly, i.e. I1 × . . . × Id , cover
(β1 , . . . , βd ) can be much lower (this is exactly the same story as with d tests each
controlling the Type-I error at level 5% but the probability of rejecting at least one
true is much higher than 5%).
■ If we want to combine intervals I1 , . . . , Id they must be constructed in a way that
we have
P(βj ∈ Ij for all j ∈ {1, . . . , d}) ≥ 1 − α.
■ For such a construction we call I1 × . . . × Id a simultaneous interval at level 1 − α.
■ Note that we can easily achieve this with our univariate intervals if each interval
has confidence level 0.951/d . Plug this into (2) to see that we obtain a level of 95%.

19
Selected inference (Pitfalls)

20
Pitfalls
■ To come back to our Question 2. on slide 6.
◆ Recall that the authors constructed confidence intervals after having selected
the parameters through tests /p-values.
◆ Compare this to our formula from above

Pθ (θ ∈ C(X1 , . . . , Xn )) ≥ 1 − α.

◆ ’After having selected’ means that we need to consider a conditional


probability whereas the displayed formula is an unconditional probability.
■ Clearly, this is an inconsistency but the question now is: Does this inconsistency
make the approach inappropriate?

21
Pitfalls (cont’d)
■ Such a question can only be addressed by theoretical derivations or Monte Carlo
method, but it cannot be decided by looking at data simply because we do not
know the truth of the data.
■ Here to answer the question of the appropriateness we first examine a simulation
study of Hurvich and Tsai (1990).

22
Pitfalls (cont’d)
■ The assumed model considered in Hurvich and Tsai was
7
X
Yi = βj Xij + ǫi , (3)
j=1

where the explanatory variables Xij and the error terms were normally distributed
(random design).
■ The data available for inference consisted of Y1 , . . . , Y50 and
 
X11 . . . X17
 X21 . . . X27 
 
X50 =  . .. ..  .
 . . . . 
X50 1 . . . X50 7

■ Here we look at their results when the Y1 , . . . , Y50 were actually generated from
4
X
Yi = βjtrue Xij + ǫi , i = 1, . . . , 50.
j=1
23
Pitfalls (cont’d)
■ With the true model parameter being β true = (β1 , . . . , β4 ) = (1, 2, 3, 0.6).
■ The first part of the analysis consisted of choosing which of the seven candidate
models to use (note that this already includes a choice). The seven candidate
models are obtained from the general assumption in Eq. (3):
◆ Model 1 has regression function r1 given by

rβ1 (Xi1 ) = β1 Xi1 ;

◆ Model 2 has regression function r2 given by

rβ1 ,β2 (Xi1 , Xi2 ) = β1 Xi1 + β2 Xi2 ;

◆ ...
◆ Model 7 has regression function r7 given by

rβ1 ,...,β7 (Xi1 , . . . , Xi7 ) = β1 Xi1 + . . . + β7 Xi7 .

24
Pitfalls (cont’d)
■ The candidate models are nested in the sense that by putting β2 = 0 model 1 is
contained in model 2 etc.
■ To choose between the seven candidate models given the data y1 , . . . , y50 and xij ,
1 ≤ i ≤ 50, 1 ≤ j ≤ 7, Hurvich and Tsai used Akaike’s Information Criterion
(AIC); see, for instance, Advanced Econometrics.
■ Then they constructed a confidence region for the parameter vector of the selected
model.
■ Examples: If model 2, for instance, had the smallest AIC value they constructed a
confidence set for (β1 , β2 ) etc.
Similarly, if model 5, for example, had the smallest AIC value they constructed a
confidence set for (β1 , . . . , β5 ).
■ The results for this approach with n = 50 and the above true β true = (1, 2, 3, 0.6)
are given on the next slide.
■ In total Hurvich and Tsai ran the same simulation 500 times.

25
Pitfalls (cont’d)
Confidencd level
Model chosen by AIC 0.9 0.95 0.99
3 0 (0/6) 0 (0/6) 0 (0/6)
4 0.918 (326/355) 0.960 (341/355) 0.994 (353/355)
5 0.789 (56/71) 0.887 (63/71) 0.972 (69/71)
6 0.694 (25/36) 0.750 (27/36) 0.917 (33/36)
7 0.500 (16/32) 0.656 (21/32) 0.969 (31/32)
■ The second number in brackets gives how often the model was chosen by AIC and
the first number in brackets how often the confidence set (a subset of R3 ,
R4 , . . . , R7 depending on the model chosen) constructed after selecting the model
contained the true parameter vector.
■ Clearly in all cases but one (correct model chosen) the level of the confidence
regions is much less than what is aimed for.
■ Regarding the last bullet note that in about 30% of the cases an incorrect model
was chosen.

26
Pitfalls (cont’d)
■ You could now argue that the result changes if
◆ we use a different criterion to select the model;
◆ if we consider a problem that can be sufficiently answered by looking at
univariate intervals only.
■ The simulation results on the next slide again due Hurvich and Tsai show that the
problem does not disappear by doing model selection in the first place with
another criterion. The set-up in this simulation study was exactly as described
above the only difference is that the Bayesian Information criterion (BIC) was
used instead of AIC.
■ Finally, we will look at a simulation study by Benjamini and Yekutieli (2005) that
shows that restricting to univariate intervals does not solve the problem either.

27
Pitfalls (cont’d)
Confidencd level
Model chosen by BIC 0.9 0.95 0.99
3 0.000 (0/27) 0.000 (0/27) 0.000 (0/27)
4 0.909 (398/438) 0.963 (422/438) 0.995 (436/438)
5 0.542 (13/24) 0.708 (17/24) 0.917 (22/24)
6 0.167 (1/6) 0.167 (1/6) 0.667 (4/6)
7 0.000 (0/5) 0.000 (0/5) 1 (5/5)
■ The second number in brackets gives how often the model was chosen by BIC and
the first number in brackets how often the confidence region (a subset of R3 ,
R4 , . . . , R7 depending on the model chosen) constructed after selecting the model
contained the true parameter vector.
■ Clearly in all cases but one (right model chosen) the level of the confidence sets is
much less than what is aimed for.
■ Regarding the last bullet note that here in about 12% of the cases an incorrect
model was chosen.

28
Pitfalls (cont’d)
■ As said above, restricting to univariate intervals the problem does not disappear.
■ Benjamini & Yekutieli (2005) looked at the following situation
◆ We have 200 independent measurements Y1 , . . . , Y200 .
◆ Each measurement comes from a normal distribution with expectation θ and
known variance being equal to 1.
◆ Our hypotheses about the expectations µi are Hj : µj = 0, 1 ≤ j ≤ 200,
i.e. our hypotheses correspond to the truth if θ = 0.
◆ We reject Hj , 1 ≤ j ≤ 200, at level α if |Yj | > q1−α/2 , where q1−α/2 is the
1 − α/2 quantile of the standard normal.
◆ If we rejected hypothesis j, say, we construct a confidence interval Ij for µj
by Ij = [Yj − q1−α/2 , Yj + q1−α/2 ].
◆ Note that, as in Hurvich and Tsai we first make a selection and then do further
inference for the selected parameters.
■ The simulation results obtained by Benjamini & Yekutieli are given in the table on
the next slide.

29
Pitfalls (cont’d)
θ Average coverage after selection (unadjusted)
0 0
0.5 0.6
1 0.84
2 0.95
4 0.97

■ Average coverage after selection is calculated as follows: If a parameter is selected


and contained in the confidence interval we count a 1 and if it was selected but is
not contained in the interval we count a zero. Summing up and dividing how often
the interval contained the selected parameter by how often the parameter was
selected we get the average coverage after selection.
■ The result is even more disturbing than the one by Hurvich and Tsai because
conditional coverage, i.e. coverage after selection, can be far from the desired level
and the deviations from the ’desired level’ do depend on the in practice unknown
true θ.

30
Pitfalls (cont’d)
■ You could now argue ’Well the lecturer himself as well as Benjamini & Yekutieli
forgot what we learned earlier this period: The selection was done using multiple
testing without accounting for it by a procedure like Bonferroni or Holm.’
■ The next table shows the result if one performs the above with Bonferroni, i.e. in
bullet 4 and 5 we replace q1−α/2 by q1−α/(2·200) .

31
Pitfalls (cont’d)
θ Average coverage after selection (adjusted)
0 0
0.5 0.82
1 0.97
2 1.00
4 1.00

■ Average coverage was calculated as above.


■ Once more, there are cases where the coverage is far from the level 1 − 1/200.
■ Bottom line: Selecting the parameters while accounting for multiple testing using
Bonferroni does not solve the problem that the confidence intervals constructed for
the selected parameters may perform poorly.

32
Conclusion

33
Conclusion
■ At the beginning we identified two approaches to inference often found in applied
work that deserved our attention
1. The use of several univariate confidence intervals;
2. Selected inference.
■ We have seen above (slide 19) that we can combine univariate confidence intervals
for βi , 1 ≤ i ≤ d, to obtain a confidence region for β.
■ We have seen various examples illustrating that when first selecting variables and
then doing inference is problematic in the sense that our inferential methods do not
come with the same statistical guarantees as they would if we had not first selected
a model or relevant parameters (above the coverage of our confidence intervals
after model selection or after selecting relevant parameters was less than the level
aimed for).
■ Unlike for 1. there is no easy solution to the problems that come with 2 (easy is
meant here in the sense that there is no drawback).

34
Conclusion (cont’d)
■ One possibility to overcome the issues of selected inference discussed above is to
split the sample into two equally sized samples. The first part of the sample is used
to select (a model or relevant parameters) and the other one for inference.
■ The drawback of splitting the sample is that the sample size available for inference
(which is only half of the full sample) might be relatively small leading to large(r)
variances and therefore to not very informative inference.
■ In our next meeting we will discuss another approach to deal with the problems of
selective inference.
■ In general, one of the purposes of today’s lecture is to make you aware that the
common practice of first selecting (a model or relevant parameters) and then doing
inference needs extra attention.

35
References

36
References
■ Benjamini, Y., Yekutieli, D. (2005). False Discovery Rate-Adjusted Multiple
Confidence Intervals for Selected Parameters. Journal of the American Statistical
Association, 100, 71–81.
■ Hurvich, C. M., Tsai, C.-L. (1990). The Impact of Model Selection on Inference in
Linear Regression. The American Statistician, 44, 214–217.

37

You might also like