You are on page 1of 10

Big Data Statistics, meeting 6: When d

is bigger than n, part 3

22 February 2024
Missing (cont’d)
■ You can interpret that the last bullet on the previous slide was a cautionary note.
Phrased differently, a priori we cannot conclude that (even) a well-motivated
estimator does what we want, i.e. being close to the true parameter (at least if we
have enough observations).
■ That some additional considerations might be needed is underlined by the next
bullet.
■ Ferguson (1982) gives an example of a continuous distribution on [−1, 1] with
parameter set Θ = [0, 1] where the MLE converges to 1 with probability 1
regardless of the true parameter. Hence, in this example the MLE will never be
close to the true parameter.

6
Guarantees (d < n) (cont’d)
■ Here comes another example that shows how crucial the assumption d fixed and n
going to infinite is in standard (or classical) statistics.
■ For instance, the average squared approximation error in the classical linear model
is
1 h i d σ2
T T
E (β̂ n − β) Xn Xn (β̂ n − β) = . (2)
n n
■ Of course, for β̂ n to be a good estimator we would like that (2) becomes small
(i.e. goes to zero if we have more and more observations).
■ If d is fixed and n → ∞ then this does indeed hold.
■ However, if d > n the average squared approximation will always be bigger than
σ 2 (and not converge to zero).

10
Quantities of interest
Which criteria could we look at for our LASSO estimator to finally conclude that it is a
good estimator?
1. We could keep the Euclidean distance between β̂ n and β as one criterion, and
hope that it goes to zero with probability going to one.
2. We could also keep the average squared approximation error
1
(β̂ n − β)T XnT Xn (β̂ n − β)
n
as a quantity that we would like to be small (or the expectation hereof).
3. Yet, there is one more thing we should look at here. Recall that our LASSO
estimator tends to select variables because several β̂j ’s will equal zero.
You now ask: Does it select the right ones, i.e. are those β̂j ’s unequal to zero for
which we also have βj 6= 0?
Before we tackle these three points in the above order we have a look at how they
relate to each other.

13
Quantities of interest (cont’d)
To analyze the relation between the three criteria in 1., 2. and 3. we introduce some
additional notation.
■ For a vector v ∈ Rd we define the active set (cf. Quiz 5)

SA (v) := {j ∈ {1, . . . , d}|vj 6= 0}.

For example, if v = (1.2, 0, 3, 0, 0, −1.5) then SA (v) = {1, 3, 6}.


■ Similarly, we define the inactive set of v to be

SAC (v) := {j ∈ {1, . . . , d}|vj = 0}.

■ As Quiz 5 showed: for a sequence of vectors (vn ) we cannot conclude from


vn → v that SA (vn ) = SA (v) as n → ∞.
■ If this does not hold for deterministic sequences, we cannot hope that
P
|| β̂n − β || → 0, i.e. criterion 1. in compact notation, implies
P
SA (β̂ n ) → SA (β).

14
Quantities of interest (cont’d)
P P
■ Clearly, from SA (β̂ n ) → SA (β) we can also not conclude that || β̂n − β ||2 → 0.
To see this take a deterministic sequence (vn ) with vn = (10, 0, 0) for all n and
v = (1, 0, 0). Thus, if criterion 3. on the previous slide holds, i.e.
P
SA (β̂ n ) → SA (β)
P
we cannot conclude that || β̂ n − β ||2 → 0.
■ By taking Xn such that XnT Xn = Id×d why see that if criterion 2. is fulfilled, i.e.
P P
(β̂n − β)T XnT Xn (β̂ n − β) → 0, we cannot conclude that SA (β̂ n ) → SA (β).
■ Finally, in general criterion 1 fulfilled does not imply criterion 2 fulfilled and the
other way around.

15
Main result
Here comes the first main result. Note that it is not a probabilistic statement (no
probabilities are involved as we consider a fixed realization). As we will use it later to
obtain asymptotic results I will index λ by n.
Theorem (upper bound on error of LASSO): Suppose that the design matrix Xn
satisfies the restricted eigenvalue bound with parameter γ > 0 over C(S; 3). If for a
particular outcome denoted by ǫn (ω) we choose λn ≥ 2||XnT ǫn (ω)||∞ /n > 0, then
any LASSO estimator from (1) satisfies the following inequality
3√
||β̂n − β||2 ≤ k λn . (3)
γ
Remarks (The quantities in Theorem upper bound on error of LASSO)
■ Restricted eigenvalue bound over C(S; 3) is a technical condition;
■ In the theorem ǫn = (ǫ1 , . . . , ǫn )T ;
■ For a vector v = (v1 , . . . , vn ) we denote by ||v||∞ the max norm,
i.e. ||v||∞ = max1≤i≤n |vi |.
■ k number of non-zero elements of β.

18
Main result (cont’d)
 
(τ 2 −8)
■ Since 1 − 2 exp − 8 log(d) will be close to 1 if d is not small, we can now
state the probabilistic version of the Theorem upper bound on error of LASSO.
■ Theorem (upper bound on error of LASSO probabilistic version )
Suppose that the design matrix Xn satisfies the
q restricted eigenvalue bound with
log(d) √
parameter γ > 0 over C(S; 3). For λn = τ σ n ,τ > 8, we have that any
LASSO estimator from (1) satisfies the following inequality
r
3 στ k log(d)
||β̂ − β||2 ≤ (4)
γ n
with probability at least

(τ 2
 
− 8)
1 − 2 exp − log(d) .
8

■ Remark: The difference between this theorem and the one on slide 18 is the red
part.
24
Main result (cont’d)
To better understand why this is really a result suitable for d >> n: we discuss the
upper bound in (4) and compare it to our results in Section ’Guarantees for standard
(d < n) models’.
■ Ignoring the constants 3 σ, k, τ and γ we see that
r
log(d)
n
can be small even if d >> n. For instance, with d = 17,000 and n = 800 the
displayed term equals 0.11.
■ Of course, there is a tradeoff for τ , because the smaller τ the
 smaller the bound,

(τ 2 −8)
but the less informative the probabilistic bound 1 − 2 exp − 8 log(d) .
■ Fortunately, even small τ do a good job for the probabilistic bound. In the example
with d = 17,000 we find for τ 2 = 11
2
 
(τ − 8)
1 − 2 exp − log(d) = 0.948.
8

25
Main result (cont’d)
■ It is also remarkable how good our upper bound in (4) is. If we were in the
classical setting with d fixed, the best bound we could get on the rhs of (4) would
√ 2
be of order (1/ n) . Here we ONLY have the additional factor log(d) which is
the price we have to for our high-dimensional setting.
■ But because log(d) increases super slowly this is an acceptable price.
■ Our bound guarantees consistency of the LASSO estimator as d, n → ∞ even if
d >> n.
■ Comparing our bound in (4) to what we said earlier we see that our bound
overcomes this problem thanks to log(d) in the numerator.
■ But k the number of non-zero coefficients
p needs to be small. Recall that above we
temporarily ignored k in front of (log(d))/n. But if k ≈ d this is far from being
small.

2
This is a consequence of the central limit theorem

26

You might also like