Lecture-BDS-4-23-24-print

Big Data Statistics, meeting 4: When d
is bigger than n, part 1
15 February 2024
Prediction, non-uniqueness
■ In the last section of Lecture 3 we looked at using our (estimated) linear regression
function to predict future responses.
■ Estimated because we used the function
β̂1 xf1 uture + . . . + β̂d xfd uture
to predict a future response Y f uture with explanatory variables

xf1 uture , . . . , xfd uture .
■ For the case d < n there is no ambiguity here because our OLS estimator is unique
(see Lecture 1, we take the inverse of (XnT Xn ) which implies uniqueness).
■ If d > n we have multiple OLS estimators (infinitely many to be precise).
■ We now check what the impact of multiple OLS estimators will be.
10
Prediction, non-uniqueness (cont’d)
LS,1 LS,2
■ Let β̂ and β̂ be two OLS estimators.
■ In general, we have for the squared prediction error (SPE) for a future explanatory
vector xf uture and a future response Y f uture
 2   2 
d d
β̂jLS,1 xfj uture   6= E Y f uture − β̂jLS,2 xfj uture  ,
X X
E Y f uture −
j=1 j=1
(1)
as can be seen from Quiz 4, Question 2.

■ Even worse because from this quiz we see that only knowledge of the true β
LS,1 LS,2
allows to decide whether β̂ or β̂ produces the smaller SPE.
■ Bottom line: If d > n we have infinitely many OLS estimates and looking at the
SPE does not help to make an informed decision which of them we should take.
■ Remark (Other regression models): The same problem would occur if we
considered, for instance, a Poisson regression estimated by maximum likelihood
when d > n.
11
LASSO: Linear model
■ Today, when looking at the linear model we assume that the original data
y1 , . . . , yn , x11 , . . . , xnd have already been processed so that for the processed
data denoted by y1p , . . . , ynp , xp11 , . . . , xpnd the centering and normalizing conditions
(i) (1/n) i=1 xpij = 0, j = 1, . . . , d;
Pn
(ii) (1/n) i=1 yip = 0;

Pn
(iii) (1/n) i=1 (xpij )2 = 1, j = 1, . . . , d.

Pn
hold.
■ Processing the original data so that the processed data fulfil (i), (ii), and (iii) can be
achieved by subtracting the sample averages and dividing by sample standard
deviations, i.e.
p xij −x̄j Pn
◆ xij = √Pn 2
for j = 1, . . . , d, where x̄j = (1/n) ℓ=1 xℓj ;
(x
i=1 ij −x̄j )
p
◆ yi = yi − (1/n) n
P
ℓ=1 yℓ ;.
This makes sure that (i), (ii), and (iii) hold for the processed data.
14
LASSO: Linear model (cont’d)
■ Now we come to the solution to the above estimation problem. It turned out that
for the linear model the solution β̂ to
 2
n d
d 1 X X
minimize w.r.t. β ∈ R : yi − βj xij  ,
n
i=1 j=1
d
X
subject to |βj | ≤ c, (2)
j=1
does a good job. The estimator resulting from this constrained optimization
problem is called LASSO (Least Absolute Selection and Shrinkage Operator).
■ This is just our ordinary least squares estimator under a constraint.
■ The constraint is meant to make sure that not too many βs are different from zero,
i.e. we do not include too many variables in our model.
■ However, it would not be a good idea to not include the intercept because the
intercept is just the average of the yi s and not a covariate. That’s why we use the
centered and scaled data when looking at (2).
16
Remarks (on LASSO)
■ Often the original minimization problem in (2) is given in its equivalent Lagrange
form  2
n d d
1 X X X
minimize w.r.t. β : yi − βj xij  + λ |βj |. (3)
n
i=1 j=1 j=1
Pd
■ The term λ j=1 |βj | is called a penalty. Think of this as follows: When d > n
the statistician looked at a too simple problem
Pd when estimating by OLS. To
penalize her we introduce the penalty λ j=1 |βj |.
2
We say too simple because n1 ni=1 yi − dj=1 βj xj will typically be equal to
P P
■
zero for d > n.
■ The penalty form of the LASSO also explains why we scale the data (see above)
before we estimate using the LASSO. If covariate x1i measures distance to
university in meters and x2i measures distance to the beach in kilometres they
would be penalized differently because of the different units used. Which does not
make sense.
20
Remarks (on LASSO (cont’d))
■ Good news: Criterion function (3) is continuous and if β12 + . . . + βd2 → ∞ it goes
to ∞. Therefore there is at least one minimizer β̃ for the criterion function
 2
n d d
1 X X X
yi − βj xij  + λ |βj |.
n
i=1 j=1 j=1
■ Bad news the criterion function is convex, but not strictly convex.
■ This is bad news because only for strictly convex functions we can guarantee
uniqueness of the minimizer. Pd
■ You now ask: What did we gain by introducing the penalty term λ j=1 |βj | if
there is still no guarantee that we have a unique estimator?
■ We come back to this point in a bit. But before we look at a special case that
illustrates what the LASSO does.
21
Theorem (LASSO for orthogonal design): Let d = n and (1/n)XnT Xn = In×n , where
In×n is the n × n identity matrix. Then the solution to (3), i.e. the LASSO estimator is
given by
λ λ


 β LS,i − 2 , β LS,i ≥ 2;
β̂i = 0, − λ2 < βLS,i < λ2 ; (4)

βLS,i + λ2 , βLS,i ≤ − λ2 ;

see Exercise 10. Here β̂ LS = (βLS,1 , . . . , βLS,n ) is the (usual) least squares estimator
of β, i.e. β̂ LS is a solution to the unconstrained problem(=non penalized problem)
 2
n
X n
X
minimize w.r.t. β : (1/n) yi − βj xij  .
i=1 j=1
22
Remarks (Theorem LASSO for orthogonal design):
■ The LASSO estimator in (4) for orthogonal design is said to be a soft-threshold
version of the usual least-squares estimator.
■ Threshold estimator because the ith (1 ≤ i ≤ d) component of the usual least
squares estimator β̂LS,i is put equal to zero if it does not exceed the threshold
|λ/2|.
■ It is called a soft threshold because it starts at zero for β̂LS,i ≥ λ/2 and increases
linearly from there. In contrast a hard threshold estimator would jump to λ/2 and
−λ/2 at β̂LS,i = λ/2 and β̂LS,i = −λ/2, respectively.
23
Recall that we asked ourselves on slide 22 what we gained by looking at a constrained
problem if there is no guarantee that our solution(=estimator) is unique. Here comes a
result that clarifies:
Theorem (LASSO linear model): Denote the gradient (vector of partial derivatives) of
 2
n d
1 X X
yi − βj xij  = (1/n)(Yn − Xn β)T (Yn − Xn β)
n
i=1 j=1
by G(β) = −(1/n) 2XnT (Yn − Xn β). Then a necessary and sufficient condition for
β̂ to be a solution of (3) is (with Gj (β̂) denoting the jth component of G(β))
Gj (β̂) = −sign(β̂j )λ if β̂j 6= 0;

|Gj (β̂)| ≤ λ if β̂j = 0.
Furthermore, if for a solution β̂ we have |Gj (β̂)| < λ and hence βj = 0, then for any
other solution β̃ of (3) we have also β̃j = 0.
24
■ Theorem (LASSO linear model, continuous regressors): If the entries of
X ∈ Rn×d are drawn from a continuous probability distribution on Rnd , then for
any y and λ > 0, the LASSO solution is unique and given by
β̂ E C = 0 and β̂ E = (XET XE )−1 (XET y − λs), (5)
with probability one where for λ given

E = {i ∈ {1, . . . , d} : |XiT (y − X β)| ˆ = λ} and s = sign(X T (y − X β̂)), and E C
E
is the complement of E.
Here, for an n × d matrix A with d columns A1 , . . . , Ad and an index set
I = {i1 , . . . , ik } ⊂ {1, . . . , d} we denote by XI the n × k matrix with columns
Ai1 , . . . , Aik .
■ Remark (Theorem LASSO linear model, continuous regressors) Note that under
the assumptions of the theorem the LASSO estimator β̂ has at most min{n, d}
non-zero entries. This follows because if |E| > min{n, d} then XET XE would not
be invertible (because its rank is at most min{n, d}). Think of the above data
examples to see by which amount the LASSO reduces the number of parameters.
26
LASSO: Binomial regression (cont’d)
■ For notational convenience we introduce the intercept as follows
Pd
exp β0 + j=1 βj xij
P(Yi = 1) = Pd , i = 1, . . . , n.
1 + exp β0 + j=1 βj xij
■ Of course, if d > n we have the same problem as for the linear model.
■ How could we generalize the idea of the LASSO for the linear model to the
binomial regression model?
■ Recall from lecture 1 that for the classical linear model minimizing (with fixed
design) the sum of squares is the same as maximizing the likelihood function. This
means that, for the classical linear model (with fixed design), we can see the
LASSO as a penalized log likelihood estimator.
■ The idea of the binomial regression model would then be to penalize the binomial
regression likelihood.
29
■ Minus the log likelihood for a binomial regression can be written as (cf. Exercise
6)
     
n d d
1 X  X X
− yi β0 + βj xij  − log 1 + exp β0 + βj xij  ,
n
i=1 j=1 j=1
where we multiplied by 1/n which has no bearing for the solution.

■ Taking the derivative and equating to zero, we see that the inclusion of an intercept
ensures that the average of the fitted probabilities equals the average of the yi , i.e.
Pd
n n
1X 1 X exp β̂0 + j=1 β̂j xij
yi = ,
n n 1 + exp β̂ + d β̂ x
P
i=1 i=1 0 j=1 j ij
where β̂ = (β̂0 , β̂1 , . . . , β̂d ) is the MLE.
30
■ Because the intercept is not related to an explanatory variable and because we can
see from the previous slide that its role is merely to ensure that the observed
average equals the fitted average we do not penalize it.
■ The LASSO estimator for the binomial regression model is defined to be the
solution of
minimize w.r.t. β0 , β1 , . . . , βd :
     
n d d
1 X  X X
− yi β0 + βj xij  − log 1 + exp β0 + βj xij 
n
i=1 j=1 j=1
d
X
+λ |βj |.
j=1
■ Comparing with minus the log-likelihood we see that it is indeed a penalized

version of it but without penalizing β0 .
31

Lecture-BDS-4-23-24-print

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture-BDS-4-23-24-print

Uploaded by

Copyright:

Available Formats

Big Data Statistics, meeting 4: When d

is bigger than n, part 1

β̂1 xf1 uture + . . . + β̂d xfd uture

to predict a future response Y f uture with explanatory variables

as can be seen from Quiz 4, Question 2.

(ii) (1/n) i=1 yip = 0;

(iii) (1/n) i=1 (xpij )2 = 1, j = 1, . . . , d.

Gj (β̂) = −sign(β̂j )λ if β̂j 6= 0;

β̂ E C = 0 and β̂ E = (XET XE )−1 (XET y − λs), (5)

with probability one where for λ given

where we multiplied by 1/n which has no bearing for the solution.

where β̂ = (β̂0 , β̂1 , . . . , β̂d ) is the MLE.

■ Comparing with minus the log-likelihood we see that it is indeed a penalized

You might also like