Professional Documents
Culture Documents
15 February 2024
Prediction, non-uniqueness
■ In the last section of Lecture 3 we looked at using our (estimated) linear regression
function to predict future responses.
■ Estimated because we used the function
10
Prediction, non-uniqueness (cont’d)
LS,1 LS,2
■ Let β̂ and β̂ be two OLS estimators.
■ In general, we have for the squared prediction error (SPE) for a future explanatory
vector xf uture and a future response Y f uture
2 2
d d
β̂jLS,1 xfj uture 6= E Y f uture − β̂jLS,2 xfj uture ,
X X
E Y f uture −
j=1 j=1
(1)
11
LASSO: Linear model
■ Today, when looking at the linear model we assume that the original data
y1 , . . . , yn , x11 , . . . , xnd have already been processed so that for the processed
data denoted by y1p , . . . , ynp , xp11 , . . . , xpnd the centering and normalizing conditions
(i) (1/n) i=1 xpij = 0, j = 1, . . . , d;
Pn
hold.
■ Processing the original data so that the processed data fulfil (i), (ii), and (iii) can be
achieved by subtracting the sample averages and dividing by sample standard
deviations, i.e.
p xij −x̄j Pn
◆ xij = √Pn 2
for j = 1, . . . , d, where x̄j = (1/n) ℓ=1 xℓj ;
(x
i=1 ij −x̄j )
p
◆ yi = yi − (1/n) n
P
ℓ=1 yℓ ;.
This makes sure that (i), (ii), and (iii) hold for the processed data.
14
LASSO: Linear model (cont’d)
■ Now we come to the solution to the above estimation problem. It turned out that
for the linear model the solution β̂ to
2
n d
d 1 X X
minimize w.r.t. β ∈ R : yi − βj xij ,
n
i=1 j=1
d
X
subject to |βj | ≤ c, (2)
j=1
does a good job. The estimator resulting from this constrained optimization
problem is called LASSO (Least Absolute Selection and Shrinkage Operator).
■ This is just our ordinary least squares estimator under a constraint.
■ The constraint is meant to make sure that not too many βs are different from zero,
i.e. we do not include too many variables in our model.
■ However, it would not be a good idea to not include the intercept because the
intercept is just the average of the yi s and not a covariate. That’s why we use the
centered and scaled data when looking at (2).
16
LASSO: Linear model (cont’d)
Remarks (on LASSO)
■ Often the original minimization problem in (2) is given in its equivalent Lagrange
form 2
n d d
1 X X X
minimize w.r.t. β : yi − βj xij + λ |βj |. (3)
n
i=1 j=1 j=1
Pd
■ The term λ j=1 |βj | is called a penalty. Think of this as follows: When d > n
the statistician looked at a too simple problem
Pd when estimating by OLS. To
penalize her we introduce the penalty λ j=1 |βj |.
2
We say too simple because n1 ni=1 yi − dj=1 βj xj will typically be equal to
P P
■
zero for d > n.
■ The penalty form of the LASSO also explains why we scale the data (see above)
before we estimate using the LASSO. If covariate x1i measures distance to
university in meters and x2i measures distance to the beach in kilometres they
would be penalized differently because of the different units used. Which does not
make sense.
20
LASSO: Linear model (cont’d)
Remarks (on LASSO (cont’d))
■ Good news: Criterion function (3) is continuous and if β12 + . . . + βd2 → ∞ it goes
to ∞. Therefore there is at least one minimizer β̃ for the criterion function
2
n d d
1 X X X
yi − βj xij + λ |βj |.
n
i=1 j=1 j=1
■ Bad news the criterion function is convex, but not strictly convex.
■ This is bad news because only for strictly convex functions we can guarantee
uniqueness of the minimizer. Pd
■ You now ask: What did we gain by introducing the penalty term λ j=1 |βj | if
there is still no guarantee that we have a unique estimator?
■ We come back to this point in a bit. But before we look at a special case that
illustrates what the LASSO does.
21
LASSO: Linear model (cont’d)
Theorem (LASSO for orthogonal design): Let d = n and (1/n)XnT Xn = In×n , where
In×n is the n × n identity matrix. Then the solution to (3), i.e. the LASSO estimator is
given by
λ λ
β LS,i − 2 , β LS,i ≥ 2;
β̂i = 0, − λ2 < βLS,i < λ2 ; (4)
βLS,i + λ2 , βLS,i ≤ − λ2 ;
see Exercise 10. Here β̂ LS = (βLS,1 , . . . , βLS,n ) is the (usual) least squares estimator
of β, i.e. β̂ LS is a solution to the unconstrained problem(=non penalized problem)
2
n
X n
X
minimize w.r.t. β : (1/n) yi − βj xij .
i=1 j=1
22
LASSO: Linear model (cont’d)
Remarks (Theorem LASSO for orthogonal design):
■ The LASSO estimator in (4) for orthogonal design is said to be a soft-threshold
version of the usual least-squares estimator.
■ Threshold estimator because the ith (1 ≤ i ≤ d) component of the usual least
squares estimator β̂LS,i is put equal to zero if it does not exceed the threshold
|λ/2|.
■ It is called a soft threshold because it starts at zero for β̂LS,i ≥ λ/2 and increases
linearly from there. In contrast a hard threshold estimator would jump to λ/2 and
−λ/2 at β̂LS,i = λ/2 and β̂LS,i = −λ/2, respectively.
23
LASSO: Linear model (cont’d)
Recall that we asked ourselves on slide 22 what we gained by looking at a constrained
problem if there is no guarantee that our solution(=estimator) is unique. Here comes a
result that clarifies:
Theorem (LASSO linear model): Denote the gradient (vector of partial derivatives) of
2
n d
1 X X
yi − βj xij = (1/n)(Yn − Xn β)T (Yn − Xn β)
n
i=1 j=1
by G(β) = −(1/n) 2XnT (Yn − Xn β). Then a necessary and sufficient condition for
β̂ to be a solution of (3) is (with Gj (β̂) denoting the jth component of G(β))
Furthermore, if for a solution β̂ we have |Gj (β̂)| < λ and hence βj = 0, then for any
other solution β̃ of (3) we have also β̃j = 0.
24
LASSO: Linear model (cont’d)
■ Theorem (LASSO linear model, continuous regressors): If the entries of
X ∈ Rn×d are drawn from a continuous probability distribution on Rnd , then for
any y and λ > 0, the LASSO solution is unique and given by
26
LASSO: Binomial regression (cont’d)
■ For notational convenience we introduce the intercept as follows
Pd
exp β0 + j=1 βj xij
P(Yi = 1) = Pd , i = 1, . . . , n.
1 + exp β0 + j=1 βj xij
■ Of course, if d > n we have the same problem as for the linear model.
■ How could we generalize the idea of the LASSO for the linear model to the
binomial regression model?
■ Recall from lecture 1 that for the classical linear model minimizing (with fixed
design) the sum of squares is the same as maximizing the likelihood function. This
means that, for the classical linear model (with fixed design), we can see the
LASSO as a penalized log likelihood estimator.
■ The idea of the binomial regression model would then be to penalize the binomial
regression likelihood.
29
LASSO: Binomial regression (cont’d)
■ Minus the log likelihood for a binomial regression can be written as (cf. Exercise
6)
n d d
1 X X X
− yi β0 + βj xij − log 1 + exp β0 + βj xij ,
n
i=1 j=1 j=1
30
LASSO: Binomial regression (cont’d)
■ Because the intercept is not related to an explanatory variable and because we can
see from the previous slide that its role is merely to ensure that the observed
average equals the fitted average we do not penalize it.
■ The LASSO estimator for the binomial regression model is defined to be the
solution of
minimize w.r.t. β0 , β1 , . . . , βd :
n d d
1 X X X
− yi β0 + βj xij − log 1 + exp β0 + βj xij
n
i=1 j=1 j=1
d
X
+λ |βj |.
j=1
31