Master 2 Mathbigdata: S. Ga Iffas

Master 2 MathBigData
S. Gaı̈ffas1
3 novembre 2014
1
CMAP - Ecole Polytechnique
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
Sparsity ISTA
Lasso FISTA
Supervised learning
Setting
Data xi ∈ X , yi ∈ Y for i = 1, . . . , n
xi is an input and yi is an output
xi are called features and xi ∈ X = Rd
yi are called labels
Y = {−1, 1} or Y = {0, 1} for binary classification
Y = {1, . . . , K } for multiclass classification
Y = R for regression
Goal: given a new x’s, predict y ’s.
Supervised learning – Loss functions, linearity
What to do
Minimize with respect to f : Rd → R
n
1X
Rn (f ) = `(yi , f (xi ))
n
i=1
where
` is a loss function. `(yi , f (xi )) small means yi is close to f (xi )
Rn (f ) is called goodness-of-fit or empirical risk
Computation f is called training or estimation step
When d is large, impossible to fit a complex functions f on

the data
When n is large, training is too time-consuming for a complex
function f
Hence:
Choose a linear function f :
d
X
f (x) = hx, θi = xj θj ,
j=1
for some parameter vector θ ∈ Rd to be trained
Remark: linear with respect to xi , but you can choose the xi

based on the data. Hence, not linear w.r.t the original features
Training the model: compute
θ̂ ∈ argmin Rn (θ)
θ∈Rd
where
n
1X
Rn (θ) = `(yi , hxi , θi).
n
i=1
Classical losses
`(y , z) = 12 (y − z)2 : least-squares loss, linear regression (label
y ∈ R)
`(y , z) = (1 − yz)+ hinge loss, or SVM loss (binary
classification, label y ∈ {−1, 1})
`(y , z) = log(1 + e −yz ) logistic loss (binary classification, label
y ∈ {−1, 1})
1
`least−sq (y , z) = (y − z)2 `hinge (y , z) = (1 − yz)+
2
`logistic (y , z) = log(1 + e −yz )
Sparsity ISTA
Lasso FISTA
Penalization – Introduction
You should never actually fit a model by minimizing only

n
1X
θ̂n ∈ argmin `(yi , hxi , θi).
θ∈Rd n
i=1
You should minimize instead

n
n1 X o
θ̂n ∈ argmin `(yi , hxi , θi) + λ pen(θ)
θ∈Rd n
i=1
where
pen is a penalization function, that encodes a prior
assumption on θ. It forbids θ to be “too complex”
λ > 0 is a tuning or smoothing parameter, that balances
goodness-of-fit and penalization
Penalization – Introduction
Why using penalization?

n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λ pen(θ)
θ∈Rd n
i=1
Penalization, for a well-chosen λ > 0, allows to avoid overfitting

Penalization – Ridge
Most classical penalization is the Ridge penalization

d
X
pen(θ) = kθk22 = θj2 .
j=1
It penalizes the energy of θ, measured by squared `2 -norm
Sparsity inducing penalization.

It would be nice to find a model where θ̂j = 0 for many
coordinates j
few features are useful for prediction, the model is simpler,
with a smaller dimension
We say that θ̂ is sparse
How to do it ?
Penalization – Sparsity
It is tempting to use
n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λkθk0 ,
θ∈Rd n
i=1
where
kθk0 = #{j : θj 6= 0}.
But, to do it exactly, you need to try all possible subsets of
non-zero coordinates of θ: 2d possibilities. Impossible!
Penalization – Lasso
A solution: Lasso penalization (least absolute shrinkage and

selection operator)
d
X
pen(θ) = kθk1 = |θj |.
j=1
This is penalization based on the `1 -norm k · k1 .

In a noiseless setting [compressed sensing, basis pursuit], in a
certain regime, `1 -minimization gives the “same solution” as
k · k0
But the Lasso penalized problem is easy to compute
Why do `1 -penalization leads to sparsity?
Why `2 (ridge) does not induce sparsity?

Hence, a minimizer
n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λkθk1
θ∈Rd n
i=1
is typically sparse (θ̂j = 0 for many j).

for λ large (larger than some constant) θ̂j = 0 for all j
for λ = 0 then there is no penalization
Between the two, the “sparsity” depends on the value of λ:
once again, it is a regularization or penalization parameter
For the least squares loss

n1 λ o
θ̂ ∈ argmin kY − X θk22 + kθk22
θ∈Rd 2n 2
is called ridge linear regression, and

n1 o
2
θ̂ ∈ argmin kY − X θk2 + λkθk1
θ∈Rd 2n
is called Lasso linear regression.

Consider the minimization problem

1
min (a − b)2 + λ|a|
a∈R 2
for λ > 0 and b ∈ R

Derivative at 0+ : d+ = λ − b
Derivative at 0− : d− = −λ − b
Let a∗ be the solution
a∗ = 0 iff d+ ≥ 0 and d− ≤ 0, namely |b| ≤ λ
a∗ ≥ 0 iff d+ ≤ 0, namely b ≥ λ and a∗ = b − λ
a∗ ≤ 0 iff d− ≥ 0, namely b ≤ −λ and a∗ = b + λ
Hence
a∗ = sign(b)(|b| − λ)+
where a+ = max(0, a)
As a consequence, we have
1
a∗ = argmin ka − bk22 + λkak1 = Sλ (b)
a∈Rd 2
where
Sλ (b) = sign(b) (|b| − λ)+
is the soft-thresholding operator
Sparsity ISTA
Lasso FISTA
Quick recap
f : Rd → [−∞, +∞] is
convex if
f (tx + (1 − t)x 0 ) ≤ tf (x) + (1 − t)f (x 0 )
for any x, x 0 ∈ Rd , t ∈ [0, 1]

proper if not equal to −∞ or +∞ everywhere [note that
proper ⇒ valued in (−∞, +∞]
lower-semicontinuous (l.s.c) if and only if for any x and a
sequence xn → x we have
f (x) ≤ lim inf f (xn ).

n
The set of such functions is often denoted Γ0 (Rd ) or Γ0

Proximal operator
For any g is convex l.s.c and any y ∈ Rd , we define the

proximal operator
n1 o
proxg (y ) = argmin kx − y k22 + g (x)
x∈Rd 2
(strongly convex problem ⇒ unique minimum)

We have seen that the soft-thresholding is the proximal
operator of the `1 -norm
proxλk·k1 (y ) = Sλ (y ) = sign(y ) (|y | − λ)+
Proximal operators and proximal algorithms are now fundamental

tools for optimization in machine learning
Examples of proximal operators
g (x) = c for a constant c, proxg = Id

If C convex set, and
(
0 if x ∈ C
g (x) = δC (x) =
+∞ if x ∈
/C
then
proxg = projC = projection onto C .
If g (x) = hb, xi + c, then
proxλg (x) = x − λb
If g (x) = 12 x > Ax + hb, xi + c with A symetric positive, then
proxλg (x) = (I + λA)−1 (x − λb)

If g (x) = 12 kxk22 then
1
proxλg (x) = x = shrinkage operator
1+λ
If g (x) = − log x then
√
x+ x 2 + 4λ
proxλg (x) =
2
If g (x) = kxk2 then
λ
proxλg (x) = 1 − x,
kxk2 +
the block soft-thresholding operator

If g (x) = kxk1 + γ2 kxk22 (elastic-net) where γ > 0, then
1
proxλg (x) = proxλk·k1 (x)
1 + λγ
P
If g (x) = g ∈G kxg k2 where G partition of {1, . . . , d},
λ
(proxλg (x))g = 1 − xg ,
kxg k2 +
for g ∈ G . Block soft-thresholding, used for group-Lasso

Subdifferential, Fenchel conjuguate
The subdifferential of f ∈ Γ0 at x is the set
∂f (x) = g ∈ Rd : f (y ) ≥ hg , y − xi + f (x) for all y ∈ Rd

Each element is called a subgradient

Optimality criterion
0 ∈ ∂f (x) iff f (x) ≤ f (y ) ∀y
If f is differentiable at x, then ∂f (x) = {∇f (x)}

Example: ∂|0| = [−1, 1]
Fenchel conjuguate
The Fenchel conjugate of a function f on Rd is given by
f ∗ (x) = sup hx, y i − f (y )

y ∈Rd
Always a convex function (as a sup of continuous, linear functions).

It is the smallest constant c such that the affine function
x 7→ hy , xi − c
is below f .
Fenchel-Young inequality: we have
f (x) + f ∗ (y ) ≥ hx, y i
for any x and y .

Fenchel conjuguate and subgradients
Legendre Fenchel identity: if f ∈ Γ0 we have
hx, y i = f (x) + f ∗ (y ) ⇔ y ∈ ∂f (x) ⇔ x ∈ ∂f ∗ (y )
Example. Fenchel conjuguate of a norm k · k:
kxk∗ = sup hx, y i − ky k = δ{y ∈Rd :ky k∗ ≤1} (x),

y ∈Rd
where
kxk∗ = max hx, y i
y ∈Rd :ky k≤1
dual norm of k · k [recall that kxk∗p = kxkq with 1/p + 1/q = 1]

Some extras
f : Rd → [−∞, +∞] is
f is L-smooth if it is continuously differentiable and if
k∇f (x) − ∇f (y )k2 ≤ Lkx − y k2 for any x, y ∈ Rd .
Equivalent to Hf (x) LId for all x, where Hf (x) Hessian at x

when twice continuously differentiable [i.e. LId − Hf (x)
positive semi-definite]
f is µ-strongly convex if f (·) − µ2 k · k22 is convex. Equivalent to
µ
f (y ) ≥ f (x) + hg , y − xi + ky − xk22
2
for g ∈ ∂f (x). Equivalent to Hf (x) µId when twice
differentiable.
1
f is µ-strongly convex iff f ∗ is -smooth
µ
Sparsity ISTA
Lasso FISTA
The general problem we want to solve
How to solve
n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λ pen(θ) ???
θ∈Rd n
i=1
Put for short

n
1X
f (θ) = `(yi , hxi , θi) and g (θ) = λ pen(θ)
n
i=1
Assume that
f is convex and L-smooth
g is convex and continuous, but possibly non-smooth (for
instance `1 penalization)
g is prox-capable: not hard to compute its proximal operator
Examples
Smoothness of f :
Least-squares:
1 > kX > X kop

∇f (θ) = X (X θ − Y ), L=
n n
Logistic loss:
n
1X yi maxi=1,...,n kXi k22
∇f (θ) = − xi , L=
n 1 + e yi hxi ,θi
i=1
4n
Prox-capability of g :
we gave the explicit prox for many penalizations above
Gradient descent
Now how do I minimize f + g ?
Key point: the descent lemma. If f convex and L-smooth,

then for any L0 ≥ L:
L0 0
f (θ0 ) ≤ f (θ) + h∇f (θ), θ0 − θi + kθ − θk22
2
for any θ, θ0 ∈ Rd
At iteration k, the current point is θk . I use the descent
lemma:
L0
f (θ) ≤ f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22 .
2
Gradient descent
Remark that
n L0 o
argmin f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22
θ∈Rd 2
1 2
= argmin θ − θk − 0 ∇f (θk )

θ∈Rd L 2
Hence, choose
1
θk+1 = θk − ∇f (θk )
L0
This is the basic gradient descent algorithm [cf previous
lecture]
Gradient descent is based on a majoration-minimization
principle, with a quadratic majorant given by the descent
lemma
But we forgot about g ...
ISTA
Let’s put back g :
L0
f (θ) + g (θ) ≤ f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22 + g (θ)
2
and again
n L0 o
argmin f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22 + g (θ)
θ∈Rd 2
n L0 1 2 o
= argmin θ − θk − 0 ∇f (θk ) + g (θ)

θ∈Rd 2 L 2
n1 1 1
2
o
= argmin θ − θk − 0 ∇f (θk ) + 0 g (θ)

θ∈Rd 2 L 2 L
1
= proxg /L0 θk − 0 ∇f (θk )
L
The prox operator naturally appears because of the descent lemma
ISTA
Proximal gradient descent algorithm [also called ISTA]

Input: starting point θ0 , Lipschitz constant L > 0 for ∇f
For k = 1, 2, . . . until converged do

θk = proxg /L θk−1 − L1 ∇f (θk−1 )
Return last θk
Also called Forward-Backward splitting. For Lasso with
least-squares loss, iteration is
1
θk = Sλ/L θk−1 − (X > X θk−1 − X > Y ) ,
L
where Sλ is the soft-thresholding operator
ISTA
Put for short F = f + g ,

Take any θ∗ ∈ argminθ∈Rd F (θ)
Theorem (Beck Teboulle (2009))

If the sequence {θk } is generated by ISTA, then
Lkθ0 − θ∗ k22
F (θk ) − F (θ∗ ) ≤
2k
Convergence rate is O(1/k)

Is it possible to improve the O(1/k) rate?
FISTA
Yes! Using Accelerated proximal gradient descent (called

FISTA, Nesterov 83, 04, Beck Teboule 09)
Idea: to find θk+1 , use an interpolation between θk and θk−1
Accelerated proximal gradient descent algorithm [FISTA]

Input: starting points z 1 = θ0 , Lipschitz constant L > 0 for
∇f , t1 = 1
For k = 1, 2, . . . until converged do
θk = proxg /L (z k − L1 ∇f (z k ))
√
1+ 1+4tk2
tk+1 = 2
−1 k
z k+1 = θk + ttkk+1 (θ − θk−1 )
Return last θk
FISTA
Theorem (Beck Teboulle (2009))

If the sequence {θk } is generated by FISTA, then
2Lkθ0 − θ∗ k22
F (θk ) − F (θ∗ ) ≤
(k + 1)2
Convergence rate is O(1/k 2 )

Is O(1/k 2 ) the optimal rate in general?
FISTA
Yes. Put g = 0
Theorem (Nesterov)
For any optimization procedure satisfying
θk+1 ∈ θ1 + span(∇f (θ1 ), . . . , ∇f (θk )),
there is a function f on Rd convex and L-smooth such that

3L kθ1 − θ∗ k22
min f (θj ) − f (θ∗ ) ≥
1≤j≤k 32 (k + 1)2
for any 1 ≤ k ≤ (d − 1)/2.

FISTA
Comparison of ISTA and FISTA
FISTA is not a descent algorithm, while ISTA is

FISTA
[Proof of convergence of FISTA on the blackboard]

Backtracking linesearch
What if I don’t know L > 0?

kX > X kop can be long to compute
Letting L evolve along iterations k generally improve
convergence speed
Backtracking linesearch. Idea:

Start from a very small lipschitz constant L
Between iteration k and k + 1, choose the smallest L
satisfying the lemma descent at z k
Backtracking linesearch
At iteration k of FISTA, we have z k and a constant Lk

1 Put L ← Lk
2 Do an iteration

k 1 k

θ ← proxg /L z − ∇f (z )
L
3 Check it this step satisfies the descent lemma at z k :
L
f (θ) + g (θ) ≤ f (z k ) + h∇f (z k ), θ − z k i + kθ − z k k22 + g (θ)
2
4 If yes, then θk+1 ← θ and Lk+1 ← L and continue FISTA
5 It not, then put L ← 2L (say), and go back to point 2
Sequence Lk is non-decreasing: between iteration k and k + 1, a
tweak is to decrease it a little bit to have (much) faster
convergence
Sparsity ISTA
Lasso FISTA
Fenchel duality
How to stop an iterative optimization algorithm?

If F objective function, fix ε > 0 small, and stop when
F (θk+1 ) − F (θk )
≤ ε or k∇f (θk )k ≤ ε
F (θk )
An alternative is to compute the duality gap
Fenchel Duality. Consider the problem minθ f (Aθ) + g (θ) with

f : Rd → R, g : Rp → R and a d × p matrix A.
We have
sup{−f ∗ (u) − g ∗ (−A> u)} ≤ inf {f (Aθ) + g (θ)}

u θ
Moreover, if f and g are convex, then under mild assumptions,

equality of both sides holds (strong duality, no duality gap)
Fenchel duality
Fenchel Duality
sup{−f ∗ (u) − g ∗ (−A> u)} ≤ inf {f (Aθ) + g (θ)}

u θ
Right part is the primal problem

Left part is a dual formulation of the primal problem
Ifθ∗ is an optimum for the primal and u ∗ is an optimum for the
dual, then
−f ∗ (u ∗ ) − g ∗ (−A> u ∗ ) = f (Aθ∗ ) + g (θ∗ )
When g (θ) = λkθk where λ > 0 and k · k is a norm, this writes
sup −f ∗ (u) ≤ inf {f (Aθ) + λkθk}

u:kA> uk∗ ≤λ θ
Duality gap
If (θ∗ , u ∗ ) is a pair of primal/dual solutions then we have
u ∗ ∈ ∂f (Aθ∗ ) or u ∗ = ∇f (Aθ∗ ) if f is differentiable
Namely, we have at optimum
kA> ∇f (Aθ∗ )k∗ ≤ λ and f (Aθ∗ ) + λkθ∗ k + f ∗ (∇f (Aθ∗ )) = 0
Natural stopping rule: imagine we are at iteration k of an

optimization algorithm, current primal variable is θk . Define
λ
u k = u(θk ) = min 1, ∇f (Aθk )
kA> ∇f (Aθk )k∗
and stop at iteration k when
f (Aθk ) + λkθk k + f ∗ (u k ) ≤ ε
for a given small ε > 0

Duality gap
Back to machine learning:

n
X n
X
`(yi , hxi , θi) = `(yi , (X θ)i ) = f (X θ)
i=1 i=1
with f (z) = ni=1 `(yi , zi ) for z = [z1 · · · zn ]> and X matrix with
P
lines x1> , . . . , xn> . Gradient is
n
X
∇f (X θ) = `0 (yi , hxi , θi)xi ,
i=1
where `0 (y , z) = ∂`(y , z)/∂z.

For the duality gap, we need to compute f ∗ .
Duality gap
For least squares f (z) = 12 ky − zk22 , we have
1
f ∗ (u) = kuk22 + hu, y i
2
For logistic regression f (z) = ni=1 log(1 + e −yi zi ) we have
P
n
X
f ∗ (u) = (1 + ui yi ) log(1 + ui yi ) − ui yi log(−ui yi )
i=1
if −ui yi ∈ (0, 1] for any i = 1, . . . , d and
f ∗ (u) = +∞
otherwise
Duality gap
Example. Stopping criterion for Lasso based on duality gap:

Compute residual
r k ← X θk − y
Compute dual variable
λ
uk = ∧ 1 rk
kX > r k k∞
Stop if
1 k 2
kr k2 + λkθk k1 + ku k k22 + hu k , y i ≤ ε
2

Master 2 Mathbigdata: S. Ga Iffas

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Master 2 Mathbigdata: S. Ga Iffas

Uploaded by

Copyright:

Available Formats

Master 2 MathBigData

When d is large, impossible to fit a complex functions f on

for some parameter vector θ ∈ Rd to be trained

Remark: linear with respect to xi , but you can choose the xi

Training the model: compute

You should never actually fit a model by minimizing only

You should minimize instead

Why using penalization?

Penalization, for a well-chosen λ > 0, allows to avoid overfitting

Most classical penalization is the Ridge penalization

It penalizes the energy of θ, measured by squared `2 -norm

Sparsity inducing penalization.

A solution: Lasso penalization (least absolute shrinkage and

This is penalization based on the `1 -norm k · k1 .

Why `2 (ridge) does not induce sparsity?

is typically sparse (θ̂j = 0 for many j).

For the least squares loss

is called ridge linear regression, and

is called Lasso linear regression.

Consider the minimization problem

for λ > 0 and b ∈ R

f (tx + (1 − t)x 0 ) ≤ tf (x) + (1 − t)f (x 0 )

for any x, x 0 ∈ Rd , t ∈ [0, 1]

f (x) ≤ lim inf f (xn ).

The set of such functions is often denoted Γ0 (Rd ) or Γ0

For any g is convex l.s.c and any y ∈ Rd , we define the

(strongly convex problem ⇒ unique minimum)

proxλk·k1 (y ) = Sλ (y ) = sign(y ) (|y | − λ)+

Proximal operators and proximal algorithms are now fundamental

g (x) = c for a constant c, proxg = Id

If g (x) = 12 x > Ax + hb, xi + c with A symetric positive, then

proxλg (x) = (I + λA)−1 (x − λb)

If g (x) = 12 kxk22 then

the block soft-thresholding operator

If g (x) = kxk1 + γ2 kxk22 (elastic-net) where γ > 0, then

for g ∈ G . Block soft-thresholding, used for group-Lasso

The subdifferential of f ∈ Γ0 at x is the set

∂f (x) = g ∈ Rd : f (y ) ≥ hg , y − xi + f (x) for all y ∈ Rd

Each element is called a subgradient

0 ∈ ∂f (x) iff f (x) ≤ f (y ) ∀y

If f is differentiable at x, then ∂f (x) = {∇f (x)}

The Fenchel conjugate of a function f on Rd is given by

f ∗ (x) = sup hx, y i − f (y )

Always a convex function (as a sup of continuous, linear functions).

Fenchel-Young inequality: we have

for any x and y .

Legendre Fenchel identity: if f ∈ Γ0 we have

hx, y i = f (x) + f ∗ (y ) ⇔ y ∈ ∂f (x) ⇔ x ∈ ∂f ∗ (y )

Example. Fenchel conjuguate of a norm k · k:

kxk∗ = sup hx, y i − ky k = δ{y ∈Rd :ky k∗ ≤1} (x),

dual norm of k · k [recall that kxk∗p = kxkq with 1/p + 1/q = 1]

k∇f (x) − ∇f (y )k2 ≤ Lkx − y k2 for any x, y ∈ Rd .

Equivalent to Hf (x)  LId for all x, where Hf (x) Hessian at x

Put for short

1 > kX > X kop

Now how do I minimize f + g ?

Key point: the descent lemma. If f convex and L-smooth,

Let’s put back g :

Proximal gradient descent algorithm [also called ISTA]

Put for short F = f + g ,

Theorem (Beck Teboulle (2009))

Convergence rate is O(1/k)

Equivalent to Hf (x) LId for all x, where Hf (x) Hessian at x