You are on page 1of 51

Master 2 MathBigData

S. Gaı̈ffas1

3 novembre 2014

1
CMAP - Ecole Polytechnique
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
Supervised learning

Setting
Data xi ∈ X , yi ∈ Y for i = 1, . . . , n
xi is an input and yi is an output
xi are called features and xi ∈ X = Rd
yi are called labels
Y = {−1, 1} or Y = {0, 1} for binary classification
Y = {1, . . . , K } for multiclass classification
Y = R for regression
Goal: given a new x’s, predict y ’s.
Supervised learning – Loss functions, linearity

What to do
Minimize with respect to f : Rd → R
n
1X
Rn (f ) = `(yi , f (xi ))
n
i=1

where
` is a loss function. `(yi , f (xi )) small means yi is close to f (xi )
Rn (f ) is called goodness-of-fit or empirical risk
Computation f is called training or estimation step
Supervised learning – Loss functions, linearity

When d is large, impossible to fit a complex functions f on


the data
When n is large, training is too time-consuming for a complex
function f
Hence:
Choose a linear function f :
d
X
f (x) = hx, θi = xj θj ,
j=1

for some parameter vector θ ∈ Rd to be trained

Remark: linear with respect to xi , but you can choose the xi


based on the data. Hence, not linear w.r.t the original features
Supervised learning – Loss functions, linearity

Training the model: compute

θ̂ ∈ argmin Rn (θ)
θ∈Rd

where
n
1X
Rn (θ) = `(yi , hxi , θi).
n
i=1

Classical losses
`(y , z) = 12 (y − z)2 : least-squares loss, linear regression (label
y ∈ R)
`(y , z) = (1 − yz)+ hinge loss, or SVM loss (binary
classification, label y ∈ {−1, 1})
`(y , z) = log(1 + e −yz ) logistic loss (binary classification, label
y ∈ {−1, 1})
Supervised learning – Loss functions, linearity

1
`least−sq (y , z) = (y − z)2 `hinge (y , z) = (1 − yz)+
2
`logistic (y , z) = log(1 + e −yz )
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
Penalization – Introduction

You should never actually fit a model by minimizing only


n
1X
θ̂n ∈ argmin `(yi , hxi , θi).
θ∈Rd n
i=1

You should minimize instead


n
n1 X o
θ̂n ∈ argmin `(yi , hxi , θi) + λ pen(θ)
θ∈Rd n
i=1

where
pen is a penalization function, that encodes a prior
assumption on θ. It forbids θ to be “too complex”
λ > 0 is a tuning or smoothing parameter, that balances
goodness-of-fit and penalization
Penalization – Introduction

Why using penalization?


n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λ pen(θ)
θ∈Rd n
i=1

Penalization, for a well-chosen λ > 0, allows to avoid overfitting


Penalization – Ridge

Most classical penalization is the Ridge penalization


d
X
pen(θ) = kθk22 = θj2 .
j=1

It penalizes the energy of θ, measured by squared `2 -norm

Sparsity inducing penalization.


It would be nice to find a model where θ̂j = 0 for many
coordinates j
few features are useful for prediction, the model is simpler,
with a smaller dimension
We say that θ̂ is sparse
How to do it ?
Penalization – Sparsity

It is tempting to use
n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λkθk0 ,
θ∈Rd n
i=1

where
kθk0 = #{j : θj 6= 0}.
But, to do it exactly, you need to try all possible subsets of
non-zero coordinates of θ: 2d possibilities. Impossible!
Penalization – Lasso

A solution: Lasso penalization (least absolute shrinkage and


selection operator)
d
X
pen(θ) = kθk1 = |θj |.
j=1

This is penalization based on the `1 -norm k · k1 .


In a noiseless setting [compressed sensing, basis pursuit], in a
certain regime, `1 -minimization gives the “same solution” as
k · k0
But the Lasso penalized problem is easy to compute
Why do `1 -penalization leads to sparsity?
Penalization – Lasso

Why `2 (ridge) does not induce sparsity?


Penalization – Lasso

Hence, a minimizer
n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λkθk1
θ∈Rd n
i=1

is typically sparse (θ̂j = 0 for many j).


for λ large (larger than some constant) θ̂j = 0 for all j
for λ = 0 then there is no penalization
Between the two, the “sparsity” depends on the value of λ:
once again, it is a regularization or penalization parameter
Penalization – Lasso

For the least squares loss


n1 λ o
θ̂ ∈ argmin kY − X θk22 + kθk22
θ∈Rd 2n 2

is called ridge linear regression, and


n1 o
2
θ̂ ∈ argmin kY − X θk2 + λkθk1
θ∈Rd 2n

is called Lasso linear regression.


Penalization – Lasso

Consider the minimization problem


1
min (a − b)2 + λ|a|
a∈R 2

for λ > 0 and b ∈ R


Derivative at 0+ : d+ = λ − b
Derivative at 0− : d− = −λ − b
Let a∗ be the solution
a∗ = 0 iff d+ ≥ 0 and d− ≤ 0, namely |b| ≤ λ
a∗ ≥ 0 iff d+ ≤ 0, namely b ≥ λ and a∗ = b − λ
a∗ ≤ 0 iff d− ≥ 0, namely b ≤ −λ and a∗ = b + λ
Hence
a∗ = sign(b)(|b| − λ)+
where a+ = max(0, a)
Penalization – Lasso

As a consequence, we have
1
a∗ = argmin ka − bk22 + λkak1 = Sλ (b)
a∈Rd 2

where
Sλ (b) = sign(b) (|b| − λ)+
is the soft-thresholding operator
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
Quick recap

f : Rd → [−∞, +∞] is
convex if

f (tx + (1 − t)x 0 ) ≤ tf (x) + (1 − t)f (x 0 )

for any x, x 0 ∈ Rd , t ∈ [0, 1]


proper if not equal to −∞ or +∞ everywhere [note that
proper ⇒ valued in (−∞, +∞]
lower-semicontinuous (l.s.c) if and only if for any x and a
sequence xn → x we have

f (x) ≤ lim inf f (xn ).


n

The set of such functions is often denoted Γ0 (Rd ) or Γ0


Proximal operator

For any g is convex l.s.c and any y ∈ Rd , we define the


proximal operator
n1 o
proxg (y ) = argmin kx − y k22 + g (x)
x∈Rd 2

(strongly convex problem ⇒ unique minimum)


We have seen that the soft-thresholding is the proximal
operator of the `1 -norm

proxλk·k1 (y ) = Sλ (y ) = sign(y ) (|y | − λ)+

Proximal operators and proximal algorithms are now fundamental


tools for optimization in machine learning
Examples of proximal operators

g (x) = c for a constant c, proxg = Id


If C convex set, and
(
0 if x ∈ C
g (x) = δC (x) =
+∞ if x ∈
/C

then
proxg = projC = projection onto C .
If g (x) = hb, xi + c, then

proxλg (x) = x − λb

If g (x) = 12 x > Ax + hb, xi + c with A symetric positive, then

proxλg (x) = (I + λA)−1 (x − λb)


Examples of proximal operators

If g (x) = 12 kxk22 then

1
proxλg (x) = x = shrinkage operator
1+λ
If g (x) = − log x then

x+ x 2 + 4λ
proxλg (x) =
2
If g (x) = kxk2 then
 λ 
proxλg (x) = 1 − x,
kxk2 +

the block soft-thresholding operator


Examples of proximal operators

If g (x) = kxk1 + γ2 kxk22 (elastic-net) where γ > 0, then

1
proxλg (x) = proxλk·k1 (x)
1 + λγ
P
If g (x) = g ∈G kxg k2 where G partition of {1, . . . , d},
 λ 
(proxλg (x))g = 1 − xg ,
kxg k2 +

for g ∈ G . Block soft-thresholding, used for group-Lasso


Subdifferential, Fenchel conjuguate

The subdifferential of f ∈ Γ0 at x is the set

∂f (x) = g ∈ Rd : f (y ) ≥ hg , y − xi + f (x) for all y ∈ Rd




Each element is called a subgradient


Optimality criterion

0 ∈ ∂f (x) iff f (x) ≤ f (y ) ∀y

If f is differentiable at x, then ∂f (x) = {∇f (x)}


Example: ∂|0| = [−1, 1]
Fenchel conjuguate

The Fenchel conjugate of a function f on Rd is given by

f ∗ (x) = sup hx, y i − f (y )



y ∈Rd

Always a convex function (as a sup of continuous, linear functions).


It is the smallest constant c such that the affine function

x 7→ hy , xi − c

is below f .

Fenchel-Young inequality: we have

f (x) + f ∗ (y ) ≥ hx, y i

for any x and y .


Fenchel conjuguate and subgradients

Legendre Fenchel identity: if f ∈ Γ0 we have

hx, y i = f (x) + f ∗ (y ) ⇔ y ∈ ∂f (x) ⇔ x ∈ ∂f ∗ (y )

Example. Fenchel conjuguate of a norm k · k:

kxk∗ = sup hx, y i − ky k = δ{y ∈Rd :ky k∗ ≤1} (x),



y ∈Rd

where
kxk∗ = max hx, y i
y ∈Rd :ky k≤1

dual norm of k · k [recall that kxk∗p = kxkq with 1/p + 1/q = 1]


Some extras

f : Rd → [−∞, +∞] is
f is L-smooth if it is continuously differentiable and if

k∇f (x) − ∇f (y )k2 ≤ Lkx − y k2 for any x, y ∈ Rd .

Equivalent to Hf (x)  LId for all x, where Hf (x) Hessian at x


when twice continuously differentiable [i.e. LId − Hf (x)
positive semi-definite]
f is µ-strongly convex if f (·) − µ2 k · k22 is convex. Equivalent to
µ
f (y ) ≥ f (x) + hg , y − xi + ky − xk22
2
for g ∈ ∂f (x). Equivalent to Hf (x)  µId when twice
differentiable.

1
f is µ-strongly convex iff f ∗ is -smooth
µ
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
The general problem we want to solve

How to solve
n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λ pen(θ) ???
θ∈Rd n
i=1

Put for short


n
1X
f (θ) = `(yi , hxi , θi) and g (θ) = λ pen(θ)
n
i=1

Assume that
f is convex and L-smooth
g is convex and continuous, but possibly non-smooth (for
instance `1 penalization)
g is prox-capable: not hard to compute its proximal operator
Examples

Smoothness of f :
Least-squares:

1 > kX > X kop


∇f (θ) = X (X θ − Y ), L=
n n
Logistic loss:
n
1X yi maxi=1,...,n kXi k22
∇f (θ) = − xi , L=
n 1 + e yi hxi ,θi
i=1
4n

Prox-capability of g :
we gave the explicit prox for many penalizations above
Gradient descent

Now how do I minimize f + g ?

Key point: the descent lemma. If f convex and L-smooth,


then for any L0 ≥ L:

L0 0
f (θ0 ) ≤ f (θ) + h∇f (θ), θ0 − θi + kθ − θk22
2
for any θ, θ0 ∈ Rd
At iteration k, the current point is θk . I use the descent
lemma:
L0
f (θ) ≤ f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22 .
2
Gradient descent

Remark that
n L0 o
argmin f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22
θ∈Rd 2
 1  2
= argmin θ − θk − 0 ∇f (θk )

θ∈Rd L 2

Hence, choose
1
θk+1 = θk − ∇f (θk )
L0
This is the basic gradient descent algorithm [cf previous
lecture]
Gradient descent is based on a majoration-minimization
principle, with a quadratic majorant given by the descent
lemma
But we forgot about g ...
ISTA

Let’s put back g :

L0
f (θ) + g (θ) ≤ f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22 + g (θ)
2
and again
n L0 o
argmin f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22 + g (θ)
θ∈Rd 2
n L0  1  2 o
= argmin θ − θk − 0 ∇f (θk ) + g (θ)

θ∈Rd 2 L 2
n1 1 1
2
  o
= argmin θ − θk − 0 ∇f (θk ) + 0 g (θ)

θ∈Rd 2 L 2 L
 1 
= proxg /L0 θk − 0 ∇f (θk )
L
The prox operator naturally appears because of the descent lemma
ISTA

Proximal gradient descent algorithm [also called ISTA]


Input: starting point θ0 , Lipschitz constant L > 0 for ∇f
For k = 1, 2, . . . until converged do
 
θk = proxg /L θk−1 − L1 ∇f (θk−1 )

Return last θk
Also called Forward-Backward splitting. For Lasso with
least-squares loss, iteration is
 1 
θk = Sλ/L θk−1 − (X > X θk−1 − X > Y ) ,
L
where Sλ is the soft-thresholding operator
ISTA

Put for short F = f + g ,


Take any θ∗ ∈ argminθ∈Rd F (θ)

Theorem (Beck Teboulle (2009))


If the sequence {θk } is generated by ISTA, then

Lkθ0 − θ∗ k22
F (θk ) − F (θ∗ ) ≤
2k

Convergence rate is O(1/k)


Is it possible to improve the O(1/k) rate?
FISTA

Yes! Using Accelerated proximal gradient descent (called


FISTA, Nesterov 83, 04, Beck Teboule 09)
Idea: to find θk+1 , use an interpolation between θk and θk−1

Accelerated proximal gradient descent algorithm [FISTA]


Input: starting points z 1 = θ0 , Lipschitz constant L > 0 for
∇f , t1 = 1
For k = 1, 2, . . . until converged do
θk = proxg /L (z k − L1 ∇f (z k ))

1+ 1+4tk2
tk+1 = 2
−1 k
z k+1 = θk + ttkk+1 (θ − θk−1 )
Return last θk
FISTA

Theorem (Beck Teboulle (2009))


If the sequence {θk } is generated by FISTA, then

2Lkθ0 − θ∗ k22
F (θk ) − F (θ∗ ) ≤
(k + 1)2

Convergence rate is O(1/k 2 )


Is O(1/k 2 ) the optimal rate in general?
FISTA

Yes. Put g = 0
Theorem (Nesterov)
For any optimization procedure satisfying

θk+1 ∈ θ1 + span(∇f (θ1 ), . . . , ∇f (θk )),

there is a function f on Rd convex and L-smooth such that


3L kθ1 − θ∗ k22
min f (θj ) − f (θ∗ ) ≥
1≤j≤k 32 (k + 1)2

for any 1 ≤ k ≤ (d − 1)/2.


FISTA

Comparison of ISTA and FISTA

FISTA is not a descent algorithm, while ISTA is


FISTA

[Proof of convergence of FISTA on the blackboard]


Backtracking linesearch

What if I don’t know L > 0?


kX > X kop can be long to compute
Letting L evolve along iterations k generally improve
convergence speed

Backtracking linesearch. Idea:


Start from a very small lipschitz constant L
Between iteration k and k + 1, choose the smallest L
satisfying the lemma descent at z k
Backtracking linesearch

At iteration k of FISTA, we have z k and a constant Lk


1 Put L ← Lk
2 Do an iteration

k 1 k

θ ← proxg /L z − ∇f (z )
L
3 Check it this step satisfies the descent lemma at z k :
L
f (θ) + g (θ) ≤ f (z k ) + h∇f (z k ), θ − z k i + kθ − z k k22 + g (θ)
2
4 If yes, then θk+1 ← θ and Lk+1 ← L and continue FISTA
5 It not, then put L ← 2L (say), and go back to point 2
Sequence Lk is non-decreasing: between iteration k and k + 1, a
tweak is to decrease it a little bit to have (much) faster
convergence
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
Fenchel duality

How to stop an iterative optimization algorithm?


If F objective function, fix ε > 0 small, and stop when

F (θk+1 ) − F (θk )
≤ ε or k∇f (θk )k ≤ ε
F (θk )

An alternative is to compute the duality gap

Fenchel Duality. Consider the problem minθ f (Aθ) + g (θ) with


f : Rd → R, g : Rp → R and a d × p matrix A.
We have

sup{−f ∗ (u) − g ∗ (−A> u)} ≤ inf {f (Aθ) + g (θ)}


u θ

Moreover, if f and g are convex, then under mild assumptions,


equality of both sides holds (strong duality, no duality gap)
Fenchel duality

Fenchel Duality

sup{−f ∗ (u) − g ∗ (−A> u)} ≤ inf {f (Aθ) + g (θ)}


u θ

Right part is the primal problem


Left part is a dual formulation of the primal problem
Ifθ∗ is an optimum for the primal and u ∗ is an optimum for the
dual, then

−f ∗ (u ∗ ) − g ∗ (−A> u ∗ ) = f (Aθ∗ ) + g (θ∗ )

When g (θ) = λkθk where λ > 0 and k · k is a norm, this writes

sup −f ∗ (u) ≤ inf {f (Aθ) + λkθk}


u:kA> uk∗ ≤λ θ
Duality gap

If (θ∗ , u ∗ ) is a pair of primal/dual solutions then we have

u ∗ ∈ ∂f (Aθ∗ ) or u ∗ = ∇f (Aθ∗ ) if f is differentiable

Namely, we have at optimum

kA> ∇f (Aθ∗ )k∗ ≤ λ and f (Aθ∗ ) + λkθ∗ k + f ∗ (∇f (Aθ∗ )) = 0

Natural stopping rule: imagine we are at iteration k of an


optimization algorithm, current primal variable is θk . Define
 λ 
u k = u(θk ) = min 1, ∇f (Aθk )
kA> ∇f (Aθk )k∗

and stop at iteration k when

f (Aθk ) + λkθk k + f ∗ (u k ) ≤ ε

for a given small ε > 0


Duality gap

Back to machine learning:


n
X n
X
`(yi , hxi , θi) = `(yi , (X θ)i ) = f (X θ)
i=1 i=1

with f (z) = ni=1 `(yi , zi ) for z = [z1 · · · zn ]> and X matrix with
P
lines x1> , . . . , xn> . Gradient is
n
X
∇f (X θ) = `0 (yi , hxi , θi)xi ,
i=1

where `0 (y , z) = ∂`(y , z)/∂z.


For the duality gap, we need to compute f ∗ .
Duality gap

For least squares f (z) = 12 ky − zk22 , we have

1
f ∗ (u) = kuk22 + hu, y i
2
For logistic regression f (z) = ni=1 log(1 + e −yi zi ) we have
P

n
X
f ∗ (u) = (1 + ui yi ) log(1 + ui yi ) − ui yi log(−ui yi )
i=1

if −ui yi ∈ (0, 1] for any i = 1, . . . , d and

f ∗ (u) = +∞

otherwise
Duality gap

Example. Stopping criterion for Lasso based on duality gap:


Compute residual
r k ← X θk − y
Compute dual variable
 λ 
uk = ∧ 1 rk
kX > r k k∞

Stop if
1 k 2
kr k2 + λkθk k1 + ku k k22 + hu k , y i ≤ ε
2

You might also like