Professional Documents
Culture Documents
S. Gaı̈ffas1
3 novembre 2014
1
CMAP - Ecole Polytechnique
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
Supervised learning
Setting
Data xi ∈ X , yi ∈ Y for i = 1, . . . , n
xi is an input and yi is an output
xi are called features and xi ∈ X = Rd
yi are called labels
Y = {−1, 1} or Y = {0, 1} for binary classification
Y = {1, . . . , K } for multiclass classification
Y = R for regression
Goal: given a new x’s, predict y ’s.
Supervised learning – Loss functions, linearity
What to do
Minimize with respect to f : Rd → R
n
1X
Rn (f ) = `(yi , f (xi ))
n
i=1
where
` is a loss function. `(yi , f (xi )) small means yi is close to f (xi )
Rn (f ) is called goodness-of-fit or empirical risk
Computation f is called training or estimation step
Supervised learning – Loss functions, linearity
θ̂ ∈ argmin Rn (θ)
θ∈Rd
where
n
1X
Rn (θ) = `(yi , hxi , θi).
n
i=1
Classical losses
`(y , z) = 12 (y − z)2 : least-squares loss, linear regression (label
y ∈ R)
`(y , z) = (1 − yz)+ hinge loss, or SVM loss (binary
classification, label y ∈ {−1, 1})
`(y , z) = log(1 + e −yz ) logistic loss (binary classification, label
y ∈ {−1, 1})
Supervised learning – Loss functions, linearity
1
`least−sq (y , z) = (y − z)2 `hinge (y , z) = (1 − yz)+
2
`logistic (y , z) = log(1 + e −yz )
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
Penalization – Introduction
where
pen is a penalization function, that encodes a prior
assumption on θ. It forbids θ to be “too complex”
λ > 0 is a tuning or smoothing parameter, that balances
goodness-of-fit and penalization
Penalization – Introduction
It is tempting to use
n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λkθk0 ,
θ∈Rd n
i=1
where
kθk0 = #{j : θj 6= 0}.
But, to do it exactly, you need to try all possible subsets of
non-zero coordinates of θ: 2d possibilities. Impossible!
Penalization – Lasso
Hence, a minimizer
n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λkθk1
θ∈Rd n
i=1
As a consequence, we have
1
a∗ = argmin ka − bk22 + λkak1 = Sλ (b)
a∈Rd 2
where
Sλ (b) = sign(b) (|b| − λ)+
is the soft-thresholding operator
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
Quick recap
f : Rd → [−∞, +∞] is
convex if
then
proxg = projC = projection onto C .
If g (x) = hb, xi + c, then
proxλg (x) = x − λb
1
proxλg (x) = x = shrinkage operator
1+λ
If g (x) = − log x then
√
x+ x 2 + 4λ
proxλg (x) =
2
If g (x) = kxk2 then
λ
proxλg (x) = 1 − x,
kxk2 +
1
proxλg (x) = proxλk·k1 (x)
1 + λγ
P
If g (x) = g ∈G kxg k2 where G partition of {1, . . . , d},
λ
(proxλg (x))g = 1 − xg ,
kxg k2 +
x 7→ hy , xi − c
is below f .
f (x) + f ∗ (y ) ≥ hx, y i
where
kxk∗ = max hx, y i
y ∈Rd :ky k≤1
f : Rd → [−∞, +∞] is
f is L-smooth if it is continuously differentiable and if
1
f is µ-strongly convex iff f ∗ is -smooth
µ
1 Supervised learning recap Proximal operators
Introduction Subdifferential, Fenchel
Loss functions, linearity conjuguate
2 Penalization 4 ISTA and FISTA
Introduction The general problem
Ridge Gradient descent
Sparsity ISTA
Lasso FISTA
3 Some tools from convex Linesearch
optimization 5 Duality gap
Quick recap Fenchel duality
Proximal operator Duality gap
The general problem we want to solve
How to solve
n
n1 X o
θ̂ ∈ argmin `(yi , hxi , θi) + λ pen(θ) ???
θ∈Rd n
i=1
Assume that
f is convex and L-smooth
g is convex and continuous, but possibly non-smooth (for
instance `1 penalization)
g is prox-capable: not hard to compute its proximal operator
Examples
Smoothness of f :
Least-squares:
Prox-capability of g :
we gave the explicit prox for many penalizations above
Gradient descent
L0 0
f (θ0 ) ≤ f (θ) + h∇f (θ), θ0 − θi + kθ − θk22
2
for any θ, θ0 ∈ Rd
At iteration k, the current point is θk . I use the descent
lemma:
L0
f (θ) ≤ f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22 .
2
Gradient descent
Remark that
n L0 o
argmin f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22
θ∈Rd 2
1
2
= argmin
θ − θk − 0 ∇f (θk )
θ∈Rd L 2
Hence, choose
1
θk+1 = θk − ∇f (θk )
L0
This is the basic gradient descent algorithm [cf previous
lecture]
Gradient descent is based on a majoration-minimization
principle, with a quadratic majorant given by the descent
lemma
But we forgot about g ...
ISTA
L0
f (θ) + g (θ) ≤ f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22 + g (θ)
2
and again
n L0 o
argmin f (θk ) + h∇f (θk ), θ − θk i + kθ − θk k22 + g (θ)
θ∈Rd 2
n L0
1
2 o
= argmin
θ − θk − 0 ∇f (θk )
+ g (θ)
θ∈Rd 2 L 2
n1
1 1
2
o
= argmin
θ − θk − 0 ∇f (θk )
+ 0 g (θ)
θ∈Rd 2 L 2 L
1
= proxg /L0 θk − 0 ∇f (θk )
L
The prox operator naturally appears because of the descent lemma
ISTA
Return last θk
Also called Forward-Backward splitting. For Lasso with
least-squares loss, iteration is
1
θk = Sλ/L θk−1 − (X > X θk−1 − X > Y ) ,
L
where Sλ is the soft-thresholding operator
ISTA
Lkθ0 − θ∗ k22
F (θk ) − F (θ∗ ) ≤
2k
2Lkθ0 − θ∗ k22
F (θk ) − F (θ∗ ) ≤
(k + 1)2
Yes. Put g = 0
Theorem (Nesterov)
For any optimization procedure satisfying
F (θk+1 ) − F (θk )
≤ ε or k∇f (θk )k ≤ ε
F (θk )
Fenchel Duality
f (Aθk ) + λkθk k + f ∗ (u k ) ≤ ε
with f (z) = ni=1 `(yi , zi ) for z = [z1 · · · zn ]> and X matrix with
P
lines x1> , . . . , xn> . Gradient is
n
X
∇f (X θ) = `0 (yi , hxi , θi)xi ,
i=1
1
f ∗ (u) = kuk22 + hu, y i
2
For logistic regression f (z) = ni=1 log(1 + e −yi zi ) we have
P
n
X
f ∗ (u) = (1 + ui yi ) log(1 + ui yi ) − ui yi log(−ui yi )
i=1
f ∗ (u) = +∞
otherwise
Duality gap
Stop if
1 k 2
kr k2 + λkθk k1 + ku k k22 + hu k , y i ≤ ε
2