Professional Documents
Culture Documents
Suvrit Sra
Massachusetts Institute of Technology
01 Apr, 2021
Coordinate descent
Pn
So far: min f (x) = i=1 fi (x)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 2
Coordinate descent
Pn
So far: min f (x) = i=1 fi (x)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 2
Coordinate descent
Pn
So far: min f (x) = i=1 fi (x)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 2
Coordinate descent
Coordinate descent
For k = 0, 1, . . .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent
Coordinate descent
For k = 0, 1, . . .
Pick an index i from {1, . . . , d}
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent
Coordinate descent
For k = 0, 1, . . .
Pick an index i from {1, . . . , d}
Optimize the ith coordinate
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkd )
i
ξ∈R |1 {z i−1} |{z} | {z }
done current todo
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent
Coordinate descent
For k = 0, 1, . . .
Pick an index i from {1, . . . , d}
Optimize the ith coordinate
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkd )
i
ξ∈R |1 {z i−1} |{z} | {z }
done current todo
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 3
Coordinate descent - context
♣ One of the simplest optimization methods
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive
♣ Gradient, subgradient, incremental methods also “slow”
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive
♣ Gradient, subgradient, incremental methods also “slow”
♣ But incremental, stochastic gradient methods are scalable
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive
♣ Gradient, subgradient, incremental methods also “slow”
♣ But incremental, stochastic gradient methods are scalable
♣ Renewed interest in CD was driven by ML
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
Coordinate descent - context
♣ One of the simplest optimization methods
♣ Old idea: Gauss-Seidel, Jacobi methods for linear systems!
♣ Can be “slow”, but sometimes very competitive
♣ Gradient, subgradient, incremental methods also “slow”
♣ But incremental, stochastic gradient methods are scalable
♣ Renewed interest in CD was driven by ML
♣ Notice: in general CD is “derivative free”
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 4
CD – which coordinate?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
CD – which coordinate?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 5
Exercise: CD for least squares
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 6
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard
♠ Convergence theory can be complicated
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard
♠ Convergence theory can be complicated
♠ Can slow down near optimum
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard
♠ Convergence theory can be complicated
♠ Can slow down near optimum
♠ Nonsmooth case more tricky
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
Coordinate descent – some remarks
Advantages
♦ Each iteration usually cheap (single variable optimization)
♦ No extra storage vectors needed
♦ No stepsize tuning
♦ No other pesky parameters (usually) that must be tuned
♦ Simple to implement
♦ Can work well for large-scale problems
Disadvantages
♠ Tricky if single variable optimization is hard
♠ Convergence theory can be complicated
♠ Can slow down near optimum
♠ Nonsmooth case more tricky
♠ Explore: not easy to use for deep learning...
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 7
BCD
(Basics, Convergence)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 8
Block coordinate descent (BCD)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 9
Block coordinate descent (BCD)
Gauss-Seidel update
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkm )
i
ξ∈Xi |1 {z i−1} |{z} | {z }
done current todo
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 9
Block coordinate descent (BCD)
Gauss-Seidel update
xk+1 ← argmin f (xk+1 , . . . , xk+1 , ξ , xki+1 , . . . , xkm )
i
ξ∈Xi |1 {z i−1} |{z} | {z }
done current todo
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 9
BCD – convergence
Qm
Theorem. Let f be C1 over X := i=1 Xi . Assume for each
block i and x ∈ X , the minimum
is
uniquely attained. Then, every limit point of the sequence
xk generated by BCD, is a stationary point of f .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 10
BCD – convergence
Qm
Theorem. Let f be C1 over X := i=1 Xi . Assume for each
block i and x ∈ X , the minimum
is
uniquely attained. Then, every limit point of the sequence
xk generated by BCD, is a stationary point of f .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 10
BCD – convergence
Qm
Theorem. Let f be C1 over X := i=1 Xi . Assume for each
block i and x ∈ X , the minimum
is
uniquely attained. Then, every limit point of the sequence
xk generated by BCD, is a stationary point of f .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 10
BCD – Two blocks
Two block BCD
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 11
BCD – Two blocks
Two block BCD
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 11
BCD – Two blocks
Two block BCD
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 11
CD – projection onto convex sets
min 1
2 kx − yk22
s.t. x ∈ C1 ∩ C2 ∩ · · · ∩ Cm .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 12
CD – projection onto convex sets
min 1
2 kx − yk22
s.t. x ∈ C1 ∩ C2 ∩ · · · ∩ Cm .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 12
CD – projection onto convex sets
min 1
2 kx − yk22
s.t. x ∈ C1 ∩ C2 ∩ · · · ∩ Cm .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 12
Solution 1: Product space technique
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique
s.t. x1 = x2 = · · · = xn .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 13
Solution 1: Product space technique
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 1: Product space technique
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 14
Solution 2: proximal Dykstra
1
min 2
kx − yk22 + f (x) + h(x)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra
1
min 2
kx − yk22 + f (x) + h(x)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra
1
min 2
kx − yk22 + f (x) + h(x)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra
1
min 2
kx − yk22 + f (x) + h(x)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
Solution 2: proximal Dykstra
1
min 2
kx − yk22 + f (x) + h(x)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 15
The Proximal-Dykstra method
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method
I 0 ∈ ν + µk − y + ∂f ∗ (ν)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method
I 0 ∈ ν + µk − y + ∂f ∗ (ν)
I 0 ∈ νk+1 + µ − y + ∂h∗ (µ)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method
I 0 ∈ ν + µk − y + ∂f ∗ (ν)
I 0 ∈ νk+1 + µ − y + ∂h∗ (µ)
I y − µk ∈ ν + ∂f ∗ (ν) = (I + ∂f ∗ )(ν)
=⇒ ν = proxf ∗ (y − µk )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method
I 0 ∈ ν + µk − y + ∂f ∗ (ν)
I 0 ∈ νk+1 + µ − y + ∂h∗ (µ)
I y − µk ∈ ν + ∂f ∗ (ν) = (I + ∂f ∗ )(ν)
=⇒ ν = proxf ∗ (y − µk ) =⇒ ν = y − µk − proxf (y − µk )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method
I 0 ∈ ν + µk − y + ∂f ∗ (ν)
I 0 ∈ νk+1 + µ − y + ∂h∗ (µ)
I y − µk ∈ ν + ∂f ∗ (ν) = (I + ∂f ∗ )(ν)
=⇒ ν = proxf ∗ (y − µk ) =⇒ ν = y − µk − proxf (y − µk )
I Similarly, we see that
µ = y − νk+1 − proxh (y − νk+1 )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
The Proximal-Dykstra method
I 0 ∈ ν + µk − y + ∂f ∗ (ν)
I 0 ∈ νk+1 + µ − y + ∂h∗ (µ)
I y − µk ∈ ν + ∂f ∗ (ν) = (I + ∂f ∗ )(ν)
=⇒ ν = proxf ∗ (y − µk ) =⇒ ν = y − µk − proxf (y − µk )
I Similarly, we see that
µ = y − νk+1 − proxh (y − νk+1 )
νk+1 ← y − µk − proxf (y − µk )
µk+1 ← y − νk+1 − proxh (y − νk+1 )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 16
Proximal-Dykstra as CD
x = y − ν − µ =⇒ y − µ = x + ν
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 17
Proximal-Dykstra as CD
x = y − ν − µ =⇒ y − µ = x + ν
tk ← proxf (xk + νk )
νk+1 ← xk + νk − tk
xk+1 ← proxh (µk + tk )
µk+1 ← µk + tk − xk+1
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 17
Proximal-Dykstra as CD
x = y − ν − µ =⇒ y − µ = x + ν
tk ← proxf (xk + νk )
νk+1 ← xk + νk − tk
xk+1 ← proxh (µk + tk )
µk+1 ← µk + tk − xk+1
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 17
CD – nonsmooth case
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 18
CD for nonsmooth convex problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 19
CD for separable nonsmoothness
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 20
CD for separable nonsmoothness
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 20
CD for separable nonsmoothness
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 20
CD – iteration complexity
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 21
CD non-asymptotic rate
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD non-asymptotic rate
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD non-asymptotic rate
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD non-asymptotic rate
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD non-asymptotic rate
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD non-asymptotic rate
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 22
CD – non-asymptotic convergence
2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence
2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence
2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence
2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence
2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence
2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
CD – non-asymptotic convergence
2dMkx0 − x∗ k22
f (xk ) − f ∗ ≤ , k ≥ 0.
k+4
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 23
BCD – Notation
I Observation: (
INi i=j
ETi Ej =
0Ni ,Nj i 6= j.
I So the Ei s define our partitioning of the coordinates
I Just fancier notation for a random partition of coordinates
I Now with this notation . . .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 24
BCD – formal setup
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
BCD – formal setup
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 25
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
I Optimize upper bound (partial gradient step) for block i
Li 2
h = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
h
1
h= − Li ∇i f (xk )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
I Optimize upper bound (partial gradient step) for block i
Li 2
h = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
h
1
h= − Li ∇i f (xk )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
I Optimize upper bound (partial gradient step) for block i
Li 2
h = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
h
1
h= − Li ∇i f (xk )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
I Optimize upper bound (partial gradient step) for block i
Li 2
h = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
h
1
h= − Li ∇i f (xk )
(i)
Notice: Original BCD had: xk = argminh f (. . . , |{z}
h , . . .)
block i
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Randomized BCD
I For k ≥ 0 (no init. of x necessary)
I Pick a block i from [d] with probability pi > 0
I Optimize upper bound (partial gradient step) for block i
Li 2
h = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
h
1
h= − Li ∇i f (xk )
(i)
Notice: Original BCD had: xk = argminh f (. . . , |{z}
h , . . .)
block i
We’ll call this BCM (Block Coordinate Minimization)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 26
Exercise: proximal extension
(i) Li
xk = argmin f (xk ) + h∇i f (xk ), hi + 2 khk
2
+ ri (ETi xk + h)
h
(i)
xk = proxri (· · ·)
h = prox(1/L)ri ETi xk − 1
Li
∇i f (xk )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 27
Randomized BCD – analysis
Li 2
h ← argminh f (xk ) + h∇i f (xk ), hi + 2 khk
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 28
Randomized BCD – analysis
Li 2
h ← argminh f (xk ) + h∇i f (xk ), hi + 2 khk
Descent:
xk+1 = xk + Ei h
Li 2
f (xk+1 ) ≤ f (xk ) + h∇i f (xk ), hi + 2 khk
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 28
Randomized BCD – analysis
Li 2
h ← argminh f (xk ) + h∇i f (xk ), hi + 2 khk
Descent:
xk+1 = xk + Ei h
Li 2
f (xk+1 ) ≤ f (xk ) + h∇i f (xk ), hi + 2 khk
1
xk+1 = xk − Li Ei ∇i f (xk )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 28
Randomized BCD – analysis
Li 2
h ← argminh f (xk ) + h∇i f (xk ), hi + 2 khk
Descent:
xk+1 = xk + Ei h
Li 2
f (xk+1 ) ≤ f (xk ) + h∇i f (xk ), hi + 2 khk
1
xk+1 = xk − Li Ei ∇i f (xk )
2
1 2 Li
1
f (xk+1 ) ≤ f (xk ) − Li k∇i f (xk )k +
− Li ∇i f (xk )
2
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 28
Randomized BCD – analysis
Li 2
h ← argminh f (xk ) + h∇i f (xk ), hi + 2 khk
Descent:
xk+1 = xk + Ei h
Li 2
f (xk+1 ) ≤ f (xk ) + h∇i f (xk ), hi + 2 khk
1
xk+1 = xk − Li Ei ∇i f (xk )
2
1 2 Li
1
f (xk+1 ) ≤ f (xk ) − Li k∇i f (xk )k +
− Li ∇i f (xk )
2
1 2
f (xk+1 ) ≤ f (xk ) − 2Li k∇i f (xk )k .
1
f (xk ) − f (xk+1 ) ≥ 2Li
k∇i f (xk )k2
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 28
Randomized BCD – analysis
Expected descent:
d
X
1
f (xk ) − E[f (xk+1 |xk )] = pi f (xk ) − f (xk − Li Ei ∇i f (xk ))
i=1
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 29
Randomized BCD – analysis
Expected descent:
d
X
1
f (xk ) − E[f (xk+1 |xk )] = pi f (xk ) − f (xk − Li Ei ∇i f (xk ))
i=1
d
pi
X
2
≥ 2Li k∇i f (xk )k
i=1
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 29
Randomized BCD – analysis
Expected descent:
d
X
1
f (xk ) − E[f (xk+1 |xk )] = pi f (xk ) − f (xk − Li Ei ∇i f (xk ))
i=1
d
pi
X
2
≥ 2Li k∇i f (xk )k
i=1
1 2
= 2 k∇f (xk )kW (suitable W).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 29
Randomized BCD – analysis
Expected descent:
d
X
1
f (xk ) − E[f (xk+1 |xk )] = pi f (xk ) − f (xk − Li Ei ∇i f (xk ))
i=1
d
pi
X
2
≥ 2Li k∇i f (xk )k
i=1
1 2
= 2 k∇f (xk )kW (suitable W).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 29
Randomized BCD – analysis
Expected descent:
d
X
1
f (xk ) − E[f (xk+1 |xk )] = pi f (xk ) − f (xk − Li Ei ∇i f (xk ))
i=1
d
pi
X
2
≥ 2Li k∇i f (xk )k
i=1
1 2
= 2 k∇f (xk )kW (suitable W).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 29
BCD – Exercise
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 30
Connections
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 31
CD – exercise
n
1X λ
min fi (xT ai ) + kxk2 .
n 2
i=1
Dual problem
n n
1X ∗ λ
1 X
2
max −fi (−αi ) −
λn αi ai
α n 2
i=1 i=1
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 32
Other connections
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (04/01/21; Lecture 12) 33