MLSS Complete PDF

Tutorial:
Optimization for Machine Learning
Elad Hazan
Princeton University and Google Brain
http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm
“Instead of trying to produce a programme to simulate the
adult mind, why not rather try to produce one which simulates
the child's?”
A.M Turing, 1950

ML paradigm
Machine
Chair/car
Distribution
over {a} ∈ 𝑅 & label
𝑏 = 𝑓*+,+-./.,0 (𝑎)
This tutorial - training the machine
• Efficiency
• generalization
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
4. A touch of state-of-the-art
Bibliography + more info: http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm
Agenda
NOT touch upon:

• Parallelism/distributed computation (asynchronous optimization,
HOGWILD etc.), Bayesian inference in graphical models, Markov Chain
Monte Carlo, Partial information and bandit algorithms, optimization
for RL (policy and value iteration, policy gradient, …), Hyperpararmeter
optimization, …
Mathematical optimization
Input: function 𝑓: 𝐾 ↦ 𝑅, for 𝐾 ⊆ 𝑅9

Output: minimizer 𝑥 ∈ 𝐾, such that 𝑓 𝑥 ≤ 𝑓 𝑦 ∀ 𝑦 ∈ 𝐾
Accessing f?
• Value oracle (0-order opt.)
• Gradient (1st order opt.), k’th differentials (k’th order…)
Accessing K? (separation/membership oracle, explicit constraints)

Generally NP-hard, given full access to function.
NP-hardness of Mathematical optimization
Max Cut over G=(V,E):
1 − 𝑥H 𝑥I
max E
@∈ AB,B D 2
HI ∈L
What is Optimization
Supervised Learning = optimization over data

But generally speaking...
(a.k.a. Empirical Risk Minimization)
We’re screwed.
! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous f
Fitting the parameters of the model (“training”) = optimization

h(x) = sin(2πx) = 0
problem:
250
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3
1 Duchi (UC Berkeley) Convex Optimization for Machine Learning

arg minR E ℓ 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -
m = # of examples (a,b) = (features, labels)

d = dimension
Example: linear classification
Given a sample 𝑆 = 𝑎B , 𝑏B , … , 𝑎- , 𝑏- ,
find hyperplane (through the origin w.l.o.g.)
such that:
𝑥 = arg min # of mistakes =

@ ZB
arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 c 𝑎H ≠ 𝑏H |

@ ZB
i
B
arg min ∑H ℓ(𝑥, 𝑎H , 𝑏H ) for ℓ 𝑥, 𝑎H , 𝑏H = h 1 𝑥 𝑎≠𝑏
f ZB - 0 𝑥 i 𝑎 = 𝑏
NP hard!
Example: deep neural net classification
Given a sample 𝑆 = 𝑎B , 𝑏B , … , 𝑎- , 𝑏- ,
find a depth N ResNet such that:
𝑊 = arg min # of mistakes =

l Zm
arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑁𝑒𝑡l 𝑎H ≠ 𝑏H |

l Zm
B 1 𝑁𝑒𝑡l (𝑎) ≠ 𝑏
arg min ∑H ℓ(𝑊 , 𝑎H , 𝑏H ) for ℓ 𝑊, 𝑎H , 𝑏H = h
p Zm - 0 𝑁𝑒𝑡l (𝑎) = 𝑏
Sum of signs à global optimization NP-hard!
Local property that ensures global optimality?

Convexity
A function 𝑓: 𝑅 9 ↦ 𝑅 is convex if and only if:
1 1 1 1
𝑓 𝑥+ 𝑦 ≤ 𝑓 𝑥 + 𝑓 𝑦
2 2 2 2
• Informally: smiley J
• Alternative definition:
f y ≥ f x + 𝛻𝑓(𝑥)i (𝑦 − 𝑥)
𝑥 𝑦
Convex sets
Set K is convex if and only if:

𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 + ½𝑦) ∈ 𝐾
c
Loss functions ℓ 𝑥, 𝑎H , 𝑏H = ℓ(𝑥 𝑎H ⋅ 𝑏H )
Convex relaxations for linear classification
𝑥 = arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 c 𝑎H ≠ 𝑏H |

@ ZB
1. Ridge / linear regression ℓ 𝑥 i 𝑎H , 𝑦H = 𝑥 i 𝑎H − 𝑏H x

2. SVM ℓ 𝑥 i 𝑎H , 𝑦H = max{0,1 − 𝑏H 𝑥 i 𝑎H }
}+
3. Logistic regression ℓ 𝑥 i 𝑎H , 𝑦H = log (1 + 𝑒 A{| ⋅@ | )
Non-Convex Optimization What is Optimization

We’re screwed.
Multiple unsurpassable hurdles. ! Local (non global) minima of f0
• Global optimization is NP-hard
! All kinds of constraints (even restricting to continuous func
• Even deciding whether you are at a local h(x) = sin(2πx) = 0

minimum is NP-hard
250
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3
Duchi (UC Berkeley) Convex Optimization for Machine Learning

So what can we hope to do?
𝜖 First Order Critical Points –

• First Order Critical Points –
{ℎ | 𝛻𝑓
ℎ | 𝛻𝑓 ℎ |=≤0}
𝜖}
Good enough Less ideal
𝜆-H& (𝛻 x 𝑓 ℎ ) > 0 or 𝜆-H& (𝛻 x 𝑓 ℎ ) < 0
4th order critical points: NP-hard! 𝑓 𝑥 − 𝑓 𝑥 ∗ = 𝑂( 𝑥 − 𝑥 ∗ …†B )

A better baseline
• 𝜖-𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒 𝑙𝑜𝑐𝑎𝑙 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 (Nesterov and Polyak)
| 𝛻𝑓 ℎ | ≤ 𝜖 and 𝛻 x 𝑓 ℎ ≽ − 𝜖 𝐼
• Does not guarantee a local minimum even with 𝜖 = 0 (but for some
functions)
We have:
1. motivated learning as mathematical optimization
2. convexity à locally verify global optimality
3. Non-convex opt à kth-order local opt.
(2nd is reasonable, 4th is NP-hard)
Next è algorithms!
Unconstrained Gradient Descent
@
[rf (x)]i = f (x)
@xi
𝑥/†B ← 𝑥/ − 𝜂𝛻𝑓 𝑥/
p* p3 p2 p1
Smooth functions
𝛽 = smoothness, simplified:
𝛽 𝛽
− 𝐼 ≼ 𝛻 x𝑓 𝑥 ≼ 𝐼
2 2
Why is this important?
For gradients to be meaningful at all…
Remark: any function can be “smoothed” by local integration

Convergence of gradient descent
𝑥/†B ← 𝑥/ − 𝜂𝛻𝑓 𝑥/
B
Theorem: for 𝛽-smooth M-bounded functions, step size 𝜂 = ,
“
1 x
2𝑀𝛽
E 𝛻𝑓 𝑥/ ≤
𝑇 𝑇
/
Thus, there exists a time for which:

x
2𝑀𝛽
𝛻𝑓 𝑥/ ≤
𝑇
convex f à global convergence:

𝐷
𝑓 𝑥/ − 𝑓(𝑥 ∗ ) ≤ |𝑥/ − 𝑥 ∗ ||𝛻𝑓 𝑥/ | = 𝑂
𝑇
Proof: 𝑥/†B ← 𝑥/ − 𝜂𝛻𝑓 𝑥/
1. The descent lemma:
1 x
𝑓 𝑥/ − 𝑓 𝑥/†B ≥ −𝛻𝑓 𝑥/ 𝑥/†B − 𝑥/ − 𝛽|𝑥/ − 𝑥/†B —
2
x
1 x
= 𝜂 𝛻/ − 𝛽𝜂 𝛻/ x
2
1 x
= 𝛻
2𝛽 /
2. Summing over all iterations:

B
M ≥ f xB − f x ∗ ≥ ∑/ 𝑓 𝑥/ − 𝑓 𝑥/†B ≥ ∑/ 𝛻/ x
x“
B x x™“
and thus, ∑/ 𝛻𝑓 𝑥/ ≤
c c
For convex functions:
𝑓 𝑥 − 𝑓 𝑥 ∗ ≤ 𝛻𝑓 𝑥 𝑥 ∗ − 𝑥 ≤ 𝐷|𝛻𝑓 𝑥 |
Thus, in T iterations, exists an iterate t for which:
𝐷 2𝑀𝛽
𝑓 𝑥 − 𝑓 𝑥∗ ≤
𝑇
Our objective function:
1
𝑓 𝑥 = arg minš E ℓH 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -
m = # of examples (a,b) = (features, labels). d = dimension
à Gradients are expensive! (entire data set per iteration)

à What about non-smooth functions? (non-convex: infeasible)
à How good is a local minimum in terms of generalization?
à Can we get faster rates?
! All kinds of constraints (even restricting to continuous
h(x) = sin(2πx) = 0
Stochastic gradient descent

250
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
Learning problem arg minš 𝑓 𝑥 = 𝐸(+| ,{| ) ℓH 𝑥, 𝑎H , 𝑏H

−3 −3
@∈Q
Duchi (UC Berkeley) Convex Optimization for Machine Learning
“Random approximation” (Robbins/Monro ‘51)
estimate the gradient using a single example: 𝑓/ 𝑥 = ℓH 𝑥, 𝑎H , 𝑏H
𝛻 = E[𝛻𝑓 𝑥 ] = 𝐸(+| ,{| ) 𝛻 ℓH 𝑥, 𝑎H , 𝑏H
¤
move in the direction of estimator! 𝑥/†B ← 𝑥/ − 𝜂𝛻(𝑥/)
Convergence?
Variance of gradient estimator: E[ 𝛻‖x = 𝜎 x

Stochastic vs. full gradient descent
Convergence of SGD
𝑥/†B ← 𝑥/ − 𝜂𝛻 𝑓 𝑥/
Theorem: for 𝛽-smooth M-bounded functions, step size

™
𝜂=
c“§¨
,
1 x
𝑀𝛽𝜎 x
E 𝛻𝑓 𝑥/ ≤
𝑇 𝑇
/
Proof: 𝑥/†B ← 𝑥/ − 𝜂𝛻 𝑓 𝑥/
1. The descent lemma:
1 x
E 𝑓 𝑥/ − 𝑓 𝑥/†B ≥ E −𝛻𝑓 𝑥/ 𝑥/†B − 𝑥/ − 𝛽|𝑥/ − 𝑥/†B —
2
1 x 1
= 𝐸 𝛻/ ⋅ 𝜂𝛻© − 𝛽𝜂 x 𝛻 = 𝜂 𝛻/ x
− 𝛽𝜂 x 𝜎 x
2 2
2. Summing over all iterations:

B
M ≥ f xB − f x ∗ ≥ ∑/ 𝑓 𝑥/ − 𝑓 𝑥/†B ≥ 𝜂 ∑/ 𝛻/ x
− 𝛽𝜂 x 𝜎 x 𝑇
x
B x ™ x ™“§ ¨
and thus, ∑/ 𝛻𝑓 𝑥/ ≤ + 𝜂𝛽𝜎 ≤
c ªc c
Our objective function:
1
𝑓 𝑥 = arg minš E ℓH 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -
m = # of examples (a,b) = (features, labels). d = dimension

Tutorial: PART 2
Optimization for Machine Learning
Elad Hazan
Princeton University and Google Brain
Bibliography + more info:
Agenda
4. A touch of state-of-the-art
à convex optimization
Statistical (PAC) learning
Nature: i.i.d from distribution D over
A ×𝐵 = {(𝑎, 𝑏)}
(a1,b1) (aM,bM)
• learner: h1
• Hypothesis h h2
• Loss, e.g. ℓ ℎ, 𝑎, 𝑏 = (ℎ 𝑎 −
𝑏 )x
𝑒𝑟𝑟 ℎ = 𝔼+,{∼³ [ℓ(ℎ, 𝑎, 𝑏 ] hN

Hypothesis class H: X -> Y is learnable if ∀𝜖, 𝛿 > 0 exists algorithm s.t. after seeing m
examples, for 𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))
finds h s.t. w.p. 1- δ:
⇤
err(h)  min
⇤
err(h ) + ✏
h 2H
More powerful setting:
Online Learning in Games A x B
Iteratively, for t = 1,2, … , 𝑇
Player: ℎ/ ∈ 𝐻
Adversary: (𝑎/ , 𝑏/ ) ∈ 𝐴 H
Loss ℓ(ℎ/ , (𝑎/ , 𝑏/ ))
Goal: minimize (average, expected) regret:
" #
1 X X
`(ht , (at , bt ) min `(h⇤ , (at , bt )) ! 0
T t
⇤
h 2H
t
T !1
Vanishing regret à generalization in PAC setting! (online2batch)

From this point onwards: 𝑓/ 𝑥 = ℓ(𝑥, 𝑎/ , 𝑏/ ) = loss for one example
Can we minimize regret efficiently?
Online gradient descent
yt+1 = xt ⌘rft (xt )
xt+1 = arg min kyt+1 xt k
x2K
Theorem: Regret = ∑/ 𝑓/ 𝑥/ − ∑/ 𝑓/ 𝑥 ∗ = 𝑂 𝑇
𝑦/†B ← 𝑥/ − 𝜂𝛻𝑓/ 𝑥/
Online gradient descent 𝑥/†B = arg min |𝑦/†B − 𝑥|
f∈m
³
Theorem: for step size 𝜂 = , NO SMOOTHNESS REQUIRED!!
¶ c
E 𝑓/ 𝑥/ − min E 𝑓 𝑥 ∗ ≤ 𝐷𝐺 𝑇
∗ /
@
/ /
Where:
• G = upper bound on norm of gradients
|𝛻𝑓/ 𝑥/ | ≤ 𝐺
• D = diameter of constraint set

∀𝑥, 𝑦 ∈ 𝐾 . |𝑥 − 𝑦| ≤ 𝐷
Proof: 𝑦/†B ← 𝑥/ − 𝜂𝛻𝑓/ 𝑥/
1. Observation 1: 𝑥/†B = arg min |𝑦/†B − 𝑥|
f∈m
x ∗ − y©†B x
= x ∗ − x© x
− 2𝜂𝛻𝑓/ (𝑥/ )(𝑥/ − 𝑥 ∗ ) + 𝜂 x 𝛻𝑓/ (𝑥/ ) x
2. Observation 2:
x ∗ − 𝑥/†B x
≤ x ∗ − y/†B x
This is the Pythagorean theorem:

Proof: 𝑦/†B ← 𝑥/ − 𝜂𝛻𝑓 𝑥/
1. Observation 1: 𝑥/†B = arg min |𝑦/†B − 𝑥|
f∈m
x ∗ − y©†B x = x ∗ − x© x − 2𝜂𝛻𝑓/ (𝑥/ )(𝑥/ − 𝑥 ∗ ) + 𝜂 x 𝛻𝑓/ (𝑥/ ) x
2. Observation 2:
x ∗ − 𝑥/†B x ≤ x ∗ − y/†B x
Thus:
x ∗ − x©†B x ≤ x ∗ − x© x − 2𝜂𝛻𝑓/ (𝑥/ )(𝑥/ − 𝑥 ∗ ) + 𝜂 x 𝐺 x
And hence:
1 1
E 𝑓/ 𝑥/ − 𝑓/ 𝑥 ∗ ≤ E 𝛻𝑓/ 𝑥/ 𝑥/ − 𝑥 ∗
𝑇 𝑇
/ /
1 1 𝜂
≤ E x ∗ − x©†B x − x ∗ − x© x + 𝐺x
𝑇 2𝜂 2
/
1 𝜂 𝐷𝐺
≤ 𝐷x + 𝐺 x ≤
𝑇 ⋅ 2𝜂 2 𝑇
Lower bound
p
Regret = ⌦( T )
• 2 loss functions, T iterations:
• 𝐾 = −1,1 , 𝑓B 𝑥 = 𝑥 , 𝑓x 𝑥 = −𝑥
• Second expert loss = first * -1
• Expected loss = 0 (any algorithm)
• Regret = (compared to either -1 or 1)
p
E[|#10 s #( 1)0 s|] = ⌦( T )
! All kinds of constraints (even restricting to continuous
h(x) = sin(2πx) = 0
SGD: non-smooth case

250
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
Learning problem arg minš 𝐹 𝑥 = 𝐸(+| ,{| ) ℓH 𝑥, 𝑎H , 𝑏H

−3 −3
@∈Q
random example: 𝑓/ 𝑥 = ℓH 𝑥, 𝑎H , 𝑏H Duchi (UC Berkeley) Convex Optimization for Machine Learning
1. We have proved: (for any sequence of 𝛻/ )
1 1 𝐷𝐺
E 𝛻/i 𝑥/ ≤ min E 𝛻/
i ∗
𝑥 +
𝑇 @ ∗ ∈m 𝑇 𝑇
/ /
2. Taking (conditional) expectation:
1 ∗
1 i ∗
𝐷𝐺
𝐸 𝐹 E 𝑥/ − min 𝐹 𝑥 ≤ 𝐸 E 𝛻/ (𝑥/ − 𝑥 )] ≤
𝑇 @ ∗ ∈m 𝑇 𝑇
/ /
One example per step, same convergence as GD, & gives direct generalization!
(formally needs martingales)
9 -9
O vs. O total running time for 𝜖 generalization error.
º¨ º¨
Agenda
4. A touch of state of the art
Regularization &
Gradient Descent++
Why “regularize”?
• Statistical learning theory /

Occam’s razor:
# of examples needed to learn
hypothesis class ~ it’s “dimension”
• VC dimension
• Fat-shattering dimension
• Rademacher width
• Margin/norm of linear/kernel classifier
• PAC theory: Regularization <-> reduce complexity

• Regret minimization: Regularization <-> stability
Condition number of convex functions
“
defined as 𝛾 = , where (simplified)
¼
𝛼 x
𝛽
0≺ 𝐼≼𝛻 𝑓 𝑥 ≼ 𝐼
2 2
𝛼 = strong convexity, 𝛽 = smoothness

Non-convex smooth functions: (simplified)
𝛽 x
𝛽
− 𝐼≼𝛻 𝑓 𝑥 ≼ 𝐼
2 2
Why do we care?
well-conditioned functions exhibit much faster optimization!
Important for regularization
Minimize regret: best-in-hindsight
X X
Regret = ft (xt ) min
⇤
ft (x⇤ )
x 2K
t t
• Most natural:
/AB
𝑥/ = arg min E 𝑓H 𝑥
@∈m
HUB
• Provably works [Kalai-Vempala’05]:

/
𝑥/¿ = arg min E 𝑓H 𝑥 = 𝑥/†B

@∈m
HUB
• So if 𝑥/ ≈ 𝑥/†B , we get a regret bound

• But instability 𝑥/ − 𝑥/†B can be large!
Fixing FTL: Follow-The-Regularized-Leader
(FTRL)
• Linearize: replace ft by a linear function, 𝛻𝑓/ 𝑥/ c 𝑥
• Add regularization:
1
𝑥/ = arg min E 𝛻/i 𝑥 + 𝑅 𝑥
@∈m 𝜂
HUB…/AB
• R(x) is a strongly convex function, ensures stability:
𝛻/i 𝑥/ − 𝑥/†B = 𝑂(𝜂)

FTRL vs. gradient descent
B
• 𝑅 𝑥 = ∥ 𝑥 ∥x
x
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Q ⇣ Pt 1 ⌘
= K ⌘ i=1 rfi (xi )
• Essentially OGD: starting with y1 = 0, for t = 1, 2, …
Q
xt = K (yt )
yt+1 = yt ⌘rft (xt )
FTRL vs. Multiplicative Weights
• Experts setting: 𝐾 = Δ& distributions over experts
• 𝑓/ 𝑥 = 𝑐/c 𝑥, where ct is the vector of losses
• 𝑅 𝑥 = ∑H 𝑥H log 𝑥H : negative entropy
Pt 1
x2K
⇣ P ⌘
t 1
Entrywise
= exp ⌘ i=1 ci /Zt Normalization
constant
exponential
• Gives the Multiplicative Weights method!

FTRL ⇔ Online Mirror Descent
Pt 1
x2K
Bregman Projection:
QR
K (y) = arg min BR (xky)
x2K
BR (xky) := R(x) R(y) rR(y)> (x y)
QR
xt = K (yt )
1
yt+1 = (rR) (rR(yt ) ⌘rft (xt ))
GD++
#1: Adaptive Regularization

Adaptive Regularization: AdaGrad
• Consider generalized linear model, prediction is function of 𝑎c 𝑥

𝛻𝑓/ 𝑥 = ℓ 𝑎/ , 𝑏/ , 𝑥 𝑎/
• OGD update: 𝑥/†B = 𝑥/ − 𝜂𝛻/ = 𝑥/ − 𝜂ℓ 𝑎/ , 𝑏/ , 𝑥 𝑎/

• features treated equally in updating parameter vector
• In typical text classification tasks, feature vectors at are very sparse,

Slow learning!
• Adaptive regularization: per-feature learning rates

Optimal regularization
• The general RFTL form
1
𝑥/ = arg min E 𝑓H 𝑥 + 𝑅 𝑥
@∈m 𝜂
HUB…/AB
• Which regularizer to pick?

• AdaGrad: treat this as a learning problem!
Family of regularizations:
𝑅 𝑥 = 𝑥 xÄ 𝑠. 𝑡. 𝐴 ≽ 0 , 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑
• Objective in matrix world: best regret in hindsight!

AdaGrad
• Set 𝑥B ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. use 𝑥/ obtain ft , and gradient 𝑔/ = 𝛻𝑓 𝑥/
2. compute 𝑥/†B as follows:
𝐺/ = E 𝑔H 𝑔Hi
HUB…/
Å Å
A A
𝑥/†B = 𝑥/ − 𝜂𝐺/ 𝑔/ or diagonal version 𝑥/†B = 𝑥/ − 𝜂 𝑑𝑖𝑎𝑔 𝐺/
¨ ¨ 𝑔/
𝑦/†B = arg min 𝑦/†B − 𝑥 i 𝐺/ (𝑦/†B − 𝑥)

@∈m
Intuition: Adaptive Preconditioning
● Per-coordinate scaling matrix 𝐷/ = ∑/0UB diag 𝑔0 𝑖 x AB/x
● Adaptive methods learn 𝐷 which makes the loss surface more isotropic
💁
𝑓 𝑥 ↦ 𝑓 𝐷𝑥
😖
The ratio of adaptivity
• Theorem 1: (informal) AdaGrad has regret within factor at most O(d)

from the best regularization in hindsight from the family
𝑅 𝑥 = 𝑥 xÄ 𝑠. 𝑡. 𝐴 ≽ 0 , 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑
• Define ratio of adaptivity
∑c/UB 𝑔/ , 𝑥/ − 𝑥 ∗ AdaGrad regret

𝜇 ≔ =
worst−case OGD regret
𝑥B − 𝑥 ∗ ∑c/UB 𝑔/ x
x B
• Theorem 2: 𝜇 = 𝑂 ∑H ∑/ 𝛻/,H ∈[ , 𝑑]
9
• Infrequently occurring, or small-scale, features have small influence

on regret (and therefore, convergence to optimal parameter)
Agenda
Accelerating gradient descent?
1. Adaptive regularization (AdaGrad)
works for non-smooth&non-convex
2. Variance reduction
uses special ERM structure
very effective for smooth&convex
3. Acceleration/momentum
smooth convex only, general purpose optimization
since 80’s
GD++
#2: Variance Reduction

Recall: smooth gradient descent
The descent lemma, 𝛽-smooth functions: (algorithm: 𝑥/†B = 𝑥/ − 𝜂𝛻© )
𝑓 𝑥/†B − 𝑓 𝑥/ ≤ 𝛻/ 𝑥/†B − 𝑥/ + 𝛽 𝑥/ − 𝑥/†B x
1
=− 𝜂+ 𝛽𝜂 x 𝛻/ x =− 𝛻 x
4𝛽 /
Thus, for M-bounded functions: ( 𝑓 𝑥/ ≤ 𝑀)
1 x
−2𝑀 ≤ 𝑓 𝑥c − 𝑓(𝑥B ) = E 𝑓(𝑥/†B ) − 𝑓 𝑥/ ≤− E 𝛻/
4𝛽
/ /
Thus, exists a t for which,
x
8𝑀𝛽
𝛻/ ≤
𝑇
Smooth gradient descent
B
Conclusions: for 𝑥/†B = 𝑥/ − 𝜂𝛻© and T = Ω , finds
º
𝛻/ x ≤𝜖
Holds even for non-convex functions

Non-convex stochastic gradient descent
The descent lemma, 𝛽-smooth functions: (algorithm: 𝑥/†B = 𝑥/ − 𝜂𝛻Ò/ )
𝐸 𝑓 𝑥/†B − 𝑓 𝑥/ ≤ 𝐸 𝛻© 𝑥/†B − 𝑥/ + 𝛽 𝑥/ − 𝑥/†B x
x x
= 𝐸 −𝛻/ ⋅ 𝜂𝛻© + 𝛽 𝛻Ò© = −𝜂𝛻©x + 𝜂 x 𝛽𝐸 𝛻Ò©
= −𝜂𝛻©x + 𝜂 x 𝛽(𝛻©x + 𝑣𝑎𝑟 𝛻/ )
Thus, for M-bounded functions: ( 𝑓 𝑥/ ≤ 𝑀)
M𝛽 x
T = 𝑂 x ⇒ ∃/Zc . 𝛻/ ≤𝜀
𝜀
Controlling the variance:
Interpolating GD and SGD
Model: both full and stochastic gradients. Estimator combines
both into lower variance RV:
𝑥/†B = 𝑥/ − 𝜂 𝛻𝑓 𝑥/ − 𝛻𝑓 𝑥Ö + 𝛻𝑓(𝑥Ö )
Every so often, compute full gradient and restart at new 𝑥Ö .
Theorem: [Schmidt, LeRoux, Bach ‘12]

Variance reduction for well-conditioned functions
𝛽
0 ≺ 𝛼𝐼 ≼ 𝛻x𝑓 𝑥 ≼ 𝛽𝐼 , 𝛾 =
𝛼
Produces an 𝜖 approximate solution in time 𝛾 should be
1 interpreted as
𝑂 𝑚 + 𝛾 𝑑 log
𝜖 1
𝜖
Variance Reduction: analysis
For smooth functions (non-stochastic):
1
|𝛻/ |x ≤ ℎ/ = 𝑓 𝑥/ − 𝑓 𝑥 ∗
𝛽
Variance-reduced estimator advantage:
x x
𝛻/ = 𝛻𝑓 𝑥/ − 𝛻𝑓 𝑥Ö + 𝛻𝑓 𝑥Ö ≤
x
≤ 2 𝛻𝑓 𝑥/ − 𝛻𝑓 𝑥Ö + 2 𝛻𝑓 𝑥Ö x
Since a + b x ≤ 2𝑎x + 2𝑏 x
≤ 2𝐺 x 𝑥/ − 𝑥Ö x + 2𝛽ℎ/
G-smoothness for 𝛻𝑓 , smoothness
¶¨
≤ 2 h© + 2𝛽ℎ/ by strong convexity
¼
= 𝑂 𝛽ℎ c
Variance Reduction: analysis
Regret minimization for strongly convex functions:
x
1 1 𝛻/ 𝛽 log 𝑇 …
ℎ…†B = E ℎ/ ≤ E ≤ ℎ
𝑇 𝛼𝑇 𝑡 𝛼𝑇
/ /
“
Thus, after O 𝛾 = iterations, we have that the average distance ½ of
¼
what it (on average) was:
…†B
1
𝐸ℎ ≤ 𝐸 ℎ…
2
B
Thus O log epochs. Each epoch: one full gradient, T stochastic gradients.
º
Total running time:
1
𝑂 𝑚 + 𝛾 𝑑 log
𝜖
GD++
#3: Momentum and Nesterov acceleration

Acceleration/momentum [Nesterov ‘83]
• Optimal gradient complexity (smooth, convex)
• modest practical improvements, non-convex “momentum”

methods.
• With variance reduction, fastest possible running time of first-

order methods:
1
𝑂 (𝑚 + 𝛾𝑚) 𝑑 log
𝜖
[Woodworth, Srebro ’15] – tight lower bound w. gradient oracle

Experiments w. convex losses
Improve upon GradientDescent++ ?

Next few slides:
Move from first order (GD++) to second order

Higher Order Optimization
• Gradient Descent – Direction of Steepest Descent

• Second Order Methods – Use Local Curvature
Newton’s method (+ Trust region)
𝑥B
𝑥/†B = 𝑥/ − 𝜂 [𝛻 x 𝑓(𝑥)]AB 𝛻𝑓(𝑥)
𝑥x
For non-convex function: can move to ∞

Solution: solve a quadratic approximation in a
𝑥Ü
local area (trust region)
Newton’s method (+ Trust region)
𝑥B
𝑥/†B = 𝑥/ − 𝜂 [𝛻 x 𝑓(𝑥)]AB 𝛻𝑓(𝑥)
𝑥x
d3 time per iteration!
Infeasible for ML!!
𝑥Ü
Till recently…
Speed up the Newton direction computation??
• Spielman-Teng ‘04: diagonally dominant systems of equations in

linear time!
• 2015 Godel prize
• Used by Daitch-Speilman for faster flow algorithms
• Erdogu-Montanari ‘15: low rank approximation & inversion by

Sherman-Morisson
• Allow stochastic information
• Still prohibitive: rank * d2
Stochastic Newton? 𝑛ä
𝑥/†B = 𝑥/ − 𝜂 [𝛻 x 𝑓(𝑥)]AB 𝛻𝑓(𝑥)
B
• ERM, rank-1 loss: arg min 𝐸 H∼ - [ℓ 𝑥 c 𝑎H , 𝑏H + |𝑥|x ]
f x
• unbiased estimator of the Hessian:

Þx = aß aà ⋅ ℓ′ 𝑥 c 𝑎H , bß + 𝐼 𝑖 ~ 𝑈[1, … , 𝑚]
𝛻 ß
AB
Þx x Þ
• clearly 𝐸 𝛻 = 𝛻 𝑓 , but 𝐸 𝛻 x ≠ 𝛻 x 𝑓 AB
Circumvent Hessian creation and inversion!
• 3 steps:
• (1) represent Hessian inverse as infinite series
For any distribution on
naturals i ∼ 𝑁
𝛻 Ax = E 𝐼 − 𝛻 x H
HUÖ /V å
• (2) sample from the infinite series (Hessian-gradient product) , ONCE
1
𝛻 x 𝑓 AB 𝛻𝑓 = E 𝐼 − 𝛻 x 𝑓 H 𝛻f = 𝐸H∼æ 𝐼 − 𝛻 x 𝑓 H 𝛻f ⋅
Pr[𝑖]
H
• (3) estimate Hessian-power by sampling i.i.d. data examples
1
= EH∼æ,…∼[H] è 𝐼− 𝛻 x 𝑓… 𝛻f ⋅
Pr[𝑖] Single example
…UB /V H
Vector-vector
products only
Linear-time Second-order Stochastic Algorithm
(LiSSA)
• Use the estimator 𝛻¤Ax 𝑓 defined previously
• Compute a full (large batch) gradient 𝛻f

• Move in the direction 𝛻¤Ax 𝑓 𝛻𝑓
• Theoretical running time to produce an 𝜖 approximate solution for 𝛾

well-conditioned functions (convex): [Agarwal, Bullins, Hazan ‘15]
1 1
𝑂 𝑑𝑚 log + 𝛾𝑑 𝑑 log
𝜖 𝜖
1. Faster than first-order methods!

2. Indications this is tight [Arjevani,Shamir ‘16]
What about constraints??
Next few slides – projection free

(Frank-Wolfe) methods
Recommendation systems
movies
rating of user i
1 5 for movie j
2 3
2 4 1
users
5
4 2 3
1 1 1
5 5
movies
complete
1 4 5 2 1 5
missing entries
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 3
users
2 3 1 5 4 4
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
movies
1 4 5 2 1 5
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 5
users
2 3 1 5 4 4 get new data
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
movies
re-complete
1 2 5 5 4 4
missing entries
2 4 3 2 3 1
5 2 4 2 3 1
3 3 2 4 5 5
users
3 3 2 5 1 5
2 4 3 4 2 3
1 1 1 1 1 1
4 5 3 3 5 2
Assume low rank of “true matrix”, convex relaxation: bounded trace

norm
Bounded trace norm matrices
• Trace norm of a matrix = sum of singular values

• K = { X | X is a matrix with trace norm at most D }
• Computational bottleneck: projections on K

require eigendecomposition: O(n3) operations
• But: linear optimization over K is easier

computing top eigenvector; O(sparsity) time
Projections à linear optimization
1. Matrix completion (K = bounded trace norm matrices)

eigen decomposition
largest singular vector computation
2. Online routing (K = flow polytope)

conic optimization over flow polytope
shortest path computation
3. Rotations (K = rotation matrices)

conic optimization over rotations set
Wahba’s algorithm – linear time
4. Matroids (K = matroid polytope)

convex opt. via ellipsoid method
greedy algorithm – nearly linear time!
Conditional Gradient algorithm [Frank, Wolfe ’56]
Convex opt problem:
min 𝑓(𝑥)
@∈m
• f is smooth, convex
• linear opt over K is easy
vt = arg min rf (xt )> x

x2K
xt+1 = xt + ⌘t (vt xt )
1. At iteration t: convex comb. of at most t vertices (sparsity)

B
2. No learning rate. 𝜂/ ≈ (independent of diameter, gradients etc.)
/
FW theorem rf (xt )
xt ) , vt = arg min r> vt+1

xt+1 = xt + ⌘t (vt
x2K
t x xt+1
⇤ 1 xt
Theorem: f (xt ) f (x ) = O( )
t
Proof, main observation:

f (xt+1 ) f (x⇤ ) = f (xt + ⌘t (vt xt )) f (x⇤ )
 f (xt ) f (x⇤ ) + ⌘t (vt xt )> rt + ⌘t2 kvt xt k2 -smoothness of f

2
 f (xt ) f (x⇤ ) + ⌘t (x⇤ xt )> rt + ⌘t2 kvt xt k2 optimality of vt
2
 f (xt ) f (x⇤ ) + ⌘t (f (x⇤ ) f (xt )) + ⌘t2 kvt xt k2 convexity of f
2
⇤ ⌘t2
 (1 ⌘t )(f (xt ) f (x )) + D2 .
2
Thus: ht = f(xt) – f(x*)
1
ht+1  (1 ⌘t )ht + O(⌘t2 ) ⌘t , ht = O( )
t
Online Conditional Gradient
• Set 𝑥B ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. Use 𝑥/ , obtain ft
2. Compute 𝑥/†B as follows
⇣P ⌘>
t
vt = arg min i=1 rfi (xi ) + t xt x
x2K
↵ ↵
xt+1 (1 t )xt + t vt xt
xt+1
Pt
rfi (xi ) + t xt
i=1 vt
Theorem:[Hazan, Kale ’12] 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂(𝑇 Ü/é )

Theorem: [Garber, Hazan ‘13] For polytopes, strongly-convex and
smooth losses,
1. Offline: convergence after t steps: 𝑒 Aê(/)
2. Online: 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂( 𝑇)
Agenda
* Some slides by Cyril Zhang
Non-convex optimization in ML
Machine
Distribution
over label
{a} ∈ 𝑅 &
1
arg minš E ℓH 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -
Stochastic Optimization in ML
- Problem: find argmin 𝑓(𝑥)
@∈ℝš
x
- Þ 𝑥
Stochastic first-order oracle: 𝔼 𝛻𝑓 = 𝛻𝑓 𝑥 , 𝔼 Þ 𝑥
𝛻𝑓 ≤ 𝜎x
- SGD [Robbins & Monro ‘51]:
𝑓 𝑥 = 𝑥 i 𝐀𝑥 + 𝑏i 𝑥 𝑓 𝑥 = OutrageouslyDeepNeuralNet 𝑥
source: Andrej Karpathy’s PhD thesis

“Modern ML is SGD++”
Variance Reduction
[Le Roux et al. ‘12]
[Shalev-Shwartz & Zhang ‘12]
Momentum [Johnson & Zhang ‘13]
[Polyak ‘64]
[Nemirovskii & Yudin ‘77]
[Nesterov ‘83]
Adaptive Regularization
[Duchi, Hazan, Singer ‘10]
[Kingma & Ba ‘14]
Adaptive Optimizers: The Usual Recipe
- Each coordinate 𝑥 𝑖 gets a learning rate 𝐃/ [𝑖]
- Þ 𝑥B:/ [𝑖] (𝑔B:/ [𝑖] for short)
𝐃/ [𝑖] chosen “adaptively” using 𝛻𝑓
B
- AdaGrad: 𝐷/ 𝑖 ≔
∑ñðòÅ ïð H ¨
B
- RMSprop: 𝐷/ 𝑖 ≔
∑ñðòÅ Ö.óóñôð ïð H ¨
B
- Adam: 𝐷/ 𝑖 ≔
BAÖ.óóñ ∑ñðòÅ Ö.óóñôð ïð H ¨
Intuition: Adaptive Preconditioning
● Per-coordinate scaling matrix 𝐷/ = ∑/0UB diag 𝑔0 𝑖 x AB/x

● Adaptive methods learn 𝐷 which makes the loss surface more isotropic
💁
𝑓 𝑥 ↦ 𝑓 𝐷𝑥
😖
What about the other AdaGrad?
diagonal preconditioning full-matrix preconditioning

𝑂 𝑑 time per iteration > 𝑂 𝑑 x time per iteration
What does adaptive regularization even do?!
Main theorem, AdaGrad
● No analysis for non-convex optimization

The Case for Full-Matrix Adaptive Regularization
● GGT, a new adaptive optimizer

● Efficient full-matrix (low-rank) AdaGrad
● Theory: “Adaptive” convergence rate on convex & non-convex 𝑓
● Experiments: viable in the deep learning era

● GPU-friendly; not much slower than SGD on deep models
● Accelerates training in deep learning benchmarks
● Empirical insights on anisotropic loss surfaces, real and synthetic
The GGT Algorithm
● SGD: 𝑥/†B ← 𝑥/ − 𝜂/ ⋅ 𝑔/
● AdaGrad: 𝑥/†B ← 𝑥/ − diag ∑/0UB 𝑔0x AB/x ⋅ 𝑔/
● Full-Matrix AdaGrad: 𝑥/†B ← 𝑥/ − ∑/0UB 𝑔0 𝑔0i AB/x
⋅ 𝑔/
● GGT: 𝑥/†B ← 𝑥/ − 𝐆/ 𝐆/i AB/x ⋅ 𝑔/
𝑟 ≈ 200
𝑑 ≈ 10÷ 𝐆/ = 𝑔/ 𝛽𝑔/AB 𝛽x 𝑔/Ax ⋯ 𝛽,AB 𝑔/A,†B

Why a low-rank preconditioner?
● Answer 1: want to forget stale gradients (like Adam)

● Synthetic experiments: logistic regression, polytope analytic center
Why a low-rank preconditioner?
Matrix ops: 𝑂 𝑟𝑑 x Matrix ops: 𝑂 𝑟 x 𝑑

Huge SVD: 𝑂 𝑑 Ü Tiny SVD: 𝑂 𝑟 Ü
Large-Scale Experiments (CIFAR-10, PTB)
Visualizing Gradient Spectra
eigs 𝐆/i 𝐆/
@ 𝑡 = 150
26-layer ResNet
CIFAR-10
3-layer LSTM
Penn Treebank
(char-level)
Theory: Convergence of Non-Convex SGD
§¨
- Convex: 𝑓 𝑥c ≤ argmin@ 𝑓 𝑥 + 𝜀 in 𝑂 steps
ú¨
§¨
- Non-convex: ∃𝑡: 𝛻𝑓 𝑥/ ≤ 𝜀 within 𝑂 steps
úû
- Reduction via modified descent lemma:
Minimize 1/𝜀 x times: Minimize 1/𝜀 x times:
𝑓 𝑥/ + 𝛻𝑓 𝑥 , 𝑥 − 𝑥/ 𝑓 𝑥 + 2𝐿 𝑥 − 𝑥/ x ≥𝑓 𝑥
≥𝑓 𝑥
+ 𝐿 𝑥 − 𝑥/ x
𝑥/ 𝑥/
𝑥/†B 𝑥/†B
Theory: Adaptive Convergence Rate of GGT
● Define the adaptivity ratio 𝜇:
∑c/UB 𝑔/ , 𝑥/ − 𝑥 ∗ AdaGrad regret

𝜇 ≔ =
worst−case OGD regret
𝑥B − 𝑥 ∗ ∑c/UB 𝑔/ x
B
● [DHS10]: 𝜇 ≤ , 𝑑 for diag-AdaGrad, sometimes smaller for full AdaGrad
9
ý¨§¨
● Strongly convex losses: GGT* converges in 𝑂 steps
ú
ý¨§¨
● Non-convex reduction: GGT* converges in 𝑂 steps
úû
● First step towards analyzing adaptive methods in non-convex optimization
Agenda
* Some slides by Cyril Zhang
Recap
Chair/car
What is Optimization

We’re screwed.
! Local (non global) minima of f0
1. Basic non-convex optimization
! All kinds of constraints (even restricting to continuous functions):
• GD and SGD
h(x) = sin(2πx) = 0
2. Online learning and stochastic optimization
250
• Online gradient descent, non-smooth
200 optimization
150
100 3. Regularization and generalization

• Multiplicative update, mirrored-descent
50
4. Advanced optimization
−50
3
2
3
1 2
• Adaptive Regularization, Acceleration,

0 1
−1 0
−1
−2 −2
variance reduction, second order methods,

−3 −3
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
Frank-Wolfe
5. New algorithms: GGT
collaborators: & Google Princeton-Brain
Naman Agarwal, Brian Bullins, Xinyi Chen,

Karan Singh, Cyril Zhang, Yi Zhang
Bibliography & more information, see:
Thank you!

MLSS Complete PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLSS Complete PDF

Uploaded by

Copyright:

Available Formats

Tutorial:

Optimization for Machine Learning

A.M Turing, 1950

NOT touch upon:

Input: function 𝑓: 𝐾 ↦ 𝑅, for 𝐾 ⊆ 𝑅9

Accessing K? (separation/membership oracle, explicit constraints)

Max Cut over G=(V,E):

Supervised Learning = optimization over data

Fitting the parameters of the model (“training”) = optimization

1 Duchi (UC Berkeley) Convex Optimization for Machine Learning

m = # of examples (a,b) = (features, labels)

𝑥 = arg min # of mistakes =

arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 c 𝑎H ≠ 𝑏H |

𝑊 = arg min # of mistakes =

arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑁𝑒𝑡l 𝑎H ≠ 𝑏H |

Local property that ensures global optimality?

A function 𝑓: 𝑅 9 ↦ 𝑅 is convex if and only if:

Set K is convex if and only if:

𝑥 = arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 c 𝑎H ≠ 𝑏H |

1. Ridge / linear regression ℓ 𝑥 i 𝑎H , 𝑦H = 𝑥 i 𝑎H − 𝑏H x

But generally speaking...

• Even deciding whether you are at a local h(x) = sin(2πx) = 0

Duchi (UC Berkeley) Convex Optimization for Machine Learning

𝜖 First Order Critical Points –

Good enough Less ideal

𝜆-H& (𝛻 x 𝑓 ℎ ) > 0 or 𝜆-H& (𝛻 x 𝑓 ℎ ) < 0

4th order critical points: NP-hard! 𝑓 𝑥 − 𝑓 𝑥 ∗ = 𝑂( 𝑥 − 𝑥 ∗ …†B )

• 𝜖-𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒 𝑙𝑜𝑐𝑎𝑙 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 (Nesterov and Polyak)

Why is this important?

For gradients to be meaningful at all…

Remark: any function can be “smoothed” by local integration

Thus, there exists a time for which:

convex f à global convergence:

2. Summing over all iterations:

Thus, in T iterations, exists an iterate t for which:

m = # of examples (a,b) = (features, labels). d = dimension

à Gradients are expensive! (entire data set per iteration)

Stochastic gradient descent

Learning problem arg minš 𝑓 𝑥 = 𝐸(+| ,{| ) ℓH 𝑥, 𝑎H , 𝑏H

“Random approximation” (Robbins/Monro ‘51)

estimate the gradient using a single example: 𝑓/ 𝑥 = ℓH 𝑥, 𝑎H , 𝑏H

𝛻 = E[𝛻𝑓 𝑥 ] = 𝐸(+| ,{| ) 𝛻 ℓH 𝑥, 𝑎H , 𝑏H

Variance of gradient estimator: E[ 𝛻‖x = 𝜎 x

Theorem: for 𝛽-smooth M-bounded functions, step size

2. Summing over all iterations:

m = # of examples (a,b) = (features, labels). d = dimension

à Gradients are expensive! (entire data set per iteration)

Optimization for Machine Learning

𝑒𝑟𝑟 ℎ = 𝔼+,{∼³ [ℓ(ℎ, 𝑎, 𝑏 ] hN

Goal: minimize (average, expected) regret:

Vanishing regret à generalization in PAC setting! (online2batch)

• D = diameter of constraint set

This is the Pythagorean theorem:

SGD: non-smooth case

Learning problem arg minš 𝐹 𝑥 = 𝐸(+| ,{| ) ℓH 𝑥, 𝑎H , 𝑏H

1. We have proved: (for any sequence of 𝛻/ )

• Statistical learning theory /

• PAC theory: Regularization <-> reduce complexity

𝛼 = strong convexity, 𝛽 = smoothness

• Provably works [Kalai-Vempala’05]:

𝑥/¿ = arg min E 𝑓H 𝑥 = 𝑥/†B

• So if 𝑥/ ≈ 𝑥/†B , we get a regret bound

• R(x) is a strongly convex function, ensures stability: