You are on page 1of 106

Tutorial:

Optimization for Machine Learning

Elad Hazan
Princeton University and Google Brain
http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm
“Instead of trying to produce a programme to simulate the
adult mind, why not rather try to produce one which simulates
the child's?”

A.M Turing, 1950


ML paradigm

Machine

Chair/car

Distribution
over {a} ∈ 𝑅 & label
𝑏 = 𝑓*+,+-./.,0 (𝑎)
This tutorial - training the machine
• Efficiency
• generalization
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
4. A touch of state-of-the-art
Bibliography + more info: http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm
Agenda

NOT touch upon:


• Parallelism/distributed computation (asynchronous optimization,
HOGWILD etc.), Bayesian inference in graphical models, Markov Chain
Monte Carlo, Partial information and bandit algorithms, optimization
for RL (policy and value iteration, policy gradient, …), Hyperpararmeter
optimization, …
Mathematical optimization

Input: function 𝑓: 𝐾 ↦ 𝑅, for 𝐾 ⊆ 𝑅9


Output: minimizer 𝑥 ∈ 𝐾, such that 𝑓 𝑥 ≤ 𝑓 𝑦 ∀ 𝑦 ∈ 𝐾

Accessing f?
• Value oracle (0-order opt.)
• Gradient (1st order opt.), k’th differentials (k’th order…)

Accessing K? (separation/membership oracle, explicit constraints)


Generally NP-hard, given full access to function.
NP-hardness of Mathematical optimization

Max Cut over G=(V,E):

1 − 𝑥H 𝑥I
max E
@∈ AB,B D 2
HI ∈L
What is Optimization

Supervised Learning = optimization over data


But generally speaking...
(a.k.a. Empirical Risk Minimization)
We’re screwed.
! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous f

Fitting the parameters of the model (“training”) = optimization


h(x) = sin(2πx) = 0
problem:
250

200

150

100

50

−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3

1 Duchi (UC Berkeley) Convex Optimization for Machine Learning


arg minR E ℓ 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -

m = # of examples (a,b) = (features, labels)


d = dimension
Example: linear classification

Given a sample 𝑆 = 𝑎B , 𝑏B , … , 𝑎- , 𝑏- ,
find hyperplane (through the origin w.l.o.g.)
such that:

𝑥 = arg min # of mistakes =


@ ZB

arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 c 𝑎H ≠ 𝑏H |


@ ZB

i
B
arg min ∑H ℓ(𝑥, 𝑎H , 𝑏H ) for ℓ 𝑥, 𝑎H , 𝑏H = h 1 𝑥 𝑎≠𝑏
f ZB - 0 𝑥 i 𝑎 = 𝑏

NP hard!
Example: deep neural net classification

Given a sample 𝑆 = 𝑎B , 𝑏B , … , 𝑎- , 𝑏- ,
find a depth N ResNet such that:

𝑊 = arg min # of mistakes =


l Zm

arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑁𝑒𝑡l 𝑎H ≠ 𝑏H |


l Zm

B 1 𝑁𝑒𝑡l (𝑎) ≠ 𝑏
arg min ∑H ℓ(𝑊 , 𝑎H , 𝑏H ) for ℓ 𝑊, 𝑎H , 𝑏H = h
p Zm - 0 𝑁𝑒𝑡l (𝑎) = 𝑏
Sum of signs à global optimization NP-hard!

Local property that ensures global optimality?


Convexity

A function 𝑓: 𝑅 9 ↦ 𝑅 is convex if and only if:

1 1 1 1
𝑓 𝑥+ 𝑦 ≤ 𝑓 𝑥 + 𝑓 𝑦
2 2 2 2

• Informally: smiley J
• Alternative definition:
f y ≥ f x + 𝛻𝑓(𝑥)i (𝑦 − 𝑥)

𝑥 𝑦
Convex sets

Set K is convex if and only if:


𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 + ½𝑦) ∈ 𝐾
c
Loss functions ℓ 𝑥, 𝑎H , 𝑏H = ℓ(𝑥 𝑎H ⋅ 𝑏H )
Convex relaxations for linear classification

𝑥 = arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 c 𝑎H ≠ 𝑏H |


@ ZB

1. Ridge / linear regression ℓ 𝑥 i 𝑎H , 𝑦H = 𝑥 i 𝑎H − 𝑏H x


2. SVM ℓ 𝑥 i 𝑎H , 𝑦H = max{0,1 − 𝑏H 𝑥 i 𝑎H }
}+
3. Logistic regression ℓ 𝑥 i 𝑎H , 𝑦H = log (1 + 𝑒 A{| ⋅@ | )
Non-Convex Optimization What is Optimization

But generally speaking...


We’re screwed.
Multiple unsurpassable hurdles. ! Local (non global) minima of f0
• Global optimization is NP-hard
! All kinds of constraints (even restricting to continuous func

• Even deciding whether you are at a local h(x) = sin(2πx) = 0


minimum is NP-hard
250

200

150

100

50

−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3

Duchi (UC Berkeley) Convex Optimization for Machine Learning


So what can we hope to do?

𝜖 First Order Critical Points –


• First Order Critical Points –

{ℎ | 𝛻𝑓
ℎ | 𝛻𝑓 ℎ |=≤0}
𝜖}

Good enough Less ideal

𝜆-H& (𝛻 x 𝑓 ℎ ) > 0 or 𝜆-H& (𝛻 x 𝑓 ℎ ) < 0

4th order critical points: NP-hard! 𝑓 𝑥 − 𝑓 𝑥 ∗ = 𝑂( 𝑥 − 𝑥 ∗ …†B )


A better baseline

• 𝜖-𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒 𝑙𝑜𝑐𝑎𝑙 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 (Nesterov and Polyak)

| 𝛻𝑓 ℎ | ≤ 𝜖 and 𝛻 x 𝑓 ℎ ≽ − 𝜖 𝐼

• Does not guarantee a local minimum even with 𝜖 = 0 (but for some
functions)
We have:
1. motivated learning as mathematical optimization
2. convexity à locally verify global optimality
3. Non-convex opt à kth-order local opt.
(2nd is reasonable, 4th is NP-hard)

Next è algorithms!
Unconstrained Gradient Descent

@
[rf (x)]i = f (x)
@xi

𝑥/†B ← 𝑥/ − 𝜂𝛻𝑓 𝑥/

p* p3 p2 p1
Smooth functions
𝛽 = smoothness, simplified:

𝛽 𝛽
− 𝐼 ≼ 𝛻 x𝑓 𝑥 ≼ 𝐼
2 2

Why is this important?

For gradients to be meaningful at all…

Remark: any function can be “smoothed” by local integration


Convergence of gradient descent
𝑥/†B ← 𝑥/ − 𝜂𝛻𝑓 𝑥/
B
Theorem: for 𝛽-smooth M-bounded functions, step size 𝜂 = ,

1 x
2𝑀𝛽
E 𝛻𝑓 𝑥/ ≤
𝑇 𝑇
/

Thus, there exists a time for which:


x
2𝑀𝛽
𝛻𝑓 𝑥/ ≤
𝑇

convex f à global convergence:


𝐷
𝑓 𝑥/ − 𝑓(𝑥 ∗ ) ≤ |𝑥/ − 𝑥 ∗ ||𝛻𝑓 𝑥/ | = 𝑂
𝑇
Proof: 𝑥/†B ← 𝑥/ − 𝜂𝛻𝑓 𝑥/
1. The descent lemma:
1 x
𝑓 𝑥/ − 𝑓 𝑥/†B ≥ −𝛻𝑓 𝑥/ 𝑥/†B − 𝑥/ − 𝛽|𝑥/ − 𝑥/†B —
2
x
1 x
= 𝜂 𝛻/ − 𝛽𝜂 𝛻/ x
2
1 x
= 𝛻
2𝛽 /

2. Summing over all iterations:


B
M ≥ f xB − f x ∗ ≥ ∑/ 𝑓 𝑥/ − 𝑓 𝑥/†B ≥ ∑/ 𝛻/ x
x“

B x x™“
and thus, ∑/ 𝛻𝑓 𝑥/ ≤
c c
For convex functions:
𝑓 𝑥 − 𝑓 𝑥 ∗ ≤ 𝛻𝑓 𝑥 𝑥 ∗ − 𝑥 ≤ 𝐷|𝛻𝑓 𝑥 |

Thus, in T iterations, exists an iterate t for which:

𝐷 2𝑀𝛽
𝑓 𝑥 − 𝑓 𝑥∗ ≤
𝑇
Our objective function:

1
𝑓 𝑥 = arg minš E ℓH 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -

m = # of examples (a,b) = (features, labels). d = dimension

à Gradients are expensive! (entire data set per iteration)


à What about non-smooth functions? (non-convex: infeasible)
à How good is a local minimum in terms of generalization?
à Can we get faster rates?
! All kinds of constraints (even restricting to continuous
h(x) = sin(2πx) = 0

Stochastic gradient descent


250

200

150

100

50

−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2

Learning problem arg minš 𝑓 𝑥 = 𝐸(+| ,{| ) ℓH 𝑥, 𝑎H , 𝑏H


−3 −3

@∈Q
Duchi (UC Berkeley) Convex Optimization for Machine Learning

“Random approximation” (Robbins/Monro ‘51)

estimate the gradient using a single example: 𝑓/ 𝑥 = ℓH 𝑥, 𝑎H , 𝑏H

𝛻 = E[𝛻𝑓 𝑥 ] = 𝐸(+| ,{| ) 𝛻 ℓH 𝑥, 𝑎H , 𝑏H

¤
move in the direction of estimator! 𝑥/†B ← 𝑥/ − 𝜂𝛻(𝑥/)

Convergence?

Variance of gradient estimator: E[ 𝛻‖x = 𝜎 x


Stochastic vs. full gradient descent
Convergence of SGD
𝑥/†B ← 𝑥/ − 𝜂𝛻 𝑓 𝑥/

Theorem: for 𝛽-smooth M-bounded functions, step size



𝜂=
c“§¨
,

1 x
𝑀𝛽𝜎 x
E 𝛻𝑓 𝑥/ ≤
𝑇 𝑇
/
Proof: 𝑥/†B ← 𝑥/ − 𝜂𝛻 𝑓 𝑥/
1. The descent lemma:
1 x
E 𝑓 𝑥/ − 𝑓 𝑥/†B ≥ E −𝛻𝑓 𝑥/ 𝑥/†B − 𝑥/ − 𝛽|𝑥/ − 𝑥/†B —
2
1 x 1
= 𝐸 𝛻/ ⋅ 𝜂𝛻© − 𝛽𝜂 x 𝛻 = 𝜂 𝛻/ x
− 𝛽𝜂 x 𝜎 x
2 2

2. Summing over all iterations:


B
M ≥ f xB − f x ∗ ≥ ∑/ 𝑓 𝑥/ − 𝑓 𝑥/†B ≥ 𝜂 ∑/ 𝛻/ x
− 𝛽𝜂 x 𝜎 x 𝑇
x

B x ™ x ™“§ ¨
and thus, ∑/ 𝛻𝑓 𝑥/ ≤ + 𝜂𝛽𝜎 ≤
c ªc c
Our objective function:

1
𝑓 𝑥 = arg minš E ℓH 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -

m = # of examples (a,b) = (features, labels). d = dimension

à Gradients are expensive! (entire data set per iteration)


à What about non-smooth functions? (non-convex: infeasible)
à How good is a local minimum in terms of generalization?
à Can we get faster rates?
Tutorial: PART 2

Optimization for Machine Learning

Elad Hazan
Princeton University and Google Brain
Bibliography + more info:
http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
4. A touch of state-of-the-art
à What about non-smooth functions? (non-convex: infeasible)
à Gradients are expensive! (entire data set per iteration)
à How good is a local minimum in terms of generalization?
à Can we get faster rates?

à convex optimization
Statistical (PAC) learning
Nature: i.i.d from distribution D over
A ×𝐵 = {(𝑎, 𝑏)}
(a1,b1) (aM,bM)
• learner: h1
• Hypothesis h h2
• Loss, e.g. ℓ ℎ, 𝑎, 𝑏 = (ℎ 𝑎 −
𝑏 )x

𝑒𝑟𝑟 ℎ = 𝔼+,{∼³ [ℓ(ℎ, 𝑎, 𝑏 ] hN


Hypothesis class H: X -> Y is learnable if ∀𝜖, 𝛿 > 0 exists algorithm s.t. after seeing m
examples, for 𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))
finds h s.t. w.p. 1- δ:

err(h)  min

err(h ) + ✏
h 2H
More powerful setting:
Online Learning in Games A x B
Iteratively, for t = 1,2, … , 𝑇
Player: ℎ/ ∈ 𝐻
Adversary: (𝑎/ , 𝑏/ ) ∈ 𝐴 H
Loss ℓ(ℎ/ , (𝑎/ , 𝑏/ ))

Goal: minimize (average, expected) regret:

" #
1 X X
`(ht , (at , bt ) min `(h⇤ , (at , bt )) ! 0
T t

h 2H
t
T !1

Vanishing regret à generalization in PAC setting! (online2batch)


From this point onwards: 𝑓/ 𝑥 = ℓ(𝑥, 𝑎/ , 𝑏/ ) = loss for one example
Can we minimize regret efficiently?
Online gradient descent
yt+1 = xt ⌘rft (xt )
xt+1 = arg min kyt+1 xt k
x2K

Theorem: Regret = ∑/ 𝑓/ 𝑥/ − ∑/ 𝑓/ 𝑥 ∗ = 𝑂 𝑇
𝑦/†B ← 𝑥/ − 𝜂𝛻𝑓/ 𝑥/
Online gradient descent 𝑥/†B = arg min |𝑦/†B − 𝑥|
f∈m

³
Theorem: for step size 𝜂 = , NO SMOOTHNESS REQUIRED!!
¶ c

E 𝑓/ 𝑥/ − min E 𝑓 𝑥 ∗ ≤ 𝐷𝐺 𝑇
∗ /
@
/ /

Where:
• G = upper bound on norm of gradients
|𝛻𝑓/ 𝑥/ | ≤ 𝐺

• D = diameter of constraint set


∀𝑥, 𝑦 ∈ 𝐾 . |𝑥 − 𝑦| ≤ 𝐷
Proof: 𝑦/†B ← 𝑥/ − 𝜂𝛻𝑓/ 𝑥/
1. Observation 1: 𝑥/†B = arg min |𝑦/†B − 𝑥|
f∈m
x ∗ − y©†B x
= x ∗ − x© x
− 2𝜂𝛻𝑓/ (𝑥/ )(𝑥/ − 𝑥 ∗ ) + 𝜂 x 𝛻𝑓/ (𝑥/ ) x

2. Observation 2:
x ∗ − 𝑥/†B x
≤ x ∗ − y/†B x

This is the Pythagorean theorem:


Proof: 𝑦/†B ← 𝑥/ − 𝜂𝛻𝑓 𝑥/
1. Observation 1: 𝑥/†B = arg min |𝑦/†B − 𝑥|
f∈m
x ∗ − y©†B x = x ∗ − x© x − 2𝜂𝛻𝑓/ (𝑥/ )(𝑥/ − 𝑥 ∗ ) + 𝜂 x 𝛻𝑓/ (𝑥/ ) x

2. Observation 2:
x ∗ − 𝑥/†B x ≤ x ∗ − y/†B x

Thus:
x ∗ − x©†B x ≤ x ∗ − x© x − 2𝜂𝛻𝑓/ (𝑥/ )(𝑥/ − 𝑥 ∗ ) + 𝜂 x 𝐺 x
And hence:

1 1
E 𝑓/ 𝑥/ − 𝑓/ 𝑥 ∗ ≤ E 𝛻𝑓/ 𝑥/ 𝑥/ − 𝑥 ∗
𝑇 𝑇
/ /

1 1 𝜂
≤ E x ∗ − x©†B x − x ∗ − x© x + 𝐺x
𝑇 2𝜂 2
/

1 𝜂 𝐷𝐺
≤ 𝐷x + 𝐺 x ≤
𝑇 ⋅ 2𝜂 2 𝑇
Lower bound
p
Regret = ⌦( T )
• 2 loss functions, T iterations:
• 𝐾 = −1,1 , 𝑓B 𝑥 = 𝑥 , 𝑓x 𝑥 = −𝑥
• Second expert loss = first * -1
• Expected loss = 0 (any algorithm)
• Regret = (compared to either -1 or 1)

p
E[|#10 s #( 1)0 s|] = ⌦( T )
! All kinds of constraints (even restricting to continuous
h(x) = sin(2πx) = 0

SGD: non-smooth case


250

200

150

100

50

−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2

Learning problem arg minš 𝐹 𝑥 = 𝐸(+| ,{| ) ℓH 𝑥, 𝑎H , 𝑏H


−3 −3

@∈Q
random example: 𝑓/ 𝑥 = ℓH 𝑥, 𝑎H , 𝑏H Duchi (UC Berkeley) Convex Optimization for Machine Learning

1. We have proved: (for any sequence of 𝛻/ )

1 1 𝐷𝐺
E 𝛻/i 𝑥/ ≤ min E 𝛻/
i ∗
𝑥 +
𝑇 @ ∗ ∈m 𝑇 𝑇
/ /
2. Taking (conditional) expectation:

1 ∗
1 i ∗
𝐷𝐺
𝐸 𝐹 E 𝑥/ − min 𝐹 𝑥 ≤ 𝐸 E 𝛻/ (𝑥/ − 𝑥 )] ≤
𝑇 @ ∗ ∈m 𝑇 𝑇
/ /

One example per step, same convergence as GD, & gives direct generalization!
(formally needs martingales)
9 -9
O vs. O total running time for 𝜖 generalization error.
º¨ º¨
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
4. A touch of state of the art
Regularization &
Gradient Descent++
Why “regularize”?

• Statistical learning theory /


Occam’s razor:
# of examples needed to learn
hypothesis class ~ it’s “dimension”
• VC dimension
• Fat-shattering dimension
• Rademacher width
• Margin/norm of linear/kernel classifier

• PAC theory: Regularization <-> reduce complexity


• Regret minimization: Regularization <-> stability
Condition number of convex functions

defined as 𝛾 = , where (simplified)
¼

𝛼 x
𝛽
0≺ 𝐼≼𝛻 𝑓 𝑥 ≼ 𝐼
2 2

𝛼 = strong convexity, 𝛽 = smoothness


Non-convex smooth functions: (simplified)

𝛽 x
𝛽
− 𝐼≼𝛻 𝑓 𝑥 ≼ 𝐼
2 2

Why do we care?
well-conditioned functions exhibit much faster optimization!
Important for regularization
Minimize regret: best-in-hindsight
X X
Regret = ft (xt ) min

ft (x⇤ )
x 2K
t t
• Most natural:
/AB

𝑥/ = arg min E 𝑓H 𝑥
@∈m
HUB

• Provably works [Kalai-Vempala’05]:


/

𝑥/¿ = arg min E 𝑓H 𝑥 = 𝑥/†B


@∈m
HUB

• So if 𝑥/ ≈ 𝑥/†B , we get a regret bound


• But instability 𝑥/ − 𝑥/†B can be large!
Fixing FTL: Follow-The-Regularized-Leader
(FTRL)
• Linearize: replace ft by a linear function, 𝛻𝑓/ 𝑥/ c 𝑥
• Add regularization:

1
𝑥/ = arg min E 𝛻/i 𝑥 + 𝑅 𝑥
@∈m 𝜂
HUB…/AB

• R(x) is a strongly convex function, ensures stability:

𝛻/i 𝑥/ − 𝑥/†B = 𝑂(𝜂)


FTRL vs. gradient descent

B
• 𝑅 𝑥 = ∥ 𝑥 ∥x
x
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Q ⇣ Pt 1 ⌘
= K ⌘ i=1 rfi (xi )

• Essentially OGD: starting with y1 = 0, for t = 1, 2, …

Q
xt = K (yt )
yt+1 = yt ⌘rft (xt )
FTRL vs. Multiplicative Weights
• Experts setting: 𝐾 = Δ& distributions over experts
• 𝑓/ 𝑥 = 𝑐/c 𝑥, where ct is the vector of losses
• 𝑅 𝑥 = ∑H 𝑥H log 𝑥H : negative entropy

Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
⇣ P ⌘
t 1
Entrywise
= exp ⌘ i=1 ci /Zt Normalization
constant
exponential

• Gives the Multiplicative Weights method!


FTRL ⇔ Online Mirror Descent

Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K

Bregman Projection:
QR
K (y) = arg min BR (xky)
x2K

BR (xky) := R(x) R(y) rR(y)> (x y)

QR
xt = K (yt )
1
yt+1 = (rR) (rR(yt ) ⌘rft (xt ))
GD++

#1: Adaptive Regularization


Adaptive Regularization: AdaGrad

• Consider generalized linear model, prediction is function of 𝑎c 𝑥


𝛻𝑓/ 𝑥 = ℓ 𝑎/ , 𝑏/ , 𝑥 𝑎/

• OGD update: 𝑥/†B = 𝑥/ − 𝜂𝛻/ = 𝑥/ − 𝜂ℓ 𝑎/ , 𝑏/ , 𝑥 𝑎/


• features treated equally in updating parameter vector

• In typical text classification tasks, feature vectors at are very sparse,


Slow learning!

• Adaptive regularization: per-feature learning rates


Optimal regularization

• The general RFTL form

1
𝑥/ = arg min E 𝑓H 𝑥 + 𝑅 𝑥
@∈m 𝜂
HUB…/AB

• Which regularizer to pick?


• AdaGrad: treat this as a learning problem!
Family of regularizations:

𝑅 𝑥 = 𝑥 xÄ 𝑠. 𝑡. 𝐴 ≽ 0 , 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑

• Objective in matrix world: best regret in hindsight!


AdaGrad

• Set 𝑥B ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. use 𝑥/ obtain ft , and gradient 𝑔/ = 𝛻𝑓 𝑥/
2. compute 𝑥/†B as follows:

𝐺/ = E 𝑔H 𝑔Hi
HUB…/
Å Å
A A
𝑥/†B = 𝑥/ − 𝜂𝐺/ 𝑔/ or diagonal version 𝑥/†B = 𝑥/ − 𝜂 𝑑𝑖𝑎𝑔 𝐺/
¨ ¨ 𝑔/

𝑦/†B = arg min 𝑦/†B − 𝑥 i 𝐺/ (𝑦/†B − 𝑥)


@∈m
Intuition: Adaptive Preconditioning
● Per-coordinate scaling matrix 𝐷/ = ∑/0UB diag 𝑔0 𝑖 x AB/x
● Adaptive methods learn 𝐷 which makes the loss surface more isotropic

💁
𝑓 𝑥 ↦ 𝑓 𝐷𝑥

😖
The ratio of adaptivity

• Theorem 1: (informal) AdaGrad has regret within factor at most O(d)


from the best regularization in hindsight from the family
𝑅 𝑥 = 𝑥 xÄ 𝑠. 𝑡. 𝐴 ≽ 0 , 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑

• Define ratio of adaptivity

∑c/UB 𝑔/ , 𝑥/ − 𝑥 ∗ AdaGrad regret


𝜇 ≔ =
worst−case OGD regret
𝑥B − 𝑥 ∗ ∑c/UB 𝑔/ x

x B
• Theorem 2: 𝜇 = 𝑂 ∑H ∑/ 𝛻/,H ∈[ , 𝑑]
9

• Infrequently occurring, or small-scale, features have small influence


on regret (and therefore, convergence to optimal parameter)
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
Accelerating gradient descent?
1. Adaptive regularization (AdaGrad)
works for non-smooth&non-convex
2. Variance reduction
uses special ERM structure
very effective for smooth&convex
3. Acceleration/momentum
smooth convex only, general purpose optimization
since 80’s
GD++

#2: Variance Reduction


Recall: smooth gradient descent
The descent lemma, 𝛽-smooth functions: (algorithm: 𝑥/†B = 𝑥/ − 𝜂𝛻© )

𝑓 𝑥/†B − 𝑓 𝑥/ ≤ 𝛻/ 𝑥/†B − 𝑥/ + 𝛽 𝑥/ − 𝑥/†B x

1
=− 𝜂+ 𝛽𝜂 x 𝛻/ x =− 𝛻 x
4𝛽 /

Thus, for M-bounded functions: ( 𝑓 𝑥/ ≤ 𝑀)

1 x
−2𝑀 ≤ 𝑓 𝑥c − 𝑓(𝑥B ) = E 𝑓(𝑥/†B ) − 𝑓 𝑥/ ≤− E 𝛻/
4𝛽
/ /
Thus, exists a t for which,
x
8𝑀𝛽
𝛻/ ≤
𝑇
Smooth gradient descent
B
Conclusions: for 𝑥/†B = 𝑥/ − 𝜂𝛻© and T = Ω , finds
º

𝛻/ x ≤𝜖

Holds even for non-convex functions


Non-convex stochastic gradient descent
The descent lemma, 𝛽-smooth functions: (algorithm: 𝑥/†B = 𝑥/ − 𝜂𝛻Ò/ )

𝐸 𝑓 𝑥/†B − 𝑓 𝑥/ ≤ 𝐸 𝛻© 𝑥/†B − 𝑥/ + 𝛽 𝑥/ − 𝑥/†B x

x x
= 𝐸 −𝛻/ ⋅ 𝜂𝛻© + 𝛽 𝛻Ò© = −𝜂𝛻©x + 𝜂 x 𝛽𝐸 𝛻Ò©

= −𝜂𝛻©x + 𝜂 x 𝛽(𝛻©x + 𝑣𝑎𝑟 𝛻/ )

Thus, for M-bounded functions: ( 𝑓 𝑥/ ≤ 𝑀)

M𝛽 x
T = 𝑂 x ⇒ ∃/Zc . 𝛻/ ≤𝜀
𝜀
Controlling the variance:
Interpolating GD and SGD
Model: both full and stochastic gradients. Estimator combines
both into lower variance RV:

𝑥/†B = 𝑥/ − 𝜂 𝛻𝑓 𝑥/ − 𝛻𝑓 𝑥Ö + 𝛻𝑓(𝑥Ö )

Every so often, compute full gradient and restart at new 𝑥Ö .

Theorem: [Schmidt, LeRoux, Bach ‘12]


Variance reduction for well-conditioned functions
𝛽
0 ≺ 𝛼𝐼 ≼ 𝛻x𝑓 𝑥 ≼ 𝛽𝐼 , 𝛾 =
𝛼
Produces an 𝜖 approximate solution in time 𝛾 should be
1 interpreted as
𝑂 𝑚 + 𝛾 𝑑 log
𝜖 1
𝜖
Variance Reduction: analysis
For smooth functions (non-stochastic):
1
|𝛻/ |x ≤ ℎ/ = 𝑓 𝑥/ − 𝑓 𝑥 ∗
𝛽
Variance-reduced estimator advantage:
x x
𝛻/ = 𝛻𝑓 𝑥/ − 𝛻𝑓 𝑥Ö + 𝛻𝑓 𝑥Ö ≤
x
≤ 2 𝛻𝑓 𝑥/ − 𝛻𝑓 𝑥Ö + 2 𝛻𝑓 𝑥Ö x
Since a + b x ≤ 2𝑎x + 2𝑏 x

≤ 2𝐺 x 𝑥/ − 𝑥Ö x + 2𝛽ℎ/
G-smoothness for 𝛻𝑓 , smoothness
¶¨
≤ 2 h© + 2𝛽ℎ/ by strong convexity
¼

= 𝑂 𝛽ℎ c
Variance Reduction: analysis

Regret minimization for strongly convex functions:

x
1 1 𝛻/ 𝛽 log 𝑇 …
ℎ…†B = E ℎ/ ≤ E ≤ ℎ
𝑇 𝛼𝑇 𝑡 𝛼𝑇
/ /


Thus, after O 𝛾 = iterations, we have that the average distance ½ of
¼
what it (on average) was:
…†B
1
𝐸ℎ ≤ 𝐸 ℎ…
2
B
Thus O log epochs. Each epoch: one full gradient, T stochastic gradients.
º
Total running time:
1
𝑂 𝑚 + 𝛾 𝑑 log
𝜖
GD++

#3: Momentum and Nesterov acceleration


Acceleration/momentum [Nesterov ‘83]
• Optimal gradient complexity (smooth, convex)

• modest practical improvements, non-convex “momentum”


methods.

• With variance reduction, fastest possible running time of first-


order methods:
1
𝑂 (𝑚 + 𝛾𝑚) 𝑑 log
𝜖

[Woodworth, Srebro ’15] – tight lower bound w. gradient oracle


Experiments w. convex losses

Improve upon GradientDescent++ ?


Next few slides:

Move from first order (GD++) to second order


Higher Order Optimization

• Gradient Descent – Direction of Steepest Descent


• Second Order Methods – Use Local Curvature
Newton’s method (+ Trust region)

𝑥B

𝑥/†B = 𝑥/ − 𝜂 [𝛻 x 𝑓(𝑥)]AB 𝛻𝑓(𝑥)

𝑥x

For non-convex function: can move to ∞


Solution: solve a quadratic approximation in a
𝑥Ü
local area (trust region)
Newton’s method (+ Trust region)

𝑥B

𝑥/†B = 𝑥/ − 𝜂 [𝛻 x 𝑓(𝑥)]AB 𝛻𝑓(𝑥)

𝑥x
d3 time per iteration!
Infeasible for ML!!
𝑥Ü

Till recently…
Speed up the Newton direction computation??

• Spielman-Teng ‘04: diagonally dominant systems of equations in


linear time!
• 2015 Godel prize
• Used by Daitch-Speilman for faster flow algorithms

• Erdogu-Montanari ‘15: low rank approximation & inversion by


Sherman-Morisson
• Allow stochastic information
• Still prohibitive: rank * d2
Stochastic Newton? 𝑛ä
𝑥/†B = 𝑥/ − 𝜂 [𝛻 x 𝑓(𝑥)]AB 𝛻𝑓(𝑥)
B
• ERM, rank-1 loss: arg min 𝐸 H∼ - [ℓ 𝑥 c 𝑎H , 𝑏H + |𝑥|x ]
f x

• unbiased estimator of the Hessian:


Þx = aß aà ⋅ ℓ′ 𝑥 c 𝑎H , bß + 𝐼 𝑖 ~ 𝑈[1, … , 𝑚]
𝛻 ß

AB
Þx x Þ
• clearly 𝐸 𝛻 = 𝛻 𝑓 , but 𝐸 𝛻 x ≠ 𝛻 x 𝑓 AB
Circumvent Hessian creation and inversion!

• 3 steps:
• (1) represent Hessian inverse as infinite series
For any distribution on
naturals i ∼ 𝑁
𝛻 Ax = E 𝐼 − 𝛻 x H

HUÖ /V å

• (2) sample from the infinite series (Hessian-gradient product) , ONCE

1
𝛻 x 𝑓 AB 𝛻𝑓 = E 𝐼 − 𝛻 x 𝑓 H 𝛻f = 𝐸H∼æ 𝐼 − 𝛻 x 𝑓 H 𝛻f ⋅
Pr[𝑖]
H

• (3) estimate Hessian-power by sampling i.i.d. data examples

1
= EH∼æ,…∼[H] è 𝐼− 𝛻 x 𝑓… 𝛻f ⋅
Pr[𝑖] Single example
…UB /V H
Vector-vector
products only
Linear-time Second-order Stochastic Algorithm
(LiSSA)

• Use the estimator 𝛻¤Ax 𝑓 defined previously

• Compute a full (large batch) gradient 𝛻f


• Move in the direction 𝛻¤Ax 𝑓 𝛻𝑓

• Theoretical running time to produce an 𝜖 approximate solution for 𝛾


well-conditioned functions (convex): [Agarwal, Bullins, Hazan ‘15]

1 1
𝑂 𝑑𝑚 log + 𝛾𝑑 𝑑 log
𝜖 𝜖

1. Faster than first-order methods!


2. Indications this is tight [Arjevani,Shamir ‘16]
What about constraints??

Next few slides – projection free


(Frank-Wolfe) methods
Recommendation systems
movies
rating of user i
1 5 for movie j
2 3
2 4 1

users
5
4 2 3
1 1 1
5 5
Recommendation systems
movies
complete
1 4 5 2 1 5
missing entries
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 3
users
2 3 1 5 4 4
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
1 4 5 2 1 5
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 5
users
2 3 1 5 4 4 get new data
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
re-complete
1 2 5 5 4 4
missing entries
2 4 3 2 3 1
5 2 4 2 3 1
3 3 2 4 5 5
users
3 3 2 5 1 5
2 4 3 4 2 3
1 1 1 1 1 1
4 5 3 3 5 2

Assume low rank of “true matrix”, convex relaxation: bounded trace


norm
Bounded trace norm matrices

• Trace norm of a matrix = sum of singular values


• K = { X | X is a matrix with trace norm at most D }

• Computational bottleneck: projections on K


require eigendecomposition: O(n3) operations

• But: linear optimization over K is easier


computing top eigenvector; O(sparsity) time
Projections à linear optimization

1. Matrix completion (K = bounded trace norm matrices)


eigen decomposition
largest singular vector computation

2. Online routing (K = flow polytope)


conic optimization over flow polytope
shortest path computation

3. Rotations (K = rotation matrices)


conic optimization over rotations set
Wahba’s algorithm – linear time

4. Matroids (K = matroid polytope)


convex opt. via ellipsoid method
greedy algorithm – nearly linear time!
Conditional Gradient algorithm [Frank, Wolfe ’56]
Convex opt problem:
min 𝑓(𝑥)
@∈m

• f is smooth, convex
• linear opt over K is easy

vt = arg min rf (xt )> x


x2K
xt+1 = xt + ⌘t (vt xt )

1. At iteration t: convex comb. of at most t vertices (sparsity)


B
2. No learning rate. 𝜂/ ≈ (independent of diameter, gradients etc.)
/
FW theorem rf (xt )

xt ) , vt = arg min r> vt+1


xt+1 = xt + ⌘t (vt
x2K
t x xt+1
⇤ 1 xt
Theorem: f (xt ) f (x ) = O( )
t

Proof, main observation:


f (xt+1 ) f (x⇤ ) = f (xt + ⌘t (vt xt )) f (x⇤ )

 f (xt ) f (x⇤ ) + ⌘t (vt xt )> rt + ⌘t2 kvt xt k2 -smoothness of f


2
 f (xt ) f (x⇤ ) + ⌘t (x⇤ xt )> rt + ⌘t2 kvt xt k2 optimality of vt
2
 f (xt ) f (x⇤ ) + ⌘t (f (x⇤ ) f (xt )) + ⌘t2 kvt xt k2 convexity of f
2
⇤ ⌘t2
 (1 ⌘t )(f (xt ) f (x )) + D2 .
2
Thus: ht = f(xt) – f(x*)
1
ht+1  (1 ⌘t )ht + O(⌘t2 ) ⌘t , ht = O( )
t
Online Conditional Gradient

• Set 𝑥B ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. Use 𝑥/ , obtain ft
2. Compute 𝑥/†B as follows
⇣P ⌘>
t
vt = arg min i=1 rfi (xi ) + t xt x
x2K
↵ ↵
xt+1 (1 t )xt + t vt xt
xt+1
Pt
rfi (xi ) + t xt
i=1 vt

Theorem:[Hazan, Kale ’12] 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂(𝑇 Ü/é )


Theorem: [Garber, Hazan ‘13] For polytopes, strongly-convex and
smooth losses,
1. Offline: convergence after t steps: 𝑒 Aê(/)
2. Online: 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂( 𝑇)
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
4. A touch of state of the art
* Some slides by Cyril Zhang
Non-convex optimization in ML

Machine

Distribution
over label
{a} ∈ 𝑅 &

1
arg minš E ℓH 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -
Stochastic Optimization in ML
- Problem: find argmin 𝑓(𝑥)
@∈ℝš
x
- Þ 𝑥
Stochastic first-order oracle: 𝔼 𝛻𝑓 = 𝛻𝑓 𝑥 , 𝔼 Þ 𝑥
𝛻𝑓 ≤ 𝜎x
- SGD [Robbins & Monro ‘51]:

𝑓 𝑥 = 𝑥 i 𝐀𝑥 + 𝑏i 𝑥 𝑓 𝑥 = OutrageouslyDeepNeuralNet 𝑥

source: Andrej Karpathy’s PhD thesis


“Modern ML is SGD++”

Variance Reduction
[Le Roux et al. ‘12]
[Shalev-Shwartz & Zhang ‘12]
Momentum [Johnson & Zhang ‘13]
[Polyak ‘64]
[Nemirovskii & Yudin ‘77]
[Nesterov ‘83]

Adaptive Regularization
[Duchi, Hazan, Singer ‘10]
[Kingma & Ba ‘14]
Adaptive Optimizers: The Usual Recipe
- Each coordinate 𝑥 𝑖 gets a learning rate 𝐃/ [𝑖]
- Þ 𝑥B:/ [𝑖] (𝑔B:/ [𝑖] for short)
𝐃/ [𝑖] chosen “adaptively” using 𝛻𝑓

B
- AdaGrad: 𝐷/ 𝑖 ≔
∑ñðòÅ ïð H ¨

B
- RMSprop: 𝐷/ 𝑖 ≔
∑ñðòÅ Ö.óóñôð ïð H ¨

B
- Adam: 𝐷/ 𝑖 ≔
BAÖ.óóñ ∑ñðòÅ Ö.óóñôð ïð H ¨
Intuition: Adaptive Preconditioning

● Per-coordinate scaling matrix 𝐷/ = ∑/0UB diag 𝑔0 𝑖 x AB/x


● Adaptive methods learn 𝐷 which makes the loss surface more isotropic

💁
𝑓 𝑥 ↦ 𝑓 𝐷𝑥

😖
What about the other AdaGrad?

diagonal preconditioning full-matrix preconditioning


𝑂 𝑑 time per iteration > 𝑂 𝑑 x time per iteration
What does adaptive regularization even do?!

Main theorem, AdaGrad

● No analysis for non-convex optimization


The Case for Full-Matrix Adaptive Regularization

● GGT, a new adaptive optimizer


● Efficient full-matrix (low-rank) AdaGrad

● Theory: “Adaptive” convergence rate on convex & non-convex 𝑓

● Experiments: viable in the deep learning era


● GPU-friendly; not much slower than SGD on deep models
● Accelerates training in deep learning benchmarks
● Empirical insights on anisotropic loss surfaces, real and synthetic
The GGT Algorithm

● SGD: 𝑥/†B ← 𝑥/ − 𝜂/ ⋅ 𝑔/
● AdaGrad: 𝑥/†B ← 𝑥/ − diag ∑/0UB 𝑔0x AB/x ⋅ 𝑔/
● Full-Matrix AdaGrad: 𝑥/†B ← 𝑥/ − ∑/0UB 𝑔0 𝑔0i AB/x
⋅ 𝑔/
● GGT: 𝑥/†B ← 𝑥/ − 𝐆/ 𝐆/i AB/x ⋅ 𝑔/

𝑟 ≈ 200

𝑑 ≈ 10÷ 𝐆/ = 𝑔/ 𝛽𝑔/AB 𝛽x 𝑔/Ax ⋯ 𝛽,AB 𝑔/A,†B


Why a low-rank preconditioner?

● Answer 1: want to forget stale gradients (like Adam)


● Synthetic experiments: logistic regression, polytope analytic center
Why a low-rank preconditioner?

Matrix ops: 𝑂 𝑟𝑑 x Matrix ops: 𝑂 𝑟 x 𝑑


Huge SVD: 𝑂 𝑑 Ü Tiny SVD: 𝑂 𝑟 Ü
Large-Scale Experiments (CIFAR-10, PTB)
Visualizing Gradient Spectra
eigs 𝐆/i 𝐆/
@ 𝑡 = 150

26-layer ResNet
CIFAR-10

3-layer LSTM
Penn Treebank
(char-level)
Theory: Convergence of Non-Convex SGD
§¨
- Convex: 𝑓 𝑥c ≤ argmin@ 𝑓 𝑥 + 𝜀 in 𝑂 steps
ú¨
§¨
- Non-convex: ∃𝑡: 𝛻𝑓 𝑥/ ≤ 𝜀 within 𝑂 steps
úû
- Reduction via modified descent lemma:
Minimize 1/𝜀 x times: Minimize 1/𝜀 x times:
𝑓 𝑥/ + 𝛻𝑓 𝑥 , 𝑥 − 𝑥/ 𝑓 𝑥 + 2𝐿 𝑥 − 𝑥/ x ≥𝑓 𝑥
≥𝑓 𝑥
+ 𝐿 𝑥 − 𝑥/ x

𝑥/ 𝑥/

𝑥/†B 𝑥/†B
Theory: Adaptive Convergence Rate of GGT

● Define the adaptivity ratio 𝜇:

∑c/UB 𝑔/ , 𝑥/ − 𝑥 ∗ AdaGrad regret


𝜇 ≔ =
worst−case OGD regret
𝑥B − 𝑥 ∗ ∑c/UB 𝑔/ x

B
● [DHS10]: 𝜇 ≤ , 𝑑 for diag-AdaGrad, sometimes smaller for full AdaGrad
9
ý¨§¨
● Strongly convex losses: GGT* converges in 𝑂 steps
ú
ý¨§¨
● Non-convex reduction: GGT* converges in 𝑂 steps
úû
● First step towards analyzing adaptive methods in non-convex optimization
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
4. A touch of state of the art
* Some slides by Cyril Zhang
Recap

Chair/car

What is Optimization

But generally speaking...


We’re screwed.
! Local (non global) minima of f0
1. Basic non-convex optimization
! All kinds of constraints (even restricting to continuous functions):
• GD and SGD
h(x) = sin(2πx) = 0
2. Online learning and stochastic optimization
250
• Online gradient descent, non-smooth
200 optimization
150

100 3. Regularization and generalization


• Multiplicative update, mirrored-descent
50

4. Advanced optimization
−50
3
2
3
1 2

• Adaptive Regularization, Acceleration,


0 1
−1 0
−1
−2 −2

variance reduction, second order methods,


−3 −3

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
Frank-Wolfe
5. New algorithms: GGT
collaborators: & Google Princeton-Brain

Naman Agarwal, Brian Bullins, Xinyi Chen,


Karan Singh, Cyril Zhang, Yi Zhang
Bibliography & more information, see:
http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm

Thank you!

You might also like