Professional Documents
Culture Documents
MLSS Complete PDF
MLSS Complete PDF
Elad Hazan
Princeton University and Google Brain
http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm
“Instead of trying to produce a programme to simulate the
adult mind, why not rather try to produce one which simulates
the child's?”
Machine
Chair/car
Distribution
over {a} ∈ 𝑅 & label
𝑏 = 𝑓*+,+-./.,0 (𝑎)
This tutorial - training the machine
• Efficiency
• generalization
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
4. A touch of state-of-the-art
Bibliography + more info: http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm
Agenda
Accessing f?
• Value oracle (0-order opt.)
• Gradient (1st order opt.), k’th differentials (k’th order…)
1 − 𝑥H 𝑥I
max E
@∈ AB,B D 2
HI ∈L
What is Optimization
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3
Given a sample 𝑆 = 𝑎B , 𝑏B , … , 𝑎- , 𝑏- ,
find hyperplane (through the origin w.l.o.g.)
such that:
i
B
arg min ∑H ℓ(𝑥, 𝑎H , 𝑏H ) for ℓ 𝑥, 𝑎H , 𝑏H = h 1 𝑥 𝑎≠𝑏
f ZB - 0 𝑥 i 𝑎 = 𝑏
NP hard!
Example: deep neural net classification
Given a sample 𝑆 = 𝑎B , 𝑏B , … , 𝑎- , 𝑏- ,
find a depth N ResNet such that:
B 1 𝑁𝑒𝑡l (𝑎) ≠ 𝑏
arg min ∑H ℓ(𝑊 , 𝑎H , 𝑏H ) for ℓ 𝑊, 𝑎H , 𝑏H = h
p Zm - 0 𝑁𝑒𝑡l (𝑎) = 𝑏
Sum of signs à global optimization NP-hard!
1 1 1 1
𝑓 𝑥+ 𝑦 ≤ 𝑓 𝑥 + 𝑓 𝑦
2 2 2 2
• Informally: smiley J
• Alternative definition:
f y ≥ f x + 𝛻𝑓(𝑥)i (𝑦 − 𝑥)
𝑥 𝑦
Convex sets
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3
{ℎ | 𝛻𝑓
ℎ | 𝛻𝑓 ℎ |=≤0}
𝜖}
| 𝛻𝑓 ℎ | ≤ 𝜖 and 𝛻 x 𝑓 ℎ ≽ − 𝜖 𝐼
• Does not guarantee a local minimum even with 𝜖 = 0 (but for some
functions)
We have:
1. motivated learning as mathematical optimization
2. convexity à locally verify global optimality
3. Non-convex opt à kth-order local opt.
(2nd is reasonable, 4th is NP-hard)
Next è algorithms!
Unconstrained Gradient Descent
@
[rf (x)]i = f (x)
@xi
𝑥/†B ← 𝑥/ − 𝜂𝛻𝑓 𝑥/
p* p3 p2 p1
Smooth functions
𝛽 = smoothness, simplified:
𝛽 𝛽
− 𝐼 ≼ 𝛻 x𝑓 𝑥 ≼ 𝐼
2 2
1 x
2𝑀𝛽
E 𝛻𝑓 𝑥/ ≤
𝑇 𝑇
/
B x x™“
and thus, ∑/ 𝛻𝑓 𝑥/ ≤
c c
For convex functions:
𝑓 𝑥 − 𝑓 𝑥 ∗ ≤ 𝛻𝑓 𝑥 𝑥 ∗ − 𝑥 ≤ 𝐷|𝛻𝑓 𝑥 |
𝐷 2𝑀𝛽
𝑓 𝑥 − 𝑓 𝑥∗ ≤
𝑇
Our objective function:
1
𝑓 𝑥 = arg minš E ℓH 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
@∈Q
Duchi (UC Berkeley) Convex Optimization for Machine Learning
¤
move in the direction of estimator! 𝑥/†B ← 𝑥/ − 𝜂𝛻(𝑥/)
Convergence?
1 x
𝑀𝛽𝜎 x
E 𝛻𝑓 𝑥/ ≤
𝑇 𝑇
/
Proof: 𝑥/†B ← 𝑥/ − 𝜂𝛻 𝑓 𝑥/
1. The descent lemma:
1 x
E 𝑓 𝑥/ − 𝑓 𝑥/†B ≥ E −𝛻𝑓 𝑥/ 𝑥/†B − 𝑥/ − 𝛽|𝑥/ − 𝑥/†B —
2
1 x 1
= 𝐸 𝛻/ ⋅ 𝜂𝛻© − 𝛽𝜂 x 𝛻 = 𝜂 𝛻/ x
− 𝛽𝜂 x 𝜎 x
2 2
B x ™ x ™“§ ¨
and thus, ∑/ 𝛻𝑓 𝑥/ ≤ + 𝜂𝛽𝜎 ≤
c ªc c
Our objective function:
1
𝑓 𝑥 = arg minš E ℓH 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -
Elad Hazan
Princeton University and Google Brain
Bibliography + more info:
http://www.cs.princeton.edu/~ehazan/tutorial/MLSStutorial.htm
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
4. A touch of state-of-the-art
à What about non-smooth functions? (non-convex: infeasible)
à Gradients are expensive! (entire data set per iteration)
à How good is a local minimum in terms of generalization?
à Can we get faster rates?
à convex optimization
Statistical (PAC) learning
Nature: i.i.d from distribution D over
A ×𝐵 = {(𝑎, 𝑏)}
(a1,b1) (aM,bM)
• learner: h1
• Hypothesis h h2
• Loss, e.g. ℓ ℎ, 𝑎, 𝑏 = (ℎ 𝑎 −
𝑏 )x
" #
1 X X
`(ht , (at , bt ) min `(h⇤ , (at , bt )) ! 0
T t
⇤
h 2H
t
T !1
Theorem: Regret = ∑/ 𝑓/ 𝑥/ − ∑/ 𝑓/ 𝑥 ∗ = 𝑂 𝑇
𝑦/†B ← 𝑥/ − 𝜂𝛻𝑓/ 𝑥/
Online gradient descent 𝑥/†B = arg min |𝑦/†B − 𝑥|
f∈m
³
Theorem: for step size 𝜂 = , NO SMOOTHNESS REQUIRED!!
¶ c
E 𝑓/ 𝑥/ − min E 𝑓 𝑥 ∗ ≤ 𝐷𝐺 𝑇
∗ /
@
/ /
Where:
• G = upper bound on norm of gradients
|𝛻𝑓/ 𝑥/ | ≤ 𝐺
2. Observation 2:
x ∗ − 𝑥/†B x
≤ x ∗ − y/†B x
2. Observation 2:
x ∗ − 𝑥/†B x ≤ x ∗ − y/†B x
Thus:
x ∗ − x©†B x ≤ x ∗ − x© x − 2𝜂𝛻𝑓/ (𝑥/ )(𝑥/ − 𝑥 ∗ ) + 𝜂 x 𝐺 x
And hence:
1 1
E 𝑓/ 𝑥/ − 𝑓/ 𝑥 ∗ ≤ E 𝛻𝑓/ 𝑥/ 𝑥/ − 𝑥 ∗
𝑇 𝑇
/ /
1 1 𝜂
≤ E x ∗ − x©†B x − x ∗ − x© x + 𝐺x
𝑇 2𝜂 2
/
1 𝜂 𝐷𝐺
≤ 𝐷x + 𝐺 x ≤
𝑇 ⋅ 2𝜂 2 𝑇
Lower bound
p
Regret = ⌦( T )
• 2 loss functions, T iterations:
• 𝐾 = −1,1 , 𝑓B 𝑥 = 𝑥 , 𝑓x 𝑥 = −𝑥
• Second expert loss = first * -1
• Expected loss = 0 (any algorithm)
• Regret = (compared to either -1 or 1)
p
E[|#10 s #( 1)0 s|] = ⌦( T )
! All kinds of constraints (even restricting to continuous
h(x) = sin(2πx) = 0
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
@∈Q
random example: 𝑓/ 𝑥 = ℓH 𝑥, 𝑎H , 𝑏H Duchi (UC Berkeley) Convex Optimization for Machine Learning
1 1 𝐷𝐺
E 𝛻/i 𝑥/ ≤ min E 𝛻/
i ∗
𝑥 +
𝑇 @ ∗ ∈m 𝑇 𝑇
/ /
2. Taking (conditional) expectation:
1 ∗
1 i ∗
𝐷𝐺
𝐸 𝐹 E 𝑥/ − min 𝐹 𝑥 ≤ 𝐸 E 𝛻/ (𝑥/ − 𝑥 )] ≤
𝑇 @ ∗ ∈m 𝑇 𝑇
/ /
One example per step, same convergence as GD, & gives direct generalization!
(formally needs martingales)
9 -9
O vs. O total running time for 𝜖 generalization error.
º¨ º¨
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
4. A touch of state of the art
Regularization &
Gradient Descent++
Why “regularize”?
𝛼 x
𝛽
0≺ 𝐼≼𝛻 𝑓 𝑥 ≼ 𝐼
2 2
𝛽 x
𝛽
− 𝐼≼𝛻 𝑓 𝑥 ≼ 𝐼
2 2
Why do we care?
well-conditioned functions exhibit much faster optimization!
Important for regularization
Minimize regret: best-in-hindsight
X X
Regret = ft (xt ) min
⇤
ft (x⇤ )
x 2K
t t
• Most natural:
/AB
𝑥/ = arg min E 𝑓H 𝑥
@∈m
HUB
1
𝑥/ = arg min E 𝛻/i 𝑥 + 𝑅 𝑥
@∈m 𝜂
HUB…/AB
B
• 𝑅 𝑥 = ∥ 𝑥 ∥x
x
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Q ⇣ Pt 1 ⌘
= K ⌘ i=1 rfi (xi )
Q
xt = K (yt )
yt+1 = yt ⌘rft (xt )
FTRL vs. Multiplicative Weights
• Experts setting: 𝐾 = Δ& distributions over experts
• 𝑓/ 𝑥 = 𝑐/c 𝑥, where ct is the vector of losses
• 𝑅 𝑥 = ∑H 𝑥H log 𝑥H : negative entropy
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
⇣ P ⌘
t 1
Entrywise
= exp ⌘ i=1 ci /Zt Normalization
constant
exponential
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Bregman Projection:
QR
K (y) = arg min BR (xky)
x2K
QR
xt = K (yt )
1
yt+1 = (rR) (rR(yt ) ⌘rft (xt ))
GD++
1
𝑥/ = arg min E 𝑓H 𝑥 + 𝑅 𝑥
@∈m 𝜂
HUB…/AB
𝑅 𝑥 = 𝑥 xÄ 𝑠. 𝑡. 𝐴 ≽ 0 , 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑
• Set 𝑥B ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. use 𝑥/ obtain ft , and gradient 𝑔/ = 𝛻𝑓 𝑥/
2. compute 𝑥/†B as follows:
𝐺/ = E 𝑔H 𝑔Hi
HUB…/
Å Å
A A
𝑥/†B = 𝑥/ − 𝜂𝐺/ 𝑔/ or diagonal version 𝑥/†B = 𝑥/ − 𝜂 𝑑𝑖𝑎𝑔 𝐺/
¨ ¨ 𝑔/
💁
𝑓 𝑥 ↦ 𝑓 𝐷𝑥
😖
The ratio of adaptivity
x B
• Theorem 2: 𝜇 = 𝑂 ∑H ∑/ 𝛻/,H ∈[ , 𝑑]
9
1
=− 𝜂+ 𝛽𝜂 x 𝛻/ x =− 𝛻 x
4𝛽 /
1 x
−2𝑀 ≤ 𝑓 𝑥c − 𝑓(𝑥B ) = E 𝑓(𝑥/†B ) − 𝑓 𝑥/ ≤− E 𝛻/
4𝛽
/ /
Thus, exists a t for which,
x
8𝑀𝛽
𝛻/ ≤
𝑇
Smooth gradient descent
B
Conclusions: for 𝑥/†B = 𝑥/ − 𝜂𝛻© and T = Ω , finds
º
𝛻/ x ≤𝜖
x x
= 𝐸 −𝛻/ ⋅ 𝜂𝛻© + 𝛽 𝛻Ò© = −𝜂𝛻©x + 𝜂 x 𝛽𝐸 𝛻Ò©
M𝛽 x
T = 𝑂 x ⇒ ∃/Zc . 𝛻/ ≤𝜀
𝜀
Controlling the variance:
Interpolating GD and SGD
Model: both full and stochastic gradients. Estimator combines
both into lower variance RV:
𝑥/†B = 𝑥/ − 𝜂 𝛻𝑓 𝑥/ − 𝛻𝑓 𝑥Ö + 𝛻𝑓(𝑥Ö )
≤ 2𝐺 x 𝑥/ − 𝑥Ö x + 2𝛽ℎ/
G-smoothness for 𝛻𝑓 , smoothness
¶¨
≤ 2 h© + 2𝛽ℎ/ by strong convexity
¼
= 𝑂 𝛽ℎ c
Variance Reduction: analysis
x
1 1 𝛻/ 𝛽 log 𝑇 …
ℎ…†B = E ℎ/ ≤ E ≤ ℎ
𝑇 𝛼𝑇 𝑡 𝛼𝑇
/ /
“
Thus, after O 𝛾 = iterations, we have that the average distance ½ of
¼
what it (on average) was:
…†B
1
𝐸ℎ ≤ 𝐸 ℎ…
2
B
Thus O log epochs. Each epoch: one full gradient, T stochastic gradients.
º
Total running time:
1
𝑂 𝑚 + 𝛾 𝑑 log
𝜖
GD++
𝑥B
𝑥x
𝑥B
𝑥x
d3 time per iteration!
Infeasible for ML!!
𝑥Ü
Till recently…
Speed up the Newton direction computation??
AB
Þx x Þ
• clearly 𝐸 𝛻 = 𝛻 𝑓 , but 𝐸 𝛻 x ≠ 𝛻 x 𝑓 AB
Circumvent Hessian creation and inversion!
• 3 steps:
• (1) represent Hessian inverse as infinite series
For any distribution on
naturals i ∼ 𝑁
𝛻 Ax = E 𝐼 − 𝛻 x H
HUÖ /V å
1
𝛻 x 𝑓 AB 𝛻𝑓 = E 𝐼 − 𝛻 x 𝑓 H 𝛻f = 𝐸H∼æ 𝐼 − 𝛻 x 𝑓 H 𝛻f ⋅
Pr[𝑖]
H
1
= EH∼æ,…∼[H] è 𝐼− 𝛻 x 𝑓… 𝛻f ⋅
Pr[𝑖] Single example
…UB /V H
Vector-vector
products only
Linear-time Second-order Stochastic Algorithm
(LiSSA)
1 1
𝑂 𝑑𝑚 log + 𝛾𝑑 𝑑 log
𝜖 𝜖
users
5
4 2 3
1 1 1
5 5
Recommendation systems
movies
complete
1 4 5 2 1 5
missing entries
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 3
users
2 3 1 5 4 4
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
1 4 5 2 1 5
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 5
users
2 3 1 5 4 4 get new data
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
re-complete
1 2 5 5 4 4
missing entries
2 4 3 2 3 1
5 2 4 2 3 1
3 3 2 4 5 5
users
3 3 2 5 1 5
2 4 3 4 2 3
1 1 1 1 1 1
4 5 3 3 5 2
• f is smooth, convex
• linear opt over K is easy
• Set 𝑥B ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. Use 𝑥/ , obtain ft
2. Compute 𝑥/†B as follows
⇣P ⌘>
t
vt = arg min i=1 rfi (xi ) + t xt x
x2K
↵ ↵
xt+1 (1 t )xt + t vt xt
xt+1
Pt
rfi (xi ) + t xt
i=1 vt
Machine
Distribution
over label
{a} ∈ 𝑅 &
1
arg minš E ℓH 𝑥, 𝑎H , 𝑏H + 𝑅 𝑥
@∈Q 𝑚
HUB /V -
Stochastic Optimization in ML
- Problem: find argmin 𝑓(𝑥)
@∈ℝš
x
- Þ 𝑥
Stochastic first-order oracle: 𝔼 𝛻𝑓 = 𝛻𝑓 𝑥 , 𝔼 Þ 𝑥
𝛻𝑓 ≤ 𝜎x
- SGD [Robbins & Monro ‘51]:
𝑓 𝑥 = 𝑥 i 𝐀𝑥 + 𝑏i 𝑥 𝑓 𝑥 = OutrageouslyDeepNeuralNet 𝑥
Variance Reduction
[Le Roux et al. ‘12]
[Shalev-Shwartz & Zhang ‘12]
Momentum [Johnson & Zhang ‘13]
[Polyak ‘64]
[Nemirovskii & Yudin ‘77]
[Nesterov ‘83]
Adaptive Regularization
[Duchi, Hazan, Singer ‘10]
[Kingma & Ba ‘14]
Adaptive Optimizers: The Usual Recipe
- Each coordinate 𝑥 𝑖 gets a learning rate 𝐃/ [𝑖]
- Þ 𝑥B:/ [𝑖] (𝑔B:/ [𝑖] for short)
𝐃/ [𝑖] chosen “adaptively” using 𝛻𝑓
B
- AdaGrad: 𝐷/ 𝑖 ≔
∑ñðòÅ ïð H ¨
B
- RMSprop: 𝐷/ 𝑖 ≔
∑ñðòÅ Ö.óóñôð ïð H ¨
B
- Adam: 𝐷/ 𝑖 ≔
BAÖ.óóñ ∑ñðòÅ Ö.óóñôð ïð H ¨
Intuition: Adaptive Preconditioning
💁
𝑓 𝑥 ↦ 𝑓 𝐷𝑥
😖
What about the other AdaGrad?
● SGD: 𝑥/†B ← 𝑥/ − 𝜂/ ⋅ 𝑔/
● AdaGrad: 𝑥/†B ← 𝑥/ − diag ∑/0UB 𝑔0x AB/x ⋅ 𝑔/
● Full-Matrix AdaGrad: 𝑥/†B ← 𝑥/ − ∑/0UB 𝑔0 𝑔0i AB/x
⋅ 𝑔/
● GGT: 𝑥/†B ← 𝑥/ − 𝐆/ 𝐆/i AB/x ⋅ 𝑔/
𝑟 ≈ 200
26-layer ResNet
CIFAR-10
3-layer LSTM
Penn Treebank
(char-level)
Theory: Convergence of Non-Convex SGD
§¨
- Convex: 𝑓 𝑥c ≤ argmin@ 𝑓 𝑥 + 𝜀 in 𝑂 steps
ú¨
§¨
- Non-convex: ∃𝑡: 𝛻𝑓 𝑥/ ≤ 𝜀 within 𝑂 steps
úû
- Reduction via modified descent lemma:
Minimize 1/𝜀 x times: Minimize 1/𝜀 x times:
𝑓 𝑥/ + 𝛻𝑓 𝑥 , 𝑥 − 𝑥/ 𝑓 𝑥 + 2𝐿 𝑥 − 𝑥/ x ≥𝑓 𝑥
≥𝑓 𝑥
+ 𝐿 𝑥 − 𝑥/ x
𝑥/ 𝑥/
𝑥/†B 𝑥/†B
Theory: Adaptive Convergence Rate of GGT
B
● [DHS10]: 𝜇 ≤ , 𝑑 for diag-AdaGrad, sometimes smaller for full AdaGrad
9
ý¨§¨
● Strongly convex losses: GGT* converges in 𝑂 steps
ú
ý¨§¨
● Non-convex reduction: GGT* converges in 𝑂 steps
úû
● First step towards analyzing adaptive methods in non-convex optimization
Agenda
1. Learning as mathematical optimization
• Empirical Risk Minimization
• Basics of mathematical optimization
• Gradient descent + SGD
2. Regularization and Generalization
• Convex, non-smooth opt.
• Regret minimization in games, PAC learnability
3. Gradient Descent++
• Regularization, Adaptive Regularization and AdaGrad
• Momentum and variance reduction
• Second order methods
• Constraints and the Frank-Wolfe algorithm
4. A touch of state of the art
* Some slides by Cyril Zhang
Recap
Chair/car
What is Optimization
4. Advanced optimization
−50
3
2
3
1 2
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
Frank-Wolfe
5. New algorithms: GGT
collaborators: & Google Princeton-Brain
Thank you!