You are on page 1of 29

GD and SG GD vs.

SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Optimization Methods for Large-Scale Machine Learning

Frank E. Curtis, Lehigh University

presented at

East Coast Optimization Meeting


George Mason University
Fairfax, Virginia

April 2, 2021

Optimization Methods for Large-Scale Machine Learning 1 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

References

? Léon Bottou, Frank E. Curtis, and Jorge Nocedal.


Optimization Methods for Large-Scale Machine Learning.
SIAM Review, 60(2):223–311, 2018.

? Frank E. Curtis and Katya Scheinberg.


Optimization Methods for Supervised Machine Learning: From Linear Models to
Deep Learning.
In INFORMS Tutorials in Operations Research, chapter 5, pages 89–114. Institute
for Operations Research and the Management Sciences (INFORMS), 2017.

Optimization Methods for Large-Scale Machine Learning 2 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Motivating questions

I How do optimization problems arise in machine learning applications, and


what makes them challenging?
I What have been the most successful optimization methods for large-scale
machine learning, and why?
I What recent advances have been made in the design of algorithms, and what
are open questions in this research area?

Optimization Methods for Large-Scale Machine Learning 3 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 4 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 5 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Learning problems and (surrogate) optimization problems

Learn a prediction function h : X → Y to solve


Z
max 1[h(x) ≈ y]dP (x, y)
h∈H X ×Y

Various meanings for h(x) ≈ y depending on the goal:


I Binary classification, with y ∈ {−1, +1}: y · h(x) > 0.
I Regression, with y ∈ Rny : kh(x) − yk ≤ δ.
Parameterizing h by w ∈ Rd , we aim to solve
Z
max 1[h(w; x) ≈ y]dP (x, y)
w∈Rd X ×Y

Now, common practice is to replace the indicator with a smooth loss. . .

Optimization Methods for Large-Scale Machine Learning 6 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Stochastic optimization

Over a parameter vector w ∈ Rd and given

`(·; y) ◦ h(w; x) (loss w.r.t. “true label” ◦ prediction w.r.t. “features”),

consider the unconstrained optimization problem

min f (w), where f (w) = E(x,y) [`(h(w; x), y)].


w∈Rd

Given training set {(xi , yi )}n


i=1 , approximate problem given by

n
1X
min fn (w), where fn (w) = `(h(w; xi ), yi ).
w∈Rd n i=1

Optimization Methods for Large-Scale Machine Learning 7 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Text classification

SIAM REVIEW c 2018 Society for Industrial and Applied Mathematics


!
Vol. 60, No. 2, pp. 223–311

Optimization Methods for


Large-Scale Machine Learning∗
Léon Bottou†
Frank E. Curtis‡
math
Jorge Nocedal§

Abstract. This paper provides a review and commentary on the past, present, and future of numeri-
cal optimization algorithms in the context of machine learning applications. Through case
studies on text classification and the training of deep neural networks, we discuss how op-
timization problems arise in machine learning and what makes them challenging. A major
theme of our study is that large-scale machine learning represents a distinctive setting in
which the stochastic gradient (SG) method has traditionally played a central role while
conventional gradient-based nonlinear optimization techniques typically falter. Based on
this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG
algorithm, discuss its practical behavior, and highlight opportunities for designing algo-
rithms with improved performance. This leads to a discussion about the next generation
of optimization methods for large-scale machine learning, including an investigation of two
main streams of research on techniques that diminish noise in the stochastic directions and
methods that make use of second-order derivative approximations.

Key words. numerical optimization, machine learning, stochastic gradient methods, algorithm com-
plexity analysis, noise reduction methods, second-order methods

AMS subject classifications. 65K05, 68Q25, 68T05, 90C06, 90C30, 90C90

DOI. 10.1137/16M1080173

poetry
Contents
1 Introduction 224

2 Machine Learning Case Studies 226


2.1 Text Classification via Convex Optimization . . . . . . . . . . . . . . 226
2.2 Perceptual Tasks via Deep Neural Networks . . . . . . . . . . . . . . 228
2.3 Formal Machine Learning Procedure . . . . . . . . . . . . . . . . . . . 231

3 Overview of Optimization Methods 235


3.1 Formal Optimization Problem Statements . . . . . . . . . . . . . . . . 235
∗ Received by the editors June 16, 2016; accepted for publication (in revised form) April 19, 2017;

published electronically May 8, 2018.


http://www.siam.org/journals/sirev/60-2/M108017.html
Funding: The work of the second author was supported by U.S. Department of Energy grant
DE-SC0010615 and U.S. National Science Foundation grant DMS-1016291. The work of the third
author was supported by Office of Naval Research grant N00014-14-1-0313 P00003 and Department
of Energy grant DE-FG02-87ER25047s.
† Facebook AI Research, New York, NY 10003 (leon@bottou.org).
‡ Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015

(frank.e.curtis@gmail.com).
n
§ Department of Industrial Engineering and Management Sciences, Northwestern University,

Evanston, IL 60201 (j-nocedal@northwestern.edu). 1X λ


223
min log(1 + exp(−(wT xi )yi )) + kwk22
w∈Rd n i=1 2

Optimization Methods for Large-Scale Machine Learning 8 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Image / speech recognition

What pixel combinations represent the number 4?

What sounds are these? (“Here comes the sun” – The Beatles)

Optimization Methods for Large-Scale Machine Learning 9 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Deep neural networks


h(w; x) = al (Wl . . . (a2 (W2 (a1 (W1 x + ω1 )) + ω2 )) . . . )

x1 [W1 ]11
h11 [W2 ]11 h21
[W3 ]11

x2 h1
h12 h22

Output Layer
Input Layer

x3 h2

x4 h13 h23
h3

[W3 ]43
x5 h14 [W2 ]44 h24
[W1 ]54
Hidden Layers

Figure: Illustration of a DNN

Optimization Methods for Large-Scale Machine Learning 10 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Tradeoffs of large-scale learning

Bottou, Bousquet (2008) and Bottou (2010)

Notice that we went from our true problem


Z
max 1[h(x) ≈ y]dP (x, y)
h∈H X ×Y

to say that we’ll find our solution h ≡ h(w; ·) by (approximately) solving


n
1X
min `(h(w; xi ), yi ).
w∈Rd n i=1

Three sources of error:


I approximation
I estimation
I optimization

Optimization Methods for Large-Scale Machine Learning 11 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Approximation error
Choice of prediction function family H has important implications; e.g.,

HC := {h ∈ H : Ω(h) ≤ C}.

misclassification rate misclassification rate

testing testing

training training

C training time

Figure: Illustration of C and training time vs. misclassification rate

Optimization Methods for Large-Scale Machine Learning 12 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Problems of interest

Let’s focus on the expected loss/risk problem

min f (w), where f (w) = E(x,y) [`(h(w; x), y)]


w∈Rd

and the empirical loss/risk problem


n
1X
min fn (w), where fn (w) = `(h(w; xi ), yi ).
w∈Rd n i=1

For this talk, let’s assume


I f is continuously differentiable, bounded below, and potentially nonconvex;
I ∇f is L-Lipschitz continuous, i.e., k∇f (w) − ∇f (w)k2 ≤ Lkw − wk2 .

Optimization Methods for Large-Scale Machine Learning 13 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.

Algorithm GD : Gradient Descent


1: choose an initial point w0 ∈ Rn and stepsize α > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for

f (wk )

wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.

Algorithm GD : Gradient Descent


1: choose an initial point w0 ∈ Rn and stepsize α > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for

f (wk )

f (w)? f (w)?

wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.

Algorithm GD : Gradient Descent


1: choose an initial point w0 ∈ Rn and stepsize α > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for

f (wk ) + ∇f (wk )T (w − wk ) + 1 Lkw − wk k2


2
2
f (wk )

wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.

Algorithm GD : Gradient Descent


1: choose an initial point w0 ∈ Rn and stepsize α > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for

f (wk ) + ∇f (wk )T (w − wk ) + 1 Lkw − wk k2


2
2
f (wk )

f (wk ) + ∇f (wk )T (w − wk ) + 1 ckw − wk k2


2
2

wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

GD theory

Theorem GD

X
If α ∈ (0, 1/L], then k∇f (wk )k22 < ∞, which implies {∇f (wk )} → 0.
k=0
If, in addition, f is c-strongly convex, then for all k ≥ 1:

f (wk ) − f∗ ≤ (1 − αc)k (f (x0 ) − f∗ ).

Proof.
f (wk+1 ) ≤ f (wk ) + ∇f (wk )T (wk+1 − wk ) + 12 Lkwk+1 − wk k22
· · · (due to stepsize choice)
≤ f (wk ) − 12 αk∇f (wk )k22
≤ f (wk ) − αc(f (wk ) − f∗ ).
=⇒ f (wk+1 ) − f∗ ≤ (1 − αc)(f (wk ) − f∗ ).

Optimization Methods for Large-Scale Machine Learning 15 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

GD illustration

Figure: GD with fixed stepsize

Optimization Methods for Large-Scale Machine Learning 16 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Stochastic gradient method (SG)

Invented by Herbert Robbins and Sutton Monro in 1951.

Sutton Monro, former Lehigh faculty member

Optimization Methods for Large-Scale Machine Learning 17 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Stochastic gradient descent

Approximate gradient only; e.g., random ik so E[∇w `(h(w; xik ), yik )|w] = ∇f (w).

Algorithm SG : Stochastic Gradient


1: choose an initial point w0 ∈ Rn and stepsizes {αk } > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − αk gk , where gk ≈ ∇f (wk )
4: end for

Not a descent method!


. . . but can guarantee eventual descent in expectation (with Ek [gk ] = ∇f (wk )):

f (wk+1 ) ≤ f (wk ) + ∇f (wk )T (wk+1 − wk ) + 12 Lkwk+1 − wk k22


= f (wk ) − αk ∇f (wk )T gk + 12 α2k Lkgk k22
=⇒ Ek [f (wk+1 )] ≤ f (wk ) − αk k∇f (wk )k22 + 21 α2k LEk [kgk k22 ].

Markov process: wk+1 depends only on wk and random choice at iteration k.

Optimization Methods for Large-Scale Machine Learning 18 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

SG theory

Theorem SG
If Ek [kgk k22 ] ≤ M + k∇f (wk )k22 , then:
 
k
1 1 X
2
αk = =⇒ E  k∇f (wj )k2  ≤ M
L k j=1
 
  k
1 X
2
αk = O =⇒ E  αj k∇f (wj )k2  < ∞.
k j=1

If, in addition, f is c-strongly convex, then:


 
1 (αL)(M/c)
αk = =⇒ E[f (wk ) − f∗ ] ≤ O
L 2
   
1 (L/c)(M/c)
αk = O =⇒ E[f (wk ) − f∗ ] = O .
k k

(*Assumed unbiased gradient estimates; see paper for more generality.)

Optimization Methods for Large-Scale Machine Learning 19 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Why O(1/k)?

Mathematically:

X ∞
X
αk = ∞ while α2k < ∞
k=1 k=1

Graphically (sequential version of constant stepsize result):

Optimization Methods for Large-Scale Machine Learning 20 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

SG illustration

Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right)

Optimization Methods for Large-Scale Machine Learning 21 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 22 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Why SG over GD for large-scale machine learning?


GD: E[fn (wk ) − fn,∗ ] = O(ρk ) linear convergence
SG: E[fn (wk ) − fn,∗ ] = O(1/k) sublinear convergence

So why SG?

Motivation Explanation
Intuitive data “redundancy”
Empirical SG vs. L-BFGS with batch gradient (below)
Theoretical E[fn (wk ) − fn,∗ ] = O(1/k) and E[f (wk ) − f∗ ] = O(1/k)

0.6

0.5
Empirical Risk

0.4

0.3 LBFGS

0.2

0.1 SGD

0
0 0.5 1 1.5 2 2.5 3 3.5 4
Accessed Data Points 5
x 10

Optimization Methods for Large-Scale Machine Learning 23 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Work complexity
Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010).
Time Time for
Convergence rate per iteration -optimality
GD: E[fn (wk ) − fn,∗ ] = O(ρk ) + O(n) =⇒ n log(1/)
SG: E[fn (wk ) − fn,∗ ] = O(1/k) + O(1) =⇒ 1/

Considering total (estimation + optimization) error as

E = E[f (wn ) − f (w∗ )] + E[f (w̃n ) − f (wn )] ∼ 1


n
+

and a time budget T , one finds:


I SG: Process as many samples as possible (n ∼ T ), leading to

1
E∼ .
T
I GD: With n ∼ T / log(1/), minimizing E yields  ∼ 1/T and

log(T ) 1
E∼ + .
T T

Optimization Methods for Large-Scale Machine Learning 24 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 25 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

End of the story?

SG is great! Let’s keep proving how great it is!


I SG is “stable with respect to inputs”
I SG avoids “steep minima”
I SG avoids “saddle points”
I . . . (many more)
No, we should want more. . .
I SG requires a lot of “hyperparameter” tuning
I Sublinear convergence is not satisfactory
I . . . “linearly” convergent method eventually wins
I . . . with higher budget, faster computation, parallel?, distributed?
Also, any “gradient”-based method is not scale invariant.

Optimization Methods for Large-Scale Machine Learning 26 of 59

You might also like