Optimization Methods For Large-Scale Machine Learning

GD and SG GD vs.
SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Optimization Methods for Large-Scale Machine Learning
Frank E. Curtis, Lehigh University
presented at
East Coast Optimization Meeting

George Mason University
Fairfax, Virginia
April 2, 2021
Optimization Methods for Large-Scale Machine Learning 1 of 59

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
References
? Léon Bottou, Frank E. Curtis, and Jorge Nocedal.

Optimization Methods for Large-Scale Machine Learning.
SIAM Review, 60(2):223–311, 2018.
? Frank E. Curtis and Katya Scheinberg.

Optimization Methods for Supervised Machine Learning: From Linear Models to
Deep Learning.
In INFORMS Tutorials in Operations Research, chapter 5, pages 89–114. Institute
for Operations Research and the Management Sciences (INFORMS), 2017.

Motivating questions
I How do optimization problems arise in machine learning applications, and

what makes them challenging?
I What have been the most successful optimization methods for large-scale
machine learning, and why?
I What recent advances have been made in the design of algorithms, and what
are open questions in this research area?

Outline
GD and SG
GD vs. SG
Beyond SG
Noise Reduction Methods
Second-Order Methods
Conclusion

Outline
GD and SG
GD vs. SG
Beyond SG
Conclusion

Learning problems and (surrogate) optimization problems
Learn a prediction function h : X → Y to solve

Z
max 1[h(x) ≈ y]dP (x, y)
h∈H X ×Y
Various meanings for h(x) ≈ y depending on the goal:

I Binary classification, with y ∈ {−1, +1}: y · h(x) > 0.
I Regression, with y ∈ Rny : kh(x) − yk ≤ δ.
Parameterizing h by w ∈ Rd , we aim to solve
Z
max 1[h(w; x) ≈ y]dP (x, y)
w∈Rd X ×Y
Now, common practice is to replace the indicator with a smooth loss. . .

Stochastic optimization
Over a parameter vector w ∈ Rd and given
`(·; y) ◦ h(w; x) (loss w.r.t. “true label” ◦ prediction w.r.t. “features”),
consider the unconstrained optimization problem
min f (w), where f (w) = E(x,y) [`(h(w; x), y)].

w∈Rd
Given training set {(xi , yi )}n

i=1 , approximate problem given by
n
1X
min fn (w), where fn (w) = `(h(w; xi ), yi ).
w∈Rd n i=1

Text classification
SIAM REVIEW c 2018 Society for Industrial and Applied Mathematics

!
Vol. 60, No. 2, pp. 223–311
Optimization Methods for

Large-Scale Machine Learning∗
Léon Bottou†
Frank E. Curtis‡
math
Jorge Nocedal§
Abstract. This paper provides a review and commentary on the past, present, and future of numeri-
cal optimization algorithms in the context of machine learning applications. Through case
studies on text classification and the training of deep neural networks, we discuss how op-
timization problems arise in machine learning and what makes them challenging. A major
theme of our study is that large-scale machine learning represents a distinctive setting in
which the stochastic gradient (SG) method has traditionally played a central role while
conventional gradient-based nonlinear optimization techniques typically falter. Based on
this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG
algorithm, discuss its practical behavior, and highlight opportunities for designing algo-
rithms with improved performance. This leads to a discussion about the next generation
of optimization methods for large-scale machine learning, including an investigation of two
main streams of research on techniques that diminish noise in the stochastic directions and
methods that make use of second-order derivative approximations.
Key words. numerical optimization, machine learning, stochastic gradient methods, algorithm com-
plexity analysis, noise reduction methods, second-order methods
AMS subject classifications. 65K05, 68Q25, 68T05, 90C06, 90C30, 90C90
DOI. 10.1137/16M1080173
poetry
Contents
1 Introduction 224
2 Machine Learning Case Studies 226

2.1 Text Classification via Convex Optimization . . . . . . . . . . . . . . 226
2.2 Perceptual Tasks via Deep Neural Networks . . . . . . . . . . . . . . 228
2.3 Formal Machine Learning Procedure . . . . . . . . . . . . . . . . . . . 231
3 Overview of Optimization Methods 235

3.1 Formal Optimization Problem Statements . . . . . . . . . . . . . . . . 235
∗ Received by the editors June 16, 2016; accepted for publication (in revised form) April 19, 2017;
published electronically May 8, 2018.

http://www.siam.org/journals/sirev/60-2/M108017.html
Funding: The work of the second author was supported by U.S. Department of Energy grant
DE-SC0010615 and U.S. National Science Foundation grant DMS-1016291. The work of the third
author was supported by Office of Naval Research grant N00014-14-1-0313 P00003 and Department
of Energy grant DE-FG02-87ER25047s.
† Facebook AI Research, New York, NY 10003 (leon@bottou.org).
‡ Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015
(frank.e.curtis@gmail.com).
n
§ Department of Industrial Engineering and Management Sciences, Northwestern University,
Evanston, IL 60201 (j-nocedal@northwestern.edu). 1X λ

223
min log(1 + exp(−(wT xi )yi )) + kwk22
w∈Rd n i=1 2

Image / speech recognition
What pixel combinations represent the number 4?
What sounds are these? (“Here comes the sun” – The Beatles)

Deep neural networks

h(w; x) = al (Wl . . . (a2 (W2 (a1 (W1 x + ω1 )) + ω2 )) . . . )
x1 [W1 ]11
h11 [W2 ]11 h21
[W3 ]11
x2 h1
h12 h22
Output Layer
Input Layer
x3 h2
x4 h13 h23
h3
[W3 ]43
x5 h14 [W2 ]44 h24
[W1 ]54
Hidden Layers
Figure: Illustration of a DNN

Tradeoffs of large-scale learning
Bottou, Bousquet (2008) and Bottou (2010)
Notice that we went from our true problem

Z
max 1[h(x) ≈ y]dP (x, y)
h∈H X ×Y
to say that we’ll find our solution h ≡ h(w; ·) by (approximately) solving

n
1X
min `(h(w; xi ), yi ).
w∈Rd n i=1
Three sources of error:

I approximation
I estimation
I optimization

Approximation error
Choice of prediction function family H has important implications; e.g.,
HC := {h ∈ H : Ω(h) ≤ C}.
misclassification rate misclassification rate
testing testing
training training
C training time
Figure: Illustration of C and training time vs. misclassification rate

Problems of interest
Let’s focus on the expected loss/risk problem
min f (w), where f (w) = E(x,y) [`(h(w; x), y)]

w∈Rd
and the empirical loss/risk problem

n
1X
min fn (w), where fn (w) = `(h(w; xi ), yi ).
w∈Rd n i=1
For this talk, let’s assume

I f is continuously differentiable, bounded below, and potentially nonconvex;
I ∇f is L-Lipschitz continuous, i.e., k∇f (w) − ∇f (w)k2 ≤ Lkw − wk2 .

Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.
Algorithm GD : Gradient Descent

1: choose an initial point w0 ∈ Rn and stepsize α > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for
f (wk )
wk
Gradient descent

2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for
f (wk )
f (w)? f (w)?
wk
Gradient descent

2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for
f (wk ) + ∇f (wk )T (w − wk ) + 1 Lkw − wk k2

2
2
f (wk )
wk
Gradient descent

2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for
f (wk ) + ∇f (wk )T (w − wk ) + 1 Lkw − wk k2

2
2
f (wk )
f (wk ) + ∇f (wk )T (w − wk ) + 1 ckw − wk k2

2
2
wk
GD theory
Theorem GD
∞
X
If α ∈ (0, 1/L], then k∇f (wk )k22 < ∞, which implies {∇f (wk )} → 0.
k=0
If, in addition, f is c-strongly convex, then for all k ≥ 1:
f (wk ) − f∗ ≤ (1 − αc)k (f (x0 ) − f∗ ).
Proof.
f (wk+1 ) ≤ f (wk ) + ∇f (wk )T (wk+1 − wk ) + 12 Lkwk+1 − wk k22
· · · (due to stepsize choice)
≤ f (wk ) − 12 αk∇f (wk )k22
≤ f (wk ) − αc(f (wk ) − f∗ ).
=⇒ f (wk+1 ) − f∗ ≤ (1 − αc)(f (wk ) − f∗ ).

GD illustration
Figure: GD with fixed stepsize

Stochastic gradient method (SG)
Invented by Herbert Robbins and Sutton Monro in 1951.
Sutton Monro, former Lehigh faculty member

Stochastic gradient descent
Approximate gradient only; e.g., random ik so E[∇w `(h(w; xik ), yik )|w] = ∇f (w).
Algorithm SG : Stochastic Gradient

1: choose an initial point w0 ∈ Rn and stepsizes {αk } > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − αk gk , where gk ≈ ∇f (wk )
4: end for
Not a descent method!

. . . but can guarantee eventual descent in expectation (with Ek [gk ] = ∇f (wk )):
f (wk+1 ) ≤ f (wk ) + ∇f (wk )T (wk+1 − wk ) + 12 Lkwk+1 − wk k22

= f (wk ) − αk ∇f (wk )T gk + 12 α2k Lkgk k22
=⇒ Ek [f (wk+1 )] ≤ f (wk ) − αk k∇f (wk )k22 + 21 α2k LEk [kgk k22 ].
Markov process: wk+1 depends only on wk and random choice at iteration k.

SG theory
Theorem SG
If Ek [kgk k22 ] ≤ M + k∇f (wk )k22 , then:
 
k
1 1 X
2
αk = =⇒ E  k∇f (wj )k2  ≤ M
L k j=1
 
k
1 X
2
αk = O =⇒ E  αj k∇f (wj )k2  < ∞.
k j=1
If, in addition, f is c-strongly convex, then:

1 (αL)(M/c)
αk = =⇒ E[f (wk ) − f∗ ] ≤ O
L 2

1 (L/c)(M/c)
αk = O =⇒ E[f (wk ) − f∗ ] = O .
k k
(*Assumed unbiased gradient estimates; see paper for more generality.)

Why O(1/k)?
Mathematically:
∞
X ∞
X
αk = ∞ while α2k < ∞
k=1 k=1
Graphically (sequential version of constant stepsize result):

SG illustration
Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right)

Outline
GD and SG
GD vs. SG
Beyond SG
Conclusion

Why SG over GD for large-scale machine learning?

GD: E[fn (wk ) − fn,∗ ] = O(ρk ) linear convergence
SG: E[fn (wk ) − fn,∗ ] = O(1/k) sublinear convergence
So why SG?
Motivation Explanation
Intuitive data “redundancy”
Empirical SG vs. L-BFGS with batch gradient (below)
Theoretical E[fn (wk ) − fn,∗ ] = O(1/k) and E[f (wk ) − f∗ ] = O(1/k)
0.6
0.5
Empirical Risk
0.4
0.3 LBFGS
0.2
0.1 SGD
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Accessed Data Points 5
x 10

Work complexity
Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010).
Time Time for
Convergence rate per iteration -optimality
GD: E[fn (wk ) − fn,∗ ] = O(ρk ) + O(n) =⇒ n log(1/)
SG: E[fn (wk ) − fn,∗ ] = O(1/k) + O(1) =⇒ 1/
Considering total (estimation + optimization) error as
E = E[f (wn ) − f (w∗ )] + E[f (w̃n ) − f (wn )] ∼ 1

n
+
and a time budget T , one finds:

I SG: Process as many samples as possible (n ∼ T ), leading to
1
E∼ .
T
I GD: With n ∼ T / log(1/), minimizing E yields ∼ 1/T and
log(T ) 1
E∼ + .
T T

Outline
GD and SG
GD vs. SG
Beyond SG
Conclusion

End of the story?
SG is great! Let’s keep proving how great it is!

I SG is “stable with respect to inputs”
I SG avoids “steep minima”
I SG avoids “saddle points”
I . . . (many more)
No, we should want more. . .
I SG requires a lot of “hyperparameter” tuning
I Sublinear convergence is not satisfactory
I . . . “linearly” convergent method eventually wins
I . . . with higher budget, faster computation, parallel?, distributed?
Also, any “gradient”-based method is not scale invariant.

Optimization Methods For Large-Scale Machine Learning - 2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimization Methods For Large-Scale Machine Learning - 2021

Uploaded by

Copyright:

Available Formats

GD and SG GD vs.

SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Frank E. Curtis, Lehigh University

East Coast Optimization Meeting

Optimization Methods for Large-Scale Machine Learning 1 of 59

? Léon Bottou, Frank E. Curtis, and Jorge Nocedal.

? Frank E. Curtis and Katya Scheinberg.

Optimization Methods for Large-Scale Machine Learning 2 of 59

I How do optimization problems arise in machine learning applications, and

Optimization Methods for Large-Scale Machine Learning 3 of 59

Noise Reduction Methods

Optimization Methods for Large-Scale Machine Learning 4 of 59

Noise Reduction Methods

Optimization Methods for Large-Scale Machine Learning 5 of 59

Learning problems and (surrogate) optimization problems

Learn a prediction function h : X → Y to solve

Various meanings for h(x) ≈ y depending on the goal:

Now, common practice is to replace the indicator with a smooth loss. . .

Optimization Methods for Large-Scale Machine Learning 6 of 59

Over a parameter vector w ∈ Rd and given

`(·; y) ◦ h(w; x) (loss w.r.t. “true label” ◦ prediction w.r.t. “features”),

consider the unconstrained optimization problem

min f (w), where f (w) = E(x,y) [`(h(w; x), y)].

Given training set {(xi , yi )}n

Optimization Methods for Large-Scale Machine Learning 7 of 59

SIAM REVIEW c 2018 Society for Industrial and Applied Mathematics

Optimization Methods for

AMS subject classifications. 65K05, 68Q25, 68T05, 90C06, 90C30, 90C90

2 Machine Learning Case Studies 226

3 Overview of Optimization Methods 235

published electronically May 8, 2018.

Evanston, IL 60201 (j-nocedal@northwestern.edu). 1X λ

Optimization Methods for Large-Scale Machine Learning 8 of 59

Image / speech recognition

What pixel combinations represent the number 4?

Optimization Methods for Large-Scale Machine Learning 9 of 59

Deep neural networks

Figure: Illustration of a DNN

Optimization Methods for Large-Scale Machine Learning 10 of 59

Tradeoffs of large-scale learning

Bottou, Bousquet (2008) and Bottou (2010)

Notice that we went from our true problem

to say that we’ll find our solution h ≡ h(w; ·) by (approximately) solving

Three sources of error:

Optimization Methods for Large-Scale Machine Learning 11 of 59

misclassification rate misclassification rate

Figure: Illustration of C and training time vs. misclassification rate

Optimization Methods for Large-Scale Machine Learning 12 of 59

Let’s focus on the expected loss/risk problem

min f (w), where f (w) = E(x,y) [`(h(w; x), y)]

and the empirical loss/risk problem

For this talk, let’s assume

Optimization Methods for Large-Scale Machine Learning 13 of 59

Algorithm GD : Gradient Descent

Algorithm GD : Gradient Descent

Algorithm GD : Gradient Descent

f (wk ) + ∇f (wk )T (w − wk ) + 1 Lkw − wk k2

Algorithm GD : Gradient Descent

f (wk ) + ∇f (wk )T (w − wk ) + 1 Lkw − wk k2

f (wk ) + ∇f (wk )T (w − wk ) + 1 ckw − wk k2

f (wk ) − f∗ ≤ (1 − αc)k (f (x0 ) − f∗ ).

Optimization Methods for Large-Scale Machine Learning 15 of 59