Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford

RECENT PROGRESS IN
THE THEORY OF DEEP

LEARNING
Tengyu Ma
Facebook AI Research/Stanford
How can we use mathematical thinking to
formalize, understand, and improve deep learning?
Outline
Modeling/
Architecture
→
→
Supervised Learning
→
Unsupervised Learning →
Reinforcement Learning
Statistics/
Optimization →
→
Generalization
Part I: Optimization in Deep Learning
Obvious questions:
How can we provide guarantees on the runtime and
the quality of the solutions?
How do we optimize faster?
Obstacle: the lack of convexity

 global minimum
Problem-Agnostic/Blackbox Optimization
is NP-hard
 stationary point
does not suffice
given
approximate
Optimizer
local minimum
is Lipschitz and smooth
Still too little
structure for
non-convex Works for all possible in time )
fun.!
 Clean interface: optimizers don’t need to understand its application
 Strong positive and negative results (but still improvable gap)
 Doesn’t leverage the structure of the particular problem
[Carmon-Duchi-Hinder-Sidford’17a,b,18]
Optimization for Concrete Problems
An ML problem:
Input/output ) A global
Algorithm
model parameter minimum of
loss
only needs to work for the particular
Trading generality for more ambitious goals: global minimum and faster
runtime
 Encourages leveraging the structure of the problem
 Even the same algo. may have better analysis
Example: Linearized One-hidden-layer
Neural Nets
Identity function in
this slide
input output
𝑊 𝑊
1 2
Loss for example:

Loss = sum of losses of all training examples

Theorem: Assume identity activation and mean-squared loss, then
stochastic gradient descent with random initialization converges to a
global minimum (with exponential decay of error.)
 solving PCA with SGD; but such a strong result is not possible without
using the structure of .
 Caveat: statement is false when is ReLU, sigmoid, or tanh (more later)
[Baldi-Hornik’89, Du-Lee-Hu’18, c.f. Li-M.-Zhang’18]

What structures of functions beyond convexity can
we leverage in optimization? saddle point
One candidate structure:

1. All local minima are global minima
2. All saddle points are not flat (have a
strictly negative curvature)

Theorem: If satisfies 1&2, stochastic gradient descent (and various
other algorithms) converge to a global minimum of in poly time.
 Key: saddle point is not stable under the presence of noise
 Conjecture: loss for deep neural nets have similar properties [Choromanska-
Henaff-Mathieu-BenArous-LeCun’15]
See [Ge-Jin-Huang-Yuan’15, Lee-Simchowitz-Jordan-Recht’16, Sun-Qu-Wright’16,

Agarwal-AllenZhu-Bullins-Hazan-M.’16, Carmon-Duchi-Hinder-Sidford’16]
and reference therein
Leveraging the Optimization Landscape
An ML problem:
Analysis of
Input/output )
optimization
model parameter
landscape
loss
General optimizers A global

(w/ customization) minimum of
 A plethora of results on matrix factorization based problem and tensor

problems
 Extensions to shallow or linearized neural nets
Optimization Landscape for Statistical Problems:
A Partial List of References
 Isotonic regression ( NN w/o hidden layer)
 Matrix completion/matrix sensing ( linearized NN w/ 1-hidden layer)
 Phase retrieval
 Tensor decomposition
 Linearized neural nets
 One-hidden-layer neural net with non-overlapping hidden units
A few of these customize the algorithms or analysis to give stronger

run-time or sample complexity guarantees.
[Kakade-Kalai-Kanade-Shamir’11]
[Ge-Lee-M.’16, Bhojanapalli-Neyshabur-Srebro’16,
Li-M.-Zhang’18, Jain-Netrapalli-Sanghavi’13, …]
[Chen-Chi-Fan-Ma’18, Chen-Candes’15,16, …]
[Ge-Jin-Huang-Yuan’15, Ge-M.’17, Sun-Qu-Wright’17, …]
[Baldi-Hornik’89, Kawaguchi’16, Hardt-M.’17, Hardt-M.-Recht’17, …]
[Tian’17, Brutzkus-Globerson’17]
Landscape Design by Changing the Models
An ML problem:
Input/output ) What if the landscape
model parameter is not great?
loss
Landscape design: a
new parameterization,
model, or loss
A prediction model
General optimizers with good test error
Image credit: [Li-Xu-Taylor-Studer-Goldstein’17]

Example 1: Residual Neural Nets
 Same expressivity (up to 2X more parameters)

 Very different training curve for deep models
Feed-forward neural nets Residual neural nets

^𝑦 =h 4 ^𝑦 =h 4
𝐴 4 𝐴 4
h 3 h 3 𝐼
𝐴 3 𝐴 3
h 2 h 2
𝐴 2 𝐴 2
h 1 h 1 𝐼
𝐴 1 𝐴 1
𝑥=h
0
𝑥=h
0
h 𝑖=relu( 𝐴 ¿ ¿𝑖 h𝑖 −1 )¿
[He-Zhang-Ren-Sun’2016]
Residual Connection Improve the Landscape!
 A very simplified model for first-cut understanding: linearized deep nets
 
flat saddle points Theorem: [Hardt-M., ICLR’17]
(with 0 gradient, Hessian, and 1. All stationary points with small
higher-order derivatives) norm are global minima
[Kawaguchi’16] 2. global minimum with small norm
^𝑦 =h 4 ^𝑦 =h 4
𝐴 4 𝐴 4
h 3 h 3 𝐼
𝐴 3 𝐴 3
h 2 h 2
𝐴 2 𝐴 2
h 1 h 1 𝐼
𝐴 1 𝐴 1
𝑥=h
0
𝑥=h
0
Residual Connection Improve the Landscape!
 
flat saddle points Theorem: [Hardt-M., ICLR’17]
(with 0 gradient, Hessian, and 1. All stationary points with small
higher-order derivatives) norm are global minima
[Kawaguchi’16] 2. global minimum with small norm
^𝑦 =h 4 ^𝑦 =h 4
Open problem: 𝐴landscape
4 𝐴 4
property of resnet with non-linear activations?
h 3 h 3 𝐼
𝐴 3 𝐴 3
h 2 h 2
𝐴 2 𝐴 2
h 1 h 1 𝐼
𝐴 1 𝐴 1
𝑥=h
0
𝑥=h
0
Ex. 2: Over-parameterization Improves the
Landscape (Empirically)
Synthetic experiments (a teacher-student setup):
 Large amount of data generated by a fixed two-layer neural net with
100 hidden unites
 Loss =
train with 100 nodes

(global minimum has error 0) train with 400 nodes
training/ test error

training/test error
bad local min

global min
easier!
input labels input labels
𝑊 𝑊 𝑊 𝑊
1 2
1 2
dim = 100
dim = 400
Experiment first performed by [LeCun, Livni-ShalevShwartz-Shamir’14]
See also [Ge-Lee-M.’17, Safran-Shamir’17]
Analysis of Over-parameterization: Inspiring Partial
Results
Setup:
 hidden units; # data points; input dim 𝑊 𝑊
1 2
𝑑 𝑘

Theorem [Soudry-Carmon’16]: With leaky relu activation and squared loss,
if , all the differentiable stationary points are global minima.
 Caveat: optimizers do converge to non-differential point

Theorem [Du-Lee’18]: With quadratic activation and squared loss, if and
fixed, all the local minima are global minima.
 Both papers assume very little about data distribution and labels
 Can these models generalize? Yes! (more later)
 Over-parameterized linearized recurrent neural nets [Hardt-M.-Recht’17]
Ex. 3: Landscape Design by Defining New Loss
Function
 Assume, in addition to previous slide,
𝑊 𝑊
1 2
 Recall that squared loss has bad local min
𝑑 𝑘

[Ge-Lee-M.’17] designs an alternative objective function
 has the same global minima as the standard squared loss

 All local minima of are global
 Theoretical guarantees for ReLU activation and no over-parametrization

 is complicated and generalization requires empirically larger amount of
samples
Summary on Optimization
1. Fast provable non-convex optimization is possible, if the optimization
landscape has particular properties (e.g., all local minima are global)
2. Landscape can be reshaped or designed by changing the model
architecture and loss function
Architecture →
→
→
→
Landscape Optimization
choice
Part II: Generalization Theory in Deep
Learning
Why over-parameterized neural nets can generalize?
Modeling/
Architecture
→
→
Supervised Learning
→
Statistics/
Optimization →
→
Generalization
“Overfitting” (Textbook Version)
Prediction Error generalization error
Model complexity
# data
 Generalization theory via uniform convergence
set of parameter , test loss – training loss
complexity of the model class

 Trivial bound: # parameters
Image credit: gluon.mxnet.io

Real-life Deep Learning: Network Size is Not the
Right Complexity Measure
3-layer neural nets on MNIST [Neyshabur-Tomioka-Srebro’2015]

(similar results on CIFAR)
What Complexity Measures are Informative?
 We often add regularization; but norm doesn’t correlate with the
generalization error
 Normalized margin [Bartlett et al.’17, Neyshabur et al.’17]:

product of norms of the weights
𝐶Compression-based
Flat≜minima tend to generalize


margin:
complexitybetter [Hinton-Camp’93, Hochreiter-
[Arora et al.’18]
Schmidhuber, Keskar et al.’16, Dinh et al.’17]
𝐶≜minimumcompressionofthemodel≤noisestability

Post-mortem analyses: don’t explain why low-complexity
solutions
 PAC-Bayes bounds are obtained
(for stochastic at nets)
neural the first place
[Dziugaite-Roy’18]
𝐶 ≜ KL( 𝑃∨¿ 𝑄)

estimated stochastic NN prior stochastic NN

Algorithms Matter in Generalization Theory of
Deep Learning

Same
objective, regularization, and training error same test error
 For strongly convex objective: unique minimizers for all optimizers
 Faster algorithms don’t necessarily mean better/faster generalization
 Mysteries in generalization also hamper the study of optimization
Systematic experiment in [Keskar et al’16, Wilson-Roelofs- Stern- Srebro-Recht’17]

Algorithmic Regularization
Algorithms matter
Hypothesis: stochastic gradient descent, with proper initialization

and learning rate, prefers an optimal solution with low complexity,
when it exists
Intrinsic complexity
of the data matters
Two missing parts:
 A definition of complexity
 Proof that the algorithm converges to low-complexity solutions
Algorithmic Regularization: Recent Progress
 Linear models:
 Folklore: GD on linear regression converges to min -norm solution
 Mirror descent converges to solution with minimum Bregman divergence
 GD on logistic regression converges to max-margin solution
 Neural nets with linear activation

 GD regularizes the norm in frequency space in CNN
 Neural nets with quadratic activations (next several slides)

 GD with small initialization converges to minimum rank solution
 Neural nets on linearly separable data
[Gunasekar-Lee-Soudry-Srebro’18a]
[Soudry-Hoffer-Nacson-Gunasekar-Srebro’17, Ji-Telgarsky’18]
[Gunasekar-Lee-Soudry-Srebro’18b]
[Li-M.-Zhang’18]
[Brutzkus-Globerson-Malach-ShalevShwartz’18]
One-hidden-layer Quadratic Neural Nets With Over-
Parameterization
∈ ℝ𝑑×𝑚
𝑈
1
𝑥 1 𝑦
^
1
 data points
 quadratic activations
 Empirical MSE
 With #data= #parameter=, can we still learn a model that generalizes?

 Impossible without additional assumptions (counterexample: labels
are random)
One-hidden-layer Quadratic Neural Nets With Over-
Parameterization
∈ ℝ𝑑×𝑚
𝑈 ⋆ ∈ ℝ 𝑑 ×𝑟
𝑈
1
𝑥 𝑥 1 𝑦
1 𝑦
^
1
1
 Assume a small network of size ( that produces the label

Theorem: [Informal, Li-M.-Zhang’18, COLT best paper] With samples,
gradient descent, with an initialization of size , returns a solution with
test error
after iterations.
 With small enough initialization , gen. error is negligible

 Stochasticity and early stopping is not necessary (in theory and practice)
Simulations
 Input dim or
 Generate labels with a network of hidden layer size
 Train parameters
 Use only samples <<
Key Intuition: Gradient Descent Prefers Low-
complexity Solutions
 Def of complexity measure: if a network is approx. equivalent to
another network of hidden units its intrinsic complexity
 For quadratic NN:
complexity = approx. rank of weight ≈

Comp=2

𝑆 𝑑
Non-generalizable global
minima of training loss
𝑆 𝑟
generalizable global
minima of training loss 𝑆 1
0
Weight Matrix has Low-complexity Throughout the
Training
 Very similar setup as previous slide
 Generate labels with hidden size
Summary on Generalization Theory
 Informative complexity measures of deep nets
 Algorithms have an implicit regularization effect to minimize the
complexity
Another related open direction:

 Understanding the (statistical) structure of the data?
Architecture →
→ Statistics/ →
→
Optimization
choice Generalization
Part III: Statistical Theory of Generative
Adversarial Networks (GANs)
Modeling/
Architecture
→
→
Supervised Learning
→
Statistics/
Optimization →
→
Generalization
Next few slides
Statistical problems:
 What distributions do GANs learn with finite samples?
 How do we measure the quality of the generated images?
Computational problem:
 how do we train GANs faster and better?
GANs: Learning to Sample from Samples
generator
  neural net 
( are the parameters)
Loss Function Based on Samples
dist. of real images; dist. of generated images

dist. over empirical samples from dist. over empirical samples from
Loss Function Based on Samples
dist. of real images; dist. of generated images

dist. over empirical samples from dist. over empirical samples from
 Goal: find such that

 Note: may not have a density or the density is intractable
 GANs: define a loss function that can be evaluated with only
samples
some “distance
min 𝐿 ( 𝜃 ) ≔ 𝑑 ( 𝑃 𝜃 ,measure”
^ 𝑃^)
Loss Function Based on Samples (Cont’d)

min 𝐿 ( 𝜃 ) ≔ 𝑑 ( ^
𝑃𝜃 , 𝑃
^)
 Discriminator family = {neural nets}
 GANs [Goodfellow et al.’14]:
max likelihood loss for classifying real from fake samples using
discriminators in
 W-GANs [Arjovsky et al.’16]:
𝑑 𝑃𝜃 , ^
(^ 𝑃 )= 𝑊 ℱ ( ^
𝑃𝜃 , 𝑃 ) := max ¿ 𝔼𝑃^ 𝑓 ( 𝑋 ) − 𝔼 ^𝑃 𝑓 ( 𝑋 )∨¿ ¿
𝜃
𝑓∈ℱ
-integral probability metric (IPM) between and

If {all 1-Lipschitz fun.}, Wasserstein distance
( is often much weaker than )
What distributions do W-GANs learn?
 If the training succeeds  is close to
𝑊 ℱ 𝑃𝜃 , ^
(^ 𝑃)≤ 𝜖 𝑊 ( 𝑃𝜃 , 𝑃 ) ≤ 𝑔( 𝜖)
“training loss” “test loss”
empirical distance population distance
weak discriminators strong discriminators
Generalization Approximability/
(of discriminators) Distinguishability
A Pessimistic Dilemma
If {NNs with size } and
 has support on only images
 Good generalization but poor approximability poor diversity

 Small discriminators cannot detect mode collapse
 If {all 1-Lipschitz fun.}, then reverse problem: poor generalization
[Arora-Ge-Liang-M.-Zhang, ICML17]
Beyond the Dilemma: Restricted Approximability
If {NNs with size } and
 We only need:
For a particular generator class,
it’s possible to design corresponding
parameterized discriminator class with restricted approximability:
Generator classes with restricted approximability:

 Gaussian, mixture of Gaussian, and exponential family
 Invertible neural networks generators (next slide)
[Bai-M.-Risteski’18]
Restricted Approximability of Invertible
Neural Nets
𝑍 ∈ℝ 𝑘 𝑋 =𝐺 𝜃 ( 𝑍) ∈ ℝ 𝑑
 Assume is an injective function. E.g., an -layer neural nets with leaky relu
activation and full-rank weight matrices
Define -layer neural nets with a special structure}. Then,
Open problems: test loss training loss

 Removing the injective assumption?
 Stronger approximation inequalities? generalization
approximation
 Practical implication?
[Bai-M.-Risteski’18] (which uses tools from [Zhang-Liu-Zhou-Xu-He’17])
Part IV (Brief): Theory for Embeddings
How do we reason about representation learning?
Modeling/
Architecture
→
→
Supervised Learning
→
Statistics/
Optimization →
→
Generalization
Embeddings
(for words, Downstream
sentences, tasks
paragraphs)
 How do we formalize representation learning?

A generative approach:
(params, latent variable) corpus
 Reason about properties of learned embeddings
 Design new embedding methods (e.g., algo. to recover the latent var.)
Incorporating the syntactic information in the modeling?

Other effective framework for understanding representation learning?
Word and sense embeddings [ Arora-Li-Liang-M.-Risteski TACL’17, 18,

Hashimoto-AlvarezMelis-Jaakkola’17]
Sentence embeddings [Arora-Liang-M., ICLR’17, Arora-Khodak-Saunshi-Vodrahalli, ICLR’18]
Rare words/phrases embeddings [Khodak-Saunshi-Liang-M.-Stewart-Arora, ACL’18]
Part V (Brief): Theory for Deep RL
Sample-efficient algorithms in continuous state space?
Modeling/
Architecture
→
→
Supervised Learning
→
Statistics/
Optimization →
→
Generalization
Theoretical Challenges in Deep
All theoretical challenges with deep learning
+
sequential decision making
 High-dimensional state space

 Non-linear function approximators for policy, dynamics, and Q ..
 Non-convex optimization
 Sample efficiency
Model-Based Deep Reinforcement Learning
 Very promising to reduce the sample complexity [Nagabandi et al.’17,
Kurutach et al.’18]
 Uses state information (instead of only rewards)
Recent theoretical progress on controlling linear dynamical models

 Sample complexity guarantees [Dean-Mania-Matni-Recht-Tu’17,18]
Non-linear dynamics:
 no previous convergence guarantees
 known issue: initial convergence is good, asymptotic convergence is
difficult [Nagabandi et al.’17]
Model-Based Deep RL with Asymptotic
Convergence Guarantees
Framework of MB RL [Xu-Li-Tian-Darrell-M.’18]
Repeat:
 maximize an analytical lower bound of the reward based on
trajectories, over both the policy and model
 collect new samples using current policy; recomputed the lower
bound
Guarantee: the reward monotonically increases to a local maximum (under

the assumption that model-free policy optimization can be solved. )
Model-Based Deep RL with Asymptotic
Convergence Guarantees
Number of samples
 The first to achieve near-optimal reward on half-cheetah using MB RL

with a single dynamical model
 3-10X more sample-efficient than model-free approach
[Xu-Li-Tian-Darrell-M.’18]
Summary
 Supervised learning: interactions between optimization, generalization,
and choices of architectures
 Unsupervised learning: theory of GANs and representation learning
 Model-based reinforcement learning: asymptotic convergence theorem
Topics I did not cover

 Security in deep learning (e.g., defending adversarial examples)
 Uncertainty quantification
 Fairness in deep learning
 Meta learning
Concluding Thoughts
 A burst of recent promising theoretical results on deep learning
 Potential to provide intuition or guidance to practice
 Back to math: motivate the development of new tools
 I am optimistic that theory can sustain or boost the progress of deep

learning and AI
Thank you!!

Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford

Uploaded by

Copyright:

Available Formats

RECENT PROGRESS IN

THE THEORY OF DEEP

Obstacle: the lack of convexity

 Doesn’t leverage the structure of the particular problem

only needs to work for the particular

Loss for example:

[Baldi-Hornik’89, Du-Lee-Hu’18, c.f. Li-M.-Zhang’18]

One candidate structure:

See [Ge-Jin-Huang-Yuan’15, Lee-Simchowitz-Jordan-Recht’16, Sun-Qu-Wright’16,

General optimizers A global

 A plethora of results on matrix factorization based problem and tensor

A few of these customize the algorithms or analysis to give stronger

Image credit: [Li-Xu-Taylor-Studer-Goldstein’17]

 Same expressivity (up to 2X more parameters)

Feed-forward neural nets Residual neural nets

train with 100 nodes

training/ test error

bad local min

 has the same global minima as the standard squared loss

 Theoretical guarantees for ReLU activation and no over-parametrization

Prediction Error generalization error

set of parameter , test loss – training loss

complexity of the model class

Image credit: gluon.mxnet.io

3-layer neural nets on MNIST [Neyshabur-Tomioka-Srebro’2015]

estimated stochastic NN prior stochastic NN

 Faster algorithms don’t necessarily mean better/faster generalization

 Mysteries in generalization also hamper the study of optimization

Systematic experiment in [Keskar et al’16, Wilson-Roelofs- Stern- Srebro-Recht’17]

Hypothesis: stochastic gradient descent, with proper initialization

 Neural nets with linear activation

 Neural nets with quadratic activations (next several slides)

 Neural nets on linearly separable data

 With #data= #parameter=, can we still learn a model that generalizes?

 With small enough initialization , gen. error is negligible

complexity = approx. rank of weight ≈

Another related open direction:

dist. of real images; dist. of generated images

dist. of real images; dist. of generated images

 Goal: find such that

 W-GANs [Arjovsky et al.’16]:

-integral probability metric (IPM) between and

 If the training succeeds  is close to

If {NNs with size } and

 has support on only images

 Good generalization but poor approximability poor diversity

 If {all 1-Lipschitz fun.}, then reverse problem: poor generalization

If {NNs with size } and

Generator classes with restricted approximability:

Open problems: test loss training loss

 How do we formalize representation learning?

Incorporating the syntactic information in the modeling?

Word and sense embeddings [ Arora-Li-Liang-M.-Risteski TACL’17, 18,

 High-dimensional state space

Recent theoretical progress on controlling linear dynamical models

Guarantee: the reward monotonically increases to a local maximum (under

 The first to achieve near-optimal reward on half-cheetah using MB RL

Topics I did not cover

 Potential to provide intuition or guidance to practice

 Back to math: motivate the development of new tools

 I am optimistic that theory can sustain or boost the progress of deep

You might also like