You are on page 1of 50

RECENT PROGRESS IN

THE THEORY OF DEEP


LEARNING

Tengyu Ma
Facebook AI Research/Stanford
How can we use mathematical thinking to
formalize, understand, and improve deep learning?
Outline
Modeling/
Architecture


    →
Supervised Learning

  Unsupervised Learning   →
Reinforcement Learning

Statistics/
Optimization   →
  →
Generalization
Part I: Optimization in Deep Learning

Obvious questions:
How can we provide guarantees on the runtime and
the quality of the solutions?
How do we optimize faster?

Obstacle: the lack of convexity


 global minimum
Problem-Agnostic/Blackbox Optimization
is NP-hard
 stationary point
does not suffice
  given
approximate
Optimizer
local minimum
is Lipschitz and smooth
Still too little
structure  for
non-convex Works for all possible in time )
fun.!
 Clean interface: optimizers don’t need to understand its application
 Strong positive and negative results (but still improvable gap)

 Doesn’t leverage the structure of the particular problem

[Carmon-Duchi-Hinder-Sidford’17a,b,18]
Optimization for Concrete Problems
  An ML problem:
Input/output )   A global
Algorithm
model parameter minimum of
loss

  only needs to work for the particular

Trading generality for more ambitious goals: global minimum and faster
runtime
 Encourages leveraging the structure of the problem
 Even the same algo. may have better analysis
Example: Linearized One-hidden-layer
Neural Nets
Identity function in
this slide
  input  output
  𝑊  𝑊
1 2

  Loss for example:


Loss = sum of losses of all training examples
 
Theorem: Assume identity activation and mean-squared loss, then
stochastic gradient descent with random initialization converges to a
global minimum (with exponential decay of error.)

  solving PCA with SGD; but such a strong result is not possible without
using the structure of .
 Caveat: statement is false when is ReLU, sigmoid, or tanh (more later)

[Baldi-Hornik’89, Du-Lee-Hu’18, c.f. Li-M.-Zhang’18]


What structures of functions beyond convexity can
we leverage in optimization? saddle point

One candidate structure:


1. All local minima are global minima
2. All saddle points are not flat (have a
strictly negative curvature)

 
Theorem: If satisfies 1&2, stochastic gradient descent (and various
other algorithms) converge to a global minimum of in poly time.
 Key: saddle point is not stable under the presence of noise
 Conjecture: loss for deep neural nets have similar properties [Choromanska-
Henaff-Mathieu-BenArous-LeCun’15]

See [Ge-Jin-Huang-Yuan’15, Lee-Simchowitz-Jordan-Recht’16, Sun-Qu-Wright’16,


Agarwal-AllenZhu-Bullins-Hazan-M.’16, Carmon-Duchi-Hinder-Sidford’16]
and reference therein
Leveraging the Optimization Landscape
  An ML problem:
Analysis of
Input/output )
optimization
model parameter
landscape
loss

General optimizers   A global


(w/ customization) minimum of

 A plethora of results on matrix factorization based problem and tensor


problems
 Extensions to shallow or linearized neural nets
Optimization Landscape for Statistical Problems:
A Partial List of References
 Isotonic regression ( NN w/o hidden layer)
 Matrix completion/matrix sensing ( linearized NN w/ 1-hidden layer)
 Phase retrieval
 Tensor decomposition
 Linearized neural nets
 One-hidden-layer neural net with non-overlapping hidden units

A few of these customize the algorithms or analysis to give stronger


run-time or sample complexity guarantees.
[Kakade-Kalai-Kanade-Shamir’11]
[Ge-Lee-M.’16, Bhojanapalli-Neyshabur-Srebro’16,
Li-M.-Zhang’18, Jain-Netrapalli-Sanghavi’13, …]
[Chen-Chi-Fan-Ma’18, Chen-Candes’15,16, …]
[Ge-Jin-Huang-Yuan’15, Ge-M.’17, Sun-Qu-Wright’17, …]
[Baldi-Hornik’89, Kawaguchi’16, Hardt-M.’17, Hardt-M.-Recht’17, …]
[Tian’17, Brutzkus-Globerson’17]
Landscape Design by Changing the Models
  An ML problem:
Input/output ) What if the landscape
model parameter is not great?
loss

Landscape design: a
new parameterization,
model, or loss

A prediction model
General optimizers with good test error

Image credit: [Li-Xu-Taylor-Studer-Goldstein’17]


Example 1: Residual Neural Nets

 Same expressivity (up to 2X more parameters)


 Very different training curve for deep models

Feed-forward neural nets Residual neural nets


^𝑦  =h 4 ^𝑦  =h 4
𝐴  4 𝐴  4
h  3 h  3  𝐼
𝐴  3 𝐴  3
h  2 h  2
𝐴  2 𝐴  2
h  1 h  1  𝐼
𝐴  1 𝐴  1
𝑥=h
  0
𝑥=h
  0

h  𝑖=relu( 𝐴 ¿ ¿𝑖 h𝑖 −1 )¿  

[He-Zhang-Ren-Sun’2016]
Residual Connection Improve the Landscape!
 A very simplified model for first-cut understanding: linearized deep nets

   
flat saddle points Theorem: [Hardt-M., ICLR’17]
(with 0 gradient, Hessian, and 1. All stationary points with small
higher-order derivatives) norm are global minima
[Kawaguchi’16] 2. global minimum with small norm

^𝑦  =h 4 ^𝑦  =h 4
𝐴  4 𝐴  4
h  3 h  3  𝐼
𝐴  3 𝐴  3
h  2 h  2
𝐴  2 𝐴  2
h  1 h  1  𝐼
𝐴  1 𝐴  1
𝑥=h
  0
𝑥=h
  0

h  𝑖=relu( 𝐴 ¿ ¿𝑖 h𝑖 −1 )¿  
Residual Connection Improve the Landscape!
   
flat saddle points Theorem: [Hardt-M., ICLR’17]
(with 0 gradient, Hessian, and 1. All stationary points with small
higher-order derivatives) norm are global minima
[Kawaguchi’16] 2. global minimum with small norm

^𝑦  =h 4 ^𝑦  =h 4
Open problem: 𝐴landscape
 4 𝐴  4
property of resnet with non-linear activations?
h  3 h  3  𝐼
𝐴  3 𝐴  3
h  2 h  2
𝐴  2 𝐴  2
h  1 h  1  𝐼
𝐴  1 𝐴  1
𝑥=h
  0
𝑥=h
  0

h  𝑖=relu( 𝐴 ¿ ¿𝑖 h𝑖 −1 )¿  
Ex. 2: Over-parameterization Improves the
Landscape (Empirically)
 Synthetic experiments (a teacher-student setup):
 Large amount of data generated by a fixed two-layer neural net with
100 hidden unites
 Loss =

train with 100 nodes


(global minimum has error 0) train with 400 nodes

training/ test error


training/test error

bad local min


global min
easier!
input labels input labels
  𝑊  𝑊   𝑊  𝑊
1 2
1 2

dim = 100
dim = 400
Experiment first performed by [LeCun, Livni-ShalevShwartz-Shamir’14]
See also [Ge-Lee-M.’17, Safran-Shamir’17]
Analysis of Over-parameterization: Inspiring Partial
Results
 Setup:
 hidden units; # data points; input dim   𝑊  𝑊
1 2

 𝑑  𝑘
 
Theorem [Soudry-Carmon’16]: With leaky relu activation and squared loss,
if , all the differentiable stationary points are global minima.
 Caveat: optimizers do converge to non-differential point

 
Theorem [Du-Lee’18]: With quadratic activation and squared loss, if and
fixed, all the local minima are global minima.

 Both papers assume very little about data distribution and labels
 Can these models generalize? Yes! (more later)
 Over-parameterized linearized recurrent neural nets [Hardt-M.-Recht’17]
Ex. 3: Landscape Design by Defining New Loss
Function
 Assume, in addition to previous slide,
  𝑊  𝑊
1 2
 Recall that squared loss has bad local min

 𝑑  𝑘

 
[Ge-Lee-M.’17] designs an alternative objective function

 has the same global minima as the standard squared loss


 All local minima of are global

  Theoretical guarantees for ReLU activation and no over-parametrization


 is complicated and generalization requires empirically larger amount of
samples
Summary on Optimization
1. Fast provable non-convex optimization is possible, if the optimization
landscape has particular properties (e.g., all local minima are global)
2. Landscape can be reshaped or designed by changing the model
architecture and loss function

Architecture   →
  →
  →
  →
Landscape Optimization
choice
Part II: Generalization Theory in Deep
Learning
Why over-parameterized neural nets can generalize?
Modeling/
Architecture


    →
Supervised Learning

  Unsupervised Learning   →
Reinforcement Learning

Statistics/
Optimization   →
  →
Generalization
“Overfitting” (Textbook Version)

Prediction Error generalization error

Model complexity
# data
 Generalization theory via uniform convergence

set of parameter , test loss – training loss

complexity of the model class


  Trivial bound: # parameters

Image credit: gluon.mxnet.io


Real-life Deep Learning: Network Size is Not the
Right Complexity Measure

3-layer neural nets on MNIST [Neyshabur-Tomioka-Srebro’2015]


(similar results on CIFAR)
What Complexity Measures are Informative?
 We often add regularization; but norm doesn’t correlate with the
generalization error
 Normalized margin [Bartlett et al.’17, Neyshabur et al.’17]:

 
product of norms of the weights
𝐶Compression-based
Flat≜minima tend to generalize


margin:
complexitybetter [Hinton-Camp’93, Hochreiter-
[Arora et al.’18]
Schmidhuber, Keskar et al.’16, Dinh et al.’17]

𝐶≜minimumcompressionofthemodel≤noisestability
 
Post-mortem analyses: don’t explain why low-complexity
solutions
 PAC-Bayes bounds are obtained
(for stochastic at nets)
neural the first place
[Dziugaite-Roy’18]

𝐶 ≜ KL( 𝑃∨¿ 𝑄)
 

estimated stochastic NN prior stochastic NN


Algorithms Matter in Generalization Theory of
Deep Learning

Same
  objective, regularization, and training error same test error
 For strongly convex objective: unique minimizers for all optimizers

 Faster algorithms don’t necessarily mean better/faster generalization

 Mysteries in generalization also hamper the study of optimization

Systematic experiment in [Keskar et al’16, Wilson-Roelofs- Stern- Srebro-Recht’17]


Algorithmic Regularization
Algorithms matter

Hypothesis: stochastic gradient descent, with proper initialization


and learning rate, prefers an optimal solution with low complexity,
when it exists

Intrinsic complexity
of the data matters
Two missing parts:
 A definition of complexity
 Proof that the algorithm converges to low-complexity solutions
Algorithmic Regularization: Recent Progress
 Linear models:
 Folklore: GD on linear regression converges to min -norm solution
 Mirror descent converges to solution with minimum Bregman divergence
 GD on logistic regression converges to max-margin solution

 Neural nets with linear activation


 GD regularizes the norm in frequency space in CNN

 Neural nets with quadratic activations (next several slides)


 GD with small initialization converges to minimum rank solution

 Neural nets on linearly separable data

[Gunasekar-Lee-Soudry-Srebro’18a]
[Soudry-Hoffer-Nacson-Gunasekar-Srebro’17, Ji-Telgarsky’18]
[Gunasekar-Lee-Soudry-Srebro’18b]
[Li-M.-Zhang’18]
[Brutzkus-Globerson-Malach-ShalevShwartz’18]
One-hidden-layer Quadratic Neural Nets With Over-
Parameterization
  ∈ ℝ𝑑×𝑚
𝑈
1
 𝑥 1 𝑦
 ^
1

  data points
 quadratic activations
 Empirical MSE

 With #data= #parameter=, can we still learn a model that generalizes?


 Impossible without additional assumptions (counterexample: labels
are random)
One-hidden-layer Quadratic Neural Nets With Over-
Parameterization
  ∈ ℝ𝑑×𝑚
𝑈   ⋆ ∈ ℝ 𝑑 ×𝑟
𝑈
1
 𝑥  𝑥 1  𝑦
1 𝑦
 ^
1
1
 Assume a small network of size ( that produces the label

 
Theorem: [Informal, Li-M.-Zhang’18, COLT best paper] With samples,
gradient descent, with an initialization of size , returns a solution with
test error

after iterations.

  With small enough initialization , gen. error is negligible


 Stochasticity and early stopping is not necessary (in theory and practice)
Simulations
 Input dim or
 Generate labels with a network of hidden layer size
 Train parameters
 Use only samples <<
Key Intuition: Gradient Descent Prefers Low-
complexity Solutions
 Def of complexity measure: if a network is approx. equivalent to
another network of hidden units its intrinsic complexity
 For quadratic NN:

complexity = approx. rank of weight ≈


 

Comp=2
 
𝑆 𝑑

Non-generalizable global
minima of training loss
𝑆  𝑟
generalizable global
minima of training loss 𝑆  1
0
Weight Matrix has Low-complexity Throughout the
Training
 Very similar setup as previous slide
 Generate labels with hidden size
Summary on Generalization Theory
 Informative complexity measures of deep nets
 Algorithms have an implicit regularization effect to minimize the
complexity

Another related open direction:


 Understanding the (statistical) structure of the data?

Architecture   →
  → Statistics/   →
  →
Optimization
choice Generalization
Part III: Statistical Theory of Generative
Adversarial Networks (GANs)

Modeling/
Architecture


    →
Supervised Learning

  Unsupervised Learning   →
Reinforcement Learning

Statistics/
Optimization   →
  →
Generalization
Next few slides

Statistical problems:
 What distributions do GANs learn with finite samples?
 How do we measure the quality of the generated images?

Computational problem:
 how do we train GANs faster and better?
GANs: Learning to Sample from Samples

generator

    neural net  
( are the parameters)
Loss Function Based on Samples

  dist. of real images;   dist. of generated images


dist. over empirical samples from dist. over empirical samples from
Loss Function Based on Samples

  dist. of real images;   dist. of generated images


dist. over empirical samples from dist. over empirical samples from

  Goal: find such that


 Note: may not have a density or the density is intractable
 GANs: define a loss function that can be evaluated with only
samples
some “distance
min 𝐿 ( 𝜃 ) ≔ 𝑑 ( 𝑃 𝜃 ,measure”
^ 𝑃^)
Loss Function Based on Samples (Cont’d)
 
min 𝐿 ( 𝜃 ) ≔ 𝑑 ( ^
𝑃𝜃 , 𝑃
^)
 Discriminator family = {neural nets}
 GANs [Goodfellow et al.’14]:

  max likelihood loss for classifying real from fake samples using
discriminators in

 W-GANs [Arjovsky et al.’16]:

 𝑑 𝑃𝜃 , ^
(^ 𝑃 )= 𝑊 ℱ ( ^
𝑃𝜃 , 𝑃 ) := max ¿ 𝔼𝑃^ 𝑓 ( 𝑋 ) − 𝔼 ^𝑃 𝑓 ( 𝑋 )∨¿ ¿
𝜃
𝑓∈ℱ

  -integral probability metric (IPM) between and


If {all 1-Lipschitz fun.}, Wasserstein distance
( is often much weaker than )
What distributions do W-GANs learn?

 If the training succeeds   is close to

 𝑊 ℱ 𝑃𝜃 , ^
(^ 𝑃)≤ 𝜖  𝑊 ( 𝑃𝜃 , 𝑃 ) ≤ 𝑔( 𝜖)
“training loss” “test loss”
empirical distance population distance
weak discriminators strong discriminators

Generalization Approximability/
(of discriminators) Distinguishability
A Pessimistic Dilemma
Generalization Approximability/
(of discriminators) Distinguishability

 If {NNs with size } and

  has support on only images

  Good generalization but poor approximability poor diversity


 Small discriminators cannot detect mode collapse

 If {all 1-Lipschitz fun.}, then reverse problem: poor generalization

[Arora-Ge-Liang-M.-Zhang, ICML17]
Beyond the Dilemma: Restricted Approximability
Generalization Approximability/
(of discriminators) Distinguishability

 If {NNs with size } and

 We only need:
 For a particular generator class,
it’s possible to design corresponding
parameterized discriminator class with restricted approximability:

Generator classes with restricted approximability:


 Gaussian, mixture of Gaussian, and exponential family
 Invertible neural networks generators (next slide)

[Bai-M.-Risteski’18]
Restricted Approximability of Invertible
Neural Nets

 𝑍 ∈ℝ 𝑘  𝑋 =𝐺 𝜃 ( 𝑍) ∈ ℝ 𝑑

  Assume is an injective function. E.g., an -layer neural nets with leaky relu
activation and full-rank weight matrices
Define -layer neural nets with a special structure}. Then,

Open problems: test loss training loss


 Removing the injective assumption?
 Stronger approximation inequalities? generalization
approximation
 Practical implication?
[Bai-M.-Risteski’18] (which uses tools from [Zhang-Liu-Zhou-Xu-He’17])
Part IV (Brief): Theory for Embeddings
How do we reason about representation learning?

Modeling/
Architecture


    →
Supervised Learning

  Unsupervised Learning   →
Reinforcement Learning

Statistics/
Optimization   →
  →
Generalization
Embeddings
(for words, Downstream
sentences, tasks
paragraphs)

  How do we formalize representation learning?


A generative approach:
(params, latent variable) corpus
 Reason about properties of learned embeddings
 Design new embedding methods (e.g., algo. to recover the latent var.)

Incorporating the syntactic information in the modeling?


Other effective framework for understanding representation learning?

Word and sense embeddings [ Arora-Li-Liang-M.-Risteski TACL’17, 18,


Hashimoto-AlvarezMelis-Jaakkola’17]
Sentence embeddings [Arora-Liang-M., ICLR’17, Arora-Khodak-Saunshi-Vodrahalli, ICLR’18]
Rare words/phrases embeddings [Khodak-Saunshi-Liang-M.-Stewart-Arora, ACL’18]
Part V (Brief): Theory for Deep RL
Sample-efficient algorithms in continuous state space?

Modeling/
Architecture


    →
Supervised Learning

  Unsupervised Learning   →
Reinforcement Learning

Statistics/
Optimization   →
  →
Generalization
Theoretical Challenges in Deep
Reinforcement Learning
All theoretical challenges with deep learning
+
sequential decision making

 High-dimensional state space


 Non-linear function approximators for policy, dynamics, and Q ..
 Non-convex optimization
 Sample efficiency
Model-Based Deep Reinforcement Learning
 Very promising to reduce the sample complexity [Nagabandi et al.’17,
Kurutach et al.’18]
 Uses state information (instead of only rewards)

Recent theoretical progress on controlling linear dynamical models


 Sample complexity guarantees [Dean-Mania-Matni-Recht-Tu’17,18]

Non-linear dynamics:
 no previous convergence guarantees
 known issue: initial convergence is good, asymptotic convergence is
difficult [Nagabandi et al.’17]
Model-Based Deep RL with Asymptotic
Convergence Guarantees

Framework of MB RL [Xu-Li-Tian-Darrell-M.’18]
Repeat:
 maximize an analytical lower bound of the reward based on
trajectories, over both the policy and model
 collect new samples using current policy; recomputed the lower
bound

Guarantee: the reward monotonically increases to a local maximum (under


the assumption that model-free policy optimization can be solved. )
Model-Based Deep RL with Asymptotic
Convergence Guarantees

Number of samples

 The first to achieve near-optimal reward on half-cheetah using MB RL


with a single dynamical model
 3-10X more sample-efficient than model-free approach

[Xu-Li-Tian-Darrell-M.’18]
Summary
 Supervised learning: interactions between optimization, generalization,
and choices of architectures
 Unsupervised learning: theory of GANs and representation learning
 Model-based reinforcement learning: asymptotic convergence theorem

Topics I did not cover


 Security in deep learning (e.g., defending adversarial examples)
 Uncertainty quantification
 Fairness in deep learning
 Meta learning
Concluding Thoughts
 A burst of recent promising theoretical results on deep learning

 Potential to provide intuition or guidance to practice

 Back to math: motivate the development of new tools

 I am optimistic that theory can sustain or boost the progress of deep


learning and AI

Thank you!!

You might also like