Professional Documents
Culture Documents
Tengyu Ma
Facebook AI Research/Stanford
How can we use mathematical thinking to
formalize, understand, and improve deep learning?
Outline
Modeling/
Architecture
→
→
Supervised Learning
→
Unsupervised Learning →
Reinforcement Learning
Statistics/
Optimization →
→
Generalization
Part I: Optimization in Deep Learning
Obvious questions:
How can we provide guarantees on the runtime and
the quality of the solutions?
How do we optimize faster?
[Carmon-Duchi-Hinder-Sidford’17a,b,18]
Optimization for Concrete Problems
An ML problem:
Input/output ) A global
Algorithm
model parameter minimum of
loss
Trading generality for more ambitious goals: global minimum and faster
runtime
Encourages leveraging the structure of the problem
Even the same algo. may have better analysis
Example: Linearized One-hidden-layer
Neural Nets
Identity function in
this slide
input output
𝑊 𝑊
1 2
solving PCA with SGD; but such a strong result is not possible without
using the structure of .
Caveat: statement is false when is ReLU, sigmoid, or tanh (more later)
Theorem: If satisfies 1&2, stochastic gradient descent (and various
other algorithms) converge to a global minimum of in poly time.
Key: saddle point is not stable under the presence of noise
Conjecture: loss for deep neural nets have similar properties [Choromanska-
Henaff-Mathieu-BenArous-LeCun’15]
Landscape design: a
new parameterization,
model, or loss
A prediction model
General optimizers with good test error
h 𝑖=relu( 𝐴 ¿ ¿𝑖 h𝑖 −1 )¿
[He-Zhang-Ren-Sun’2016]
Residual Connection Improve the Landscape!
A very simplified model for first-cut understanding: linearized deep nets
flat saddle points Theorem: [Hardt-M., ICLR’17]
(with 0 gradient, Hessian, and 1. All stationary points with small
higher-order derivatives) norm are global minima
[Kawaguchi’16] 2. global minimum with small norm
^𝑦 =h 4 ^𝑦 =h 4
𝐴 4 𝐴 4
h 3 h 3 𝐼
𝐴 3 𝐴 3
h 2 h 2
𝐴 2 𝐴 2
h 1 h 1 𝐼
𝐴 1 𝐴 1
𝑥=h
0
𝑥=h
0
h 𝑖=relu( 𝐴 ¿ ¿𝑖 h𝑖 −1 )¿
Residual Connection Improve the Landscape!
flat saddle points Theorem: [Hardt-M., ICLR’17]
(with 0 gradient, Hessian, and 1. All stationary points with small
higher-order derivatives) norm are global minima
[Kawaguchi’16] 2. global minimum with small norm
^𝑦 =h 4 ^𝑦 =h 4
Open problem: 𝐴landscape
4 𝐴 4
property of resnet with non-linear activations?
h 3 h 3 𝐼
𝐴 3 𝐴 3
h 2 h 2
𝐴 2 𝐴 2
h 1 h 1 𝐼
𝐴 1 𝐴 1
𝑥=h
0
𝑥=h
0
h 𝑖=relu( 𝐴 ¿ ¿𝑖 h𝑖 −1 )¿
Ex. 2: Over-parameterization Improves the
Landscape (Empirically)
Synthetic experiments (a teacher-student setup):
Large amount of data generated by a fixed two-layer neural net with
100 hidden unites
Loss =
dim = 100
dim = 400
Experiment first performed by [LeCun, Livni-ShalevShwartz-Shamir’14]
See also [Ge-Lee-M.’17, Safran-Shamir’17]
Analysis of Over-parameterization: Inspiring Partial
Results
Setup:
hidden units; # data points; input dim 𝑊 𝑊
1 2
𝑑 𝑘
Theorem [Soudry-Carmon’16]: With leaky relu activation and squared loss,
if , all the differentiable stationary points are global minima.
Caveat: optimizers do converge to non-differential point
Theorem [Du-Lee’18]: With quadratic activation and squared loss, if and
fixed, all the local minima are global minima.
Both papers assume very little about data distribution and labels
Can these models generalize? Yes! (more later)
Over-parameterized linearized recurrent neural nets [Hardt-M.-Recht’17]
Ex. 3: Landscape Design by Defining New Loss
Function
Assume, in addition to previous slide,
𝑊 𝑊
1 2
Recall that squared loss has bad local min
𝑑 𝑘
[Ge-Lee-M.’17] designs an alternative objective function
Architecture →
→
→
→
Landscape Optimization
choice
Part II: Generalization Theory in Deep
Learning
Why over-parameterized neural nets can generalize?
Modeling/
Architecture
→
→
Supervised Learning
→
Unsupervised Learning →
Reinforcement Learning
Statistics/
Optimization →
→
Generalization
“Overfitting” (Textbook Version)
Model complexity
# data
Generalization theory via uniform convergence
product of norms of the weights
𝐶Compression-based
Flat≜minima tend to generalize
margin:
complexitybetter [Hinton-Camp’93, Hochreiter-
[Arora et al.’18]
Schmidhuber, Keskar et al.’16, Dinh et al.’17]
𝐶≜minimumcompressionofthemodel≤noisestability
Post-mortem analyses: don’t explain why low-complexity
solutions
PAC-Bayes bounds are obtained
(for stochastic at nets)
neural the first place
[Dziugaite-Roy’18]
𝐶 ≜ KL( 𝑃∨¿ 𝑄)
Intrinsic complexity
of the data matters
Two missing parts:
A definition of complexity
Proof that the algorithm converges to low-complexity solutions
Algorithmic Regularization: Recent Progress
Linear models:
Folklore: GD on linear regression converges to min -norm solution
Mirror descent converges to solution with minimum Bregman divergence
GD on logistic regression converges to max-margin solution
[Gunasekar-Lee-Soudry-Srebro’18a]
[Soudry-Hoffer-Nacson-Gunasekar-Srebro’17, Ji-Telgarsky’18]
[Gunasekar-Lee-Soudry-Srebro’18b]
[Li-M.-Zhang’18]
[Brutzkus-Globerson-Malach-ShalevShwartz’18]
One-hidden-layer Quadratic Neural Nets With Over-
Parameterization
∈ ℝ𝑑×𝑚
𝑈
1
𝑥 1 𝑦
^
1
data points
quadratic activations
Empirical MSE
Theorem: [Informal, Li-M.-Zhang’18, COLT best paper] With samples,
gradient descent, with an initialization of size , returns a solution with
test error
after iterations.
Comp=2
𝑆 𝑑
Non-generalizable global
minima of training loss
𝑆 𝑟
generalizable global
minima of training loss 𝑆 1
0
Weight Matrix has Low-complexity Throughout the
Training
Very similar setup as previous slide
Generate labels with hidden size
Summary on Generalization Theory
Informative complexity measures of deep nets
Algorithms have an implicit regularization effect to minimize the
complexity
Architecture →
→ Statistics/ →
→
Optimization
choice Generalization
Part III: Statistical Theory of Generative
Adversarial Networks (GANs)
Modeling/
Architecture
→
→
Supervised Learning
→
Unsupervised Learning →
Reinforcement Learning
Statistics/
Optimization →
→
Generalization
Next few slides
Statistical problems:
What distributions do GANs learn with finite samples?
How do we measure the quality of the generated images?
Computational problem:
how do we train GANs faster and better?
GANs: Learning to Sample from Samples
generator
neural net
( are the parameters)
Loss Function Based on Samples
max likelihood loss for classifying real from fake samples using
discriminators in
𝑑 𝑃𝜃 , ^
(^ 𝑃 )= 𝑊 ℱ ( ^
𝑃𝜃 , 𝑃 ) := max ¿ 𝔼𝑃^ 𝑓 ( 𝑋 ) − 𝔼 ^𝑃 𝑓 ( 𝑋 )∨¿ ¿
𝜃
𝑓∈ℱ
𝑊 ℱ 𝑃𝜃 , ^
(^ 𝑃)≤ 𝜖 𝑊 ( 𝑃𝜃 , 𝑃 ) ≤ 𝑔( 𝜖)
“training loss” “test loss”
empirical distance population distance
weak discriminators strong discriminators
Generalization Approximability/
(of discriminators) Distinguishability
A Pessimistic Dilemma
Generalization Approximability/
(of discriminators) Distinguishability
[Arora-Ge-Liang-M.-Zhang, ICML17]
Beyond the Dilemma: Restricted Approximability
Generalization Approximability/
(of discriminators) Distinguishability
We only need:
For a particular generator class,
it’s possible to design corresponding
parameterized discriminator class with restricted approximability:
[Bai-M.-Risteski’18]
Restricted Approximability of Invertible
Neural Nets
𝑍 ∈ℝ 𝑘 𝑋 =𝐺 𝜃 ( 𝑍) ∈ ℝ 𝑑
Assume is an injective function. E.g., an -layer neural nets with leaky relu
activation and full-rank weight matrices
Define -layer neural nets with a special structure}. Then,
Modeling/
Architecture
→
→
Supervised Learning
→
Unsupervised Learning →
Reinforcement Learning
Statistics/
Optimization →
→
Generalization
Embeddings
(for words, Downstream
sentences, tasks
paragraphs)
Modeling/
Architecture
→
→
Supervised Learning
→
Unsupervised Learning →
Reinforcement Learning
Statistics/
Optimization →
→
Generalization
Theoretical Challenges in Deep
Reinforcement Learning
All theoretical challenges with deep learning
+
sequential decision making
Non-linear dynamics:
no previous convergence guarantees
known issue: initial convergence is good, asymptotic convergence is
difficult [Nagabandi et al.’17]
Model-Based Deep RL with Asymptotic
Convergence Guarantees
Framework of MB RL [Xu-Li-Tian-Darrell-M.’18]
Repeat:
maximize an analytical lower bound of the reward based on
trajectories, over both the policy and model
collect new samples using current policy; recomputed the lower
bound
Number of samples
[Xu-Li-Tian-Darrell-M.’18]
Summary
Supervised learning: interactions between optimization, generalization,
and choices of architectures
Unsupervised learning: theory of GANs and representation learning
Model-based reinforcement learning: asymptotic convergence theorem
Thank you!!