You are on page 1of 149

Pre-requisites

• Linear Algebra, Multivariable Calculus, Probability and


Statistics, Algorithms

• Machine learning class or prior equivalent experience

• Experience with python and scienti c toolkits (e.g.


numpy,scipy,matplotlib, sklearn) is recommended

4
fi
Outline

• Introduction to Neural Networks and Representation Learning

• Backpropagation and Automatic Di erentiation Software

• Optimization for Deep Learning

• Regularization and Implicit Regularization

• CNNs and Representation Learning

• Interpretability in Deep Learning (Guest Lecture)

• Generalization, Memorization, and Adversarial Examples

6
ff
Outline

• RNNs and sequence models

• Attention and Self-Attention

• Multi-Task and Transfer Learning

• Deep Generative Models

• Deep Metric Learning

• Self-supervised Learning

• (If time allots) Basics of Deep Reinforcement Learning

7
Deep Learning

• Rebranding of (deep) Neural Networks - universal


function approximator

• Representation Learning - avoid hand-crafted features

• Modular but powerful/expressive model framework

9
Applications
Autonomous Driving Speech Recognition

Automated Shopping

Machine Translation

Image from Wikimedia Commons


10
Group Activity

• 10 minutes - In groups of 3-5 discuss

• Come up with a list of 1-2 Deep Learning application that are not
already mentioned

• In your group decided on an application which you’d like to best


understand after this course

11
Neural Networks
• Simple computation blocks

x1, x2 Inputs h1

w1, w2 Weights σ

∑ b
b Bias w1 w2
1
σ Activation
function x1 x2
h1 Neuron
activation
h(x1,x2) = σ(w1x1 + w2 x2 + b)

12
Neural Networks
• Simple computation blocks that work together

Width

h1(2) = σ(w11
(2) (1) (2) (1)
h1 + w21 (2) (1)
h3 + w31 h3 + b1(2))
y
Layer 2
(2) (2) (2)
h1 h2 h3
Depth
w32 Layer 1
h1 hx1(1) hx2(1) hx3(1)
1 1 1
σ
w11 w12 w22 w23
Input Layer

x1 x2

13
Neural Networks
• Linear models can be thought of as neural networks

y
w12 w22
x1 x2

• Usually for feedforward case, deep networks refers to those


with 2 or more hidden layers

14
Neural Networks Review
• Can be written as composition of simple linear operator +
pointwise non-linearity

y
W3
(2) (2) (2)
h1 h2 h3
fW1,W2,W3(x) = W3ρ(W2 ρ(W1x))
w32 W2
hx1(1)
1 hx2(1)
1 hx3(1)
1
w11 w12 w22 w23 Wi Matrix of parameters at layer I
W1
x1 x2 ρ Pointwise non-lineartiy

15
Neural Networks Review
• Can incorporate bias terms in W to simplify notation

y
W3
h (2) = ρ(W1x)
(2) (2) (2)
h1 h2 h3
w32 W2

[1]
x′
hx1(1) hx2(1) hx3(1) W1 = [W′1 b] x =
1 1 1
w11 w12 w22 w23 W1
x1 x2

16


`

• Aspects of neural networks have been historically inspired by


biological systems

Images from Jean-Louis Queguiner

17
Biologically Plausible
Careful to not bring the biological analogy too far
• Biological neurons much more complex and have large variety

• Timing is (potentially) important in real neuronal activity (spike timing)

Lobo, J. Et al (2020). Spiking Neural Networks and


online learning: An overview and perspectives. Neural
Networks.

• Learning algorithm in brains is unknown! May not be anything like existing methods for NN

• Brains have feedback and recurrence

• Many other huge di erences

18
ff
Representational Power of Linear Models

fW (x) = W1x fW (x) = W2 ρW1x

19
Linear Deep Network
y
W3 fW1,W2,W3(x) = W3ρ(W2 ρ(W1x))

h1(2) h2(2) h3(2)


w32 Wi Matrix of parameters at layer I
W2
(1) (1) (1)
hx11 hx21 hx31 ρ Identity function

w11 w12 w22 w23 W1


x1 x2

Can we approximate this?

20
CAPACITY OF NEURAL
• La puissance expressive des réseaux de neurones
NETW
Non-linear 1-hidden layer network
Topics: single hidden layer neural network
z1 x2

x1
y2 z1 y3

y1
y1 y2 y3 y4
y4

x1 x2

From Pascal Vincent and Hugo Larochelle’s slides


(from Pascal Vincent’s slides)

21
Playground Tensor ow

https://playground.tensor ow.org/

22
fl
fl
Universal Approximation Theorem
ftarget : ℛd → ℛ Function we want to approximate

fW (x) = W2 ∘ ρ ∘ W1x Approximation parametrized by W1, W2

Under conditions on ρ (satis ed for e.g. for sigmoid), there exists W1, W2

which obtains at most ϵ error

sup | fW (x) − ftarget(x) | < ϵ


x∈[0,1]d

Cybenko 1989
23
fi
Why Deep?
Most functions of interest can be approximated with a single hidden layer network:

Why do we need multiple layers?

• Deeper networks can much more e ciently represent many


functions (number of parameters)

1-hidden layer runs on N-hidden layer

• Deeper networks, for some data, may bias the learning process
towards more general solutions

24 Images from Wikimedia Commons


ffi
Capacity of Deep NN

• Several results exist showing that Deep NN are more


parameter e cient

• Existing results are for restricted settings (on the target or


network) or make use of assumptions

Eldan, Ronen, and Ohad Shamir. "The power of depth for feedforward neural networks." Conference on learning theory. 2016.

Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." Advances in neural information processing systems. 2014.

25
ffi

Capacity of Deep NN
• (Montufar et al NIPS 2014) studies number of piecewise
regions represented by ReLU networks and followed up
by many works
Relu non-linearity Relu network

Example in 2d

Another example

Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." Advances in neural information processing systems. 2014.

Serra T. et al Bounding and Counting Linear Regions of Deep Neural Networks. Neurips 2017
26

Capacity of Deep NN
• Relu networks create a piecewise linear functions

Eldan, Ronen, and Ohad Shamir. "The power of depth for feedforward neural networks." Conference on learning theory. 2016.

Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." Advances in neural information processing systems. 2014.

Serra T. et al Bounding and Counting Linear Regions of Deep Neural Networks. Neurips 2017

27

Capacity of Deep NN
• (Montufar et al NIPS 2014)

x1 x2
• Number of linear regions in the output increase
exponentially with depth and polynomial in width
L hidden layers
n (L−1)d d d  input dimension
Ω(( ) n )
d n  width -- number of 
 hidden units per layer
Hanin, Boris, and David Rolnick. "Complexity of linear regions in deep networks." arXiv preprint arXiv:1901.09021 (2019).

Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." Advances in neural information processing systems. 2014.
28

Depth Helps Generalization (Optional)

• Common conjecture based on empirical observations with


some existing, but limited theoretical backing

• Some relevant work:

Urban, Gregor, et al. "Do deep convolutional nets really need to be deep and convolutional?." arXiv
preprint arXiv:1603.05691 (2016).

Arora, Sanjeev, et al. "Implicit regularization in deep matrix factorization." Advances in Neural
Information Processing Systems. 2019.

Pérez, Guillermo Valle, Chico Q. Camargo, and Ard A. Louis. "Deep learning generalizes because the
parameter-function map is biased towards simple functions." ICLR 2019

29

Deep Learning

• Rebranding of (deep) Neural Networks - universal


function approximator

• Representation learning - avoid hand-crafted features

• Modular but powerful model framework

30
Supervised Representation Learning

• Linear models are fast and easy to use

• Designing hand crafted feature is thus tempting

• Di cult for humans to fully describe models of


perception and relevant features they use —
representation learning can ll this gap

31
ffi
fi
Representation Learning

• We can interpret intermediate hidden layer outputs as


learned features

Head?
h1(1)

(1) Legs?
h2 y Gira e

h3(1) Neck?

Photo from Wikimedia


Commons
32
ff
Representation Learning
• A key feature of representation learning is reusability, features
from one category can be relevant in another suggesting
generality

Head?
h1(1)

(1) Legs?
h2 y Cat

h3(1) Neck?

Photo from Wikimedia


Commons
33
Representation Learning
• In supervised settings we can consider nal layer as
encouraging features that linearly separate the data

Learned “features” (2) (2) (2)


h1 h2 h3

hx1(1)
1 hx2(1)
1 hx3(1)
1

x1 x2

34
fi
Representation Learning
• In order to optimize the learning objectives with a given
deep architecture the learning process can discover
relevant features of the data

• Intermediate representations in deep networks can nd


features similar to those designed by humans

• However they can also learn ones we wouldn’t think of

From Y Lecun slides

35
fi
Representation Learning Re-usability
• A goal of representation learning is to be useful on new tasks and datasets

• With enough varied data representation learning can often outperform


hand crafted feature

• Imagenet learned features more useful in most computer vision


tasks than hand-crafted features

• Can reuse representations in multiple ways

Reuse
Train
Representations
Initial NN
Trained NN

Large Dataset 1 Dataset 2/Task 2

36
Unsupervised Rep. Learning
• Unsupervised representation learning simple example,
auto encoders

Encoder
Decoder Network
Network

Photo from Wikimedia


Commons
37
Unsupervised Rep. Learning

• Unsupervised representation learning has been a


major driver of deep learning research since 2006

Use
Train
Representations
Initial NN
Trained NN

Large Unlabeled Dataset 2/Task 2


Dataset 1 Labeled

• Starting to show promising results since 2018

38
Deep Learning

• Rebranding of (deep) Neural Networks - universal


function approximator

• Representation learning - avoid hand-crafted features

• Modular framework for high capacity models

39
A Modular Framework

• Gradient descent and back-propagation algorithm leads to


set of tools for jointly adapting modular system for a single
objective

Task output
Network 1 Network 2 Network 3

40
A Modular Framework
Toolbox Assemble model components from toolbox

Component 1

Photo from Wikimedia


Component 2 Commons

Block 1 Block 2 Block 3

Component

Component 4
Learn parameters

41
A Modular Framework

• Software frameworks for adjusting all model components is


set

• Adding priors is easy and learning is formulaic

• Flexible framework to build models with strong approximation


ability

42
Large Data and Compute
• Deep learning models and frameworks are ripe to take
advantage of larger datasets and increasing compute
g Examples

amples Images from Nvidia Inc

43
More data + high capacity model

10,000 samples 1 Million samples

44
Bigger Models Keep Being Better
• Original Imagenet dataset of ~1.2 million images released in
2010

2020

• For architecture class bigger model


increases performance — power law

2018 • More e cient architectures shift the


curve over time

2014-16

Alexnet
2012
45
ffi
Neural Scaling “Laws” (Hypothesis)

• Empirically observed that models continue to improve with


more data and bigger (parameters) versions of existing model
classes

• Performance gains follow a power law

• Recent language models continue to improve performance

• Observations have been consistent even a decade since the


rst large dataset results with deep learning models

46
fi
wwand
and Motivating
Motivating Examples
Examples
More Data Image Recognition

ata explosion
ata explosion

Representations keep improving


domains have
domains have many
manyexamples
examples

domains: more
domains: moremeasurements/variables
measurements/variables compared
compared to examples
to examples

47
Language Modeling

Where do these empirical scaling observations break down?

48
Summary of some Trends

• More compute/acc e cient models due to modularity/ease of


use -> continues to show improve results on many tasks

• More data + compute -> show improved results on many


tasks

• Bigger models + compute -> show improved results on many


tasks

49
ffi
When do we not use Deep Learning

• Many machine learning problems do not require deep learning


or huge models

• Training speed — deep learning is slow

• Logistic regression, Random Forests, Gradient boosting, and related


can often be t on a laptop for mid-sized problems, including
hyperparameter search

• Linearly separable problems

50
fi
When do we not use Deep Learning
• Smaller datasets, without external related data, sometimes more
easily solved with hand-crafted features + simpler classi ers

• Good feature extractors exist

• Deep Learning methods particularly excel on perceptual data (images,


speech, language)

• Interpretability is critical

• Tabular data can often be solved equally well by other methods

• This is a non-exhaustive list, there are many other situations as well!

51
fi
Gradient Descent

min f(x)
x

f(x)

2
Gradient

Negative Gradient gives the direction of steepest descent

∂f(x)
∂xi
f : ℛD − > ℛ ∇f(x) = ⋮
∂f(x)
∂xD

∂f(x, z)
∂xi
∇x f(x, z) = ⋮
∂f(x, z)
∂xD
3
Gradient Descent

min f(x)
x

Initialize x0

Iterate xt+1 = xt − α ∇f(xt)

Stopping Crit. | f(xt+1) − f(xt) | < tolerance


4
Gradient Descent in 2D
Visualization in 2 dimensions using contours

Minimum

Initialize x0
xt0
Iterate xt+1 = xt − α ∇f(xt)
[x]1 xt1
Stopping | f(x ) − f(x ) | < ϵ
t+1 t
xt2 Crit.

xt3

[x]2
5
Gradient Descent for NN
fW1,W2,W3(x) = W3ρ(W2 ρ(W1x))
y
W3 Wi Matrix of parameters at layer I

(2) (2) (2) ρ Pointwise non-lineartiy


h1 h2 h3
w32 X Data Matrix N samples x 2 features
W2
hx1(1)
1 hx2(1)
1 hx3(1)
1
w = [ flat(W1), flat(W2), flat(W3)]

w11 w12 w22 w23 W1 All Parameters


( attend)
x1 x2

fw(x)
6
fl
Gradient Descent for NN

1 n

Empiricial Risk Minimization w* = arg min l( fw(xi), yi)
w n
i=1

1
e.g. l( fw(x), y) = ( fw(x) − y)2
2
n
1 1
∥Y − fw(X)∥2
n∑
ℒ(X, Y, w) = l( fw(x), y) =
i=1
2n

Gradient of objective respect to weights w All Parameters of Model

∇w ℒ(X, Y, w) X, Y Data Matrix and Labels

7
Gradient based learning

Gradient Descent (GD) Stochastic GD (SGD) Mini-batch SGD

Gradient of loss w.r.t 1 Gradient of loss w.r.t


Gradient of full objective
sample sub-sample
Xn ⊂ X
∇w ℒ(X, Y, w) ∇w l(x, y, w) ∇w ℒ(Xn, Yn, w)

wt+1 = wt − α ∇w ℒ(X, Y, wt ) wt+1 = wt − α ∇w l(x, y, wt) wt+1 = wt − α ∇w ℒ(Xn, Yn, wt )

8
Gradient-Based Optimization in ML

• Gradient-based optimization are critical in machine learning


and especially in deep learning

• Deriving gradients becomes tedious as the number of


components and their complexity grows
f
W1,W2,W3(x) = W3 ρ(W2 ρ(W1x))

• Changes to the model require re-deriving gradients

9
Computing the Gradient
∂ℒ(X, Y, w)
∂w11
1

∇W ℒ(X, Y, w) = ⋮
∂ℒ(X, Y, w)
∂wKJ
I

• Finite di erences
[ ∇W ℒ(X, Y, w)]1 =
1 i I 1 i I
∂ℒ(X, Y, w) ℒ(X, Y, w11 + ϵ, …, wkj, …, wKJ ) − ℒ(X, Y, w11 − ϵ, …, wkj, …, wKJ )

∂w11
1 2ϵ

What’s wrong with this method of


estimating the gradient

10
ff
Speed of Finite Difference

1
∂ℒ(X, Y, w) ℒ(X, Y, w11 + ϵ, …, wkji , …, wKJ
I 1
) − ℒ(X, Y, w11 − ϵ, …, wkji , …, wKJ
I
)

∂w11
1 2ϵ

• Requires 2 forward for each component i of ∇w ℒ(X, Y, w)

For d parameters 2d forward passes (calls to the objective func)

11
Automatic Differentiation

• Automatic di erentiation is a general term for a system that


computes the gradients without needing closed form
expressions

• Backpropagation is largely synonymous with a speci c form


of reverse mode auto di erentiation

12
ff
ff
fi
Chain Rule

• Consider
z(x) = f(g(x)) f, g : ℛ → ℛ

• The chain rule in one dimension

∂z ∂z ∂y y = g(x) and z = f(y)
=
∂x ∂y ∂x

13
Chain Rule Example
z(x) = log(x)2 g(x) = log(x)
h(x) = f(g(x)) f(y) = y 2

∂z ∂z ∂y 1 2 log(x)
= = (2 * log(x)) * ( ) =
∂x ∂y ∂x x x

14
Automatic Differentiation
∂z ∂z ∂y
=
∂x ∂y ∂x

Simpli ed expression Procedural

∂z
1. = 2 * log(x)
∂z 2 log(x) ∂y
=
∂x x
∂z ∂z 1
2. = ( )
∂x ∂y x

15
fi
Multivariable Calculus Review

• Gradient

• Jacobian

16
Multivariable Calculus Review
D
• Gradient — when vector input and scalar output f:ℛ −>ℛ
∂f(w)

( ∂w )
∂w1 T
∂f
∇f(w) = ⋮ =
∂f(w)
∂wD

D M
• Jacobian — vector input and vector output f:ℛ −>ℛ
∂f1 ∂f1
… ∂w
∂w1 ∇f1(w)T
∂f(w) D

Jg(w) = = ⋮ = ⋮
∂w ∂fM ∂fM ∇fM(w)T
∂w1
… ∂w
D
17
Multivariable chain rule warm up

Function of two variables h(g(x), f(x))

∂h(g(x), f(x)) ∂h ∂g ∂h ∂f
= +
∂x ∂g ∂x ∂f ∂x

18
Chain Rule Vector Valued f
• Consider
f(g(x))
x ∈ R n , g : ℛn → ℛm, f : ℛm → ℛ

• The chain rule in multiple dimension y = g(x)

∂y1

∂f ∂f ∂yj ∂xi

∂xi ∑
= → ( ∇y f )T

j
∂yj ∂xi ∂yM
∂xi

T T ∂y
∇x f(x) = ∇y f(y) Jacobian
∂x
Vector - Jacobian product
19
Computation Graphs
x ∈ R n , f : ℛm → ℛ
x1 = f0(x0)
f(x) = f2( f1( f0(x))) x2 = f1(x1)
y = f2(x2)

f0 f1 f2
x x1 x2 y

• Nodes are input or computed variables

• Non-leaf nodes are obtained by operations dependent only on parent


nodes

• Note several valid alternative ways to formalize computation graphs exist

20
Computation Graphs
x0 ∈ R n , f : ℛm → ℛ
x1 = f0(x0)
f(x0) = f2( f1( f0(x0))) x2 = f1(x1)
y = f2(x2)

f0 f1 f2
x0 x1 x2 y

∂y ∂y ∂x2 ∂x1 ∂f2(x2) ∂f1(x1) ∂f0(x0)


= =
∂x0 ∂x2 ∂x1 ∂x0 ∂x2 ∂x1 ∂x0

1 × M3 M3 × M2 M2 × M1
21
Forward and Backward Differentiation
f0 f1 f2
x0 x1 x2 y

∂y ∂f2(x2) ∂f1(x1) ∂f0(x0)


=
∂x0 ∂x2 ∂x1 ∂x0

1 × M3 M3 × M2 M2 × M1
Take  M = M3 = M2 = M1

Forward Mode AutoDi M3 + M2 Ops O(M 3)

Reverse Mode AutoDi (Backprop) M2 + M2 Ops O(M 2)

Multiply this way


22
ff
f
Reverse Mode AD
f0 f1 f2
x0 x1 x2 y

Backward Pass
Forward Pass
∂y ∂f2(x2)
=
∂x2 ∂x2
x1 = f0(x0)
x2 = f1(x1) ∂y ∂y ∂f1(x1)
=
y = f2(x2) ∂x1 ∂x2 ∂x1

∂y ∂y ∂f0(x0)
=
∂x0 ∂x1 ∂x0

23
Reverse Mode AD
T
vj−1 − output grad
f0 f1 fJ−1
x0 x1 x2 … xJ−1 y

Reverse Mode AD for chain graph and scalar output

x0 ← x

for j = 0 to J − 1 :
Forward Pass
..
xk+1 ← fk(xk)
1 × M3 M3 × M2 M2 × M1
vJ ← ∇fJ−1(xJ−1)
for j = J − 1 to 1 : …
Backward Pass
vj−1 ← vjT Jfj−1(xj−1) 1 × M2 M2 × M1

∇x0 y = v0T 24
Reverse Mode AD for MLP
Terminology: feedforward networks with fully connected layers -> Multilayer
Perceptrons (MLP)

x1 = f0(x0, w)
f(x) = f1( f0(x, w))
x2 = f1(x1)

Leaf nodes w

f2 = ρ
Non-Leaf nodes
x x1 z1

f1 = Matmul

25
Reverse Mode AD for MLP
fW1,W2,W3,⋯,WJ(x) = ρWJ⋯W3ρ(W2 ρ(W1x)) xi−1 Input layer i
zi Pre-activation

xi Post-activation

W1 W2 y

Matmul ρ Matmul
ρ …
l
x0 z1 x1 z2 x2 xJ L

26
Reverse Mode AD for MLP
fW1,W2,W3,⋯,WJ(x) = ρWJ⋯W3ρ(W2 ρ(W1x))
W1 W2 y

Matmul ρ Matmul
ρ … l
x0 z1 x1 z2 x2 xJ L

We want ∇W L, ⋯ ∇W1 L xi−1 Input layer i


J

zi Pre-activation

Forward Pass
xi Post-activation

x0 ← x ∂zj+1
for j = 0 to J − 1 : = Wj+1 ∂zj
zj+1 ← Wj+1xj ∂xj =?
∂Wj
xj+1 ← ρ(zj+1) ∂xj
= diag(ρ′(zj))
L = l(xJ, y) ∂zj

27
Reverse Mode AD for MLP
W1 Wj y

Matmul ρ Matmul
ρ … l
x0 z1 … xj−1 zj xj xJ L

fW1,W2,W3,⋯,WJ(x) = ρWJ…W3ρ(W2 ρ(W1x))


VJPs (Vector Jacobian Products)
∂L ∂l(xJ, y)
We want ∇WJ L, ⋯ ∇W1 L =
∂xJ ∂xJ
∂L ∂L ∂xj ∂L ∂L
Forward Pass = = diag(ρ′(zj)) = ∘ ρ′(zj)
∂zj ∂xj ∂zj ∂xj ∂xj
x0 ← x
for j = 0 to J − 1 : ∂L ∂L ∂zj ∂L
= = Wj
zj+1 ← Wj+1xj ∂xj−1 ∂zj ∂xj−1 ∂zj
xj+1 ← ρ(zj+1) ∂L ∂L ∂zj ∂L T T
= = ( ) xj−1
L = l(xJ, y) ∂Wj ∂zj ∂Wj ∂zj


28
Reverse Mode AD for MLP
W1 Wj y

Matmul ρ Matmul
ρ … l
x0 z1 … xj−1 zj xj xJ L

We want ∇WJ L, ⋯ ∇W1 L

Backward Pass
Forward Pass
v = ∇xJ L = ∇xJ l(xJ, y)
x0 ← x for j = J − 1 to 1 :
for j = 0 to J − 1 :
zj+1 ← Wj+1xj v ← ∇zj L = v ∘ ρ′(zj)
xj+1 ← ρ(zj+1) T
∇Wj L = vxj+1
J = l(xJ, y) v ← ∇xj L = WjT v
29Note: here we keep everything as column vectors

Speed for MLP
Forward Pass Backward Pass
x0 ← x v = ∇xJ L = ∇xJ l(xJ, y)
for j = 0 to J − 1 : for j = J − 1 to 1 :
zj+1 ← Wj+1xj v ← ∇zj L = v ∘ ρ′(zj)
xj+1 ← ρ(zj+1) T
∇Wj L = vxj+1
L = l(xJ, y)
v ← ∇xj L = WjT v

• Finite di erence requires 2*D forward passes, with D parameters

• Reverse Mode AD, often ~ 2x forward pass

• Forward Mode AD speed / forward pass would increase with width

30

ff
Group Activity
• In a group work through the following - 20 minutes

• Consider the following: y = wT2 tanh(W1x + b)

• Draw the computation graph

• Nodes as input or computed variables

∂y ∂y ∂y
• Find systematically the expressions for
, ,
∂w2 ∂W1 ∂b

[0.5 0.5] [1.0] [0]


−1 1 0.5 0 Note
W1 = x= b=
31 tanh′(x) = 1 − tanh2(x)

Problem
y = wT2 tanh(W1x + b)
Note: there are di erent valid graphs
depending how you de ne the primitive
ops

32
ff
fi
More Complex Graphs

f4 x3 f5

f1 f2 f3 f6
x x1 x2 x4 y

∂x0 ∂x4 ( ∂x2 ∂x1 ∂x3 ∂x1 ) ∂x0


∂y ∂y ∂x4 ∂x2 ∂x4 ∂x3 ∂x1
= +

∂y ∂y ∂xs
∑ ∂xs ∂xj
=
∂xj s∈Child( j)
33
Topological Sort

x3

f1 f2 f3
x x1 x2 x4 y

f5
f4
f6
x x3 x1 x2 x4 y
f2 f3
f1
34
Reverse AD over General Graph
f5
f4
f6
x x3 x1 x2 x4 y
f2
f1
For last node xJ := L

Forward Pass Backward Pass

x1, ⋯, xJ ← topological sort(Graph) ∇xJ L = ∇xJ xJ = 1


for j = 1 to J : for j = J − 1 to 1 :
xi ← fi(Parent1(xi), …, ParentK(xi)) T T
∂xk

( ∇xj L) ← ( ∇xk L)
k∈Child(x )
∂xj
i

35
Recap of Terminology
• Backpropagation is how we compute the gradient

• It is not Gradient Descent, which is how we optimize the objective

• Automatic Di erentiation

• More general than backprop

• Backprop is essentially reverse mode AD for scalar output

• Autograd

• Speci c package implementing Auto Di erentiation

• Predecessor of torch autograd

36
fi
ff
ff
Vanishing Gradients
W1 W2 y

Matmul ρ Matmul
ρ … l
x0 z1 x1 z2 x2 xJ L

∂L ∂L T
=( ) Πj=J..2diag(ρ′(xj−1))Wj
∂x1 ∂xJ

• ∂L ∂L
Has been observed for feedforward nets that signal degrades with depth

= diag(ρ′(zj))
∂zj ∂xj
• This makes adapting lower layers di cult

• Dependence on distribution of initial weights


∂L ∂L
= Wj
• RNNs (same)
∂xj−1 ∂zj
Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward
neural networks." Proceedings of the thirteenth international conference on artificial intelligence
and statistics. JMLR Workshop and Conference Proceedings, 2010.

Sepp Hochreiter Master’s Thesis




37
ffi
Vanishing and Exploding Gradients

• For many decades this was believed to be the main issue in


training deep networks

• Recurrent networks did not work at all until circa 2014, attributed
to vanshining/exploding gradients problem

38 (Optional for now)


Vanishing and Exploding Gradients
• Various ways to address this

• Most of these will be discussed in more detail in future lectures

• Gradient clipping for exploding gradients

• Shortcut connections (LSTMs and ResNets)

• Normalization techniques

• Better activation selection

• Initialization

• Trying to keep matrices orthogonal

• Alternatives to gradient based learning (bypass bprop)

39 (Optional for now)


Deep Learning Frameworks

Theano (deprecated)

Model/Training
Building Front
Tensor ow Ends

Automatic
Differentiation

MxNet
Tensor Library

Pytorch

40
fl
Deep Learning Frameworks (Pytorch)

Tensor Library

• Built on tensor libraries (similar to numpy)

• Backends to operations on GPU

41
DL Frameworks

• Built on tensor libraries


(similar to numpy)

• Backends to operations
on GPU

42
DL Frameworks

a
RAM CPU
b

• Backends to operations on GPU

a
GPU Mem GPU
b

Image from Nvidia


43
DL Frameworks: Autodiff
Automatic
Differentiation

Tensor Library

x3

f1 f2 f3
x x1 x2 x4 y

• Automate the construction of computation graph


and the backward pass

44
DL Frameworks: Autodiff

• Frameworks allow to de ne primitives and optimize


their forward and backward computation

• Optimized primitives can be chained together to


form complex models

45
fi
DL Frameworks: Autodiff
Automatic
Differentiation

Tensor Library

• Automatic Di erentiation tools

• Only need to specify the forward pass behaviour if using


prede ned primitives

• Obtain computation graph ahead of time (theano, tensor ow v1,


mxnet v1)

• On the y (pytorch, mxnet gluon, tensor ow v2)

46
fi
fl
ff
fl
fl
On the Fly Construction (Tracing)

• Each torch tensor created with will


be recognized by torch autograd for building
computation graphs

• Graph is constructed on the y by storing for each


node a reference to the parent nodes and functions
applied

47
fl
Torch Autograd
c
a
Matmul y cost

bx

48
Torch Autograd

49
Barebones Autograd Implementations

Mathieu Blondel
https://github.com/mblondel/teaching/blob/main/autodi -2020/autodi .py

Andrei Karpathy

https://github.com/karpathy/micrograd

Matt Johnson

https://github.com/mattjj/autodidact

50
ff
ff
DL Frameworks: Autograd and Pytorch
• Mini-Autograd from Mathieu Blondel:
https://github.com/mblondel/teaching/blob/main/
autodi -2020/autodi .py

Forward Pass
x1, ⋯, xJ ← topological sort(Graph)
for j = 1 to J :
xi ← fi(Parent1(xi), …, ParentK(xi))

Store at each
node Grad
Value
f5
f6
x x1 f4 x3 x2 x4 y
f2
f1
51
ff
ff
DL Frameworks: Autograd and Pytorch
Backward Pass

∇xJ L = ∇xJ xJ = 1
for j = J − 1 to 1 :
T T
∂xk

( ∇xj L) ← ( ∇xk L)
k∈Child(x )
∂xj
i

Store at each
node Grad
Value

f5
f6
x x1 f4 x3 x2 x4 y
f2
f1
52
DL Frameworks: Model Building

torch.nn.module

Model Building
Front Ends
• Simple ways to track and manipulate all
parameters of large models

Automatic
Differentiation
• Allows to easily build and plug and play layers

• Easily specify the parameters and initialize them


Tensor Library

• Describe the forward pass behaviour

• Designed to work well with training pipelines

53
DL Frameworks: Model Building
torch.nn.module

• Provides commonly used module that build on


top of each other

• Tracks parameters initialized in __init__

• Specify forward pass behaviour

• Put all parameter tensors on GPU with one


call

54
DL Frameworks: Model Building

55
References

• Di erential Programming by Gabriel Peyre

• Automatic Di erentiation Slides by Roger Grosse

• Autodi Slides and Code from Mathieu Blondel

56
ff
ff
ff
Gradient Descent in 2D
Visualization in 2 dimensions using contours

Minimum

Initialize x0
xt0
Iterate xt+1 = xt − α ∇f(xt)
[x]1 xt1
Stopping | f(x ) − f(x ) | < ϵ
t+1 t
xt2 Crit.

xt3

[x]2
2
Convexity

θ ∈ [0,1] f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y)

• Choose two points draw a line

• Convex if line is above graph

x y
x y
Convex Optimization
Convex function

θ ∈ [0,1] f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y)

• Many optimization methods come from convex optimization

x y
Local and Global Minimum
• Stationary Points

• Global Minimum

• Local Minimum

• For convex functions local minima are global minimum


Convex Optimization
f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y)

• Linear model + standard loss functions typically yield convex


problems

• Convex optimization often allows us provable ways to reach a


global minus of the function

• Deep networks + standard loss functions typicaly yield a


highly non-convex optimization problem
Non-convex optimization

• Non-convex optimization is NP-hard

• Fortunately the objective and loss surface in deep networks has


properties that allow reasonable solutions in practice
Convergence of Gradient Descent

wk+1 = wk − α ∇f(wk) w0
[w]1 w1
w3
w4

Under conditions (to follow) [w]2


For convex functions we can show

k
k
f(w ) − f(w*) ≤
∥w − k
w*∥22 w k-th iterate

2αk
w* Minimum

α
8

Learning rate
Converges at rate O(1/k)
BertSekas et al GRADIENT CONVERGENCE IN GRADIENT METHODS WITH ERRORS∗
Convergence of Gradient Descent
k 2
k
∥w − w*∥ 2
f(w ) − f(w*) ≤
2αk

1. We assume Lipschitz continuous gradient


∥∇f(x) − ∇f(y)∥2
≤L
∥x − y∥2 • Limits how fast gradients
can change

• Often valid

1
2. Learning rate α≤
L
Hessian
• Matrix of all second partial D
f:ℛ −>ℛ ∂ 2f ∂2f
derivatives
… ∂w ∂w
∂w12
1 n
2
• Characterizes curvature in high
dimension Hf = ∇ f = ⋮
2
∂f ∂f
∂wn∂w1

∂wD2
Gradient Descent
∥∇f(x) − ∇f(y)∥2
≤L
∥x − y∥2

∇2 f(w) = H

Slide courtesy of Mark Schmidt


Gradient Descent

Slide courtesy of Mark Schmidt


Descent Lemma

Slide courtesy of Mark Schmidt


Convergence of Gradient Descent

Slide courtesy of Mark Schmidt

(Optional)
Convergence of Gradient Descent
Guaranteed progress

(Optional)
Gradient Descent for NN

Empiricial Risk 1 n
w n∑
Minimization
w* = arg min l( fw(xi), yi)
i=1

1
e.g. l( fw(x), y) = ( fw(x) − y)2
2
n
1 1
∥Y − fw(X)∥2
n∑
ℒ(X, Y, w) = l( fw(x), y) =
i=1
2n

Gradient of objective respect to weights w All Parameters of Model

∇w ℒ(X, Y, w) X, Y Data Matrix and Labels

16
Stochastic Gradient Descent
Gradient Descent (GD) Stochastic GD (SGD) Mini-batch SGD

Gradient of loss w.r.t 1 Gradient of loss w.r.t


Gradient of full objective
sample sub-sample
Xn ⊂ X
∇w ℒ(X, Y, w) ∇w l(x, y, w) ∇w ℒ(Xn, Yn, w)

wt+1 = wt − α ∇w ℒ(X, Y, wt ) wt+1 = wt − α ∇w l(x, y, wt) wt+1 = wt − α ∇w ℒ(Xn, Yn, wt )

17
Image Credits: Toward Data Science
Intuition about Stochastic Methods
1

∇w l(x, y, w)
| S1 | i∈S1

S1

1

∇w l(x, y, w)
| S2 | i∈S2

S2
Stochastic vs Gradient Descent

• Stochastic Gradient Descent is much more scalable in large


datasets

• E.g. in convex settings convergence rates can be shown to be


similar to GD while processing only one point at a time

• Stochastic Gradient Descent is a classic optimizations


method that is the backbone of most modern neural network
training
Mini-Batch SGD
• Classical SGD uses a single point
Mini-batch SGD

Gradient of loss w.r.t


• Mini-batch SGD can be seen as sub-sample Xn ⊂ X
obtaining a lower variance gradient 1

∇w l(xi, yi, w)
estimate but without need for full | Xn | (xi,yi)∈Xn
batch

• Mini-batch forward and backward processing is


often more e cient on single GPU

Terminology: Recently, mini-batch SGD in many contexts is often just called SGD
ffi
Mini-Batch SGD Algo
Mini-Batch SGD in Code
W1 W2

ρ ρ
x0 z1 x1 z2 Loss

b1 b1

Each node holds


gradient bu er
ff
Mini-Batch SGD in Code
Stochastic Gradient Descent
• Stochastic gradient descent can be shown to reach the global
minimum in convex settings

• SGD can be shown to reach a stationary point in non-convex


settings with some assumptions

• Assumption - Appropriate step size sequence

• Assumption - Lipschitz continuity


Minimum in Non-Convex Setting

• Are local minimum useful?


Local Minimum Can be Useful

• It has been empirically shown in some cases that many local


minimum in highly non-convex functions obtained by SGD are
“close” to global minimum. Still poorly understood why this
occurs

• With many assumptions and speci c model classes recent


results have begun to show that SGD can converge to global
minima

fi
GD Step Size
• Step size or learning rate can often greatly a ect the optimization

Slide credit: Roger Grosse


ff
Learning Rate Schedules SGD
SGD

True Gradient

Minimum

• In stochastic optimization larger learning rates even if they are of


the correct size to reach minimum may bounce around without
hitting the solution

• Similar to GD small learning rates would be too slow

• In SGD we typically start at a high learning rate and decay


Momentum

g = ∇w ℒ(X, Y, wt)
vt+1 = μ * vt + g
wt+1 = wt − α * vt+1

• Popular and simple approach to speed up and stabilize learning

• Dampens oscillations and noise from noisy gradient

• Often accelerates training


Visualizations

https://distill.pub/2017/momentum/
Adaptive Learning Rate

• Attempt to adjust learning rate based on rules to better suit


local curvature without explicitly representing the hessian

• Typically based on heuristics


RMSProp
• Individually adapt learning rate of each parameter

• Divide the learning rate for a weight by a running average of the


magnitudes of recent gradients for that weight.

• Parameters with very large gradients have less e ect

ff
Adam
• Extremely popular optimization algorithm

• Can be seen as a combination of momentum and Rmsprop

• Adam rose to prominence in training models where learning rate


schedules were extremely di cult to determine

• Robust parameters
Kingma, D Ba, J Adam: A Method for Stochastic Optimization
ffi
Second-order optimization

n
1 1
∥Y − gw(X)∥2
n∑
f(w) = ℒ(X, Y, w) = l(gw(x), y) =
i=1
2n

̂ ≈ f(v) 1
̂ + ∇f(v) (w − v) + (w − v)T ∇2 f(v)(w − v)
T
f(w)
2
Newton Method

̂ + ∇f(v)T (w − v) + 1 (w − v)T ∇2 f(v)(w − v)


̂f(w) ≈ f(v)
2
H
Solve local approximation

̂ 1
min f(v) + ∇f(v) (w − v) + (w − v)T H(w − v)
T
w 2

w * = v − H −1 ∇f(v)
Second-order optimization
̂ ≈ f(v) 1
̂ + ∇f(v) (w − v) + (w − v)T ∇2 f(v)(w − v)
T
f(w)
2
H

• Requires inverse hessian

• Large memory and computation

• O(D^3) compute and O(D^2)


memory
Second-order optimization
̂ ≈ f(v) 1
̂ + ∇f(v) (w − v) + (w − v)T ∇2 f(v)(w − v)
T
f(w)
2
H

v * = v − H −1 ∇f(v)

• Common second order approaches attempt to approximate the


Hessian

• BFGS is one of the more successful

• Still maintains O(D^2) computation and O(D^2) memory cost

• L-BFGS further reduces memory cost, very popular outside of DL

• KFAC, Krylov Subspace methods


Second-order optimization

• In practice at the moment second order methods are rarely


used in Deep Learning

• Many properties of these methods particularly in estimating the


inverse hessian and their behaviour in the stochastic setting is
not yet well understood

• They hold promise for potentially more rapid optimization

• Several promising results using novel hessian approximations


(KFAC)

Ba, Jimmy, Roger Grosse, and James Martens. "Distributed second-order optimization using Kronecker-factored approximations." (2016).
Generalization and Optimization

n
1
w n∑
min l( fw(xi), yi) + Ω(w)
i=1

• Classic machine learning often separate the optimization of the


objective function and properties of the optimum as separate
concepts

• In practice a lot of success is due to implicit regularization from


optimization methods
SGD is good for generalization

• Growing body of evidence shows that SGD has better generalization properties

• Intuitively the process of sampling the training data in SGD mimics the process
of sampling test/train

• Several works empirically show SGD can nd “ atter minimum”

• This makes it particularly hard to theoretically analyze optimization algorithms in


DL

Kuzborskij, Ilja and Christoph H. Lampert. “Data-Dependent Stability of Stochastic Gradient Descent.” ICML (2018).
Hardt, Moritz, Ben Recht, and Yoram Singer. "Train faster, generalize better: Stability of stochastic gradient descent." International Conference on
Machine Learning. PMLR, 2016.
fi
fl
SGD is good for generalization
• Several works argue that SGD with small mini-batch can nd atter
minimum and that at minimum generalize better

Hochreiter and Schmidhuber “Flat


Minima” 1997.

Keskar, Nitish Shirish, et al. "On large-batch training for deep learning:
Generalization gap and sharp minima." arXiv preprint
arXiv:1609.04836 (2016).
fl
fi
fl
SGD and Generalization
Kaiming He’s 2015 Imagenet Competition Winner

Why don’t we decay the learning rate in this at regions?

Initial training at high learning rate has been observed to act as a regularizer
Some initial explanations for this effect have shown in the literature:

Li, Yuanzhi, Colin Wei, and Tengyu Ma. "Towards explaining the regularization effect of initial large learning rate in training
neural networks." Advances in Neural Information Processing Systems. 2019.
fl
Distributed Optimization
Parallelzing Deep Network Training
• Most common form of parallelism is data parallelism

• Each node simultaneously process di erent mini batches

• Model parallelism - attempts to create models that split model across nodes

• Di cult to parallelize in some cases

Wikimedia
ffi
ff
Distributed Synchronous SGD
• Most common approach is Distributed Synchronous SGD

• Nodes (GPU) sample data and wait to receive parameters from a


parameter server

• Parameter server will wait to aggregate gradients from all the


nodes, then send new params

Data

GPU 1 GPU 2 GPU 3

Parameters
Gradients

Param Server
DataParallel SGD Pytorch
Minibatch size - 300
Sample minibatch
Data Dimensionality - 500
Huge Dataset
Data 300x500

100x500 100x500 100x500

Data1 Data2 Data2

GPU 0 GPU 1 GPU 2

Gradient Gradient Gradient

Parameters Parameters Parameters


DataParallel + SGD Pytorch
Distributed Synchronous SGD
What are some issues with Distributed Synchronous SGD?

• Bandwith needs to be high for synchronous SGD

GPU 1

• Requires central node

GPU 0 GPU 1 GPU 2


Batch size and learning rates

• With many available GPU we would want to increase the batch


size to maximize

• From the point of view of variance reduction we should multiply


the learning rate by sqrt(k) for k fold increase in batch size

• In practice various other rules are used most notably

Goyal, Priya, et al. "Accurate, large minibatch sgd: Training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).

You might also like