Deep Learning Prerequisites and Outline

Pre-requisites
• Linear Algebra, Multivariable Calculus, Probability and

Statistics, Algorithms
• Machine learning class or prior equivalent experience
• Experience with python and scienti c toolkits (e.g.

numpy,scipy,matplotlib, sklearn) is recommended
4
fi
Outline
• Introduction to Neural Networks and Representation Learning
• Backpropagation and Automatic Di erentiation Software
• Optimization for Deep Learning
• Regularization and Implicit Regularization
• CNNs and Representation Learning
• Interpretability in Deep Learning (Guest Lecture)
• Generalization, Memorization, and Adversarial Examples
6
ff
Outline
• RNNs and sequence models
• Attention and Self-Attention
• Multi-Task and Transfer Learning
• Deep Generative Models
• Deep Metric Learning
• Self-supervised Learning
• (If time allots) Basics of Deep Reinforcement Learning
7
Deep Learning
• Rebranding of (deep) Neural Networks - universal

function approximator
• Representation Learning - avoid hand-crafted features
• Modular but powerful/expressive model framework
9
Applications
Autonomous Driving Speech Recognition
Automated Shopping
Machine Translation
Image from Wikimedia Commons

10
Group Activity
• 10 minutes - In groups of 3-5 discuss
• Come up with a list of 1-2 Deep Learning application that are not
already mentioned
• In your group decided on an application which you’d like to best

understand after this course
11
Neural Networks
• Simple computation blocks
x1, x2 Inputs h1
w1, w2 Weights σ
∑ b
b Bias w1 w2
1
σ Activation
function x1 x2
h1 Neuron
activation
h(x1,x2) = σ(w1x1 + w2 x2 + b)
12
Neural Networks
• Simple computation blocks that work together
Width
h1(2) = σ(w11
(2) (1) (2) (1)
h1 + w21 (2) (1)
h3 + w31 h3 + b1(2))
y
Layer 2
(2) (2) (2)
h1 h2 h3
Depth
w32 Layer 1
h1 hx1(1) hx2(1) hx3(1)
1 1 1
σ
w11 w12 w22 w23
Input Layer
∑
x1 x2
13
Neural Networks
• Linear models can be thought of as neural networks
y
w12 w22
x1 x2
• Usually for feedforward case, deep networks refers to those

with 2 or more hidden layers
14
Neural Networks Review
• Can be written as composition of simple linear operator +
pointwise non-linearity
y
W3
(2) (2) (2)
h1 h2 h3
fW1,W2,W3(x) = W3ρ(W2 ρ(W1x))
w32 W2
hx1(1)
1 hx2(1)
1 hx3(1)
1
w11 w12 w22 w23 Wi Matrix of parameters at layer I
W1
x1 x2 ρ Pointwise non-lineartiy
15
Neural Networks Review
• Can incorporate bias terms in W to simplify notation
y
W3
h (2) = ρ(W1x)
(2) (2) (2)
h1 h2 h3
w32 W2
[1]
x′
hx1(1) hx2(1) hx3(1) W1 = [W′1 b] x =
1 1 1
w11 w12 w22 w23 W1
x1 x2
16

`
• Aspects of neural networks have been historically inspired by

biological systems
Images from Jean-Louis Queguiner
17
Biologically Plausible
Careful to not bring the biological analogy too far
• Biological neurons much more complex and have large variety
• Timing is (potentially) important in real neuronal activity (spike timing)
Lobo, J. Et al (2020). Spiking Neural Networks and

online learning: An overview and perspectives. Neural
Networks.
• Learning algorithm in brains is unknown! May not be anything like existing methods for NN
• Brains have feedback and recurrence
• Many other huge di erences
18
ff
Representational Power of Linear Models
fW (x) = W1x fW (x) = W2 ρW1x
19
Linear Deep Network
y
W3 fW1,W2,W3(x) = W3ρ(W2 ρ(W1x))
h1(2) h2(2) h3(2)

w32 Wi Matrix of parameters at layer I
W2
(1) (1) (1)
hx11 hx21 hx31 ρ Identity function
w11 w12 w22 w23 W1

x1 x2
Can we approximate this?
20
CAPACITY OF NEURAL
• La puissance expressive des réseaux de neurones
NETW
Non-linear 1-hidden layer network
Topics: single hidden layer neural network
z1 x2
x1
y2 z1 y3
y1
y1 y2 y3 y4
y4
x1 x2
From Pascal Vincent and Hugo Larochelle’s slides

(from Pascal Vincent’s slides)
21
Playground Tensor ow
https://playground.tensor ow.org/
22
fl
fl
Universal Approximation Theorem
ftarget : ℛd → ℛ Function we want to approximate
fW (x) = W2 ∘ ρ ∘ W1x Approximation parametrized by W1, W2
Under conditions on ρ (satis ed for e.g. for sigmoid), there exists W1, W2
which obtains at most ϵ error
sup | fW (x) − ftarget(x) | < ϵ

x∈[0,1]d
Cybenko 1989
23
fi
Why Deep?
Most functions of interest can be approximated with a single hidden layer network:
Why do we need multiple layers?
• Deeper networks can much more e ciently represent many

functions (number of parameters)
1-hidden layer runs on N-hidden layer
• Deeper networks, for some data, may bias the learning process
towards more general solutions
24 Images from Wikimedia Commons

ffi
Capacity of Deep NN
• Several results exist showing that Deep NN are more

parameter e cient
• Existing results are for restricted settings (on the target or

network) or make use of assumptions
Eldan, Ronen, and Ohad Shamir. "The power of depth for feedforward neural networks." Conference on learning theory. 2016.
Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." Advances in neural information processing systems. 2014.
25
ffi

Capacity of Deep NN
• (Montufar et al NIPS 2014) studies number of piecewise
regions represented by ReLU networks and followed up
by many works
Relu non-linearity Relu network
Example in 2d
Another example
Serra T. et al Bounding and Counting Linear Regions of Deep Neural Networks. Neurips 2017
26
Capacity of Deep NN
• Relu networks create a piecewise linear functions
Eldan, Ronen, and Ohad Shamir. "The power of depth for feedforward neural networks." Conference on learning theory. 2016.
Serra T. et al Bounding and Counting Linear Regions of Deep Neural Networks. Neurips 2017
27

Capacity of Deep NN
• (Montufar et al NIPS 2014)
x1 x2
• Number of linear regions in the output increase
exponentially with depth and polynomial in width
L hidden layers
n (L−1)d d d input dimension
Ω(( ) n )
d n width -- number of
hidden units per layer
Hanin, Boris, and David Rolnick. "Complexity of linear regions in deep networks." arXiv preprint arXiv:1901.09021 (2019).
28

Depth Helps Generalization (Optional)
• Common conjecture based on empirical observations with

some existing, but limited theoretical backing
• Some relevant work:
Urban, Gregor, et al. "Do deep convolutional nets really need to be deep and convolutional?." arXiv
preprint arXiv:1603.05691 (2016).
Arora, Sanjeev, et al. "Implicit regularization in deep matrix factorization." Advances in Neural
Information Processing Systems. 2019.
Pérez, Guillermo Valle, Chico Q. Camargo, and Ard A. Louis. "Deep learning generalizes because the
parameter-function map is biased towards simple functions." ICLR 2019
29

Deep Learning

• Representation learning - avoid hand-crafted features
• Modular but powerful model framework
30
Supervised Representation Learning
• Linear models are fast and easy to use
• Designing hand crafted feature is thus tempting
• Di cult for humans to fully describe models of

perception and relevant features they use —
representation learning can ll this gap
31
ffi
fi
Representation Learning
• We can interpret intermediate hidden layer outputs as

learned features
Head?
h1(1)
(1) Legs?
h2 y Gira e
h3(1) Neck?
Photo from Wikimedia

Commons
32
ff
• A key feature of representation learning is reusability, features
from one category can be relevant in another suggesting
generality
Head?
h1(1)
(1) Legs?
h2 y Cat
h3(1) Neck?

Commons
33
• In supervised settings we can consider nal layer as
encouraging features that linearly separate the data
Learned “features” (2) (2) (2)

h1 h2 h3
hx1(1)
1 hx2(1)
1 hx3(1)
1
x1 x2
34
fi
• In order to optimize the learning objectives with a given
deep architecture the learning process can discover
relevant features of the data
• Intermediate representations in deep networks can nd

features similar to those designed by humans
• However they can also learn ones we wouldn’t think of
From Y Lecun slides
35
fi
Representation Learning Re-usability
• A goal of representation learning is to be useful on new tasks and datasets
• With enough varied data representation learning can often outperform

hand crafted feature
• Imagenet learned features more useful in most computer vision

tasks than hand-crafted features
• Can reuse representations in multiple ways
Reuse
Train
Representations
Initial NN
Trained NN
Large Dataset 1 Dataset 2/Task 2
36
Unsupervised Rep. Learning
• Unsupervised representation learning simple example,
auto encoders
Encoder
Decoder Network
Network

Commons
37
Unsupervised Rep. Learning
• Unsupervised representation learning has been a

major driver of deep learning research since 2006
Use
Train
Representations
Initial NN
Trained NN
Large Unlabeled Dataset 2/Task 2

Dataset 1 Labeled
• Starting to show promising results since 2018
38
Deep Learning

• Representation learning - avoid hand-crafted features
• Modular framework for high capacity models
39
A Modular Framework
• Gradient descent and back-propagation algorithm leads to

set of tools for jointly adapting modular system for a single
objective
Task output
Network 1 Network 2 Network 3
40
A Modular Framework
Toolbox Assemble model components from toolbox
Component 1

Component 2 Commons
Block 1 Block 2 Block 3
Component
Component 4
Learn parameters
41
A Modular Framework
• Software frameworks for adjusting all model components is

set
• Adding priors is easy and learning is formulaic
• Flexible framework to build models with strong approximation

ability
42
Large Data and Compute
• Deep learning models and frameworks are ripe to take
advantage of larger datasets and increasing compute
g Examples
amples Images from Nvidia Inc
43
More data + high capacity model
10,000 samples 1 Million samples
44
Bigger Models Keep Being Better
• Original Imagenet dataset of ~1.2 million images released in
2010
2020
• For architecture class bigger model

increases performance — power law
2018 • More e cient architectures shift the

curve over time
2014-16
Alexnet
2012
45
ffi
Neural Scaling “Laws” (Hypothesis)
• Empirically observed that models continue to improve with

more data and bigger (parameters) versions of existing model
classes
• Performance gains follow a power law
• Recent language models continue to improve performance
• Observations have been consistent even a decade since the

rst large dataset results with deep learning models
46
fi
wwand
and Motivating
Motivating Examples
Examples
More Data Image Recognition
ata explosion
ata explosion
Representations keep improving

domains have
domains have many
manyexamples
examples
domains: more
domains: moremeasurements/variables
measurements/variables compared
compared to examples
to examples
47
Language Modeling
Where do these empirical scaling observations break down?
48
Summary of some Trends
• More compute/acc e cient models due to modularity/ease of

use -> continues to show improve results on many tasks
• More data + compute -> show improved results on many

tasks
• Bigger models + compute -> show improved results on many

tasks
49
ffi
When do we not use Deep Learning
• Many machine learning problems do not require deep learning

or huge models
• Training speed — deep learning is slow
• Logistic regression, Random Forests, Gradient boosting, and related

can often be t on a laptop for mid-sized problems, including
hyperparameter search
• Linearly separable problems
50
fi
When do we not use Deep Learning
• Smaller datasets, without external related data, sometimes more
easily solved with hand-crafted features + simpler classi ers
• Good feature extractors exist
• Deep Learning methods particularly excel on perceptual data (images,

speech, language)
• Interpretability is critical
• Tabular data can often be solved equally well by other methods
• This is a non-exhaustive list, there are many other situations as well!
51
fi
Gradient Descent
min f(x)
x
f(x)
2
Gradient
Negative Gradient gives the direction of steepest descent
∂f(x)
∂xi
f : ℛD − > ℛ ∇f(x) = ⋮
∂f(x)
∂xD
∂f(x, z)
∂xi
∇x f(x, z) = ⋮
∂f(x, z)
∂xD
3
Gradient Descent
min f(x)
x
Initialize x0
Iterate xt+1 = xt − α ∇f(xt)
Stopping Crit. | f(xt+1) − f(xt) | < tolerance

4
Gradient Descent in 2D
Visualization in 2 dimensions using contours
Minimum
Initialize x0
xt0
[x]1 xt1
Stopping | f(x ) − f(x ) | < ϵ
t+1 t
xt2 Crit.
xt3
[x]2
5
Gradient Descent for NN
fW1,W2,W3(x) = W3ρ(W2 ρ(W1x))
y
W3 Wi Matrix of parameters at layer I
(2) (2) (2) ρ Pointwise non-lineartiy

h1 h2 h3
w32 X Data Matrix N samples x 2 features
W2
hx1(1)
1 hx2(1)
1 hx3(1)
1
w = [ flat(W1), flat(W2), flat(W3)]
w11 w12 w22 w23 W1 All Parameters

( attend)
x1 x2
fw(x)
6
fl
1 n
∑
Empiricial Risk Minimization w* = arg min l( fw(xi), yi)
w n
i=1
1
e.g. l( fw(x), y) = ( fw(x) − y)2
2
n
1 1
∥Y − fw(X)∥2
n∑
ℒ(X, Y, w) = l( fw(x), y) =
i=1
2n
Gradient of objective respect to weights w All Parameters of Model
∇w ℒ(X, Y, w) X, Y Data Matrix and Labels
7
Gradient based learning
Gradient Descent (GD) Stochastic GD (SGD) Mini-batch SGD
Gradient of loss w.r.t 1 Gradient of loss w.r.t

Gradient of full objective
sample sub-sample
Xn ⊂ X
∇w ℒ(X, Y, w) ∇w l(x, y, w) ∇w ℒ(Xn, Yn, w)
wt+1 = wt − α ∇w ℒ(X, Y, wt ) wt+1 = wt − α ∇w l(x, y, wt) wt+1 = wt − α ∇w ℒ(Xn, Yn, wt )
8
Gradient-Based Optimization in ML
• Gradient-based optimization are critical in machine learning

and especially in deep learning
• Deriving gradients becomes tedious as the number of

components and their complexity grows
f
W1,W2,W3(x) = W3 ρ(W2 ρ(W1x))
• Changes to the model require re-deriving gradients
9
Computing the Gradient
∂ℒ(X, Y, w)
∂w11
1
∇W ℒ(X, Y, w) = ⋮
∂ℒ(X, Y, w)
∂wKJ
I
• Finite di erences
[ ∇W ℒ(X, Y, w)]1 =
1 i I 1 i I
∂ℒ(X, Y, w) ℒ(X, Y, w11 + ϵ, …, wkj, …, wKJ ) − ℒ(X, Y, w11 − ϵ, …, wkj, …, wKJ )
≈
∂w11
1 2ϵ
What’s wrong with this method of

estimating the gradient
10
ff
Speed of Finite Difference
1
∂ℒ(X, Y, w) ℒ(X, Y, w11 + ϵ, …, wkji , …, wKJ
I 1
) − ℒ(X, Y, w11 − ϵ, …, wkji , …, wKJ
I
)
≈
∂w11
1 2ϵ
• Requires 2 forward for each component i of ∇w ℒ(X, Y, w)
For d parameters 2d forward passes (calls to the objective func)
11
Automatic Differentiation
• Automatic di erentiation is a general term for a system that

computes the gradients without needing closed form
expressions
• Backpropagation is largely synonymous with a speci c form

of reverse mode auto di erentiation
12
ff
ff
fi
Chain Rule
• Consider
z(x) = f(g(x)) f, g : ℛ → ℛ
• The chain rule in one dimension
∂z ∂z ∂y y = g(x) and z = f(y)
=
∂x ∂y ∂x
13
Chain Rule Example
z(x) = log(x)2 g(x) = log(x)
h(x) = f(g(x)) f(y) = y 2
∂z ∂z ∂y 1 2 log(x)
= = (2 * log(x)) * ( ) =
∂x ∂y ∂x x x
14
Automatic Differentiation
∂z ∂z ∂y
=
∂x ∂y ∂x
Simpli ed expression Procedural
∂z
1. = 2 * log(x)
∂z 2 log(x) ∂y
=
∂x x
∂z ∂z 1
2. = ( )
∂x ∂y x
15
fi
Multivariable Calculus Review
• Gradient
• Jacobian
16
Multivariable Calculus Review
D
• Gradient — when vector input and scalar output f:ℛ −>ℛ
∂f(w)
( ∂w )
∂w1 T
∂f
∇f(w) = ⋮ =
∂f(w)
∂wD
D M
• Jacobian — vector input and vector output f:ℛ −>ℛ
∂f1 ∂f1
… ∂w
∂w1 ∇f1(w)T
∂f(w) D
Jg(w) = = ⋮ = ⋮
∂w ∂fM ∂fM ∇fM(w)T
∂w1
… ∂w
D
17
Multivariable chain rule warm up
Function of two variables h(g(x), f(x))
∂h(g(x), f(x)) ∂h ∂g ∂h ∂f
= +
∂x ∂g ∂x ∂f ∂x
18
Chain Rule Vector Valued f
• Consider
f(g(x))
x ∈ R n , g : ℛn → ℛm, f : ℛm → ℛ
• The chain rule in multiple dimension y = g(x)
∂y1
∂f ∂f ∂yj ∂xi
∂xi ∑
= → ( ∇y f )T
⋮
j
∂yj ∂xi ∂yM
∂xi
T T ∂y
∇x f(x) = ∇y f(y) Jacobian
∂x
Vector - Jacobian product
19
Computation Graphs
x ∈ R n , f : ℛm → ℛ
x1 = f0(x0)
f(x) = f2( f1( f0(x))) x2 = f1(x1)
y = f2(x2)
f0 f1 f2
x x1 x2 y
• Nodes are input or computed variables
• Non-leaf nodes are obtained by operations dependent only on parent

nodes
• Note several valid alternative ways to formalize computation graphs exist
20
Computation Graphs
x0 ∈ R n , f : ℛm → ℛ
x1 = f0(x0)
f(x0) = f2( f1( f0(x0))) x2 = f1(x1)
y = f2(x2)
f0 f1 f2
x0 x1 x2 y
∂y ∂y ∂x2 ∂x1 ∂f2(x2) ∂f1(x1) ∂f0(x0)

= =
∂x0 ∂x2 ∂x1 ∂x0 ∂x2 ∂x1 ∂x0
1 × M3 M3 × M2 M2 × M1
21
Forward and Backward Differentiation
f0 f1 f2
x0 x1 x2 y
∂y ∂f2(x2) ∂f1(x1) ∂f0(x0)

=
∂x0 ∂x2 ∂x1 ∂x0
1 × M3 M3 × M2 M2 × M1
Take M = M3 = M2 = M1
Forward Mode AutoDi M3 + M2 Ops O(M 3)
Reverse Mode AutoDi (Backprop) M2 + M2 Ops O(M 2)
Multiply this way

22
ff
f
Reverse Mode AD
f0 f1 f2
x0 x1 x2 y
Backward Pass
Forward Pass
∂y ∂f2(x2)
=
∂x2 ∂x2
x1 = f0(x0)
x2 = f1(x1) ∂y ∂y ∂f1(x1)
=
y = f2(x2) ∂x1 ∂x2 ∂x1
∂y ∂y ∂f0(x0)
=
∂x0 ∂x1 ∂x0
23
Reverse Mode AD
T
vj−1 − output grad
f0 f1 fJ−1
x0 x1 x2 … xJ−1 y
Reverse Mode AD for chain graph and scalar output
x0 ← x
for j = 0 to J − 1 :
Forward Pass
..
xk+1 ← fk(xk)
1 × M3 M3 × M2 M2 × M1
vJ ← ∇fJ−1(xJ−1)
for j = J − 1 to 1 : …
Backward Pass
vj−1 ← vjT Jfj−1(xj−1) 1 × M2 M2 × M1
∇x0 y = v0T 24
Reverse Mode AD for MLP
Terminology: feedforward networks with fully connected layers -> Multilayer
Perceptrons (MLP)
x1 = f0(x0, w)
f(x) = f1( f0(x, w))
x2 = f1(x1)
Leaf nodes w
f2 = ρ
Non-Leaf nodes
x x1 z1
f1 = Matmul
25
fW1,W2,W3,⋯,WJ(x) = ρWJ⋯W3ρ(W2 ρ(W1x)) xi−1 Input layer i
zi Pre-activation
xi Post-activation
W1 W2 y
Matmul ρ Matmul
ρ …
l
x0 z1 x1 z2 x2 xJ L
26
fW1,W2,W3,⋯,WJ(x) = ρWJ⋯W3ρ(W2 ρ(W1x))
W1 W2 y
Matmul ρ Matmul
ρ … l
x0 z1 x1 z2 x2 xJ L
We want ∇W L, ⋯ ∇W1 L xi−1 Input layer i

J
zi Pre-activation
Forward Pass
xi Post-activation
x0 ← x ∂zj+1
for j = 0 to J − 1 : = Wj+1 ∂zj
zj+1 ← Wj+1xj ∂xj =?
∂Wj
xj+1 ← ρ(zj+1) ∂xj
= diag(ρ′(zj))
L = l(xJ, y) ∂zj

27
W1 Wj y
Matmul ρ Matmul
ρ … l
x0 z1 … xj−1 zj xj xJ L
fW1,W2,W3,⋯,WJ(x) = ρWJ…W3ρ(W2 ρ(W1x))

VJPs (Vector Jacobian Products)
∂L ∂l(xJ, y)
We want ∇WJ L, ⋯ ∇W1 L =
∂xJ ∂xJ
∂L ∂L ∂xj ∂L ∂L
Forward Pass = = diag(ρ′(zj)) = ∘ ρ′(zj)
∂zj ∂xj ∂zj ∂xj ∂xj
x0 ← x
for j = 0 to J − 1 : ∂L ∂L ∂zj ∂L
= = Wj
zj+1 ← Wj+1xj ∂xj−1 ∂zj ∂xj−1 ∂zj
xj+1 ← ρ(zj+1) ∂L ∂L ∂zj ∂L T T
= = ( ) xj−1
L = l(xJ, y) ∂Wj ∂zj ∂Wj ∂zj

28
W1 Wj y
Matmul ρ Matmul
ρ … l
x0 z1 … xj−1 zj xj xJ L
We want ∇WJ L, ⋯ ∇W1 L
Backward Pass
Forward Pass
v = ∇xJ L = ∇xJ l(xJ, y)
x0 ← x for j = J − 1 to 1 :
for j = 0 to J − 1 :
zj+1 ← Wj+1xj v ← ∇zj L = v ∘ ρ′(zj)
xj+1 ← ρ(zj+1) T
∇Wj L = vxj+1
J = l(xJ, y) v ← ∇xj L = WjT v
29Note: here we keep everything as column vectors

Speed for MLP
Forward Pass Backward Pass
x0 ← x v = ∇xJ L = ∇xJ l(xJ, y)
for j = 0 to J − 1 : for j = J − 1 to 1 :
zj+1 ← Wj+1xj v ← ∇zj L = v ∘ ρ′(zj)
xj+1 ← ρ(zj+1) T
∇Wj L = vxj+1
L = l(xJ, y)
v ← ∇xj L = WjT v
• Finite di erence requires 2*D forward passes, with D parameters
• Reverse Mode AD, often ~ 2x forward pass
• Forward Mode AD speed / forward pass would increase with width
30

ff
Group Activity
• In a group work through the following - 20 minutes
• Consider the following: y = wT2 tanh(W1x + b)
• Draw the computation graph
• Nodes as input or computed variables
∂y ∂y ∂y
• Find systematically the expressions for
, ,
∂w2 ∂W1 ∂b
[0.5 0.5] [1.0] [0]

−1 1 0.5 0 Note
W1 = x= b=
31 tanh′(x) = 1 − tanh2(x)

Problem
y = wT2 tanh(W1x + b)
Note: there are di erent valid graphs
depending how you de ne the primitive
ops
32
ff
fi
More Complex Graphs
f4 x3 f5
f1 f2 f3 f6
x x1 x2 x4 y
∂x0 ∂x4 ( ∂x2 ∂x1 ∂x3 ∂x1 ) ∂x0

∂y ∂y ∂x4 ∂x2 ∂x4 ∂x3 ∂x1
= +
∂y ∂y ∂xs
∑ ∂xs ∂xj
=
∂xj s∈Child( j)
33
Topological Sort
x3
f1 f2 f3
x x1 x2 x4 y
f5
f4
f6
x x3 x1 x2 x4 y
f2 f3
f1
34
Reverse AD over General Graph
f5
f4
f6
x x3 x1 x2 x4 y
f2
f1
For last node xJ := L
Forward Pass Backward Pass
x1, ⋯, xJ ← topological sort(Graph) ∇xJ L = ∇xJ xJ = 1

for j = 1 to J : for j = J − 1 to 1 :
xi ← fi(Parent1(xi), …, ParentK(xi)) T T
∂xk
∑
( ∇xj L) ← ( ∇xk L)
k∈Child(x )
∂xj
i
35
Recap of Terminology
• Backpropagation is how we compute the gradient
• It is not Gradient Descent, which is how we optimize the objective
• Automatic Di erentiation
• More general than backprop
• Backprop is essentially reverse mode AD for scalar output
• Autograd
• Speci c package implementing Auto Di erentiation
• Predecessor of torch autograd
36
fi
ff
ff
Vanishing Gradients
W1 W2 y
Matmul ρ Matmul
ρ … l
x0 z1 x1 z2 x2 xJ L
∂L ∂L T
=( ) Πj=J..2diag(ρ′(xj−1))Wj
∂x1 ∂xJ
• ∂L ∂L
Has been observed for feedforward nets that signal degrades with depth
= diag(ρ′(zj))
∂zj ∂xj
• This makes adapting lower layers di cult
• Dependence on distribution of initial weights

∂L ∂L
= Wj
• RNNs (same)
∂xj−1 ∂zj
Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward
neural networks." Proceedings of the thirteenth international conference on artificial intelligence
and statistics. JMLR Workshop and Conference Proceedings, 2010.
Sepp Hochreiter Master’s Thesis

37
ffi
Vanishing and Exploding Gradients
• For many decades this was believed to be the main issue in

training deep networks
• Recurrent networks did not work at all until circa 2014, attributed
to vanshining/exploding gradients problem
38 (Optional for now)

Vanishing and Exploding Gradients
• Various ways to address this
• Most of these will be discussed in more detail in future lectures
• Gradient clipping for exploding gradients
• Shortcut connections (LSTMs and ResNets)
• Normalization techniques
• Better activation selection
• Initialization
• Trying to keep matrices orthogonal
• Alternatives to gradient based learning (bypass bprop)
39 (Optional for now)

Deep Learning Frameworks
Theano (deprecated)
Model/Training
Building Front
Tensor ow Ends
Automatic
Differentiation
MxNet
Tensor Library
Pytorch
40
fl
Deep Learning Frameworks (Pytorch)
Tensor Library
• Built on tensor libraries (similar to numpy)
• Backends to operations on GPU
41
DL Frameworks
• Built on tensor libraries

(similar to numpy)
• Backends to operations
on GPU
42
DL Frameworks
a
RAM CPU
b
• Backends to operations on GPU
a
GPU Mem GPU
b
Image from Nvidia

43
DL Frameworks: Autodiff
Automatic
Differentiation
Tensor Library
x3
f1 f2 f3
x x1 x2 x4 y
• Automate the construction of computation graph

and the backward pass
44
• Frameworks allow to de ne primitives and optimize

their forward and backward computation
• Optimized primitives can be chained together to

form complex models
45
fi
Automatic
Differentiation
Tensor Library
• Automatic Di erentiation tools
• Only need to specify the forward pass behaviour if using

prede ned primitives
• Obtain computation graph ahead of time (theano, tensor ow v1,

mxnet v1)
• On the y (pytorch, mxnet gluon, tensor ow v2)
46
fi
fl
ff
fl
fl
On the Fly Construction (Tracing)
• Each torch tensor created with will

be recognized by torch autograd for building
computation graphs
• Graph is constructed on the y by storing for each

node a reference to the parent nodes and functions
applied
47
fl
Torch Autograd
c
a
Matmul y cost
bx
48
Torch Autograd
49
Barebones Autograd Implementations
Mathieu Blondel
https://github.com/mblondel/teaching/blob/main/autodi -2020/autodi .py
Andrei Karpathy
https://github.com/karpathy/micrograd
Matt Johnson
https://github.com/mattjj/autodidact
50
ff
ff
DL Frameworks: Autograd and Pytorch
• Mini-Autograd from Mathieu Blondel:
https://github.com/mblondel/teaching/blob/main/
autodi -2020/autodi .py
Forward Pass
x1, ⋯, xJ ← topological sort(Graph)
for j = 1 to J :
xi ← fi(Parent1(xi), …, ParentK(xi))
Store at each
node Grad
Value
f5
f6
x x1 f4 x3 x2 x4 y
f2
f1
51
ff
ff
DL Frameworks: Autograd and Pytorch
Backward Pass
∇xJ L = ∇xJ xJ = 1
for j = J − 1 to 1 :
T T
∂xk
∑
( ∇xj L) ← ( ∇xk L)
k∈Child(x )
∂xj
i
Store at each
node Grad
Value
f5
f6
x x1 f4 x3 x2 x4 y
f2
f1
52
DL Frameworks: Model Building
torch.nn.module
Model Building
Front Ends
• Simple ways to track and manipulate all
parameters of large models
Automatic
Differentiation
• Allows to easily build and plug and play layers
• Easily specify the parameters and initialize them

Tensor Library
• Describe the forward pass behaviour
• Designed to work well with training pipelines
53
torch.nn.module
• Provides commonly used module that build on

top of each other
• Tracks parameters initialized in __init__
• Specify forward pass behaviour
• Put all parameter tensors on GPU with one

call
54
55
References
• Di erential Programming by Gabriel Peyre
• Automatic Di erentiation Slides by Roger Grosse
• Autodi Slides and Code from Mathieu Blondel
56
ff
ff
ff
Gradient Descent in 2D
Visualization in 2 dimensions using contours
Minimum
Initialize x0
xt0
[x]1 xt1
Stopping | f(x ) − f(x ) | < ϵ
t+1 t
xt2 Crit.
xt3
[x]2
2
Convexity
θ ∈ [0,1] f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y)
• Choose two points draw a line
• Convex if line is above graph
x y
x y
Convex Optimization
Convex function
θ ∈ [0,1] f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y)
• Many optimization methods come from convex optimization
x y
Local and Global Minimum
• Stationary Points
• Global Minimum
• Local Minimum
• For convex functions local minima are global minimum

Convex Optimization
f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y)
• Linear model + standard loss functions typically yield convex

problems
• Convex optimization often allows us provable ways to reach a

global minus of the function
• Deep networks + standard loss functions typicaly yield a

highly non-convex optimization problem
Non-convex optimization
• Non-convex optimization is NP-hard
• Fortunately the objective and loss surface in deep networks has

properties that allow reasonable solutions in practice
Convergence of Gradient Descent
wk+1 = wk − α ∇f(wk) w0
[w]1 w1
w3
w4
Under conditions (to follow) [w]2

For convex functions we can show
k
k
f(w ) − f(w*) ≤
∥w − k
w*∥22 w k-th iterate
2αk
w* Minimum
α
8
Learning rate
Converges at rate O(1/k)
BertSekas et al GRADIENT CONVERGENCE IN GRADIENT METHODS WITH ERRORS∗
k 2
k
∥w − w*∥ 2
f(w ) − f(w*) ≤
2αk
1. We assume Lipschitz continuous gradient

∥∇f(x) − ∇f(y)∥2
≤L
∥x − y∥2 • Limits how fast gradients
can change
• Often valid
1
2. Learning rate α≤
L
Hessian
• Matrix of all second partial D
f:ℛ −>ℛ ∂ 2f ∂2f
derivatives
… ∂w ∂w
∂w12
1 n
2
• Characterizes curvature in high
dimension Hf = ∇ f = ⋮
2
∂f ∂f
∂wn∂w1
…
∂wD2
Gradient Descent
∥∇f(x) − ∇f(y)∥2
≤L
∥x − y∥2
∇2 f(w) = H
Slide courtesy of Mark Schmidt

Gradient Descent

Descent Lemma

(Optional)
Guaranteed progress
(Optional)
Empiricial Risk 1 n
w n∑
Minimization
w* = arg min l( fw(xi), yi)
i=1
1
e.g. l( fw(x), y) = ( fw(x) − y)2
2
n
1 1
∥Y − fw(X)∥2
n∑
ℒ(X, Y, w) = l( fw(x), y) =
i=1
2n
Gradient of objective respect to weights w All Parameters of Model
∇w ℒ(X, Y, w) X, Y Data Matrix and Labels
16
Stochastic Gradient Descent
Gradient Descent (GD) Stochastic GD (SGD) Mini-batch SGD
Gradient of loss w.r.t 1 Gradient of loss w.r.t

Gradient of full objective
sample sub-sample
Xn ⊂ X
∇w ℒ(X, Y, w) ∇w l(x, y, w) ∇w ℒ(Xn, Yn, w)
wt+1 = wt − α ∇w ℒ(X, Y, wt ) wt+1 = wt − α ∇w l(x, y, wt) wt+1 = wt − α ∇w ℒ(Xn, Yn, wt )
17
Image Credits: Toward Data Science
Intuition about Stochastic Methods
1
∑
∇w l(x, y, w)
| S1 | i∈S1
S1
1
∑
∇w l(x, y, w)
| S2 | i∈S2
S2
Stochastic vs Gradient Descent
• Stochastic Gradient Descent is much more scalable in large

datasets
• E.g. in convex settings convergence rates can be shown to be

similar to GD while processing only one point at a time
• Stochastic Gradient Descent is a classic optimizations

method that is the backbone of most modern neural network
training
Mini-Batch SGD
• Classical SGD uses a single point
Mini-batch SGD
Gradient of loss w.r.t

• Mini-batch SGD can be seen as sub-sample Xn ⊂ X
obtaining a lower variance gradient 1
∑
∇w l(xi, yi, w)
estimate but without need for full | Xn | (xi,yi)∈Xn
batch
• Mini-batch forward and backward processing is

often more e cient on single GPU
Terminology: Recently, mini-batch SGD in many contexts is often just called SGD
ffi
Mini-Batch SGD Algo
Mini-Batch SGD in Code
W1 W2
ρ ρ
x0 z1 x1 z2 Loss
b1 b1
Each node holds

gradient bu er
ff
Mini-Batch SGD in Code
Stochastic Gradient Descent
• Stochastic gradient descent can be shown to reach the global
minimum in convex settings
• SGD can be shown to reach a stationary point in non-convex

settings with some assumptions
• Assumption - Appropriate step size sequence
• Assumption - Lipschitz continuity

Minimum in Non-Convex Setting
• Are local minimum useful?

Local Minimum Can be Useful
• It has been empirically shown in some cases that many local

minimum in highly non-convex functions obtained by SGD are
“close” to global minimum. Still poorly understood why this
occurs
• With many assumptions and speci c model classes recent

results have begun to show that SGD can converge to global
minima
fi
GD Step Size
• Step size or learning rate can often greatly a ect the optimization
Slide credit: Roger Grosse

ff
Learning Rate Schedules SGD
SGD
True Gradient
Minimum
• In stochastic optimization larger learning rates even if they are of

the correct size to reach minimum may bounce around without
hitting the solution
• Similar to GD small learning rates would be too slow
• In SGD we typically start at a high learning rate and decay

Momentum
g = ∇w ℒ(X, Y, wt)
vt+1 = μ * vt + g
wt+1 = wt − α * vt+1
• Popular and simple approach to speed up and stabilize learning
• Dampens oscillations and noise from noisy gradient
• Often accelerates training

Visualizations
https://distill.pub/2017/momentum/
Adaptive Learning Rate
• Attempt to adjust learning rate based on rules to better suit

local curvature without explicitly representing the hessian
• Typically based on heuristics

RMSProp
• Individually adapt learning rate of each parameter
• Divide the learning rate for a weight by a running average of the

magnitudes of recent gradients for that weight.
• Parameters with very large gradients have less e ect
ff
Adam
• Extremely popular optimization algorithm
• Can be seen as a combination of momentum and Rmsprop
• Adam rose to prominence in training models where learning rate

schedules were extremely di cult to determine
• Robust parameters
Kingma, D Ba, J Adam: A Method for Stochastic Optimization
ffi
Second-order optimization
n
1 1
∥Y − gw(X)∥2
n∑
f(w) = ℒ(X, Y, w) = l(gw(x), y) =
i=1
2n
̂ ≈ f(v) 1
̂ + ∇f(v) (w − v) + (w − v)T ∇2 f(v)(w − v)
T
f(w)
2
Newton Method
̂ + ∇f(v)T (w − v) + 1 (w − v)T ∇2 f(v)(w − v)

̂f(w) ≈ f(v)
2
H
Solve local approximation
̂ 1
min f(v) + ∇f(v) (w − v) + (w − v)T H(w − v)
T
w 2
w * = v − H −1 ∇f(v)
̂ ≈ f(v) 1
̂ + ∇f(v) (w − v) + (w − v)T ∇2 f(v)(w − v)
T
f(w)
2
H
• Requires inverse hessian
• Large memory and computation
• O(D^3) compute and O(D^2)

memory
̂ ≈ f(v) 1
̂ + ∇f(v) (w − v) + (w − v)T ∇2 f(v)(w − v)
T
f(w)
2
H
v * = v − H −1 ∇f(v)
• Common second order approaches attempt to approximate the

Hessian
• BFGS is one of the more successful
• Still maintains O(D^2) computation and O(D^2) memory cost
• L-BFGS further reduces memory cost, very popular outside of DL
• KFAC, Krylov Subspace methods

• In practice at the moment second order methods are rarely

used in Deep Learning
• Many properties of these methods particularly in estimating the

inverse hessian and their behaviour in the stochastic setting is
not yet well understood
• They hold promise for potentially more rapid optimization
• Several promising results using novel hessian approximations

(KFAC)
Ba, Jimmy, Roger Grosse, and James Martens. "Distributed second-order optimization using Kronecker-factored approximations." (2016).
Generalization and Optimization
n
1
w n∑
min l( fw(xi), yi) + Ω(w)
i=1
• Classic machine learning often separate the optimization of the

objective function and properties of the optimum as separate
concepts
• In practice a lot of success is due to implicit regularization from

optimization methods
SGD is good for generalization
• Growing body of evidence shows that SGD has better generalization properties
• Intuitively the process of sampling the training data in SGD mimics the process
of sampling test/train
• Several works empirically show SGD can nd “ atter minimum”
• This makes it particularly hard to theoretically analyze optimization algorithms in

DL
Kuzborskij, Ilja and Christoph H. Lampert. “Data-Dependent Stability of Stochastic Gradient Descent.” ICML (2018).
Hardt, Moritz, Ben Recht, and Yoram Singer. "Train faster, generalize better: Stability of stochastic gradient descent." International Conference on
Machine Learning. PMLR, 2016.
fi
fl
SGD is good for generalization
• Several works argue that SGD with small mini-batch can nd atter
minimum and that at minimum generalize better
Hochreiter and Schmidhuber “Flat

Minima” 1997.
Keskar, Nitish Shirish, et al. "On large-batch training for deep learning:
Generalization gap and sharp minima." arXiv preprint
arXiv:1609.04836 (2016).
fl
fi
fl
SGD and Generalization
Kaiming He’s 2015 Imagenet Competition Winner
Why don’t we decay the learning rate in this at regions?
Initial training at high learning rate has been observed to act as a regularizer
Some initial explanations for this effect have shown in the literature:
Li, Yuanzhi, Colin Wei, and Tengyu Ma. "Towards explaining the regularization effect of initial large learning rate in training
neural networks." Advances in Neural Information Processing Systems. 2019.
fl
Distributed Optimization
Parallelzing Deep Network Training
• Most common form of parallelism is data parallelism
• Each node simultaneously process di erent mini batches
• Model parallelism - attempts to create models that split model across nodes
• Di cult to parallelize in some cases
Wikimedia
ffi
ff
Distributed Synchronous SGD
• Most common approach is Distributed Synchronous SGD
• Nodes (GPU) sample data and wait to receive parameters from a

parameter server
• Parameter server will wait to aggregate gradients from all the

nodes, then send new params
Data
GPU 1 GPU 2 GPU 3
Parameters
Gradients
Param Server
DataParallel SGD Pytorch
Minibatch size - 300
Sample minibatch
Data Dimensionality - 500
Huge Dataset
Data 300x500
100x500 100x500 100x500
Data1 Data2 Data2
GPU 0 GPU 1 GPU 2
Gradient Gradient Gradient
Parameters Parameters Parameters

DataParallel + SGD Pytorch
Distributed Synchronous SGD
What are some issues with Distributed Synchronous SGD?
• Bandwith needs to be high for synchronous SGD
GPU 1
• Requires central node
GPU 0 GPU 1 GPU 2

Batch size and learning rates
• With many available GPU we would want to increase the batch

size to maximize
• From the point of view of variance reduction we should multiply

the learning rate by sqrt(k) for k fold increase in batch size
• In practice various other rules are used most notably
Goyal, Priya, et al. "Accurate, large minibatch sgd: Training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).

Deep Learning Prerequisites and Outline

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Prerequisites and Outline

Uploaded by

Copyright:

Available Formats

Pre-requisites

• Linear Algebra, Multivariable Calculus, Probability and

• Machine learning class or prior equivalent experience

• Experience with python and scienti c toolkits (e.g.

• Introduction to Neural Networks and Representation Learning

• Backpropagation and Automatic Di erentiation Software

• Optimization for Deep Learning

• Regularization and Implicit Regularization

• CNNs and Representation Learning

• Interpretability in Deep Learning (Guest Lecture)

• Generalization, Memorization, and Adversarial Examples

• RNNs and sequence models

• Attention and Self-Attention

• Multi-Task and Transfer Learning

• Deep Generative Models

• Deep Metric Learning

• (If time allots) Basics of Deep Reinforcement Learning

• Rebranding of (deep) Neural Networks - universal

• Representation Learning - avoid hand-crafted features

• Modular but powerful/expressive model framework

Image from Wikimedia Commons

• 10 minutes - In groups of 3-5 discuss

• In your group decided on an application which you’d like to best

• Usually for feedforward case, deep networks refers to those

• Aspects of neural networks have been historically inspired by

Images from Jean-Louis Queguiner

• Timing is (potentially) important in real neuronal activity (spike timing)

Lobo, J. Et al (2020). Spiking Neural Networks and

• Brains have feedback and recurrence

• Many other huge di erences

fW (x) = W1x fW (x) = W2 ρW1x

h1(2) h2(2) h3(2)

w11 w12 w22 w23 W1

Can we approximate this?

From Pascal Vincent and Hugo Larochelle’s slides

fW (x) = W2 ∘ ρ ∘ W1x Approximation parametrized by W1, W2

which obtains at most ϵ error

sup | fW (x) − ftarget(x) | < ϵ

Why do we need multiple layers?

• Deeper networks can much more e ciently represent many

1-hidden layer runs on N-hidden layer

24 Images from Wikimedia Commons

• Several results exist showing that Deep NN are more

• Existing results are for restricted settings (on the target or

Depth Helps Generalization (Optional)

• Common conjecture based on empirical observations with

• Some relevant work:

• Rebranding of (deep) Neural Networks - universal

• Representation learning - avoid hand-crafted features

• Modular but powerful model framework

• Linear models are fast and easy to use

• Designing hand crafted feature is thus tempting

• Di cult for humans to fully describe models of

• We can interpret intermediate hidden layer outputs as

Photo from Wikimedia

Photo from Wikimedia

Learned “features” (2) (2) (2)

• Intermediate representations in deep networks can nd

• However they can also learn ones we wouldn’t think of

From Y Lecun slides

• With enough varied data representation learning can often outperform

• Imagenet learned features more useful in most computer vision

• Can reuse representations in multiple ways