Main

Machine Perception
Lecture Notes 2023
O. Hilliges, J. Song, F. Engelmann, X. Chen

Acknowledgement
These notes were written and edited by Elisabetta Fedele.
Their creation would not have been possible without the contributions of many people.
I would like to thank all the people who have contributed to the creation of the lectures’
material throughtout the years: Artur Grigorev, Sammy Christen.
I would like to thank Jonas Hübotter, who inspired me with his lecture notes and
accepted to share his template with us, from which I got some tricks which have facilitated
the writing of these notes.
Disclaimer
These lecture notes are provided as a draft version for educational purposes only. The
content presented herein is subject to change and may contain inaccuracies or errors.
Contributing
You are encouraged to raise issues and suggest fixes for anything you think can be improved.
Contact: machine-perception@inf.ethz.ch
Compilation date: May 25, 2023
This set of notes was written for the course Machine Perception (263-5210-00L) at ETH Zürich.
Distribution of these notes without the permission of the authors is prohibited.
© 2023 ETH Zürich. All rights reserved.

Learning Objectives
Students will learn about fundamental aspects of modern deep learning approaches for perception
and generation. Students will learn to implement, train and debug their own neural networks and
gain a detailed understanding of cutting-edge research in learning-based computer vision, robotics,
and shape modeling. The optional final project assignment will involve training a complex neural
network architecture and applying it to a real-world dataset.
The core competency acquired through this course is a solid foundation in deep-learning algorithms
to process and interpret human-centric signals. In particular, students should be able to develop
systems that deal with the problem of recognizing people in images, detecting and describing body
parts, inferring their spatial configuration, performing action/gesture recognition from still images
or image sequences, also considering multi-modal data, among others.
We will focus on teaching: how to set up the problem of machine perception, the learning algorithms,
network architectures, and advanced deep learning concepts in particular probabilistic deep learning
models.
The course covers the following main areas:
1. Foundations of deep learning.

2. Advanced topics like probabilistic generative modeling of data (latent variable models,
generative adversarial networks, auto-regressive models, invertible neural networks).
3. Deep learning in computer vision, human-computer interaction, and robotics.
Summary of Notation
This chapter provides a concise reference describing the notation used throughtout the lecture notes.
If you are unfamiliar with any of the corresponding mathematical concepts, we suggest to read
chapters 2 − 4 of the Deep Learning book[6].
Numbers and Arrays
a A scalar
a A vector
A A matrix
In Identity matrix with n rows and n columns
I Identity matrix with dimensionality implied by context
diag(a) A square, diagonal matrix with diagonal entries given by the vector a
Sets
A A set
N set of natural numbers {1, 2, . . . }
N0 set of natural numbers, including 0, N ∪ {0}
R set of real numbers
[m] set of natural numbers from 1 to m, {1, 2, . . . , m − 1, m}
i:j subset of natural numbers between i and j, {i, i + 1, . . . , j − 1, j}
(a, b] real interval between a and b including b but not including a
6
Indexing
Aij The element in position (i, j) (where i is the row and j is the column) of a matrix A
Ai,: Row i of a matrix A
A:,j Column j of a matrix A
In addition to the notation described above, when an indexing is specified, we would like to use the
following notation so that an indexed scalar or vector is still represented with the right notation
(respectively a and a).
ai Indexed scalar (in a vector a)

aij Indexed scalar (in a matrix A)
ai Indexed vector (in a matrix A)
Linear Algebra Operations
A⊤ transpose of matrix A
A−1 inverse of invertible matrix A
det(A) determinant of A
tr(A) trace of A
Calculus
dy
dx Derivative of y w.r.t. x
∂y
∂x Partial derivative of y w.r.t. x
∇x y Gradient of y w.r.t. x
∇X y Gradient of y w.r.t. X
Probability
Ω sample space
A event space
P(X = x) probability of a random variable X taking on the value x
X∼P random variable X follows the distribution P
x∼P value x is sampled according to distribution P
x|y value x is sampled according to (implicit) conditional distribution p(· | y)
PX cumulative distribution function of a random variable X
∆A set of all probability distributions over the set A
X⊥Y random variable X is independent of random variable Y
E[X] expected value of random variable X
Ex∼X [f (x)] expected value of the random variable f (X), E[f (X)]
Var[X] variance of random variable X
Cov[X, Y ] covariance of random variable X and random variable Y
Σ covariance matrix
Xn sample mean of random variable X with n samples
DKL (p∥q) KL-divergence of distribution p with respect to distribution q
N (µ, Σ) normal distribution with parameters µ and Σ
Unif(S) uniform distribution on the set S
7
Bern(p) Bernoulli distribution with parameter p

Bin(n, p) binomial distribution with parameters n and p
Functions
f :A→B function f from elements of set A to elements of set B

(·)+ max{0, ·}
log logarithm with base e
.
1{predicate} indicator function (1{predicate} = 1 if the predicate is true, else 0)
Deep Learning Notations
x(i) The i-th example (input) from a dataset

y (i) or y(i) The target associated with x(i) for supervised learning
x[i] or x[i] The value associated with layer i
x{i} or x{i} The value associated with epoch i
x<t> or x<t> The value associated with time t in LSTM and RNN
Θ The set of parameters on which a model depends on
Acronyms
iff if and only if

s.t. such that
w.r.t with respect to
w.l.o.g. without loss of generality
i.i.d. independent and identically distributed
BCE Binary Cross Entropy
DNN Deep Neural Network
ELBO Evidence Lower Bound
KL Kullback-Leibler
MAP Maximum A Posteriori
MC Monte Carlo
MLE Maximum Likelihood Estimate
MLP Multi Layer Perceptron
MSE Mean Squared Error
NLL Negative Log Likelihood
ReLU REctified Linear Unit
RL Reinforcement Learning
tanh Hyperbolic TANgent
Contents
I Part One: Foundation of Deep Learning
1 Neural Network Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.1 Outline 19
1.2 Biological motivations 19
1.3 Perceptron 20
1.3.1 Perceptron Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 The ingredients of a Neural Network 23
1.4.1 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5 Supervised Learning 24
1.5.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Defining a loss function: Maximum Likelihood Estimation 25
1.6.1 Final considerations on MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.7 Optimizing the network: Gradient Descent 29
1.7.1 Optimization procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.7.2 Efficient Computation of Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.7.3 Backpropagation in Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.8 Block diagrams of a Single Unit Network and of a MLP 33
1.9 Approximation capabilities of Neural Networks 34
1.9.1 Linear Activation Function are not enough . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.9.2 Universal approximation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1 Regularization 35
2.1.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.2 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1.3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.1.5 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.1.6 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.1.7 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.1.8 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.1.9 Pre-existing architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Activation functions 45
2.2.1 Logistic activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.2 ReLU and its variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.3 Final considerations on activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3 Optimization Algorithms 49
2.3.1 Gradient Descent and its variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.2 Challenges in Optimization and some solutions to them . . . . . . . . . . . . . . . . . 51
2.3.3 Adaptive Learning Rate: Adagrad and RMSProp . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.4 Adaptive Moment Estimation (Adam) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Last practical suggestions 56
2.4.1 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.2 Ensembles of different models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.3 Fast Geometric Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.4 Stochastic Weight Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
II Part Two: CNNs, RNNs & Co
3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.1 Introduction 61
3.2 The Neuroscientific Basis for CNNs 62
3.3 Convolution operation 62
3.3.1 Convolutions as linear, shift-equivariant transforms . . . . . . . . . . . . . . . . . . . . . . 63
3.3.2 From linear filtering to convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.4 Intuition of Convolution in Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Difference between convolution and correlation 66
3.4.1 Convolution as matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5 Convolutional Neural Network 66
3.5.1 Convolution layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5.2 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.3 Dense layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.6 Practical observations 70
4 Fully Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1 The goal: pixelwise predictions 73
4.2 Upsampling techniques 74
4.2.1 Fixed upsampling techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.2 Learnable upsampling: transposed convolutions . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 U-Net: the most used FCNN 75
4.4 Applications 76
5 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1 Introduction 77
5.2 Dynamical System 78
5.2.1 Dynamical System without input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.2 Dynamical System with input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Vanilla Recurrent Neural Network 81
5.3.1 Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.2 Backprop through time (BPTT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Solving the problem of vanishing and exploding gradient: LSTM and Friends 84
5.4.1 Naive solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.2 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.3 Gradient flow in RNNs and in LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.4 Gradient clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
III Part Three: Generative Modeling
6 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1 Introduction 93
6.2 Linear Autoencoders: the PCA projection 94
6.3 Non-Linear Autoencoders 94
6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.2 Dimensionality of hidden layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.3 Autoencoder Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Variational Autoencoders 98
6.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4.2 Kullback-Leibler (KL) Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.3 Derivation of the objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.5 Training in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.6 Generating new data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5 β-VAE 103
7 Autoregressive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.1 Regressive model 105
7.2 Sequence model 105
7.3 A toy example: prediction of a B&W image 106
7.4 Fully Visible Sigmoid Belief Network 107
7.4.1 Neural Autoregressive Density Estimator (NADE) . . . . . . . . . . . . . . . . . . . . . . . 108
7.5 Masked Autoencoder Distribution Estimation (MADE) 110
7.6 Generative model of Natural images 112
7.7 Pixel RNN 112
7.8 Pixel CNN 112
7.8.1 Autoregressive over color-channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.8.2 Gated PixelCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.9 TCNs-WaveNet 115
7.10 RNNs are autoregressive models 116
7.10.1 VRNN: A Recurrent Latent Variable Model for Sequential Data . . . . . . . . . . . 116
7.11 Self-Attention and Transformers 117
7.11.1 Keys, values and queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8 Normalizing flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8.1 Introduction 119
8.2 Change of variables technique 119
8.3 Parameterize the Transformation f with NN 120
8.4 Coupling layers 121
8.4.1 Forward pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.4.2 Backward pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.4.3 Jacobian matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.5 A Flow of Transformations 121
8.5.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.5.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.6 Model architecture 123
8.6.1 Squeeze and Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.7 Applications in Computer Vision 124
9 GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.1 Likelihood-free model 125
9.1.1 Case 1: great log-likelihood and poor samples . . . . . . . . . . . . . . . . . . . . . . . 125
9.1.2 Case 2: poor log-likelihood and great samples . . . . . . . . . . . . . . . . . . . . . . . 125
9.2 Introduction to GAN 126
9.3 Definitions 126
9.4 Training 127
9.4.1 General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.4.2 Theory vs practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.5 Theoretical analysis 129
9.5.1 Derivation of the GAN objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.5.2 Optimal Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.5.3 Global Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.5.4 Convergence of the training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.6 Difficulties during training 132
9.6.1 Mode collapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.6.2 Issues with DJS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.7 Comparison with VAE 133
9.8 Conditional GANs 133
IV Part Four: Deep Learning For Computer Vision
10 Parametric Body models and Applications . . . . . . . . . . . . . . . . . . . . 137

10.1 2D human pose representation and estimation 137
10.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.2 Body modeling 137
10.2.1 Pictorial Structure Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.2.2 Pictorial Structure Model with Flexible Mixtures . . . . . . . . . . . . . . . . . . . . . . . . 139
10.3 Feature Representation Learning 140
10.3.1 Direct Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
10.3.2 Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
10.4 Body modelling + Deep Representation Learning 142
10.4.1 Graph models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.4.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10.4.3 Training: Sub-gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10.5.1 SMPL representation: 3D Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.5.2 Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.5.3 Pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.6 Case Study: Learned-Gradient Descent 148
11 Neural Implicit Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

11.1 Why we should be able to learn a representation of 3d shape 149
11.2 3D representations 149
11.3 Neural implicit representation 150
11.4 Implementation of Neural Implicit Representations 151
11.4.1 Watertight meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.4.2 Point clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
11.4.3 2d images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.5 NEural Radiance Field 154
11.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
11.5.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
11.5.3 Comparison with implicit surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
11.5.4 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
11.5.5 Limits of NERF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
V Part Five: Deep Reinforcement Learning
12 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

12.1 Motivations 161
12.2 RL problem statement 161
12.3 Major Components of an RL Agent 162
12.3.1 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.3.2 Value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.4 Taxonomy of RL agents 163
12.4.1 Value Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.4.2 Policy Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.4.3 Model-free and model-based agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.5 Markov Decision Processes 164
12.5.1 Markov property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.5.2 Markov Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.5.3 Markov Reward Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.5.4 Return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.5.5 Value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.5.6 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.5.7 Bellman equation to compute the return . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.5.8 Action-value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
12.5.9 Bellman Optimality Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
12.6 Dynamic Programming 167
12.6.1 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12.7 Monte Carlo methods 168
12.8 Temporal Difference Learning 169
12.8.1 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
12.8.2 Pro and cons of Temporal Difference Learning . . . . . . . . . . . . . . . . . . . . . . . . 171
12.9 Deep Reinforcement Learning 172
12.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
12.9.2 Deep Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
12.9.3 Policy search methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Bibliography
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Books 183
Articles 183
I
Part One: Foundation of Deep
Learning
1 Neural Network Basics . . . . . . . . . . . . . . . 19

1.1 Outline
1.2 Biological motivations
1.3 Perceptron
1.4 The ingredients of a Neural Network
1.5 Supervised Learning
1.6 Defining a loss function: Maximum Likelihood Esti-
mation
1.7 Optimizing the network: Gradient Descent
1.8 Block diagrams of a Single Unit Network and of a
MLP
1.9 Approximation capabilities of Neural Networks
2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1 Regularization
2.2 Activation functions
2.3 Optimization Algorithms
2.4 Last practical suggestions
1. Neural Network Basics
1.1 Outline
This chapter is a review of the basics of Neural Network Theory. We begin with the presentation of
the Perceptron model and its learning algorithm. We then extend these concepts to understand
modern Deep Neural Networks. Finally, in section 1.9.2, we analyze which types of functions can
be approximated using a DNN.
1.2 Biological motivations
During the first weeks of the course we focus on the study of bottom-up perception, with the aim
of understanding how stimuli are processed without the interference of top-down processes.
Artificial neural networks have traditionally taken inspiration from the working of the nervous
system, which consists on basic computational units called neurons. In this regard, we will describe
the structure of a neuron and use it as an inspiration to derive the first building block of artificial
networks.
At a high level, each neuron receives signals from other neurons, process them, and propagates the
new signal to other neurons.
The detailed process can be described as follows. First, the dentrites collect the chemical signals
(in the form of neurotrasmittors) arriving from other neurons. All these signals are then integrated
in the soma and, the value obtained, is compared to a given threshold. If the resulting signal, now
called axon potential, surpasses the threshold, it is transmitted to the next part of the neuron,
the axon. At this point, the axon potential travels down the axon, reaching the axon terminals.
Here the signal can cause the release of neurotrasmittors which are received by the dentrites of
the next neurons and make the process restart.
20 Chapter 1. Neural Network Basics
Figure 1.1: Representation of signal processing in a biological neuron
For the purposes of this course, it is important to keep in mind that a neuron combines multiple
inputs and only forwards them if they surpass the threshold, which introduces a non-linearity in
the propagation of the signal.
1.3 Perceptron
One of the earliest predecessor of modern deep learning are simple linear models. As brain neurons
collect multiple inputs from other neurons and combine them into a single one in the soma, these
models are designed to take a set of m input values x and associate them with a single output
y ∈ {0, 1}. In order to perform this task, these models employ a m dimensional set of weights w
(one for each input value) and a bias scalar value b, which are called parameters1 of the model.
The output of the model is then computed as a function (which can be thought as the neurons’
threshold) of the linear combination of the inputs, parametrized by the weights and the bias value.
Among these early models, we can find the Perceptron [11]. Its output is defined as follows.
(
1 if w⊤ x + b > 0
ŷ = f (x, w) =
0 otherwise
Pm
where w⊤ x is the dot product i=1 wi xi .
1.3.1 Perceptron Learning Algorithm
In 1958, the Perceptron became the first among the linear models that could learn the weights that
defined the categories given examples of inputs from each category. Before introducing the learning
algorithm, we will define some terms:
• D = ((x(1) , y (1) ), . . . , (x(n) , y (n) )) is the training set of n data samples, where:
– x(i) is the ith m dimensional input vector
– y (i) is desired output value of the perceptron for the ith input
1
In general, throughout the class we are going to indicate with Θ the set of parameters of a model.
1.3 Perceptron 21
Figure 1.2: The perceptron framework
• w{k} represents the weights at the iteration k

• η is the learning rate, which determines the step size at each iteration when updating the
model parameters towards minimizing the loss function
⊤
• ŷ = (w{k} x(i) > 0) is the prediction of the output given w{k} , the weights of the model at
the k th iteration
• (y (i) − ŷ (i) ) is the residual of the ith training sample
Moreover, as illustrated in the figure below, a 1 is preponed to the input vector x. In that way, x
becomes a m + 1 dimensional vector as well as w, which now includes also the bias term b.
Figure 1.3: The perceptron learning algorithm framework
The perceptron learning algorithm does the following.
Algorithm 1 Perceptron Learning Algorithm

k ← 0, w{0} ← 0;
while ∃ a sample x(i) , y (i) s.t. ŷ (i) ̸= y (i) do
⊤ ⊤
w{k+1} = w{k} + η(y (i) − ŷ (i) )x(i)
⊤
where ŷ = (w{k} x(i) > 0);
k ← k + 1;
end while
Here are some observations about the algorithm:
1. By definition of y (i) and ŷ (i) , the residual (y (i) − ŷ (i) ) can be either −1, 0 or 1 as shown in
the table below. You can find a visualization of how the weights are adjusted in the 2D case
in the given jupyter notebook.
y ŷ (y − ŷ)
0 0 0
1 0 −1
0 1 1
1 1 0
Table 1.1: Possible values of y − ŷ
2. If the training set is linearly separable the perceptron learning algorithm is guaranted to
converge and to eventually find the set of weights w which correctly separate the two classes
of the training set. Conversely, it will never get to the state where all the input vectors are
classified correctly if the training set D is not linearly separable.
3. The algorithm does not find an optimal separation in terms of margin distance as SVM
(Support Vector Machines) does, but it just stops once it finds a solution for the separability
problem.
1.4 The ingredients of a Neural Network 23
1.4 The ingredients of a Neural Network
Neural Networks are, in general, an extension of the Perceptron model where:
1. The indicator threshold function is replaced by a non-linearity, called activation function

(we will talk about this in details later)
2. Rather than a single neuron, a neural network consists of multiple layers (a network) of
stacked neurons organized in two dimensions, where the outputs of one layer serve as the
inputs to the next layer
3. The final output is then used to compute a loss function L(Θ) which is a function of the
weights Θ of the whole architecture
Figure 1.4: Neural Network Principal Ingredients
In this chapter we use the sigmoid as activation function. However, in the next chapter, we will
explore and compare other activation functions that can be used in neural networks to improve
their performance and flexibility.
1.4.1 Sigmoid
The sigmoid function σ(x) is defined as
. 1 ex
sigm(x) = σ(x) = =
1 + e−x ex + 1
The sigmoid has many desirable properties for use in neural networks, which we will explore in
more detail later.
• It is differentiable across its entire domain, which is necessary for training the network using
gradient-based optimization algorithms
• It outputs values in the (0, 1) interval, which can be interpreted as probabilities. This allows
us to use the sigmoid as a natural choice for the output layer of the neural network when we
want to perform binary classification tasks, where the output represents the probability of
belonging to a certain class.
In general, activation functions are an essential component of neural networks as they introduce
nonlinearity, allowing the network to learn complex relationships and patterns in the data. Specifi-
cally, the activation functions transform the input signal of a neuron to produce its output. This
transformation is critical as it helps to project instances that are not linearly separable into a space
where we can find a separating hyperplane, enabling the network to learn to classify or predict
outputs for a given input.
1.5 Supervised Learning
At the beginning of this course, we will focus on the framework of supervised learning, which is a
type of machine learning technique where an algorithm learns to make predictions or classifications
by training on a labeled dataset with known input-output pairs. In other words, we provide the
algorithm with a dataset containing input-output pairs, and the algorithm learns to generalize from
this dataset to make predictions or classifications on new, unseen data.
Figure 1.5: Supervised Learning framework
In particular, supervised learning consist on two stages:
• Learning: the estimation of the parameters Θ of the function fΘ from the training data
{(x(i) , y (i) )}N
i=1
• Inference: keeping the learned Θ fixed, the model makes predictions ŷ = fΘ (x) for unseen
inputs
The fact that we test our model on unseen data is the main different between learning and traditional
optimization.
1.5.1 Classification
An example of supervised learning task is classification.
Suppose we want to classify images depending on whether they represent a house or a boat.
Figure 1.6: Classification Example
In this case, our model takes a vectorized image as input and it outputs 0 if the image represents a
house, 1 if it represents a boat. Thus, our model in this case will be a mapping
fΘ : RW ×H×3 7→ {House,Boat}
In order to learn this mapping, we need to define a loss function to be optimized.

1.6 Defining a loss function: Maximum Likelihood Estimation
The first technique we are going to use to define a suitable loss function is Maximum Likelihood
Estimation.
Suppose we are given:
• A dataset D = {(x(i) , y (i) )}N

i=1 with inputs x
(i)
and corresponding outputs y (i) , where i
indexes the samples. The dataset is assumed to be drawn from an underlying probability
distribution pdata , which is inaccessible to us except through the samples in D, which we
assume to be i.i.d..
• A parametric family of probability distributions pmodel (y|X, Θ) over the output space, indexed
by Θ
When defining the loss function, we will always follow these three main steps:
1. Write down the parametric probability distribution of the model pmodel (y|X, Θ)
2. Decompose that probability distribution into per sample probability pmodel (y (i) |x(i) , Θ)
3. Convert everything in log scale and minimize the Negative Log Likelihood (NLL)
More formally, the conditional maximum likelihood estimator for Θ is given by
Θ∗M LE = arg max pmodel (y|X, Θ)

Θ
N
Y
= arg max pmodel (y (i) |x(i) , Θ)
Θ i=1
N
X
= arg max log pmodel (y (i) |x(i) , Θ) logarithm is monotonic and increasing
Θ i=1
| {z }
log-likelihood
The NLL is the loss function that we will use to optimize our networks.
We will now see some examples for different types of pmodel .
Case 1: pmodel is a Gaussian
Let X ∼ N (µ, σ 2 ). In this situation our model is a 1D Gaussian distribution, and we can model
our system using a Gaussian probability density function with µ and σ as parameters.
In order to minimize the negative log-likelihood (NLL), we can adjust the values of µ and σ such
that the Gaussian curve places more probability mass on areas where we expect to see data, and
less probability mass on areas where we do not expect to see data. A visual representation of that
is given in fig. 1.7.
Case 2: pmodel is a Bernoulli. Derivation of Cross Entropy as a MLE estimator
Suppose we are given a dataset D = {(x(i) , y (i) )}N

i=1 with y
(i)
∈ {0, 1}, i.e. we are performing binary
classification.
Now a Gaussian would not be a valid pmodel anymore. Instead, we will model the output variable
Figure 1.7: Modeling a Gaussian with NLL
ŷ (i) using a Bernoulli distribution
ŷ (i) ∼ Bern(σ(θ ⊤ x(i) ))
The parameter of this Bernoulli distribution (σ(θ ⊤ x(i) )) derives from the model used for binary
classification, illustrated in fig. 1.8.
Figure 1.8: Binary classification model
R The sigmoid can be interpreted as a probability distribution over two classes as:
1. Its output is always positive
2. Its output lays in the interval (0, 1)
As we discussed before, in order to compute the optimal parameters of this model w.r.t. the
likelihood function θM
∗
LE , we minimize the NLL.
In particular, we do the following steps:
1. We compute the expression of the likelihood under the assumption that the instances are
independent and identically distributed (i.i.d.):
N
Y
pmodel (y|X, θ) = p(y (i) |x(i) , θ)
i=1
N y(i) 1−y(i)
Y 1 1
= 1− ϕ = θ ⊤ x(i)
i=1 |
1 + e−ϕ 1 + e−ϕ
{z } | {z }
π (i) 1−π (i)
We can decompose the likelihood in this way since y (i) can only take on the values of 0 or 1,
so only one of the two terms in the product will be active in practice.
2. We compute the negative log-likelihood (NLL) and use it as a loss function L(θ), which is
a function of the model parameters θ. This loss function is also known as Binary Cross
Entropy (BCE) and is given by:

L(θ) = N LL = − log pmodel (y|X, θ)
N
1 X (i)
=− (y ) log(π (i) ) + (1 − y (i) ) log(1 − π (i) )
N i=1
In fig. 1.9 you can find a visualization of the values assumed by the BCE as a function of y (i) and
yˆ(i) .
Figure 1.9: Cross Entropy visualization
The blue curve in the loss function plot corresponds to the case where the true label is 1, and only
the first term of the loss function is active. The red curve represents the case where the true label
is 0, and only the second term of the loss function is active.
The loss is minimized when the model predicts a high probability for the true class label and a
low probability for the other class label. By minimizing this loss function, we can train a model to
accurately predict the class labels for new instances.
Multiclass classification and softmax function
In multiclass classification, the number of possible class labels is greater than two. To support
multiclass classification, we need to modify the output layer of the neural network.
Instead of a single neuron, we need to have k stacked neurons, where k is the number of classes in
the dataset. The output of these neurons represents the probability of an input instance belonging
to each class.
To obtain a probability distribution over classes, we apply the softmax function to the k-dimensional
output vector of the stacked neurons. The softmax function maps the output of each neuron
to a probability between 0 and 1, such that the sum of probabilities over all classes is equal to
1.
Definition 1.6.1 — Softmax. The softmax of a k dimensional vector x is a k dimensional
vector, whose ith element is defined as
. exi
sof tmax(x)i = Pk
j=1 exj
As the sigmoid, also the softmax satisfies all the properties required to be a probability distribution
over classes. In particular:
1. Its outputs are always positive (the exponential function is always positive)
2. Its outputs are always in (0, 1) (true thanks to the normalization factor at the denominator
Pk
j=1 e
xj
and to the fact that the exponential is always positive)
3. The sum of all the outputs is 1.
k Pk
X exi
sof tmax(x)i = Pki=1 =1
xj
i=1 j=1 e
R A sigmoid is a 2-class softmax.
1.6.1 Final considerations on MLE
We have seen that choosing pmodel (y|X, θ) as Bernoulli distribution yields the binary cross-entropy
estimator.
However, there are also other variations:
1. If we choose pmodel (y|x, θ) = N (y|θ ⊤ x, σ) to be Gaussian, we end up with the least squares
2
(MSE) estimator: θM ∗ ⊤
LE = arg minθ θ x − y 2
2. Choosing pmodel (y|x, θ) to be a Laplacian distribution, yields an estimator that minimizes
the l1 norm: θM∗ ⊤
LE = arg minθ θ x − y 1
3. Assuming a Gaussian distribution over θ and performing maximum a-posteriori (MAP)
estimation yields ridge regression
There are many nice theoretical properties making MLE an appealing framework. Among them:
• Consistency: as the number of training samples N → ∞, the maximum likelihood estimate

θ̂ converges to the true parameters θ
• Efficiency: the maximum likelihood estimates converges quickly as N increases
1.7 Optimizing the network: Gradient Descent
So far we have learnt how to define a loss function and how to express the output for binary and
multiclass classification tasks.
In this section, our attention shifts towards optimizing the parameters of the network. Since most
loss functions, such as the Binary Cross-Entropy (BCE) function, do not have a closed-form solution,
the typical approach in Deep Learning is to utilize iterative gradient descent.
1.7.1 Optimization procedure
The optimization procedure using Gradient Descent (GD) is as follows:
1. Choose a learning rate η and a tolerance ϵ

2. Initialize Θ{0} with small random values
3. Repeat until the norm of the gradient (∥v∥) is small enough, i.e. ∥v∥ < ϵ:
(a) Compute the gradient using the entire dataset
N
X
v = ∇Θ L(ŷ, y) = ∇Θ L(ŷ (i) , y (i) )
i=1
(b) Update the parameters using the GD update rule
Θ{t+1} = Θ{t} − ηv
This procedure eventually leads to finding the parameters that correspond to a local minimum of
the cost function.
In practice, we use Stochastic Gradient Descent (SGD) instead of GD. SGD computes the gradient
only using a small subset of samples, which makes it more computationally efficient. As a result,
rather than computing the true gradient, at each step, we compute its expectation.
It’s worth noting that the update rule for SGD is similar to what we previously saw in the perceptron
algorithm (see section 1.3.1).
1.7.2 Efficient Computation of Gradients
Now that we have seen that gradients are used to optimize the network parameters, we are interested
in ways to compute them efficiently.
In general, there are at least two methods to compute gradients:
• Method 1: symbolical differentiation

• Method 2: automatic differentiation (also called backpropagation)
Symbolic differentiation
Consider the given scalar function

2 2
f = exp exp (x) + exp (x) + sin exp (x) + exp (x)
Symbolic differentiation gives us

df
2

2

2

2

= exp exp (x) + exp (x) exp (x) + 2exp (x) +sin exp (x) + exp (x) exp (x) + 2exp (x)
dx
Automatic differentiation
The symbolic differentiation is in practice slow and inefficient for a machine.
What is used in Deep Learning is called backpropagation and it is a form of automatic differ-
entiation. To illustrate how backpropagation works, we consider the same scalar function as
before:
2 2
f = exp exp (x) + exp (x) + sin exp (x) + exp (x)
To begin, we define and compute intermediate variables as follows:
a = exp (x) c=a+b e = sin (c)

b = a2 d = exp (c) f =d+e
We can represent this function as a graph, known as a computational graph, composed of inputs,
single functions, and intermediate variables.
Figure 1.10: Computational graph
This graph allows us to compute gradients mechanically in a top-down manner, reusing previously
computed values according to the chain rule. Specifically:
Figure 1.11: Gradients computation

1.7.3 Backpropagation in Neural Networks
Case 1: Single Unit
Next, we will apply the automatic differentiation method to compute the gradients of a simple
Neural Network composed of a single unit using the Mean Squared Error (MSE) as the loss function.
The architecture of the single unit is shown below in the form of a computational graph:
Figure 1.12: Single Unit architecture
Our ultimate goal is to obtain the gradient of the loss function, L(w), with respect to w. To
accomplish this, we will analyze how to compute the gradients of L(w) with respect to z[3] , z[2] ,
z[1] , and w[0] respectively.
∂L ∂z[3]
= =1
∂z[3] ∂z[3]
∂L ∂z[3]
= [2] · 1
∂z[2] ∂z
∂L ∂z[2] ∂z[3]
= [1] · [2] · 1
∂z[1] ∂z ∂z
∂L ∂z[1] ∂z[2] ∂z[3]
[1]

[1] [2]

= · · · 1 = x · σ z 1 − σ z · 2 z − y
∂w[0] ∂w[0] ∂z[1] ∂z[2]
Where in order to compute the last value we used the fact that
∂z[3]
= 2z[2] − 2y
∂z[2]
∂z[2]
[1]

= σ z 1 − σ z[1]
∂z[1]
∂z[1]
=x
∂w[0]
Case 2: Layer-wise
We will now analyze how to combine the gradients in a multi-layer architecture, bearing in mind
that in a neural network, the input of each layer is the output of the layer below it.
Figure 1.13: Outputs and gradients flow through layers
Let us make some observations about the figures above:
1. While the outputs are propagated in forward direction, the gradients are propagated in the
backward direction.
2. Different layers can have different number of units (the layer l has N units while the layer
l − 1 has M units).
3. Typically each unit i in the l − 1 layer is connected to all N units in the layer above. Thus,
each unit of one layer receives input from all units of the layer below.
During the course we will use the following notation:
[l−1]
• δi : gradient w.r.t. ith unit in the (l − 1)th layer
• δ [l−1]
: gradient that flows from layer l to layer l − 1
To compute the gradients we go through two substeps.
First, we compute the gradient for a single unit. In particular, we are interested in the gradient of
the j th unit of the layer l w.r.t. the ith unit of the layer l − 1.
N [l]
[l−1] ∂L X ∂L ∂zj
δi = [l−1]
= [l] [l−1]
(1.1)
∂zi j=1 ∂zj ∂zi
| {z }
[l]
δj
Then, for an entire layer, we obtain

N [l]
[l−1]
X ∂L ∂zj ∂L ∂z[l]
δ = [l] [l−1]
= [l] [l−1]
j=1 ∂zj ∂z |∂z
{z } |∂z{z }
∈R1×N ∈RN ×M
Next, we compute the weight update. Therefore, we need to take the derivative of L with respect
to the weights W [l] .
N [l]
∂L X [l] ∂zj ∂L ∂z[l]
= δ j =
∂W [l] j=1
∂W [l] ∂z[l] ∂W [l]
[l]
Here, as seen in Equation (1.1), we define δj = [l] .
∂L
∂zj
1.8 Block diagrams of a Single Unit Network and of a MLP 33
1.8 Block diagrams of a Single Unit Network and of a MLP
Figure 1.14: Block diagram of a Single Unit Network
Figure 1.15: Block diagram of a MLP

1.9 Approximation capabilities of Neural Networks
1.9.1 Linear Activation Function are not enough
A MLP which makes use only of linear activation functions is equivalent to a single unit network
with a linear activation.
1.9.2 Universal approximation theorem
As we announced at the beginning of the chapter, our detour ends with an investigation of the
types of functions which can be approximated using DNN.
However, this kind of network is not enough for learning many types of functions, thus a non-linearity
between layers is really needed to make the network work.
Theorem 1.9.1 — Universal approximation theorem. Let σ : R → R be a non-constant,

bounded and continuous activation function. Let Im denote the m-dimensional unit hypercube
[0, 1]m and the space of real-valued functions on Im is denoted by C(Im ).
Then a function f ∈ C(Im ) can be approximated given any ϵ > 0, integer N , real constants
vi , bi ∈ R and real vectors wi ∈ Rm for i = 1, . . . , N :
N
X
f (x) ≈ g(x) = vi σ(wi⊤ x + bi )
i=1
and |g(x) − f (x)| < ϵ, ∀x ∈ Im .
In a nutshell the theorem says that:

A feed-forward neural network with a single hidden layer and continuous non-linear activation
function can approximate any continuous function with arbitrary precision.
2. Training
2.1 Regularization
In general, the goal of Machine Learning is to designs algorithms which perform well not only on
the training data, but especially on new inputs. For this reason, many strategies used in ML are
explicitly designed to reduce the test error, possibly at the expense of increased training error.
These strategies are known collectively as regularization.

Definition 2.1.1 Regularization is any modification we make to a learning algorithm that is
intended to reduce its generalization error but not its training error.
In an ideal scenario, the model family we have used during training includes the data generating
process but also many other possible generating processes. In this scenario regularization pushes or
restricts the solution space towards the true generating process.
Figure 2.1: Ideal scenario
However, the majority of the DL algorithms are indeed applied to extremely complicated domains
such as images, audio sequences and text, for which understanding the true generation process
essentially involves simulating the entire universe.
36 Chapter 2. Training
Thus, in a realistic scenario the model family we have used during training may not include the
true generating process. The role of regularization techniques is thus to find the model within the
family that best explains the true data generating process.
In general, we say that training a ML model is more like "trying to fit a square peg into a round
hole".
Figure 2.2: Realistic scenario
In this section, we are going to explore the most used regularization methods used in deep learning.
2.1.1 Parameter Norm Penalties
Many regularization approaches are based on limiting the capacity of models, such as neural
networks, linear regression, or logistic regression, by adding a parameter norm penalty Ω(Θ) to the
loss function. We denote the regularized objective function by L̃:
L̃(Θ; X, y) = L(Θ; X, y) + λΩ(Θ)
where λ ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term.
Setting λ = 0 gives no regularization, while larger values of λ correspond to more regularization.
As a consequence, when our training algorithm minimizes the regularized loss function L̃ it will
decrease both the original loss L on the training data and some measure of the size of the parameters
Θ (or some subset of them).
There are mainly two ways of doing parameter norm regularization, based on the type of norm we
decide to consider:
• L1 regularization (Lasso): Ω(Θ) = ∥Θ∥1 =

P
i |Θi |
2
• L2 regularization (Ridge): Ω(Θ) = 12 ∥Θ∥2
L2 parameters regularization (Ridge regression)
L2 parameters regularization, also know as weight decay is one of the simplest and most common
kinds of parameter norm penalty. This strategy drives the weights closer to the origin by adding a
2
term 21 ∥Θ∥2 to the objective function.
How does this term affect the training of the network can be efficiently visualized by analyzing the
gradient of L̃ and, subsequently, the update rule for the weights given by GD.
The gradient of the loss function w.r.t. the model’s weights can be written as:
∇Θ L̃(Θ; X, y) = ∇Θ L(Θ; X, y) + λΘ
As a consequence, the update rule for the weights given by GD is:
Θ ← Θ − α∇Θ L̃(Θ; X, y)
= Θ − α∇Θ (L(Θ; X, y) + λΘ)
= Θ − α(∇Θ L(Θ; X, y) + λΘ)
= (1 − αλ)Θ − α∇Θ L(Θ; X, y)
| {z } | {z }
weight decay parameters update
We can see that the addition of the weight decay term has modified the learning rule to multiplica-
tively shrink the weight vector by a constant factor on each step, just before performing the usual
gradient update.
L1 regularization (Lasso)
While L2 weight decay remains the most common form of weight decay, there are other ways to
penalize the size of the model parameters. Another option, as we anticipated before, is to use L1
regularization.
L1 regularization is achieved by adding to the loss the term ∥Θ∥ = i |Θi |, that in practice is the
P
sum of absolute values of the individual parameters. The loss in this case becomes
L̃(Θ; X, y) = L(Θ; X, y) + λ ∥Θ∥
Doing the same reasoning as before, we will now compute its gradient to understand how using the
L1 regularization affects weights updates.
∇Θ L̃(Θ; X, y) = ∇Θ L(Θ; X, y) + λsign(Θ)
By inspecting the equation above, we can notice how the two regularization techniques are different
from each other. In particular, here the regularization contribution to the gradient no longer scales
linearly with each Θi ; instead it is a constant factor (λ) with a sign equal to sign(Θi ).
In comparison to L2 regularization, L1 regularization (for a sufficiently large λ) results in a solution

that is more sparse 1 . Sparsity in this context refers to the fact that some parameters have an
optimal value of zero, which leads to ignore the input features which correspond to those weights.
For this reason, L1 regularization is often used in practice as a feature selection mechanism2 .
Conclusive notes on parameters regularization
Both the forms of parameters regularization can be interpreted in two ways.
On one side, from a MAP perspective, they are equivalent to specifying a prior distribution on
the weights’ values. In particular, the L2 regularization corresponds to a Gaussian prior on the
weights, while the L1 regularization corresponds to a Laplace prior.
On the other side, they can be seen as a form of constrained optimization. In particular, if Ω is
the L2 norm, then the weights are constrained to lie in an L2 ball. If Ω is the L1 norm, then the
weights are constrained to lie in a region of limited L1 norm.
This concept is explained visually in fig. 2.3.

1
In mathematics, a sparse solution refers to a solution in which most of the coefficients or variables are
zero.
2
Feature selection in machine learning refers to the process of selecting a subset of relevant features
from a larger set of variables to improve model accuracy and efficiency.
Figure 2.3: Position of the optimal solutions (w∗ ) using L1 (left) and L2 (right) regulariza-
tion, considering a 2-dimensional parameters vector w. Here, the ellipses represent curves
of losses (with a null loss in their centers) in the parameter space, while the grey areas
represent the constraints. w∗ represents the optimal parameter (the one we usually call Θ∗ )
in the two cases. We can notice that, due to their resulting sharp shape, the constraints of
L1 regularization are more likely to be satisfied in the corners, where some coordinates of
Θ (in the figure called w) are zero.
2.1.2 Ensemble Methods
The idea behind ensemble methods is to use finite amount of different machine learning models to
obtain better performance than any one of them alone. The reason that ensemble techniques work
is that different models will usually not make all the same errors on the test set.
In order to do that there are two main approaches:
1. Train different model classes (Linear Regression, Decision Tree, Neural Network) on the same
data and then aggregate the predictions
2. Train same model class on different data (sampled from the original dataset) and aggregate
the predictions
In this section we are going to explore two enseble methods: bagging and dropout.
2.1.3 Bagging
Bagging [1] (short for bootstrap aggregating) is a technique for reducing generalization error by
combining several models.
The procedure, as explained in fig. 2.4 is the following:
1. We create k bootstraps by sampling from the training set with replacement

2. For each bootstrap we train a classifier, to obtain k different classifiers
3. Our final prediction is a combination of the predictions of the different classifiers
Figure 2.4: Bagging porocedure
2.1.4 Dropout
Dropout [12] is a computationally inexpensive technique to regularize a broad family of models.

In practice, it consists on ignoring a subset of neurons (chosen at random) during each training
iteration. It is seen as an ensemble technique as it is in practice equivalent to create an ensemble
consisting of all sub-networks that can be formed by removing non-output units from an underlying
base network, as shown in fig. 2.5.
Figure 2.5: Dropout as an ensemble
We now analyze the modifications that the use of dropout introduces both at training and at test
stage.
Training Stage
Let y [l] be the input to the (l + 1)th layer in the network, f an activation function, and Θ[l] and b[l]
be respectively the weights and bias parameters at that layer.
As we already seen, in the standard feed-forward configuration we have

z [l+1] = Θ[l+1] y [l] + b[l+1]
y [l+1] = f (z [l+1] )
When we introduce dropout in this network, we start ignoring a subset of neurons (chosen at
random) during each training iteration.
More formally, to train with dropout, we follow this procedure.

First, each time we load an example into a mini batch, we randomly sample each component of the
mask r[l] (which is the mask which decides which neurons should be kept at layer l for that specific
mini batch) from a Bernoulli distribution with parameter p (the probability of keeping a neuron
active during training time). Using this mask we compute ỹ [l] .
r[l] ∼ Bern(p)
ỹ [l] = r[l] ⊙ y [l]
Then we compute the input of the next layer as in the standard configuration, but this time using
ỹ [l] instead of y [l] .
z [l+1] = Θ[l+1] ỹ [l] + b[l+1]
y [l+1] = f (z [l+1] )
Test Stage
At test stage we have mainly two solutions to include the dropout.
A first solution is to approximate with sampling. In order to do that, we first do a limited

number of forward passes in the network in which dropout behaves same as at training time.
Then, we take average of the results obtained. However, as one can imagine, this solution is
computationally expensive and, thus, a better one needs to be found.
An alternative solution, which is in general the preferred one, is the weight scaling inference
rule. The idea behind this solution is to make expected total input to any unit at test time equal
to the expected total input at training time. In order to do that, as we know by hypothesis that
r[l] ∼ Bern(p), we can define and use new parameters Θ̃ defined as
X
Θ̃ = E[r[l] Θ] = p(r[l] )r[l] Θ = pΘ + (1 − p)0 = pΘ
r [l] ∈[0,1]
As a consequence, in practice we scale by p the weights of each neuron in testing phase.
For many classes of models that do not have nonlinear hidden units, the weight scaling inference
rule is exact. In case the model contains nonlinearities this rule is no longer exact, but still provides
a good approximation of the true geometric mean of the ensemble.
Figure 2.6: Test error for different architectures with and without dropout
Dropout and bagging
We conclude this section by comparing the two ensembling methods we have just seen.
Dropout Bagging
Models share parameters Models are independent
Trains (partially) only small percentage of the models Trains all models until convergence
Table 2.1: Dropout vs bagging
2.1.5 Data Normalization
Another regularization technique is data normalization.
Differences in scale can have a significant effect, both when they are present in the input data and
when they are present in the target values.
In particular, in the input large input values could result in large weight values which make the
predictions unstable.
On the other side, a target variable with a large spread in its values can make the training
process unstable. For this reason, it is often useful to normalize the data before training a model,
we will now analyze how this should be done.
Let X be our input data, and xi a particular data point.

Then the mean and the standard deviation of the data are:
n
1X
µ= xi
n i=1
v
u n
u 1 X
σ=t (xi − µ)2
n − 1 i=1
Thus, we normalize X (per feature, per channel etc.) in the following way to obtain XN
(N=normalized)
X −µ
XN =
σ
Important: during testing, we must use the same mean and standard deviation we have found
during the training, not a new one. This allows the model to be evaluated on a single example,
without needing to use definitions of µ, σ that depend on an entire minibatch.
2.1.6 Batch Normalization
In the case of very deep network however, input and output normalization seems to be not sufficient.
As a matter of fact, normalizing only the input may help in the learning of the first layer’s parameters,
but after that, data will likely be not normalized again.
Batch normalization [5] has been proposed to try to solve this problem. The idea behind this
technique is normalize not only input and output data, but also the activations computed at each
intermediate layer.
Figure 2.7: Batch normalization
Training phase
At training time, for each mini-batch of intermediate activations z1 , . . . , zm , we compute their

mean and the standard deviation, that are:
n
1X
µ= zi
n i=1
n
2 1 X
σ = (zi − µ)2
n − 1 i=1
Thus, we do the batch normalization in the following way

zi − µ
zinorm = √
ϵ + σ2
norm
z̃i = γzi +β
where ϵ is small value added for numerical stability and γ, β are learnable parameters that adjust
the mean and the variance at that layer.
Test phase
At test time µ, σ might be replaced by running averages that were collected during training time.
As said before, this allows the model to be evaluated on a single example, without needing to use
definitions of µ, σ that depend on an entire minibatch.
Here are some observations on batch normalization:
• The bias term in a linear layer (and convolutional layer) becomes redundant if you use batch
normalization after it
• Batch normalization makes weights in deeper layers more robust to changes than weights in
the shallower layers of the network
• Each mini-batch is scaled by the mean/variance computed on just that mini batch. This
adds some inherent noise within that mini-batch (similar to dropout). Threfore it has slight
regularization effect.
2.1.7 Data Augmentation
The best way to make a machine learning model generalize better is to train it on more data.
However, acquiring training data is expensive.
One way to get around this problem is to creating new fake data as augmentation of the real one.
This technique is called data augmentation.
For some machine learning tasks, as image classification task, it is reasonably straightforward to
create new fake data.
However, we need to ensure consistency between transformed input and label, we have to avoid
transformations that would change the correct class. In fig. 2.8 we can see two examples of a
inconsistent data augmentation.
Figure 2.8: Example of a inconsistent data augmentation in classification (left) and regression
(right)
Thus, the goal before applying data augmentation is to exploit invariances (classification) or
equivariances (regression) of the function you are trying to learn to obtain new samples.
These are some examples of transformations used in order to do data augmentation:
• Geometric transformation: translations, rotations, flipping, scaling, cutting, noise injec-

tion
• Noise injections: enforce robustness to noise perturbations
These are some useful links and packages to perform data augmentation:
• torchvision.transform
• albumentation, check out online demo
• imgaug
2.1.8 Transfer learning
Another problem we may encounter when training our neural network is the presence of small
dataset. As a matter of fact, training only on small amount of data generally leads to poor
generalization.
A solution to this problem takes the name of transfer learning.
The idea is to first train network first on another task with a large dataset and, then, to fine-tune
the trained network on your original task.
The features learned from training on the large dataset can be exploited for solving the new task
Figure 2.9: Transfer Learning
2.1.9 Pre-existing architectures
The first step can be in many situation avoided by using pre-existing architectures, already trained
in a specific task and dataset. Among the benefits of using these architectures we have the
modularity and the built-in regularizers. Some example of models available are AlexNet, ResNet,
VGG, DenseNet, Inception.
If you are interested in seeing how transfer learning works can be implemented in practice we
reccomend you to take a look at this tutorial.
2.2 Activation functions
Activation functions make the layer-to-layer mappings non linear. Without activation functions
neural networks would only implement affine mappings.
Figure 2.10: Activation function
The family of activation functions is mainly divided into two groups: logistics and the rectified
linears.
In this section we will see some of the most used on the two sides and we will analyze their
properties.
2.2.1 Logistic activations
Sigmoid
f ′ (x) = σ(x)(1 − σ(x))

1
f (x) = σ(x) =
1 − e−x
This function has the advantages of being differentiable everywhere and of having a finite range
(0, 1), which makes it suitable for mapping to a probability space as an output.
However, it is not the most used function (especially for the hidden units) nowadays as it saturates
across most of its domain - to an high value in case of very positive inputs, to low values in case of
very negatives.
Hyperbolic Tangent (tanH)
f ′ (x) = 1 − f (x)2
ex − e−x
f (x) = tanh(x) = x
e + e−x
The tanh function is a scaled sigmoid function (tanh(x) = 2σ(2x) − 1).
Similiarly to the sigmoid function, it is differentiable everywhere and it has a finite range (in this
case (−1, 1)). As it can be noticed from the figure, it also shares the drawback of saturating across
most of its domain.
However, it tipically performs better than the sigmoid function. As a matter of fact, it resembles
the identity function more closely, in the sense that tanh(0) = 0 while σ (0) = 0.5, and for this
reason training a model with a tanh activation function resembles the training of a linear model (as
long as the activations of the layer are kept small enough to avoid saturation). This makes the
training of the model easier.
2.2.2 ReLU and its variants
ReLU
(
′ 0 if x < 0
f (x) =
f (x) = max(0, x) 1 otw
The rectified linear units (ReLU) have the property to be very similar to purely linear units and
as only difference, they output zero for half of their domain. This behaviour, brings both some
advantages and disadvantages.
Among the advantages we have that the piecewise linearity greatly accelerates the convergence
of gradient-based optimization algorithms, especially when compared to the sigmoid and tanh
functions. Moreover, it is computationally cheap to compute.
On the other hand, it may be source of instability during the learning. Its unbounded output range
([0, ∞)) can blow up the activation and destabilize the training. Moreover, units with negative
activations get no update (the gradient is zero for x < 0). This last phenomenon is commonly
referred to as the “dying” ReLU problem.
Some possible solutions to “Dying” ReLU problem are the followings. On the one side, one can use
a slightly modified version of this function. Among them, as we will see in the following part of
this section we have Leaky ReLU, Parametric ReLU, ELU, SELU, GELU. On the other side, a
careful initialization of the weights can help to avoid this problem. Doing so makes it very likely
that the rectified linear units will be initially active for most inputs in the training set and allow
the derivatives to pass through.
Leaky ReLu
(
′ α if x < 0
f (x) =
(
αx for x < 0 1 if x ≥ 0
f (x) =
x for x ≥ 0
where α is a small positive constant.
Randomized Leaky ReLU
The Randomized Leaky ReLU is very similar to the leaky ReLU but it introduces a small random
negative slope for negative activations, rather than a fixed one.
In particular, at training time the slope value is sampled from an uniform distribution (α ∼ U(a, b),
α), while at test time α is set to its expected value a+b
2 .
2.2.3 Final considerations on activation functions
The process of deciding which activation function to use in pratice is not straightforward.
As a matter of fact, there is no clear winner and the design process consists of trial and error and
insights into the modelled system. However, some general insights from what we have seen so far
are the followings:
• Differentiability everywhere is good but it’s not strictly necessary

• Piecewise Linear and non-saturating activations better for hidden layers (faster convergence)
• Sigmoid and Softmax are needed for probabilistic outputs (Bernoulli and Multinoulli distri-
butions)
2.3 Optimization Algorithms
The question we will be trying to address in this section is
How should we update model weights wij in order to minimize a loss function L?
In order to do that we will first present the Gradient Descent algorithm and some of its variants
(subsection 2.3.1). Then, we will move to analyzing some of its challenges and which methods are
nowadays used to overcome them (subsection 2.3.2).
2.3.1 Gradient Descent and its variants
The state-of-the-art methods for training DNNs are almost all variants of gradient descent (also
called Batch GD). The idea behind that is to follow the direction of the slope of the surface created
by the objective function downhill.
Figure 2.11: Intuition behind gradient descent
Batch Gradient Descent
First, we analyze the vanilla version of the GD algorithms, the so-called Batch Gradient Descent or,
simply, GD.
Let L : Rd 7→ R be a differentiable objective function, Θ the parameters of the network and η be

the learning rate. We define the iterate sequence generated by gradient descent as
Θt+1 = Θt − ∇Θ L(Θ)
where ∇Θ L(Θ) is the gradient of the loss function with respect to the parameters Θ computed on
the entire training set.
Stochastic Gradient Descent
The computational cost of GD linearly grows with the number of samples in the dataset, thus
using the Batch GD becomes prohibitive for large datasets. In order to overcome this problem,
we can use a variant of GD called Stochastic Gradient Descent (SGD). The idea is to evaluate
the gradient only in a sample [i] ∈ n (where n is the number of training samples contained in the
dataset) extracted uniformly at random from the training samples.
Θt+1 = Θt − ∇Θ L(Θ; x(i) ; y (i) )
From a theoretical point of view, SGD has the advantage of being an unbiased estimate of the true
gradient. However, it has a high variance and it is not guaranteed to decrease the loss function at
each iteration.
In addition to that, introducing stochasticity in the process allows to jump to new and potentially
better local minima. This is particularly useful when the loss function is non-convex and has many
local minima. However, it also makes it harder to converge to the minimum as near a smoothened
minimum the SGD step is dominated by stochastic fluctuations. In practice, it is necessary to
decrease the learning rate over time for SGD to converge.
From a computational perspective, SGD is much more efficient than GD. In fact, each iteration
of SGD is n times cheaper than GD, as it only have to compute the gradient on a single sample
instead of the entire dataset.
Mini-batch GD
The mini-batch GD is a variant of GD that lies in between GD and SGD. In particular, the idea is
the following.
At each iteration, we sample a minibatch of m samples from the training set {x(1) , ..., x(m) } together
with the corresponding targets {y (1) , ..., y (m) }. Then, we compute the gradient estimate as a mean
of those gradients.
1 X
ĝ = ∇Θ L(f (x(i) ; Θ), y (i) )
m i
Finally, we update the parameters Θ as follows
Θ ← Θ − ηĝ
This version of GD is faster than the Batch version because it goes through a lot less data points
than Batch (entire dataset). In addition to that, when compared to the SGD, it reduces the variance
of the gradient estimate and thus it guarantees more stable convergence.
Finally, it enables parallelization over up to m processors.

2.3.2 Challenges in Optimization and some solutions to them
In this subsection we present the solutions to two common problems which arise in gradient-based
optimization techniques. In particular we will try to answer to two questions.
How can gradient descent be modified to avoid a slow down in regions of small gradient norms?
We will see two methods (commonly referred as heavy ball methods) which are able to accelerate
the convergence of the gradient descent in the aforementioned regions.
How can the effective step size be adapted per dimension?
We will see Adagrad and RMSProp, two methods which are able to adapt the learning rate per
dimension.
Finally we will analyze the most used optimization algorithm in deep learning, the Adam algorithm,
which combines the two approaches presented before.
Polyak’s Momentum
In some settings (especially the ones characterized by a poor condition number3 of the Hessian
Matrix of the loss function w.r.t. the parameters) we have that the loss function changes very
slowly in a direction, while it is very sensitive in another one. In this cases the SGD will be zig
zaging for a long time before reaching convergence.
The direction of the gradient in this case is not aligned with the direction toward the minima.
The updates give us very slow progress in the shallow dimension (the one corresponding to the
small gradients) and jitter in the other one. This problem becomes even more common in higher
dimensions.
One solution to this problem is to use a momentum term that accelerates SGD in the relevant
direction and dampens oscillations.
3
Conditioning refers to how rapidly a function changes with respect to small changes in its inputs.
Functions that change rapidly when their inputs are perturbed slightly can be problematic for scientific
computation because rounding errors in the inputs can result in large changes in the output.[6]
The first momentum-based algorithm was proposed by Polyak in 1964.
Its main challenge was to accelerate learning, especially in the face of high curvature (of loss
functions) and both in presence of small but consistent gradients and noisy ones (as the ones coming
from mini-batch GD). The momentum algorithm accumulates a (exponentially decaying) moving
average of past gradients and continues to move in their direction.
Formally, the momentum algorithm introduces a variable v that plays the role of velocity; it is
the direction and speed at which the parameters move through parameter space. We can see the
velocity as a weighted sum of the previous gradients, with the most recent ones weighted heavier.
A hyperparameter α ∈ [0, 1) determines how quickly the contributions of previous gradients
exponentially decay. The update rule is given by:
m
!
1 X (i) (i)
v ← αv − η∇Θ L(f (x ; Θ), y )
m i=1
Θt+1 ← Θt + v
The larger α is relative to η, the more previous gradients affect the current direction.
As a consequence, the step size becomes larger when many successive gradients point in exactly the
same direction.
Figure 2.12: SGD+momentum solve stuck points and poor condition
However, Polyak’s momentum has been proved not to converge in the very simple case of a
strongly-convex and smooth function for carefully chosen α and η.
A solution of this problem is given by Nesterov’s momentum.
Nesterov’s Momentum
The Nesterov’s Momentum, pursues the same idea of the Polyak’s Momentum, but evaluates the
gradient directly in the estimated Θ + αv rather than in Θ.
m
!
1 X
v ← αv − η∇Θ L(f (x(i) ; Θ+αv), y (i) )
m i=1
Θt+1 ← Θ + v
The idea is to give less weight to the velocity factor and more weight to the gradient.
Instead of evaluating gradient at the current position (red circle), we know that our momentum is
about to carry us to the top of the green arrow. With Nesterov momentum we therefore instead
evaluate the gradient at this "looked-ahead" position.
However, this notation is a little bit annoying since usually we want to evaluate the loss and the
gradient at the same point, while here we are computing the loss in xt and the gradient in xt + αv,
Figure 2.13: SGD+Momentum (left) and SGD+Nesterov’s Momentum(right)
but we can do a change of variables (Θ̃t = Θt + αvt ) to obtain

m
!
1 X (i) (i)
vt+1 ← αvt − η∇Θ L(f (x ; Θ̃t ), y )
m i=1
Θ̃t+1 ← Θ̃t − αvt + vt+1
2.3.3 Adaptive Learning Rate: Adagrad and RMSProp
As we have anticipated before, momentum is not the only way to stabilize the oscillations of the
gradient. Another solution is to adapt the learning rate to the parameters.
With adaptive learning rate strategies ideally we would like to make smaller steps for “steeper”
directions in the cost function. In order to do that we make step size inversely proportional to past
gradients magnitude, so that
• Dimensions with large gradients have rapid decrease in their learning rate
• Dimensions with small gradients have a small decrease in their learning rate
• Greater progress in the more gently sloped directions of parameter space
The two most known adaptive learning rate strategies are Adagrad and RMSprop.
The idea of Adagrad is to keep a sum of the past squared gradients for each dimension and divide
the present gradient to a quantity proportional to that. As a result, if a dimension has really small
gradient it will be divided for a small quantity and it will increase its magnitude, while in the
opposite case, its magnitude will be decreased.
However, the accumulation of squared gradients from the beginning of training can result in a
premature and excessive decrease in the learning rate. A solution to that is given by RMSprop.
RMSprop, instead of just summing up the squared gradients and make them accumulate during
training, takes an exponentially weighted moving average, which allows to discard history from the
extreme past, with exponential decay.
Figure 2.14: Adagrad and RMSProp
2.3.4 Adaptive Moment Estimation (Adam)
The idea of Adam is to collect both first (momentum, velocity) and second order (Adagrad/
RMSProp) moments of the gradient and mix momentum and adaptive learning rate approaches.
Let β1 , β2 be the exponential decade rates for moments estimates, Θ0 the initial parameters vector
and α the step size.
The Adam algorithm then proceeds in the following way:
Here are some additional observations about the algorithm.

Algorithm 2 Adam algorithm

m0 ← 0, v0 ← 0, t ← 0
while Θt do not converge do
gt ← ∇Θ f (Θt−1 ) ▷ get gradients
mt ← β1 mt−1 + (1 − β1 )gt ▷ update 1st moment biased extimate
vt ← β2 vt−1 + (1 − β2 )gt2 ▷ update 2nd moment biased extimate
mt
m̂t ← 1−β t ▷ compute bias-corrected 1st order estimate
1
vt
v̂t ← 1−β t ▷ compute bias-corrected 2nd order estimate
2
Θt ← Θt−1 − √αm̂t ▷ update parameters
v̂t +ϵ
end while
Without computing the bias-corrected estimates, after one update vˆt would be biased towards zero
due to their initializations. Thus, to do the update we would be dividing for a very small number.
Thus, we would do a very large step at the beginning, just because of the zero-initialization.
√
Note that the ϵ (often ϵ ∼ 10−7 ) is added for numerical stability, since we are dividing by v̂t ,
which could be very small.
2.4 Last practical suggestions
In this last section, we provide some suggestions about how to train a neural network in practice,
given the theoretical foundation given in the sections before.
2.4.1 Learning rate
All the optimization methods we have analyzed before have learning rate as a hyperparameter,
which must be tuned according to the specific task.
A common strategy is to set a fixed initial learning rate and to decade its value during training.
There are mainly three ways to do that:
tτ
1. Step decay: ηt+1 = η0 ∗ αf loor( ) , α ∈ (0, 1)
2. Exponential decay: ηt+1 = η0 ∗ e−kt , k > 0
3. Time based decay: ηt+1 = 1+ktη0
A common startegy is to first start without using any decay and we introduce it later on.
2.4.2 Ensembles of different models
In section 2.1.2 we have seen that ensembles of different models can improve the performance of a
single model. In this chapter we will see how this can be done in practice when training a neural
network.
As we have seen before, the idea behind ensemble methods is to train multiple models independently,
and then, at test time average their results.
There are a few approaches to forming an ensemble.
A first approach is to use the same model, different hyperparameters initializations. Use cross-
validation to determine the best hyperparameters, then train multiple models with the best set of
hyperparameters but with different random initialization. The danger with this approach is that
the variety is only due to initialization.
Another possible approach is to use the top models discovered during cross-validation. Use cross-
validation to determine the best hyperparameters, then pick the top few (e.g. 10) models to form
the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal
models. In practice, this can be easier to perform since it doesn’t require additional retraining of
models after cross-validation.
A third approach is to use different checkpoints of a single model. If training is very expensive,
some people have had limited success in taking different checkpoints of a single network over time
(for example after every epoch) and using those to form an ensemble. Clearly, this suffers from
some lack of variety, but can still work reasonably well in practice. The advantage of this approach
is that is very cheap.
A fourth approach is to use a running average of parameters during training. This is similar to the
previous approach, but instead of taking checkpoints at fixed intervals, we take a running average
of the parameters over time. This is equivalent to taking the average of the parameters of the last
few epochs. This is a very cheap way to form an ensemble, but it has the disadvantage that it can
2.4 Last practical suggestions 57
only be used at test time, since the running average is not a valid set of parameters for the model.
2.4.3 Fast Geometric Ensembling
Garipov et. al. [2] proposed an approach called Fast Geometric Ensembling (FGE). They found
that local minima are connected by simple curves with almost constant training and testing loss.
Figure 2.15: Loss surfaces
Notice that in each panel a direct linear path between each mode would incur high loss.
The key idea is to follow these curves of constant loss to explore new local minima.
More formally they have suggested the following procedure:
1. Train your model normally for about 80% of the training time
2. Adopt a cyclic LR for the remaining 20% of training time
3. Save checkpoints when LR is lowest
4. Ensemble all checkpointed models for inference
However, this approach has a drawback in term of required inference time. If we save k checkpoints,
the inference requires k times the computations compared to a single model.
2.4.4 Stochastic Weight Averaging
Stochastic Weight Averaging (SWA) can be interpreted as an approximation to FGE ensembles but
with the test-time, convenience, and interpretability of a single model.
The algorithm is the following:
1. Train your model normally for about 80% of the training time
2. Initialize wSW A with the weights from your pretrained model
3. Adopt a cyclic LR for the remaining 20% of training time
4. For every cycle when the lowest learning rate is reached update wSW A using a running
average
wSW A · nmodels + w
wSW A ←
nmodels + 1
5. Use wSW A for inference
Note that if the DNN uses batch normalization, we run one additional pass over the data, to
compute the running mean and standard deviation of the activations for each layer of the network
with wSW A weights after the training is finished, since these statistics are not collected during
training.
Here are some advantages of SWA:
• Updating the running average is relatively cheap

• Inference cost is the same as for a normal model
• Easy to add to existing models
• Can lead to huge performance increase
II
Part Two: CNNs, RNNs & Co
3 Convolutional Neural Network . . . . . . . . 61

3.1 Introduction
3.2 The Neuroscientific Basis for CNNs
3.3 Convolution operation
3.4 Difference between convolution and correlation
3.5 Convolutional Neural Network
3.6 Practical observations
4 Fully Convolutional Neural Network . . . 73

4.1 The goal: pixelwise predictions
4.2 Upsampling techniques
4.3 U-Net: the most used FCNN
4.4 Applications
5 Recurrent Neural Network . . . . . . . . . . . . 77

5.1 Introduction
5.2 Dynamical System
5.3 Vanilla Recurrent Neural Network
5.4 Solving the problem of vanishing and exploding
gradient: LSTM and Friends
3. Convolutional Neural Network
3.1 Introduction
Many tasks in the field of computer vision can be solved with CNN.
Among them we can find:
• Classification: assigning an input image to a specific category or label

• Classification and Localization: identifying the category of an object in an image and
also its location within the image
• Object detection: detecting and localizing multiple objects within an image
• Instance Segmentation: identifying and segmenting individual objects within an image
• Body Pose Estimation: estimating the 3D position and orientation of a person’s body
parts
• Eye Gaze Estimation: estimating the direction of a person’s gaze from the position of
their eyes
• Dynamic Gesture Recognition: identifying and recognizing human gestures and move-
ments in real-time
Figure 3.1: Computer vision tasks solved with CNN

62 Chapter 3. Convolutional Neural Network
3.2 The Neuroscientific Basis for CNNs
From a neuroscientific perspective, convolutional layers have been inspired by how the visual cortex
works. Particularly influential were the discoveries made by Hubel & Wiesel between 1959 and
1968.
Hubel and Wiesel introduced the concept of cell hierarchy in the visual cortex, where different
types of cells hierarchically transform visual stimuli. They distinguished between simple, complex,
and hypercomplex cells, each playing a specific role in the processing of visual information.
Figure 3.2: Feature hierarchy as theorized by Hubel & Wiesel
An important finding was that while simple cells (found at the lowest level of the hierarchy) are
susceptible to fuzziness and noise, complex cells are not. In particular, complex cells respond to
the largest output from a bank of simple cells to achieve oriented responses that are robust to
distortion.
The HMAX model is a biologically motivated architecture for computer vision that incorporates
these neuroscientific insights. It closely aligns with existing physiological evidence, particularly in
terms of the existence and operation of simple (S) and complex (C) cells at different levels of the
visual hierarchy. Simple cells (S cells) are tuned to specific stimuli and typically have small
receptive fields. Given an input x, the response y of a simple cell is computed as follows:
 
n sk
1 X
y = exp − 2 (wj − x)2 
2σ j=1
On the other hand, complex cells (C cells) combine the outputs from multiple simple cells to
increase invariance and receptive field size. The output of a complex cell is computed as follows:
y= max (xj )
j=1,..,nCk
Research has shown that through many iterations of these operations, complex objects can be
constructed from low-level features.
In the following pages we will see how this structured hiercarchic model can be translated into a
neural network architecture using convolution operations.
3.3 Convolution operation
Convolutional Neural Networks owe their name to the convolution operation, which is the mathe-
matical operation at the basis of convolutional layers. For this reason, before diving into the details
Figure 3.3: HMAX model
of CNNs, we will first present the concept of convolution and we will give an intuition of why this
operation works well for image processing.
3.3.1 Convolutions as linear, shift-equivariant transforms
In deep neural networks (DNNs), our goal is to transform a given input signal f into a more
informative representation using an operator T . Among the various operators, convolutions are
an interesting class because, through their parameterization, they can express any linear, shift-
equivariant transform.
For readers who may not be familiar with the concepts of linearity, invariance, and equivariance, we
will provide a brief recap here as these concepts are essential for understanding the following pages.
Given a transform T , a function f , and two input vectors u and v, as well as scalars α and β:
A transform T is linear if:
T (αu + βv) = T (αu) + T (βv)
A transform T is invariant to f if:
T (f (u)) = T (u)
Invariance is a property we want to exploit in classification tasks. For example, if we have an image
of a cat and we shift every pixel by one unit, the image should still represent a cat. In other words,
the classifier should be invariant to the shift of one pixel.
A transform T is equivariant to f if:

T (f (u)) = f (T (u))
We desire equivariance for tasks such as edge detection1 . When there is an edge present in the
input image and we apply the function f to shift the image content, we want the edge detector to
also shift its response (the position of the edge) along with the function f .
3.3.2 From linear filtering to convolution
Now we are interested in investigatating how we can obtain a specific type of linear transform that
can express convolution.
Linear operations can be represented as follows:

X
I ′ (i, j) = K(i, j, m, n)I(i + m, j + n)
(m,n)∈N (i,j)
Here, I represents the input image, I ′ is the output of the operation, K is the kernel of the operation,
and N (m, n) is a neighborhood of (i, j).
In a linear transform, the value of the kernel K, which is applied to each point of the image,
depends on both the position on the image (i, j) and the position of the neighboring point (m, n)
with respect to (i, j). This dependency on the specific position (i, j) makes the linear transform
not shift-invariant. To achieve shift invariance, we need to remove the dependency on the position
(i, j), which can be done by considering kernels that are constant over the image. From now on, the
kernel K will be represented by a constant matrix, and we will write K(m, n).
In practice, to perform a shift-invariant linear transformation, we move the fixed kernel over the
entire input image. This process is known as convolution.
3.3.3 Correlation
Correlation (referred also as cross-correlation operation in the field of ML) is a particular case of
shift-invariant linear filtering.
In correlation, a fixed spatial pattern is shifted over the image, and the response is recorded as
the pattern is applied to different patches. The response is computed by multiplying the pattern
with the under-lighted portion of the image. If the elements are similar (indicating parallelism in
Euclidean space), the outputs will be high, whereas dissimilar elements will yield low outputs.
The ability to perform pattern matching 2 (finding a pattern when the correlation between the
kernel and the input pixels is high) makes the correlation operation particularly useful in object
detection.
For instance, given a 3 × 3 kernel K and an input image I, the output of the correlation operation
for each cell (i, j) can be computed as follows:
I ′ (i, j) = c11 I(i − 1, j − 1) + c12 I(i − 1, j) + c13 I(i − 1, j + 1) + c21 I(i, j − 1) + c22 I(i, j − 1)
+ c23 I(i, j − 1) + c31 I(i + 1, j − 1) + c32 I(i + 1, j) + c33 I(i + 1, j + 1)
1
Edge detection is the process of identifying and highlighting the boundaries of objects in an image or
video.
2
Pattern matching in computer vision refers to a set of computational techniques which enable the
localization of a template pattern in a sample image or signal.
Figure 3.4: A visualization of a correlation trasnform for a single pixel in position (i, j)
In general, for a 2k × 2k kernel, the correlation operation can be expressed as:
k
X k
X
I ′ (i, j) = K(m, n)I(i + m, j + n)
m=−k n=−k
3.3.4 Intuition of Convolution in Images
We are now interested in exploring the physical intuition behind the reason why the convolution
operation is effective in extracting features from images.
When working with images, we are dealing with data captured by an imaging system that has a
specific response for each point of light in the scene. This response is influenced by the system’s
point spread function (PSF), which characterizes how the imaging system blurs a point source
of light.
The blurring effect caused by an imaging system can be mathematically modeled as a convolution
operation. When an image is blurred by the system, it is effectively convolved with the system’s
PSF. This convolution operation describes how the light spreads out in the image due to the
characteristics of the imaging system.
As a result, if we apply deconvolution (which is also a convolution operation) to the blurred image,
we can potentially recover the original source of the image by undoing the blurring effect caused by
the imaging system.
Figure 3.5: Example of a point spread function

3.4 Difference between convolution and correlation
The concepts of convolution and correlation are closely related. In particular, the convolution
operation is defined as
k
X k
X
I ′ (i, j) = (I ∗ K)(i, j) = K(i − m, i − n)I(m, n)
m=−k n=−k
which can be seen as a correlation with a flipped kernel.
However, there are differences between convolution and correlation. One important distinction
is that convolution is commutative, which means that the order of the operands can be swapped
without changing the result. Therefore, we can equivalently write:
k
X k
X
I ′ (i, j) = (I ∗ K)(i, j) = K(i − m, i − n)I(m, n)
m=−k n=−k
k
X k
X
= K(m, n)I(i − m, j − n) = (K ∗ I)(i, j)
m=−k n=−k
In practice, the latter formula is often preferred for implementation in machine learning libraries as
it allows for a smaller variation in the range of valid (m, n) values. The commutative property of
convolution is the primary reason it is commonly used instead of correlation.
Additionally, it is worth noting that if the kernel satisfies K(m, n) = K(−m, −n), then correlation
and convolution become equivalent.
3.4.1 Convolution as matrix multiplication
Discrete convolution can be implemented using matrix multiplication.
Let’s consider a 1D convolution operation with an input image I and a kernel K. We can express
it as follows:
 
k1 0 . . . 0  
..  I1
.

k2 k1   I2 
..   .. 
  
.  . 

(I ∗ K) = 
k3 k2
. ..  . 
 
 .. k3 .   .. 
..
 
In
0 . ... km
It is worth noting that the convolution operation is typically represented using an asterisk (∗)
symbol.
For more practical details on this topic, please refer to the first exercise of the CNNs’ pen and
paper homeworks.
3.5 Convolutional Neural Network
Now that we have introduced the concept of convolution and discussed why it is a suitable operation
for feature extraction from images, we can proceed to Convolutional Neural Networks (CNNs).
A CNN is composed of a sequence of convolutional layers interspersed with activation functions

and pooling layers, followed by a final dense layer (also called fully connected). The dense
layer aggregates the features extracted by the convolutional layers and produces the final output of
the network.
A typical layer of a convolutional network consists of three stages. In the first stage (??), the layer
performs a set of convolutions in parallel, with each convolution having its own learnable kernel.
This process generates a set of activations, also known as feature maps. In the second stage, each
activation is then passed through a non-linear activation function, such as the Rectified Linear
Unit (ReLU). This stage introduces non-linearities and allows the network to capture complex
patterns and relationships within the data. In the third and final stage, a pooling operation (see
subsection 3.5.2) is employed. A pooling function replaces the output of the network at a certain
location with a summary statistic of the nearby outputs. This operation aggregates features and
obtains a representation at a lower resolution.
By combining these three stages, the convolutional layer extracts local features from the input
data, introduces non-linearities, and reduces the spatial dimensions of the features through pooling.
This hierarchical process helps the network learn hierarchical representations of the input data. In
addition to convolutional layers, CNNs also include a final dense layer. This layer aggregates the
features extracted by the preceding layers and produces the final output of the network.
Next, we will examine convolutional (section 3.5.1), pooling (subsection 3.5.2) and dense layers in
details.
Figure 3.6: A visualization of the global structure of a CNN (source)
3.5.1 Convolution layer
Convolution layer in practice
In the context of CNNs, the first step of a convolutional layer involves applying convolution to an
input image. This is achieved by convolving a kernel (also referred to as a filter in deep learning)
with the entire image. In practice, this involves sliding the kernel over the image spatially and
computing dot products.
It is important to note that filters must extend the full depth of the input volume, as illustrated in
Figure 3.7. For example, if we have an RGB image with three channels, we would apply a k × k × 3
filter to it.
When taking the dot product between the filter and a small 5 × 5 × 3 chunk of the image (resulting
Figure 3.7: Convolutional layer with a 5 × 5 × 3 filter
in a 75-dimensional dot product + bias), the output is a single number.
We are now interested in analyzing the math of how parameters are updated in a convolutional layer.
We will focus on the case of a single filter, as the generalization to multiple filters is straightforward.
Mathematical derivation of the CNN layer
[l−1] [l]
Let zi,j be the output of the (l − 1)-th layer at position (i, j), let wm,n be the weight of the filter
in position (m, n), and let b be the bias parameter (every convolutional layer has one bias parameter
per filter). We can write the output of the l-th layer as:
[l] [l−1]
XX
zi,j = W [l] · z [l−1] + b = [l]
wm,n zi−m,j−n + b
m n
[l]
Let L be the loss function and let again zi,j be the output at the l-th layer in position (i, j).
First, we perform the forward pass. We can express the derivative of the cost function with
respect to the output of the (l − 1)-th layer as:
[l]
[l−1] ∂L X X ∂L ∂zi′ ,j ′
δi,j = [l−1]
= [l] [l−1]
∂zi,j i′ ∂zi′ ,j ′ ∂zi,j
j′
P P [l]
XX [l] ∂ m n wm,n
= δi′ ,j ′ [l−1]
i′ j′ ∂zi,j
[l]
XX
[l]
= δi′ ,j ′ wm,n
i′ j′
The last equality is a result of the fact that the only term in the sum that outputs the element
[l] [l−1]
zi′ ,j ′ whose derivative is not zero is the one containing zi,j , where i = i′ − m and j = j ′ − n (or
equivalently, m = i − i and n = j − j).
′ ′
Next, we compute the backward pass:

[l−1]
X X [l] [l]
δi,j = δi′ ,j ′ wi′ −i,j ′ −j
i′ j′
=δ [l]
∗ ROT180 (W [l] )
| {z }
kernel flipped
Figure 3.8: Blue pixels only contributes 1 times to the computation of green pixel
Finally we perform the parameters update:
[l]
∂L X X ∂L ∂zi,j
[l]
= [l] [l]
∂wm,n i j ∂zi,j ∂wm,n
[l]
XX [l] ∂zi,j
= δi,j [l]
i j ∂wm,n
P P [l] [l−1]
XX [l] ∂ m n wm,n zi−m,j−n + b
= δi,j [l]
i j ∂wm,n
[l] [l−1]
XX
= δi,j zi−m,j−n
i′ j′
= δ [l] ∗ ROT180 (Z [l−1] )
3.5.2 Pooling layer
The pooling layer substitutes the output of the network at a specific position with a condensed
representation of the adjacent outputs, typically in the form of a statistical summary.
This operation reduces the size of the representations and makes them more manageable.
Figure 3.9: A visualization of the output of a pooling layer
It’s important to note that the pooling operation is applied independently to each activation map.
Max pooling
One of the possible pooling operations is the max pooling, which outputs the maximum value
from the input within a given region.
Figure 3.10: Effect of max pooling on pixel values
Let’s examine the forward and backward pass of a max-pool layer.
First, we perform the forward pass.

[l] [l−1]
zi′ = max{zi }
[l−1]
Let i∗ = arg maxi zi , be the index which corresponds to the maximum value. We have:
(
[l]
∂zi′ 1 if i = i∗
=
[l−1]
∂zi 0 otherwise
Therefore, the backward pass is defined as:
δ [l−1] = {δ [l] }i∗
Note that as the max-pooling layer has not any learnable parameter, the backward pass is just a
propagation of the error and it is not used for weight update in this case.
3.5.3 Dense layer
The dense layer, also referred to as a fully connected layer, complements the role of convolutional
and pooling layers in capturing local features and reducing spatial dimensions.
Its crucial function lies in aggregating the extracted features and generating the final output of
the network. In this layer, each neuron performs a weighted sum of all its inputs and applies a
non-linear activation function. By learning complex relationships among the features extracted by
preceding layers, the dense layer enables the network to make predictions based on these learned
representations.
3.6 Practical observations
Here are some practical observations that may be useful for the exercises:
3.6 Practical observations 71
• Doing A ∗ B means to stride the kernel B over A. In general we use kernel with dimension
equal or smaller than the image matrix
• A ∗ ROT180 B = A⋆B
• A⋆B = ROT180 (B⋆A)
4. Fully Convolutional Neural Network
4.1 The goal: pixelwise predictions
Semantic segmentation is a critical task in computer vision that involves assigning a semantic class
to each pixel in an image. While traditional image classification outputs a single class for the entire
image, semantic segmentation requires classifying each pixel individually.
The most straightforward approach to pixel-wise classification is to classify each pixel individually,
extracting features from a patch centered on it (Figure 4.1).
Figure 4.1: Pixelwise CNN
However, this method is inefficient and redundant for processing large images. Instead, practitioners
adopt a pipeline that involves using the entire image as input to a CNN. The final fully connected
layer, typically used for image classification, is removed, and the resulting feature maps are used as
segmentation predictions. Due to convolutions and max-pool operations, these predictions have
lower resolution than the original image.
To obtain the same resolution as the input image, we could keep the same dimensions by using
appropriate padding in the convolutions and avoiding pooling layers (Figure 4.2). However, this
method can be computationally expensive.
In practice, the most common approach is to downsample the features obtained using convolution
74 Chapter 4. Fully Convolutional Neural Network
Figure 4.2: Semantic segmentation using CNNs at original resolution
and pooling layers and then upsample them again. By applying convolution to a smaller object,
this method is more computationally efficient while producing output with the same resolution as
the input.
While downsampling can be achieved with pooling and strided convolution, there are various
techniques for upsampling that we will now explore.
4.2 Upsampling techniques
Upsampling techniques, such as unpooling and transposed convolutions, are commonly used in
semantic segmentation to increase the spatial feature size. These techniques can be divided into
two categories: fixed and learnable.
4.2.1 Fixed upsampling techniques
In this section, we will explore three common fixed upsampling techniques: nearest neighbor, bed
of nails, and max unpooling.
Nearest neighbor upsampling involves upsampling features by copying the same value into all
corresponding pixels at a higher resolution. In contrast, bed of nails upsampling involves padding
with zero neighbor values, resulting in a sparse matrix as output. Figure 4.3 provides a visualization
of these two techniques.
Figure 4.3: Unpooling techniques
Max unpooling is another fixed upsampling technique that uses zero padding as in bed of nails.
However, it also remembers the original position of the maximum value before the corresponding
max-pooling in the downsampling phase. This information is then used to place each element back
in the correct position. Figure 4.4 provides a visualization of this technique.
4.3 U-Net: the most used FCNN 75
Figure 4.4: Max Unpooling
4.2.2 Learnable upsampling: transposed convolutions
Fixed upsampling techniques are brute force upsampling approaches that do not involve learning.
In contrast, learnable upsampling techniques, such as transposed convolutions (also known as
deconvolutions), make use of learning.
As we have seen before, the problem with convolutions and pooling is that they result in output
with lower resolution. To address this issue, we can use transposed convolutions.
In practice, given a low-resolution image, we learn a kernel (e.g., 2 × 2) that is used to produce
all the terms whose sum will be the final output. Each term is obtained by multiplying all the
elements of the kernel by the value of one single input pixel and then inserting the result in the
correct position of a matrix of the same size as the output. Note that each term of the sum is a
sparse matrix, potentially with non-zero terms only in a number of pixels equal to the kernel size.
Figure 4.5 provides a visualization of this process.
Figure 4.5: Visualization of transposed convolution with a 2 × 2 kernel.
4.3 U-Net: the most used FCNN
U-Net [10] is a popular fully convolutional neural network (FCNN) architecture that has been
widely used for semantic segmentation tasks. The main idea behind U-Net is to combine global
and local feature maps by copying corresponding tensors from earlier stages in each upsampling
stage. This allows the network to capture both local 1 and global context, leading to more accurate
semantic segmentation results.
A visualization of the UNet architecture can be found in Figure 4.6.
1
Residual connections help to mantain local features as images are not completely downsampled at every
stage.
76 Chapter 4. Fully Convolutional Neural Network
Figure 4.6: U-Net architecture.
4.4 Applications
Semantic segmentation is one of the most common applications of upsampling techniques, and it
has been used in a variety of fields, including medical imaging, autonomous driving, and robotics.
However, there are several other applications of upsampling techniques as well. These include:
• Image generation from semantic labels (i.e., creating realistic images from semantic labels)
• Human pose estimation (i.e., estimating the pose of a person in an image)
• Human shape estimation (i.e., estimating the 3D shape of a person from a 2D image)
Overall, upsampling techniques have a wide range of applications in computer vision, and their use
has led to significant advancements in the field.
5. Recurrent Neural Network
In this chapter we present Recurrent Neural Networks. In section 5.1 we introduce the concept
of recurrent neural networks and we see some potential applications. In section 5.2 we introduce
the concept of dynamical system, which is at the basis on the Recurrent Networks structure. In
section 5.3 we introduce the easiest recurrent architecture and we analyze its failure cases. In
section 5.4 we show how more complex architectures as LSTM can solve the problems of the vanilla
RNNs. In the same section we also explain how gradient clipping can avoid instabilities which may
occur during the training of a recurrent network.
5.1 Introduction
Recurrent Neural Networks (RNNs) are a type of neural network that can process sequential data.
Unlike traditional feedforward neural networks, which take fixed-length inputs, RNNs can take
inputs of variable length, and can maintain an internal memory of the past inputs that they have
seen. This makes them well-suited to tasks such as sequence prediction, language modeling, and
machine translation.
Different combinations of input and output lengths yields to different applications of RNNs. Some
examples are:
• One to one: vanilla RNN, at each time step we have one input and one output
• One to many: this is the case of Image Captioning where at each time step we have in
input one image (one element) and we output a sequence of word (many elements).
• Many to one: this is the case of Sentiment Classification, where at each time step we have
in input a sequence of words (many elements) and we output the sentiment linked to them
(one label).
• Many to many: this is the case of Machine Translation, where at each time step we translate
a sequence of words (many elements) to another sequence of words (many elements). Another
common case is Video Classification on frame levels, where at each time step we have in
input a present frame together with the previous ones - which are encoded in the hidden
78 Chapter 5. Recurrent Neural Network
state - (many elements) and we output a label from each of those (many elements).
Figure 5.1: RNNs applications and structure of the nets. The red rectangles represent
inputs, the green ones the hidden state and the blue ones the outputs. The cases reported,
respectively, refer to vanilla RNN, Image Captioning, Sentiment Classification, Machine
Translation and Video Classification.
5.2 Dynamical System
At their core, RNNs are a type of dynamical system. In general, a dynamical system is a
mathematical concept used to describe the behavior of a system over time. It is represented by a
set of rules that determine how a system changes from one state to another over time, based on
its current state. In the context of recurrent neural networks, the hidden state of the network at
each time step can be thought of as a representation of the current state of the system, and the
transition function that updates the state can be thought of as the set of rules that govern the
behavior of the system over time.
Dynamical system may be with or without input.
5.2.1 Dynamical System without input
In dynamical systems without input, the state on time t is a function function of s<t−1> , thus
s<t> = f (s<t−1> , Θ), as it is shown in Figure 5.2.
Figure 5.2: Dynamical System without input
For example, in a finite horizon setting, we can unroll the recurrence to obtain:
s<3> = f (s<2> ; Θ) = f (f (s<1> ; Θ); Θ)
5.2.2 Dynamical System with input
In the case of RNNs, the state at time t depends not only on the state at the previous time step
t − 1, but also on some input at the current time step x<t> . In other words, we can write the
RNN as a function that maps the previous state and the current input to the current state as:
s<t> = f (s<t−1> , x<t> ; Θ)
5.2 Dynamical System 79
This can also be expressed in terms of the hidden layer:
h<t> = f (h<t−1> , x<t> ; Θ)
Here, Θ denotes the set of parameters that the RNN learns during training.
Figure 5.3: Unrolled RNN over a finite horizon
It is important to note that these parameters remain the same across all time steps, allowing the
RNN to learn a single model that can handle sequences of arbitrary length. The most common
representation of a RNN is shown in Figure 5.4.
Figure 5.4: Unrolled RNN, compact representation
Graphical Representations of temporal dynamics
Temporal dynamics can be represented in two different ways in RNNs.
A first way to do that is to consider a function g <t> that takes as input all the previous timesteps,
as shown in Figure 5.5 However, this option has the drawback of requiring variable-length input
Figure 5.5: A visual representation of g<t>
sequences and being difficult to parameterize with a neural network.

Another option is to write the recurrence using a function f (h<t−1> , x<t> ; Θ) that takes as inputs
the input of the current timestep x<t> , the previous hidden state h<t−1> and the set of parameters
Θ. This approach has the advantage of using the same transition function for all time step, meaning
that the network learns to generalize across the entire sequence. In practice, the same set of
parameters is used for all time steps, allowing for efficient computation and easier training.
A representation of this option is shown in Figure 5.6
Figure 5.6: A visual representation of f (h<t−1> , x<t> ; Θ)

5.3 Vanilla Recurrent Neural Network
5.3.1 Network Structure
The Vanilla version of a Recurrent Neural Network (RNN) is characterized by a single hidden
vector, denoted as h<t> , which forms the state of the network.
The equations for the Vanilla RNN, as depicted in Figure 5.7, are as follows:
ŷ<t> = Why h<t> (5.1)

h<t>
= tanh(Whh h <t−1>
+ Wxh x <t>
) (5.2)
where h<t> = f (h<t−1> , x<t> ; Θ) represents the hidden state at time step t.
Figure 5.7: A representation of a Vanilla RNNs
Compared with a MLP we notice two main differences. Instead of using a sigmoid activation function,
the RNN employs the hyperbolic tangent function (tanh). This intuitevely allows the hidden states
to have both positive and negative values, enabling the model to cancel some information of the
past events. Moreover, the layer at time step t depends not only on the previous hidden state, but
also on the input xt at that time step.
Figure 5.8: Signal flow in folded (left) and unfolded (right) RNN
5.3.2 Backprop through time (BPTT)
In the context of recurrent neural networks (RNNs), we utilize the following equations to describe
the network’s behavior:
h<t> = f (h<t−1> , x<t> ; W ) (5.3)
ŷ <t>
= Why h <t>
(5.4)
<t> 2
L <t>
= ŷ <t>
−y (5.5)
Given a finite horizon of size S, our objective is to compute the partial derivative of the overall loss
(which is the sum of individual losses) with respect to the network’s weights:
X ∂L<t> S
∂L
= (5.6)
∂W t=1
∂W
To do this computation, it is crucial to view the unrolled recurrent model as a multi-layer network,
with a potentially infinite number of layers. We can then apply backpropagation to efficiently
compute the gradients in this extended network structure. For each time step t we can write:
t
∂L<t> X ∂L<t> ∂ ŷ<t> ∂h<t> ∂ + h<k>
= (5.7)
∂W ∂ ŷ<t> ∂h<t> ∂h<k> ∂W
k=1
We obtain the expression for the second equation because h<t> depends on all the previous h<k> .
∂ + h<k>
∂W is the immediate derivative, which treats h<k−1> as constant w.r.t. the weight W .
Let us consider only the following term of the product

t
∂h<t> Y ∂h
<k>
= (5.8)
∂h ∂h<i−1>
i=k+1
t
Y ∂
= f (Whh h<i−1> + Wxh x ) (5.9)
∂h<i−1>
i=k+1
t
Y
= T
Whh diag(f ′ (h<i−1> )) (5.10)
i=k+1
t
t−k−1 Y
T
= Whh diag(f ′ (h<i−1> )) (5.11)
i=k+1
where f is the activation function. Assuming the existence of an eigenvalue decomposition of the
weight matrix Whh (i.e., Whh is symmetric), we can alternatively express it as Whh = QΛQT , where
Λ is a diagonal matrix containing the eigenvalues of Whh along its diagonal. By rearranging the
previous equation, we obtain:
T t−k−1
= (QT ΛQ)t−k−1 = QT Λt−k−1 Q (5.12)

Whh
Where the last step is due to the fact the QQ⊤ = I, as Q is orthogonal.1 Notably, we observe that
we raise Λ to the power of t. We are now interested in analyzing the influence of the eigenvalues on
the final matrix.
If we consider f to be a sigmoid or a hyperbolic tangent (which are both upper bounded by 1), we
can say that there is an γ ∈ R s.t.
diag(f ′ (h<i−1> )) < γ (5.13)
k′
1
For example, for k′ = 2 we have Whh
T
= QT Λ QQT ΛQT = QT Λ2 Q
| {z }
I
Assume that λ1 is the highest singular value of this matrix Whh , we will now show that the behaviour
of the gradients depends on whether it is smaller or larger than γ1 . In particular, if λ1 < γ1 , then
the gradient vanishes. In the other case, the gradient explodes. Let us formally prove the first
statement.
∂h 1
∀i, T
≤ Whh diag[f ′ (h<i−1> )] < γ=1 (5.14)
∂h<i−1> γ
∂h
Here ∥·∥ is the spectral norm. Let η ∈ R be such that ∀i, ∂h<i−1> ≤ η < 1.
By induction over i
t
Y ∂h
< (η)t−k → 0 as t → ∞ (5.15)
∂h<i−1>
i=k+1
For the reasons explained in this section (gradients are not meaningful when T → ∞), RNNs
struggle in pratice to capture long-term dependences. In the next section we will see how LSTMs
solve this problem.
5.4 Solving the problem of vanishing and exploding gradient: LSTM and Friends
5.4.1 Naive solution
Figure 5.9: Naive solution
Where C is the cell state, our memory, we want something that summarizes the memory in this cell
state, but keeps the gradient alive.
5.4.2 LSTM
Long Short Term Memory networks [4], or LSTMs, are a special kind of RNN, capable of learning
long-term dependencies.
The structure of their cells result very different from the ones of a vanilla RNN which only consists
on a single internal layer, where the cell state and the input are transformed by a single affine
transformation and a point-wise non linearity.
The cell of a LSTM indeed consists on four layers, interacting in a very special way. In particular,
these layers, also called gates, have the following functions.
• f is the forget gate and has the role of scaling the old cell state h<t−1> . Depending on xt
and h<t−1> , it decides which information should be forgotten from the previous cell state.
Its output is a sigmoided value, which for each element of the previous cell state x<t−1>
decides how much of the old state kept in the current one (0 deletes it, 1 keeps the element
entirely).
• i is the input gate and has the role of deciding which values of the state cell should be
updated at the current time step. Its output is a sigmoided value, which for each element
of the cell state, decides how much of it should be written in the current cell state x<t> (0
deletes everything, 1 keeps everything).
• o is the output gate and has the role of deciding which values of the current cell state
should be put in output of the cell h<t> . As the previous gates, its output is a sigmoided
value, which for each element of the current cell state x<t> , decides how much of it should
be put in output.
• g is the gate that decides what to write in the cell state. It is a tanh layer, which creates a
vector of new candidate values.
In practice, the idea is to stack the vector x<t> and h<t−1> and to multiply them for a big weight
matrix in order to obtain the four different values of i, f , o, g, with the roles described before.
Given that the cell state c<t> , the input x<t> and outputs h<t−1> have dimensionality n and
given W ∈ R4n×2n we compute i, f , o, g as follows.
   
i sigm <t>
 f  sigm x
 = W (5.16)
o sigm h<t−1>
g tanh
5.4 Solving the problem of vanishing and exploding gradient: LSTM and Friends85
Figure 5.10: LSTM’s gate structure
Here, we can rewrite W in blocks as follows.

 
Wxi Wci
Wxf Wcf 
W =  Wxo Wco 
 (5.17)
Wxg Wcg
[l−1]
Moreover, in a multi-layer architecture we can see x<t> as the output of the layer before h<t>
(Figure 5.11) and we can alternatively write
   
i sigm !
 f  sigm [l] h[l−1]
 =
o sigm W
 <t>
[l] (5.18)
h<t−1>
g tanh
We remind that in the case of a RNN the equation for the output h<t> was the following.
!
[l−1]
[l] h<t>
h<t> = tanh W [l]
[l] (5.19)
h<t−1>
where h ∈ Rn and W ∈ Rn×2n .
Once computed the values for i, f , o, g, we can compute the new cell state c<t> and the new output
h<t> as follows.
c<t> =f <t> ⊙ c<t−1> + i<t> ⊙ g<t> (5.20)
h <t>
=o <t>
⊙ tanh(c <t>
) (5.21)
5.4.3 Gradient flow in RNNs and in LSTMs
To understand why LSTMs are more effective than vanilla RNNs in practice, let’s examine the
gradient flow from c<t> to c<t−1> . In vanilla RNNs, the gradient flow relies on matrix multiplication,
as seen in the equation h<t> = tanh(h<t−1> Whh + x<t> Wxh ), where the weight matrix W remains
constant throughout.
However, in LSTMs, the gradient flow takes a different approach. As a matter of fact, + operator
allows the gradient to directly propagate to the element-wise multiplication (c<t−1> ⊙ f ). Unlike
Figure 5.11: A graphical representation of the LSTM multi-layered strcture. The red
rectangles are the inputs, the green ones are the hidden layers and the blue ones the
outputs.
matrix multiplication, this operation involves an element-wise multiplication between a changing

vector c<t−1> and f , which can vary at each time step.
Figure 5.12: Gradient flow in RNNs and LSTMs
5.4.4 Gradient clipping
While LSTM can be seen as solution to the problem of vanishing gradients, gradient clipping solves
the issue of exploding gradients. The idea behind gradient clipping is to limit the maximum value
of the gradient if it surpasses a predetermined threshold. In practice, given the gradient g ← ∂Θ
∂L
and a threshold T
(
Θ − λg if ∥g∥2 ≤ t
Θ← (5.22)
Θ − λT ∥g∥g
otherwise
2
5.4 Solving the problem of vanishing and exploding gradient: LSTM and Friends87
Figure 5.13: Training without (left) and with (right) gradient clipping in a recurrent network
with parameters w and b. It can be noticed as in absence of gradient clipping, the gradients
can overshoot the bottom of the cliff and receive a very large gradient from the steep cliff
face. This can lead to catastrophic parameter updates, pushing the parameters far beyond
the plot’s axes. Picture from [6].
where λ is a the learning rate and Θ are the parameters of the model.
In PyTorch we can write the following code to implement gradient clipping using norm 2 (∥·∥2 )
and threshold T = 2.0.
1 loss . backward ()
2 torch . nn . utils . clip_grad_norm_ ( model . parameters () , max_norm =2.0 ,
norm_type =2)
3 optimizer . step ()
III
Part Three: Generative
Modeling
6 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 93
6.1 Introduction
6.2 Linear Autoencoders: the PCA projection
6.3 Non-Linear Autoencoders
6.4 Variational Autoencoders
6.5 β-VAE
7 Autoregressive models . . . . . . . . . . . . . . 105

7.1 Regressive model
7.2 Sequence model
7.3 A toy example: prediction of a B&W image
7.4 Fully Visible Sigmoid Belief Network
7.5 Masked Autoencoder Distribution Estimation
(MADE)
7.6 Generative model of Natural images
7.7 Pixel RNN
7.8 Pixel CNN
7.9 TCNs-WaveNet
7.10 RNNs are autoregressive models
7.11 Self-Attention and Transformers
8 Normalizing flows . . . . . . . . . . . . . . . . . . . 119

8.1 Introduction
8.2 Change of variables technique
8.3 Parameterize the Transformation f with NN
8.4 Coupling layers
8.5 A Flow of Transformations
8.6 Model architecture
8.7 Applications in Computer Vision
9 GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.1 Likelihood-free model
9.2 Introduction to GAN
9.3 Definitions
9.4 Training
9.5 Theoretical analysis
9.6 Difficulties during training
9.7 Comparison with VAE
9.8 Conditional GANs
91
Discriminative vs Generative Models
In discriminative models, given in pairs (x, y) as training data, the goal is to learn a function
f which maps an input x to an output y. On the other hand, in generative models, we work
with training data consisting only of unlabeled data points x. Here, our objective is to learn the
underlying hidden structure of the data. We aim to model the distribution pmodel(x) to generate
new samples that resemble the distribution pdata(x) . Generative models can be classified into two
categories:
• Explicit model: in this category, we explicitly define the probability distribution pmodel(x)
and then sample from it to generate new data points.
• Implicit model: here, we directly sample from pmodel(x) without explicitly defining the
distribution. Implicit models offer more flexibility and are commonly used in complex
scenarios.
While explicit models have the advantage of being highly interpretable, implicit models are more
versatile and applicable in various contexts. For a more detailed classification of generative models,
refer to Figure 5.14.
Generative models find applications in several areas, including:
• Super resolution [7], Compression, Text-to-speech

• Image generation [9]
• Planning, “Curiosity”, Model-based RL
• Drug discovery, Astronomy [8], High-energy physics
Figure 5.14: Taxonomy of generative models
In Parts 1 and 2, we primarily explored discriminative models. Now, in Part 3, we will shift our
focus towards the domain of generative models.
6. Autoencoders
In this chapter we introduce the Autoencoders. In section 6.1 we introduce the structure as well
the idea behind the autoencoder models. In section 6.2 we introduce the simplest autoencoder,
the linear autoencoder. In section 6.3 we describe how autoencoders can be implemented with a
neural network, allowing them to learn also non-linear projections. In section 6.4 we introduce the
Variational Autoencoders.
6.1 Introduction
In the field of deep learning, the data we often work with is represented as measurement vectors,
denoted as x ∈ Rn . While the dimensionality of these vectors can be low when using carefully
selected features, modern machine learning applications frequently involve high-dimensional data
such as images, audio, or time-series. In such cases, a crucial objective is to find low-dimensional
representations that can effectively compress the data while preserving its essential information.
Moreover, these representations should be interpretable and capable of capturing different modes of
variation.
Autoencoders offer a solution to this challenge through the use of an encoder-decoder structure, as
depicted in Figure 6.1.
• The encoder f projects the original input space X into a latent space1 , denoted as Z.
• The decoder g maps samples from the latent space Z back to the input space X.
Autoencoders operate on the assumption that a meaningful compressed representation of the data
can be obtained if the decoder is capable of reconstructing the original input solely from that
compressed representation. Consequently, the composition [g ◦ f ] aims to approximate the identity
function on the data, resulting in a low reconstruction error.
1
We may refer to the intermediate space where the data is projected as latent space, code, embedding
space equivalently.
94 Chapter 6. Autoencoders
Figure 6.1: Autoencoders structure.
Furthermore, to enable the generation of new samples from the latent space, it is desirable for the
latent space to exhibit a well-structured nature, characterized by continuity and interpolation
capabilities.
6.2 Linear Autoencoders: the PCA projection
If we restrict the functions f and g to be linear, the encoder function f of the autoencoder becomes
equivalent to the projection performed by Principal Component Analysis (PCA) projection, which
is the projection achieving the lowest reconstruction loss (L). Given N data points, L is computed
as
N
X N
X
2 2
L= ∥xn − xˆn ∥ = ∥xn − g(f (xn ))∥ (6.1)
n=1 n=1
The advantage of such a reconstruction is that it can be found in closed form, by computing the
eigenvectors of the covariance matrix of the data.
6.3 Non-Linear Autoencoders
6.3.1 Overview
When we allow f and g to be non-linear, the Autoencoder becomes a non-linear projection of the
data. In this case, both the encoder and decoder are implemented as neural networks, as illustrated
in Figure 6.2. To construct an autoencoder, we typically use a feedforward neural network trained
to reconstruct its inputs. In practice, it optimizes the following objective function w.r.t. the encoder
and decoder parameters Θf and Θg :
N
X N
X
2 2
Θ̂f , Θ̂g = arg min ∥xn − xˆn ∥ = arg min ∥xn − g(f (xn ))∥ (6.2)
Θf ,Θg n=1 Θf ,Θg n=1
Figure 6.2: Autoencoder class of models.
Figure 6.3 provides a comparison between the reconstructions obtained using a PCA projection
(linear autoencoder) and a feedforward neural network (in this case, a convolutional neural network
or CNN) for compressing the data from a dimensionality of 1024 to 2.
Figure 6.3: Comparison between the reconstruction obtained with a linear (PCA) and
non-linear (CNN) autoencoder.
6.3.2 Dimensionality of hidden layer
In the context of autoencoders, the dimensionality of the hidden layer plays a crucial role. Let us
call X the original feature space and Z the latent space. In general, as we have seen before, we
have dim(X) > dim(Z). However, there is also a class of autoencoders where dim(X) < dim(Z).
Depending on the dimensionality of the latent space, we can distinguish between undercomplete
(dim(X) > dim(Z)) and overcomplete (dim(X) < dim(Z)) hidden representations (see Figure 6.4).
Undercomplete hidden representation: dim(Z) < dim(X)
The idea behind an undercomplete hidden representation is to enable the network to learn the
important features of the data by reducing the dimensionality of the hidden space. This prevents the
autoencoder from simply copying the input and forces it to extract meaningful and discriminative
features.
In practice, they work well to extract those features for training samples, but may not generalize
effectively to out-of-distribution samples.
Figure 6.4: Undercomplete (left) and overcomplete (right) hidden representation.
Overcomplete hidden representation: dim(Z) > dim(X)
In contrast, an overcomplete autoencoder has a hidden layer with a higher dimensionality than the
input layer. This lack of compression potentially allows each hidden unit to simply copy different
input components, achieving a perfect reconstruction loss, but without extracting any meaningful
feature. The question now is:
How could this still be useful?
There are mainly two applications of Autoencoders with overcomplete hidden representation: the
Denoising and the Inpainting Autoencoders.
The goal of Denoising Autoencoders is, given a noisy image, to reconstruct the original clean
one. In order to do that, during training, a clean image is intentionally corrupted by injecting noise,
such as Gaussian noise. This noisy image is then provided as input to an Autoencoder with an
overcomplete hidden representation. Since the loss is evaluated based on a comparison with the
original (clean) image, the network is discouraged from simply copying the input (noisy) image.
Instead, it must learn the necessary transformations to remove the noise and accurately restore the
clean information.
Figure 6.5: Denoising Autoencoder network structure
The other application of Autoencoders with overcomplete hidden representation is the Inpainting
Autoencoders. The goal of this model is to reconstruct the missing parts of an image. In order to
train such a network, similarly to what we have seen with Denoising Autoencoder, we provide a
corrupted image as input and the original image as target. In this case, instead of injecting noise,
the original image is intentionally occluded by applying partial or complete occlusions, as shown in
Figure 6.6.
Figure 6.6: Inpainting Autoencoder network structure
The network is then trained to reconstruct the original (complete) image by learning to fill in the
missing or occluded regions. By utilizing an overcomplete hidden representation, the Inpainting
Autoencoder learns to capture the underlying structure of the image and accurately restore the
missing parts based on the available information.
6.3.3 Autoencoder Limitations
As mentioned in section 6.1 in order to be able to generate new samples, it is desirable for the
latent space to exhibit a well-structured nature, characterized by continuity and interpolation
capabilities.
However, in the classical version of autoencoders we have examined, the decoder struggles to
generate high-quality samples. This limitation arises due to the lack of continuity in the latent
space, as depicted in Figure 6.7. In regions of the latent space where there are discontinuities or
gaps between clusters, the decoder has no knowledge of how to generate realistic outputs. This
happens because during training, the autoencoder was not exposed to encoded vectors from those
regions. Therefore, while autoencoders excel at reconstructing input data, they face challenges in
generating new samples.
Figure 6.7: Training an autoencoder on the MNIST dataset, and visualizing the encodings
from a 2D latent space reveals the formation of distinct clusters, given by the found
projections of the input samples. However, if we sample an element for a region of that
space which ahs not been covered during training, the decoder will not be able to generate
a realistic output, as it was trained purely to optimize the reconstruction loss and it lacks
any kind of interpolation capabilities. (from blog post)
6.4 Variational Autoencoders
6.4.1 Overview
Variational Autoencoders (VAEs) are the last category of Autoencoders we will analyze and are
proposed as solution for the issue of vanilla Autoencoders discussed in subsection 6.3.3. They have
a unique feature that makes them excellent for generative modeling: their latent spaces are designed
to be continuous. This means that VAEs can easily generate new and diverse samples by smoothly
interpolating between different points (explored during training) in the latent space.
In practice, they achieve that by making the encoder not output a latent vector of size dim(Z), but
instead outputting two vectors of that size: a vector of means µ and a vector of standard deviations
σ, as depicted in Figure 6.8.
Figure 6.8: Structure of a VAE. The part highlighted in red is the encoder. µ and σ (which
have the same dimensionality as z) are the values predicted by the last layer of the encoder
network. z is the latent embedding sampled from N (µ, σI). The part highlighted in blue
is the decoder.
This stochastic generation means, that even for the same input, while the mean and standard
deviation remain the same, the actual encoding will somewhat vary on every single pass simply due
to sampling.
Figure 6.9: Difference of latent space encodings between AE and VAE. In VAE the input to
the decoder is a sample from a gaussian distribution centered in the projected data point
rather than the projected data point itself. (from blog post)
However, since there are no limits on the values which can be taken by µ and σ, the encoder may
learn to generate very different µ for each different class while minimizing σ, in order to reobtain a
clustered structure, which allows the network to achieve a lower recontruction error. However, this
is again something we would like to avoid, since we want the latent space to be continuous and not
clustered. For a visual representation refer to Figure 6.10.
Figure 6.10: What we would like to obtain (left) and what we obtain only changing the
structure of the encoder (right).
Ideally, we would like to obtain encodings which are as close as possible to each other in the latent
space, allowing smooth interpolation and thus good generation of new samples. In order to force
this, the Kullback-Leibler divergence (KL)2 is inserted in the loss function. In particular, we want
to minimize the KL divergence between the distribution defined by every training sample and
a standard (multivariate) normal distribution. Intuitively, this loss encourages the encoder to
distribute all encodings (for all types of inputs, eg. all MNIST numbers), evenly around the center
of the latent space. Using a loss based on reconstruction loss and KL divergence we obtain a latent
space structured as depicted in Figure 6.11.
Figure 6.11: The latent space of a VAE trained on MNIST with a loss based on reconstruction
error and KL divergence.
2
The KL term between two probability distributions is a measure of how much they differ from each
other. More details are provided in subsection 6.4.2
6.4.2 Kullback-Leibler (KL) Divergence
Before diving into the specific computation of the objective function, let us introduce the definition
and some properties of the Kullback-Leibler (KL) Divergence, which will be needed to understand
the following part.
Many times it happens that we have some probability distributions and we want to measure how
different they are. A way to do that is using Kullback Leibler Divergence. More formally, if we want
to measure how much the distribution p is different from a second, reference probability distribution
q we write
Z " !#
p(x)
DKL (p||q) = p(x) log dx
x q(x)
A first thing that worth noticing is that the KL divergence is not symmetric, as in general
DKL (p||q) ̸= DKL (q||p).
Moreover, the KL divergence is non-negative, as we can see from the following proof:
Z " !#
p(x)
−DKL (p||q) = − p(x) log dx (6.3)
x q(x)
Z " !#
q(x)
= p(x) log dx (6.4)
x p(x)
Z " !#
q(x)
≤ log p(x) dx Jensen’s inequality E(ϕ(x)) ≤ ϕ(E(x)) if ϕ is concave
x p(x)
(6.5)
Z
= log q(x)dx (6.6)
Zx
= log dx = 0 (6.7)
x
6.4.3 Derivation of the objective function
In this section we are interested in understanding how to capture the process we have just described
via estimation of the parameters Θ∗ of this generative model.
In order to train a model we would like to maximize the likelihood of training data, that in this
case is:
Z
p(x) = p(x|z)p(z)dz (6.8)
z
In this expression we know p(z) and p(x|z), but we are not able to compute the integral over all z.
As a consequence, also the posterior distribution p(z|x) = p(x|z)p(z)

p(x) becomes intractable (it includes
the computation of p(x)). In order to solve this problem, we define an approximation of the posterior
pθ (z|x) and we call it q(z|x)ϕ , which depends on new parameters ϕ. This function q(z|x)ϕ is the
one which is in practice computed by the encoder. Given this approximation, we are now ready to
compute the log-likelihood of the data, that is
log(pθ (x)) = Ez∼qϕ (z|x) [log(p(x))] (6.9)

" #
p(x|z)p(z)
= Ez∼qϕ (z|x) log (6.10)
p(z|x)
" !#
(p(x|z)p(z)) q(z|x)
= Ez∼qϕ (z|x) log · (6.11)
p(z|x) q(z|x)
" # " #
q(z|x) q(z|x)
= Ez∼qϕ (z|x) [log(p(x|z))] − Ez|x∼qϕ log + Ez|x∼qϕ log (6.12)
p(z) p(z|x)
= Ez∼qϕ (z|x) [log(p(x|z))] − DKL [q(z|x)||p(z)] + DKL [q(z|x)||p(z|x)]
| {z } | {z } | {z }
reconstruction error make approx. posterior close to prior intractable, but ≥ 0
(6.13)
Here we have used that: x does not depend on z (1.9), Bayes rule (1.10), multiplication of both
numerator and denominator for a constant gives identity (1.11), the logarithm of a product is the
sum of the logarithms (1.12), KL divergence is non-negative (1.13) (as proven in subsection 6.4.2).
Since, as we have said before, p(z|x) is not tractable we cannot optimize the last term. As a
consequence, we aim to maximize the first two terms (that we call ELBO, Evidence Lower BOund)
of the expression.
ELBOVAE = Ez∼qϕ (z|x) [log(p(x|z))] − DKL [q(z|x)||p(z)] (6.14)

| {z } | {z }
reconstruction error KL divergence
In particular we want to jointly maximize the reconstruction error and minimize the KL divergence
between the approximate posterior and the prior. The first term encourages the encoder to form
clusters where samples from the same category or with similar properties are closely located in
the latent space. The second term encourages the encoder to project latent representations evenly
around the center of the latent space. A visualization of the effect of each term in the learned latent
space can be found in Observe that in practice we assume p(z) ∼ N (0, I) (enforce the covariance
Figure 6.12: A comparison between the latent spaces of three autoencoder trained on
MNIST optimizing (left to right): reconstruction error, KL term, jointly reconstruction
error and KL term.
matrix to be diagonal, and thus improve disentanglement) and q(z|x) ∼ N (µ, σ 2 I) (makes the KL
term analytically computable).
6.4.4 Training
Now that we have obtained an objective function to optimize, in order to proceed with the actual
training we must be able to compute the gradients of the ELBO with respect to the parameters of
the encoder and the decoder. However, we have a problem. The process of sampling (in the case of
z) from a distribution that is parameterized by our model is not differentiable. For this reason, in
order to compute the gradients, we need to find a method of making our predictions separate from
the stochastic sampling element.
The solution to that is the so-called reparametrization trick, which involves treating the random
sampling as a single noise term. In particular, instead of considering z to be sampled from a
N (µ, σ) distribution, we consider it to be a deterministic variable z = µ + σϵ, where ϵ ∼ N (0, I)
is a random noise term. The benefit is the prediction of mean and variance is now no longer tied to
the stochastic sampling operation. This means that we can now differentiate with respect to our
models’ parameters again.
Figure 6.13: Reparametrization trick explained graphically. Circles are stochastic variables
and diamonds are deterministic variables (i.e., neural network layers).
6.4.5 Training in practice
For implementation and clearity purposes we report here all the steps of a forward pass in a VAE.
1. Consider a training sample x

2. Use encoder to reduce the dimensions of x to a low-dimensional representation of it, say x′
3. Now feed x′ into the mean NN and into the variance NN to obtain respectively µ and σ
4. Sample an ϵ from N (0, I) and write z = µ + ϵσ
5. Feed z to the decoder and obtain x̂
6. Compute the loss L which, as we have seen before, consists on two terms: reconstruction loss
and KL divergence term. In practice, it can be written as:
L= BCE(x, x̂) + 0.5 · (σ + µ2 − log(σ) − 1) (6.15)

| {z } | {z }
reconstruction loss KL divergence term
(6.16)
Here, the last term is the KL divergence between the approximate posterior and the prior.
For more details on the derivation of this term, refer to the Pen&Paper homeworks.
6.5 β-VAE 103
6.4.6 Generating new data
Once trained the network, in order to generate new data samples, we sample a vector z from N (0, I)
and we use the decoder network to obtain its representation in the original space. However, the
representations obtained in that way are still entangled. For instance, in the case of MNIST dataset
we do not have an explicit way to sample a 1 rather than a 9. We would like to further structure
the latent space in order to have a disentangled representation in which each dimension corresponds
to a factor of variation of the data (digits, style, thickness, orientation, etc.).
There are mainly two solutions to this problem: training the network in a semi-supervised way to
make it learn the labels of the data, or using the so-called β-VAEs. We are mainly interested to
discover and separate the important factors of variation in an unsupervised fashion, so in the next
section we will present the β-VAE [3].
6.5 β-VAE
As we have seen before, in the ELBO the KL loss enforces independent Gaussians (diagonal
covariance matrix). However, due to fight between two losses this is not always the case.
The idea introduced by β-VAE is to give more weight to the KL term, by multiplying it by a
adjustable hyperparameter β that balances latent channel capacity and independence constraints
with reconstruction accuracy. The intuition behind that is that if factors are in practice independent
from each other (as style and digit in MNIST), the model should benefit from disentangling them.
In practice, we want to force the KL loss to be under a certain threeshold, so we write

" #
max Ex∼D Ez∼qϕ (z|x) [log pθ (x|z)] (6.17)
ϕ,θ
!
subject to DKL qϕ (z|x)||p(θ z) <δ (6.18)
Re-writing Eq. 1 as a Lagrangian under the KKT conditions, we obtain:

!
F(θ, ϕ, β) = Ez∼qϕ (z|x) log(pθ (x|z)) − β DKL qϕ (z|x)||p(θ z) − δ (6.19)

= Ez∼qϕ (z|x) log(pθ (x|z)) − β · DKL qϕ (z|x)||p(θ z) + βδ (6.20)
|{z}
≥0

≥ Ez∼qϕ (z|x) log(pθ (x|z)) − β · DKL qϕ (z|x)||p(θ z) (6.21)
Thus, the loss (opposite of the objective function) becomes

Lbeta = −Ez∼qϕ (z|x) log(pθ (x|z)) + β · DKL qϕ (z|x)||p(θ z)
7. Autoregressive models
7.1 Regressive model
A regressive model is a model whose outputs are linear combination of the inputs.
7.2 Sequence model
Given seed x1 , ..., xk , our interest is to produce xk+1 .

Once produced, we want to predict xk+2 using x2 , ..., xk+1 , and so on.
106 Chapter 7. Autoregressive models
7.3 A toy example: prediction of a B&W image
If we have n pixels, we have 2n possible states (each pixel can be black or white).
we can model this problem using n Bernoulli variables X1 , ..., Xn s.t. V al(Xi ) = {0, 1} =
{Black,White}.
If we sample from p(x1 , ..., xn ), we generate an image (we use lower xi since the probability function
is a function that maps a realization to a probability mass).
First attempt: tabular approach
Using a tabular approach, via the chain rule of probability we can factorize the joint distribution
over the n-dimensions:
n
Y n
Y
p(x) = p(xi |x1 , ..., xi−1 ) = p(xi |x<i )
1 1
So, since x1 is given and the total probability must be 1, we obtain that we need 2n−1 parameters.
However, an exponential number of parameters to train is too high.
What if we assume independence of the variables?

In this case we will obtain
p(x) = p(x1 ) · .... · p(xn )
and the number of parameters needed would be n, what a nice drop in the number of parameters!
However, independence assumption is likely too strong, , it would just be random sampling of
pixels. As a consequence, we need to find a solution that combine number of parameters to learn
and capacity of the model.
Another attempt: Autoregressive generative models
Idea: assume p(xi |x<i ) to correspond to Bernoulli random variable and learn to map previous
inputs to the mean:
pθi (xi |x<i ) = Ber(f (x1 , ..., xn ))
Now our problem is to find a useful definition of f : {0, 1}i−1 → [0, 1].
7.4 Fully Visible Sigmoid Belief Network
In Fully Visible Sigmoid Belief Network we model f via logistic regression.

In particular, we define f as:
fi (x1 , x2 , ..., xi−1 ) = σ(α0i + α1i x1 + ... + αi−1

i
xi−1 )
At each time step i, we will have i − 1 parameters (θ = (α1 , ..., αi−1 }), thus, in a horizon of n steps
we will have
n
X
number of parameters = i = O(n2 )
i=1
Figure 7.1: Fully Visible Sigmoid Belief Network

7.4.1 Neural Autoregressive Density Estimator (NADE)
The main example of a Fully Visible Sigmoid Belief Network is NADE and its variants.
There the function is expressed as
hi = σ(b + W.,<i x<i )

x̂i = σ(ci + Vi,. h(i) )
where W.,<i represents the first i − 1 columns of W and Vi,. is the i-th row of V .
So we can express the k-th row of hi as
i−1
X
hi,k = bk + Wkj xj
j=1
and our final output will be

x̂i = σ(ci + Vi hi )
We notice that x̂ ∈ R is a value between 0 and 1, NADE is a model for binary data, at each time
step i we predict p(xi = 1|x<1 ).
Figure 7.2: NADE architecture
Notice that the processing order is predefined.

Training of NADE
We train this model by maximizing the average log-likelihood

T T D
1X 1X Y
log(p(xt )) = log( p(xti |xt<i ))
T t=1 T t=1 i=1
T D
1 XX
= log(p(xti |xt<i ))
T t=1 i=1
What is nice about NADE?
• Efficience: the computations are in O(T D)

• First thought for binary observations, but it is still easily extendable to other types of
observations (reals, multinomials)
• could make use of second-order optimizers
We have seen that the inputs are processed in order. Many researches have been done to understand
the best order of the vector x, however random order has been proven to works fine.
During the training of NADE the teacher forcing approach is used: ground truth values of the
pixels are used for conditioning when predicting subsequent values (I don’t use the value predicted
in the training).
During inference I use the predicted values, it is a fully generative model.
NADE’s extensions
Some extensions of NADE are:
• Real-valued NADE: it expands to real valued data, modelling the conditionals as mixture of
Gaussians
• Orderless and deep NADE (DeepNADE): a single deep neural network is trained to assign a
conditional distribution to any variable given any subset of the others.
• Convolutional NADE (ConvNADE)
7.5 Masked Autoencoder Distribution Estimation (MADE)
As in autoencoders, our objective is to learn hidden representations of the inputs that reveal the
statistical structure of the distribution that generated them.
However, the autoencoder takes the input as a whole and, thus, it does not satisfy the autoregressive
property.
So now we are going to impose some constraints on an autoencoder in order to make it fulfil the
auto regressive property.
In order to best understand the procedure we are going to explain next, remind that in a classical
autoencoder we have that
h(x) = g(b + W x)
x̂ = σ(c + V h(x))
where g is the encoder network.

Now we want to build some masks M W and M V whose entries that are set to 0 correspond to the
connections we wish to remove. Thus, we write
h(x) = g(b + (W ⊙ M W )x)

x̂ = σ(c + (V ⊙ M V )h(x))
In particular I want that:
• W ⊙ M W takes only the first i − 1 rows of W

• V ⊙ M V takes only the i-th row of V
In order to do that, we follow this process:
1. We assign (sampling uniformly) each unit in the hidden layer an integer m s.t. 1 ≤ m ≤ D − 1,
so that for every hidden layer ml (k) represents the value assigned to the k-th element of the
layer l
2. In the hidden layers allow to propagate connections only to m that are greater or equal, not
to smaller one
3. Allow connections between the last hidden layer and the output only to m that are strictly
greater
7.5 Masked Autoencoder Distribution Estimation (MADE) 111
Let us clarify formally how to compute the weights of the M W and M V matrices in order to satisfy
the conditions expressed in the last two steps.
W
Mij = ⊮(m(i)l ≥ m(j)l−1 )
V
Mij = ⊮(m(i)l > m(j)l−1 )
Using such a procedure it can be proved that we always end up with an autoencoder which fulfil
the auto-regressive property.
Computing p(x) is just a matter of performing a forward pass. Implementing MADE usually we
use ReLu for hidden layers and sigmoid for the last one.
7.6 Generative model of Natural images
As we have seen in the previous pages our goal is to predict a new images, given some true ones.
In order to do that we use chain rule to decompose likelihood of an image x into product of 1D
distributions
n
Y
p(x) = p(xi |x1 , ..., xi−1 )
1
In order to train the model we want to maximize the likelihood of training data.
We mainly have two issues:
1. Decide an ordering of the pixels of the image

2. Model the probability distribution p(x)
We will now explore some approaches.
7.7 Pixel RNN
Idea: we generate image pixels starting from corner and we model the dependency on previous
pixels modeled using an RNN (LSTM).
Note that from now on pixels will be RGB values (3 channels, 255 values each).
The issue is that sequential generation is slow – due to explicit pixel dependencies. This issue has
been solved in Pixel CNN, which models dependencies on previous pixels with a CNN over context
region.
7.8 Pixel CNN
Still generate images starting from the top left corner, but models the dependencies using CNN,
masked convolution in order to satisfy the autoregressive property.
Training maximize the likelihood of training data as before, we use a softmax loss over pixel values
(from 0 to 255), the one with the lowest loss is then selected.
How much of the context is considered by the cnn?
We model this problem using masked convolutions which ensures the autoregressive property is
satisfied.
Only the pixel in blues of the receptive filter of the filter are used.
Figure 7.3: Masked convolution

7.8 Pixel CNN 113
7.8.1 Autoregressive over color-channels
The model is autoregressive over color-channels
p(xi |x<i ) = p(xi,R |x<i )p(xi,G |x<i , xi,R )p(xi,B |x<i , xi,R , xi,G )
In order to guarantee this property, we use two types of masks:
• Mask A: this mask is only applied to the first convolutional layer and restricts connections to
those colors in current pixels that have already been predicted.
• Mask B: this mask is applied to other layers and allows connections to predicted colors in
the current pixels
Figure 7.4: RGB masked convolution, when RED and GREEN have already been predicted
for the current pixel
In order to make the training faster, the authors proposed a parallelization of the operations, using
a stack of masked convolutions.
The problem that arises is the presence of a blind spot (the pixel we are considering is not directly
or indirectly dependent from a group of pixels, even if it should).
Figure 7.5: Blind spot problem
The authors of the paper have proposed to remove the blind spot by combining two convolutional
network stacks:
• one that conditions on the current row so far (horizontal stack)

• one that conditions on all rows above (vertical stack)
7.8.2 Gated PixelCNN
Moreover, in order to try getting results as good as the ones obtained with RNN, they replaced the
rectified linear units between the masked convolutions in the original pixelCNN with the following
gated activation unit
hk+1 = tanh(Wk,f ∗ hk ) ⊙ σ(Wk,g ∗ hk )
The introduction of this combination between vertical and horizontal convolutions makes sure that
the pixel dependencies are preserved in the right order.
Figure 7.6: GatedPixelCNN
The results obtained by this model were images which looked like images at first glance, but which
have no semantic if observed carefully.
However, this model is an historical value since it showed that generating images which looked like
images was possible using neural networks.
The thing we like compared to PixelRNN is that training is much faster. However the generation
remains sequential and, thus, slow.
7.9 TCNs-WaveNet 115
7.9 TCNs-WaveNet
The idea was to adapt PixelCNN to work with audio data, where the dimensionality is much larger
(at least 16,000 samples per second).
WaveNet is based on the idea of dilated convolution, a type of convolution which allow to capture
further dependencies not increasing the number of layers.
Figure 7.7: Dilated Convolution
In particular WaveNet increases the dilaction factor as we go up in the layers, so that with the
same amount of layers we have an higher receptive field.
Figure 7.8: Dilated Convolution in WaveNet
The complete architecture is the following
Figure 7.9: Dilated Convolution in WaveNet
We can use this also in motion modeling, it can predict future poses.
7.10 RNNs are autoregressive models
If we consider the output at timestep t as an input for the next timestep, RNNs can be seen as
autoregressive models. In order to generate new images, we random sample h0 and then we generate
Figure 7.10: RNNs are autoregressive models
in a deterministic fashion the other pixels.

Pro: exact log-likelihood objective
Cons: slow inference due to sequential generation
7.10.1 VRNN: A Recurrent Latent Variable Model for Sequential Data
However, the internal transition structure of a standard RNN is entirely deterministic; the only
source of randomness or variability can be found in the conditional output probability model
pθ (xt , x<t ).
As a consequence, RNNs are often augmented with random latent variables in order to:
• increase model capacity and better capture uncertainity

• infer high-level abstraction (e.g. who is the speaker in a dialogue or the style of an handwriting);
thios is nice also in generative models, since it increases high-level control during generation
The goal is to increase expressive power of RNNs by incorporating stochastic latent variables into
hidden state of an RNN.
In order to do that we combine RNN and VAE by including two latents variables for timesteps.
They allow us to specify priors.
7.11 Self-Attention and Transformers 117
7.11 Self-Attention and Transformers
We form the prediction of the current time step by taking a convex combination of the entire input
sequence.
The Attention operation learns to identify/select the relevant past information for the next step.
Figure 7.11: Transformer model
7.11.1 Keys, values and queries
A linear mapping transforms from the inputs/embeddings to Key, Value and Query embeddings. In
particular
K = XWK
V = XWV
Q = XWQ
Where x1 , ..., xT ∈ RD and X ∈ RT xD .

In general we say that WK , WV , WQ ∈ RD×D , but keys and values could also have different
dimensionality.
Intuitively, the model at each time step works like this:
1. It transforms input embedding in keys, by multiplying xt with the key matrix

2. It computes the query deriving from xt multiplying xt with the query matrix
3. It computes dependencies of the query of the current time step with all the keys of the
previous time steps. In particular it does a softmax operation over all the previous keys and
it evaluates the attention weights
√
α = sof tmax(QK ⊤ / D) where α ∈ R1xT
√
which gives us a normalized relevance score for every past step. Note the term 1/ D is used
to control the variance in the output
More formally
(XWQ )(XWk )⊤

Y = αV = αXWV = sof tmax √ + M (XWV )
D
t
Note that QK expresses how much the query i depends on the key j, thus using a matrix M as

√
D ij
drawn below, allow to delete dependencies of queries from higher (and thus future) keys.
 
−∞ −∞ · · · −∞
 0 −∞ · · · −∞
M = . .. .. .. 
 
 .. . . . 
0 ··· 0 −∞
The complessity of this operation is O(T 2 · D), it is quadratic in T . To be honest the computation
cost is higher, since at every time step the matrix M changes, so an O(n) should be added for
sequential operations.
The term M is to prevent the model from accessing future steps (in order to fulfil the autoregressive
property), if we are doing seq2seq mapping we don’t need it. In particular M is an upper triangular
matrix initialized with very negative values. It masks out the influence of future elements in the
sequence on the prediction of current time step.
8. Normalizing flows
8.1 Introduction
VAE has a latent space, but not tractable likelihood, we have to optimize an approximation.
Autoregressive models have a tractable likelihood, but not a latent space.
The idea is now to have both these conditions satisfied, constructing a mapping from an easy
distribution to a complex space.
We can do that using change of variables technique.
8.2 Change of variables technique
In 1-D:
x = g(u)
Z b Z u=g −1 (b)
f (x)dx = f (x)dx
a u=g −1 (a)
If we think about probabilities then, given a probability function pz (z) and x = f (z), where f (·) is
a monotone and differentiable function, we have
′ ′
px (x) = pz (f −1 (x))|f −1 (x)| = pz (z(x))|f −1 (x)|
Let us now move to the 2-D case.

Let z be a 2-D random vector and x = f (z), where f is invertible and continuously differentiable.
Then we have

∂f (z)
dx = det dz
∂z
120 Chapter 8. Normalizing flows
and
∂f −1 (x)

−1
px (x) = pz (f (x)) det det(A−1 ) = det(A) in invertible matrix
∂x
−1
∂f (z)
= pz (f −1 (x)) det
∂z
8.3 Parameterize the Transformation f with NN
The idea is now to parameterize the Transformation f with a simple MLP layer.
Let us now analyze the properties which must be satisfied by our neural network.
From a theoretical perspective, it must:
• be differentiable
• be invertible
• preserve the dimensionality
From a computational perspective, the Jacobian of the transformation must be computed efficiently.
As we can see from the examples below, the complexity of this computation depends on the form of
the transformation
Figure 8.1: Complexity of computing determinant
Thus, we want the Jacobian to be lower or upper triangular.

8.4 Coupling layers 121
8.4 Coupling layers
Coupling layers is a neural network structure which fulfil all the desired requirements.
Figure 8.2: Coupling layers
Here β is some form of complex function that can include non linearities and does not have to be
invertible (it can be a CNN or a complex function).
We have another function h, an element wise function, which adds together one unprocessed part of
the input with the second half of the input which has gone through the complex function β. This
product the first half of the output.
Then, the part of the input which has not gone trough β will form the second half of the output.
That’s important to ensure that we can invert the overall computation.
Let us see in details the forward pass, the backward pass and the Jacobian matrix.
8.4.1 Forward pass
yA h(xA , β(xB ))

=
yB xB
8.4.2 Backward pass
xA h−1 (y A )[0]

=
xB yB
8.4.3 Jacobian matrix
h′ h′ f ′

J=
0 1
We can immediately notice that this matrix is upper triangular, as we wanted since the beginning.
8.5 A Flow of Transformations
However, a single nonlinear transform (β) is normally not powerful enough, more complex transfor-
mations can be attained via composition.
Now we have a flow of transformations, where we can see each transform as a NN.
x = f (z) = fk ◦ fk−1 ◦ ... ◦ f2 ◦ f1 (z)
As the determinant of a product is the product of the determinants we can write that
−1
Y ∂fk (x)
px (x) = pz f −1 (x) det
∂x
k
8.5.1 Training
During training time, we can learn the model via maximizing the exact log likelihood over the
dataset D.
Under the assumption that the samples are independently and identically distributed, we obtain
X −1
X ∂fk (x)
log(px (D)) = log pz (f −1 (x)) + log det
∂x
x∈D k
8.5.2 Inference
To generate a sample x, we can draw a sample from pz (), and transform it via f (as we can see
from the backward path).
To evaluate the probability of an observation x, we leverage the inverse transform to get its latent
variable z, and calculate its probability at pz ().
8.6 Model architecture 123
8.6 Model architecture
The model is composed of L levels.

Each level contains K steps of flow, after which a part of the output is written out and the other is
fed up to the next layer.
Although coupling layers can be powerful, their forward transformation leaves some components
unchanged. This difficulty can be overcome by composing coupling layers in an alternating pattern,
such that the components that are left unchanged in one coupling layer are updated in the next.
In practice, we shuffle the input and we process a part at each step, so that at the end all the input
has been processed. How this shuffle happens is the main difference between different papers.
Note: K and L are hyperparameters of our netwrok architecture.
Figure 8.3: RealNVP (Dinh et al. 2017) architecture

8.6.1 Squeeze and Split
We want to ensure that overall we mantain dimensionality and invertibility.

We do the splitting in this way:
1. Divide the matrix in subsquares of size 2 × 2

2. Starting from the top left corner we rotate clockwise and we assign each element to a different
channel
As a consequence we obtain that from a starting tensor of size W × H × C we obtain a tensor of

size W/2 × H/2 × 4 · C
Figure 8.4: Squeeze and Split
8.7 Applications in Computer Vision
The following are some applications of normalizing flows in the field of Computer Vision
• Super-Resolution
• Disentanglement
• Multimodal modeling
• Noise modeling
• 3D Pose Estimation
9. GAN
9.1 Likelihood-free model
All the generative models we have seen so far act maximizing the likelihood.
A question raises spontaneously: "Is the likelihood a good indicator of the generated samples?". In
order to find an answer, let us consider the following two examples.
9.1.1 Case 1: great log-likelihood and poor samples
According to Theis et al. A Note on the Evaluation of Generative Models - chapter 3.2 we can
obtain an high log likelihood even if we are generating poor samples.
Let p(x) be a model which generates good samples and q(x) a model which generates bad ones
(just noise).
Consider the log likelihood of the model 0.01p(x) + 0.99q(x).
log(0.01p(x) + 0.99q(x)) ≥ log(0.01p(x)) = log(p(x)) − log 100
For high-dimensional data, log p(x) will be proportional to d while log 100 stays constant. Thus, we
have obtained a model with an high log-likelihood and poor samples generated (the samples of this
model will be noise 99% of the time).
9.1.2 Case 2: poor log-likelihood and great samples
Great samples easily achieved by memorizing training data!

However, test set will have zero probabilities assigned, the log likelihood cannot be worse.
For these reasons, we are interested in a Likelihood-free model (also called Implicit Model or Neural
Sampler).
On one side, it solves the main two problems of the models we have seen before, so that:
126 Chapter 9. GAN
• It can handle a highly expressive model class (universal)

• It can handle a density function not defined or intractable
On the other side another problem arise, since it lacks of theory and learning algorithms when
compared to explicit models.
9.2 Introduction to GAN
The base idea of the GAN is to draw samples from simple distribution (i.e. random noise) and use
neural network to learn transformation into realistic image.
Figure 9.1: GAN: image generation
However, we still need to find a way to evaluate the generated samples.

In order to do that, we introduce a discriminator, whose role is to learn how to distinguish real
from fake images.
In this perspective, the goal of the generator become to maximize discriminator’s classification loss,
by generating real-looking images.
Let us give some definitions and formalize the process of learning.
9.3 Definitions
Let x ∈ RD be an observation and p(z) a prior over a latent variable z ∈ RQ .Then:
• The generator is trained to (ideally) map random normal-distributed inputs, drawn from Z
(latent space), to a sample following the data distribution as output.
G : RQ → RD
• The discriminator is trained to output a probability.
D : RD → [0, 1]
so that ideally D = 1 only if the image is a real one.

9.4 Training 127
9.4 Training
9.4.1 General idea
Generator G and Discriminator D can be implemented with arbitrary architectures, MLPs, CNNs,
RNNs.
In particular:
• the discriminator will be train to predict 0 on the generated images x̂ and 1 on the real ones
x
• the generator will be then trained to confuse the discriminator and make it output the
opposite
In practice, we reach an equilibrium when the discriminator output 0.5 for every image. Once
training succeeded, the generator is used to represent pmodel from which we want to draw samples
Figure 9.2: GAN: training

128 Chapter 9. GAN
9.4.2 Theory vs practice
In theory, it has been proven that this minimax game recovers pmodel = pdata if D and G are given
enough capacity and assuming that D∗ can be reached.
In practice, we must implement the game using an iterative, numerical approach.

Optimizing D to completion in the inner loop of training
• is computationally prohibitive
• on finite datasets would result in overfitting
Instead, we alternate between
• k steps of optimizing D (typically k ∈ {1, ..., 5})

• 1 step of optimizing G
This procedure aims to keep D near optimum and G changes only slowly.
More precisely, the training algoritm is the following:
While not converged do:
1. For k steps do:

(a) Draw N training samples {x(1) , ..., x(N ) } from pdata (x)
(b) Draw N noisy samples {z (1) , ..., z (N ) } from p(z)
(c) Update D by ascending its stochastic gradient
N
1 X (i)

(i)
∇ΘD log(D(G(x ))) + log 1 − D(z )
N i=1
2. For one step do:

(a) Draw N noise samples {z (1) , ..., z (N ) } from p(z)
(b) Update G by descending its stochastic gradient
N
1 X
log 1 − D(G(z (i) )) (9.1)

∇ΘG
N i=1
In practice, equation (9.1) may not provide sufficient gradient for G to learn well.
Early in learning, when G is poor, D can reject samples with high confidence because they are
clearly different from the training data. In this case, log(1 − D(G(z))) saturates.
Rather than training G to minimize log(1 − D(G(z))) we can train G to maximize log D(G(z)).
This objective function results in the same fixed point of the dynamics of G and D but provides
much stronger gradients early in learning.
Thus, point 2 becomes:
For one step do:
1. Draw N noise samples {z (1) , ..., z (N ) } from p(z)

2. Update G by ascending its stochastic gradient
N
1 X
log D(z (i) ) (9.2)

∇ΘG
N i=1
9.5 Theoretical analysis
9.5.1 Derivation of the GAN objective
Let us first assume that G is fixed.
1. First, we create a training dataset

(a) Take N samples from the training set x1 , .., xn
(b) Draw N samples
1 from Z ∼N N (0, 1), z1 , ..., z
1 N
(c) Define D = (x , 1), ..., (x , 1), (G(z ), 0), ..., (G(z N ), 0)

2. Then, we define a loss function, in this case the Binary Cross Entropy
N 2N
1 X (i) (i)
X
(i) (i)
L(D) = − y log(D(x )) + (1 − y ) log(1 − D(x ))
2N i=1
i=N +1
3. We define an objective to train D as a function of G

∗ 1
D = arg min − Ex∼pd [log(D(x))] + Ez∼pz [log(1 − D(G(z)))]
D 2
Our goal now is to find a generator G that will fool any D. In order to do that we must increase
L(D).
In particular, to find the optimal generator G∗ , we define
V (D, G) = Ex∼pd [log(D(x))] + Ez∼pz [log(1 − D(G(z)))]
where x ∼ pdata amnd x̂ ∼ pmodel .
In order to fool any (and so even the best one) discriminator D we have to satisfy
G∗ , D∗ = min max V (G, D)
G D
Remember that, in practice, optimizing D in the inner-loop is computationally prohibitive and

would lead to overfitting on finite datasets.
9.5.2 Optimal Discriminator

Theorem 9.5.1 — Optimal Discriminator. For each generator G, the optimal discriminator is
pdata (x)
D∗ =
pdata (x) + pmodel (x)
Proof. The training criterion for the discriminator D, given any generator G, is to maximize the
quantity V (G, D), in particular
V (G, D) = Ex∼pd [log(D(x))] + Ez∼pz [log(1 − D(G(z)))]
Z Z
= pdata (x) log(D(x))dx + pz (z)(1 − log(D(G(z))))dz
Zx z
= pdata (x) log(D(x)) + pmodel (x) log(1 − D(x))dx

x
Note that ∀a, b ∈ R2 \ {0, 0} (Discriminator does not need to be defined outside of Supp(pdata ) ∪
Supp(pmodel ), where they are both 0).
From mathematical analysis it can be proven that the function y → a log(y) + b(log(1 − y)) achieves
its maximum in a+ba
, for a, b ∈ (0, 1]. Thus,
pdata (x)
D∗ (x) =
130 Chapter 9. GAN
Note that D∗ is unique, but in practice not desiderable.
9.5.3 Global Optimality
Now that we have found the optimal theoretical value for the discriminator, we are interested in
seeing, globally, which is the function we are wishing to optimize.
Theorem 9.5.2 — Global optimum of the training criterion.
Ex∼pd [log(D(x))] + Ez∼pz [log(1 − D(G(z)))]
is achieved if pmodel (x) = pdata (x) and at optimum V (G, D∗ ) = −log(4).
Proof
1. Substituting the previously found D∗ in the training criterion we obtain

" !# " !#
pdata (x) pdata (x)
Ex∼pd log + Ez∼pz log 1 −
pdata (x) + pmodel (x) pdata (x) + pmodel (x)
2. Notice that
pdata (x) pdata (x) + pmodel (x) pdata (x)
1− = −
pdata (x) + pmodel (x) pdata (x) + pmodel (x) pdata (x) + pmodel (x)
pmodel (x)
=
so, substituting this in the expression obtained in 1, we have
" !# " !#
pdata (x) pmodel (x)
Ex∼pd log + Ez∼pz log
3. Now inside the two log we multiply and divide for 2, obtaining
" !# " !#
2 · pdata (x) 2 · pmodel (x)
Ex∼pd log + Ez∼pz log
2 · (pdata (x) + pmodel (x)) 2 · (pdata (x) + pmodel (x))
" !# " !#
=Ex∼pd log − log 2 + Ez∼pz log − log 2
" !# " !#
=Ex∼pd log + Ez∼pz log − log 4
! !
pdata (x) + pmodel pdata (x) + pmodel
= DKL pdata + DKL pmodel − log 4
2 2

= 2 · DJS pdata (x) pmodel (x) − log 4
4. Since we know that ∀xDJS ≥ 0 and we want to minimize the training criterion we can affirm
that
• we achieve a minimum when DJS pdata (x) pmodel (x) = 0 that happens iff pdata (x) =
pmodel (x)
• the optimum V (G, D∗ ) is − log 4
9.5.4 Convergence of the training algorithm
Proposition 9.5.3 If the following two assumptions are satisfied:
1. G and D have enough capacity

2. at each update step the D is allowed to reach D = D∗ , and pmodel is updated to improve
V (pmodel , D∗ ) =Ex∼pd [log(D∗ (x))] + Ex∼pmodel [log(1 − D∗ (x))]

Z
α sup pmodel (x) log(1 − D(x))dx
D x
then pmodel converges to pdata .

Proof
Observe that the argument of the supremum is convex in pmodel .
Since the supremum operation does not change the convexity, also V (G, V ∗) is convex since it is
proportional to the sup.
Thus, we can reach the global optimum with our algorithm, small improvements are enough.
Remark Observe that these assumptions are very strong, we are requiring that :
• The generator G and discriminator D have enough capacity

• The discriminator D reaches its optimum D∗ at every outer iteration
• We directly optimize pmodel instead of its parameters Θ
In practice it happens that
• G and D have finite capacity

• D is optimized for only k steps
• Using a N.N. to define G define some critical point in the parameters space (the function is
no longer convex)
Thus, in practice pmodel do not converge to pdata and may oscillate.

However, GANs still work well in practice.
132 Chapter 9. GAN
9.6 Difficulties during training
There are mainly two difficulties which may occur during training
1. Mode collapse
2. Divergence of the generator
9.6.1 Mode collapse
When the generator finds a very likely sample it starts producing only samples very similar to that
one, it rotates over a small set of output types.
This phenomenon is called mode collapse.
The most used solution to mode collapse is the Unrolled GAN.
The idea of the unrolled GAN is to move the generator forward in the game and make him prepared
for the next moves. In particular, after the k updates of D the generator is optimized once w.r.t.
the state of D after the next k steps. This often discourage G to exploit a local minima.
Figure 9.3: Unrolled (up) vs vanilla (down) GAN
9.6.2 Issues with DJS
The dimensions of many real-world datasets, as represented by pdata , only appear to be artificially
high. They have been found to concentrate in a lower dimensional manifold. Thinking of the real
world images, once the theme or the contained object is fixed, the images have a lot of restrictions
to follow, i.e., a dog should have two ears and a tail, and a skyscraper should have a straight and tall
body, etc. These restrictions keep images aways from the possibility of having a high-dimensional
free form.
Because both pmodel and pdata rest in low dimensional manifolds, they are almost certainly gonna be
disjoint. When they have disjoint supports, we are always capable of finding a perfect discriminator
that separates real and fake samples 100% correctly.
In particular, when the discriminator is perfect we have that D(x) = 1, ∀x ∈ pdata (x) and D(x) = 0,
∀x ∈ pmodel (x). Therefore, the loss function falls to zero and we end up with no gradient to update
loss during learning iterations. Thus, the learning of the generator becomes very slow.
A solution to this problem are the Wasserstein GAN, which use another measure of similairity
between the two distributions, the Wasserstein distance. This measure allows to take into account
the amount of work required to make a distribution similar to another one and in practice, it does
a good job.
9.7 Comparison with VAE 133
9.7 Comparison with VAE
VAE: blurred images

GANs: sharper images
9.8 Conditional GANs
You train your GAN model on images of cats and dogs. Now that you have a generator that
can produce images of animals, you would like to be able to control the properties of the image.
Some examples of properties could be the animal type or fur colour. Therefore, you would like to
introduce some measure of control over the output of the generator. Explain how you would extend
the basic GAN framework to introduce a measure of control and how the training would look like.
One way of doing this would be to introduce a class label into the input of the generator and
discriminator, leading to the following modified loss function:
max V (G, D) = Ex∼pdata [log(D(x|y))] + Ez∼pz [log(D(G(z|y)|y))]

D
Where y is the corresponding class label (for example, corresponding to cats or dogs). This was
introduced by Conditional Generative Adversarial Net, Mirza et al., 2014. However this requires
the dataset to have the appropriate labels. The training would look exactly the same as in the
original algorithm, with the exception of passing the labels. Therefore if you want to produce an
image of a certain class, you just need to pass the corresponding class label and the generator will
output the corresponding image.
IV
Part Four: Deep Learning For
Computer Vision
10 Parametric Body models and Applications

137
10.1 2D human pose representation and estimation
10.2 Body modeling
10.3 Feature Representation Learning
10.4 Body modelling + Deep Representation Learning
10.6 Case Study: Learned-Gradient Descent
11 Neural Implicit Representations . . . . . 149

11.1 Why we should be able to learn a representation
of 3d shape
11.2 3D representations
11.3 Neural implicit representation
11.4 Implementation of Neural Implicit Representations
11.5 NEural Radiance Field
10. Parametric Body models and Applications
10.1.1 Introduction
2D human pose representation and estimation consists on two main fields of study that should be
combined:
• Body modeling
• Feature representation Learning
We will first analyze them separately and then we will understand how to efficiently combine them.
10.2 Body modeling
Question: can we understand how the different parts of the body are linked to each other?
10.2.1 Pictorial Structure Model
We describe the human body model as a graph G = (V, E) where:
• V = (1, .., k) represent the k parts of the human body

• E specify which pairs of parts are constrained to have consistent relations
138 Chapter 10. Parametric Body models and Applications
Given an 2D image I we indicate with li = (xi , yi ) the estimated position of vertex i. Thus, given
an image I and a configuration estimate L = (l1 , ..., lk ) we can define a score as follows:
X X
S(I, L) = αi · ϕ(I, li ) + βij · ψ(li , lj )
i∈V i,j∈E
where:
• ϕ(I, li ) is the unary term, a feature vector which provides information on the pixel in
location i (i.e. a patch extracted from the original image, possibly modified using convolutions
etc.)
• ψ(li , lj ) is the pairwise term between part i and part j, which is a spatial feature which
depends on the relative location li w.r.t. to lj .
10.2 Body modeling 139
10.2.2 Pictorial Structure Model with Flexible Mixtures
It has been proved empirically that a mixture of non-oriented pictorial structures can outperform
explicitly articulated parts because mixture models can capture orientation-specific statistics of
background features.
Thus, we have to slightly modify the framework previously discussed introducing the concept of
mixture models. Let us call mi the type (mixture component) of part i.
The mixture component can express many concepts as orientations of a part (e.g., a vertical versus
horizontally oriented hand), but types may span out-of-plane rotations (front-view head versus
side-view head) or even semantic classes (an open versus closed hand).
Formally, the score becomes:
X m X mm
S(I, L, M ) = αi i · ϕ(I, li ) + βij i j · ψ(li , lj ) + S(M)
i∈V i,j∈E
where:
• αimi is the local appearance template for part i with type assignment mi
mm
• βij i j is the spatial spring parameter for pair of types (mi , mj ). It express the likelihood of
having template mi for part i and template mj for part j given the distance between li adn lj
• S(M ) is the co-occurence bias and it is defined as
X mm
S(M ) = bij i j
ij∈E
where bij is the pairwise co-occurrence prior between part i with mixture type mi and j
with mixture type mj and it favors particular co-occurrences of part types.
10.3 Feature Representation Learning
We will explore two ways of Feature Representation Learning
• Direct regression
• Heatmaps
10.3.1 Direct Regression
• Based on deep convolutional neural networks

• Directly regress x and y coordinates
• It involves a refiner
Figure 10.1: DeepPose: Human Pose Estimation via Deep Neural Networks
10.3 Feature Representation Learning 141
10.3.2 Heatmaps
For the heatmaps-based representation learning we refer to Convolutional Pose Machines.

The objective of this paper is to improve performance of human pose estimation methods when
occlusions are present.
The main idea of this paper is to create separate heatmaps (gaussian distributions around keypoints)
for each keypoint and then combining them together only in the last phase. The key is that at
Figure 10.2: First row: part detection heatmaps. Second row: output of CNN regression
every stage the architecture operates both on image evidence as well as belief maps from preceding
stages. In each stage, the computed beliefs provide an increasingly refined estimate for the location
of each part.
This is the complete architecture
Figure 10.3: Architecture and receptive fields of CPMs. We show a convolutional

architecture and receptive fields across layers for a CPM with any T stages. The pose
machine is shown in insets (a) and (b), and the corresponding convolutional networks are
shown in insets (c) and (d). Insets (a) and (c) show the architecture that operates only on
image evidence in the first stage. Insets (b) and (d) shows the architecture for subsequent
stages, which operate both on image evidence as well as belief maps from preceding stages.
The architectures in (b) and (d) are repeated for all subsequent stages (2 to T). The network
is locally supervised after each stage using an intermediate loss layer that prevents vanishing
gradients during training. Below in inset (e) we show the effective receptive field on an
image (centered at left knee) of the architecture, where the large receptive field enables the
model to capture long-range spatial dependencies such as those between head and knees.
10.4 Body modelling + Deep Representation Learning
The idea is to use the predictions obtained using the heatmaps and then refine them using body
modelling.
Another thing widely used is the spatial temporal inference that use the temporal continuity of the
body to obtain better predictions.
As a case study we use Thin-Slicing Network
Figure 10.4: Complete pipeline
10.4.1 Graph models
In a single frame we have (given an image I and a pose estimation p)

X X
S(I, p) = ϕi (pi |I) + ψi,j (pi , pj )
i∈V (i,j)∈Es
where ψi,j (pi , pj ) = wi,j · d(pi , pj ), d(pi , pj ) = [∆x, ∆x2 , ∆y, ∆y 2 ] and w encodes rest location and
rigidity between pairs.
For a slice window we have
T
X X
Sslice = S(I t , pt ) + ψi,i∗ (pi , p′i∗ )
t=1 (i,i∗ )∈Ef
where p′i∗ = pi∗ + fi∗ ,i (pi∗ ) and fi∗ ,i (pi∗ ) is the optical flow evaluated at pi∗ (this is the flow warping
process in which pixel-wise flow tracks are applied to align confidence values in neighboring frames
to the target frame). As a matter of fact the term ψi,i∗ (pi , p′i∗ ) regularizes the temporal consistency
of the part i in neighboring frames.
10.4 Body modelling + Deep Representation Learning 143
10.4.2 Inference
Inference corresponds to maximizing Sslice over p for the image sequence slice.
When the relational graph G = (V, E) is a tree-structured graph, exact belief propagation can be
applied efficiently by one pass of dynamic programming in polynomial time.
How- ever, loopy belief propagation algorithms such as the Max- Sum algorithm make approximate
inference possible in intractable loopy models.
More precisely, in our case at each iteration a part i sends a message to its neighbors and also
receives reciprocal messages along the edges in G:
X
scorei (pi ) ← ϕi (pi |I) + mki (pi )
k∈child(i)
where child(i) is defined as the set of children of part i. The local scorei (pi ) is the sum of the
unary terms and the messages collected from its all children. The messages mki (pi ) (best score
that can be achieved using position pi for vertex i and being able to change pk ) sent from body
part k to part i are given by:
mki (pi ) ← max(scorek (pk ) + ψk,i (pk , pi ))

pk
Using this process we eventually obtain the maximization of the sum over all the ϕi (pi |I) and all
the terms coming from the edges.
10.4.3 Training: Sub-gradient Descent

(
∂mki (pi ) 1 if pk = p∗
=
∂scorek (pk ) 0 otw
(
∂mki (pi ) 1 if pk = p∗
=
∂ψk,i (pk , pi ) 0 otw
∂mki (pi ) ∂mki (pi )

= d(pk − pi )
∂wki ∂ψk,i (pk , pi )
10.5.1 SMPL representation: 3D Mesh
In order to represent the body in 3D we use a 3d mesh, that is designed by an artist and contains
around 7000 vertices.
In order to define a body we need to define its shape and its pose.
10.5.2 Shape
In order to define the pose we do PCA of meshes in canonical pose to estimate the directions
of maximal shape variation. Doing that, we obtain a low-dimensional subspace (10D-300D) in
canonical pose. Note that usually 10 dimensions are enough to define a pose.
Figure 10.5: PCA on body shapes

10.5.3 Pose
Linear blend skinning
The linear mesh skinning is the simplest mesh skinning method Linear blend skinning is the idea of
transforming vertices inside a single mesh by a (blend) of multiple transforms.
Deformed position of a point(vertex) is a sum of the positions determined by each bone’s transform
alone, weighted by that vertex’s weight for that bone.
Figure 10.6: Linear Blend Skinning
In particular for each vertex i, starting from a rest position ti its position in the transformed pose
t′i is X
t′i = wki Gk (θ, J)ti
k
• wki are the blend skinning weights, created by an artist

• Gk is a rigid bone transformation
• θ is the desired pose
• J are the joint locations
Thus,in this model posed vertices are linear combination of transformed template vertices.
Pro: simple and fast to compute (widely used in videogames)
Contro: it produces only well known artifacts.
Figure 10.7: In linear blend skin, in presence of strong twists, the surface collapse
SMPL
A solution to this problem is SMPL, where t′i is computed as

X
t′i = wki Gk (θ, J(β))(ti + si (β) + pi (θ))
k
• si (β): vertex i in BS (β), which represents offset from the template depending on the shape
described by β
• pi (θ): vertex i in BP (θ), which represents offset from the template depending on the pose
described by θ
Shape blend shapes BS
The body shapes of different people are represented by a linear function BS

|S|
X
BS (β, S) = βn Sn
n=1
• β = [β1 , ..., β|β| ]⊤ : linear shape coefficients

• Sn ∈ R3N : orthonormal principal components of shape displacements
• S = [S1 , .., S|β| ]: matrix of all shape displacements, learned from registered training meshes
Notationally, the values to the right of a semicolon represent learned parameters, while those on
the left are parameters set by an animator.
Pose blend shapes BP
We denote as R : R|θ| 7→ R9K a function that maps a pose vector θ to a vector of concatenated
part relative rotation matrices (each rotation matrix has dimensions 3 × 3).
Given that our rig has 23 joints we have that K = 3 and thus R(θ) is a vector of length (23×9 = 207).
Elements of R(θ) are functions of sines and cosines of joint angles and therefore R(θ) is non-linear
with θ.
If we define θ∗ as the rest pose, then the vertex deviations from the rest template are
9K
X
BP (θ, P) = (Rn (θ) − Rn (θ∗ ))Pn
n=1
where Pn ∈ R3N are the vector of vertex displacements. Thus, P = [P1 , ..., P9K ] ∈ R3N ×9K is a
matrix of all 207 pose blend shape.
As a consequence of this formula, the rotation of a particular joint can influence all the body
vertices, not only the local ones.
Note that subtracting the rest pose rotation vector, R(θ), guarantees that the contribution of the
pose blend shapes is zero in the rest pose, which is important for animation.
Summing up, there are 9 coefficient that describe the rotation of each joint and for example,
R1 (θ) − R1 (θ∗ ) describes the part of the rotation of joint 1 w.r.t. the rest pose, described by the
first rotation coefficient.
SMPL summary
1. Define a base template mesh

2. Capture raw training scans
3. Register a template mesh to them
4. Define shape in a canonical pose
5. Factor body shape variation from pose
6. Learn pose-dependent deformations (e.g. the fat distribution is different if one stands or is
upside down)
7. Pose the mesh using linear blend skinning (LBS)
8. Learn everything including the blend weights
As a result we obtain
• a mesh M(β, θ, ϕ): depends on β (shape), θ (pose), ϕ (gender)

• the joints positions J(β; J , T̄ , S)
10.6 Case Study: Learned-Gradient Descent
The next part is based on Song et al. ”Human Body Model Fitting by Learned Gradient Descent.”
The novel idea is to train a neural network to do the optimization step.
The algorithm used is the following
In practice we sample from the training set and we try to reconstruct the ground truth starting
with all the weights set to 0.

11. Neural Implicit Representations
11.1 Why we should be able to learn a representation of 3d shape
As we have seen before, the universal approximation theorem ensure that NN are able to learn an
approximation of any continuos function, thus, as a 3d shape is a continuous function, we should
not be surprised that NN are able to solve it.
11.2 3D representations
• Voxels:
– Voxels are 3d correspondent of pixel, a discretization of 3D space into grid
– Cons: it occupies a cubic memory O(n3 ), thus the resolution is limited
• Points:
– Discretization of surface into 3D points
– Cons: it does not model connectivity / topology
• Meshes:
– Discretization into vertices and faces
– Requires either
∗ Class-specific templates
∗ The maximum number of vertices I want to represent it
∗ Cons: there will always be an approximation error and they lead to self-intersections
• Implicit functions:
– Learn the analytic function which represents the 3d-surface
– Pro: no approximation error + smooth and continuous surface
150 Chapter 11. Neural Implicit Representations
11.3 Neural implicit representation
We will talk about Neural implicit representation.

The idea is to represent surface as the level-set of a continuous function.
The form is:
f (x) = x21 + x22 + x23 − r2

S = {x|f (x) = 0}
There are two kind of functions that can do this work:
• Occupancy Networks: fθ : R3 × X → [0, 1], outputs the probability of being inside the surface
• DeepSDF (Signed Distance Field): fθ : R3 × X → R, output the signed distance from the
surface (negative if inside, positive if outside)
Pros:
• The representation is continuous

• We obtain an arbitrary topology and resolution
• Low memory footprint
11.4 Implementation of Neural Implicit Representations
The implicit function fθ is parameterized as an MLP.

We condition over the type of shapes we want to obtain via input concatenation.
The output can be both occupancy probability or Signed Distance Field. In order to obtain an
explicit representation then we use marching cubes.
Figure 11.1: Conditioning on different inputs
Now our question is
How can we learn the function f ? What should we use as ground truth?
In general, we can choose one out of these three image representations of ground truth:
1. Watertight Meshes
2. Point Cloud
3. 2D Images
11.4.1 Watertight meshes
This is the simplest case (they have no holes thus the space is divided in inside and outside): we
uniformly sample points inside the surface and we train the model using Binary Cross Entropy.
K
X
L(θ, ψ) = BCE(fθ (pij , zi ), oij )
j=1
11.4.2 Point clouds
Why sometime we use this representation?
• Many 3D sensors output unordered point clouds

• Generally, they are cheaper to obtain than watertight meshes
Input: points cloud X = {xi }i∈I ⊂ R3

In this case the implicit function fθ represents the signed distance function to a plausible surface
M defined by X .
We use the signed distance function since learning only from points would be hard.
Figure 11.2: Learning from points vs learning from a distance function
The loss we wish to minimize is

X
L(θ) = |fθ (xi )|2 + λEx (∥∇x fθ (x)∥ − 1)2
| {z } | {z }
i∈I Vanish term Eikonal term
• Vanish term: we want the loss to vanish at training points

• Eikonal term: we want the spacial gradient at those points to be 1, so that we can interpret
it as a geometric surface (we do not want sudden changes in the norm of the gradient),
encourage smoothness
Convergence and linear reproduction

Theorem 11.4.1 Gradient descent of the linear model with random initialization converges with
probability 1 to the reproducing plane
11.4.3 2d images
Now all we get are 2d images (no more 3d supervision).

In order to learn from them we need to render them in a differentiable way.
Let’s see how it can be done.
Differentiable Volumetric Rendering
Our goal is to learn fθ (occupancy function) and tθ (texture) from 2D image observations. Consider
a single image observation. We define a photometric reconstruction loss
X
ˆ =
L(I, I) Iû − Iu
u
where I is observed image (GT) and Iˆ is image rendered by our implicit model. Moreover, Iu
denotes the RGB value of the observation I at pixel u and ∥·∥ is a (robust) photo-consistency
measure such as the l1 -norm.
To minimize the reconstruction loss L w.r.t. the network parameters θ using gradient-based
optimization techniques, we must be able to
• Render Iˆ given fθ (fθ = τ if the point is in the surface, is > τ if the point is behind and < τ
if it is outside) and tθ
• Compute gradients of L w.r.t. the network parameters θ
To obtain the rendering we follow this procedure.

Given r0 , the position of the camera, for each pixel u:
1. Draw w, a vector connecting r0 to u

2. Consider the ray r(d) = r0 + wd
3. Call p̂ the first point of intersection (where fθ (p̂) = τ , found with Secant Method) with
the estimated surface in the direction of r(d). Call dˆ the distance of this point from r0 , in
particular we know that r(d) ˆ = p̂
4. Query the texture network and obtain tθ (p̂)
5. Color the pixel u with color tθ (p̂) (I(u)
ˆ = tθ (p̂))
Forward path
We are given r0 , the position of the camera in the image we are analyzing
1. For all the pixels we query the occupancy network which gives us a value :
• fθ < τ : outside the surface
• fθ = τ : in the surface
• fθ > τ : behind the surface
2. For the points p̂ with fθ = τ we evaluate the texture field tθ (p̂)
3. We assign the color tθ at pixel u
Secant method
In order to find the points which lay on the surface, we use the secant method.
The idea is the following
1. Start from 2 points, x0 , x1 and connect them with a straight line

2. Find the intersection of this line with x-axis, call this point x2
3. Repeat this until convergence using point with opposite signs
Backward pass
Let us call I the real image and Iˆ the predicted one. As we have said before, we define the loss as
ˆ I) = P Iû − Iu .
L(I, u
The gradient of the loss w.r.t. our parameters will be
∂L X ∂L ∂ Iû
=
∂θ ˆ ∂θ
u ∂ Iu
where
∂ Iû ∂tθ (p̂) ∂tθ (p̂) ∂ p̂
= +
∂θ ∂θ ∂ p̂ ∂θ
In order to evaluate ∂ p̂
∂θ we need implicit differentiation.
ˆ and condition for the intersection between the ray and the surface
Consider the ray p̂ = r0 + dw
(remember that we evaluate the color of a point only for the points on the surface) and take the
derivative on both sides
fθ (p̂) = τ
∂fθ (p̂) ∂fθ (p̂) ∂ p̂
+ · =0 τ is a constant
∂θ ∂ p̂ ∂θ
∂fθ (p̂) ∂fθ (p̂) ∂ dˆ ˆ and r0 is constant
+ ·w =0 p̂ = r0 + dw
∂θ ∂ p̂ ∂θ
!−1
∂ dˆ ∂fθ (p̂) ∂fθ (p̂) ∂ dˆ
=− ·w We obtain an expression for
∂θ ∂ p̂ ∂θ ∂θ
ˆ we have that
As p̂ = r0 + dw
∂ p̂ ∂ dˆ
=w
∂θ ∂θ
!−1
∂fθ (p̂) ∂fθ (p̂)
= −w ·w
∂ p̂ ∂θ
11.5 NEural Radiance Field
So far we have learnt how to represent surfaces, but in some cases this is not enough, scenes are
more complex.
In particular we have to learn:
• Thin Structures (e.g. leaves, hair)

• Transparency (e.g. glasses, smoke)
11.5.1 Architecture
Before we were interested in one single output, the RGB value of a pixel. The novelty of NERF
is that they introduce the concept of density σ, that enables us to learn more about the difficult
surfaces we were mentioning before.
In particular they take as input:
• x, y, z: the 3D position of the point we are considering

• θ, ϕ: the camera parameters
and they output σ, the density of the point, and c, the RGB value of the point.
More formally the architecture they have proposed is the following (green represents input, blue
layers of the network and red outputs):
Some observations:
• The view directories θ and ϕ are given to the network only in later layers, after having
predicted σ to enforce this value not to be dependent on ϕ, θ but just from x, y, z
• After some layers we give again the position x, y, z to network to make sure it has not been
washed out
11.5.2 Procedure
We want now to analyze how we can obtain the volume rendering.

The first step, as before, is to draw a ray connecting the camera position to the point we want to
represent.
However, now we analyze the whole ray sampling points along it, without stopping at the first
intersection with the surface. The parameters we analyze are:
• The density σ Qi−1

• The trasmittancy Ti = j=1 (1 − αi )
In order to get the color then we apply alpha compositing. To better understand the process,
consider the following formula:
• δi = ti+1 − ti
• αi = 1 − e−σi δi
The final color will be then computed as a weighted average of the colours along the ray, in particular
N
X
c= Ti αi ci
i=0
Since the sampling operation is very expensive one trick is to sample more in more significant
positions (i.e. positions with high weights (high Ti αi )).
Figure 11.3: Sampling (the points in the ray) frequency is higher where Ti αi is higher
11.5.3 Comparison with implicit surfaces
Pro: they can model transparency and thin structure, and therefore is a more flexible representation
Cons: generally leads to worse geometry compared to implicit surface
11.5.4 Positional Encoding
Despite the fact that neural networks are universal function approximators (14), we found that
having the network F Θ directly operate on x, y, z, θ, ϕ input coordinates results in renderings that
perform poorly at representing high-frequency variation in color and geometry. This happens
because NN are biased towards learning lower frequency functions.
The solution proposed in the paper is to introduce positional encoding, mapping the inputs to
a higher dimensional space R2L and then applying the MLP function. Formally, the encoding
function used is the following
γ(p) = (sin 20 πp , cos(20 πp), . . . , sin(2L−1 πp), cos(2L−1 πp))

Note that this function γ(·) applied separately to each of the three coordinate values in x (which
are normalized to lie in [−1, 1]) and to the three components of the Cartesian viewing direction
unit vector d (which by construction lies in [−1, 1]). In their experiments, they set L = 10 for γ(x)
and L = 4 for γ(d).
A similar mapping is used in the popular Transformer architecture, where it is referred to as a

positional encoding. However, Transformers use it for a different goal of providing the discrete
positions of tokens in a sequence as input to an architecture that does not contain any notion of
order. In contrast, they use these functions to map continuous input coordinates into a higher
dimensional space to enable our MLP to more easily approximate a higher frequency function.
By adding positional encoding, we transform the input into Fourier domain.
11.5.5 Limits of NERF
• Requires many (50+) calibrated views

• Slow rendering speed for high-res images

• Only models static scenes
V
Part Five: Deep
Reinforcement Learning
12 Reinforcement Learning . . . . . . . . . . . . 161

12.1 Motivations
12.2 RL problem statement
12.3 Major Components of an RL Agent
12.4 Taxonomy of RL agents
12.5 Markov Decision Processes
12.6 Dynamic Programming
12.7 Monte Carlo methods
12.8 Temporal Difference Learning
12.9 Deep Reinforcement Learning
12. Reinforcement Learning
12.1 Motivations
In Reinforcement Learning models learn how to act interacting with the environment trough some
actions.
This can be useful in many fields such as games, logistics and operations and Robot Con-
trol/Computer Vision.
12.2 RL problem statement
Reinforcement Learning is a problem, not a method. Given an unknown and uncertain environment,
it aims to choose the right actions in order to maximize the reward signal in the long-term.
162 Chapter 12. Reinforcement Learning
12.3 Major Components of an RL Agent
An RL agent may include one or more of these components:
• Policy: agent’s behaviour function

• Value function: how good is each state and/or action
• Model: agent’s representation of the environment
12.3.1 Policy
A policy expresses the agent’s behaviour, it is a map from states to action.

It can be:
• Deterministic: π(s), it returns the precise action to do given a state s ∈ S

• Stochastic: π(a|s) = P(a|St = s), it returns the probability of doing the action a in the state
s
12.3.2 Value function
The value function is a prediction of the expected future reward.

It is used to evaluate the goodness/badness of states. It is the basis the agent uses to decide the
next action.
In particular, we write
vπ (s) = E[Rt + γRt+1 + γ 2 Rt+2 + . . . |St = s]
Note that the value function depends on a policy π, on the way we are behaving.
The factor γ ∈ [0, 1] is introduced since mostly we are interested more in immediate reward and
less in the future one.
12.3.3 Model
A model predicts what the environment will do next.

In particular:
• Pss
a
′ = P[St+1 = s |St = s, At = a] predicts the probability of the next state given a state
′
and an action
• Ras = E[Rt+1 |St = s, At = a] predicts the next immediate reward given a state and an action
12.4 Taxonomy of RL agents 163
12.4 Taxonomy of RL agents
• Value Based
• Policy Based
• Actor Critic: combination of value and policy based
12.4.1 Value Based
The agent have access to the value function.

Given a value function, we can derive a greedy policy by reading the value function and maximizing
the best action.
12.4.2 Policy Based
The agent have just access to a policy and try to adjust directly this policy trying to get the highest
possible reward.
12.4.3 Model-free and model-based agents
• Model-free: directly optimize value/policy function

• Model-based: it first build a model of how the environment works and then it finds the
optimal way to behave
12.5 Markov Decision Processes
Markov decision processes formally describe an environment for reinforcement learning where the
environment is fully observable.
12.5.1 Markov property
The future is independent of the past given the present
Definition 12.5.1 A state St is Markov if and only if
P[St+1 |St ] = P[St+1 |S1 , . . . , St ]
As a consequence:
• The state captures all relevant information from the history

• Once the state is known, the history may be thrown away
12.5.2 Markov Process

Definition 12.5.2 A Markov Process (or Markov Chain) is a tuple ⟨S, P⟩
• S is a (finite) set of states

• P is a state transition probability matrix, where
Pss′ = P[St+1 = s′ |St = s]
12.5.3 Markov Reward Process

Definition 12.5.3 A Markov Reward Process is a tuple ⟨S, P, R, γ⟩
• R is a reward function
Rs = E[Rt+1 |St = s]
• γ ∈ [0, 1] is a discount factor
In particular:
• γ close to 0 leads to ”myopic” evaluation (values only immediate reward)

• γ close to 1 leads to ”far-sighted” evaluation (values immediate reward as delayed reward)
12.5 Markov Decision Processes 165
12.5.4 Return
Definition 12.5.4 The return Gt is the total discounted reward from time-step t.
∞
X
Gt = Rt+1 + γRt+2 + · · · = γ k Rt+k
k=0
Note that the value of receiving reward R after k + 1 time-steps is γ k R.
12.5.5 Value function
The value function v(s) gives the long-term value of state s

Definition 12.5.5 The state value function v(s) of an MRP is the expected return starting from
state s
v(s) = E[Gt |St = s]
12.5.6 Markov Decision Process

Definition 12.5.6 A Markov Reward Process is a tuple ⟨S, A, P, R, γ⟩ where A is a finite set of
actions.
12.5.7 Bellman equation to compute the return
The value function can be decomposed into two parts:
• immediate reward Rt+1

• discounted value of successor state γGt+1
vπ (s) = Eπ [Gt |St = s] =

= Eπ [Rt+1 + γGt+1 |St = s] = recursive expression of Gt
X XX
p(s′ , r|s, a) r + γEπ [Gt+1 |St+1 = s′ ]

= π(a|s)
a s′ r
| {z }
weighted sum of rewards given current state and action
X X
p(s′ , r|s, a) r + γvπ (s′ )] recursive formulation

= π(a|s)
a s′ ,r
where a are all the outgoing actions from the current state s.
12.5.8 Action-value function
The action-value function qπ (s, a) is the expected return starting from state s, taking action a, and
then following policy π
q(s, a) = Eπ [Gt |St = s, At = a]

= Eπ [Rt+1 + γqπ (St+1 , At+1 )|St = s, At = a]
where At+1 is defined by the policy π.
12.5.9 Bellman Optimality Equation
The optimal state-value function v∗ (s) is the maximum value function over all policies
X
v∗ (s) = max vπ (s) = max q∗ (s, a) = max p(s′ , r|s, a)(r + γv∗ (s′ ))
π a a
s′
Observe that this equation:
• is not linear
• has no closed form solution
• can be solved using many iterative solution methods, such as:
– DP
– Monte-Carlo Methods
– Temporal-Difference Learning (combination of DP and MC)
12.6 Dynamic Programming 167
12.6 Dynamic Programming
DP is able to compute optimal policies given a perfect model of the world (MDP).
Thus, in order to apply DP we need to know transitions’ probabilities. This has a limited utility,
however it still has a great theoretical importance.
There are two key ideas to compute the optimal policy:
• Value iteration
1. Compute optimal v∗ using the value iteration algorithm
2. Find a policy π to obtain v∗
• Policy iteration
1. For any policy π compute v(π)
2. Update policy π given v(π) and obtain π ′
3. Iterate until π ∼ π ′
12.6.1 Value iteration
The algorithm for the value iteration is the following
Pros Cons
• Need to know the transition probability

• Exact methods matrix
• Policy/Value iteration are guaranteed to • Need to iterate over the whole state space
converge in finite number of iterations (very expensive)
• Value iteration is typically more efficient • Requires memory proportional to the size
than policy iteration of the state space
12.7 Monte Carlo methods
Now the questions are
What can we do when the states space is too big to iterate over it?
How can we estimate the value of those states?

What if we don’t know the transition probabilities?
An idea to solve these problems is using Monte Carlo methods. Monte Carlo policy evaluation uses
empirical mean return instead of expected return
∞
X
Gt = γ k Rt+k+1
k=0
N
1 X
vπ (s) = Eπ [Gt |St = s] = Gt (s)i
N i=1
The problem of Monte Carlo estimate is that in order to know the value of a trajectory we have to
wait its whole exploration, it cannot learn from incomplete episodes.
12.8 Temporal Difference Learning
The TD learning allows learning from incomplete episodes. We don’t have anymore to go all the
way in a particular trajectory, TD can learn before knowing the final outcome, it learns online at
every step.
Intuitevely, the procedure is the following:
1. I guess the reward of a certain trajectory

2. I go one step forward in that trajectory
3. I estimate again the reward from the new state
4. I come back and I update the previous estimate of the initial state
More formally:
∆V (s) = r(s, a) + γV (s′ ) − V (s)

V (s) ← V (s) + α∆V (s)
where α > 0 is the learning rate.

We can notice that even in this case the return term is composed by a term which indicates the
immediate return (r(s, a)) and a term (in this case an online estimate, γV (s′ )) that takes into
consideration future steps, and thus, rewards.
We call ∆V (s) the TD-error, as it computes the difference between the value of state S before and
after taking a forward step.
The TD(0) learning is guaranteed to converge to vπ (s), the real value.
Observe that doing that we don’t update the whole state space, only visited states!
However, we still have to fin a criterium to visit the state space. Basically there are two options:
• Random policy: in each state choose an action randomly

• Greedy policy: in each state, always choose the best action
At a first glance, the greedy strategy may look better, but in practice it could get us stuck in local
minima.
We have to find a balance between:
• exploration: gather more data to avoid missing out on a potentially large reward?
• exploitation: stick with our current knowledge and build an optimal policy for the data we’ve
seen?
A good trade-off is the ϵ-greedy policy, in each each state, with small probability ϵ choose randomly,
else choose greedily. In practice it works well. It is suggested to decrease the value of ϵ so that we
have more exploration in the beginning.
12.8.1 Implementations
Mainly there are two implementations of TDL
• SARSA: on policy, compute Q-value according to a policy and then the agent will follow that
policy
• Q-Learning: off policy, Q-value according to a greedy policy, but the agent follows a different
exploration policy
The advantage of considering directly the Q function is that we explicitly take not only states,
but also actions into account. This fact may be useful since in many cases we have to learn from
external policies µ that have not taken the best actions, in order to estimate the values of vπ (s)
and qπ (s, a) of our optimal policy π different from µ. As a consequence we will not be able to freely
choose the immediately next action (which is determined by the observations we have, and thus by
the policy µ), but we can estimate its value using greedy on the next one.
SARSA
We follow the policy π to obtain a transition (s, a, r, s′ ) so we compute the difference to our current
estimate and update our value function like this
∆Q(s, a) = Rt+1 + γQ(s′ , a′ ) − Q(s, a)

Q(s, a) ← Q(S, A) + α∆Q(s, a)
where α is the learning rate and a′ is the action chosen by π in the state s′ .
In order to decide the next action we can use an ϵ-greedy policy.
Q-learning
For each action A transitioning from S to S ′ , compute the difference to current estimate and update
value function
∆Q(S, A) = Rt+1 + γ max{Q(S ′ , a)} − Q(S, A)

a
Q(S, A) ← Q(S, A) + α∆Q(S, A)
where:
• The immediate reward Rt+1 and the next state S ′ are data from the exploration policy
• The updates of the Q function depends in general from the policy we are considering (in this
case greedy policy)
12.8.2 Pro and cons of Temporal Difference Learning
Pros Cons
• Less variance than Monte Carlo Sampling • Biased due to bootstrapping, we use “old”
due to bootstrapping value estimates as labels
• More sample efficient than Dynamic Pro- • Exploration/Exploitation dilemma
gramming
• Do not need to know the transition proba-
bility matrix
12.9 Deep Reinforcement Learning
12.9.1 Introduction
We remind that
• π:S→A
• vπ : St → R
are functions, thus can be approximated from a neural network.

In particular we will see:
• Q-learning: Deep Learning to optimize value function

• Policy methods: Deep Learning to optimize directly the policy
• Actor-Critic methods: Deep Learning to optimize directly the policy, using another
network to approximate the value function
12.9.2 Deep Q-Learning
In Q-Learning we assign a value at each pair (a, s), thus, our goal is to use function approximation
to learn the value function
vπ (s) ≈ vπ (s, θ)
We can use a neural network to learn the mapping between state-action pairs (s, a) and their values.
The Q-learning updates reduce to SGD on the TD-error (∆Q)
′ ′
2
Loss(θ) = R + γ max
′
{Qθ (s , a )} − Qθ (s, a)
a
However, we still have a problem: SGD assumes that our updates are i.i.d..
But in RL states visited in a trajectory are strongly correlated; how can we address this?
The main idea is to use a replay buffer and to store there the generated samples, let us now
discuss how this can help us to obtain i.i.d. samples.
The procedure for the training will be the following
1. Run some exploration policies and, during them, store there all the generated samples
2. When we have enough transitions, we sample a random minibatch from the buffer
3. For every transition presents in this minibatch, we update loss and parameters
4. Iterate until convergence
12.9.3 Policy search methods
Q-learning is limited to discrete action spaces (e.g. we can consider actions to be W,E,N,S but not
the angle of the direction) , for continuous action space the problem is intractable.
But we still have hopes!
As a matter of fact, learning directly a policy π is often much easier. The algorithm directly learns
the correct behavior, without exploring the value function.
The question now becomes:
How can we train such a model?
Policy gradients
We can see the policy from a particular state at a particular time step t as a normal distribution of
mean µt and variance σt2 over the possible actions, thus we use a gaussian parametrization of the
policy
π(at |st ) ∼ N (µt , σt2 )
The advantage of this parametrization is that now we can sample from there and learn the parameters
of our network.
In particular, remind that, if we want the probability of a particular trajectory τ we have to
compute
p(τ ) = p(s1 , a1 , ..., sT , aT ) =
T
Y
= p(s1 ) π(at |st )p(st+1 |at , st )
t=1
Ideally, our objective is to make
• good trajectories more likely

• bad trajectories less likely
In order to do that, as we have already said before, we sample from the gaussian parametrization
of the policy and we learn our parameters to obtain π(at |st , θ).
Training: exploration and evaluation
The learning phase can be split into two main parts:
• Exploration: get the trajectory data. To do that, we sample action at every time-step from
the policy probability distribution (on-policy methods)
• Evaluation: evaluate the policy by computing the expectation of the trajectory reward
given the parameters θ " #
X
t
J(θ) = Eτ ∼pθ (τ ) γ r(st , at )
t
Optimization: policy update
Our question now is

How can we classify the trajectories? What does it mean for a trajectory to be good/bad?
The goal is to maximize the performance measure:
θ∗ = arg max J(θ)
θ
In order to do that we update parameters using gradient ascent:
θ ← θ + ∇θ J(θ)
How can we compute the gradient of our objective function J(θ)?

Let us first write J(θ) in a compact way
" #
X
t
J(θ) = Eτ ∼pθ (τ ) γ r(st , at )
t
= Eτ ∼pθ (τ ) [r(τ )]
Z
= p(τ )r(τ )dτ
Now we are ready to compute the gradient

Z
∇θ J(θ) = ∇θ p(τ )r(τ )dτ =
∇f (x)
Z
= p(τ )∇θ log p(τ )r(τ )dτ = we used the fact that ∇ log(f (x)) =
f (x)

= Eτ ∼pθ (τ ) ∇θ log p(τ )r(τ )
Let us now give a representation of log p(τ )
log p(τ ) = log p(s1 , a1 , ..., sT , aT ) =

" T
#
Y
= log p(s1 ) log πθ (at |st )p(st+1 |at , st ) =
t=0
" T # " T #
h i X X
= log p(s1 ) + log πθ (at |st ) + log p(st+1 |at , st )
| {z } t=0 t=0
constant w.r.t. θ | {z }
constant w.r.t. θ
Note that the first and the last term do not depend on the policy we choose and, thus on the
parameters of our neural network. Thus when we apply ∇θ they will disappear.
Rearranging the terms in the previous expression we obtain:

∇θ J(θ) = Eτ ∼pθ (τ ) ∇θ log p(τ )r(τ )
" " T # #
X
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(τ )
t=0
" T
! T
!# T
X X X
t
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) γ r(st , at ) r(τ ) = γ t r(st , at )
t=0 t=0 t=0
| {z } | {z }
Gradient of the likelihood of τ trajectory reward
Thus the gradient of our objective function is the gradient of the likelihood of τ , scaled by the
trajectory reward.
Now the question becomes
How can we in practice evaluate this quantity? i.e. how can we compute the expected value?
Attempt 1: REINFORCE algorithm
The process (REINFORCE algorithm, Williams, 1992) is the following:
1. Initialize the policy parameters θ at random

2. Use this policy πθ to collect a trajectory τ = (s0 , a0 , r1 , s1 , a1 , . . . , aH , rH+1 , sH+1 )

3. Calculate the discounted reward for each step k by backpropagation
H+1
X
Gk = γ t−k−1 Rk = Rk + γGk+1
t=k+1
4. Calculate expected reward J and ∇Jθ using Monte Carlo sampling (we sample N trajectories):
N
" T ! #
1 X X
∇θ J(θ) = ∇θ log πθ (at |st ) Gi0
i i
N i=1 t=0
5. Adjust the weights of the policy to increase J
θ ← θ + ∇θ J(θ)
6. Iterate until convergence
Attempt 2: REINFORCE algorithm with baseline
However, if we do like this, we still have a problem: the gradient is estimated only over few samples
(we used Monte Carlo sampling), thus the obtained policy gradients are very noisy.
Solution: reduce the variance introducing a baseline b(sit ) in the term related to the trajectory
reward. " T ! T !#
N
1 X X
i i
X
t i i
∇θ J(θ) = ∇θ log πθ (at |st ) γ Rt − b(st )
N i=1 t=0 t=0
Remark: the baseline must be a function that does not depend on the policy (common
choices: average reward, estimate of the state value function).
As a consequence, the variance is reduced, but the policy gradient estimate remains unbiased.
Attempt 3: REINFORCE algorithm with actor-critic
Can we do better?
Remind that in the original form
N
" T
! #
1 X X
∇θ J(θ) = ∇θ log πθ (ait |sit ) i
r(τ )
N i=1 t=0
The idea is to use bootstrapping to introduce bias and reduce the variance.
In particular, we weight the likelihood at each step for the estimated value of the whole roll-out
(TD error):
T
1 XX
∇θ J(θ) = ∇θ log πθ (ait |sit ) r(sit , ait ) + γV (sit+1 ) − V (sit )
N i t=0
where V (st ) (the value function in state st ) computed by another NN.

Note that if the TD error is very close to 0 it means that the V network has learnt the v function
and the π(a|s) network has learnt the optimal policy.
This method is called actor (policy) - critic (value function).
Additional Readings
Here we provide some suggested readings, divided for each chapter and topic.
Chapter 2: training neural networks
Optimization & Gradient Descent
• Chapter 8 of Deep Learning Book [6]

• Blog article: "Gradient Descent Algorithm and Its Variants"
Chapter 5: RNN
• Blog article: "The unreasonable effectivness of RNNs"

• Blog article: "Understanding LSTM Networks"
Chapter 6: VAE
• Blog article: "Intuitively Understanding Variational Autoencoders"

Bibliography
Bibliography
Books
[6] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Vol. 521. 2015.
Articles
[1] L. Breiman. “Bagging predictors”. In: Machine Learning 24 (2004), pp. 123–140.
[2] T. Garipov et al. “Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs”.
In: ArXiv abs/1802.10026 (2018).
[4] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neural
Computation 9 (1997), pp. 1735–1780.
[7] Christian Ledig et al. “Photo-Realistic Single Image Super-Resolution Using a Gen-
erative Adversarial Network”. In: 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2016), pp. 105–114.
[9] Robin Rombach et al. “High-Resolution Image Synthesis with Latent Diffusion Models”.
In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
(2021), pp. 10674–10685.
[10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional Networks
for Biomedical Image Segmentation”. In: ArXiv abs/1505.04597 (2015).
[11] Frank Rosenblatt. “The perceptron: a probabilistic model for information storage and
organization in the brain.” In: Psychological review 65 6 (1958), pp. 386–408.
[12] Nitish Srivastava et al. “Dropout: a simple way to prevent neural networks from
overfitting”. In: J. Mach. Learn. Res. 15 (2014), pp. 1929–1958.

Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Main

Uploaded by

Copyright:

Available Formats

Machine Perception

Lecture Notes 2023

O. Hilliges, J. Song, F. Engelmann, X. Chen

Compilation date: May 25, 2023

© 2023 ETH Zürich. All rights reserved.

The course covers the following main areas:

1. Foundations of deep learning.

Numbers and Arrays

ai Indexed scalar (in a vector a)

Linear Algebra Operations

Bern(p) Bernoulli distribution with parameter p

f :A→B function f from elements of set A to elements of set B

Deep Learning Notations

x(i) The i-th example (input) from a dataset

iff if and only if

I Part One: Foundation of Deep Learning

1 Neural Network Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

II Part Two: CNNs, RNNs & Co

3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

III Part Three: Generative Modeling

7 Autoregressive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8 Normalizing flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

IV Part Four: Deep Learning For Computer Vision

10 Parametric Body models and Applications . . . . . . . . . . . . . . . . . . . . 137

11 Neural Implicit Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

V Part Five: Deep Reinforcement Learning

12 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

1 Neural Network Basics . . . . . . . . . . . . . . . 19

1.2 Biological motivations

Figure 1.1: Representation of signal processing in a biological neuron

1.3.1 Perceptron Learning Algorithm

Figure 1.2: The perceptron framework

• w{k} represents the weights at the iteration k

Figure 1.3: The perceptron learning algorithm framework

The perceptron learning algorithm does the following.

Algorithm 1 Perceptron Learning Algorithm

Here are some observations about the algorithm:

Table 1.1: Possible values of y − ŷ

1.4 The ingredients of a Neural Network

Neural Networks are, in general, an extension of the Perceptron model where:

1. The indicator threshold function is replaced by a non-linearity, called activation function

Figure 1.4: Neural Network Principal Ingredients

The sigmoid function σ(x) is defined as

1.5 Supervised Learning

Figure 1.5: Supervised Learning framework

In particular, supervised learning consist on two stages:

An example of supervised learning task is classification.

Figure 1.6: Classification Example

In order to learn this mapping, we need to define a loss function to be optimized.

1.6 Defining a loss function: Maximum Likelihood Estimation

Suppose we are given:

• A dataset D = {(x(i) , y (i) )}N

More formally, the conditional maximum likelihood estimator for Θ is given by

Θ∗M LE = arg max pmodel (y|X, Θ)

We will now see some examples for different types of pmodel .

Case 1: pmodel is a Gaussian

Case 2: pmodel is a Bernoulli. Derivation of Cross Entropy as a MLE estimator

Suppose we are given a dataset D = {(x(i) , y (i) )}N

Figure 1.7: Modeling a Gaussian with NLL

ŷ (i) using a Bernoulli distribution

ŷ (i) ∼ Bern(σ(θ ⊤ x(i) ))

Figure 1.8: Binary classification model