Professional Documents
Culture Documents
Their creation would not have been possible without the contributions of many people.
I would like to thank all the people who have contributed to the creation of the lectures’
material throughtout the years: Artur Grigorev, Sammy Christen.
I would like to thank Jonas Hübotter, who inspired me with his lecture notes and
accepted to share his template with us, from which I got some tricks which have facilitated
the writing of these notes.
Disclaimer
These lecture notes are provided as a draft version for educational purposes only. The
content presented herein is subject to change and may contain inaccuracies or errors.
Contributing
You are encouraged to raise issues and suggest fixes for anything you think can be improved.
Contact: machine-perception@inf.ethz.ch
This set of notes was written for the course Machine Perception (263-5210-00L) at ETH Zürich.
Distribution of these notes without the permission of the authors is prohibited.
Students will learn about fundamental aspects of modern deep learning approaches for perception
and generation. Students will learn to implement, train and debug their own neural networks and
gain a detailed understanding of cutting-edge research in learning-based computer vision, robotics,
and shape modeling. The optional final project assignment will involve training a complex neural
network architecture and applying it to a real-world dataset.
The core competency acquired through this course is a solid foundation in deep-learning algorithms
to process and interpret human-centric signals. In particular, students should be able to develop
systems that deal with the problem of recognizing people in images, detecting and describing body
parts, inferring their spatial configuration, performing action/gesture recognition from still images
or image sequences, also considering multi-modal data, among others.
We will focus on teaching: how to set up the problem of machine perception, the learning algorithms,
network architectures, and advanced deep learning concepts in particular probabilistic deep learning
models.
This chapter provides a concise reference describing the notation used throughtout the lecture notes.
If you are unfamiliar with any of the corresponding mathematical concepts, we suggest to read
chapters 2 − 4 of the Deep Learning book[6].
a A scalar
a A vector
A A matrix
In Identity matrix with n rows and n columns
I Identity matrix with dimensionality implied by context
diag(a) A square, diagonal matrix with diagonal entries given by the vector a
Sets
A A set
N set of natural numbers {1, 2, . . . }
N0 set of natural numbers, including 0, N ∪ {0}
R set of real numbers
[m] set of natural numbers from 1 to m, {1, 2, . . . , m − 1, m}
i:j subset of natural numbers between i and j, {i, i + 1, . . . , j − 1, j}
(a, b] real interval between a and b including b but not including a
6
Indexing
Aij The element in position (i, j) (where i is the row and j is the column) of a matrix A
Ai,: Row i of a matrix A
A:,j Column j of a matrix A
In addition to the notation described above, when an indexing is specified, we would like to use the
following notation so that an indexed scalar or vector is still represented with the right notation
(respectively a and a).
A⊤ transpose of matrix A
A−1 inverse of invertible matrix A
det(A) determinant of A
tr(A) trace of A
Calculus
dy
dx Derivative of y w.r.t. x
∂y
∂x Partial derivative of y w.r.t. x
∇x y Gradient of y w.r.t. x
∇X y Gradient of y w.r.t. X
Probability
Ω sample space
A event space
P(X = x) probability of a random variable X taking on the value x
X∼P random variable X follows the distribution P
x∼P value x is sampled according to distribution P
x|y value x is sampled according to (implicit) conditional distribution p(· | y)
PX cumulative distribution function of a random variable X
∆A set of all probability distributions over the set A
X⊥Y random variable X is independent of random variable Y
E[X] expected value of random variable X
Ex∼X [f (x)] expected value of the random variable f (X), E[f (X)]
Var[X] variance of random variable X
Cov[X, Y ] covariance of random variable X and random variable Y
Σ covariance matrix
Xn sample mean of random variable X with n samples
DKL (p∥q) KL-divergence of distribution p with respect to distribution q
N (µ, Σ) normal distribution with parameters µ and Σ
Unif(S) uniform distribution on the set S
7
Functions
6 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1 Introduction 93
6.2 Linear Autoencoders: the PCA projection 94
6.3 Non-Linear Autoencoders 94
6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.2 Dimensionality of hidden layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.3 Autoencoder Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Variational Autoencoders 98
6.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4.2 Kullback-Leibler (KL) Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.3 Derivation of the objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.5 Training in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.6 Generating new data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5 β-VAE 103
9 GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.1 Likelihood-free model 125
9.1.1 Case 1: great log-likelihood and poor samples . . . . . . . . . . . . . . . . . . . . . . . 125
9.1.2 Case 2: poor log-likelihood and great samples . . . . . . . . . . . . . . . . . . . . . . . 125
9.2 Introduction to GAN 126
9.3 Definitions 126
9.4 Training 127
9.4.1 General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.4.2 Theory vs practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.5 Theoretical analysis 129
9.5.1 Derivation of the GAN objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.5.2 Optimal Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.5.3 Global Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.5.4 Convergence of the training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.6 Difficulties during training 132
9.6.1 Mode collapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.6.2 Issues with DJS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.7 Comparison with VAE 133
9.8 Conditional GANs 133
Bibliography
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Books 183
Articles 183
I
Part One: Foundation of Deep
Learning
2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1 Regularization
2.2 Activation functions
2.3 Optimization Algorithms
2.4 Last practical suggestions
1. Neural Network Basics
1.1 Outline
This chapter is a review of the basics of Neural Network Theory. We begin with the presentation of
the Perceptron model and its learning algorithm. We then extend these concepts to understand
modern Deep Neural Networks. Finally, in section 1.9.2, we analyze which types of functions can
be approximated using a DNN.
During the first weeks of the course we focus on the study of bottom-up perception, with the aim
of understanding how stimuli are processed without the interference of top-down processes.
Artificial neural networks have traditionally taken inspiration from the working of the nervous
system, which consists on basic computational units called neurons. In this regard, we will describe
the structure of a neuron and use it as an inspiration to derive the first building block of artificial
networks.
At a high level, each neuron receives signals from other neurons, process them, and propagates the
new signal to other neurons.
The detailed process can be described as follows. First, the dentrites collect the chemical signals
(in the form of neurotrasmittors) arriving from other neurons. All these signals are then integrated
in the soma and, the value obtained, is compared to a given threshold. If the resulting signal, now
called axon potential, surpasses the threshold, it is transmitted to the next part of the neuron,
the axon. At this point, the axon potential travels down the axon, reaching the axon terminals.
Here the signal can cause the release of neurotrasmittors which are received by the dentrites of
the next neurons and make the process restart.
20 Chapter 1. Neural Network Basics
For the purposes of this course, it is important to keep in mind that a neuron combines multiple
inputs and only forwards them if they surpass the threshold, which introduces a non-linearity in
the propagation of the signal.
1.3 Perceptron
One of the earliest predecessor of modern deep learning are simple linear models. As brain neurons
collect multiple inputs from other neurons and combine them into a single one in the soma, these
models are designed to take a set of m input values x and associate them with a single output
y ∈ {0, 1}. In order to perform this task, these models employ a m dimensional set of weights w
(one for each input value) and a bias scalar value b, which are called parameters1 of the model.
The output of the model is then computed as a function (which can be thought as the neurons’
threshold) of the linear combination of the inputs, parametrized by the weights and the bias value.
Among these early models, we can find the Perceptron [11]. Its output is defined as follows.
(
1 if w⊤ x + b > 0
ŷ = f (x, w) =
0 otherwise
Pm
where w⊤ x is the dot product i=1 wi xi .
In 1958, the Perceptron became the first among the linear models that could learn the weights that
defined the categories given examples of inputs from each category. Before introducing the learning
algorithm, we will define some terms:
• D = ((x(1) , y (1) ), . . . , (x(n) , y (n) )) is the training set of n data samples, where:
– x(i) is the ith m dimensional input vector
– y (i) is desired output value of the perceptron for the ith input
1
In general, throughout the class we are going to indicate with Θ the set of parameters of a model.
1.3 Perceptron 21
Moreover, as illustrated in the figure below, a 1 is preponed to the input vector x. In that way, x
becomes a m + 1 dimensional vector as well as w, which now includes also the bias term b.
1. By definition of y (i) and ŷ (i) , the residual (y (i) − ŷ (i) ) can be either −1, 0 or 1 as shown in
the table below. You can find a visualization of how the weights are adjusted in the 2D case
in the given jupyter notebook.
22 Chapter 1. Neural Network Basics
y ŷ (y − ŷ)
0 0 0
1 0 −1
0 1 1
1 1 0
2. If the training set is linearly separable the perceptron learning algorithm is guaranted to
converge and to eventually find the set of weights w which correctly separate the two classes
of the training set. Conversely, it will never get to the state where all the input vectors are
classified correctly if the training set D is not linearly separable.
3. The algorithm does not find an optimal separation in terms of margin distance as SVM
(Support Vector Machines) does, but it just stops once it finds a solution for the separability
problem.
1.4 The ingredients of a Neural Network 23
In this chapter we use the sigmoid as activation function. However, in the next chapter, we will
explore and compare other activation functions that can be used in neural networks to improve
their performance and flexibility.
1.4.1 Sigmoid
. 1 ex
sigm(x) = σ(x) = =
1 + e−x ex + 1
The sigmoid has many desirable properties for use in neural networks, which we will explore in
more detail later.
• It is differentiable across its entire domain, which is necessary for training the network using
gradient-based optimization algorithms
• It outputs values in the (0, 1) interval, which can be interpreted as probabilities. This allows
us to use the sigmoid as a natural choice for the output layer of the neural network when we
want to perform binary classification tasks, where the output represents the probability of
belonging to a certain class.
In general, activation functions are an essential component of neural networks as they introduce
nonlinearity, allowing the network to learn complex relationships and patterns in the data. Specifi-
cally, the activation functions transform the input signal of a neuron to produce its output. This
transformation is critical as it helps to project instances that are not linearly separable into a space
where we can find a separating hyperplane, enabling the network to learn to classify or predict
outputs for a given input.
24 Chapter 1. Neural Network Basics
At the beginning of this course, we will focus on the framework of supervised learning, which is a
type of machine learning technique where an algorithm learns to make predictions or classifications
by training on a labeled dataset with known input-output pairs. In other words, we provide the
algorithm with a dataset containing input-output pairs, and the algorithm learns to generalize from
this dataset to make predictions or classifications on new, unseen data.
• Learning: the estimation of the parameters Θ of the function fΘ from the training data
{(x(i) , y (i) )}N
i=1
• Inference: keeping the learned Θ fixed, the model makes predictions ŷ = fΘ (x) for unseen
inputs
The fact that we test our model on unseen data is the main different between learning and traditional
optimization.
1.5.1 Classification
Suppose we want to classify images depending on whether they represent a house or a boat.
In this case, our model takes a vectorized image as input and it outputs 0 if the image represents a
house, 1 if it represents a boat. Thus, our model in this case will be a mapping
fΘ : RW ×H×3 7→ {House,Boat}
The first technique we are going to use to define a suitable loss function is Maximum Likelihood
Estimation.
When defining the loss function, we will always follow these three main steps:
1. Write down the parametric probability distribution of the model pmodel (y|X, Θ)
2. Decompose that probability distribution into per sample probability pmodel (y (i) |x(i) , Θ)
3. Convert everything in log scale and minimize the Negative Log Likelihood (NLL)
The NLL is the loss function that we will use to optimize our networks.
Let X ∼ N (µ, σ 2 ). In this situation our model is a 1D Gaussian distribution, and we can model
our system using a Gaussian probability density function with µ and σ as parameters.
In order to minimize the negative log-likelihood (NLL), we can adjust the values of µ and σ such
that the Gaussian curve places more probability mass on areas where we expect to see data, and
less probability mass on areas where we do not expect to see data. A visual representation of that
is given in fig. 1.7.
Now a Gaussian would not be a valid pmodel anymore. Instead, we will model the output variable
26 Chapter 1. Neural Network Basics
The parameter of this Bernoulli distribution (σ(θ ⊤ x(i) )) derives from the model used for binary
classification, illustrated in fig. 1.8.
R The sigmoid can be interpreted as a probability distribution over two classes as:
1. Its output is always positive
2. Its output lays in the interval (0, 1)
As we discussed before, in order to compute the optimal parameters of this model w.r.t. the
likelihood function θM
∗
LE , we minimize the NLL.
1. We compute the expression of the likelihood under the assumption that the instances are
independent and identically distributed (i.i.d.):
N
Y
pmodel (y|X, θ) = p(y (i) |x(i) , θ)
i=1
N y(i) 1−y(i)
Y 1 1
= 1− ϕ = θ ⊤ x(i)
i=1 |
1 + e−ϕ 1 + e−ϕ
{z } | {z }
π (i) 1−π (i)
We can decompose the likelihood in this way since y (i) can only take on the values of 0 or 1,
so only one of the two terms in the product will be active in practice.
2. We compute the negative log-likelihood (NLL) and use it as a loss function L(θ), which is
a function of the model parameters θ. This loss function is also known as Binary Cross
1.6 Defining a loss function: Maximum Likelihood Estimation 27
In fig. 1.9 you can find a visualization of the values assumed by the BCE as a function of y (i) and
yˆ(i) .
The blue curve in the loss function plot corresponds to the case where the true label is 1, and only
the first term of the loss function is active. The red curve represents the case where the true label
is 0, and only the second term of the loss function is active.
The loss is minimized when the model predicts a high probability for the true class label and a
low probability for the other class label. By minimizing this loss function, we can train a model to
accurately predict the class labels for new instances.
In multiclass classification, the number of possible class labels is greater than two. To support
multiclass classification, we need to modify the output layer of the neural network.
Instead of a single neuron, we need to have k stacked neurons, where k is the number of classes in
the dataset. The output of these neurons represents the probability of an input instance belonging
to each class.
To obtain a probability distribution over classes, we apply the softmax function to the k-dimensional
output vector of the stacked neurons. The softmax function maps the output of each neuron
to a probability between 0 and 1, such that the sum of probabilities over all classes is equal to
1.
Definition 1.6.1 — Softmax. The softmax of a k dimensional vector x is a k dimensional
vector, whose ith element is defined as
. exi
sof tmax(x)i = Pk
j=1 exj
As the sigmoid, also the softmax satisfies all the properties required to be a probability distribution
over classes. In particular:
28 Chapter 1. Neural Network Basics
1. Its outputs are always positive (the exponential function is always positive)
2. Its outputs are always in (0, 1) (true thanks to the normalization factor at the denominator
Pk
j=1 e
xj
and to the fact that the exponential is always positive)
3. The sum of all the outputs is 1.
k Pk
X exi
sof tmax(x)i = Pki=1 =1
xj
i=1 j=1 e
We have seen that choosing pmodel (y|X, θ) as Bernoulli distribution yields the binary cross-entropy
estimator.
1. If we choose pmodel (y|x, θ) = N (y|θ ⊤ x, σ) to be Gaussian, we end up with the least squares
2
(MSE) estimator: θM ∗ ⊤
LE = arg minθ θ x − y 2
2. Choosing pmodel (y|x, θ) to be a Laplacian distribution, yields an estimator that minimizes
the l1 norm: θM∗ ⊤
LE = arg minθ θ x − y 1
3. Assuming a Gaussian distribution over θ and performing maximum a-posteriori (MAP)
estimation yields ridge regression
There are many nice theoretical properties making MLE an appealing framework. Among them:
So far we have learnt how to define a loss function and how to express the output for binary and
multiclass classification tasks.
In this section, our attention shifts towards optimizing the parameters of the network. Since most
loss functions, such as the Binary Cross-Entropy (BCE) function, do not have a closed-form solution,
the typical approach in Deep Learning is to utilize iterative gradient descent.
N
X
v = ∇Θ L(ŷ, y) = ∇Θ L(ŷ (i) , y (i) )
i=1
Θ{t+1} = Θ{t} − ηv
This procedure eventually leads to finding the parameters that correspond to a local minimum of
the cost function.
In practice, we use Stochastic Gradient Descent (SGD) instead of GD. SGD computes the gradient
only using a small subset of samples, which makes it more computationally efficient. As a result,
rather than computing the true gradient, at each step, we compute its expectation.
It’s worth noting that the update rule for SGD is similar to what we previously saw in the perceptron
algorithm (see section 1.3.1).
Now that we have seen that gradients are used to optimize the network parameters, we are interested
in ways to compute them efficiently.
Symbolic differentiation
Automatic differentiation
What is used in Deep Learning is called backpropagation and it is a form of automatic differ-
entiation. To illustrate how backpropagation works, we consider the same scalar function as
before:
2 2
f = exp exp (x) + exp (x) + sin exp (x) + exp (x)
We can represent this function as a graph, known as a computational graph, composed of inputs,
single functions, and intermediate variables.
This graph allows us to compute gradients mechanically in a top-down manner, reusing previously
computed values according to the chain rule. Specifically:
Next, we will apply the automatic differentiation method to compute the gradients of a simple
Neural Network composed of a single unit using the Mean Squared Error (MSE) as the loss function.
The architecture of the single unit is shown below in the form of a computational graph:
Our ultimate goal is to obtain the gradient of the loss function, L(w), with respect to w. To
accomplish this, we will analyze how to compute the gradients of L(w) with respect to z[3] , z[2] ,
z[1] , and w[0] respectively.
∂L ∂z[3]
= =1
∂z[3] ∂z[3]
∂L ∂z[3]
= [2] · 1
∂z[2] ∂z
∂L ∂z[2] ∂z[3]
= [1] · [2] · 1
∂z[1] ∂z ∂z
∂L ∂z[1] ∂z[2] ∂z[3]
[1]
[1] [2]
= · · · 1 = x · σ z 1 − σ z · 2 z − y
∂w[0] ∂w[0] ∂z[1] ∂z[2]
Where in order to compute the last value we used the fact that
∂z[3]
= 2z[2] − 2y
∂z[2]
∂z[2]
[1]
= σ z 1 − σ z[1]
∂z[1]
∂z[1]
=x
∂w[0]
32 Chapter 1. Neural Network Basics
Case 2: Layer-wise
We will now analyze how to combine the gradients in a multi-layer architecture, bearing in mind
that in a neural network, the input of each layer is the output of the layer below it.
1. While the outputs are propagated in forward direction, the gradients are propagated in the
backward direction.
2. Different layers can have different number of units (the layer l has N units while the layer
l − 1 has M units).
3. Typically each unit i in the l − 1 layer is connected to all N units in the layer above. Thus,
each unit of one layer receives input from all units of the layer below.
[l−1]
• δi : gradient w.r.t. ith unit in the (l − 1)th layer
• δ [l−1]
: gradient that flows from layer l to layer l − 1
First, we compute the gradient for a single unit. In particular, we are interested in the gradient of
the j th unit of the layer l w.r.t. the ith unit of the layer l − 1.
N [l]
[l−1] ∂L X ∂L ∂zj
δi = [l−1]
= [l] [l−1]
(1.1)
∂zi j=1 ∂zj ∂zi
| {z }
[l]
δj
Next, we compute the weight update. Therefore, we need to take the derivative of L with respect
to the weights W [l] .
N [l]
∂L X [l] ∂zj ∂L ∂z[l]
= δ j =
∂W [l] j=1
∂W [l] ∂z[l] ∂W [l]
[l]
Here, as seen in Equation (1.1), we define δj = [l] .
∂L
∂zj
1.8 Block diagrams of a Single Unit Network and of a MLP 33
A MLP which makes use only of linear activation functions is equivalent to a single unit network
with a linear activation.
As we announced at the beginning of the chapter, our detour ends with an investigation of the
types of functions which can be approximated using DNN.
However, this kind of network is not enough for learning many types of functions, thus a non-linearity
between layers is really needed to make the network work.
2.1 Regularization
In general, the goal of Machine Learning is to designs algorithms which perform well not only on
the training data, but especially on new inputs. For this reason, many strategies used in ML are
explicitly designed to reduce the test error, possibly at the expense of increased training error.
In an ideal scenario, the model family we have used during training includes the data generating
process but also many other possible generating processes. In this scenario regularization pushes or
restricts the solution space towards the true generating process.
However, the majority of the DL algorithms are indeed applied to extremely complicated domains
such as images, audio sequences and text, for which understanding the true generation process
essentially involves simulating the entire universe.
36 Chapter 2. Training
Thus, in a realistic scenario the model family we have used during training may not include the
true generating process. The role of regularization techniques is thus to find the model within the
family that best explains the true data generating process.
In general, we say that training a ML model is more like "trying to fit a square peg into a round
hole".
In this section, we are going to explore the most used regularization methods used in deep learning.
Many regularization approaches are based on limiting the capacity of models, such as neural
networks, linear regression, or logistic regression, by adding a parameter norm penalty Ω(Θ) to the
loss function. We denote the regularized objective function by L̃:
where λ ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term.
Setting λ = 0 gives no regularization, while larger values of λ correspond to more regularization.
As a consequence, when our training algorithm minimizes the regularized loss function L̃ it will
decrease both the original loss L on the training data and some measure of the size of the parameters
Θ (or some subset of them).
There are mainly two ways of doing parameter norm regularization, based on the type of norm we
decide to consider:
L2 parameters regularization, also know as weight decay is one of the simplest and most common
kinds of parameter norm penalty. This strategy drives the weights closer to the origin by adding a
2
term 21 ∥Θ∥2 to the objective function.
How does this term affect the training of the network can be efficiently visualized by analyzing the
gradient of L̃ and, subsequently, the update rule for the weights given by GD.
The gradient of the loss function w.r.t. the model’s weights can be written as:
∇Θ L̃(Θ; X, y) = ∇Θ L(Θ; X, y) + λΘ
2.1 Regularization 37
Θ ← Θ − α∇Θ L̃(Θ; X, y)
= Θ − α∇Θ (L(Θ; X, y) + λΘ)
= Θ − α(∇Θ L(Θ; X, y) + λΘ)
= (1 − αλ)Θ − α∇Θ L(Θ; X, y)
| {z } | {z }
weight decay parameters update
We can see that the addition of the weight decay term has modified the learning rule to multiplica-
tively shrink the weight vector by a constant factor on each step, just before performing the usual
gradient update.
L1 regularization (Lasso)
While L2 weight decay remains the most common form of weight decay, there are other ways to
penalize the size of the model parameters. Another option, as we anticipated before, is to use L1
regularization.
L1 regularization is achieved by adding to the loss the term ∥Θ∥ = i |Θi |, that in practice is the
P
sum of absolute values of the individual parameters. The loss in this case becomes
Doing the same reasoning as before, we will now compute its gradient to understand how using the
L1 regularization affects weights updates.
By inspecting the equation above, we can notice how the two regularization techniques are different
from each other. In particular, here the regularization contribution to the gradient no longer scales
linearly with each Θi ; instead it is a constant factor (λ) with a sign equal to sign(Θi ).
On one side, from a MAP perspective, they are equivalent to specifying a prior distribution on
the weights’ values. In particular, the L2 regularization corresponds to a Gaussian prior on the
weights, while the L1 regularization corresponds to a Laplace prior.
On the other side, they can be seen as a form of constrained optimization. In particular, if Ω is
the L2 norm, then the weights are constrained to lie in an L2 ball. If Ω is the L1 norm, then the
weights are constrained to lie in a region of limited L1 norm.
Figure 2.3: Position of the optimal solutions (w∗ ) using L1 (left) and L2 (right) regulariza-
tion, considering a 2-dimensional parameters vector w. Here, the ellipses represent curves
of losses (with a null loss in their centers) in the parameter space, while the grey areas
represent the constraints. w∗ represents the optimal parameter (the one we usually call Θ∗ )
in the two cases. We can notice that, due to their resulting sharp shape, the constraints of
L1 regularization are more likely to be satisfied in the corners, where some coordinates of
Θ (in the figure called w) are zero.
The idea behind ensemble methods is to use finite amount of different machine learning models to
obtain better performance than any one of them alone. The reason that ensemble techniques work
is that different models will usually not make all the same errors on the test set.
1. Train different model classes (Linear Regression, Decision Tree, Neural Network) on the same
data and then aggregate the predictions
2. Train same model class on different data (sampled from the original dataset) and aggregate
the predictions
In this section we are going to explore two enseble methods: bagging and dropout.
2.1.3 Bagging
Bagging [1] (short for bootstrap aggregating) is a technique for reducing generalization error by
combining several models.
2.1.4 Dropout
We now analyze the modifications that the use of dropout introduces both at training and at test
stage.
Training Stage
Let y [l] be the input to the (l + 1)th layer in the network, f an activation function, and Θ[l] and b[l]
be respectively the weights and bias parameters at that layer.
First, each time we load an example into a mini batch, we randomly sample each component of the
mask r[l] (which is the mask which decides which neurons should be kept at layer l for that specific
mini batch) from a Bernoulli distribution with parameter p (the probability of keeping a neuron
active during training time). Using this mask we compute ỹ [l] .
r[l] ∼ Bern(p)
ỹ [l] = r[l] ⊙ y [l]
Then we compute the input of the next layer as in the standard configuration, but this time using
ỹ [l] instead of y [l] .
z [l+1] = Θ[l+1] ỹ [l] + b[l+1]
y [l+1] = f (z [l+1] )
Test Stage
An alternative solution, which is in general the preferred one, is the weight scaling inference
rule. The idea behind this solution is to make expected total input to any unit at test time equal
to the expected total input at training time. In order to do that, as we know by hypothesis that
r[l] ∼ Bern(p), we can define and use new parameters Θ̃ defined as
X
Θ̃ = E[r[l] Θ] = p(r[l] )r[l] Θ = pΘ + (1 − p)0 = pΘ
r [l] ∈[0,1]
For many classes of models that do not have nonlinear hidden units, the weight scaling inference
rule is exact. In case the model contains nonlinearities this rule is no longer exact, but still provides
a good approximation of the true geometric mean of the ensemble.
Figure 2.6: Test error for different architectures with and without dropout
We conclude this section by comparing the two ensembling methods we have just seen.
2.1 Regularization 41
Dropout Bagging
Models share parameters Models are independent
Trains (partially) only small percentage of the models Trains all models until convergence
Differences in scale can have a significant effect, both when they are present in the input data and
when they are present in the target values.
In particular, in the input large input values could result in large weight values which make the
predictions unstable.
On the other side, a target variable with a large spread in its values can make the training
process unstable. For this reason, it is often useful to normalize the data before training a model,
we will now analyze how this should be done.
n
1X
µ= xi
n i=1
v
u n
u 1 X
σ=t (xi − µ)2
n − 1 i=1
Thus, we normalize X (per feature, per channel etc.) in the following way to obtain XN
(N=normalized)
X −µ
XN =
σ
Important: during testing, we must use the same mean and standard deviation we have found
during the training, not a new one. This allows the model to be evaluated on a single example,
without needing to use definitions of µ, σ that depend on an entire minibatch.
In the case of very deep network however, input and output normalization seems to be not sufficient.
As a matter of fact, normalizing only the input may help in the learning of the first layer’s parameters,
but after that, data will likely be not normalized again.
Batch normalization [5] has been proposed to try to solve this problem. The idea behind this
technique is normalize not only input and output data, but also the activations computed at each
intermediate layer.
42 Chapter 2. Training
Training phase
where ϵ is small value added for numerical stability and γ, β are learnable parameters that adjust
the mean and the variance at that layer.
Test phase
At test time µ, σ might be replaced by running averages that were collected during training time.
As said before, this allows the model to be evaluated on a single example, without needing to use
definitions of µ, σ that depend on an entire minibatch.
• The bias term in a linear layer (and convolutional layer) becomes redundant if you use batch
normalization after it
• Batch normalization makes weights in deeper layers more robust to changes than weights in
the shallower layers of the network
• Each mini-batch is scaled by the mean/variance computed on just that mini batch. This
adds some inherent noise within that mini-batch (similar to dropout). Threfore it has slight
regularization effect.
2.1 Regularization 43
The best way to make a machine learning model generalize better is to train it on more data.
However, acquiring training data is expensive.
One way to get around this problem is to creating new fake data as augmentation of the real one.
This technique is called data augmentation.
For some machine learning tasks, as image classification task, it is reasonably straightforward to
create new fake data.
However, we need to ensure consistency between transformed input and label, we have to avoid
transformations that would change the correct class. In fig. 2.8 we can see two examples of a
inconsistent data augmentation.
Figure 2.8: Example of a inconsistent data augmentation in classification (left) and regression
(right)
Thus, the goal before applying data augmentation is to exploit invariances (classification) or
equivariances (regression) of the function you are trying to learn to obtain new samples.
These are some useful links and packages to perform data augmentation:
• torchvision.transform
• albumentation, check out online demo
• imgaug
Another problem we may encounter when training our neural network is the presence of small
dataset. As a matter of fact, training only on small amount of data generally leads to poor
generalization.
The idea is to first train network first on another task with a large dataset and, then, to fine-tune
the trained network on your original task.
The features learned from training on the large dataset can be exploited for solving the new task
44 Chapter 2. Training
The first step can be in many situation avoided by using pre-existing architectures, already trained
in a specific task and dataset. Among the benefits of using these architectures we have the
modularity and the built-in regularizers. Some example of models available are AlexNet, ResNet,
VGG, DenseNet, Inception.
If you are interested in seeing how transfer learning works can be implemented in practice we
reccomend you to take a look at this tutorial.
2.2 Activation functions 45
Activation functions make the layer-to-layer mappings non linear. Without activation functions
neural networks would only implement affine mappings.
The family of activation functions is mainly divided into two groups: logistics and the rectified
linears.
In this section we will see some of the most used on the two sides and we will analyze their
properties.
Sigmoid
This function has the advantages of being differentiable everywhere and of having a finite range
(0, 1), which makes it suitable for mapping to a probability space as an output.
However, it is not the most used function (especially for the hidden units) nowadays as it saturates
across most of its domain - to an high value in case of very positive inputs, to low values in case of
very negatives.
46 Chapter 2. Training
f ′ (x) = 1 − f (x)2
ex − e−x
f (x) = tanh(x) = x
e + e−x
Similiarly to the sigmoid function, it is differentiable everywhere and it has a finite range (in this
case (−1, 1)). As it can be noticed from the figure, it also shares the drawback of saturating across
most of its domain.
However, it tipically performs better than the sigmoid function. As a matter of fact, it resembles
the identity function more closely, in the sense that tanh(0) = 0 while σ (0) = 0.5, and for this
reason training a model with a tanh activation function resembles the training of a linear model (as
long as the activations of the layer are kept small enough to avoid saturation). This makes the
training of the model easier.
ReLU
(
′ 0 if x < 0
f (x) =
f (x) = max(0, x) 1 otw
The rectified linear units (ReLU) have the property to be very similar to purely linear units and
as only difference, they output zero for half of their domain. This behaviour, brings both some
advantages and disadvantages.
Among the advantages we have that the piecewise linearity greatly accelerates the convergence
of gradient-based optimization algorithms, especially when compared to the sigmoid and tanh
functions. Moreover, it is computationally cheap to compute.
On the other hand, it may be source of instability during the learning. Its unbounded output range
([0, ∞)) can blow up the activation and destabilize the training. Moreover, units with negative
2.2 Activation functions 47
activations get no update (the gradient is zero for x < 0). This last phenomenon is commonly
referred to as the “dying” ReLU problem.
Some possible solutions to “Dying” ReLU problem are the followings. On the one side, one can use
a slightly modified version of this function. Among them, as we will see in the following part of
this section we have Leaky ReLU, Parametric ReLU, ELU, SELU, GELU. On the other side, a
careful initialization of the weights can help to avoid this problem. Doing so makes it very likely
that the rectified linear units will be initially active for most inputs in the training set and allow
the derivatives to pass through.
Leaky ReLu
(
′ α if x < 0
f (x) =
(
αx for x < 0 1 if x ≥ 0
f (x) =
x for x ≥ 0
The Randomized Leaky ReLU is very similar to the leaky ReLU but it introduces a small random
negative slope for negative activations, rather than a fixed one.
In particular, at training time the slope value is sampled from an uniform distribution (α ∼ U(a, b),
α), while at test time α is set to its expected value a+b
2 .
The process of deciding which activation function to use in pratice is not straightforward.
48 Chapter 2. Training
As a matter of fact, there is no clear winner and the design process consists of trial and error and
insights into the modelled system. However, some general insights from what we have seen so far
are the followings:
How should we update model weights wij in order to minimize a loss function L?
In order to do that we will first present the Gradient Descent algorithm and some of its variants
(subsection 2.3.1). Then, we will move to analyzing some of its challenges and which methods are
nowadays used to overcome them (subsection 2.3.2).
The state-of-the-art methods for training DNNs are almost all variants of gradient descent (also
called Batch GD). The idea behind that is to follow the direction of the slope of the surface created
by the objective function downhill.
First, we analyze the vanilla version of the GD algorithms, the so-called Batch Gradient Descent or,
simply, GD.
Θt+1 = Θt − ∇Θ L(Θ)
where ∇Θ L(Θ) is the gradient of the loss function with respect to the parameters Θ computed on
the entire training set.
The computational cost of GD linearly grows with the number of samples in the dataset, thus
using the Batch GD becomes prohibitive for large datasets. In order to overcome this problem,
we can use a variant of GD called Stochastic Gradient Descent (SGD). The idea is to evaluate
the gradient only in a sample [i] ∈ n (where n is the number of training samples contained in the
50 Chapter 2. Training
From a theoretical point of view, SGD has the advantage of being an unbiased estimate of the true
gradient. However, it has a high variance and it is not guaranteed to decrease the loss function at
each iteration.
In addition to that, introducing stochasticity in the process allows to jump to new and potentially
better local minima. This is particularly useful when the loss function is non-convex and has many
local minima. However, it also makes it harder to converge to the minimum as near a smoothened
minimum the SGD step is dominated by stochastic fluctuations. In practice, it is necessary to
decrease the learning rate over time for SGD to converge.
From a computational perspective, SGD is much more efficient than GD. In fact, each iteration
of SGD is n times cheaper than GD, as it only have to compute the gradient on a single sample
instead of the entire dataset.
Mini-batch GD
The mini-batch GD is a variant of GD that lies in between GD and SGD. In particular, the idea is
the following.
At each iteration, we sample a minibatch of m samples from the training set {x(1) , ..., x(m) } together
with the corresponding targets {y (1) , ..., y (m) }. Then, we compute the gradient estimate as a mean
of those gradients.
1 X
ĝ = ∇Θ L(f (x(i) ; Θ), y (i) )
m i
Θ ← Θ − ηĝ
This version of GD is faster than the Batch version because it goes through a lot less data points
than Batch (entire dataset). In addition to that, when compared to the SGD, it reduces the variance
of the gradient estimate and thus it guarantees more stable convergence.
In this subsection we present the solutions to two common problems which arise in gradient-based
optimization techniques. In particular we will try to answer to two questions.
How can gradient descent be modified to avoid a slow down in regions of small gradient norms?
We will see two methods (commonly referred as heavy ball methods) which are able to accelerate
the convergence of the gradient descent in the aforementioned regions.
We will see Adagrad and RMSProp, two methods which are able to adapt the learning rate per
dimension.
Finally we will analyze the most used optimization algorithm in deep learning, the Adam algorithm,
which combines the two approaches presented before.
Polyak’s Momentum
In some settings (especially the ones characterized by a poor condition number3 of the Hessian
Matrix of the loss function w.r.t. the parameters) we have that the loss function changes very
slowly in a direction, while it is very sensitive in another one. In this cases the SGD will be zig
zaging for a long time before reaching convergence.
The direction of the gradient in this case is not aligned with the direction toward the minima.
The updates give us very slow progress in the shallow dimension (the one corresponding to the
small gradients) and jitter in the other one. This problem becomes even more common in higher
dimensions.
One solution to this problem is to use a momentum term that accelerates SGD in the relevant
direction and dampens oscillations.
3
Conditioning refers to how rapidly a function changes with respect to small changes in its inputs.
Functions that change rapidly when their inputs are perturbed slightly can be problematic for scientific
computation because rounding errors in the inputs can result in large changes in the output.[6]
52 Chapter 2. Training
Its main challenge was to accelerate learning, especially in the face of high curvature (of loss
functions) and both in presence of small but consistent gradients and noisy ones (as the ones coming
from mini-batch GD). The momentum algorithm accumulates a (exponentially decaying) moving
average of past gradients and continues to move in their direction.
Formally, the momentum algorithm introduces a variable v that plays the role of velocity; it is
the direction and speed at which the parameters move through parameter space. We can see the
velocity as a weighted sum of the previous gradients, with the most recent ones weighted heavier.
A hyperparameter α ∈ [0, 1) determines how quickly the contributions of previous gradients
exponentially decay. The update rule is given by:
m
!
1 X (i) (i)
v ← αv − η∇Θ L(f (x ; Θ), y )
m i=1
Θt+1 ← Θt + v
The larger α is relative to η, the more previous gradients affect the current direction.
As a consequence, the step size becomes larger when many successive gradients point in exactly the
same direction.
However, Polyak’s momentum has been proved not to converge in the very simple case of a
strongly-convex and smooth function for carefully chosen α and η.
Nesterov’s Momentum
The Nesterov’s Momentum, pursues the same idea of the Polyak’s Momentum, but evaluates the
gradient directly in the estimated Θ + αv rather than in Θ.
m
!
1 X
v ← αv − η∇Θ L(f (x(i) ; Θ+αv), y (i) )
m i=1
Θt+1 ← Θ + v
The idea is to give less weight to the velocity factor and more weight to the gradient.
Instead of evaluating gradient at the current position (red circle), we know that our momentum is
about to carry us to the top of the green arrow. With Nesterov momentum we therefore instead
evaluate the gradient at this "looked-ahead" position.
However, this notation is a little bit annoying since usually we want to evaluate the loss and the
gradient at the same point, while here we are computing the loss in xt and the gradient in xt + αv,
2.3 Optimization Algorithms 53
As we have anticipated before, momentum is not the only way to stabilize the oscillations of the
gradient. Another solution is to adapt the learning rate to the parameters.
With adaptive learning rate strategies ideally we would like to make smaller steps for “steeper”
directions in the cost function. In order to do that we make step size inversely proportional to past
gradients magnitude, so that
• Dimensions with large gradients have rapid decrease in their learning rate
• Dimensions with small gradients have a small decrease in their learning rate
• Greater progress in the more gently sloped directions of parameter space
The two most known adaptive learning rate strategies are Adagrad and RMSprop.
The idea of Adagrad is to keep a sum of the past squared gradients for each dimension and divide
the present gradient to a quantity proportional to that. As a result, if a dimension has really small
gradient it will be divided for a small quantity and it will increase its magnitude, while in the
opposite case, its magnitude will be decreased.
However, the accumulation of squared gradients from the beginning of training can result in a
premature and excessive decrease in the learning rate. A solution to that is given by RMSprop.
RMSprop, instead of just summing up the squared gradients and make them accumulate during
training, takes an exponentially weighted moving average, which allows to discard history from the
extreme past, with exponential decay.
The idea of Adam is to collect both first (momentum, velocity) and second order (Adagrad/
RMSProp) moments of the gradient and mix momentum and adaptive learning rate approaches.
Let β1 , β2 be the exponential decade rates for moments estimates, Θ0 the initial parameters vector
and α the step size.
The Adam algorithm then proceeds in the following way:
Without computing the bias-corrected estimates, after one update vˆt would be biased towards zero
due to their initializations. Thus, to do the update we would be dividing for a very small number.
Thus, we would do a very large step at the beginning, just because of the zero-initialization.
√
Note that the ϵ (often ϵ ∼ 10−7 ) is added for numerical stability, since we are dividing by v̂t ,
which could be very small.
56 Chapter 2. Training
In this last section, we provide some suggestions about how to train a neural network in practice,
given the theoretical foundation given in the sections before.
All the optimization methods we have analyzed before have learning rate as a hyperparameter,
which must be tuned according to the specific task.
A common strategy is to set a fixed initial learning rate and to decade its value during training.
There are mainly three ways to do that:
tτ
1. Step decay: ηt+1 = η0 ∗ αf loor( ) , α ∈ (0, 1)
2. Exponential decay: ηt+1 = η0 ∗ e−kt , k > 0
3. Time based decay: ηt+1 = 1+ktη0
A common startegy is to first start without using any decay and we introduce it later on.
In section 2.1.2 we have seen that ensembles of different models can improve the performance of a
single model. In this chapter we will see how this can be done in practice when training a neural
network.
As we have seen before, the idea behind ensemble methods is to train multiple models independently,
and then, at test time average their results.
A first approach is to use the same model, different hyperparameters initializations. Use cross-
validation to determine the best hyperparameters, then train multiple models with the best set of
hyperparameters but with different random initialization. The danger with this approach is that
the variety is only due to initialization.
Another possible approach is to use the top models discovered during cross-validation. Use cross-
validation to determine the best hyperparameters, then pick the top few (e.g. 10) models to form
the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal
models. In practice, this can be easier to perform since it doesn’t require additional retraining of
models after cross-validation.
A third approach is to use different checkpoints of a single model. If training is very expensive,
some people have had limited success in taking different checkpoints of a single network over time
(for example after every epoch) and using those to form an ensemble. Clearly, this suffers from
some lack of variety, but can still work reasonably well in practice. The advantage of this approach
is that is very cheap.
A fourth approach is to use a running average of parameters during training. This is similar to the
previous approach, but instead of taking checkpoints at fixed intervals, we take a running average
of the parameters over time. This is equivalent to taking the average of the parameters of the last
few epochs. This is a very cheap way to form an ensemble, but it has the disadvantage that it can
2.4 Last practical suggestions 57
only be used at test time, since the running average is not a valid set of parameters for the model.
Garipov et. al. [2] proposed an approach called Fast Geometric Ensembling (FGE). They found
that local minima are connected by simple curves with almost constant training and testing loss.
Notice that in each panel a direct linear path between each mode would incur high loss.
The key idea is to follow these curves of constant loss to explore new local minima.
More formally they have suggested the following procedure:
1. Train your model normally for about 80% of the training time
2. Adopt a cyclic LR for the remaining 20% of training time
3. Save checkpoints when LR is lowest
4. Ensemble all checkpointed models for inference
However, this approach has a drawback in term of required inference time. If we save k checkpoints,
the inference requires k times the computations compared to a single model.
Stochastic Weight Averaging (SWA) can be interpreted as an approximation to FGE ensembles but
with the test-time, convenience, and interpretability of a single model.
The algorithm is the following:
1. Train your model normally for about 80% of the training time
2. Initialize wSW A with the weights from your pretrained model
3. Adopt a cyclic LR for the remaining 20% of training time
4. For every cycle when the lowest learning rate is reached update wSW A using a running
average
wSW A · nmodels + w
wSW A ←
nmodels + 1
5. Use wSW A for inference
Note that if the DNN uses batch normalization, we run one additional pass over the data, to
compute the running mean and standard deviation of the activations for each layer of the network
with wSW A weights after the training is finished, since these statistics are not collected during
training.
Here are some advantages of SWA:
3.1 Introduction
Many tasks in the field of computer vision can be solved with CNN.
Among them we can find:
From a neuroscientific perspective, convolutional layers have been inspired by how the visual cortex
works. Particularly influential were the discoveries made by Hubel & Wiesel between 1959 and
1968.
Hubel and Wiesel introduced the concept of cell hierarchy in the visual cortex, where different
types of cells hierarchically transform visual stimuli. They distinguished between simple, complex,
and hypercomplex cells, each playing a specific role in the processing of visual information.
An important finding was that while simple cells (found at the lowest level of the hierarchy) are
susceptible to fuzziness and noise, complex cells are not. In particular, complex cells respond to
the largest output from a bank of simple cells to achieve oriented responses that are robust to
distortion.
The HMAX model is a biologically motivated architecture for computer vision that incorporates
these neuroscientific insights. It closely aligns with existing physiological evidence, particularly in
terms of the existence and operation of simple (S) and complex (C) cells at different levels of the
visual hierarchy. Simple cells (S cells) are tuned to specific stimuli and typically have small
receptive fields. Given an input x, the response y of a simple cell is computed as follows:
n sk
1 X
y = exp − 2 (wj − x)2
2σ j=1
On the other hand, complex cells (C cells) combine the outputs from multiple simple cells to
increase invariance and receptive field size. The output of a complex cell is computed as follows:
y= max (xj )
j=1,..,nCk
Research has shown that through many iterations of these operations, complex objects can be
constructed from low-level features.
In the following pages we will see how this structured hiercarchic model can be translated into a
neural network architecture using convolution operations.
Convolutional Neural Networks owe their name to the convolution operation, which is the mathe-
matical operation at the basis of convolutional layers. For this reason, before diving into the details
3.3 Convolution operation 63
of CNNs, we will first present the concept of convolution and we will give an intuition of why this
operation works well for image processing.
In deep neural networks (DNNs), our goal is to transform a given input signal f into a more
informative representation using an operator T . Among the various operators, convolutions are
an interesting class because, through their parameterization, they can express any linear, shift-
equivariant transform.
For readers who may not be familiar with the concepts of linearity, invariance, and equivariance, we
will provide a brief recap here as these concepts are essential for understanding the following pages.
Given a transform T , a function f , and two input vectors u and v, as well as scalars α and β:
T (f (u)) = T (u)
Invariance is a property we want to exploit in classification tasks. For example, if we have an image
of a cat and we shift every pixel by one unit, the image should still represent a cat. In other words,
the classifier should be invariant to the shift of one pixel.
64 Chapter 3. Convolutional Neural Network
Now we are interested in investigatating how we can obtain a specific type of linear transform that
can express convolution.
Here, I represents the input image, I ′ is the output of the operation, K is the kernel of the operation,
and N (m, n) is a neighborhood of (i, j).
In a linear transform, the value of the kernel K, which is applied to each point of the image,
depends on both the position on the image (i, j) and the position of the neighboring point (m, n)
with respect to (i, j). This dependency on the specific position (i, j) makes the linear transform
not shift-invariant. To achieve shift invariance, we need to remove the dependency on the position
(i, j), which can be done by considering kernels that are constant over the image. From now on, the
kernel K will be represented by a constant matrix, and we will write K(m, n).
In practice, to perform a shift-invariant linear transformation, we move the fixed kernel over the
entire input image. This process is known as convolution.
3.3.3 Correlation
Correlation (referred also as cross-correlation operation in the field of ML) is a particular case of
shift-invariant linear filtering.
In correlation, a fixed spatial pattern is shifted over the image, and the response is recorded as
the pattern is applied to different patches. The response is computed by multiplying the pattern
with the under-lighted portion of the image. If the elements are similar (indicating parallelism in
Euclidean space), the outputs will be high, whereas dissimilar elements will yield low outputs.
The ability to perform pattern matching 2 (finding a pattern when the correlation between the
kernel and the input pixels is high) makes the correlation operation particularly useful in object
detection.
For instance, given a 3 × 3 kernel K and an input image I, the output of the correlation operation
for each cell (i, j) can be computed as follows:
I ′ (i, j) = c11 I(i − 1, j − 1) + c12 I(i − 1, j) + c13 I(i − 1, j + 1) + c21 I(i, j − 1) + c22 I(i, j − 1)
+ c23 I(i, j − 1) + c31 I(i + 1, j − 1) + c32 I(i + 1, j) + c33 I(i + 1, j + 1)
1
Edge detection is the process of identifying and highlighting the boundaries of objects in an image or
video.
2
Pattern matching in computer vision refers to a set of computational techniques which enable the
localization of a template pattern in a sample image or signal.
3.3 Convolution operation 65
Figure 3.4: A visualization of a correlation trasnform for a single pixel in position (i, j)
k
X k
X
I ′ (i, j) = K(m, n)I(i + m, j + n)
m=−k n=−k
We are now interested in exploring the physical intuition behind the reason why the convolution
operation is effective in extracting features from images.
When working with images, we are dealing with data captured by an imaging system that has a
specific response for each point of light in the scene. This response is influenced by the system’s
point spread function (PSF), which characterizes how the imaging system blurs a point source
of light.
The blurring effect caused by an imaging system can be mathematically modeled as a convolution
operation. When an image is blurred by the system, it is effectively convolved with the system’s
PSF. This convolution operation describes how the light spreads out in the image due to the
characteristics of the imaging system.
As a result, if we apply deconvolution (which is also a convolution operation) to the blurred image,
we can potentially recover the original source of the image by undoing the blurring effect caused by
the imaging system.
The concepts of convolution and correlation are closely related. In particular, the convolution
operation is defined as
k
X k
X
I ′ (i, j) = (I ∗ K)(i, j) = K(i − m, i − n)I(m, n)
m=−k n=−k
However, there are differences between convolution and correlation. One important distinction
is that convolution is commutative, which means that the order of the operands can be swapped
without changing the result. Therefore, we can equivalently write:
k
X k
X
I ′ (i, j) = (I ∗ K)(i, j) = K(i − m, i − n)I(m, n)
m=−k n=−k
k
X k
X
= K(m, n)I(i − m, j − n) = (K ∗ I)(i, j)
m=−k n=−k
In practice, the latter formula is often preferred for implementation in machine learning libraries as
it allows for a smaller variation in the range of valid (m, n) values. The commutative property of
convolution is the primary reason it is commonly used instead of correlation.
Additionally, it is worth noting that if the kernel satisfies K(m, n) = K(−m, −n), then correlation
and convolution become equivalent.
Let’s consider a 1D convolution operation with an input image I and a kernel K. We can express
it as follows:
k1 0 . . . 0
.. I1
.
k2 k1 I2
.. ..
. .
(I ∗ K) =
k3 k2
. .. .
.. k3 . ..
..
In
0 . ... km
It is worth noting that the convolution operation is typically represented using an asterisk (∗)
symbol.
For more practical details on this topic, please refer to the first exercise of the CNNs’ pen and
paper homeworks.
Now that we have introduced the concept of convolution and discussed why it is a suitable operation
for feature extraction from images, we can proceed to Convolutional Neural Networks (CNNs).
3.5 Convolutional Neural Network 67
A typical layer of a convolutional network consists of three stages. In the first stage (??), the layer
performs a set of convolutions in parallel, with each convolution having its own learnable kernel.
This process generates a set of activations, also known as feature maps. In the second stage, each
activation is then passed through a non-linear activation function, such as the Rectified Linear
Unit (ReLU). This stage introduces non-linearities and allows the network to capture complex
patterns and relationships within the data. In the third and final stage, a pooling operation (see
subsection 3.5.2) is employed. A pooling function replaces the output of the network at a certain
location with a summary statistic of the nearby outputs. This operation aggregates features and
obtains a representation at a lower resolution.
By combining these three stages, the convolutional layer extracts local features from the input
data, introduces non-linearities, and reduces the spatial dimensions of the features through pooling.
This hierarchical process helps the network learn hierarchical representations of the input data. In
addition to convolutional layers, CNNs also include a final dense layer. This layer aggregates the
features extracted by the preceding layers and produces the final output of the network.
Next, we will examine convolutional (section 3.5.1), pooling (subsection 3.5.2) and dense layers in
details.
In the context of CNNs, the first step of a convolutional layer involves applying convolution to an
input image. This is achieved by convolving a kernel (also referred to as a filter in deep learning)
with the entire image. In practice, this involves sliding the kernel over the image spatially and
computing dot products.
It is important to note that filters must extend the full depth of the input volume, as illustrated in
Figure 3.7. For example, if we have an RGB image with three channels, we would apply a k × k × 3
filter to it.
When taking the dot product between the filter and a small 5 × 5 × 3 chunk of the image (resulting
68 Chapter 3. Convolutional Neural Network
We are now interested in analyzing the math of how parameters are updated in a convolutional layer.
We will focus on the case of a single filter, as the generalization to multiple filters is straightforward.
[l−1] [l]
Let zi,j be the output of the (l − 1)-th layer at position (i, j), let wm,n be the weight of the filter
in position (m, n), and let b be the bias parameter (every convolutional layer has one bias parameter
per filter). We can write the output of the l-th layer as:
[l] [l−1]
XX
zi,j = W [l] · z [l−1] + b = [l]
wm,n zi−m,j−n + b
m n
[l]
Let L be the loss function and let again zi,j be the output at the l-th layer in position (i, j).
First, we perform the forward pass. We can express the derivative of the cost function with
respect to the output of the (l − 1)-th layer as:
[l]
[l−1] ∂L X X ∂L ∂zi′ ,j ′
δi,j = [l−1]
= [l] [l−1]
∂zi,j i′ ∂zi′ ,j ′ ∂zi,j
j′
P P [l]
XX [l] ∂ m n wm,n
= δi′ ,j ′ [l−1]
i′ j′ ∂zi,j
[l]
XX
[l]
= δi′ ,j ′ wm,n
i′ j′
The last equality is a result of the fact that the only term in the sum that outputs the element
[l] [l−1]
zi′ ,j ′ whose derivative is not zero is the one containing zi,j , where i = i′ − m and j = j ′ − n (or
equivalently, m = i − i and n = j − j).
′ ′
=δ [l]
∗ ROT180 (W [l] )
| {z }
kernel flipped
3.5 Convolutional Neural Network 69
Figure 3.8: Blue pixels only contributes 1 times to the computation of green pixel
[l]
∂L X X ∂L ∂zi,j
[l]
= [l] [l]
∂wm,n i j ∂zi,j ∂wm,n
[l]
XX [l] ∂zi,j
= δi,j [l]
i j ∂wm,n
P P [l] [l−1]
XX [l] ∂ m n wm,n zi−m,j−n + b
= δi,j [l]
i j ∂wm,n
[l] [l−1]
XX
= δi,j zi−m,j−n
i′ j′
The pooling layer substitutes the output of the network at a specific position with a condensed
representation of the adjacent outputs, typically in the form of a statistical summary.
This operation reduces the size of the representations and makes them more manageable.
It’s important to note that the pooling operation is applied independently to each activation map.
70 Chapter 3. Convolutional Neural Network
Max pooling
One of the possible pooling operations is the max pooling, which outputs the maximum value
from the input within a given region.
Note that as the max-pooling layer has not any learnable parameter, the backward pass is just a
propagation of the error and it is not used for weight update in this case.
The dense layer, also referred to as a fully connected layer, complements the role of convolutional
and pooling layers in capturing local features and reducing spatial dimensions.
Its crucial function lies in aggregating the extracted features and generating the final output of
the network. In this layer, each neuron performs a weighted sum of all its inputs and applies a
non-linear activation function. By learning complex relationships among the features extracted by
preceding layers, the dense layer enables the network to make predictions based on these learned
representations.
Here are some practical observations that may be useful for the exercises:
3.6 Practical observations 71
• Doing A ∗ B means to stride the kernel B over A. In general we use kernel with dimension
equal or smaller than the image matrix
• A ∗ ROT180 B = A⋆B
• A⋆B = ROT180 (B⋆A)
4. Fully Convolutional Neural Network
Semantic segmentation is a critical task in computer vision that involves assigning a semantic class
to each pixel in an image. While traditional image classification outputs a single class for the entire
image, semantic segmentation requires classifying each pixel individually.
The most straightforward approach to pixel-wise classification is to classify each pixel individually,
extracting features from a patch centered on it (Figure 4.1).
However, this method is inefficient and redundant for processing large images. Instead, practitioners
adopt a pipeline that involves using the entire image as input to a CNN. The final fully connected
layer, typically used for image classification, is removed, and the resulting feature maps are used as
segmentation predictions. Due to convolutions and max-pool operations, these predictions have
lower resolution than the original image.
To obtain the same resolution as the input image, we could keep the same dimensions by using
appropriate padding in the convolutions and avoiding pooling layers (Figure 4.2). However, this
method can be computationally expensive.
In practice, the most common approach is to downsample the features obtained using convolution
74 Chapter 4. Fully Convolutional Neural Network
and pooling layers and then upsample them again. By applying convolution to a smaller object,
this method is more computationally efficient while producing output with the same resolution as
the input.
While downsampling can be achieved with pooling and strided convolution, there are various
techniques for upsampling that we will now explore.
Upsampling techniques, such as unpooling and transposed convolutions, are commonly used in
semantic segmentation to increase the spatial feature size. These techniques can be divided into
two categories: fixed and learnable.
In this section, we will explore three common fixed upsampling techniques: nearest neighbor, bed
of nails, and max unpooling.
Nearest neighbor upsampling involves upsampling features by copying the same value into all
corresponding pixels at a higher resolution. In contrast, bed of nails upsampling involves padding
with zero neighbor values, resulting in a sparse matrix as output. Figure 4.3 provides a visualization
of these two techniques.
Max unpooling is another fixed upsampling technique that uses zero padding as in bed of nails.
However, it also remembers the original position of the maximum value before the corresponding
max-pooling in the downsampling phase. This information is then used to place each element back
in the correct position. Figure 4.4 provides a visualization of this technique.
4.3 U-Net: the most used FCNN 75
Fixed upsampling techniques are brute force upsampling approaches that do not involve learning.
In contrast, learnable upsampling techniques, such as transposed convolutions (also known as
deconvolutions), make use of learning.
As we have seen before, the problem with convolutions and pooling is that they result in output
with lower resolution. To address this issue, we can use transposed convolutions.
In practice, given a low-resolution image, we learn a kernel (e.g., 2 × 2) that is used to produce
all the terms whose sum will be the final output. Each term is obtained by multiplying all the
elements of the kernel by the value of one single input pixel and then inserting the result in the
correct position of a matrix of the same size as the output. Note that each term of the sum is a
sparse matrix, potentially with non-zero terms only in a number of pixels equal to the kernel size.
Figure 4.5 provides a visualization of this process.
U-Net [10] is a popular fully convolutional neural network (FCNN) architecture that has been
widely used for semantic segmentation tasks. The main idea behind U-Net is to combine global
and local feature maps by copying corresponding tensors from earlier stages in each upsampling
stage. This allows the network to capture both local 1 and global context, leading to more accurate
semantic segmentation results.
1
Residual connections help to mantain local features as images are not completely downsampled at every
stage.
76 Chapter 4. Fully Convolutional Neural Network
4.4 Applications
Semantic segmentation is one of the most common applications of upsampling techniques, and it
has been used in a variety of fields, including medical imaging, autonomous driving, and robotics.
However, there are several other applications of upsampling techniques as well. These include:
• Image generation from semantic labels (i.e., creating realistic images from semantic labels)
• Human pose estimation (i.e., estimating the pose of a person in an image)
• Human shape estimation (i.e., estimating the 3D shape of a person from a 2D image)
Overall, upsampling techniques have a wide range of applications in computer vision, and their use
has led to significant advancements in the field.
5. Recurrent Neural Network
In this chapter we present Recurrent Neural Networks. In section 5.1 we introduce the concept
of recurrent neural networks and we see some potential applications. In section 5.2 we introduce
the concept of dynamical system, which is at the basis on the Recurrent Networks structure. In
section 5.3 we introduce the easiest recurrent architecture and we analyze its failure cases. In
section 5.4 we show how more complex architectures as LSTM can solve the problems of the vanilla
RNNs. In the same section we also explain how gradient clipping can avoid instabilities which may
occur during the training of a recurrent network.
5.1 Introduction
Recurrent Neural Networks (RNNs) are a type of neural network that can process sequential data.
Unlike traditional feedforward neural networks, which take fixed-length inputs, RNNs can take
inputs of variable length, and can maintain an internal memory of the past inputs that they have
seen. This makes them well-suited to tasks such as sequence prediction, language modeling, and
machine translation.
Different combinations of input and output lengths yields to different applications of RNNs. Some
examples are:
• One to one: vanilla RNN, at each time step we have one input and one output
• One to many: this is the case of Image Captioning where at each time step we have in
input one image (one element) and we output a sequence of word (many elements).
• Many to one: this is the case of Sentiment Classification, where at each time step we have
in input a sequence of words (many elements) and we output the sentiment linked to them
(one label).
• Many to many: this is the case of Machine Translation, where at each time step we translate
a sequence of words (many elements) to another sequence of words (many elements). Another
common case is Video Classification on frame levels, where at each time step we have in
input a present frame together with the previous ones - which are encoded in the hidden
78 Chapter 5. Recurrent Neural Network
state - (many elements) and we output a label from each of those (many elements).
Figure 5.1: RNNs applications and structure of the nets. The red rectangles represent
inputs, the green ones the hidden state and the blue ones the outputs. The cases reported,
respectively, refer to vanilla RNN, Image Captioning, Sentiment Classification, Machine
Translation and Video Classification.
At their core, RNNs are a type of dynamical system. In general, a dynamical system is a
mathematical concept used to describe the behavior of a system over time. It is represented by a
set of rules that determine how a system changes from one state to another over time, based on
its current state. In the context of recurrent neural networks, the hidden state of the network at
each time step can be thought of as a representation of the current state of the system, and the
transition function that updates the state can be thought of as the set of rules that govern the
behavior of the system over time.
In dynamical systems without input, the state on time t is a function function of s<t−1> , thus
s<t> = f (s<t−1> , Θ), as it is shown in Figure 5.2.
For example, in a finite horizon setting, we can unroll the recurrence to obtain:
s<3> = f (s<2> ; Θ) = f (f (s<1> ; Θ); Θ)
In the case of RNNs, the state at time t depends not only on the state at the previous time step
t − 1, but also on some input at the current time step x<t> . In other words, we can write the
RNN as a function that maps the previous state and the current input to the current state as:
s<t> = f (s<t−1> , x<t> ; Θ)
5.2 Dynamical System 79
Here, Θ denotes the set of parameters that the RNN learns during training.
It is important to note that these parameters remain the same across all time steps, allowing the
RNN to learn a single model that can handle sequences of arbitrary length. The most common
representation of a RNN is shown in Figure 5.4.
A first way to do that is to consider a function g <t> that takes as input all the previous timesteps,
as shown in Figure 5.5 However, this option has the drawback of requiring variable-length input
Another option is to write the recurrence using a function f (h<t−1> , x<t> ; Θ) that takes as inputs
the input of the current timestep x<t> , the previous hidden state h<t−1> and the set of parameters
Θ. This approach has the advantage of using the same transition function for all time step, meaning
that the network learns to generalize across the entire sequence. In practice, the same set of
parameters is used for all time steps, allowing for efficient computation and easier training.
The Vanilla version of a Recurrent Neural Network (RNN) is characterized by a single hidden
vector, denoted as h<t> , which forms the state of the network.
The equations for the Vanilla RNN, as depicted in Figure 5.7, are as follows:
where h<t> = f (h<t−1> , x<t> ; Θ) represents the hidden state at time step t.
Compared with a MLP we notice two main differences. Instead of using a sigmoid activation function,
the RNN employs the hyperbolic tangent function (tanh). This intuitevely allows the hidden states
to have both positive and negative values, enabling the model to cancel some information of the
past events. Moreover, the layer at time step t depends not only on the previous hidden state, but
also on the input xt at that time step.
Figure 5.8: Signal flow in folded (left) and unfolded (right) RNN
82 Chapter 5. Recurrent Neural Network
In the context of recurrent neural networks (RNNs), we utilize the following equations to describe
the network’s behavior:
h<t> = f (h<t−1> , x<t> ; W ) (5.3)
ŷ <t>
= Why h <t>
(5.4)
<t> 2
L <t>
= ŷ <t>
−y (5.5)
Given a finite horizon of size S, our objective is to compute the partial derivative of the overall loss
(which is the sum of individual losses) with respect to the network’s weights:
X ∂L<t> S
∂L
= (5.6)
∂W t=1
∂W
To do this computation, it is crucial to view the unrolled recurrent model as a multi-layer network,
with a potentially infinite number of layers. We can then apply backpropagation to efficiently
compute the gradients in this extended network structure. For each time step t we can write:
t
∂L<t> X ∂L<t> ∂ ŷ<t> ∂h<t> ∂ + h<k>
= (5.7)
∂W ∂ ŷ<t> ∂h<t> ∂h<k> ∂W
k=1
We obtain the expression for the second equation because h<t> depends on all the previous h<k> .
∂ + h<k>
∂W is the immediate derivative, which treats h<k−1> as constant w.r.t. the weight W .
where f is the activation function. Assuming the existence of an eigenvalue decomposition of the
weight matrix Whh (i.e., Whh is symmetric), we can alternatively express it as Whh = QΛQT , where
Λ is a diagonal matrix containing the eigenvalues of Whh along its diagonal. By rearranging the
previous equation, we obtain:
T t−k−1
= (QT ΛQ)t−k−1 = QT Λt−k−1 Q (5.12)
Whh
Where the last step is due to the fact the QQ⊤ = I, as Q is orthogonal.1 Notably, we observe that
we raise Λ to the power of t. We are now interested in analyzing the influence of the eigenvalues on
the final matrix.
If we consider f to be a sigmoid or a hyperbolic tangent (which are both upper bounded by 1), we
can say that there is an γ ∈ R s.t.
diag(f ′ (h<i−1> )) < γ (5.13)
k′
1
For example, for k′ = 2 we have Whh
T
= QT Λ QQT ΛQT = QT Λ2 Q
| {z }
I
5.3 Vanilla Recurrent Neural Network 83
Assume that λ1 is the highest singular value of this matrix Whh , we will now show that the behaviour
of the gradients depends on whether it is smaller or larger than γ1 . In particular, if λ1 < γ1 , then
the gradient vanishes. In the other case, the gradient explodes. Let us formally prove the first
statement.
∂h<i> 1
∀i, T
≤ Whh diag[f ′ (h<i−1> )] < γ=1 (5.14)
∂h<i−1> γ
∂h<i>
Here ∥·∥ is the spectral norm. Let η ∈ R be such that ∀i, ∂h<i−1> ≤ η < 1.
By induction over i
t
Y ∂h<i>
< (η)t−k → 0 as t → ∞ (5.15)
∂h<i−1>
i=k+1
For the reasons explained in this section (gradients are not meaningful when T → ∞), RNNs
struggle in pratice to capture long-term dependences. In the next section we will see how LSTMs
solve this problem.
84 Chapter 5. Recurrent Neural Network
5.4 Solving the problem of vanishing and exploding gradient: LSTM and Friends
Where C is the cell state, our memory, we want something that summarizes the memory in this cell
state, but keeps the gradient alive.
5.4.2 LSTM
Long Short Term Memory networks [4], or LSTMs, are a special kind of RNN, capable of learning
long-term dependencies.
The structure of their cells result very different from the ones of a vanilla RNN which only consists
on a single internal layer, where the cell state and the input are transformed by a single affine
transformation and a point-wise non linearity.
The cell of a LSTM indeed consists on four layers, interacting in a very special way. In particular,
these layers, also called gates, have the following functions.
• f is the forget gate and has the role of scaling the old cell state h<t−1> . Depending on xt
and h<t−1> , it decides which information should be forgotten from the previous cell state.
Its output is a sigmoided value, which for each element of the previous cell state x<t−1>
decides how much of the old state kept in the current one (0 deletes it, 1 keeps the element
entirely).
• i is the input gate and has the role of deciding which values of the state cell should be
updated at the current time step. Its output is a sigmoided value, which for each element
of the cell state, decides how much of it should be written in the current cell state x<t> (0
deletes everything, 1 keeps everything).
• o is the output gate and has the role of deciding which values of the current cell state
should be put in output of the cell h<t> . As the previous gates, its output is a sigmoided
value, which for each element of the current cell state x<t> , decides how much of it should
be put in output.
• g is the gate that decides what to write in the cell state. It is a tanh layer, which creates a
vector of new candidate values.
In practice, the idea is to stack the vector x<t> and h<t−1> and to multiply them for a big weight
matrix in order to obtain the four different values of i, f , o, g, with the roles described before.
Given that the cell state c<t> , the input x<t> and outputs h<t−1> have dimensionality n and
given W ∈ R4n×2n we compute i, f , o, g as follows.
i sigm <t>
f sigm x
= W (5.16)
o sigm h<t−1>
g tanh
5.4 Solving the problem of vanishing and exploding gradient: LSTM and Friends85
We remind that in the case of a RNN the equation for the output h<t> was the following.
!
[l−1]
[l] h<t>
h<t> = tanh W [l]
[l] (5.19)
h<t−1>
Once computed the values for i, f , o, g, we can compute the new cell state c<t> and the new output
h<t> as follows.
c<t> =f <t> ⊙ c<t−1> + i<t> ⊙ g<t> (5.20)
h <t>
=o <t>
⊙ tanh(c <t>
) (5.21)
To understand why LSTMs are more effective than vanilla RNNs in practice, let’s examine the
gradient flow from c<t> to c<t−1> . In vanilla RNNs, the gradient flow relies on matrix multiplication,
as seen in the equation h<t> = tanh(h<t−1> Whh + x<t> Wxh ), where the weight matrix W remains
constant throughout.
However, in LSTMs, the gradient flow takes a different approach. As a matter of fact, + operator
allows the gradient to directly propagate to the element-wise multiplication (c<t−1> ⊙ f ). Unlike
86 Chapter 5. Recurrent Neural Network
Figure 5.11: A graphical representation of the LSTM multi-layered strcture. The red
rectangles are the inputs, the green ones are the hidden layers and the blue ones the
outputs.
While LSTM can be seen as solution to the problem of vanishing gradients, gradient clipping solves
the issue of exploding gradients. The idea behind gradient clipping is to limit the maximum value
of the gradient if it surpasses a predetermined threshold. In practice, given the gradient g ← ∂Θ
∂L
and a threshold T
(
Θ − λg if ∥g∥2 ≤ t
Θ← (5.22)
Θ − λT ∥g∥g
otherwise
2
5.4 Solving the problem of vanishing and exploding gradient: LSTM and Friends87
Figure 5.13: Training without (left) and with (right) gradient clipping in a recurrent network
with parameters w and b. It can be noticed as in absence of gradient clipping, the gradients
can overshoot the bottom of the cliff and receive a very large gradient from the steep cliff
face. This can lead to catastrophic parameter updates, pushing the parameters far beyond
the plot’s axes. Picture from [6].
where λ is a the learning rate and Θ are the parameters of the model.
In PyTorch we can write the following code to implement gradient clipping using norm 2 (∥·∥2 )
and threshold T = 2.0.
1 loss . backward ()
2 torch . nn . utils . clip_grad_norm_ ( model . parameters () , max_norm =2.0 ,
norm_type =2)
3 optimizer . step ()
III
Part Three: Generative
Modeling
6 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 93
6.1 Introduction
6.2 Linear Autoencoders: the PCA projection
6.3 Non-Linear Autoencoders
6.4 Variational Autoencoders
6.5 β-VAE
9 GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.1 Likelihood-free model
9.2 Introduction to GAN
9.3 Definitions
9.4 Training
9.5 Theoretical analysis
9.6 Difficulties during training
9.7 Comparison with VAE
9.8 Conditional GANs
91
In discriminative models, given in pairs (x, y) as training data, the goal is to learn a function
f which maps an input x to an output y. On the other hand, in generative models, we work
with training data consisting only of unlabeled data points x. Here, our objective is to learn the
underlying hidden structure of the data. We aim to model the distribution pmodel(x) to generate
new samples that resemble the distribution pdata(x) . Generative models can be classified into two
categories:
• Explicit model: in this category, we explicitly define the probability distribution pmodel(x)
and then sample from it to generate new data points.
• Implicit model: here, we directly sample from pmodel(x) without explicitly defining the
distribution. Implicit models offer more flexibility and are commonly used in complex
scenarios.
While explicit models have the advantage of being highly interpretable, implicit models are more
versatile and applicable in various contexts. For a more detailed classification of generative models,
refer to Figure 5.14.
In Parts 1 and 2, we primarily explored discriminative models. Now, in Part 3, we will shift our
focus towards the domain of generative models.
6. Autoencoders
In this chapter we introduce the Autoencoders. In section 6.1 we introduce the structure as well
the idea behind the autoencoder models. In section 6.2 we introduce the simplest autoencoder,
the linear autoencoder. In section 6.3 we describe how autoencoders can be implemented with a
neural network, allowing them to learn also non-linear projections. In section 6.4 we introduce the
Variational Autoencoders.
6.1 Introduction
In the field of deep learning, the data we often work with is represented as measurement vectors,
denoted as x ∈ Rn . While the dimensionality of these vectors can be low when using carefully
selected features, modern machine learning applications frequently involve high-dimensional data
such as images, audio, or time-series. In such cases, a crucial objective is to find low-dimensional
representations that can effectively compress the data while preserving its essential information.
Moreover, these representations should be interpretable and capable of capturing different modes of
variation.
Autoencoders offer a solution to this challenge through the use of an encoder-decoder structure, as
depicted in Figure 6.1.
• The encoder f projects the original input space X into a latent space1 , denoted as Z.
• The decoder g maps samples from the latent space Z back to the input space X.
Autoencoders operate on the assumption that a meaningful compressed representation of the data
can be obtained if the decoder is capable of reconstructing the original input solely from that
compressed representation. Consequently, the composition [g ◦ f ] aims to approximate the identity
function on the data, resulting in a low reconstruction error.
1
We may refer to the intermediate space where the data is projected as latent space, code, embedding
space equivalently.
94 Chapter 6. Autoencoders
Furthermore, to enable the generation of new samples from the latent space, it is desirable for the
latent space to exhibit a well-structured nature, characterized by continuity and interpolation
capabilities.
If we restrict the functions f and g to be linear, the encoder function f of the autoencoder becomes
equivalent to the projection performed by Principal Component Analysis (PCA) projection, which
is the projection achieving the lowest reconstruction loss (L). Given N data points, L is computed
as
N
X N
X
2 2
L= ∥xn − xˆn ∥ = ∥xn − g(f (xn ))∥ (6.1)
n=1 n=1
The advantage of such a reconstruction is that it can be found in closed form, by computing the
eigenvectors of the covariance matrix of the data.
6.3.1 Overview
When we allow f and g to be non-linear, the Autoencoder becomes a non-linear projection of the
data. In this case, both the encoder and decoder are implemented as neural networks, as illustrated
in Figure 6.2. To construct an autoencoder, we typically use a feedforward neural network trained
to reconstruct its inputs. In practice, it optimizes the following objective function w.r.t. the encoder
and decoder parameters Θf and Θg :
N
X N
X
2 2
Θ̂f , Θ̂g = arg min ∥xn − xˆn ∥ = arg min ∥xn − g(f (xn ))∥ (6.2)
Θf ,Θg n=1 Θf ,Θg n=1
6.3 Non-Linear Autoencoders 95
Figure 6.3 provides a comparison between the reconstructions obtained using a PCA projection
(linear autoencoder) and a feedforward neural network (in this case, a convolutional neural network
or CNN) for compressing the data from a dimensionality of 1024 to 2.
Figure 6.3: Comparison between the reconstruction obtained with a linear (PCA) and
non-linear (CNN) autoencoder.
In the context of autoencoders, the dimensionality of the hidden layer plays a crucial role. Let us
call X the original feature space and Z the latent space. In general, as we have seen before, we
have dim(X) > dim(Z). However, there is also a class of autoencoders where dim(X) < dim(Z).
Depending on the dimensionality of the latent space, we can distinguish between undercomplete
(dim(X) > dim(Z)) and overcomplete (dim(X) < dim(Z)) hidden representations (see Figure 6.4).
The idea behind an undercomplete hidden representation is to enable the network to learn the
important features of the data by reducing the dimensionality of the hidden space. This prevents the
autoencoder from simply copying the input and forces it to extract meaningful and discriminative
features.
In practice, they work well to extract those features for training samples, but may not generalize
effectively to out-of-distribution samples.
96 Chapter 6. Autoencoders
In contrast, an overcomplete autoencoder has a hidden layer with a higher dimensionality than the
input layer. This lack of compression potentially allows each hidden unit to simply copy different
input components, achieving a perfect reconstruction loss, but without extracting any meaningful
feature. The question now is:
There are mainly two applications of Autoencoders with overcomplete hidden representation: the
Denoising and the Inpainting Autoencoders.
The goal of Denoising Autoencoders is, given a noisy image, to reconstruct the original clean
one. In order to do that, during training, a clean image is intentionally corrupted by injecting noise,
such as Gaussian noise. This noisy image is then provided as input to an Autoencoder with an
overcomplete hidden representation. Since the loss is evaluated based on a comparison with the
original (clean) image, the network is discouraged from simply copying the input (noisy) image.
Instead, it must learn the necessary transformations to remove the noise and accurately restore the
clean information.
The other application of Autoencoders with overcomplete hidden representation is the Inpainting
Autoencoders. The goal of this model is to reconstruct the missing parts of an image. In order to
train such a network, similarly to what we have seen with Denoising Autoencoder, we provide a
corrupted image as input and the original image as target. In this case, instead of injecting noise,
the original image is intentionally occluded by applying partial or complete occlusions, as shown in
Figure 6.6.
6.3 Non-Linear Autoencoders 97
The network is then trained to reconstruct the original (complete) image by learning to fill in the
missing or occluded regions. By utilizing an overcomplete hidden representation, the Inpainting
Autoencoder learns to capture the underlying structure of the image and accurately restore the
missing parts based on the available information.
As mentioned in section 6.1 in order to be able to generate new samples, it is desirable for the
latent space to exhibit a well-structured nature, characterized by continuity and interpolation
capabilities.
However, in the classical version of autoencoders we have examined, the decoder struggles to
generate high-quality samples. This limitation arises due to the lack of continuity in the latent
space, as depicted in Figure 6.7. In regions of the latent space where there are discontinuities or
gaps between clusters, the decoder has no knowledge of how to generate realistic outputs. This
happens because during training, the autoencoder was not exposed to encoded vectors from those
regions. Therefore, while autoencoders excel at reconstructing input data, they face challenges in
generating new samples.
Figure 6.7: Training an autoencoder on the MNIST dataset, and visualizing the encodings
from a 2D latent space reveals the formation of distinct clusters, given by the found
projections of the input samples. However, if we sample an element for a region of that
space which ahs not been covered during training, the decoder will not be able to generate
a realistic output, as it was trained purely to optimize the reconstruction loss and it lacks
any kind of interpolation capabilities. (from blog post)
98 Chapter 6. Autoencoders
6.4.1 Overview
Variational Autoencoders (VAEs) are the last category of Autoencoders we will analyze and are
proposed as solution for the issue of vanilla Autoencoders discussed in subsection 6.3.3. They have
a unique feature that makes them excellent for generative modeling: their latent spaces are designed
to be continuous. This means that VAEs can easily generate new and diverse samples by smoothly
interpolating between different points (explored during training) in the latent space.
In practice, they achieve that by making the encoder not output a latent vector of size dim(Z), but
instead outputting two vectors of that size: a vector of means µ and a vector of standard deviations
σ, as depicted in Figure 6.8.
Figure 6.8: Structure of a VAE. The part highlighted in red is the encoder. µ and σ (which
have the same dimensionality as z) are the values predicted by the last layer of the encoder
network. z is the latent embedding sampled from N (µ, σI). The part highlighted in blue
is the decoder.
This stochastic generation means, that even for the same input, while the mean and standard
deviation remain the same, the actual encoding will somewhat vary on every single pass simply due
to sampling.
Figure 6.9: Difference of latent space encodings between AE and VAE. In VAE the input to
the decoder is a sample from a gaussian distribution centered in the projected data point
rather than the projected data point itself. (from blog post)
6.4 Variational Autoencoders 99
However, since there are no limits on the values which can be taken by µ and σ, the encoder may
learn to generate very different µ for each different class while minimizing σ, in order to reobtain a
clustered structure, which allows the network to achieve a lower recontruction error. However, this
is again something we would like to avoid, since we want the latent space to be continuous and not
clustered. For a visual representation refer to Figure 6.10.
Figure 6.10: What we would like to obtain (left) and what we obtain only changing the
structure of the encoder (right).
Ideally, we would like to obtain encodings which are as close as possible to each other in the latent
space, allowing smooth interpolation and thus good generation of new samples. In order to force
this, the Kullback-Leibler divergence (KL)2 is inserted in the loss function. In particular, we want
to minimize the KL divergence between the distribution defined by every training sample and
a standard (multivariate) normal distribution. Intuitively, this loss encourages the encoder to
distribute all encodings (for all types of inputs, eg. all MNIST numbers), evenly around the center
of the latent space. Using a loss based on reconstruction loss and KL divergence we obtain a latent
space structured as depicted in Figure 6.11.
Figure 6.11: The latent space of a VAE trained on MNIST with a loss based on reconstruction
error and KL divergence.
2
The KL term between two probability distributions is a measure of how much they differ from each
other. More details are provided in subsection 6.4.2
100 Chapter 6. Autoencoders
Before diving into the specific computation of the objective function, let us introduce the definition
and some properties of the Kullback-Leibler (KL) Divergence, which will be needed to understand
the following part.
Many times it happens that we have some probability distributions and we want to measure how
different they are. A way to do that is using Kullback Leibler Divergence. More formally, if we want
to measure how much the distribution p is different from a second, reference probability distribution
q we write
Z " !#
p(x)
DKL (p||q) = p(x) log dx
x q(x)
A first thing that worth noticing is that the KL divergence is not symmetric, as in general
DKL (p||q) ̸= DKL (q||p).
Moreover, the KL divergence is non-negative, as we can see from the following proof:
Z " !#
p(x)
−DKL (p||q) = − p(x) log dx (6.3)
x q(x)
Z " !#
q(x)
= p(x) log dx (6.4)
x p(x)
Z " !#
q(x)
≤ log p(x) dx Jensen’s inequality E(ϕ(x)) ≤ ϕ(E(x)) if ϕ is concave
x p(x)
(6.5)
Z
= log q(x)dx (6.6)
Zx
= log dx = 0 (6.7)
x
In this section we are interested in understanding how to capture the process we have just described
via estimation of the parameters Θ∗ of this generative model.
In order to train a model we would like to maximize the likelihood of training data, that in this
case is:
Z
p(x) = p(x|z)p(z)dz (6.8)
z
In this expression we know p(z) and p(x|z), but we are not able to compute the integral over all z.
Here we have used that: x does not depend on z (1.9), Bayes rule (1.10), multiplication of both
numerator and denominator for a constant gives identity (1.11), the logarithm of a product is the
sum of the logarithms (1.12), KL divergence is non-negative (1.13) (as proven in subsection 6.4.2).
Since, as we have said before, p(z|x) is not tractable we cannot optimize the last term. As a
consequence, we aim to maximize the first two terms (that we call ELBO, Evidence Lower BOund)
of the expression.
In particular we want to jointly maximize the reconstruction error and minimize the KL divergence
between the approximate posterior and the prior. The first term encourages the encoder to form
clusters where samples from the same category or with similar properties are closely located in
the latent space. The second term encourages the encoder to project latent representations evenly
around the center of the latent space. A visualization of the effect of each term in the learned latent
space can be found in Observe that in practice we assume p(z) ∼ N (0, I) (enforce the covariance
Figure 6.12: A comparison between the latent spaces of three autoencoder trained on
MNIST optimizing (left to right): reconstruction error, KL term, jointly reconstruction
error and KL term.
matrix to be diagonal, and thus improve disentanglement) and q(z|x) ∼ N (µ, σ 2 I) (makes the KL
term analytically computable).
102 Chapter 6. Autoencoders
6.4.4 Training
Now that we have obtained an objective function to optimize, in order to proceed with the actual
training we must be able to compute the gradients of the ELBO with respect to the parameters of
the encoder and the decoder. However, we have a problem. The process of sampling (in the case of
z) from a distribution that is parameterized by our model is not differentiable. For this reason, in
order to compute the gradients, we need to find a method of making our predictions separate from
the stochastic sampling element.
The solution to that is the so-called reparametrization trick, which involves treating the random
sampling as a single noise term. In particular, instead of considering z to be sampled from a
N (µ, σ) distribution, we consider it to be a deterministic variable z = µ + σϵ, where ϵ ∼ N (0, I)
is a random noise term. The benefit is the prediction of mean and variance is now no longer tied to
the stochastic sampling operation. This means that we can now differentiate with respect to our
models’ parameters again.
Figure 6.13: Reparametrization trick explained graphically. Circles are stochastic variables
and diamonds are deterministic variables (i.e., neural network layers).
For implementation and clearity purposes we report here all the steps of a forward pass in a VAE.
(6.16)
Here, the last term is the KL divergence between the approximate posterior and the prior.
For more details on the derivation of this term, refer to the Pen&Paper homeworks.
6.5 β-VAE 103
Once trained the network, in order to generate new data samples, we sample a vector z from N (0, I)
and we use the decoder network to obtain its representation in the original space. However, the
representations obtained in that way are still entangled. For instance, in the case of MNIST dataset
we do not have an explicit way to sample a 1 rather than a 9. We would like to further structure
the latent space in order to have a disentangled representation in which each dimension corresponds
to a factor of variation of the data (digits, style, thickness, orientation, etc.).
There are mainly two solutions to this problem: training the network in a semi-supervised way to
make it learn the labels of the data, or using the so-called β-VAEs. We are mainly interested to
discover and separate the important factors of variation in an unsupervised fashion, so in the next
section we will present the β-VAE [3].
6.5 β-VAE
As we have seen before, in the ELBO the KL loss enforces independent Gaussians (diagonal
covariance matrix). However, due to fight between two losses this is not always the case.
The idea introduced by β-VAE is to give more weight to the KL term, by multiplying it by a
adjustable hyperparameter β that balances latent channel capacity and independence constraints
with reconstruction accuracy. The intuition behind that is that if factors are in practice independent
from each other (as style and digit in MNIST), the model should benefit from disentangling them.
A regressive model is a model whose outputs are linear combination of the inputs.
If we have n pixels, we have 2n possible states (each pixel can be black or white).
we can model this problem using n Bernoulli variables X1 , ..., Xn s.t. V al(Xi ) = {0, 1} =
{Black,White}.
If we sample from p(x1 , ..., xn ), we generate an image (we use lower xi since the probability function
is a function that maps a realization to a probability mass).
Using a tabular approach, via the chain rule of probability we can factorize the joint distribution
over the n-dimensions:
n
Y n
Y
p(x) = p(xi |x1 , ..., xi−1 ) = p(xi |x<i )
1 1
So, since x1 is given and the total probability must be 1, we obtain that we need 2n−1 parameters.
However, an exponential number of parameters to train is too high.
Idea: assume p(xi |x<i ) to correspond to Bernoulli random variable and learn to map previous
inputs to the mean:
Now our problem is to find a useful definition of f : {0, 1}i−1 → [0, 1].
7.4 Fully Visible Sigmoid Belief Network 107
At each time step i, we will have i − 1 parameters (θ = (α1 , ..., αi−1 }), thus, in a horizon of n steps
we will have
n
X
number of parameters = i = O(n2 )
i=1
The main example of a Fully Visible Sigmoid Belief Network is NADE and its variants.
There the function is expressed as
where W.,<i represents the first i − 1 columns of W and Vi,. is the i-th row of V .
So we can express the k-th row of hi as
i−1
X
hi,k = bk + Wkj xj
j=1
Training of NADE
We have seen that the inputs are processed in order. Many researches have been done to understand
the best order of the vector x, however random order has been proven to works fine.
During the training of NADE the teacher forcing approach is used: ground truth values of the
pixels are used for conditioning when predicting subsequent values (I don’t use the value predicted
in the training).
During inference I use the predicted values, it is a fully generative model.
NADE’s extensions
• Real-valued NADE: it expands to real valued data, modelling the conditionals as mixture of
Gaussians
• Orderless and deep NADE (DeepNADE): a single deep neural network is trained to assign a
conditional distribution to any variable given any subset of the others.
• Convolutional NADE (ConvNADE)
110 Chapter 7. Autoregressive models
As in autoencoders, our objective is to learn hidden representations of the inputs that reveal the
statistical structure of the distribution that generated them.
However, the autoencoder takes the input as a whole and, thus, it does not satisfy the autoregressive
property.
So now we are going to impose some constraints on an autoencoder in order to make it fulfil the
auto regressive property.
In order to best understand the procedure we are going to explain next, remind that in a classical
autoencoder we have that
h(x) = g(b + W x)
x̂ = σ(c + V h(x))
1. We assign (sampling uniformly) each unit in the hidden layer an integer m s.t. 1 ≤ m ≤ D − 1,
so that for every hidden layer ml (k) represents the value assigned to the k-th element of the
layer l
2. In the hidden layers allow to propagate connections only to m that are greater or equal, not
to smaller one
3. Allow connections between the last hidden layer and the output only to m that are strictly
greater
7.5 Masked Autoencoder Distribution Estimation (MADE) 111
Let us clarify formally how to compute the weights of the M W and M V matrices in order to satisfy
the conditions expressed in the last two steps.
W
Mij = ⊮(m(i)l ≥ m(j)l−1 )
V
Mij = ⊮(m(i)l > m(j)l−1 )
Using such a procedure it can be proved that we always end up with an autoencoder which fulfil
the auto-regressive property.
Computing p(x) is just a matter of performing a forward pass. Implementing MADE usually we
use ReLu for hidden layers and sigmoid for the last one.
112 Chapter 7. Autoregressive models
As we have seen in the previous pages our goal is to predict a new images, given some true ones.
In order to do that we use chain rule to decompose likelihood of an image x into product of 1D
distributions
n
Y
p(x) = p(xi |x1 , ..., xi−1 )
1
In order to train the model we want to maximize the likelihood of training data.
We mainly have two issues:
Idea: we generate image pixels starting from corner and we model the dependency on previous
pixels modeled using an RNN (LSTM).
Note that from now on pixels will be RGB values (3 channels, 255 values each).
The issue is that sequential generation is slow – due to explicit pixel dependencies. This issue has
been solved in Pixel CNN, which models dependencies on previous pixels with a CNN over context
region.
Still generate images starting from the top left corner, but models the dependencies using CNN,
masked convolution in order to satisfy the autoregressive property.
Training maximize the likelihood of training data as before, we use a softmax loss over pixel values
(from 0 to 255), the one with the lowest loss is then selected.
How much of the context is considered by the cnn?
We model this problem using masked convolutions which ensures the autoregressive property is
satisfied.
Only the pixel in blues of the receptive filter of the filter are used.
p(xi |x<i ) = p(xi,R |x<i )p(xi,G |x<i , xi,R )p(xi,B |x<i , xi,R , xi,G )
• Mask A: this mask is only applied to the first convolutional layer and restricts connections to
those colors in current pixels that have already been predicted.
• Mask B: this mask is applied to other layers and allows connections to predicted colors in
the current pixels
Figure 7.4: RGB masked convolution, when RED and GREEN have already been predicted
for the current pixel
In order to make the training faster, the authors proposed a parallelization of the operations, using
a stack of masked convolutions.
The problem that arises is the presence of a blind spot (the pixel we are considering is not directly
or indirectly dependent from a group of pixels, even if it should).
The authors of the paper have proposed to remove the blind spot by combining two convolutional
network stacks:
Moreover, in order to try getting results as good as the ones obtained with RNN, they replaced the
rectified linear units between the masked convolutions in the original pixelCNN with the following
gated activation unit
The introduction of this combination between vertical and horizontal convolutions makes sure that
the pixel dependencies are preserved in the right order.
The results obtained by this model were images which looked like images at first glance, but which
have no semantic if observed carefully.
However, this model is an historical value since it showed that generating images which looked like
images was possible using neural networks.
The thing we like compared to PixelRNN is that training is much faster. However the generation
remains sequential and, thus, slow.
7.9 TCNs-WaveNet 115
7.9 TCNs-WaveNet
The idea was to adapt PixelCNN to work with audio data, where the dimensionality is much larger
(at least 16,000 samples per second).
WaveNet is based on the idea of dilated convolution, a type of convolution which allow to capture
further dependencies not increasing the number of layers.
In particular WaveNet increases the dilaction factor as we go up in the layers, so that with the
same amount of layers we have an higher receptive field.
We can use this also in motion modeling, it can predict future poses.
116 Chapter 7. Autoregressive models
If we consider the output at timestep t as an input for the next timestep, RNNs can be seen as
autoregressive models. In order to generate new images, we random sample h0 and then we generate
However, the internal transition structure of a standard RNN is entirely deterministic; the only
source of randomness or variability can be found in the conditional output probability model
pθ (xt , x<t ).
As a consequence, RNNs are often augmented with random latent variables in order to:
The goal is to increase expressive power of RNNs by incorporating stochastic latent variables into
hidden state of an RNN.
In order to do that we combine RNN and VAE by including two latents variables for timesteps.
They allow us to specify priors.
7.11 Self-Attention and Transformers 117
We form the prediction of the current time step by taking a convex combination of the entire input
sequence.
The Attention operation learns to identify/select the relevant past information for the next step.
A linear mapping transforms from the inputs/embeddings to Key, Value and Query embeddings. In
particular
K = XWK
V = XWV
Q = XWQ
More formally
(XWQ )(XWk )⊤
Y = αV = αXWV = sof tmax √ + M (XWV )
D
t
Note that QK expresses how much the query i depends on the key j, thus using a matrix M as
√
D ij
drawn below, allow to delete dependencies of queries from higher (and thus future) keys.
−∞ −∞ · · · −∞
0 −∞ · · · −∞
M = . .. .. ..
.. . . .
0 ··· 0 −∞
The complessity of this operation is O(T 2 · D), it is quadratic in T . To be honest the computation
cost is higher, since at every time step the matrix M changes, so an O(n) should be added for
118 Chapter 7. Autoregressive models
sequential operations.
The term M is to prevent the model from accessing future steps (in order to fulfil the autoregressive
property), if we are doing seq2seq mapping we don’t need it. In particular M is an upper triangular
matrix initialized with very negative values. It masks out the influence of future elements in the
sequence on the prediction of current time step.
8. Normalizing flows
8.1 Introduction
VAE has a latent space, but not tractable likelihood, we have to optimize an approximation.
Autoregressive models have a tractable likelihood, but not a latent space.
The idea is now to have both these conditions satisfied, constructing a mapping from an easy
distribution to a complex space.
We can do that using change of variables technique.
In 1-D:
x = g(u)
Z b Z u=g −1 (b)
f (x)dx = f (x)dx
a u=g −1 (a)
If we think about probabilities then, given a probability function pz (z) and x = f (z), where f (·) is
a monotone and differentiable function, we have
′ ′
px (x) = pz (f −1 (x))|f −1 (x)| = pz (z(x))|f −1 (x)|
and
∂f −1 (x)
−1
px (x) = pz (f (x)) det det(A−1 ) = det(A) in invertible matrix
∂x
−1
∂f (z)
= pz (f −1 (x)) det
∂z
The idea is now to parameterize the Transformation f with a simple MLP layer.
Let us now analyze the properties which must be satisfied by our neural network.
From a theoretical perspective, it must:
• be differentiable
• be invertible
• preserve the dimensionality
From a computational perspective, the Jacobian of the transformation must be computed efficiently.
As we can see from the examples below, the complexity of this computation depends on the form of
the transformation
Coupling layers is a neural network structure which fulfil all the desired requirements.
Here β is some form of complex function that can include non linearities and does not have to be
invertible (it can be a CNN or a complex function).
We have another function h, an element wise function, which adds together one unprocessed part of
the input with the second half of the input which has gone through the complex function β. This
product the first half of the output.
Then, the part of the input which has not gone trough β will form the second half of the output.
That’s important to ensure that we can invert the overall computation.
Let us see in details the forward pass, the backward pass and the Jacobian matrix.
yA h(xA , β(xB ))
=
yB xB
xA h−1 (y A )[0]
=
xB yB
h′ h′ f ′
J=
0 1
We can immediately notice that this matrix is upper triangular, as we wanted since the beginning.
However, a single nonlinear transform (β) is normally not powerful enough, more complex transfor-
mations can be attained via composition.
122 Chapter 8. Normalizing flows
Now we have a flow of transformations, where we can see each transform as a NN.
As the determinant of a product is the product of the determinants we can write that
−1
Y ∂fk (x)
px (x) = pz f −1 (x) det
∂x
k
8.5.1 Training
During training time, we can learn the model via maximizing the exact log likelihood over the
dataset D.
Under the assumption that the samples are independently and identically distributed, we obtain
X −1
X ∂fk (x)
log(px (D)) = log pz (f −1 (x)) + log det
∂x
x∈D k
8.5.2 Inference
To generate a sample x, we can draw a sample from pz (), and transform it via f (as we can see
from the backward path).
To evaluate the probability of an observation x, we leverage the inverse transform to get its latent
variable z, and calculate its probability at pz ().
8.6 Model architecture 123
Although coupling layers can be powerful, their forward transformation leaves some components
unchanged. This difficulty can be overcome by composing coupling layers in an alternating pattern,
such that the components that are left unchanged in one coupling layer are updated in the next.
In practice, we shuffle the input and we process a part at each step, so that at the end all the input
has been processed. How this shuffle happens is the main difference between different papers.
Note: K and L are hyperparameters of our netwrok architecture.
The following are some applications of normalizing flows in the field of Computer Vision
• Super-Resolution
• Disentanglement
• Multimodal modeling
• Noise modeling
• 3D Pose Estimation
9. GAN
All the generative models we have seen so far act maximizing the likelihood.
A question raises spontaneously: "Is the likelihood a good indicator of the generated samples?". In
order to find an answer, let us consider the following two examples.
According to Theis et al. A Note on the Evaluation of Generative Models - chapter 3.2 we can
obtain an high log likelihood even if we are generating poor samples.
Let p(x) be a model which generates good samples and q(x) a model which generates bad ones
(just noise).
Consider the log likelihood of the model 0.01p(x) + 0.99q(x).
For high-dimensional data, log p(x) will be proportional to d while log 100 stays constant. Thus, we
have obtained a model with an high log-likelihood and poor samples generated (the samples of this
model will be noise 99% of the time).
For these reasons, we are interested in a Likelihood-free model (also called Implicit Model or Neural
Sampler).
On one side, it solves the main two problems of the models we have seen before, so that:
126 Chapter 9. GAN
On the other side another problem arise, since it lacks of theory and learning algorithms when
compared to explicit models.
The base idea of the GAN is to draw samples from simple distribution (i.e. random noise) and use
neural network to learn transformation into realistic image.
9.3 Definitions
• The generator is trained to (ideally) map random normal-distributed inputs, drawn from Z
(latent space), to a sample following the data distribution as output.
G : RQ → RD
D : RD → [0, 1]
9.4 Training
Generator G and Discriminator D can be implemented with arbitrary architectures, MLPs, CNNs,
RNNs.
In particular:
• the discriminator will be train to predict 0 on the generated images x̂ and 1 on the real ones
x
• the generator will be then trained to confuse the discriminator and make it output the
opposite
In practice, we reach an equilibrium when the discriminator output 0.5 for every image. Once
training succeeded, the generator is used to represent pmodel from which we want to draw samples
In theory, it has been proven that this minimax game recovers pmodel = pdata if D and G are given
enough capacity and assuming that D∗ can be reached.
• is computationally prohibitive
• on finite datasets would result in overfitting
This procedure aims to keep D near optimum and G changes only slowly.
More precisely, the training algoritm is the following:
While not converged do:
In practice, equation (9.1) may not provide sufficient gradient for G to learn well.
Early in learning, when G is poor, D can reject samples with high confidence because they are
clearly different from the training data. In this case, log(1 − D(G(z))) saturates.
Rather than training G to minimize log(1 − D(G(z))) we can train G to maximize log D(G(z)).
This objective function results in the same fixed point of the dynamics of G and D but provides
much stronger gradients early in learning.
Thus, point 2 becomes:
For one step do:
Our goal now is to find a generator G that will fool any D. In order to do that we must increase
L(D).
In particular, to find the optimal generator G∗ , we define
V (D, G) = Ex∼pd [log(D(x))] + Ez∼pz [log(1 − D(G(z)))]
where x ∼ pdata amnd x̂ ∼ pmodel .
In order to fool any (and so even the best one) discriminator D we have to satisfy
G∗ , D∗ = min max V (G, D)
G D
pdata (x)
D∗ =
pdata (x) + pmodel (x)
Proof. The training criterion for the discriminator D, given any generator G, is to maximize the
quantity V (G, D), in particular
V (G, D) = Ex∼pd [log(D(x))] + Ez∼pz [log(1 − D(G(z)))]
Z Z
= pdata (x) log(D(x))dx + pz (z)(1 − log(D(G(z))))dz
Zx z
Note that ∀a, b ∈ R2 \ {0, 0} (Discriminator does not need to be defined outside of Supp(pdata ) ∪
Supp(pmodel ), where they are both 0).
From mathematical analysis it can be proven that the function y → a log(y) + b(log(1 − y)) achieves
its maximum in a+ba
, for a, b ∈ (0, 1]. Thus,
pdata (x)
D∗ (x) =
pdata (x) + pmodel (x)
130 Chapter 9. GAN
Now that we have found the optimal theoretical value for the discriminator, we are interested in
seeing, globally, which is the function we are wishing to optimize.
Proof
2. Notice that
pdata (x) pdata (x) + pmodel (x) pdata (x)
1− = −
pdata (x) + pmodel (x) pdata (x) + pmodel (x) pdata (x) + pmodel (x)
pmodel (x)
=
pdata (x) + pmodel (x)
so, substituting this in the expression obtained in 1, we have
" !# " !#
pdata (x) pmodel (x)
Ex∼pd log + Ez∼pz log
pdata (x) + pmodel (x) pdata (x) + pmodel (x)
3. Now inside the two log we multiply and divide for 2, obtaining
" !# " !#
2 · pdata (x) 2 · pmodel (x)
Ex∼pd log + Ez∼pz log
2 · (pdata (x) + pmodel (x)) 2 · (pdata (x) + pmodel (x))
" !# " !#
2 · pdata (x) 2 · pmodel (x)
=Ex∼pd log − log 2 + Ez∼pz log − log 2
pdata (x) + pmodel (x) pdata (x) + pmodel (x)
" !# " !#
2 · pdata (x) 2 · pmodel (x)
=Ex∼pd log + Ez∼pz log − log 4
pdata (x) + pmodel (x) pdata (x) + pmodel (x)
! !
pdata (x) + pmodel pdata (x) + pmodel
= DKL pdata + DKL pmodel − log 4
2 2
= 2 · DJS pdata (x) pmodel (x) − log 4
4. Since we know that ∀xDJS ≥ 0 and we want to minimize the training criterion we can affirm
that
• we achieve a minimum when DJS pdata (x) pmodel (x) = 0 that happens iff pdata (x) =
pmodel (x)
• the optimum V (G, D∗ ) is − log 4
9.5 Theoretical analysis 131
Remark Observe that these assumptions are very strong, we are requiring that :
There are mainly two difficulties which may occur during training
1. Mode collapse
2. Divergence of the generator
When the generator finds a very likely sample it starts producing only samples very similar to that
one, it rotates over a small set of output types.
This phenomenon is called mode collapse.
The most used solution to mode collapse is the Unrolled GAN.
The idea of the unrolled GAN is to move the generator forward in the game and make him prepared
for the next moves. In particular, after the k updates of D the generator is optimized once w.r.t.
the state of D after the next k steps. This often discourage G to exploit a local minima.
The dimensions of many real-world datasets, as represented by pdata , only appear to be artificially
high. They have been found to concentrate in a lower dimensional manifold. Thinking of the real
world images, once the theme or the contained object is fixed, the images have a lot of restrictions
to follow, i.e., a dog should have two ears and a tail, and a skyscraper should have a straight and tall
body, etc. These restrictions keep images aways from the possibility of having a high-dimensional
free form.
Because both pmodel and pdata rest in low dimensional manifolds, they are almost certainly gonna be
disjoint. When they have disjoint supports, we are always capable of finding a perfect discriminator
that separates real and fake samples 100% correctly.
In particular, when the discriminator is perfect we have that D(x) = 1, ∀x ∈ pdata (x) and D(x) = 0,
∀x ∈ pmodel (x). Therefore, the loss function falls to zero and we end up with no gradient to update
loss during learning iterations. Thus, the learning of the generator becomes very slow.
A solution to this problem are the Wasserstein GAN, which use another measure of similairity
between the two distributions, the Wasserstein distance. This measure allows to take into account
the amount of work required to make a distribution similar to another one and in practice, it does
a good job.
9.7 Comparison with VAE 133
You train your GAN model on images of cats and dogs. Now that you have a generator that
can produce images of animals, you would like to be able to control the properties of the image.
Some examples of properties could be the animal type or fur colour. Therefore, you would like to
introduce some measure of control over the output of the generator. Explain how you would extend
the basic GAN framework to introduce a measure of control and how the training would look like.
One way of doing this would be to introduce a class label into the input of the generator and
discriminator, leading to the following modified loss function:
Where y is the corresponding class label (for example, corresponding to cats or dogs). This was
introduced by Conditional Generative Adversarial Net, Mirza et al., 2014. However this requires
the dataset to have the appropriate labels. The training would look exactly the same as in the
original algorithm, with the exception of passing the labels. Therefore if you want to produce an
image of a certain class, you just need to pass the corresponding class label and the generator will
output the corresponding image.
IV
Part Four: Deep Learning For
Computer Vision
10.1.1 Introduction
2D human pose representation and estimation consists on two main fields of study that should be
combined:
• Body modeling
• Feature representation Learning
We will first analyze them separately and then we will understand how to efficiently combine them.
Question: can we understand how the different parts of the body are linked to each other?
Given an 2D image I we indicate with li = (xi , yi ) the estimated position of vertex i. Thus, given
an image I and a configuration estimate L = (l1 , ..., lk ) we can define a score as follows:
X X
S(I, L) = αi · ϕ(I, li ) + βij · ψ(li , lj )
i∈V i,j∈E
where:
• ϕ(I, li ) is the unary term, a feature vector which provides information on the pixel in
location i (i.e. a patch extracted from the original image, possibly modified using convolutions
etc.)
• ψ(li , lj ) is the pairwise term between part i and part j, which is a spatial feature which
depends on the relative location li w.r.t. to lj .
10.2 Body modeling 139
It has been proved empirically that a mixture of non-oriented pictorial structures can outperform
explicitly articulated parts because mixture models can capture orientation-specific statistics of
background features.
Thus, we have to slightly modify the framework previously discussed introducing the concept of
mixture models. Let us call mi the type (mixture component) of part i.
The mixture component can express many concepts as orientations of a part (e.g., a vertical versus
horizontally oriented hand), but types may span out-of-plane rotations (front-view head versus
side-view head) or even semantic classes (an open versus closed hand).
Formally, the score becomes:
X m X mm
S(I, L, M ) = αi i · ϕ(I, li ) + βij i j · ψ(li , lj ) + S(M)
i∈V i,j∈E
where:
• αimi is the local appearance template for part i with type assignment mi
mm
• βij i j is the spatial spring parameter for pair of types (mi , mj ). It express the likelihood of
having template mi for part i and template mj for part j given the distance between li adn lj
• S(M ) is the co-occurence bias and it is defined as
X mm
S(M ) = bij i j
ij∈E
where bij is the pairwise co-occurrence prior between part i with mixture type mi and j
with mixture type mj and it favors particular co-occurrences of part types.
140 Chapter 10. Parametric Body models and Applications
• Direct regression
• Heatmaps
Figure 10.1: DeepPose: Human Pose Estimation via Deep Neural Networks
10.3 Feature Representation Learning 141
10.3.2 Heatmaps
Figure 10.2: First row: part detection heatmaps. Second row: output of CNN regression
every stage the architecture operates both on image evidence as well as belief maps from preceding
stages. In each stage, the computed beliefs provide an increasingly refined estimate for the location
of each part.
This is the complete architecture
The idea is to use the predictions obtained using the heatmaps and then refine them using body
modelling.
Another thing widely used is the spatial temporal inference that use the temporal continuity of the
body to obtain better predictions.
As a case study we use Thin-Slicing Network
where ψi,j (pi , pj ) = wi,j · d(pi , pj ), d(pi , pj ) = [∆x, ∆x2 , ∆y, ∆y 2 ] and w encodes rest location and
rigidity between pairs.
For a slice window we have
T
X X
Sslice = S(I t , pt ) + ψi,i∗ (pi , p′i∗ )
t=1 (i,i∗ )∈Ef
where p′i∗ = pi∗ + fi∗ ,i (pi∗ ) and fi∗ ,i (pi∗ ) is the optical flow evaluated at pi∗ (this is the flow warping
process in which pixel-wise flow tracks are applied to align confidence values in neighboring frames
to the target frame). As a matter of fact the term ψi,i∗ (pi , p′i∗ ) regularizes the temporal consistency
of the part i in neighboring frames.
10.4 Body modelling + Deep Representation Learning 143
10.4.2 Inference
Inference corresponds to maximizing Sslice over p for the image sequence slice.
When the relational graph G = (V, E) is a tree-structured graph, ex- act belief propagation can be
applied efficiently by one pass of dynamic programming in polynomial time.
How- ever, loopy belief propagation algorithms such as the Max- Sum algorithm make approximate
inference possible in in- tractable loopy models.
More precisely, in our case at each iteration a part i sends a message to its neighbors and also
receives reciprocal messages along the edges in G:
X
scorei (pi ) ← ϕi (pi |I) + mki (pi )
k∈child(i)
where child(i) is defined as the set of children of part i. The local scorei (pi ) is the sum of the
unary terms and the messages collected from its all children. The messages mki (pi ) (best score
that can be achieved using position pi for vertex i and being able to change pk ) sent from body
part k to part i are given by:
Using this process we eventually obtain the maximization of the sum over all the ϕi (pi |I) and all
the terms coming from the edges.
(
∂mki (pi ) 1 if pk = p∗
=
∂ψk,i (pk , pi ) 0 otw
In order to represent the body in 3D we use a 3d mesh, that is designed by an artist and contains
around 7000 vertices.
In order to define a body we need to define its shape and its pose.
10.5.2 Shape
In order to define the pose we do PCA of meshes in canonical pose to estimate the directions
of maximal shape variation. Doing that, we obtain a low-dimensional subspace (10D-300D) in
canonical pose. Note that usually 10 dimensions are enough to define a pose.
10.5.3 Pose
The linear mesh skinning is the simplest mesh skinning method Linear blend skinning is the idea of
transforming vertices inside a single mesh by a (blend) of multiple transforms.
Deformed position of a point(vertex) is a sum of the positions determined by each bone’s transform
alone, weighted by that vertex’s weight for that bone.
In particular for each vertex i, starting from a rest position ti its position in the transformed pose
t′i is X
t′i = wki Gk (θ, J)ti
k
Thus,in this model posed vertices are linear combination of transformed template vertices.
Pro: simple and fast to compute (widely used in videogames)
Contro: it produces only well known artifacts.
Figure 10.7: In linear blend skin, in presence of strong twists, the surface collapse
146 Chapter 10. Parametric Body models and Applications
SMPL
• si (β): vertex i in BS (β), which represents offset from the template depending on the shape
described by β
• pi (θ): vertex i in BP (θ), which represents offset from the template depending on the pose
described by θ
Notationally, the values to the right of a semicolon represent learned parameters, while those on
the left are parameters set by an animator.
We denote as R : R|θ| 7→ R9K a function that maps a pose vector θ to a vector of concatenated
part relative rotation matrices (each rotation matrix has dimensions 3 × 3).
Given that our rig has 23 joints we have that K = 3 and thus R(θ) is a vector of length (23×9 = 207).
Elements of R(θ) are functions of sines and cosines of joint angles and therefore R(θ) is non-linear
with θ.
If we define θ∗ as the rest pose, then the vertex deviations from the rest template are
9K
X
BP (θ, P) = (Rn (θ) − Rn (θ∗ ))Pn
n=1
where Pn ∈ R3N are the vector of vertex displacements. Thus, P = [P1 , ..., P9K ] ∈ R3N ×9K is a
matrix of all 207 pose blend shape.
As a consequence of this formula, the rotation of a particular joint can influence all the body
vertices, not only the local ones.
Note that subtracting the rest pose rotation vector, R(θ), guarantees that the contribution of the
pose blend shapes is zero in the rest pose, which is important for animation.
Summing up, there are 9 coefficient that describe the rotation of each joint and for example,
R1 (θ) − R1 (θ∗ ) describes the part of the rotation of joint 1 w.r.t. the rest pose, described by the
first rotation coefficient.
10.5 3D human pose representation and estimation 147
SMPL summary
As a result we obtain
The next part is based on Song et al. ”Human Body Model Fitting by Learned Gradient Descent.”
The novel idea is to train a neural network to do the optimization step.
The algorithm used is the following
In practice we sample from the training set and we try to reconstruct the ground truth starting
As we have seen before, the universal approximation theorem ensure that NN are able to learn an
approximation of any continuos function, thus, as a 3d shape is a continuous function, we should
not be surprised that NN are able to solve it.
11.2 3D representations
• Voxels:
– Voxels are 3d correspondent of pixel, a discretization of 3D space into grid
– Cons: it occupies a cubic memory O(n3 ), thus the resolution is limited
• Points:
– Discretization of surface into 3D points
– Cons: it does not model connectivity / topology
• Meshes:
– Discretization into vertices and faces
– Requires either
∗ Class-specific templates
∗ The maximum number of vertices I want to represent it
∗ Cons: there will always be an approximation error and they lead to self-intersections
• Implicit functions:
– Learn the analytic function which represents the 3d-surface
– Pro: no approximation error + smooth and continuous surface
150 Chapter 11. Neural Implicit Representations
• Occupancy Networks: fθ : R3 × X → [0, 1], outputs the probability of being inside the surface
• DeepSDF (Signed Distance Field): fθ : R3 × X → R, output the signed distance from the
surface (negative if inside, positive if outside)
Pros:
How can we learn the function f ? What should we use as ground truth?
In general, we can choose one out of these three image representations of ground truth:
1. Watertight Meshes
2. Point Cloud
3. 2D Images
This is the simplest case (they have no holes thus the space is divided in inside and outside): we
uniformly sample points inside the surface and we train the model using Binary Cross Entropy.
K
X
L(θ, ψ) = BCE(fθ (pij , zi ), oij )
j=1
152 Chapter 11. Neural Implicit Representations
11.4.3 2d images
Our goal is to learn fθ (occupancy function) and tθ (texture) from 2D image observations. Consider
a single image observation. We define a photometric reconstruction loss
X
ˆ =
L(I, I) Iˆu − Iu
u
where I is observed image (GT) and Iˆ is image rendered by our implicit model. Moreover, Iu
denotes the RGB value of the observation I at pixel u and ∥·∥ is a (robust) photo-consistency
measure such as the l1 -norm.
To minimize the reconstruction loss L w.r.t. the network parameters θ using gradient-based
optimization techniques, we must be able to
• Render Iˆ given fθ (fθ = τ if the point is in the surface, is > τ if the point is behind and < τ
if it is outside) and tθ
• Compute gradients of L w.r.t. the network parameters θ
Forward path
We are given r0 , the position of the camera in the image we are analyzing
1. For all the pixels we query the occupancy network which gives us a value :
• fθ < τ : outside the surface
• fθ = τ : in the surface
• fθ > τ : behind the surface
2. For the points p̂ with fθ = τ we evaluate the texture field tθ (p̂)
3. We assign the color tθ at pixel u
154 Chapter 11. Neural Implicit Representations
Secant method
In order to find the points which lay on the surface, we use the secant method.
The idea is the following
Backward pass
Let us call I the real image and Iˆ the predicted one. As we have said before, we define the loss as
ˆ I) = P Iˆu − Iu .
L(I, u
The gradient of the loss w.r.t. our parameters will be
∂L X ∂L ∂ Iˆu
=
∂θ ˆ ∂θ
u ∂ Iu
where
∂ Iˆu ∂tθ (p̂) ∂tθ (p̂) ∂ p̂
= +
∂θ ∂θ ∂ p̂ ∂θ
In order to evaluate ∂ p̂
∂θ we need implicit differentiation.
ˆ and condition for the intersection between the ray and the surface
Consider the ray p̂ = r0 + dw
(remember that we evaluate the color of a point only for the points on the surface) and take the
derivative on both sides
fθ (p̂) = τ
∂fθ (p̂) ∂fθ (p̂) ∂ p̂
+ · =0 τ is a constant
∂θ ∂ p̂ ∂θ
∂fθ (p̂) ∂fθ (p̂) ∂ dˆ ˆ and r0 is constant
+ ·w =0 p̂ = r0 + dw
∂θ ∂ p̂ ∂θ
!−1
∂ dˆ ∂fθ (p̂) ∂fθ (p̂) ∂ dˆ
=− ·w We obtain an expression for
∂θ ∂ p̂ ∂θ ∂θ
ˆ we have that
As p̂ = r0 + dw
∂ p̂ ∂ dˆ
=w
∂θ ∂θ
!−1
∂fθ (p̂) ∂fθ (p̂)
= −w ·w
∂ p̂ ∂θ
So far we have learnt how to represent surfaces, but in some cases this is not enough, scenes are
more complex.
In particular we have to learn:
11.5.1 Architecture
Before we were interested in one single output, the RGB value of a pixel. The novelty of NERF
is that they introduce the concept of density σ, that enables us to learn more about the difficult
surfaces we were mentioning before.
In particular they take as input:
and they output σ, the density of the point, and c, the RGB value of the point.
More formally the architecture they have proposed is the following (green represents input, blue
layers of the network and red outputs):
Some observations:
• The view directories θ and ϕ are given to the network only in later layers, after having
predicted σ to enforce this value not to be dependent on ϕ, θ but just from x, y, z
• After some layers we give again the position x, y, z to network to make sure it has not been
washed out
11.5.2 Procedure
In order to get the color then we apply alpha compositing. To better understand the process,
consider the following formula:
• δi = ti+1 − ti
• αi = 1 − e−σi δi
The final color will be then computed as a weighted average of the colours along the ray, in particular
N
X
c= Ti αi ci
i=0
156 Chapter 11. Neural Implicit Representations
Since the sampling operation is very expensive one trick is to sample more in more significant
positions (i.e. positions with high weights (high Ti αi )).
Figure 11.3: Sampling (the points in the ray) frequency is higher where Ti αi is higher
Pro: they can model transparency and thin structure, and therefore is a more flexible representation
Cons: generally leads to worse geometry compared to implicit surface
Despite the fact that neural networks are universal function approximators (14), we found that
having the network F Θ directly operate on x, y, z, θ, ϕ input coordinates results in renderings that
perform poorly at representing high-frequency variation in color and geometry. This happens
because NN are biased towards learning lower frequency functions.
The solution proposed in the paper is to introduce positional encoding, mapping the inputs to
a higher dimensional space R2L and then applying the MLP function. Formally, the encoding
function used is the following
γ(p) = (sin 20 πp , cos(20 πp), . . . , sin(2L−1 πp), cos(2L−1 πp))
Note that this function γ(·) applied separately to each of the three coordinate values in x (which
are normalized to lie in [−1, 1]) and to the three components of the Cartesian viewing direction
unit vector d (which by construction lies in [−1, 1]). In their experiments, they set L = 10 for γ(x)
and L = 4 for γ(d).
12.1 Motivations
In Reinforcement Learning models learn how to act interacting with the environment trough some
actions.
This can be useful in many fields such as games, logistics and operations and Robot Con-
trol/Computer Vision.
Reinforcement Learning is a problem, not a method. Given an unknown and uncertain environment,
it aims to choose the right actions in order to maximize the reward signal in the long-term.
162 Chapter 12. Reinforcement Learning
12.3.1 Policy
Note that the value function depends on a policy π, on the way we are behaving.
The factor γ ∈ [0, 1] is introduced since mostly we are interested more in immediate reward and
less in the future one.
12.3.3 Model
• Pss
a
′ = P[St+1 = s |St = s, At = a] predicts the probability of the next state given a state
′
and an action
• Ras = E[Rt+1 |St = s, At = a] predicts the next immediate reward given a state and an action
12.4 Taxonomy of RL agents 163
• Value Based
• Policy Based
• Actor Critic: combination of value and policy based
The agent have just access to a policy and try to adjust directly this policy trying to get the highest
possible reward.
Markov decision processes formally describe an environment for reinforcement learning where the
environment is fully observable.
As a consequence:
• R is a reward function
Rs = E[Rt+1 |St = s]
• γ ∈ [0, 1] is a discount factor
In particular:
12.5.4 Return
Definition 12.5.4 The return Gt is the total discounted reward from time-step t.
∞
X
Gt = Rt+1 + γRt+2 + · · · = γ k Rt+k
k=0
where a are all the outgoing actions from the current state s.
166 Chapter 12. Reinforcement Learning
The action-value function qπ (s, a) is the expected return starting from state s, taking action a, and
then following policy π
The optimal state-value function v∗ (s) is the maximum value function over all policies
X
v∗ (s) = max vπ (s) = max q∗ (s, a) = max p(s′ , r|s, a)(r + γv∗ (s′ ))
π a a
s′
• is not linear
• has no closed form solution
• can be solved using many iterative solution methods, such as:
– DP
– Monte-Carlo Methods
– Temporal-Difference Learning (combination of DP and MC)
12.6 Dynamic Programming 167
DP is able to compute optimal policies given a perfect model of the world (MDP).
Thus, in order to apply DP we need to know transitions’ probabilities. This has a limited utility,
however it still has a great theoretical importance.
• Value iteration
1. Compute optimal v∗ using the value iteration algorithm
2. Find a policy π to obtain v∗
• Policy iteration
1. For any policy π compute v(π)
2. Update policy π given v(π) and obtain π ′
3. Iterate until π ∼ π ′
Pros Cons
What can we do when the states space is too big to iterate over it?
The problem of Monte Carlo estimate is that in order to know the value of a trajectory we have to
wait its whole exploration, it cannot learn from incomplete episodes.
12.8 Temporal Difference Learning 169
The TD learning allows learning from incomplete episodes. We don’t have anymore to go all the
way in a particular trajectory, TD can learn before knowing the final outcome, it learns online at
every step.
Intuitevely, the procedure is the following:
More formally:
Observe that doing that we don’t update the whole state space, only visited states!
However, we still have to fin a criterium to visit the state space. Basically there are two options:
At a first glance, the greedy strategy may look better, but in practice it could get us stuck in local
minima.
We have to find a balance between:
• exploration: gather more data to avoid missing out on a potentially large reward?
• exploitation: stick with our current knowledge and build an optimal policy for the data we’ve
seen?
A good trade-off is the ϵ-greedy policy, in each each state, with small probability ϵ choose randomly,
else choose greedily. In practice it works well. It is suggested to decrease the value of ϵ so that we
have more exploration in the beginning.
170 Chapter 12. Reinforcement Learning
12.8.1 Implementations
• SARSA: on policy, compute Q-value according to a policy and then the agent will follow that
policy
• Q-Learning: off policy, Q-value according to a greedy policy, but the agent follows a different
exploration policy
The advantage of considering directly the Q function is that we explicitly take not only states,
but also actions into account. This fact may be useful since in many cases we have to learn from
external policies µ that have not taken the best actions, in order to estimate the values of vπ (s)
and qπ (s, a) of our optimal policy π different from µ. As a consequence we will not be able to freely
choose the immediately next action (which is determined by the observations we have, and thus by
the policy µ), but we can estimate its value using greedy on the next one.
SARSA
We follow the policy π to obtain a transition (s, a, r, s′ ) so we compute the difference to our current
estimate and update our value function like this
where α is the learning rate and a′ is the action chosen by π in the state s′ .
In order to decide the next action we can use an ϵ-greedy policy.
Q-learning
For each action A transitioning from S to S ′ , compute the difference to current estimate and update
value function
where:
• The immediate reward Rt+1 and the next state S ′ are data from the exploration policy
• The updates of the Q function depends in general from the policy we are considering (in this
case greedy policy)
12.8 Temporal Difference Learning 171
Pros Cons
• Less variance than Monte Carlo Sampling • Biased due to bootstrapping, we use “old”
due to bootstrapping value estimates as labels
• More sample efficient than Dynamic Pro- • Exploration/Exploitation dilemma
gramming
• Do not need to know the transition proba-
bility matrix
172 Chapter 12. Reinforcement Learning
12.9.1 Introduction
We remind that
• π:S→A
• vπ : St → R
In Q-Learning we assign a value at each pair (a, s), thus, our goal is to use function approximation
to learn the value function
vπ (s) ≈ vπ (s, θ)
We can use a neural network to learn the mapping between state-action pairs (s, a) and their values.
The Q-learning updates reduce to SGD on the TD-error (∆Q)
′ ′
2
Loss(θ) = R + γ max
′
{Qθ (s , a )} − Qθ (s, a)
a
However, we still have a problem: SGD assumes that our updates are i.i.d..
But in RL states visited in a trajectory are strongly correlated; how can we address this?
The main idea is to use a replay buffer and to store there the generated samples, let us now
discuss how this can help us to obtain i.i.d. samples.
The procedure for the training will be the following
1. Run some exploration policies and, during them, store there all the generated samples
2. When we have enough transitions, we sample a random minibatch from the buffer
3. For every transition presents in this minibatch, we update loss and parameters
4. Iterate until convergence
174 Chapter 12. Reinforcement Learning
Q-learning is limited to discrete action spaces (e.g. we can consider actions to be W,E,N,S but not
the angle of the direction) , for continuous action space the problem is intractable.
But we still have hopes!
As a matter of fact, learning directly a policy π is often much easier. The algorithm directly learns
the correct behavior, without exploring the value function.
The question now becomes:
How can we train such a model?
Policy gradients
We can see the policy from a particular state at a particular time step t as a normal distribution of
mean µt and variance σt2 over the possible actions, thus we use a gaussian parametrization of the
policy
π(at |st ) ∼ N (µt , σt2 )
The advantage of this parametrization is that now we can sample from there and learn the parameters
of our network.
In particular, remind that, if we want the probability of a particular trajectory τ we have to
compute
p(τ ) = p(s1 , a1 , ..., sT , aT ) =
T
Y
= p(s1 ) π(at |st )p(st+1 |at , st )
t=1
In order to do that, as we have already said before, we sample from the gaussian parametrization
of the policy and we learn our parameters to obtain π(at |st , θ).
• Exploration: get the trajectory data. To do that, we sample action at every time-step from
the policy probability distribution (on-policy methods)
• Evaluation: evaluate the policy by computing the expectation of the trajectory reward
given the parameters θ " #
X
t
J(θ) = Eτ ∼pθ (τ ) γ r(st , at )
t
θ ← θ + ∇θ J(θ)
∇f (x)
Z
= p(τ )∇θ log p(τ )r(τ )dτ = we used the fact that ∇ log(f (x)) =
f (x)
= Eτ ∼pθ (τ ) ∇θ log p(τ )r(τ )
Note that the first and the last term do not depend on the policy we choose and, thus on the
parameters of our neural network. Thus when we apply ∇θ they will disappear.
Rearranging the terms in the previous expression we obtain:
∇θ J(θ) = Eτ ∼pθ (τ ) ∇θ log p(τ )r(τ )
" " T # #
X
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(τ )
t=0
" T
! T
!# T
X X X
t
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) γ r(st , at ) r(τ ) = γ t r(st , at )
t=0 t=0 t=0
| {z } | {z }
Gradient of the likelihood of τ trajectory reward
Thus the gradient of our objective function is the gradient of the likelihood of τ , scaled by the
trajectory reward.
Now the question becomes
How can we in practice evaluate this quantity? i.e. how can we compute the expected value?
4. Calculate expected reward J and ∇Jθ using Monte Carlo sampling (we sample N trajectories):
N
" T ! #
1 X X
∇θ J(θ) = ∇θ log πθ (at |st ) Gi0
i i
N i=1 t=0
θ ← θ + ∇θ J(θ)
However, if we do like this, we still have a problem: the gradient is estimated only over few samples
(we used Monte Carlo sampling), thus the obtained policy gradients are very noisy.
Solution: reduce the variance introducing a baseline b(sit ) in the term related to the trajectory
reward. " T ! T !#
N
1 X X
i i
X
t i i
∇θ J(θ) = ∇θ log πθ (at |st ) γ Rt − b(st )
N i=1 t=0 t=0
Remark: the baseline must be a function that does not depend on the policy (common
choices: average reward, estimate of the state value function).
As a consequence, the variance is reduced, but the policy gradient estimate remains unbiased.
12.9 Deep Reinforcement Learning 177
Can we do better?
Remind that in the original form
N
" T
! #
1 X X
∇θ J(θ) = ∇θ log πθ (ait |sit ) i
r(τ )
N i=1 t=0
The idea is to use bootstrapping to introduce bias and reduce the variance.
In particular, we weight the likelihood at each step for the estimated value of the whole roll-out
(TD error):
T
1 XX
∇θ J(θ) = ∇θ log πθ (ait |sit ) r(sit , ait ) + γV (sit+1 ) − V (sit )
N i t=0
Here we provide some suggested readings, divided for each chapter and topic.
Chapter 5: RNN
Chapter 6: VAE
Books
[6] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Vol. 521. 2015.
Articles
[1] L. Breiman. “Bagging predictors”. In: Machine Learning 24 (2004), pp. 123–140.
[2] T. Garipov et al. “Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs”.
In: ArXiv abs/1802.10026 (2018).
[4] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neural
Computation 9 (1997), pp. 1735–1780.
[7] Christian Ledig et al. “Photo-Realistic Single Image Super-Resolution Using a Gen-
erative Adversarial Network”. In: 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2016), pp. 105–114.
[9] Robin Rombach et al. “High-Resolution Image Synthesis with Latent Diffusion Models”.
In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
(2021), pp. 10674–10685.
[10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional Networks
for Biomedical Image Segmentation”. In: ArXiv abs/1505.04597 (2015).
[11] Frank Rosenblatt. “The perceptron: a probabilistic model for information storage and
organization in the brain.” In: Psychological review 65 6 (1958), pp. 386–408.
[12] Nitish Srivastava et al. “Dropout: a simple way to prevent neural networks from
overfitting”. In: J. Mach. Learn. Res. 15 (2014), pp. 1929–1958.