Professional Documents
Culture Documents
Deep Learning: Technical Introduction: Thomas Epelbaum
Deep Learning: Technical Introduction: Thomas Epelbaum
... pF
pT
sthgieW
..
.
.
.F
p
.
...
reyaL noitulovnoC tupnI
CS
CR
F
CR
P2 + T
Deep learning: RC
T + 2P
F
Input Convolution Layer
RC S
C
...
Weights
Fp
.
.
.
.
..
Output Convolution Layer
Fp
Tp
...
Technical introduction
..
. . ..
pN Np
P2 + N N + 2P
Thomas Epelbaum
Input Hidden Hidden Output
ReLU function and its derivative layer layer 1 layer h layer
1
ReLU(x) (1)
h0
ReLU’(x) (0) (h)
0.8 Bias h0 h0
(1) (N )
h1 h1 Output #1
(0) (h)
Input #2 h1 h1
0.6 (1) (N )
Relu(x)
h2 h2 Output #2
(0) (h)
Input #3 h2 h2
(1) (N )
0.4 h3 ••• ••• h3 Output #3
(0) (h)
Input #4 h3 h3
(1) (N )
0.2 h4 h4 Output #4
(0) (h)
Input #5 h4 h4
(1) (N )
0 h5 h5 Output #5
−1 −0.5 (0) (h)
0 0.5 1 Input #6 h5 h5
x h6
(1)
F0 ...
..
....
R0 .. ..
R0 S
0
. . .. .
....
F3
...
F1
. . bias
F .
T0 . . .. .
5
..
.... ....
F1 F3 F4 F5
T3
. R1
F2
... . .
. R1 S
1
. N3
..
....
.. T1 ..
R2
R2 S2
T2
N0 N1 N2
Output
+
Conv 3
BN 3
Relu 3
Conv 2
h(ν τ ) h(ν τ ) h(ν τ ) h(ν τ )
BN 2
Relu 2 = Res
c(ν τ −1) c(ν τ )
× +
Conv 1 f (ν τ )
h(ν τ −1) Θgτ (ν) + σ σ × h(ν τ )
i(ν τ ) g (ν τ ) tanh
BN 1
h(ν τ −1) Θiτ (ν) + tanh (ν τ ) h(ν τ )
o
Relu 1 σ ×
h(ν τ −1) Θgτ (ν) + h(ν τ )
1 Preface 5
2 Acknowledgements 7
3 Introduction 9
3
4 CONTENTS
7 Conclusion 103
Chapter 1
reyaL noitulovnoC tuptuO
...
pN
pF
pT
sthgieW
..
..
.
.
.F
p
.
.
...
reyaL noitulovnoC tupnI
P2 + N
CS
CR
F
CR
P2 + T
Preface RC
T + 2P
F
Input Convolution Layer
RC S
C
N + 2P
...
Weights
Fp
.
.
.
.
.
..
..
Output Convolution Layer
Fp
Tp
Np
...
5
6 CHAPTER 1. PREFACE
This work is bottom-up at its core. If you are searching a working Neural
Network in 10 lines of code and 5 minutes of your time, you have come to the
wrong place. If you can mentally multiply and convolve 4D tensors, then I
have nothing to convey to you either.
If on the other hand you like(d) to rederive every tiny calculation of every
theorem of every class that you stepped into, then you might be interested by
what follow!
Chapter 2
reyaL noitulovnoC tuptuO
...
pN
pF
pT
sthgieW
..
..
.
.
.F
p
.
.
...
reyaL noitulovnoC tupnI
P2 + N
CS
CR
F
CR
P2 + T
Acknowledgements RC
T + 2P
F
Input Convolution Layer
RC S
C
N + 2P
...
Weights
Fp
.
.
.
.
.
..
..
Output Convolution Layer
Fp
Tp
Np
...
his work has no benefit nor added value to the deep learning topic on
7
8 CHAPTER 2. ACKNOWLEDGEMENTS
...
pN
pF
pT
sthgieW
..
..
.
.
.F
p
.
.
...
reyaL noitulovnoC tupnI
P2 + N
CS
CR
F
CR
P2 + T
Introduction RC
T + 2P
F
Input Convolution Layer
RC S
C
N + 2P
...
Weights
Fp
.
.
.
.
.
..
..
Output Convolution Layer
Fp
Tp
Np
...
his note aims at presenting the three most common forms of neural
9
10 CHAPTER 3. INTRODUCTION
There again, the novelties and the modifications of the material introduced in
the two previous chapters are detailed in the main text, while the appendices
give all what one needs to understand the most cumbersome formula of this
kind of network architecture.
Chapter 4
reyaL noitulovnoC tuptuO
...
pN
pF
pT
sthgieW
..
..
.
.
.F
p
.
.
...
reyaL noitulovnoC tupnI
P2 + N
CS
CR
F
CR
P2 + T
T + 2P
F
Input Convolution Layer
RC S
C
N + 2P
...
Weights
Fp
.
.
.
.
.
..
..
Output Convolution Layer
Fp
Tp
Np
...
Contents
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 FNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Some notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Weight averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.5 Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5.1 The sigmoid function . . . . . . . . . . . . . . . . . . . . 15
4.5.2 The tanh function . . . . . . . . . . . . . . . . . . . . . . . 16
4.5.3 The ReLU function . . . . . . . . . . . . . . . . . . . . . . 17
4.5.4 The leaky-ReLU function . . . . . . . . . . . . . . . . . . 18
4.5.5 The ELU function . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 FNN layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6.1 Input layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6.2 Fully connected layer . . . . . . . . . . . . . . . . . . . . 21
4.6.3 Output layer . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.7 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.8 Regularization techniques . . . . . . . . . . . . . . . . . . . . . 22
4.8.1 L2 regularization . . . . . . . . . . . . . . . . . . . . . . . 22
4.8.2 L1 regularization . . . . . . . . . . . . . . . . . . . . . . . 23
4.8.3 Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.8.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
11
12 CHAPTER 4. FEEDFORWARD NEURAL NETWORKS
4.1 Introduction
n this section we review the first type of neural network that has
well – allowed to circumvent the training issues that people experienced when
dealing with "deep" architectures: namely the fact that neural networks with an
important number of hidden states and hidden layers have proven historically
to be very hard to train (vanishing gradient and overfitting issues).
Figure 4.1: Neural Network with N + 1 layers (N − 1 hidden layers). for sim-
plicity of notations, the index referencing the training set has not been indi-
cated. Shallow architectures use only one hidden layer. Deep learning amounts
to take several hidden layers, usually containing the same number of hidden
neurons. This number should be on the ballpark of the average of the number
of input and output variables.
A FNN is formed by one input layer, one (shallow network) or more (deep
network, hence the name deep learning) hidden layers and one output layer.
Each layer of the network (except the output one) is connected to a following
layer. This connectivity is central to the FNN structure and has two main
features in its simplest form: a weight averaging feature and an activation
feature. We will review these features extensively in the following
14 CHAPTER 4. FEEDFORWARD NEURAL NETWORKS
• N the number of layers (not counting the input) in the Neural Network.
(t) (0)(t)
• Xf = hf where f ∈ J0, F0 − 1K the input variables.
(t)
• y f where f ∈ [0, FN − 1] the output variables (to be predicted).
(t)
• ŷ f where f ∈ [0, FN − 1] the output of the network.
(ν) f 0
• Θf for f ∈ [0, Fν − 1], f 0 ∈ [0, Fν+1 − 1] and ν ∈ [0, N − 1] the weights
matrices
• A bias term can be included. In practice, we will see when talking about
the batch-normalization procedure that we can omit it.
(0)
h0
(0) (0)3
h1 Θ0
(0)3
Θ
(0) 1
h 2 Θ (0)3
2 (0)
(0)3 a3
h3 Θ3
(0)
(0)3
Θ4
(0)
h 4 Θ (0)3
5
(0)
h5
Fν −1+e
∑
(t)(ν) (ν) f (t)(ν)
af = Θ f0 hf0 , (4.1)
0
f =0
where ν ∈ J0, N − 1K, t ∈ J0, Tmb − 1K and f ∈ J0, Fν+1 − 1K. The e is here
to include or exclude a bias term. In practice, as we will be using batch-
normalization, we can safely omit it (e = 0 in all the following).
Its derivative is
σ0 ( x ) = σ( x ) (1 − σ ( x )) . (4.4)
0.6
σ(x)
0.4
0.2
0
−10 −5 0 5 10
x
1 − e−2x
g( x ) = tanh( x ) = . (4.5)
1 + e−2x
Its derivative is
This activation function has seen its popularity drop due to the use of the
activation function presented in the next section.
4.5. ACTIVATION FUNCTION 17
tanh(x) 0.5
−0.5
−1
−10 −5 0 5 10
x
the ReLU –for Rectified Linear Unit – function takes its value in [0, +∞[ and
reads
(
x x≥0
g( x ) = ReLU( x ) = . (4.7)
0 x<0
Its derivative is
(
1 x≥0
ReLU0 ( x ) = . (4.8)
0 x<0
18 CHAPTER 4. FEEDFORWARD NEURAL NETWORKS
0.6
Relu(x)
0.4
0.2
0
−1 −0.5 0 0.5 1
x
This activation function is the most extensively used nowadays. Two of its
more common variants can also be found : the leaky ReLU and ELU – Expo-
nential Linear Unit. They have been introduced because the ReLU activation
function tends to "kill" certain hidden neurons: once it has been turned off
(zero value), it can never be turned on again.
Its derivative is
(
1 x≥0
l − ReLU0 ( x ) = . (4.10)
0.01 x<0
4.5. ACTIVATION FUNCTION 19
0.5
−2 −1 0 1 2
x
ELU(x)
−1
−2 −1 0 1 2
x
with
Ttrain −1
1
∑
(t)
µf = Xf . (4.16)
Ttrain t =0
1 To
train the FNN, we jointly compute the forward and backward pass for Tmb samples of
the training set, with Tmb Ttrain . In the following we will thus have t ∈ J0, Tmb − 1K.
4.7. LOSS FUNCTION 21
This correspond to compute the mean per data types over the training set.
Following our notations, let us recall that
(t) (t)(0)
Xf = hf . (4.17)
Fν −1
∑
(t)(ν) (ν) f (t)(ν)
af = Θf0 hf0 . (4.18)
0
f =0
and ∀ν ∈ J0, N − 2K
(t)(ν+1) (t)(ν)
hf = g af . (4.19)
where o is called the output function. In the case of the Euclidean loss function,
the output function is just the identity. In a classification task, o is the softmax
function
(t)( N −1)
af
(t)( N −1)
e
o af = FN −1 −1 a(t)( N −1)
(4.21)
f0
∑ e
f 0 =0
Tmb −1 FN −1
1 (t)( N ) 2
∑ ∑
(t)
J (Θ) = yf − hf , (4.22)
2Tmb t =0 f =0
while for a classification task, the loss function is called the cross-entropy func-
tion
Tmb −1 FN −1
1
∑ ∑
f (t)( N )
J (Θ) = − δ ln h f , (4.23)
Tmb y(t)
t =0 f =0
For reasons that will appear clear when talking about the data sample used at
each training step, we denote
Tmb −1
J (Θ) = ∑ Jmb (Θ) . (4.25)
t =0
4.8.1 L2 regularization
L2 regularization is the most common regularization technique that on can
find in the litterature. It amounts to add a regularizing term to the loss function
in the following way
N −1
N −1 Fν+1 −1 Fν −1
( ν )
2 (ν) f 0 2
JL2 (Θ) = λL2 ∑
Θ
= λL2 ∑ ∑ ∑ Θ f
L2
. (4.26)
ν =0 ν =0 f =0 f 0 =0
4.8. REGULARIZATION TECHNIQUES 23
This regularization technique is almost always used, but not on its own. A
typical value of λL2 is in the range 10−4 − 10−2 . Interestingly, this L2 regular-
ization technique has a Bayesian interpretation: it is Bayesian inference with
a Gaussian prior on the weights. Indeed, for a given ν, the weight averaging
procedure can be considered as
Fν −1
∑
(t)(ν) (ν) f (t)(ν)
af = Θf0 hf0 +e , (4.27)
0
f =0
where e is a noise term of mean 0 and variance σ2 . Hence the following Gaus-
sian likelihood for all values of t and f :
!
Fν −1
N a f ∑ Θ f 0 h f 0 , σ2 .
(t)(i ) (ν) f (t)(ν)
(4.28)
f 0 =0
( ν ) f −1
Assuming all the weights to have a Gaussian prior of the form N Θ f 0 λL2
with the same parameter λL2 , we get the following expression
Tmb −1 Fν+1 −1
" ! #
Fν −1 Fν −1
P = ∏ ∏ N af ∑ Θ f 0 h f 0 , σ2 ∏ N Θ f 0 λL2
(t)(ν) (ν) f (t)(ν) ( ν ) f −1
t =0 f =0 f 0 =0 f 0 =0
2
F −1 (ν) f (t)(ν) (ν) f 2
(t)(ν)
Tmb −1 Fν+1 −1 a −∑ i0 Θ 0 h 0 F − 1
r Θ λ
f
1 −
f =0 f f ν
λL2 − f 0 L2
= ∏ ∏ √ e 2σ2
∏ e .
2
t =0 f =0 2πσ 2 0
f =0
2π
(4.29)
Taking the log of it and forgetting most of the constant terms leads to
!2
F −1 Fν+1 −1 Fν −1
1 Tmb −1 ν+1 Fν −1
(ν) f 2
Tmb σ2 t∑ ∑ ∑ L2 ∑ ∑
(t)(ν) (ν) f (t)(ν)
L∝ a f − Θ f0 h f0 + λ Θ f0 ,
=0 f =0 0
f =0 0
f =0 f =0
(4.30)
and the last term is exactly the L2 regulator for a given nu value (see formula
(4.26)).
4.8.2 L1 regularization
L1 regularization amounts to replace the L2 norm by the L1 one in the L2
regularization technique
N −1
N −1 Fν+1 −1 Fν −1
∑ ∑ ∑ ∑
(ν)
(ν) f
JL1 (Θ) = λL1
Θ
= λL1 Θ f 0 . (4.31)
ν =0 L1 ν =00 f =0 f =0
24 CHAPTER 4. FEEDFORWARD NEURAL NETWORKS
(ν) f
λ −λ Θ(ν) f
F −1
Θ f 0 0, λL1 = L1 e L1 f 0 . (4.32)
2
4.8.3 Clipping
(ν) f (ν) f C
if
Θ(ν)
> C −→ Θ f 0 = Θf0 ×
(ν)
. (4.33)
L2
Θ
L2
4.8.4 Dropout
Figure 4.8: The neural network of figure 4.1 with dropout taken into account
for both the hidden layers and the input. Usually, a different (lower) propabil-
ity for turning off a neuron is adopted for the input thanthe one adopted for
the hidden layers.
(ν) (ν) (ν)
hf = mf g af (4.34)
(i ) 1
with m f following a p Bernouilli distribution with usually p = 5 for the mask
of the input layer and p = 12 otherwise. Dropout[1] has been the most succesful
regularization technique until the appearance of Batch Normalization.
with
Tmb −1
1
∑
(ν) (t)(ν+1)
ĥ f = hf (4.36)
Tmb t =0
Tmb −1
(ν) 2 1 (ν) 2
∑
(t)(ν+1)
σ̂ f = hf − ĥ f . (4.37)
Tmb t =0
To make sure that this transformation can represent the identity transform, we
add two additional parameters (γ f , β f ) to the model
(t)(ν) (ν) (t)(ν) (ν) (ν) (t)(ν) (ν)
yf = γ f h̃ f + β f = γ̃ f h f + β̃ f . (4.38)
(ν)
The presence of the β f coefficient is what pushed us to get rid of the bias term,
as it is naturally included in batchnorm. During training, one must compute a
running sum for the mean and the variance, that will serve for the evaluation of
the cross-validation and the test set (calling e the number of iterations/epochs)
h i
(t)(ν) (ν)
h i eE h f + ĥ f
(t)(ν+1)
E hf = e
, (4.39)
e +1 e+1
(ν) 2
h i
(t)(ν)
h i eVar h f + σ̂ f
(t)(ν+1)
Var h f = e
(4.40)
e +1 e+1
and what will be used at test time is
h
(t)(ν)
i h
(t)(ν)
i h
(t)(ν)
i Tmb h
(t)(ν)
i
E hf = E hf , Var h f = Var h f . (4.41)
Tmb − 1
so that at test time
(t)(ν) (t)(ν)
(t)(ν) (ν)
hf − E[h f ] (ν)
yf = γf r h i + βf . (4.42)
(t)(ν)
Var h f +e
In practice, and as advocated in the original paper, on can get rid of dropout
without loss of precision when using batch normalization. We will adopt this
convention in the following.
4.9. BACKPROPAGATION 27
4.9 Backpropagation
Backpropagation[9] is the standard technique to decrease the loss function
error so as to correctly predict what one needs. As it name suggests, it amounts
to backpropagate through the FNN the error performed at the output layer, so
as to update the weights. In practice, on has to compute a bunch of gradient
terms, and this can be a tedious computational task. Nevertheless, if performed
correctly, this is the most useful and important task that one can do in a FN. We
will therefore detail how to compute each weight (and Batchnorm coefficients)
gradients in the following.
(t)( N −1)
the value of δ f depends on the loss used. We show also in appendix 4.A
that for the MSE loss function
(t)( N −1) 1 (t)( N ) (t)
δf = hf − yf , (4.47)
Tmb
28 CHAPTER 4. FEEDFORWARD NEURAL NETWORKS
and
Tmb −1
Θ (0) f
∑
(t)(0) (t)(0)
∆f0 = δf hf0 . (4.52)
t =0
4.10.1 Full-batch
Full-batch takes the whole training set at each epoch, such that the loss
function reads
Ttrain −1
J (Θ) = ∑ Jtrain (Θ) . (4.55)
t =0
4.10.3 Mini-batch
Mini-batch gradient descent is a compromise between stability and time ef-
ficiency, and is the middle-ground between Full-batch and Stochastic gradient
descent: 1 Tmb Ttrain . Hence
Tmb −1
J (Θ) = ∑ Jmb (Θ) . (4.57)
t =0
All the calculations in this note have been performed using this gradient de-
scent technique.
30 CHAPTER 4. FEEDFORWARD NEURAL NETWORKS
4.11.1 Momentum
Momentum[10] introduces a new vector ve and can be seen as keeping a
memory of what where the previous updates at prior epochs. Calling e the
number of epochs and forgetting the f , f 0 , ν indices for the gradients to ease
the notations, we have
ve = γve−1 + η∆Θ , (4.59)
and the weights at epoch e are then updated as
Θ e = Θ e −1 − v e . (4.60)
γ is a new parameter of the model, that is usually set to 0.9 but that could also
be fixed thanks to cross-validation.
4.11.3 Adagrad
Adagrad[12] allows to fine tune the different gradients by having individual
learning rates η. Calling for each value of f , f 0 , i
e −1 2
ve = ∑ ∆Θ
e0 , (4.63)
0
e =0
One advantage of Adagrad is that the learning rate η can be set once and for
all (usually to 10−2 ) and does not need to be fine tune via cross validation
anymore, as it is individually adapted to each weight via the ve term. e is here
to avoid division by 0 issues, and is usually set to 10−8 .
4.11.4 RMSprop
Since in Adagrad one adds the gradient from the first epoch, the weight
are forced to monotonically decrease. This behaviour can be smoothed via the
Adadelta technique, which takes
ve = γve−1 + (1 − γ)∆Θ
e , (4.65)
with γ a new parameter of the model, that is usually set to 0.9. The Adadelta
update rule then reads as the Adagrad one
η
Θ e = Θ e −1 − √ ∆Θ
e . (4.66)
ve + e
4.11.5 Adadelta
Adadelta[13] is an extension of RMSprop, that aims at getting rid of the η
parameter. To do so, a new vector update is introduced
√
m e−1 + e Θ 2
me = γme−1 + (1 − γ) √ ∆ , (4.67)
ve + e e
32 CHAPTER 4. FEEDFORWARD NEURAL NETWORKS
The learning rate has been completely eliminated from the update rule, but the
procedure for doing so is add hoc. The next and last optimization technique
presented seems more natural and is the default choice on a number of deep
learning algorithms.
4.11.6 Adam
Adam[14] keeps track of both the gradient and its square via two epoch
dependent vectors
2
m e = β 1 m e−1 + (1 − β 1 ) ∆ Θ
e , ve = β 2 ve + (1 − β 2 ) ∆ Θ
e , (4.69)
with β 1 and β 2 parameters usually respectively set to 0.9 and 0.999. But the
robustness and great strength of Adam is that it makes the whole learning
process weakly dependent of their precise value. To avoid numerical problems
during the first steps, these vector are rescaled
me ve
m̂e = , v̂e = . (4.70)
1 − βe1 1 − βe2
η e = e − α0 η e −1 , (4.72)
of the reasons why neural networks have experienced out of mode periods.
Since dropout and Batch normalization, this issue is less pronounced, but one
should not initialize the weigt in a symmetric fashion (all zero for instance),
nor should one intialize them too large. A good heuristic is
s
h
(ν) f 0
i 6
Θf = × N (0, 1) . (4.73)
init Fi + Fi+1
Appendix
4.A Backprop through the output layer
Recalling the MSE loss function
Tmb −1 FN −1
1 (t)( N ) 2
∑ ∑
(t)
J (Θ) = yf − hf , (4.74)
2Tmb t =0 f =0
we instantaneously get
Tmb −1 FN −1 C −1 (t0 )( N )
(t)( N −1) ∂ ∂h f 0 d ∂
δf c = (t)( N −1)
J (Θ) = ∑
0
∑
0
∑ (t)( N −1) (t0 )( N )
J (Θ) . (4.76)
∂a f c t =0 f =0 d =0 ∂a f c ∂h f 0 d
Now
δd(t0 )
∂ yf0
(t0 )( N )
J (Θ) = − (t0 )( N )
, (4.77)
∂h f 0 d Tmb h f 0 d
and
(t0 )( N )
∂h f 0 d f
(t)( N ) (t)( N ) (t)( N )
(t)( N −1)
= δ f 0 δtt0 δdc h f c − hfc hfd , (4.78)
∂a f c
34 CHAPTER 4. FEEDFORWARD NEURAL NETWORKS
so that
C −1
δd(t)
(t)( N −1) 1 y
∑ c (t)( N )
f (t)( N ) (t)( N )
δf c =− δ h − h h
Tmb d=0 h(t)( N ) d f c fc fd
fd
1 (t)( N ) c
= hfc − δ (t) . (4.79)
Tmb yf
so that
Tmb −1 Fν+1 −1
(ν+1) f 0 (tt0 )(ν) (t)(ν+1)
∑ ∑
(t)(ν) 0 (t)(ν)
δf =g af Θf Jf δf 0 , (4.82)
0
t =0 0
f =0
Secondly
(ν) 2 0
f
∂ σ̂ f 0 2δ f (t)(ν+1) (ν)
(t)(ν+1)
= hf − ĥ f , (4.85)
∂h f Tmb
so that we get
f0 (t0 )(ν+1)
(t)(ν) (ν) (t)(ν+1) (ν)
∂ h̃ f 0 δf Tmb δtt0 − 1 hf − ĥ f hf − ĥ f
(t)(ν+1)
=
1
− 3
Tmb
(ν) 2
∂h f 2 2 2
(ν)
σ̂ f +e σ̂ f +e
f0 (t0 )(ν) (t)(ν)
δf 1 + h̃ h̃
= δtt0 − f f
. (4.86)
1
2
2 Tmb
(ν)
σ̂ f +e
so that
(t)(ν) (t0 )(ν) (t)(ν)
∂y f 0 f0
1 + h̃ f h̃ f
(ν) t0
(t)(ν+1)
= γ̃ f δ f δt − . (4.88)
∂h f Tmb
BN
full
Relu
full ∗
=
In its non standard form presented in this section, the residual operation
amounts to add a skip connection to two consecutive full layers
Output
+
BN
Relu
full ∗
BN
Res
Relu
full ∗
=
Input
Fν −1
∑
(t)(ν+1) (ν+1) (t)(ν+2) ( ν +1) (t)(ν+1) (ν+1) f (t)(ν)
yf = γf h̃ f + βf , af = Θf0 yf
0
f =0
Fν−1 −1
(ν) f (t)(ν−1)
∑
(t)(ν) (ν) (t)(ν+1) (ν) (t)(ν)
yf = γ f h̃ f + βf , af = Θf0 yf , (4.89)
0
f =0
(t)(ν+2) (t)(ν+1) (t)(ν+1) (t)(ν)
as well as h f = g af and h f = g af . In ResNet, we
now have the slight modification
(t)(ν+1) ( ν +1) ν +2 ( ν +1) (t)(ν−1)
yf = γf h̃ f + βf + yf . (4.90)
The choice of skipping two and not just one layer has become a standard for
empirical reasons, so as the decision not to weight the two paths (the trivial
4.D. FNN RESNET (NON STANDARD PRESENTATION) 37
skip one and the two FNN layer one) by a parameter to be learned by back-
propagation
(t)(ν+1) (ν+1) (t)(ν+2) ( ν +1) (t)(ν−1)
yf = α γf h̃ f + βf + (1 − α ) y f 0 . (4.91)
(t)(ν−1)
∂a f 0 (t0 )(ν) f (t0 )(ν+2)
δf = ∑ ∑ (t)(ν−1) f
δ 0 + ∑ ∑ δ
(t)(ν−1) f 0
t0 =0 f 0 =0 ∂a f t0 =0 f 0 =0 ∂a f
so that
Tmb −1 Fν−1 −1 0
(t)(ν−1) (t)(ν−1)
∑ ∑ Jf
(tt )(ν)
δf = g0 a f
t 0 =0 f 00 =0
Fν+2 −1
" #
Fν −1
(ν) f 0 (t0 )(ν) (ν+2) f 0 (t0 )(ν+2)
× ∑ Θ f 00 δ f 0 + ∑ Θ f 00 δf 0 . (4.93)
f 0 =0 f 0 =0
This formulation has one advantage: it totally preserves the usual FNN
layer structure of a weight averaging (WA) followed by an activation func-
tion (AF) and then a batch normalization operation (BN). It nevertheless has
one disadvantage: the backpropagation gradient does not really flow smootly
from one error rate to the other. In the following section we will present the
standard ResNet formulation of that takes the problem the other way around
: it allows the gradient to flow smoothly at the cost of "breaking" the natural
FNN building block.
38 CHAPTER 4. FEEDFORWARD NEURAL NETWORKS
Output
+
∗
full
BN
Relu
full ∗
Res
BN
Relu
=
Input
Figure 4.11: Residual connection in a FNN, trivial gradient flow through error
rates.
The upsides and downsides of this formulation are the exact opposite of the
index one: what we gained in readibility, we lost in terms of direct implemen-
tation in low level programming languages (C for instance). For FNN, one can
use a high level programming language (like python), but this will get quite
intractable when we talk about Convolutional networks. Since the whole point
of the present work was to introduce the index notation, and as one can easily
find numerous derivation of the backpropagation update rules in matrix form,
we will stick with the index notation in all the following, and now turn our
attention to convolutional networks.
40 CHAPTER 4. FEEDFORWARD NEURAL NETWORKS
Chapter 5
reyaL noitulovnoC tuptuO
...
pN
pF
pT
sthgieW
..
..
.
.
.F
p
.
.
...
reyaL noitulovnoC tupnI
P2 + N
CS
CR
F
CR
P2 + T
T + 2P
F
Input Convolution Layer
RC S
C
N + 2P
...
Weights
Fp
.
.
.
.
.
..
..
Output Convolution Layer
Fp
Tp
Np
...
Contents
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 CNN specificities . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 Feature map . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.2 Input layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.3 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.4 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.5 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.6 Towards fully connected layers . . . . . . . . . . . . . . . 48
5.3.7 fully connected layers . . . . . . . . . . . . . . . . . . . . 48
5.3.8 Output connected layer . . . . . . . . . . . . . . . . . . . 49
5.4 Modification to Batch Normalization . . . . . . . . . . . . . . . 49
5.5 Network architectures . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5.1 Realistic architectures . . . . . . . . . . . . . . . . . . . . 51
5.5.2 LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5.3 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5.4 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5.5 GoogleNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5.6 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.6 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
41
42 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
5.1 Introduction
n this chapter we review a second type of neural network that is
F0 ...
..
....
R0 .. ..
R0 S
0
. . .. .
....
F3
...
F1
. . bias
F .
T0 . . .. 5
. ..
.... ....
F1 F3 F4 F5
T3
. R1
F2
... . .
. R1 S
1
. N3
..
....
.. T1 ..
R2
R2 S2
T2
N0 N1 N2
Figure 5.1: A typical CNN architecture (in this case LeNet inspired): convo-
lution operations are followed by pooling operations, until the size of each
feature map is reduced to one. Fully connected layers can then be introduced.
with
Ttrain −1 N0 −1 T0 −1
1
∑ ∑ ∑
(t)
µf = Xf j k , (5.2)
Ttrain T0 N0 t =0 j k
Ttrain −1
1
∑
(t)
µf jk = Xf j k . (5.3)
Ttrain t =0
This correspond to either compute the mean per pixel over the training set, or
to also average over every pixel. This procedure should not be followed for
regression tasks. To conclude, figure 5.2 shows what the input layer looks like.
Input Layer
F0 ...
T0
N0
5.3.3 Padding
As we will see when we proceed, it may be convenient to "pad" the feature
maps in order to preserve the width and the height of the images though
several hidden layers. the padding operation amounts to add 0’s around the
original image. With a padding of size P, we add P zeros at the beginning
of each row and column of a given feature map. This is illustrated in the
following figure
5.3. CNN SPECIFICITIES 45
Padding
F ...
T + 2P
N + 2P
Figure 5.3: Padding of the feature maps. The zeros added correspond to the
red tiles, hence a padding of size P = 1.
5.3.4 Convolution
The convolution operation that gives its name to the CNN is the fundamen-
tal buiding block of this type of network. It amounts to convolute a feature map
of an input hidden layer with a weight matrix to give rise to an output feature
map. The weight is really a four dimensional tensor, one dimension (F) being
the number of feature maps of the convolutional input layer, another (Fp ) the
number of feature maps of the convolutional output layer. The two others gives
the size of the receptive field in the width and the height direction. The recep-
tive field allows one to convolute a subset instead of the whole input image.
It aims at searching similar patterns in the input image, no matter where the
pattern is (translational invariance). The width and the height of the output
image are also determined by the stride: it is simply the number of pixel by
which one slides in the vertical and/or the horizontal direction before apply-
ing again the convolution operation. A good picture being worth a thousand
words, here is the convolution operation in a nuthsell
46 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
N + 2P
Here RC is the size of the convolutional receptive field (we will see that the
pooling operation also has a receptive field and a stride) and SC the convolu-
tional stride. The widths and heights of the output image can be computed
thanks to the input height T and output width N
N + 2P − RC T + 2P − RC
Np = +1 , Tp = +1 . (5.4)
SC SC
It is common to introduce a padding to preserve the widths and heights of the
input image N = Np = T = Tp , so that in these cases SC = 1 and
RC − 1
P= . (5.5)
2
For a given layer n, the convolution operation mathematically reads (similar in
spirit to the weight averaging procedure of a FNN)
Fν −1 RC −1 RC −1
∑ ∑ ∑
(t)(ν) (o ) f (t)(ν)
af l m = Θf jk hf S , (5.6)
C l + j SC m + k
ν =0 j =0 k =0
5.3.5 Pooling
The pooling operation, less and less used in the current state of the art
CNN, is fundamentally a dimension reduction opearation. It amounts either
to average or to take the maximum of a sub-image – characterized by a pooling
receptive field R P and a stride SP – of the input feature map F to obtain an
output feature map Fp = F of width Np < N and height Tp < T. To be noted:
the padded values of the input hidden layers are not taken into account during
the pooling operation (hence the + P indices in the following formulas)
T Tp
Np
N
The average pooling procedure reads for a given ν’th pooling operation
R P −1
∑
(t)(ν) (t)(ν)
af l m = hf S
P l + j+ P SP m+k+ P
, (5.8)
j,k=0
(t)(ν) R P −1 (t)(ν)
a f l m = max h f S . (5.9)
j,k=0 P l + j+ P SP m+k+ P
Here ν denotes the ν’th hidden layer of the network (and thus belongs to
J0, N − 1K), and f ∈ J0, Fν+1 − 1K, l ∈ J0, Nν+1 − 1K and m ∈ J0, Tν+1 − 1K.
Thus SP l + j ∈ J0, Nν − 1K and SP l + j ∈ J0, Tν − 1K. Max pooling is extensively
used in the litterature, and we will therefore adopt it in all the following. de-
(t)( p) (t)( p)
noting j f lm , k f lm the indices at which the l, m maximum of the f feature
map of the tth batch sample is reached, we have
.
.....
Fp
T
N
Figure 5.6: Fully connected operation to get images of width and height 1.
Fν −1 N −1 T −1
∑ ∑ ∑
(t)(ν) (o ) f (t)(ν)
af = Θ f 0 lm h f 0 l + Pm+ P , (5.11)
0
f =0 l =0 m =0
Fν −1
∑
(t)(ν) (o ) f (t)(ν)
af = Θf0 hf0 , (5.13)
0
f =0
..
....
.
F .
bias
. .
..... . .....
p
F Fp
.
.
.....
FN −1
(t)( N −1) (o ) f (t)( N −1) (t)( N −1)
∑
(t)( N )
af = Θf0 hf0 , hf = o af , (5.15)
0
f =0
(t)(ν) (n)
(t)(n)
h f l m − ĥ f
h̃ f l m = r , (5.16)
(n) 2
σ̂ f +e
50 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
with
Tmb −1 Nn −1 Tn −1
1
∑ ∑ ∑
(n) (t)(ν)
ĥ f = hf l m (5.17)
Tmb Nn Tn t =0 l =0 m =0
2 Tmb −1 Nn −1 Tn −1
1 (n) 2
∑ ∑ ∑
(n) (t)(ν)
σ̂ f = h f l m − ĥ f . (5.18)
Tmb Nn Tn t =0 l =0 m =0
The identity transform can be implemented thanks to the two additional pa-
rameters (γ f , β f )
For the evaluation of the cross-validation and the test set (calling e the number
of iterations/epochs), one has to compute
h i
(t)(ν) (n)
h i eE hf l m + ĥ f
(t)(ν)
E hf l m = e
, (5.20)
e +1 e+1
(n) 2
h i
(t)(ν)
h i iVar h f l m + σ̂ f
(t)(ν)
Var h f l m = e
(5.21)
e +1 e+1
h i h i
(t)(ν) Tmb (t)(ν)
and what will be used at test time is E h f l m and Tmb −1 Var hf l m .
We will now review the standard CNN architectures that have been used
in the litterature in the past 20 years, from old to very recent (end of 2015)
ones. To allow for an easy pictural representation, we will adopt the following
schematic representation of the different layers.
5.5. NETWORK ARCHITECTURES 51
BN
full
Relu
full ∗
=
BN
Conv
Relu
Conv ∗
=
5.5.2 LeNet
The LeNet[3] (end of the 90’s) network is formed by an input, followed by
two conv-pool layers and then a fully-connected layer before the final output.
It can be seen in figure 5.1
O
I
C P C P F u
n
o o o o u t
p
n o n o l p
u
v l v l l u
t
t
When treating large images (224 × 224), this implie to use large size of
receptive fields and strides. This has two downsides. Firstly, the number or
parameter in a given weight matrix is proportional to the size of the receptive
field, hence the larger it is the larger the number of parameter. The network can
thus be more prone to overfit. Second, a large stride and receptive field means
a less subtle analysis of the fine structures of the images. All the subsequent
CNN implementations aim at reducing one of these two issues.
5.5.3 AlexNet
The AlexNet[17] (2012) saw no qualitative leap in the CNN theory, but due
to better processors was able to deal with more hidden layers.
O
I
C P C P C C C P F F F u
n
o o o o o o o o u u u t
p
n o n o n n n o l l l p
u
v l v l v v v l l l l u
t
t
This network is still commonly used, though less since the arrival of the
VGG network.
5.5.4 VGG
The VGG[4] network (2014) adopted a simple criteria: only 2 × 2 paddings
of stride 2 and 3 × 3 convolutions of stride 1 with a padding of size 1, hence
preserving the size of the image’s width and height through the convolution
operations.
O
I
C C P C C P C C C P C C C P C C C P F F F u
n
o o o o o o o o o o o o o o o o o o u u u t
p
n n o n n o n n n o n n n o n n n o l l l p
u
v v l v v l v v v l v v v l v v v l l l l u
t
t
This network is the standard one in most of the deep learning packages
dealing with CNN. It is no longer the state of the art though, as a design
innovation has taken place since its creation.
5.5.5 GoogleNet
The GoogleNet[18] introduced a new type of "layer" (wich is in reality a
combination of existing layers): the inception layer (in reference to the movie
by Christopher Nolan). Instead of passing from one layer of a CNN to the
next by a simple pool, conv or fully-connected (fc) operation, one averages the
result of the following architecture.
Concat
Input
We won’t enter into the details of the concat layer, as the Google Net illus-
trated on the following figure is (already!) no longer state of the art.
O
I I I I I I I I I I
C P C C P P P P F u
n n n n n n n n n n
o o o o o o o o u t
p c c c c c c c c c
n o n n o o o o l p
u e e e e e e e e e
v l v v l l l l l u
t p p p p p p p p p
t
5.5.6 ResNet
Output Output
+ +
BN 3
Conv 1 Relu 3
Conv 3
Conv 3 = Res BN 2
Relu 2
Conv 2 = Res
Conv 1 BN 1
Relu 1
Conv 1
Input Input
The ResNet[5] takes back the simple idea of the VGG net to always use
the same size for the convolution operations (except for the first one). It also
5.5. NETWORK ARCHITECTURES 55
takes into account an experimental fact: the fully connected layer (that usually
contains most of the parameters given their size) are not really necessary to
perform well. Removing them leads to a great decrease of the number of
parameters of a CNN. In addition, the pooling operation is also less and less
popular and tend to be replaced by convolution operations. This gives the basic
ingredients of the ResNet fundamental building block, the Residual module of
figure 5.16.
Two important points have to be mentioned concerning the Residual mod-
ule. Firstly, a usual conv-conv-conv structure would lead to the following out-
put (forgeting about batch normalization for simplicity and only for the time
being, and denoting that there is no need for padding in 1 × 1 convolution
operations)
!
F0 −1 RC −1 RC −1
∑ ∑ ∑
(t)(1) (0) f (t)(0)
h f l + Pm+ P =g Θ f 0 j k h f 0 S l + j S m+k
C C
f 0 =0 j =0 k =0
!
F1 −1 RC −1 RC −1
∑ ∑ ∑
(t)(2) (0) f (t)(1)
h f lm = g Θ f 0 j kh f 0 S
C l + j SC m + k
0
f =0 j =0 k =0
!
F2 −1 RC −1 RC −1
∑ ∑ ∑
(t)(3) (0) f (t)(2)
h f lm =g Θ f 0 j k h f 0 S l + j S m+k , (5.22)
C C
f 0 =0 j =0 k =0
whereas the Residual module modifies the last previous equation to (implying
that the width, the size and the number of feature size of the input and the
output being the same)
!
F2 −1 RC −1 RC −1
∑ ∑ ∑
(t)(4) (t)(0) (0) f (t)(2)
h f lm = h f lm +g Θ f 0 j k h f 0 S l + j S m+k
C C
f 0 =0 j =0 k =0
(t)(0) (t)(0)
= h f lm + δh f lm . (5.23)
Instead of trying to fit the input, one is trying to fit a tiny modification of the
input, hence the name residual. This allows the network to minimally modify
the input when necessary, contrary to traditional architectures. Secondly, if the
number of feature maps is important, a 3 × 3 convolution with stride 1 could
be very costly in term of execution time and prone to overfit (large number
of parameters). This is the reason of the presence of the 1 × 1 convolution,
whose aim is just to prepare the input to the 3 × 3 conv to reduce the number
of feature maps, number which is then restored with the final 1 × 1 conv of the
Residual module. The first 1 × 1 convolution thus reads as a weight averaging
56 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
operation
!
F0 −1
∑
(t)(1) (0) f (t)(0)
h f l + Pm+ P = g Θf0 hf0 l m , (5.24)
0
f =0
with f ∈ F0 , restoring the initial feature map size. The ResNet architecture is
then the stacking of a large number (usually 50) of Residual modules, preceded
by a conv-pool layer and ended by a pooling operation to obtain a fully con-
nected layer, to which the output function is directly applied. This is illustrated
in the following figure.
O
I
C P P u
n
o o o t
p
n o o p
u
v l l u
t
t
The ResNet CNN has accomplished state of the art results on a number
of popular training sets (CIFAR, MNIST...). In practice, we will present in the
following the bacpropagation algorithm for CNN having standard (like VGG)
architectures in mind.
5.6 Backpropagation
In a FNN, one just has to compute two kind of backpropagations : from
output to fully connected (fc) layer and from fc to fc. In a traditional CNN, 4
new kind of propagations have to be computed: fc to pool, pool to conv, conv
to conv and conv to pool. We present the corresponding error rates in the next
sections, postponing their derivation to the appendix. We will consider as in
a FNN a network with an input layer labelled 0, N-1 hidden layers labelled i
and an output layer labelled N (N + 1 layers in total in the network).
5.6. BACKPROPAGATION 57
O
F u
u t
l p
l u
t
F F
u u
l l
l l
As in FNN, we find
Tmb −1 Fν+1 −1
(o ) f 0 (tt0 )(n) (t)(ν+1)
∑ ∑
(t)(ν) 0 (t)(ν)
δf =g af Θf Jf δf 0 , (5.32)
t 0 =0 f 0 =0
P F
o u
o l
l l
We show in appendix 5.B that this induces the following error rate
Fν+1 −1
(o ) f 0 (t)(ν+1)
∑
(t)(ν)
δ f lm = Θ f l m δf 0 , (5.33)
0
f =0
5.6. BACKPROPAGATION 59
C P
o o
n o
v l
We show in appendix 5.A that this induces the following error rate (calling
the pooling layer the pth one)
Tmb −1 Nν+1 −1 Tν+1 −1
(t0 )(ν+1)
∑ ∑ ∑
(t)(ν) 0 (t)(ν)
δ f l + Pm+ P =g af l m δ f l 0 m0
0
t =0 l =00 0
m =0
(tt0 )(n)
×J (t0 )( p) (t0 )( p)
. (5.34)
f S P l 0 + j f l 0 m0 + P S P m0 +k f l 0 m0 + P l + P m+ P
C C
o o
n n
v v
We show in appendix 5.B that this induces the following error rate
Tmb −1 Fν+1 −1 Nν+1 −1 Tν+1 −1 RC −1 RC −1 0
∑ ∑ ∑ ∑ ∑ ∑ δ f 0 l0 +Pm0 +P
(t)(ν) (t)(ν) (t )(ν+1)
δ f l + Pm+ P = g0 a f l m
t 0 =0 f 0 =0 l 0 =0 m 0 =0 j =0 k =0
(o ) f 0 (tt0 )(n)
× Θ f j k J f S l 0 + j S m0 +k l + P m+ P (5.35)
C C
P C
o o
o n
l v
We show in appendix 5.B that this induces the following error rate
Fν+1 −1 RC −1 RC −1
∑ ∑ ∑
(t)(ν) (o ) f (t)(ν+1)
δ f lm = Θ f 0 j kδ l + P− j m+ P−k
. (5.36)
0
f =0 j =0 k =0
f SC + P SC + P
F F
u u
l l
l l
Tmb −1
Θ(o ) f
∑
(t)(n) (t)(ν)
∆f0 = yf0 δf (5.37)
t =0
5.6. BACKPROPAGATION 61
P F
o u
o l
l l
Tmb −1
Θ(o ) f
∑
(t)(ν) (t)(ν)
∆ f 0 jk = h f 0 j+ Pk+ P δ f (5.38)
t =0
C C
o o
n n
v v
I
P C C
n
o o o
p
o n n
u
l v v
t
Figure 5.27: Weight update between a conv and a pool layer, as well as between
a conv and the input layer.
F F
u u
l l
l l
We have
Tmb −1 Fν+1 −1
(o ) f 0 (t)(n) (t)(ν)
∑ ∑
γ(n)
∆f = Θf h̃ f δf 0 ,
t =0 0
f =0
Tmb −1 Fν+1 −1
(o ) f 0 (t)(ν)
∑ ∑
β(n)
∆f = Θf δf 0 , (5.41)
t =0 0
f =0
5.6. BACKPROPAGATION 63
P F P C
o u o o
o l o n
l l l v
Figure 5.29: Coefficient update between a fc layer and a pool as well as a conv
and a pool layer.
C C
o o
n n
v v
We have
Appendix
5.A Backprop through BatchNorm
For Backpropagation, we will need
(t0 )(n) (t0 )(n)
∂y f 0 l 0 m0 (n)
∂h̃ f 0 l 0 m0
(t)(ν)
= γf (t)(ν)
. (5.44)
∂h f l m ∂h f l m
Since
(t0 )(ν) (n) f0
∂h f 0 l 0 m0 f 0 0 m0
∂ ĥ f 0 δf
t0
(t)(ν)
= δt δ f δll δm , (t)(ν)
= ; (5.45)
∂h f l m ∂h f l m Tmb Nn Tn
and
(n) 2 f0
∂ σ̂ f 0 2δ f
(t)(ν) (n)
(t)(ν)
= hf l m − ĥ f , (5.46)
∂h f l m Tmb Nn Tn
we get
(t0 )(n) f0 (t0 )(ν)
0 0 m0 (n) (t)(ν) (n)
∂ h̃ f 0 l 0 m0 δf Tmb Nn Tn δtt δll δm
− 1 h f l 0 m0 − ĥ f h f l m − ĥ f
(t)(ν)
=
1
− 3
Tmb Nn Tn
(n) 2
∂h f l m 2 2 2
(n)
σ̂ f +e σ̂ f +e
f0 (t0 )(n) (t)(n)
δf 1 + h̃ 0 0 h̃ f lm
= m0
δtt0 δl 0 δm −
fl m
. (5.47)
1 l
2
2 Tmb Nn Tn
(n)
σ̂ f +e
so that
(t0 )(n) (t0 )(n) (t)(n)
f (tt0 )(n)
∂y f 0 l 0 m0 (n) f0
0 0 m0 1 + h̃ f l 0 m0 h̃ f l m
δ f 0 J f l m l 0 m0 = (t)(ν)
= γ̃ f δ f δtt δll δm − . (5.49)
∂h f l m Tmb Nn Tn
5.B. ERROR RATE UPDATES: DETAILS 65
so
and for backpropagation from conv to pool (taking the stride equal to 1 to
simplify the derivation)
Fc to Fc
Fc to pool
conv to conv
(5.66)
Numerically, this implies that for each t, f , l, m one needs to perform 3 loops
(on t0 , l 0 , m0 ), hence a 7 loop process. This can be reduced to 4 at most in the
following way. Defining
and
we have introduced two new variables that can be computed in four loops, but
(t)(ν)
three of them are independent of the ones needed to compute δ f lm . For the
last term, the δ functions "kill" 3 loops and we are left with
Nν+1 −1 Tν+1 −1
( ( p) ( p)
0 0
(t0 )(ν+1) SP l + jt0 f l 0 m0 SP m +k t0 f l 0 m0
∑ ∑
(t)(ν) (n) (t)(ν)
δ f lm = γ̃ f g0 a f l m δ f l 0 m0 δl δm
l 0 =0 m 0 =0
(1) (2) (t)(n)
µ f + µ f h̃ f l + P m+ P
− , (5.69)
Tmb Nn Tn
and (2 loops)
Fν+1 −1
∑
(4) (2) (3)
λf = λf f0λf0 . (5.73)
0
f =0
Finally, we compute
Tmb −1 Nν+1 −1 Tν+1 −1
(t0 )(ν+1) (t0 )(n)
∑ ∑ ∑
(5)
λ f f 0 jk = δ f 0 l 0 m0 h̃ f l 0 + j m0 +k , (5.74)
0
t =0 l =0 0
m =0 0
Fν+1 −1 RC −1 RC −1
( o ) f 0 (5)
∑ ∑ ∑
(6)
λf = Θ f j k λ f f 0 jk . (5.75)
0
f =0 j =0 k =0
(6) (5)
λ f only requires four loops, and λ f f 0 jk is the most expensive operation. but it
is also a convolution operation that implies two of the smalles loops (those on
5.F. BATCHPROPAGATION THROUGH A RESNET MODULE 71
so that
Fν+1 −1 RC −1 RC −1
(o ) f 0
∑ ∑ ∑
γ(n) (1)
∆f = ν f 0 f jk Θ f jk (5.80)
0
f =0 j =0 k =0
Fν+1 −1 RC −1 RC −1
(o ) f 0
∑ ∑ ∑
β(n) (2)
∆f = ν f 0 Θ f jk , (5.81)
0
f =0 j =0 k =0
Output
+
Conv 3
BN 3
Relu 3
Conv 2
BN 2
Relu 2 = Res
Conv 1
BN 1
Relu 1
Input
This new term is what allows the error rate to flow smoothly from the output
to the input in the ResNet CNN, as the additional connexion in a ResNet is
like a skip path to the convolution chains. Let us mention in passing that some
architecture connects every hidden layer to each others[19].
5.G.1 2D Convolution
RC
RC S
C
T + 2P Tp
Np
N + 2P
Np Np
RC2
• •
Np × Tp •
•
Np × Tp •
•
Np Np
RC2
and it involves 4 loops. the trick to reduce this operation to a 2D matrix mul-
tiplication is to redefine the h matrix by looking at each h indices are going to
be multiplied by Θ for each value of l and m. Then the assiociated h values are
stored into a Np Tp × R2C matrix. flatening out the Θ matrix, we are left with
a matrix multiplication, the flattened Θ matrix being of size R2C × 1. This is
illustrated on figure 5.32
5.G.2 4D Convolution
F ...
..
Tmb times
RC
Fp
Tmb times
RC S
C ...
.
.
T + 2P F . p Tp
.
. ..
Np
N + 2P
RC2
Np × Tp
RC2 F
• •
Tmb × Np × Tp •
•
•
•
Tmb × Np × Tp
Fp
Np × Tp
RC2 F Fp
Following the same lines, the adding of the input and output feature maps
as well as the batch size poses no particular conceptual difficulty, as illustrated
5.H. POOLING AS A ROW MATRIX MAXIMUM 75
T + 2P Tp
Np
N + 2P
F
F × Np × Tp
• •
Tmb × F × Np × Tp •
•
•
•
Tmb × F × Np × Tp
Np × Tp
RP2
...
pN
pF
pT
sthgieW
..
..
.
.
.F
p
.
.
...
reyaL noitulovnoC tupnI
P2 + N
CS
CR
F
CR
P2 + T
T + 2P
F
Input Convolution Layer
RC S
C
N + 2P
...
Weights
Fp
.
.
.
.
.
..
..
Output Convolution Layer
Fp
Tp
Np
...
Contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 RNN-LSTM architecture . . . . . . . . . . . . . . . . . . . . . . 78
6.2.1 Forward pass in a RNN-LSTM . . . . . . . . . . . . . . . 78
6.2.2 Backward pass in a RNN-LSTM . . . . . . . . . . . . . . 80
6.3 Extreme Layers and loss function . . . . . . . . . . . . . . . . . 80
6.3.1 Input layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.2 Output layer . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.3 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4 RNN specificities . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4.1 RNN structure . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4.2 Forward pass in a RNN . . . . . . . . . . . . . . . . . . . 83
6.4.3 Backpropagation in a RNN . . . . . . . . . . . . . . . . . 83
6.4.4 Weight and coefficient updates in a RNN . . . . . . . . . 84
6.5 LSTM specificities . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.5.1 LSTM structure . . . . . . . . . . . . . . . . . . . . . . . . 85
6.5.2 Forward pass in LSTM . . . . . . . . . . . . . . . . . . . . 86
6.5.3 Batch normalization . . . . . . . . . . . . . . . . . . . . . 87
6.5.4 Backpropagation in a LSTM . . . . . . . . . . . . . . . . . 88
6.5.5 Weight and coefficient updates in a LSTM . . . . . . . . 89
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
77
78 CHAPTER 6. RECURRENT NEURAL NETWORKS
6.1 Introduction
Figure 6.1: RNN architecture, with data propagating both in "space" and in
"time". In our exemple, the time dimension is of size 8 while the "spatial" one
is of size 4.
The real novelty of this type of neural network is that the fact that we are
trying to predict a time serie is encoded in the very architecture of the network.
RNN have first been introduced mostly to predict the next words in a sentance
(classification task), hence the notion of ordering in time of the prediction. But
this kind of network architecture can also be applied to regression problems.
Among others things one can think of stock prices evolution, or temperature
forecasting. In contrast to the precedent neural networks that we introduced,
where we defined (denoting ν as in previous chapters the layer index in the
spatial direction)
(t)(ν) (t)(ν)
af = Weight Averaging h f ,
(t)(ν+1) (t)(ν)
hf = Activation function a f , (6.1)
we now have the hidden layers that are indexed by both a "spatial" and a
"temporal" index (with T being the network dimension in this new direction),
80 CHAPTER 6. RECURRENT NEURAL NETWORKS
and the general philosophy of the RNN is (now the a is usually characterized
by a c for cell state, this denotation, trivial for the basic RNN architecture will
make more sense when we talk about LSTM networks)
(t)(ντ ) (t)(ντ −1) (t)(ν−1τ )
cf = Weight Averaging hf , hf ,
(t)(ντ ) (t)(ντ )
hf = Activation function c f , (6.2)
Figure 6.2: Architecture taken, backward pass. Here what cannot compute the
gradient of a layer without having computed the ones that flow into it
With this in mind, let us now see in details the implementation of a RNN
and its advanced cousin, the Long Short Term Memory (LSTM)-RNN.
(t)(0τ ) (t)(0τ )
where h̃ f is h f with the first time column removed.
FN −1 −1
!
(t)( N −1τ )
∑
(t)( Nτ ) f
hf =o Θf0hf , (6.4)
0
f =0
where the output function o is as for FNN’s and CNN’s is either the identity
(regression task) or the cross-entropy function (classification task).
Tmb −1 T −1 FN −1
1 (t)(τ ) 2
∑ ∑ ∑
(t)( Nτ )
J (Θ) = hf − yf . (6.5)
2Tmb t =0 τ =0 f =0
h(ν τ )
tanh
Θ(ν)τ
h(ν−1 τ )
And here is how the output of the hidden layer represented in 6.3 enters
into the subsequent hidden units
h(ν τ )
tanh
Θ(ν)τ
h(ν−1 τ )
Figure 6.4: How the RNN hidden unit interact with each others
Fν−1 −1
!
ν(ν) f (t)(ν−1τ )
∑
(t)(ντ )
hf = tanh Θf0 hf0 , (6.7)
0
f =0
Fν−1 −1
!
Fν −1
ν(ν) f (t)(ν−1τ ) τ (ν) f (t)(ντ −1)
∑ ∑
(t)(ντ )
hf = tanh Θf0 hf0 + Θf0 hf0 . (6.8)
f 0 =0 f 0 =0
(t)(ντ ) δ
δf = (t)(ν+1τ )
J (Θ) , (6.9)
δh f
to deduce
Θindex f δ
∆f0 = Θindex f
J (Θ) , (6.10)
δ∆ f 0
where the index can either be nothing (weights of the ouput layers), ν(ν)
(weights between two spatially connected layers) or τ (ν) (weights between two
temporally connected layers). First, it is easy to compute (in the same way as
in chapter 1 for FNN) for the MSE loss function
Calling
(t)(ντ ) 2
(t)(ντ )
Tf = 1 − hf , (6.13)
and
(t0 )(ντ ) a (t0 )(ν+1τ ) a ( ν +1) f 0
Hf f0 = Tf0 Θf , (6.14)
we show in appendix 6.B.1 that (if τ + 1 exists, otherwise the second term is
absent)
Tmb 1 Fν+1−e −1
(t)(ν−1τ ) (tt0 )(ντ ) (t0 )(ν−eτ +e)be (t0 )(ν−eτ +e)
δf = ∑
0
Jf ∑ ∑
0
Hf f0 δf 0 . (6.15)
t =0 e =0 f =0
where b0 = ν and b1 = τ.
T −1 Tmb −1
ν(ν−) f (t)(ντ ) (t)(ν−1τ ) (t)(ν−1τ )
∆f0 = ∑ ∑ Tf δf hf0 , (6.17)
τ =0 t =0
T −1 Tmb −1
(t)(ντ ) (t)(ν−1τ ) (t)(ντ −1)
∑ ∑
τ (ν) f
∆f0 = Tf δf hf0 , (6.18)
τ =1 t =0
T −1 Tmb −1
(t)( N −1τ ) (t)( N −1τ )
∑ ∑
f
∆f0 = hf0 δf , (6.19)
τ =0 t =0
Tmb −1 1 Fν+1−e −1
(t0 )(ν−eτ +e)be (t0 )(ν−eτ +e)
∑ ∑ ∑
β(ντ )
∆f = Hf f0 δf 0 , (6.20)
t =0 e =0 0
f =0
Tmb −1 1 Fν+1−e −1
(t0 )(ν−eτ +e)be (t0 )(ν−eτ +e)
∑ ∑ ∑
γ(ντ ) (t)(ντ )
∆f = h̃ f Hf f0 δf 0 . (6.21)
t =0 e =0 0
f =0
6.5. LSTM SPECIFICITIES 85
Figure 6.6: How the LSTM hidden unit interact with each others
Fν−1 −1
!
F ν −1
i (ν) f (t)(ν−1τ ) i (ν) f (t)(ντ −1)
∑ ∑
(t)(ντ )
if =σ Θ fν0 hf0 + Θ fτ0 hf0 , (6.22)
0
f =0 0
f =0
F ν −1 −1
!
Fν −1
f (ν) f (t)(ν−1τ ) f (ν) f (t)(ντ −1)
∑ ∑
(t)(ντ )
ff =σ Θ f ν0 hf0 + Θ f τ0 hf0 , (6.23)
f 0 =0 f 0 =0
Fν−1 −1
!
Fν −1
o (ν) f (t)(ν−1τ ) o (ν) f (t)(ντ −1)
∑ ∑
(t)(ντ )
of =σ Θ f ν0 hf0 + Θ f τ0 hf0 . (6.24)
f 0 =0 f 0 =0
The sigmoid function is the reason why the i, f , o functions are called gates:
they take their values between 0 and 1, therefore either allowing or forbid-
ding information to pass through the next step. The cell state update is then
6.5. LSTM SPECIFICITIES 87
and as announced, hidden state update is just a probe of the current cell state
(t)(ντ ) (t)(ντ ) (t)(ντ )
hf = of tanh c f . (6.27)
These formula singularly complicates the feed forward and especially the
backpropagation procedure. For completeness, we will us nevertheless care-
fully derive it. Let us mention in passing that recent studies tried to replace the
(t)(ντ ) (t)(ντ )
tanh activation function of the hidden state h f and the cell update g f
by Rectified Linear Units, and seems to report better results with a proper
initialization of all the weight matrices, argued to be diagonal
s !
f 1 f 6
Θ f 0 (init) = δ f 0 + , (6.28)
2 Fin + Fout
with the bracket term here to possibly (or not) include some randomness into
the initialization
Fν−1 −1
!
F ν −1
i (ν−) f (t)(ν−1τ ) i (−ν) f (t)(ντ −1)
∑ ∑
(t)(ντ )
if =σ Θ fν0 yf0 + Θ fτ0 yf0 , (6.29)
f 0 =0 f 0 =0
F ν −1 −1
!
Fν −1
f (ν−) f (t)(ν−1τ ) f (−ν) f (t)(ντ −1)
∑ ∑
(t)(ντ )
ff =σ Θ f ν0 yf0 + Θ f τ0 yf0 , (6.30)
f 0 =0 f 0 =0
Fν−1 −1
!
Fν −1
o (ν−) f (t)(ν−1τ ) o (−ν) f (t)(ντ −1)
∑ ∑
(t)(ντ )
of =σ Θ f ν0 yf0 + Θ f τ0 yf0 , (6.31)
0
f =0 0
f =0
Fν−1 −1
!
Fν −1
g (ν−) f (t)(ν−1τ ) g (−ν) f (t)(ντ −1)
∑ ∑
(t)(ντ )
gf = tanh Θ f ν0 yf0 + Θ f τ0 yf0 , (6.32)
f 0 =0 f 0 =0
88 CHAPTER 6. RECURRENT NEURAL NETWORKS
where
(t)(ντ ) (ντ ) (t)(ντ ) (ντ )
yf = γf h̃ f + βf , (6.33)
as well as
(t)(ντ ) (ντ )
(t)(ντ )
hf − ĥ f
h̃ f = r (6.34)
(ντ ) 2
σf +e
and
Tmb −1 Tmb −1
1 (ντ ) 2 1 (ντ ) 2
∑ ∑
(ντ ) (t)(ντ ) (t)(ντ )
ĥ f = hf , σf = hf − ĥ f . (6.35)
Tmb t =0 Tmb t =0
It is important to compute a running sum for the mean and the variance, that
will serve for the evaluation of the cross-validation and the test set (calling e
the number of iterations/epochs)
h i
(t)(ντ ) (ντ )
h i eE h f + ĥ f
(t)(ντ )
E hf = e
, (6.36)
e +1 e+1
(ντ ) 2
h i
(t)(ντ )
h i eVar h f + σ̂ f
(t)(ντ )
Var h f = e
(6.37)
e +1 e+1
h i h i
(t)(ντ ) Tmb (t)(ντ )
and what will be used at the end is E h f and T −1 Var h f .
mb
Tmb 1 Fν+1−e −1
(t)(ν−1τ ) (tt0 )(ντ ) (t0 )(ν−eτ +e)be (t0 )(ν−eτ +e)
δf = ∑
0
Jf ∑ ∑
0
Hf f0 δf 0 . (6.39)
t =0 e =0 f =0
6.5. LSTM SPECIFICITIES 89
and
T −1 Tmb −1
(ντ )(t) (ντ )(t) (ν−1τ )(t)
∑ ∑
ρ (ν) f
∆ f ν0 = ρf δf hf0 , (6.42)
τ =0 t =0
(6.43)
and otherwise
T −1 Tmb −1
(ντ )(t) (ντ )(t) (ν−1τ )(t)
∑ ∑
ρ (ν) f
∆ f ν0 = ρf δf yf0 , (6.44)
τ =0 t =0
(ντ )(t) (ντ )(t) (ν−1τ )(t)
ρf δf yf0 , (6.45)
T −1 Tmb −1
(ντ )(t) (ντ )(t) (ντ −1)(t)
∑ ∑
ρ (ν) f
∆ f τ0 = ρf δf yf0 , (6.46)
τ =1 t =0
Tmb −1 1 Fν+1−e −1
(t)(ν−eτ +e)be (t)(ν−eτ +e)
∑ ∑ ∑
β(ντ )
∆f = Hf f0 δf 0 , (6.47)
t =0 e =0 f =00
Tmb −1 1 Fν+1−e −1
(t)(ν−eτ +e)be (t)(ν−eτ +e)
∑ ∑ ∑
γ(ντ ) (t)(ντ )
∆f = h̃ f Hf f0 δf 0 .
t =0 e =0 0
f =0
(6.48)
90 CHAPTER 6. RECURRENT NEURAL NETWORKS
and
T −1 Tmb −1
(t)( N −1τ ) (t)( N −1τ )
∑ ∑
f
∆f0 = yf0 δf . (6.49)
τ =0 t =0
Appendix
6.A Backpropagation trough Batch Normalization
For Backpropagation, we will need
Since
and
(ντ ) 2 0
f
∂ σ̂ f 0 2δ f (t)(ντ ) (ντ )
(t)(ντ )
= hf − ĥ f , (6.52)
∂h f Tmb
we get
(t0 )(ντ ) f0 (t0 )(ντ )
(ντ ) (t)(ντ ) (ντ )
∂ h̃ f 0 δf T δ t0 − 1 hf − ĥ f hf − ĥ f
mb t
(t)(ντ )
=
1
− 3
∂h f Tmb
(ντ ) 2
2
2
2
(ντ )
σ̂ f +e σ̂ f +e
f0 (t0 )(ντ ) (t)(ντ )
δf 1 + h̃ h̃
= δtt0 − f f
. (6.53)
1
2
2 T mb
(ντ )
σ̂ f +e
6.B. RNN BACKPROPAGATION 91
(ντ )
(ντ )
γf
γ̃ f = 1 . (6.54)
2 2
(ντ )
σ̂ f +e
so that
This modifies the error rate backpropagation, as well as the formula for the
weight update (y’s instead of h’s). In the following we will use the formula
(t)(ντ ) δ
δf = (t)(ν+1τ )
J (Θ) , (6.57)
δh f
FN −1 −1
!
f (t)( N −1τ )
∑
(t)( Nτ )
hf =o Θf0yf , (6.59)
f 0 =0
92 CHAPTER 6. RECURRENT NEURAL NETWORKS
and
Fν−1 −1
!
Fν −1
ν(ν) f (t)(ν−1τ ) τ (ν) f (t)(ντ −1)
∑ ∑
(t)(ντ )
hf = tanh Θf0 yf0 + Θf0 yf0 , (6.60)
f 0 =0 f 0 =0
we get for
(t0 )( Nτ )
Tmb FN −1 δh f 0
(t)( N −2τ ) (t0 )( N −1τ )
δf = ∑
0
∑
0
δ
(t)( N −1τ ) f 0
t =0 f =0 δh f
0
FN −1 −1 δh(t )( N −1τ +1)
f0 (t0 )( N −2τ +1)
+ ∑ (t)( N −1τ )
δf 0 . (6.61)
f 0 =0 δh f
Let us work out explicitly once (for a regression cost function and a trivial
identity output function)
as well as
Thus
"
Tmb FN −1
(t)( N −2τ ) (tt0 )( N −1τ ) f 0 (t0 )( N −1τ )
δf = ∑
0
Jf ∑
0
Θ f δf 0
t =0 f =0
FN −1 −1
#
(t0 )( N −1τ +1) τ ( N −1) f 0 (t0 )( N −2τ +1)
+ ∑ Tf0 Θf δf 0 . (6.64)
f 0 =0
6.B. RNN BACKPROPAGATION 93
0
Here we adopted the convention that the δ(t )( N −2τ +1) ’s are 0 if τ = T. In a
similar way, we derive for ν ≤ N − 1
Fν+1 −1
"
Tmb
(t)(ν−1τ ) (tt0 )(ντ ) (t0 )(ν+1τ ) ν(ν+1) f 0 (t0 )(ντ )
δf = ∑
0
Jf ∑
0
Tf0 Θf δf 0
t =0 f =0
#
Fν −1
(t0 )(ντ +1) τ (ν) f 0 (t0 )(ν−1τ +1)
+ ∑ Tf0 Θf δf 0 . (6.65)
f 0 =0
Defining
(t0 )( Nτ ) ν( N ) f 0 f0
Tf0 =1, Θf = Θf , (6.66)
(t)(ν−1τ )
the previous δ f formula extends to the case ν = N − 1. To unite the
RNN and the LSTM formulas, let us finally define (with a either τ or ν
Tmb 1 Fν+1−e −1
(t)(ν−1τ ) (tt0 )(ντ ) (t0 )(ν−eτ +e)be (t0 )(ν−eτ +e)
δf = ∑
0
Jf ∑ ∑
0
Hf f0 δf 0 . (6.68)
t =0 e =0 f =0
ν(ν) f ∂ τ (ν) f ∂
∆f0 = ν(ν) f
J (Θ) ∆f0 = τ (ν) f
J (Θ) . (6.69)
∂Θ f 0 ∂Θ f 0
We first expand
(t)(ντ )
T −1 Fν −1 Tmb −1 ∂h f 00 (t)(ν−1τ )
∑ ∑ ∑
ν(ν) f
∆f0 = ν(ν) f
δ f 00 ,
00
τ =0 f =0 t =0 ∂Θ f 0
(t)(ντ )
T −1 Fν −1 Tmb −1 ∂h f 00 (t)(ν−1τ )
∑ ∑ ∑
τ (ν) f
∆f0 = τ (ν) f
δ f 00 , (6.70)
00
τ =0 f =0 t =0 ∂Θ f 0
94 CHAPTER 6. RECURRENT NEURAL NETWORKS
so that
T −1 Tmb −1
(t)(ντ ) (t)(ν−1τ ) (t)(ν−1τ )
∑ ∑
ν(ν) f
∆f0 = Tf δf hf0 , (6.71)
τ =0 t =0
T −1 Tmb −1
(t)(ντ ) (t)(ν−1τ ) (t)(ντ −1)
∑ ∑
τ (ν) f
∆f0 = Tf δf hf0 . (6.72)
τ =1 t =0
f ∂
∆f0 = f
J (Θ) . (6.73)
∂Θ f 0
We first expand
(t)( Nτ )
T −1 FN −1 Tmb −1 ∂h f 00 (t)( N −1τ )
∑ ∑ ∑
f
∆f0 = f
δ f 00 (6.74)
00
τ =0 f =0 t =0 ∂Θ f 0
so that
T −1 Tmb −1
(t)( N −1τ ) (t)( N −1τ )
∑ ∑
f
∆f0 = hf0 δf . (6.75)
τ =0 t =0
Finally, we need
β(ντ ) ∂ γ(ντ ) ∂
∆f = (ντ )
J (Θ) ∆f = (ντ )
J (Θ) . (6.76)
∂β f ∂γ f
First
(t)(ν+1τ ) (t)(ντ +1)
Tmb −1 Fν+1 −1
∂h f 0 Fν −1 ∂h f 0 (t)(ν−1τ +1)
∑ ∑ ∑
β(ντ ) (t)(ντ )
∆f =
(ντ )
δf 0 + (ντ )
δf 0 ,
t =0 f 0 =0 ∂β f f 0 =0 ∂β f
Tmb −1 Fν+1 −1 ∂h(t0)(ν+1τ )
(t)(ντ +1)
Fν −1 ∂h 0
f f (t)(ν−1τ +1)
∑ ∑ ∑
γ(ντ ) (t)(ντ )
∆f =
(ντ )
δf 0 + (ντ )
δf 0 .
t =0 f 0 =0 ∂γ f f 0 =0 ∂γ f
(6.77)
6.C. LSTM BACKPROPAGATION 95
So that
Tmb −1 Fν+1 −1
"
ν(ν+1) f 0 (t)(ντ )
∑ ∑
β(ντ ) (t)(ν+1τ )
∆f = Tf0 Θf δf 0
t =0 0
f =0
#
Fν −1
(t)(ντ +1) τ (ν) f 0 (t)(ν−1τ +1)
+ ∑ Tf0 Θf δf 0 , (6.78)
f 0 =0
Tmb −1 Fν+1 −1
"
ν(ν+1) f 0 (t)(ντ ) (t)(ντ )
∑ ∑
γ(ντ ) (t)(ν+1τ )
∆f = Tf0 Θf h̃ f δf 0
t =0 0
f =0
#
Fν −1
(t)(ντ +1) τ (ν) f 0 (t)(ντ ) (t)(ν−1τ +1)
+ ∑ Tf0 Θf h̃ f δf 0 , (6.79)
f 0 =0
Tmb −1 1 Fν+1−e −1
(t)(ν−eτ +e)be (t)(ν−eτ +e)
∑ ∑ ∑
β(ντ )
∆f = Hf f0 δf 0 , (6.80)
t =0 e =0 0
f =0
Tmb −1 1 Fν+1−e −1
(t)(ν−eτ +e)be (t)(ν−eτ +e)
∑ ∑ ∑
γ(ντ ) (t)(ντ )
∆f = h̃ f Hf f0 δf 0 . (6.81)
t =0 e =0 0
f =0
1 (t)( Nτ )
(t)( N −1τ ) (t)(τ )
δf hf − yf
= . (6.82)
Tmb
Before going any further, it will be useful to define
(t)(ντ ) (t)(ντ ) (t)(ντ )
Of = hf 1 − of ,
(t)(ντ ) (t)(ντ ) 2 (t)(ντ ) (t)(ντ ) (t)(ντ ) (t)(ντ )
If = of 1 − tanh cf gf if 1 − if ,
(t)(ντ ) (t)(ντ ) (t)(ντ ) (t)(ντ −1) (t)(ντ ) (t)(ντ )
Ff = of 1 − tanh2 c f cf ff 1 − ff ,
(t)(ντ ) 2
(t)(ντ ) (t)(ντ ) 2 (t)(ντ ) (t)(ντ )
Gf = of 1 − tanh c f if 1 − gf , (6.83)
96 CHAPTER 6. RECURRENT NEURAL NETWORKS
and
(t0 )( Nτ )
Tmb FN −1 δh f 0
(t)( N −2τ ) (t0 )( N −1τ )
δf = ∑
0
∑
0
δ
(t)( N −1τ ) f 0
t =0 f =0 δh f
(t0 )( N −1τ +1)
FN −1 −1 δh f 0 (t0 )( N −2τ +1)
+ ∑
0
(t)( N −1τ )
δ f0 . (6.85)
f =0 δh f
We will be able to get our hands on the second term with the general for-
mula, so let us first look at
(t0 )( Nτ )
δh f 0 f 0 (tt0 )( N −1τ )
(t)( N −1τ )
= Θf Jf , (6.86)
δh f
which is is similar to the RNN case. Let us put aside the second term of
(t)( N −2τ )
δf ,
and look at the general case
Now
(t0 )(ν+1τ ) (t0 )(ντ )
δo f 0 i Fν −1 δy f 0
(t0 )(ν+1τ ) (t0 )(ν+1τ ) ( ν +1) f 0
h
∑
o
(t)(ντ )
= of0 1 − of0 Θ f ν00 (t)(ντ )
δh f 00
f =0 δh f
(t0 )(ν+1τ ) (t0 )(ν+1τ ) o (ν+1) f 0 (tt0 )(ντ )
h i
= of0 1 − of0 Θ fν Jf , (6.89)
and
(t0 )(ν+1τ ) (t0 )(ν+1τ ) (t0 )(ν+1τ )
δc f 0 δi f 0 (t0 )(ν+1τ )
δg f 0 (t0 )(ν+1τ )
(t)(ντ )
= (t)(ντ )
gf0 + (t)(ντ )
if0
δh f δh f δh f
(t0 )(ν+1τ )
δ ff0 (t0 )(ντ )
+ (t)(ντ )
cf0 . (6.90)
δh f
(t0 )(ν+1τ )
δi f 0 (t0 )(ν+1τ )
h
(t0 )(ν+1τ )
i
i (ν+1) f 0 (tt0 )(ντ )
(t)(ντ )
= if0 1 − if0 Θ fν Jf ,
δh f
(t0 )(ν+1τ )
δ ff0 (t0 )(ν+1τ )
h
(t0 )(ν+1τ )
i
f (ν+1) f 0 (tt0 )(ντ )
(t)(ντ )
= ff0 1− ff0 Θ fν Jf ,
δh f
(t0 )(ν+1τ )
δg f 0
(t0 )(ν+1τ ) 2
g (ν+1) f 0 (tt0 )(ντ )
(t)(ντ )
= 1− gf0 Θfν Jf , (6.91)
δh f
(t0 )(ν+1τ )
δh f 0 (tt0 )(ντ ) (t)(ντ )ν
(t)(ντ )
= Jf Hf f0 . (6.92)
δh f
(t)( N −2τ )
This formula also allows us to compute the second term for δ f . In a
totally similar manner
Fν+1 −1
"
Tmb
(t)(ν−1τ ) (tt0 )(ντ ) (t)(ντ )ν (t0 )(ντ )
δf = ∑
0
Jf ∑
0
Hf f0 δf 0
t =0 f =0
#
Fν −1
(t)(ν−1τ +1)τ (t0 )(ν−1τ +1)
+ ∑ Hf f0 δf 0 , (6.94)
f 0 =0
Tmb 1 Fν+1−e −1
(t)(ν−1τ ) (tt0 )(ντ ) (t0 )(ν−eτ +e)be (t0 )(ν−eτ +e)
δf = ∑ Jf ∑ ∑ Hf f0 δf 0 . (6.95)
t 0 =0 e =0 f 0 =0
This formula is also valid for ν = N − 1 if we define as for the RNN case
(t0 )( Nτ ) ν( N ) f 0 f0
Hf0 =1, Θf = Θf , (6.96)
ρ (ν) f ∂ ρ (ν) f ∂
∆ f ν0 = ρ (ν) f
J (Θ) ∆ f τ0 = ρ (ν) f
J (Θ) , (6.97)
∂Θ f ν0 ∂Θ f τ0
(ντ )(t)
T −1 Fν −1 Tmb −1 ∂h f 00 ∂
∑ ∑ ∑
ρ (ν) f
∆ f ν0 = ρ (ν) f (ντ )(t)
J (Θ)
00
τ =0 f =0 t =0 ∂Θ f ν0 ∂h f 00
(ντ )(t)
T −1 Fν −1 Tmb −1 ∂h f 00
∑ ∑ ∑
(ντ )(t)
= ρ (ν) f
δ f 00 , (6.98)
00
τ =0 f =0 t =0 ∂Θ f ν0
T −1 Tmb −1
ρ (ν−) f (ντ )(t) (ντ )(t) (ν−1τ )(t)
∆ f ν0 = ∑ ∑ ρf δf hf0 , (6.99)
τ =0 t =0
6.C. LSTM BACKPROPAGATION 99
and else
T −1 Tmb −1
ρ (ν−) f (ντ )(t) (ντ )(t) (ν−1τ )(t)
∆ f ν0 = ∑ ∑ ρf δf yf0 , (6.100)
τ =0 t =0
T −1 Tmb −1
(ντ )(t) (ντ )(t) (ντ −1)(t)
∑ ∑
ρ (ν) f
∆ f τ0 = ρf δf yf0 . (6.101)
τ =1 t =0
β(ντ ) ∂ γ(ντ ) ∂
∆f = (ντ )
J (Θ) ∆f = (ντ )
J (Θ) . (6.102)
∂β f ∂γ f
and
Tmb −1 Fν+1 −1
( )
Fν −1
(t)(ν−1τ +1) (t)(ν−1τ +1)
∑ ∑ ∑
γ(ντ ) (t)(ντ ) (t)(ντ ) (t)(ντ )
∆f = h̃ f Hf f 0 δf 0 + Hf f0 δf 0 ,
t =0 f 0 =0 f 0 =0
(6.104)
Tmb −1 1 Fν+1−e −1
(t)(ν−eτ +e)be (t)(ν−eτ +e)
∑ ∑ ∑
β(ντ )
∆f = Hf f0 δf 0 , (6.105)
t =0 e =0 0
f =0
Tmb −1 1 Fν+1−e −1
(t)(ν−eτ +e)be (t)(ν−eτ +e)
∑ ∑ ∑
γ(ντ ) (t)(ντ )
∆f = h̃ f Hf f0 δf 0 . (6.106)
t =0 e =0 0
f =0
f ∂
∆f0 = f
J (Θ) . (6.107)
∂Θ f 0
We first expand
(t)( Nτ )
T −1 FN −1 Tmb −1 ∂h f 00 (t)( N −1τ )
∑ ∑ ∑
f
∆f0 = f
δ f 00 (6.108)
00
τ =0 f =0 t =0 ∂Θ f 0
so that
T −1 Tmb −1
(t)( N −1τ ) (t)( N −1τ )
∑ ∑
f
∆f0 = hf0 δf . (6.109)
τ =0 t =0
6.D. PEEPHOLE CONNEXIONS 101
f (ν τ )
Θcf (ν) Θci (ν) Θco (ν)
Fν−1 −1 F ν −1 !
i (ν) f (ν−1τ )(t) i (ν) f (ντ −1)(t) c (ν) f (ντ −1)(t)
∑ ∑
(ντ )(t)
if =σ Θ fν0 hf0 + Θ fτ0 hf0 + Θ fi0 c f 0 ,
0
f =0 0
f =0
(6.110)
F ν −1 −1 Fν −1 !
c (ν) f (ντ −1)(t)
f (ν) f (ν−1τ )(t) f (ν) f (ντ −1)(t)
∑ ∑
(ντ )(t)
ff =σ Θ f ν0 hf0 + Θ f τ0 hf0 + Θ f f0 cf0 ,
0
f =0 0
f =0
(6.111)
Fν−1 −1
!
Fν −1 h i
o (ν) f (ν−1τ )(t) o (ν) f (ντ −1)(t)
∑ ∑
(ντ )(t) c (ν) f (ντ )(t)
of =σ Θ f ν0 hf0 + Θ f τ0 hf0 + Θ fo0 cf0 ,
0
f =0 0
f =0
(6.112)
which also modifies the LSTM backpropagation algorithm in a non-trivial way.
As it as been shown that different LSTM formulations lead to pretty similar
102 CHAPTER 6. RECURRENT NEURAL NETWORKS
...
pN
pF
pT
sthgieW
..
..
.
.
.F
p
.
.
...
reyaL noitulovnoC tupnI
P2 + N
CS
CR
F
CR
P2 + T
Conclusion RC
T + 2P
F
Input Convolution Layer
RC S
C
N + 2P
...
Weights
Fp
.
.
.
.
.
..
..
Output Convolution Layer
Fp
Tp
Np
...
e have come to the end of our journey. I hope this note lived up
W to its promises, and that the reader now understands better how a
neural network is designed and how it works under the hood. To
wrap it up, we have seen the architecture of the three most common
neural networks, as well as the careful mathematical derivation of their training
formulas.
Deep Learning seems to be a fastly evolving field, and this material might
be out of date in a near future, but the index approach adopted will still allow
the reader – as it as helped the writer – to work out for herself what is behind
the next state of the art architectures.
Until then, one should have enough material to encode from scratch its
own FNN, CNN and RNN-LSTM, as the author did as an empirical proof of
his formulas.
103
104 CHAPTER 7. CONCLUSION
Bibliography
[1] Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Net-
works from Overfitting”. In: J. Mach. Learn. Res. 15.1 (Jan. 2014), pp. 1929–
1958. issn: 1532-4435. url: http : / / dl . acm . org / citation . cfm ? id =
2627435.2670313.
[2] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerat-
ing Deep Network Training by Reducing Internal Covariate Shift”. In:
(Feb. 2015).
[3] Yann Lecun et al. “Gradient-based learning applied to document recog-
nition”. In: Proceedings of the IEEE. 1998, pp. 2278–2324.
[4] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional
Networks for Large-Scale Image Recognition”. In: CoRR abs/1409.1556
(2014). url: http://arxiv.org/abs/1409.1556.
[5] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:
7 (Dec. 2015).
[6] Alex Graves. Supervised Sequence Labelling with Recurrent Neural Networks.
2011.
[7] Felix A. Gers, Jürgen A. Schmidhuber, and Fred A. Cummins. “Learn-
ing to Forget: Continual Prediction with LSTM”. In: Neural Comput. 12.10
(Oct. 2000), pp. 2451–2471. issn: 0899-7667. doi: 10.1162/089976600300015015.
url: http://dx.doi.org/10.1162/089976600300015015.
[8] F. Rosenblatt. “The Perceptron: A Probabilistic Model for Information
Storage and Organization in The Brain”. In: Psychological Review (1958),
pp. 65–386.
[9] Yann LeCun et al. “Effiicient BackProp”. In: Neural Networks: Tricks of
the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop. London,
UK, UK: Springer-Verlag, 1998, pp. 9–50. isbn: 3-540-65311-2. url: http:
//dl.acm.org/citation.cfm?id=645754.668382.
[10] Ning Qian. “On the momentum term in gradient descent learning al-
gorithms”. In: Neural Networks 12.1 (1999), pp. 145 –151. issn: 0893-6080.
doi: http://dx.doi.org/10.1016/S0893-6080(98)00116-6. url: http:
//www.sciencedirect.com/science/article/pii/S0893608098001166.
[11] Yurii Nesterov. “A method for unconstrained convex minimization prob-
lem with the rate of convergence O (1/k2)”. In: Doklady an SSSR. Vol. 269.
3. 1983, pp. 543–547.
105
106 BIBLIOGRAPHY
[12] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive Subgradient
Methods for Online Learning and Stochastic Optimization”. In: J. Mach.
Learn. Res. 12 (July 2011), pp. 2121–2159. issn: 1532-4435. url: http://
dl.acm.org/citation.cfm?id=1953048.2021068.
[13] Matthew D. Zeiler. “ADADELTA: An Adaptive Learning Rate Method”.
In: CoRR abs/1212.5701 (2012). url: http://dblp.uni- trier.de/db/
journals/corr/corr1212.html#abs-1212-5701.
[14] Diederik Kingma and Jimmy Ba. “Adam: A Method for Stochastic Op-
timization”. In: (Dec. 2014).
[15] Rupesh K. Srivastava, Klaus Greff, and Jurgen Schmidhuber. “High-
way Networks”. In: (). url: http://arxiv.org/pdf/1505.00387v1.pdf.
[16] Jiuxiang Gu et al. “Recent Advances in Convolutional Neural Networks”.
In: CoRR abs/1512.07108 (2015).
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet
Classification with Deep Convolutional Neural Networks”. In: Advances
in Neural Information Processing Systems 25. Ed. by F. Pereira et al. Curran
Associates, Inc., 2012, pp. 1097–1105. url: http : / / papers . nips . cc /
paper / 4824 - imagenet - classification - with - deep - convolutional -
neural-networks.pdf.
[18] Christian Szegedy et al. “Going Deeper with Convolutions”. In: Com-
puter Vision and Pattern Recognition (CVPR). 2015. url: http : / / arxiv .
org/abs/1409.4842.
[19] Gao Huang et al. “Densely Connected Convolutional Networks”. In:
(July 2017).