0% found this document useful (0 votes)
194 views67 pages

Vectorized Neural Network Gradients

The document provides a comprehensive guide on computing neural network gradients using vectorized methods, emphasizing the importance of the Jacobian matrix. It outlines various identities for calculating gradients efficiently, particularly in the context of backpropagation in neural networks. Additionally, it includes an example of a one-layer neural network to illustrate the application of these concepts.

Uploaded by

Suri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
194 views67 pages

Vectorized Neural Network Gradients

The document provides a comprehensive guide on computing neural network gradients using vectorized methods, emphasizing the importance of the Jacobian matrix. It outlines various identities for calculating gradients efficiently, particularly in the context of backpropagation in neural networks. Additionally, it includes an example of a one-layer neural network to illustrate the application of these concepts.

Uploaded by

Suri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Computing Neural Network Gradients

Kevin Clark

1 Introduction
The purpose of these notes is to demonstrate how to quickly compute neural
network gradients in a completely vectorized way. It is complementary to the
last part of lecture 3 in CS224n 2019, which goes over the same material.

2 Vectorized Gradients
While it is a good exercise to compute the gradient of a neural network with re-
spect to a single parameter (e.g., a single element in a weight matrix), in practice
this tends to be quite slow. Instead, it is more efficient to keep everything in ma-
trix/vector form. The basic building block of vectorized gradients is the Jacobian
Matrix. Suppose we have a function f : Rn → Rm that maps a vector of length n
to a vector of length m: f (x) = [f1 (x1 , ..., xn ), f2 (x1 , ..., xn ), ..., fm (x1 , ..., xn )].
Then its Jacobian is the following m × n matrix:

 ∂f1 ∂f1 
∂x1 ... ∂xn
∂f
=  ... .. .. 

∂x . . 
∂f m ∂fm
∂x1 ... ∂xn

That is, ( ∂f ∂fi


∂x )ij = ∂xj (which is just a standard non-vector derivative). The
Jacobian matrix will be useful for us because we can apply the chain rule to a
vector-valued function just by multiplying Jacobians.

As a little illustration of this, suppose we have a function f (x) = [f1 (x), f2 (x)]
taking a scalar to a vector of size 2 and a function g(y) = [g1 (y1 , y2 ), g2 (y1 , y2 )]
taking a vector of size two to a vector of size two. Now let’s compose them to
get g(x) = [g1 (f1 (x), f2 (x)), g2 (f1 (x), f2 (x))]. Using the regular chain rule, we
can compute the derivative of g as the Jacobian
∂  " ∂g1 ∂f1 ∂g1 ∂f2
#
∂g g1 (f 1 (x), f2 (x)) ∂f ∂x + ∂f ∂x
= ∂x ∂ = ∂g12 ∂f1 2
∂g2 ∂f2
∂x ∂x g2 (f1 (x), f2 (x)) ∂f1 ∂x + ∂f2 ∂x

1
And we see this is the same as multiplying the two Jacobians:
" #
∂g1 ∂g1  ∂f1 
∂g ∂g ∂f ∂f1 ∂f2 ∂x
= = ∂g2 ∂g2 ∂f2
∂x ∂f ∂x ∂f1 ∂f2 ∂x

3 Useful Identities
This section will now go over how to compute the Jacobian for several simple
functions. It will provide some useful identities you can apply when taking neu-
ral network gradients.

(1) Matrix times column vector with respect to the column vector
∂z
(z = W x, what is ∂x ?)

Suppose W ∈ Rn×m . Then we can think of z as a function of x taking an


m-dimensional vector to an n-dimensional vector. So its Jacobian will be
n × m. Note that
m
X
zi = Wik xk
k=1

∂z
So an entry ( ∂x )ij of the Jacobian will be
m m
∂z ∂zi ∂ X X ∂
( )ij = = Wik xk = Wik xk = Wij
∂x ∂xj ∂xj ∂xj
k=1 k=1

∂ ∂z
because ∂xj xk = 1 if k = j and 0 if otherwise. So we see that =W
∂x
(2) Row vector times matrix with respect to the row vector
∂z
(z = xW , what is ∂x ?)

∂z
A computation similar to (1) shows that = WT .
∂x
(3) A vector with itself
∂z
(z = x, what is ∂x ? )
We have zi = xi . So
(
∂z ∂zi ∂ 1 if i = j
( )ij = = xi =
∂x ∂xj ∂xj 0 if otherwise
∂z
So we see that the Jacobian is a diagonal matrix where the entry at (i, i)
∂x
∂z
is 1. This is just the identity matrix: = I . When applying the chain
∂x

2
rule, this term will disappear because a matrix or vector multiplied by the
identity matrix does not change.
(4) An elementwise function applied a vector
∂z
(z = f (x), what is ∂x ? )
Since f is being applied elementwise, we have zi = f (xi ). So
(
∂z ∂zi ∂ f 0 (xi ) if i = j
( )ij = = f (xi ) =
∂x ∂xj ∂xj 0 if otherwise

∂z
So we see that the Jacobian ∂x is a diagonal matrix where the entry at (i, i)
∂z
is the derivative of f applied to xi . We can write this as = diag(f 0 (x)) .
∂x
Since multiplication by a diagonal matrix is the same as doing elementwise
multiplication by the diagonal, we could also write ◦f 0 (x) when applying
the chain rule.

(5) Matrix times column vector with respect to the matrix


(z = W x, δ = ∂J ∂J ∂J ∂z ∂z
∂z what is ∂W = ∂z ∂W = δ ∂W ?)

This is a bit more complicated than the other identities. The reason for in-
cluding ∂J
∂z in the above problem formulation will become clear in a moment.
First suppose we have a loss function J (a scalar) and are computing its
gradient with respect to a matrix W ∈ Rn×m . Then we could think of J as
a function of W taking nm inputs (the entries of W ) to a single output (J).
∂J
This means the Jacobian ∂W would be a 1 × nm vector. But in practice
this is not a very useful way of arranging the gradient. It would be much
nicer if the derivatives were in a n × m matrix like this:
 ∂J ∂J 
∂W11 . . . ∂W 1m
∂J
=  ... .. .. 

∂W . . 
∂J ∂J
∂Wn1 . . . ∂Wnm

Since this matrix has the same shape as W , we could just subtract it (times
the learning rate) from W when doing gradient descent. So (in a slight abuse
∂J
of notation) let’s find this matrix as ∂W instead.
This way of arranging the gradients becomes complicated when computing
∂z
∂W . Unlike J, z is a vector. So if we are trying to rearrange the gradients
∂J ∂z
like with ∂W , ∂W would be an n × m × n tensor! Luckily, we can avoid
the issue by taking the gradient with respect to a single weight Wij instead.

3
∂z
∂Wij is just a vector, which is much easier to deal with. We have

m
X
zk = Wkl xl
l=1
m
∂zk X ∂
= xl Wkl
∂Wij ∂Wij
l=1


Note that ∂W ij
Wkl = 1 if i = k and j = l and 0 if otherwise. So if k 6= i
everything in the sum is zero and the gradient is zero. Otherwise, the only
nonzero element of the sum is when l = j, so we just get xj . Thus we find
∂zk
∂Wij = xj if k = i and 0 if otherwise. Another way of writing this is

 
0
 .. 
.
 
0
∂z  
xj  ← ith element
= 
∂Wij 0
 
.
 .. 
0
∂J
Now let’s compute ∂Wij

m
∂J ∂J ∂z ∂z X ∂zk
= =δ = δk = δ i xj
∂Wij ∂z ∂Wij ∂Wij ∂Wij
k=1

∂zi ∂J
(the only nonzero term in the sum is δi ∂W ij
). To get ∂W we want a ma-
trix where entry (i, j) is δi xj . This matrix is equal to the outer product
∂J
= δ T xT
∂W
(6) Row vector time matrix with respect to the matrix
(z = xW , δ = ∂J ∂J ∂z
∂z what is ∂W = δ ∂W ?)
∂J
A similar computation to (5) shows that = xT δ .
∂W
(7) Cross-entropy loss with respect to logits (ŷ = softmax(θ), J =
CE(y, ŷ), what is ∂J
∂θ ?)

∂J
The gradient is = ŷ − y
∂θ
(or (ŷ − y)T if y is a column vector).

4
These identities will be enough to let you quickly compute the gradients for many
neural networks. However, it’s important to know how to compute Jacobians
for other functions as well in case they show up. Some examples if you want
practice: dot product of two vectors, elementwise product of two vectors, 2-norm
of a vector. Feel free to use these identities in the assignments. One option is
just to memorize them. Another option is to figure them out by looking at the
dimensions. For example, only one ordering/orientation of δ and x will produce
∂J
the correct shape for ∂W (assuming W is not square).

4 Gradient Layout
Jacobean formulation is great for applying the chain rule: you just have to mul-
tiply the Jacobians. However, when doing SGD it’s more convenient to follow
the convention “the shape of the gradient equals the shape of the parameter”
∂J
(as we did when computing ∂W ). That way subtracting the gradient times the
learning rate from the parameters is easy. We expect answers to homework
questions to follow this convention. Therefore if you compute the gradient
of a column vector using Jacobian formulation, you should take the transpose
when reporting your final answer so the gradient is a column vector. Another
option is to always follow the convention. In this case the identities may not
work, but you can still figure out the answer by making sure the dimensions of
your derivatives match up. Up to you which of these options you choose!

5 Example: 1-Layer Neural Network


This section provides an example of computing the gradients of a full neural
network. In particular we are going to compute the gradients of a one-layer
neural network trained with cross-entropy loss. The forward pass of the model
is as follows:
x = input
z = W x + b1
h = ReLU(z)
θ = U h + b2
ŷ = softmax(θ)
J = CE(y, ŷ)
It helps to break up the model into the simplest parts possible, so note that
we defined z and θ to split up the activation functions from the linear trans-
formations in the network’s layers. The dimensions of the model’s parameters
are
x ∈ RDx ×1 b1 ∈ RDh ×1 W ∈ RDh ×Dx b2 ∈ RNc ×1 U ∈ RNc ×Dh
where Dx is the size of our input, Dh is the size of our hidden layer, and Nc is
the number of classes.

5
In this example, we will compute all of the network’s gradients:
∂J ∂J ∂J ∂J ∂J
∂U ∂b2 ∂W ∂b1 ∂x
To start with, recall that ReLU(x) = max(x, 0). This means
(
0 1 if x > 0
ReLU (x) = = sgn(ReLU(x))
0 if otherwise

where sgn is the signum function. Note that we are able to write the derivative
of the activation in terms of the activation itself.

∂J ∂J
Now let’s write out the chain rule for ∂U and ∂b2 :

∂J ∂J ∂ ŷ ∂θ
=
∂U ∂ ŷ ∂θ ∂U
∂J ∂J ∂ ŷ ∂θ
=
∂b2 ∂ ŷ ∂θ ∂b2
∂ ŷ
Notice that ∂J ∂J
∂ ŷ ∂θ = ∂θ is present in both gradients. This makes the math a bit
cumbersome. Even worse, if we’re implementing the model without automatic
differentiation, computing ∂J ∂θ twice will be inefficient. So it will help us to define
some variables to represent the intermediate derivatives:
∂J ∂J
δ1 = δ2 =
∂θ ∂z
These can be thought as the error signals passed down to θ and z when doing
backpropagation. We can compute them as follows:
∂J
δ1 = = (ŷ − y)T this is just identity (7)
∂θ
∂J ∂J ∂θ ∂h
δ2 = = using the chain rule
∂z ∂θ ∂h ∂z
∂θ ∂h
= δ1 substituting in δ1
∂h ∂z
∂h
= δ1 U using identity (1)
∂z
= δ1 U ◦ ReLU0 (z) using identity (4)
= δ1 U ◦ sgn(h) we computed this earlier

A good way of checking our work is by looking at the dimensions of the Jaco-
bians:
∂J
= δ1 U ◦ sgn(h)
∂z
(1 × Dh ) (1 × Nc ) (Nc × Dh ) (Dh )

6
We see that the dimensions of all the terms in the gradient match up (i.e., the
number of columns in a term equals the number of rows in the next term). This
will always be the case if we computed our gradients correctly.

Now we can use the error terms to compute our gradients. Note that we trans-
pose out answers when computing the gradients for column vectors terms to
follow the shape convention.
∂J ∂J ∂θ ∂θ
= = δ1 = δ1T hT using identity (5)
∂U ∂θ ∂U ∂U
∂J ∂J ∂θ ∂θ
= = δ1 = δ1T using identity (3) and transposing
∂b2 ∂θ ∂b2 ∂b2
∂J ∂J ∂z ∂z
= = δ2 = δ2T xT using identity (5)
∂W ∂θ ∂W ∂W
∂J ∂J ∂z ∂z
= = δ2 = δ2T using identity (3) and transposing
∂b1 ∂θ ∂b1 ∂b1
∂J ∂J ∂z
= = (δ2 W )T using identity (1) and transposing
∂x ∂θ ∂x

7
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part III
Neural Networks, Backpropagation 2 2
Authors: Rohit Mundra, Amani
Peddada, Richard Socher, Qiaojing Yan
Winter 2019

Keyphrases: Neural networks. Forward computation. Backward


propagation. Neuron Units. Max-margin Loss. Gradient checks.
Xavier parameter initialization. Learning rates. Adagrad.
This set of notes introduces single and multilayer neural networks,
and how they can be used for classification purposes. We then dis-
cuss how they can be trained using a distributed gradient descent
technique known as backpropagation. We will see how the chain
rule can be used to make parameter updates sequentially. After a
rigorous mathematical discussion of neural networks, we will discuss
some practical tips and tricks in training neural networks involving:
neuron units (non-linearities), gradient checks, Xavier parameter ini-
tialization, learning rates, Adagrad, etc. Lastly, we will motivate the
use of recurrent neural networks as a language model.

1 Neural Networks: Foundations

We established in our previous discussions the need for non-linear


classifiers since most data are not linearly separable and thus, our
classification performance on them is limited. Neural networks are
a family of classifiers with non-linear decision boundary as seen in
Figure 1: We see here how a non-linear
Figure 1. Now that we know the sort of decision boundaries neural decision boundary separates the data
networks create, let us see how they manage doing so. very well. This is the prowess of neural
networks.
Fun Fact:
1.1 A Neuron Neural networks are biologically in-
spired classifiers which is why they are
A neuron is a generic computational unit that takes n inputs and often called "artificial neural networks"
to distinguish them from the organic
produces a single output. What differentiates the outputs of different kind. However, in reality human neural
neurons is their parameters (also referred to as their weights). One of networks are so much more capable
and complex from artificial neural net-
the most popular choices for neurons is the "sigmoid" or "binary lo-
works that it is usually better to not
gistic regression" unit. This unit takes an n-dimensional input vector draw too many parallels between the
x and produces the scalar activation (output) a. This neuron is also two.

associated with an n-dimensional weight vector, w, and a bias scalar,


b. The output of this neuron is then:

1
a=
1 + exp(−(w T x + b))

We can also combine the weights and bias term above to equiva- Neuron:
A neuron is the fundamental building
block of neural networks. We will
see that a neuron can be one of many
functions that allows for non-linearities
to accrue in the network.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 2

lently formulate:

1
a=
1 + exp(−[w T b] · [ x 1])

This formulation can be visualized in the manner shown in Fig-


ure 2.
Figure 2: This image captures how in
1.2 A Single Layer of Neurons a sigmoid neuron, the input vector x is
first scaled, summed, added to a bias
We extend the idea above to multiple neurons by considering the unit, and then passed to the squashing
sigmoid function.
case where the input x is fed as an input to multiple such neurons as
shown in Figure 3.
If we refer to the different neurons’ weights as {w(1) , · · · , w(m) }
and the biases as {b1 , · · · , bm }, we can say the respective activations
are { a1 , · · · , am }:

1
a1 =
1 + exp(w(1)T x + b1 )
..
.
1
am =
1 + exp(w(m)T x + bm )

Let us define the following abstractions to keep the notation simple


and useful for more complex networks:
 1 
1+exp(z1 )
 .. 
σ(z) = 
 .


1
1+exp(zm )
 
b1
 .  m
 ..  ∈ R
b= 
bm
 
− w (1) T −
m×n
W= ··· ∈R
 
− w(m) T −

We can now write the output of scaling and biases as:

z = Wx + b
Figure 3: This image captures how
The activations of the sigmoid function can then be written as: multiple sigmoid units are stacked on
the right, all of which receive the same
input x.
 
a (1)
 . 
 .  = σ (z) = σ (Wx + b)
 . 
a(m)
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 3

So what do these activations really tell us? Well, one can think
of these activations as indicators of the presence of some weighted
combination of features. We can then use a combination of these
activations to perform classification tasks.

1.3 Feed-forward Computation


So far we have seen how an input vector x ∈ Rn can be fed to a
layer of sigmoid units to create activations a ∈ Rm . But what is the
intuition behind doing so? Let us consider the following named-
entity recognition (NER) problem in NLP as an example:

"Museums in Paris are amazing"


Dimensions for a single hidden layer
Here, we want to classify whether or not the center word "Paris" is neural network: If we represent each
word using a 4-dimensional word
a named-entity. In such cases, it is very likely that we would not just vector and we use a 5-word window
want to capture the presence of words in the window of word vectors as input, then the input x ∈ R20 . If we
but some other interactions between the words in order to make the use 8 sigmoid units in the hidden layer
and generate 1 score output from the
classification. For instance, maybe it should matter that "Museums" activations, then W ∈ R8×20 , b ∈ R8 ,
is the first word only if "in" is the second word. Such non-linear de- U ∈ R8×1 , s ∈ R. The stage-wise
feed-forward computation is then:
cisions can often not be captured by inputs fed directly to a Softmax
function but instead require the scoring of the intermediate layer z = Wx + b
discussed in Section 1.2. We can thus use another matrix U ∈ Rm×1 a = σ(z)
to generate an unnormalized score for a classification task from the s = UT a
activations:
s = U T a = U T f (Wx + b)
where f is the activation function.
Analysis of Dimensions: If we represent each word using a 4-
dimensional word vector and we use a 5-word window as input (as
in the above example), then the input x ∈ R20 . If we use 8 sigmoid
units in the hidden layer and generate 1 score output from the activa-
tions, then W ∈ R8×20 , b ∈ R8 , U ∈ R8×1 , s ∈ R.

1.4 Maximum Margin Objective Function Figure 4: This image captures how a
simple feed-forward network might
Like most machine learning models, neural networks also need an compute its output.
optimization objective, a measure of error or goodness which we
want to minimize or maximize respectively. Here, we will discuss a
popular error metric known as the maximum margin objective. The
idea behind using this objective is to ensure that the score computed
for "true" labeled data points is higher than the score computed for
"false" labeled data points.
Using the previous example, if we call the score computed for the
"true" labeled window "Museums in Paris are amazing" as s and the
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 4

score computed for the "false" labeled window "Not all museums in
Paris" as sc (subscripted as c to signify that the window is "corrupt").
Then, our objective function would be to maximize (s − sc ) or to
minimize (sc − s). However, we modify our objective to ensure that
error is only computed if sc > s ⇒ (sc − s) > 0. The intuition
behind doing this is that we only care the the "true" data point have
a higher score than the "false" data point and that the rest does not
matter. Thus, we want our error to be (sc − s) if sc > s else 0. Thus,
our optimization objective is now:

minimize J = max(sc − s, 0)

However, the above optimization objective is risky in the sense that


it does not attempt to create a margin of safety. We would want the
"true" labeled data point to score higher than the "false" labeled data
point by some positive margin ∆. In other words, we would want
error to be calculated if (s − sc < ∆) and not just when (s − sc < 0).
Thus, we modify the optimization objective:

minimize J = max(∆ + sc − s, 0)

We can scale this margin such that it is ∆ = 1 and let the other
parameters in the optimization problem adapt to this without any
change in performance. For more information on this, read about
functional and geometric margins - a topic often covered in the study
of Support Vector Machines. Finally, we define the following opti- The max-margin objective function
mization objective which we optimize over all training windows: is most commonly associated with
Support Vector Machines (SVMs)

minimize J = max(1 + sc − s, 0)

In the above formulation sc = U T f (Wxc + b) and s = U T f (Wx + b).

1.5 Training with Backpropagation – Elemental


In this section we discuss how we train the different parameters in
the model when the cost J discussed in Section 1.4 is positive. No
parameter updates are necessary if the cost is 0. Since we typically
update parameters using gradient descent (or a variant such as SGD),
we typically need the gradient information for any parameter as
required in the update equation:

θ ( t +1) = θ ( t ) − α ∇ θ ( t ) J

Backpropagation is technique that allows us to use the chain rule


of differentiation to calculate loss gradients for any parameter used
in the feed-forward computation on the model. To understand this
further, let us understand the toy network shown in Figure 5 for
which we will perform backpropagation.

Figure 5: This is a 4-2-1 neural network


where neuron j on layer k receives input
(k) (k)
zj and produces activation output a j .
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 5

Here, we use a neural network with a single hidden layer and a


single unit output. Let us establish some notation that will make it
easier to generalize this model later:

• xi is an input to the neural network.

• s is the output of the neural network.

• Each layer (including the input and output layers) has neurons
which receive an input and produce an output. The j-th neuron
(k)
of layer k receives the scalar input z j and produces the scalar
(k)
activation output aj .

(k) (k)
• We will call the backpropagated error calculated at z j as δj .

• Layer 1 refers to the input layer and not the first hidden layer. For
(1) (1)
the input layer, x j = z j = aj .

• W (k) is the transfer matrix that maps the output from the k-th layer
to the input to the (k + 1)-th Thus, W (1) = W and W (2) = U to put
this new generalized notation in perspective of Section 1.3.
Backpropagation Notation:
Let us begin: Suppose the cost J = (1 + sc − s) is positive and • xi is an input to the neural network.
(1)
we want to perform the update of parameter W14 (in Figure 5 and • s is the output of the neural net-
(1) (2) work.
Figure 6), we must realize that W14 only contributes to z1 and
(2) • The j-th neuron of layer k receives
thus a1 . This fact is crucial to understanding backpropagation – (k)
the scalar input z j and produces
backpropagated gradients are only affected by values they contribute (k)
(2) the scalar activation output a j .
to. a1 is consequently used in the forward computation of score by
(1) (1)
(2) • For the input layer, x j = z j = aj .
multiplication with W1 . We can see from the max-margin loss that:
• W (k) is the transfer matrix that maps
∂J ∂J the output from the k-th layer to
=− = −1 the input to the (k + 1)-th. Thus,
∂s ∂sc
W (1) = W and W (2) = U T using
∂s notation from Section 1.3.
Therefore we will work with (1) here for simplicity. Thus,
∂Wij

(2) (2) (2)


∂s ∂W (2) a(2) ∂Wi ai (2) ∂ai
(1)
= (1)
= (1)
= Wi (1)
∂Wij ∂Wij ∂Wij ∂Wij
(2) (2) (2)
(2) ∂ai (2) ∂ai ∂zi
⇒ Wi (1)
= Wi (2) (1)
∂Wij ∂zi ∂Wij
(2) (2)
(2) f (zi ) ∂zi
= Wi (2) (1)
∂zi ∂Wij
(2)
(2) 0 (2) ∂zi
= Wi f ( zi ) (1)
∂Wij
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 6

(2) 0 (2) ∂ (1) (1) (1) (1) (1) (1) (1) (1) (1)
= Wi f ( zi ) (1)
( bi + a1 Wi1 + a2 Wi2 + a3 Wi3 + a4 Wi4 )
∂Wij

+ ∑ ak Wik )
(2) 0 (2) (1) (1) (1)
= Wi f ( zi ) (1)
( bi
∂Wij k
(2) 0 (2) (1)
= Wi f ( zi ) a j
(2) (1)
= δi · aj

(2) (1)
We see above that the gradient reduces to the product δi · aj where
(2)
δi is essentially the error propagating backwards from the i-th neu-
(1)
ron in layer 2. a j is an input fed to i-th neuron in layer 2 when
scaled by Wij .
Let us discuss the "error sharing/distribution" interpretation of
backpropagation better using Figure 6 as an example. Say we were to
(1) Figure 6: This subnetwork shows the
update W14 :
relevant parts of the network required
(1)
to update Wij
1. We start with the an error signal of 1 propagating backwards from
(3)
a1 .

2. We then multiply this error by the local gradient of the neuron


(3) (3)
which maps z1 to a1 . This happens to be 1 in this case and thus,
(3)
the error is still 1. This is now known as δ1 = 1.
(3)
3. At this point, the error signal of 1 has reached z1 . We now need
to distribute the error signal so that the "fair share" of the error
(2)
reaches to a1 .
(3) (3) (2) (2)
4. This amount is the (error signal at z1 = δ1 )×W1 = W1 . Thus,
(2) (2)
the error at a1 = W1 .

5. As we did in step 2, we need to move the error across the neuron


(2) (2)
which maps z1 to a1 . We do this by multiplying the error signal
(2)
at a1 by the local gradient of the neuron which happens to be
(2)
f 0 ( z1 ).
(2) (2) (2) (2)
6. Thus, the error signal at z1 is f 0 (z1 )W1 . This is known as δ1 .
(1)
7. Finally, we need to distribute the "fair share" of the error to W14
by simply multiplying it by the input it was responsible for for-
(1)
warding, which happens to be a4 .
(1)
8. Thus, the gradient of the loss with respect to W14 is calculated to
(1) (2) (2)
be a4 f 0 (z1 )W1 .
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 7

Notice that the result we arrive at using this approach is exactly


the same as that we arrived at using explicit differentiation earlier.
Thus, we can calculate error gradients with respect to a parameter
in the network using either the chain rule of differentiation or using
an error sharing and distributed flow approach – both of these ap-
proaches happen to do the exact same thing but it might be helpful
to think about them one way or another.

(1)
Bias Updates: Bias terms (such as b1 ) are mathematically equivalent
(2)
to other weights contributing to the neuron input (z1 ) as long as the
input being forwarded is 1. As such, the bias gradients for neuron
(k) (1)
i on layer k is simply δi . For instance, if we were updating b1
(1) (2) (2)
instead of W14 above, the gradient would simply be f 0 (z1 )W1 .

Generalized steps to propagate δ(k) to δ(k−1) :


(k) (k)
1. We have error δi propagating backwards from zi , i.e. neuron i Figure 7: Propagating error from δ(k) to
δ ( k −1)
at layer k. See Figure 7.
( k −1) (k)
2. We propagate this error backwards to a j by multiplying δi by
( k −1)
the path weight Wij .

( k −1) (k) ( k −1)


3. Thus, the error received at a j is δi Wij .

( k −1)
4. However, a j may have been forwarded to multiple nodes in
the next layer as shown in Figure 8. It should receive responsibility
for errors propagating backward from node m in layer k too, using
the exact same mechanism.
( k −1) (k) ( k −1) (k) ( k −1)
5. Thus, error received at a j is δi Wij + δm Wmj .

(k) ( k −1)
6. In fact, we can generalize this to be ∑i δi Wij .

( k −1)
7. Now that we have the correct error at a j , we move it across
neuron j at layer k − 1 by multiplying with with the local gradient
( k −1)
f 0 (z j ).
( k −1) ( k −1)
8. Thus, the error that reaches z j , called δj is
( k −1) (k) ( k −1)
f 0 (z j ) ∑i δi Wij

Figure 8: Propagating error from δ(k) to


δ ( k −1)
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 8

1.6 Training with Backpropagation – Vectorized


So far, we discussed how to calculate gradients for a given parameter
in the model. Here we will generalize the approach above so that
we update weight matrices and bias vectors all at once. Note that
these are simply extensions of the above model that will help build
intuition for the way error propagation can be done at a matrix-
vector level.
(k)
For a given parameter Wij , we identified that the error gradient is
( k +1) (k)
simply δi · a j . As a reminder, W (k) is the matrix that maps a(k)
to z(k+1) . We can thus establish that the error gradient for the entire
matrix W (k) is:
 ( k +1) ( k ) ( k +1) ( k )

δ1 a1 δ1 a2 ···
 ( k +1) ( k ) ( k +1) ( k )  ( k +1) ( k ) T
∇W ( k ) = 
δ2 a1 δ2 a2 · · ·
=δ a
.. .. ..
. . .
Error propagates from layer (k + 1) to
Thus, we can write an entire matrix gradient using the outer prod- (k) in the following manner:
uct of the error vector propagating into the matrix and the activations
δ ( k ) = f 0 ( z ( k ) ) ◦ (W ( k ) T δ ( k + 1 ) )
forwarded by the matrix.
Of course, this assumes that in the
Now, we will see how we can calculate the error vector δ(k) . We forward propagation the signal z(k) first
(k) (k) ( k +1) ( k )
established earlier using Figure 8 that δj = f 0 (z j ) ∑i δi Wij . goes through activation neurons f to
generate activations a(k) and are then
This can easily generalize to matrices such that:
linearly combined to yield z(k+1) via
transfer matrix W (k) .
δ ( k ) = f 0 ( z ( k ) ) ◦ (W ( k ) T δ ( k + 1 ) )

In the above formulation, the ◦ operator corresponds to an element


wise product between elements of vectors (◦ : R N × R N → R N ).

Computational efficiency: Having explored element-wise updates


as well as vector-wise updates, we must realize that the vectorized
implementations run substantially faster in scientific computing
environments such as MATLAB or Python (using NumPy/SciPy
packages). Thus, we should use vectorized implementation in prac-
tice. Furthermore, we should also reduce redundant calculations
in backpropagation - for instance, notice that δ(k) depends directly
on δ(k+1) . Thus, we should ensure that when we update W (k) using
δ(k+1) , we save δ(k+1) to later derive δ(k) – and we then repeat this for
(k − 1) . . . (1). Such a recursive procedure is what makes backpropa-
gation a computationally affordable procedure.

2 Neural Networks: Tips and Tricks

Having discussed the mathematical foundations of neural networks,


we will now dive into some tips and tricks commonly employed
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 9

when using neural networks in practice.

2.1 Gradient Check


In the last section, we discussed in detail how to calculate error
gradients/updates for parameters in a neural network model via
calculus-based (analytic) methods. Here we now introduce a tech-
nique of numerically approximating these gradients – though too
computationally inefficient to be used directly for training the net-
works, this method will allow us to very precisely estimate the
derivative with respect to any parameter; it can thus serve as a useful
sanity check on the correctness of our analytic derivatives. Given a
model with parameter vector θ and loss function J, the numerical
gradient around θi is simply given by centered difference formula:

J (θ (i+) ) − J (θ (i−) )
f 0 (θ ) ≈
2e
where e is a small number (usually around 1e−5 ). The term J (θ (i+) )
is simply the error calculated on a forward pass for a given input
when we perturb the parameter θ’s ith element by +e. Similarly, the
term J (θ (i−) ) is the error calculated on a forward pass for the same
input when we perturb the parameter θ’s ith element by −e. Thus,
using two forward passes, we can approximate the gradient with
respect to any given parameter element in the model. We note that
this definition of the numerical gradient follows very naturally from
the definition of the derivative, where, in the scalar case,

f ( x + e) − f ( x )
f 0 (x) ≈
e
Gradient checks are a great way to
Of course, there is a slight difference – the definition above only compare analytical and numerical
perturbs x in the positive direction to compute the gradient. While it gradients. Analytical gradients should
be close and numerical gradients can be
would have been perfectly acceptable to define the numerical gradi- calculated using:
ent in this way, in practice it is often more precise and stable to use
J (θ (i+) ) − J (θ (i−) )
the centered difference formula, where we perturb a parameter in f 0 (θ ) ≈
2e
both directions. The intuition is that to get a better approximation of J (θ (i+) ) and J (θ (i−) ) can be evalu-
the derivative/slope around a point, we need to examine the func- ated using two forward passes. An
tion f ’s behavior both to the left and right of that point. It can also be implementation of this can be seen in
Snippet 2.1.
shown using Taylor’s theorem that the centered difference formula
has an error proportional to e2 , which is quite small, whereas the
derivative definition is more error-prone.
Now, a natural question you might ask is, if this method is so pre-
cise, why do we not use it to compute all of our network gradients
instead of applying back-propagation? The simple answer, as hinted
earlier, is inefficiency – recall that every time we want to compute the
gradient with respect to an element, we need to make two forward
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 10

passes through the network, which will be computationally expen-


sive. Furthermore, many large-scale neural networks can contain
millions of parameters, and computing two passes per parameter is
clearly not optimal. And, since in optimization techniques such as
SGD, we must compute the gradients once per iteration for several
thousands of iterations, it is obvious that this method quickly grows
intractable. This inefficiency is why we only use gradient check to
verify the correctness of our analytic gradients, which are much
quicker to compute. A standard implementation of gradient check is
shown below:

Snippet 2.1
def eval_numerical_gradient(f, x):
"""
a naive implementation of numerical gradient of f at x
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient
at
"""

fx = f(x) # evaluate function value at original point


grad = np.zeros(x.shape)
h = 0.00001

# iterate over all indexes in x


it = np.nditer(x, flags=[’multi_index’],
op_flags=[’readwrite’])
while not it.finished:

# evaluate function at x+h


ix = it.multi_index
old_value = x[ix]
x[ix] = old_value + h # increment by h
fxh_left = f(x) # evaluate f(x + h)
x[ix] = old_value - h # decrement by h
fxh_right = f(x) # evaluate f(x - h)
x[ix] = old_value # restore to previous value (very
important!)

# compute the partial derivative


grad[ix] = (fxh_left - fxh_right) / (2*h) # the slope
it.iternext() # step to next dimension
return grad
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 11

2.2 Regularization
As with many machine learning models, neural networks are highly
prone to overfitting, where a model is able to obtain near perfect per-
formance on the training dataset, but loses the ability to generalize
to unseen data. A common technique used to address overfitting (an
issue also known as the “high-variance problem”) is the incorpora-
tion of an L2 regularization penalty. The idea is that we will simply
append an extra term to our loss function J, so that the overall cost is
now calculated as:

L
JR = J + λ ∑ W (i )
F
i =1
The Frobenius Norm of a matrix U is
In the above formulation, W (i) is the Frobenius norm of the defined as follows:
F s
matrix W (i) (the i-th weight matrix in the network) and λ is the ||U || F = ∑ ∑ Uij2
i j
hyper-parameter controlling how much weight the regularization
term has relative to the original cost function. Since we are trying
to minimize JR , what regularization is essentially doing is penaliz-
ing weights for being too large while optimizing over the original
cost function. Due to the quadratic nature of the Frobenius norm
(which computes the sum of the squared elements of a matrix), L2 -
regularization effectively reduces the flexibility of the model and
thereby reduces the overfitting phenomenon. Imposing such a con-
straint can also be interpreted as the prior Bayesian belief that the
optimal weights are close to zero – how close depends on the value
of λ. Choosing the right value of λ is critical, and must be chosen
via hyperparameter-tuning. Too high a value of λ causes most of
the weights to be set too close to 0, and the model does not learn
anything meaningful from the training data, often obtaining poor ac-
curacy on training, validation, and testing sets. Too low a value, and
we fall into the domain of overfitting once again. It must be noted
that the bias terms are not regularized and do not contribute to the
cost term above – try thinking about why this is the case!
There are indeed other types of regularization that are sometimes
used, such as L1 regularization, which sums over the absolute values
(rather than squares) of parameter elements – however, this is less
commonly applied in practice since it leads to sparsity of parameter
weights. In the next section, we discuss dropout, which effectively acts
as another form of regularization by randomly dropping (i.e. setting
to zero) neurons in the forward pass.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 12

2.3 Dropout
Dropout is a powerful technique for regularization, first introduced
by Srivastava et al. in Dropout: A Simple Way to Prevent Neural Net-
works from Overfitting. The idea is simple yet effective – during train-
ing, we will randomly “drop” with some probability (1 − p) a subset
of neurons during each forward/backward pass (or equivalently,
we will keep alive each neuron with a probability p). Then, during
testing, we will use the full network to compute our predictions. The
result is that the network typically learns more meaningful informa-
tion from the data, is less likely to overfit, and usually obtains higher
performance overall on the task at hand. One intuitive reason why
this technique should be so effective is that what dropout is doing is
essentially doing is training exponentially many smaller networks at
once and averaging over their predictions.
In practice, the way we introduce dropout is that we take the out-
put h of each layer of neurons, and keep each neuron with prob-
ability p, and else set it to 0. Then, during back-propagation, we
only pass gradients through neurons that were kept alive during
the forward pass. Finally, during testing, we compute the forward
pass using all of the neurons in the network. However, a key sub-
Dropout applied to an artificial neural
tlety is that in order for dropout to work effectively, the expected network. Image credits to Srivastava et
output of a neuron during testing should be approximately the same al.

as it was during training – else the magnitude of the outputs could


be radically different, and the behavior of the network is no longer
well-defined. Thus, we must typically divide the outputs of each
neuron during testing by a certain value – it is left as an exercise to
the reader to determine what this value should be in order for the
expected outputs during training and testing to be equivalent.

2.4 Neuron Units


So far we have discussed neural networks that contain sigmoidal
neurons to introduce nonlinearities; however in many applications
better networks can be designed using other activation functions.
Some common choices are listed here with their function and gra-
dient definitions and these can be substituted with the sigmoidal
functions discussed above.

Sigmoid: This is the default choice we have discussed; the activation


function σ is given by:

1
σ(z) =
1 + exp(−z) Figure 9: The response of a sigmoid
nonlinearity
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 13

where σ (z) ∈ (0, 1)

The gradient of σ (z) is:

− exp(−z)
σ0 (z) = = σ(z)(1 − σ(z))
1 + exp(−z)

Tanh: The tanh function is an alternative to the sigmoid function


that is often found to converge faster in practice. The primary differ-
ence between tanh and sigmoid is that tanh output ranges from −1 to
1 while the sigmoid ranges from 0 to 1.

exp(z) − exp(−z)
tanh(z) = = 2σ(2z) − 1
exp(z) + exp(−z) Figure 10: The response of a tanh
nonlinearity
where tanh(z) ∈ (−1, 1)

The gradient of tanh(z) is:


2
exp(z) − exp(−z)

0
tanh (z) = 1 − = 1 − tanh2 (z)
exp(z) + exp(−z)

Hard tanh: The hard tanh function is sometimes preferred over the
tanh function since it is computationally cheaper. It does however
saturate for magnitudes of z greater than 1. The activation of the
hard tanh is:

 −1
 : z < −1
hardtanh(z) = z : −1 ≤ z ≤ 1
 Figure 11: The response of a hard tanh
 1 :z>1
nonlinearity
The derivative can also be expressed in a piecewise functional form:
(
0 1 : −1 ≤ z ≤ 1
hardtanh (z) =
0 : otherwise

Soft sign: The soft sign function is another nonlinearity which can
be considered an alternative to tanh since it too does not saturate as
easily as hard clipped functions:

z
softsign(z) =
1 + |z|
The derivative is the expressed as: Figure 12: The response of a soft sign
nonlinearity
sgn(z)
softsign0 (z) =
(1 + z )2

where sgn is the signum function which returns ± 1 depending on the sign of z
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 14

ReLU: The ReLU (Rectified Linear Unit) function is a popular choice


of activation since it does not saturate even for larger values of z and
has found much success in computer vision applications:

rect(z) = max(z, 0)
The derivative is then the piecewise function: Figure 13: The response of a ReLU
( nonlinearity
0 1 :z>0
rect (z) =
0 : otherwise

Leaky ReLU: Traditional ReLU units by design do not propagate


any error for non-positive z – the leaky ReLU modifies this such
that a small error is allowed to propagate backwards even when z is
negative:

leaky(z) = max(z, k · z)
where 0 < k < 1 Figure 14: The response of a leaky
ReLU nonlinearity
This way, the derivative is representable as:
(
0 1 :z>0
leaky (z) =
k : otherwise

2.5 Data Preprocessing


As is the case with machine learning models generally, a key step
to ensuring that your model obtains reasonable performance on the
task at hand is to perform basic preprocessing on your data. Some
common techniques are outlined below.

Mean Subtraction
Given a set of input data X, it is customary to zero-center the data by
subtracting the mean feature vector of X from X. An important point
is that in practice, the mean is calculated only across the training set,
and this mean is subtracted from the training, validation, and testing
sets.

Normalization
Another frequently used technique (though perhaps less so than
mean subtraction) is to scale every input feature dimension to have
similar ranges of magnitudes. This is useful since input features are
often measured in different “units”, but we often want to initially
consider all features as equally important. The way we accomplish
this is by simply dividing the features by their respective standard
deviation calculated across the training set.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 15

Whitening
Not as commonly used as mean-subtraction + normalization, whiten-
ing essentially converts the data to a have an identity covariance
matrix – that is, features become uncorrelated and have a variance
of 1. This is done by first mean-subtracting the data, as usual, to get
X 0 . We can then take the Singular Value Decomposition (SVD) of X 0
to get matrices U, S, V. We then compute UX 0 to project X 0 into the
basis defined by the columns of U. We finally divide each dimension
of the result by the corresponding singular value in S to scale our
data appropriately (if a singular value is zero, we can just divide by a
small number instead).

2.6 Parameter Initialization


A key step towards achieving superlative performance with a neu-
ral network is initializing the parameters in a reasonable way. A
good starting strategy is to initialize the weights to small random
numbers normally distributed around 0 – and in practice, this often
words acceptably well. However, in Understanding the dif-
ficulty of training deep feedforward neural networks
(2010), Xavier et al study the effect of different weight and bias
initialization schemes on training dynamics. The empirical findings
suggest that for sigmoid and tanh activation units, faster convergence
and lower error rates are achieved when the weights of a matrix
( l +1) × n ( l )
W ∈ Rn are initialized randomly with a uniform distribution
as follows:
 s s 
6 6
W∼U − ,
n ( l ) + n ( l +1) n ( l ) + n ( l +1)

Where n(l ) is the number of input units to W (fan-in) and n(l +1)
is the number of output units from W (fan-out). In this parameter
initialization scheme, bias units are initialized to 0. This approach
attempts to maintain activation variances as well as backpropagated
gradient variances across layers. Without such initialization, the
gradient variances (which are a proxy for information) generally
decrease with backpropagation across layers.

2.7 Learning Strategies


The rate/magnitude of model parameter updates during training can
be controlled using the learning rate. In the following naive Gradient
Descent formulation, α is the learning rate:

θ new = θ old − α∇θ Jt (θ )


cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 16

You might think that for fast convergence rates, we should set α
to larger values – however faster convergence is not guaranteed with
larger convergence rates. In fact, with very large learning rates, we
might experience that the loss function actually diverges because the
parameters update causes the model to overshoot the convex minima
as shown in Figure 15. In non-convex models (most of those we work
with), the outcome of a large learning rate is unpredictable, but the
chances of diverging loss functions are very high.
The simple solution to avoiding a diverging loss is to use a very
small learning rate so that we carefully scan the parameter space –
of course, if we use too small a learning rate, we might not converge Figure 15: Here we see that updating
in a reasonable amount of time, or might get caught in local minima. parameter w2 with a large learning rate
can lead to divergence of the error.
Thus, as with any other hyperparameter, the learning rate must be
tuned effectively.
Since training is the most expensive phase in a deep learning
system, some research has attempted to improve this naive approach
to setting learning learning rates. For instance, Ronan Collobert
( l +1) × n ( l )
scales the learning rate of a weight Wij (where W ∈ Rn ) by
the inverse square root of the fan-in of the neuron (n ). ( l )

There are several other techniques that have proven to be effec-


tive as well – one such method is annealing, where, after several
iterations, the learning rate is reduced in some way – this method
ensures that we start off with a high learning rate and approach a
minimum quickly; as we get closer to the minimum, we start lower-
ing our learning rate so that we can find the optimum under a more
fine-grained scope. A common way to perform annealing is to reduce
the learning rate α by a factor x after every n iterations of learning.
Exponential decay is also common, where, the learning rate α at iter-
ation t is given by α(t) = α0 e−kt , where α0 is the initial learning rate,
and k is a hyperparameter. Another approach is to allow the learning
rate to decrease over time such that:

α0 τ
α(t) =
max(t, τ )
In the above scheme, α0 is a tunable parameter and represents the
starting learning rate. τ is also a tunable parameter and represents
the time at which the learning rate should start reducing. In practice,
this method has been found to work quite well. In the next section
we discuss another method for adaptive gradient descent which does
not require hand-set learning rates.

2.8 Momentum Updates


Momentum methods, a variant of gradient descent inspired by the
study of dynamics and motion in physics, attempt to use the “veloc-
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 17

ity” of updates as a more effective update scheme. Pseudocode for


momentum updates is shown below:

Snippet 2.2
# Computes a standard momentum update
# on parameters x
v = mu*v - alpha*grad_x
x += v

2.9 Adaptive Optimization Methods


AdaGrad is an implementation of standard stochastic gradient de-
scent (SGD) with one key difference: the learning rate can vary for
each parameter. The learning rate for each parameter depends on
the history of gradient updates of that parameter in a way such that
parameters with a scarce history of updates are updated faster using
a larger learning rate. In other words, parameters that have not been
updated much in the past are likelier to have higher learning rates
now. Formally:

α ∂
θt,i = θt−1,i − q gt,i where gt,i = Jt (θ )
∑tτ =1 gτ,i
2 ∂θit

In this technique, we see that if the RMS of the history of gradients


is extremely low, the learning rate is very high. A simple implemen-
tation of this technique is:

Snippet 2.3
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / np.sqrt(cache + 1e-8)

Other common adaptive methods are RMSProp and Adam, whose


update rules are shown below (courtesy of Andrej Karpathy):

Snippet 2.4
# Update rule for RMS prop
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

Snippet 2.5
# Update rule for Adam
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 18

m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)

RMSProp is a variant of AdaGrad that utilizes a moving average


of squared gradients – in particular, unlike AdaGrad, its updates
do not become monotonically smaller. The Adam update rule is in
turn a variant of RMSProp, but with the addition of momentum-
like updates. We refer the reader to the respective sources of these
methods for more detailed analyses of their behavior.
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition

CS231n Convolutional Neural Networks for Visual Recognition


Course Website

Table of Contents:

Introduction
Simple expressions, interpreting the gradient
Compound expressions, chain rule, backpropagation
Intuitive understanding of backpropagation
Modularity: Sigmoid example
Backprop in practice: Staged computation
Patterns in backward flow
Gradients for vectorized operations
Summary

Introduction
Motivation. In this section we will develop expertise with an intuitive understanding of
backpropagation, which is a way of computing gradients of expressions through recursive
application of chain rule. Understanding of this process and its subtleties is critical for you to
understand, and effectively develop, design and debug neural networks.

Problem statement. The core problem studied in this section is as follows: We are given some
function f (x) where x is a vector of inputs and we are interested in computing the gradient of f
at x (i.e. ∇f (x) ).

Motivation. Recall that the primary reason we are interested in this problem is that in the specific
case of neural networks, f will correspond to the loss function ( L ) and the inputs x will consist
of the training data and the neural network weights. For example, the loss could be the SVM loss
function and the inputs are both the training data (xi , yi ), i = 1 … N and the weights and biases
W, b. Note that (as is usually the case in Machine Learning) we think of the training data as given
and fixed, and of the weights as variables we have control over. Hence, even though we can easily
use backpropagation to compute the gradient on the input examples xi , in practice we usually
only compute the gradient for the parameters (e.g. W, b)
so that we can use it to perform a
Back
parameter update. However, as we will see later in the class the gradient on xi can still be to Top
useful

https://cs231n.github.io/optimization-2/ 1/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition

sometimes, for example for purposes of visualization and interpreting what the Neural Network
might be doing.

If you are coming to this class and you’re comfortable with deriving gradients with chain rule, we
would still like to encourage you to at least skim this section, since it presents a rarely developed
view of backpropagation as backward flow in real-valued circuits and any insights you’ll gain may
help you throughout the class.

Simple expressions and interpretation of the gradient


Lets start simple so that we can develop the notation and conventions for more complex
expressions. Consider a simple multiplication function of two numbers f (x, y) = xy. It is a
matter of simple calculus to derive the partial derivative for either input:

∂f ∂f
f (x, y) = xy → =y =x
∂x ∂y

Interpretation. Keep in mind what the derivatives tell you: They indicate the rate of change of a
function with respect to that variable surrounding an infinitesimally small region near a particular
point:

df (x) f (x + h) − f (x)
= lim
dx h →0 h

A technical note is that the division sign on the left-hand side is, unlike the division sign on the
d
right-hand side, not a division. Instead, this notation indicates that the operator is being
dx
applied to the function f , and returns a different function (the derivative). A nice way to think
about the expression above is that when h is very small, then the function is well-approximated by
a straight line, and the derivative is its slope. In other words, the derivative on each variable tells
you the sensitivity of the whole expression on its value. For example, if x = 4, y = −3 then
∂f
f (x, y) = −12 and the derivative on x ∂x = −3. This tells us that if we were to increase the value
of this variable by a tiny amount, the effect on the whole expression would be to decrease it (due
to the negative sign), and by three times that amount. This can be seen by rearranging the above
df (x) ∂f
equation ( f (x + h) = f (x) + h ). Analogously, since ∂y = 4 , we expect that increasing the
dx
value of y by some very small amount h would also increase the output of the function (due to
the positive sign), and by 4h .

The derivative on each variable tells you the sensitivity of the whole expression on its value.
Back to Top

https://cs231n.github.io/optimization-2/ 2/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition

As mentioned, the gradient ∇f is the vector of partial derivatives, so we have that


∂f ∂f
∇f = [ ∂x , ∂y ] = [y, x]. Even though the gradient is technically a vector, we will often use terms
such as “the gradient on x” instead of the technically correct phrase “the partial derivative on x” for
simplicity.

We can also derive the derivatives for the addition operation:

∂f ∂f
f (x, y) = x + y → =1 =1
∂x ∂y

that is, the derivative on bothx, y is one regardless of what the values of x, y are. This makes
sense, since increasing either x, y would increase the output of f , and the rate of that increase
would be independent of what the actual values of x, y are (unlike the case of multiplication
above). The last function we’ll use quite a bit in the class is the max operation:

∂f ∂f
f (x, y) = max(x, y) → = 𝟙(x >= y) = 𝟙(y >= x)
∂x ∂y

That is, the (sub)gradient is 1 on the input that was larger and 0 on the other input. Intuitively, if
the inputs are x = 4, y = 2, then the max is 4, and the function is not sensitive to the setting of y .
That is, if we were to increase it by a tiny amount h , the function would keep outputting 4, and
therefore the gradient is zero: there is no effect. Of course, if we were to change y by a large
amount (e.g. larger than 2), then the value of f would change, but the derivatives tell us nothing
about the effect of such large changes on the inputs of a function; They are only informative for
tiny, infinitesimally small changes on the inputs, as indicated by the limh→0 in its definition.

Compound expressions with chain rule


Lets now start to consider more complicated expressions that involve multiple composed
functions, such as f (x, y, z) = (x + y)z. This expression is still simple enough to differentiate
directly, but we’ll take a particular approach to it that will be helpful with understanding the
intuition behind backpropagation. In particular, note that this expression can be broken down into
two expressions: q=x+y
f = qz . Moreover, we know how to compute the derivatives of
and
both expressions separately, as seen in the previous section. f is just multiplication of q and z, so
∂f ∂f ∂q ∂q
∂q
= z, ∂z = q, and q is addition of x and y so ∂x = 1, ∂y = 1 . However, we don’t necessarily
∂f
care about the gradient on the intermediate value q - the value of ∂q is not useful. Instead, we are
ultimately interested in the gradient of f with respect to its inputs x, y, z. The chain rule tells us
that the correct way to “chain” these gradient expressions together is through multiplication. For

Back to Top

https://cs231n.github.io/optimization-2/ 3/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
∂f ∂f ∂q
example, ∂x = ∂q ∂x . In practice this is simply a multiplication of the two numbers that hold the
two gradients. Lets see this with an example:

# set some inputs


x = -2; y = 5; z = -4

# perform the forward pass


q = x + y # q becomes 3
f = q * z # f becomes -12

# perform the backward pass (backpropagation) in reverse order:


# first backprop through f = q * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
dqdx = 1.0
dqdy = 1.0
# now backprop through q = x + y
dfdx = dfdq * dqdx # The multiplication here is the chain rule!
dfdy = dfdq * dqdy

We are left with the gradient in the variables [dfdx,dfdy,dfdz] , which tell us the sensitivity of
the variables x,y,z on f !. This is the simplest example of backpropagation. Going forward, we
will use a more concise notation that omits the df prefix. For example, we will simply write dq
instead of dfdq , and always assume that the gradient is computed on the final output.

This computation can also be nicely visualized with a circuit diagram:

The real-valued "circuit" on left shows


x -2
the visual representation of the
-4
q 3 computation. The forward pass
+ computes values from inputs to
-4
y 5 output (shown in green). The
-4 f -12
backward pass then performs
* 1 backpropagation which starts at the
z -4 end and recursively applies the chain

3 rule to compute the gradients (shown


in red) all the way to the inputs of the
circuit. The gradients can be thought
of as flowing backwards through the circuit.

Back to Top

https://cs231n.github.io/optimization-2/ 4/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition

Intuitive understanding of backpropagation


Notice that backpropagation is a beautifully local process. Every gate in a circuit diagram gets
some inputs and can right away compute two things: 1. its output value and 2. the local gradient
of its output with respect to its inputs. Notice that the gates can do this completely independently
without being aware of any of the details of the full circuit that they are embedded in. However,
once the forward pass is over, during backpropagation the gate will eventually learn about the
gradient of its output value on the final output of the entire circuit. Chain rule says that the gate
should take that gradient and multiply it into every gradient it normally computes for all of its
inputs.

This extra multiplication (for each input) due to the chain rule can turn a single and relatively
useless gate into a cog in a complex circuit such as an entire neural network.

Lets get an intuition for how this works by referring again to the example. The add gate received
inputs [-2, 5] and computed output 3. Since the gate is computing the addition operation, its local
gradient for both of its inputs is +1. The rest of the circuit computed the final value, which is -12.
During the backward pass in which the chain rule is applied recursively backwards through the
circuit, the add gate (which is an input to the multiply gate) learns that the gradient for its output
was -4. If we anthropomorphize the circuit as wanting to output a higher value (which can help
with intuition), then we can think of the circuit as “wanting” the output of the add gate to be lower
(due to negative sign), and with a force of 4. To continue the recurrence and to chain the gradient,
the add gate takes that gradient and multiplies it to all of the local gradients for its inputs (making
the gradient on both x and y 1 * -4 = -4). Notice that this has the desired effect: If x,y were to
decrease (responding to their negative gradient) then the add gate’s output would decrease, which
in turn makes the multiply gate’s output increase.

Backpropagation can thus be thought of as gates communicating to each other (through the
gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as
to make the final output value higher.

Modularity: Sigmoid example


The gates we introduced above are relatively arbitrary. Any kind of differentiable function can act
as a gate, and we can group multiple gates into a single gate, or decompose a function into
multiple gates whenever it is convenient. Lets look at another expression that illustrates this point:

1
f (w, x) =
1 + e−(w0 x0 +w1 x1 +w2 ) Back to Top

https://cs231n.github.io/optimization-2/ 5/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition

as we will see later in the class, this expression describes a 2-dimensional neuron (with inputs x
and weights w) that uses the sigmoid activation function. But for now lets think of this very simply
as just a function from inputs w,x to a single number. The function is made up of multiple gates. In
addition to the ones described already above (add, mul, max), there are four more:

1 df
f (x) = → = −1/x2
x dx
df
fc (x) = c + x → =1
dx
df
f (x) = ex → = ex
dx
df
fa (x) = ax → =a
dx
Where the functions fc , fa translate the input by a constant of c and scale the input by a constant
of a, respectively. These are technically special cases of addition and multiplication, but we
introduce them as (new) unary gates here since we do not need the gradients for the constants
c, a. The full circuit then looks as follows:

w0 2.00
-0.20
-2.00
* 0.20
x0 -1.00
0.39
4.00
+
0.20
w1 -3.00
-0.39
6.00
* 0.20 1.00 -1.00 0.37 1.37 0.73
x1 -2.00 + *-1 exp +1 1/x
0.20 -0.20 -0.53 -0.53 1.00
-0.59

w2 -3.00
0.20

Example circuit for a 2D neuron with a sigmoid activation function. The inputs are [x0,x1] and the (learnable)
weights of the neuron are [w0,w1,w2]. As we will see later, the neuron computes a dot product with the input
and then its activation is softly squashed by the sigmoid function to be in range from 0 to 1.

In the example above, we see a long chain of function applications that operates on the result of
the dot product between w,x. The function that these operations implement is called the sigmoid
function σ(x). It turns out that the derivative of the sigmoid function with respect to its input
simplifies if you perform the derivation (after a fun tricky part where we add and subtract a 1 in
the numerator):
Back to Top

https://cs231n.github.io/optimization-2/ 6/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
1
σ(x) =
1 + e−x
dσ(x) e−x 1 + e−x − 1 1
→ = = = (1 − σ(x)) σ(x)
dx (1 + e−x )2 ( 1 + e−x ) ( 1 + e−x )

As we see, the gradient turns out to simplify and becomes surprisingly simple. For example, the
sigmoid expression receives the input 1.0 and computes the output 0.73 during the forward pass.
The derivation above shows that the local gradient would simply be (1 - 0.73) * 0.73 ~= 0.2, as the
circuit computed before (see the image above), except this way it would be done with a single,
simple and efficient expression (and with less numerical issues). Therefore, in any real practical
application it would be very useful to group these operations into a single gate. Lets see the
backprop for this neuron in code:

w = [2,-3,-3] # assume some random weights and data


x = [-1, -2]

# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid function

# backward pass through the neuron (backpropagation)


ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient
dx = [w[0] * ddot, w[1] * ddot] # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w
# we're done! we have the gradients on the inputs to the circuit

Implementation protip: staged backpropagation. As shown in the code above, in practice it is


always helpful to break down the forward pass into stages that are easily backpropped through.
For example here we created an intermediate variable dot which holds the output of the dot
product between w and x . During backward pass we then successively compute (in reverse
order) the corresponding variables (e.g. ddot , and ultimately dw, dx ) that hold the gradients
of those variables.

The point of this section is that the details of how the backpropagation is performed, and which
parts of the forward function we think of as gates, is a matter of convenience. It helps to be aware
of which parts of the expression have easy local gradients, so that they can be chained together
with the least amount of code and effort.

Backprop in practice: Staged computation


Back to Top

https://cs231n.github.io/optimization-2/ 7/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition

Lets see this with another example. Suppose that we have a function of the form:

x + σ(y)
f (x, y) =
σ(x) + (x + y)2

To be clear, this function is completely useless and it’s not clear why you would ever want to
compute its gradient, except for the fact that it is a good example of backpropagation in practice.
It is very important to stress that if you were to launch into performing the differentiation with
respect to either x or y , you would end up with very large and complex expressions. However, it
turns out that doing so is completely unnecessary because we don’t need to have an explicit
function written down that evaluates the gradient. We only have to know how to compute it. Here
is how we would structure the forward pass of such expression:

x = 3 # example values
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator #(1)
num = x + sigy # numerator #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y #(4)
xpysqr = xpy**2 #(5)
den = sigx + xpysqr # denominator #(6)
invden = 1.0 / den #(7)
f = num * invden # done! #(8)

Phew, by the end of the expression we have computed the forward pass. Notice that we have
structured the code in such way that it contains multiple intermediate variables, each of which are
only simple expressions for which we already know the local gradients. Therefore, computing the
backprop pass is easy: We’ll go backwards and for every variable along the way in the forward
pass ( sigy, num, sigx, xpy, xpysqr, den, invden ) we will have the same variable, but
one that begins with a d , which will hold the gradient of the output of the circuit with respect to
that variable. Additionally, note that every single piece in our backprop will involve computing the
local gradient of that expression, and chaining it with the gradient on that expression with a
multiplication. For each row, we also highlight which part of the forward pass it refers to:

# backprop f = num * invden


dnum = invden # gradient on numerator #(8)
dinvden = num #(8)
# backprop invden = 1.0 / den
Back to Top
dden = (-1.0 / (den**2)) * dinvden #(7)

https://cs231n.github.io/optimization-2/ 8/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition

# backprop den = sigx + xpysqr


dsigx = (1) * dden #(6)
dxpysqr = (1) * dden #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr #(5)
# backprop xpy = x + y
dx = (1) * dxpy #(4)
dy = (1) * dxpy #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below #(3)
# backprop num = x + sigy
dx += (1) * dnum #(2)
dsigy = (1) * dnum #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy #(1)
# done! phew

Notice a few things:

Cache forward pass variables. To compute the backward pass it is very helpful to have some of
the variables that were used in the forward pass. In practice you want to structure your code so
that you cache these variables, and so that they are available during backpropagation. If this is too
difficult, it is possible (but wasteful) to recompute them.

Gradients add up at forks. The forward expression involves the variables x,y multiple times, so
when we perform backpropagation we must be careful to use += instead of = to accumulate
the gradient on these variables (otherwise we would overwrite it). This follows the multivariable
chain rule in Calculus, which states that if a variable branches out to different parts of the circuit,
then the gradients that flow back to it will add.

Patterns in backward flow


It is interesting to note that in many cases the backward-flowing gradient can be interpreted on an
intuitive level. For example, the three most commonly used gates in neural networks
(add,mul,max), all have very simple interpretations in terms of how they act during
backpropagation. Consider this example circuit:

An example circuit demonstrating the intuition behind the operations that backpropagation performs during
the backward pass in order to compute the gradients on the inputs. Sum operation distributes gradients
equally to all its inputs. Max operation routes the gradient to the higher input. Multiply gate takes thetoinput
Back Top
activations, swaps them and multiplies by its gradient.

https://cs231n.github.io/optimization-2/ 9/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition

x 3.00
-8.00
-12.00
* 2.00
y -4.00
6.00
-10.00 -20.00
+ *2
2.00 1.00
z 2.00
2.00
2.00
max
2.00
w -1.00
0.00

Looking at the diagram above as an example, we can see that:

The add gate always takes the gradient on its output and distributes it equally to all of its inputs,
regardless of what their values were during the forward pass. This follows from the fact that the
local gradient for the add operation is simply +1.0, so the gradients on all inputs will exactly equal
the gradients on the output because it will be multiplied by x1.0 (and remain unchanged). In the
example circuit above, note that the + gate routed the gradient of 2.00 to both of its inputs, equally
and unchanged.

The max gate routes the gradient. Unlike the add gate which distributed the gradient unchanged
to all its inputs, the max gate distributes the gradient (unchanged) to exactly one of its inputs (the
input that had the highest value during the forward pass). This is because the local gradient for a
max gate is 1.0 for the highest value, and 0.0 for all other values. In the example circuit above, the
max operation routed the gradient of 2.00 to the z variable, which had a higher value than w, and
the gradient on w remains zero.

The multiply gate is a little less easy to interpret. Its local gradients are the input values (except
switched), and this is multiplied by the gradient on its output during the chain rule. In the example
above, the gradient on x is -8.00, which is -4.00 x 2.00.

Unintuitive effects and their consequences. Notice that if one of the inputs to the multiply gate is
very small and the other is very big, then the multiply gate will do something slightly unintuitive: it
will assign a relatively huge gradient to the small input and a tiny gradient to the large input. Note
T
that in linear classifiers where the weights are dot producted w xi (multiplied) with the inputs,
this implies that the scale of the data has an effect on the magnitude of the gradient for the
weights. For example, if you multiplied all input data examples xi
by 1000 during preprocessing,
Back
then the gradient on the weights will be 1000 times larger, and you’d have to lower the to Top
learning

https://cs231n.github.io/optimization-2/ 10/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition

rate by that factor to compensate. This is why preprocessing matters a lot, sometimes in subtle
ways! And having intuitive understanding for how the gradients flow can help you debug some of
these cases.

Gradients for vectorized operations


The above sections were concerned with single variables, but all concepts extend in a straight-
forward manner to matrix and vector operations. However, one must pay closer attention to
dimensions and transpose operations.

Matrix-Matrix multiply gradient. Possibly the most tricky operation is the matrix-matrix
multiplication (which generalizes all matrix-vector and vector-vector) multiply operations:

# forward pass
W = np.random.randn(5, 10)
X = np.random.randn(10, 3)
D = W.dot(X)

# now suppose we had the gradient on D from above in the circuit


dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T) #.T gives the transpose of the matrix
dX = W.T.dot(dD)

Tip: use dimension analysis! Note that you do not need to remember the expressions for dW and
dX because they are easy to re-derive based on dimensions. For instance, we know that the
gradient on the weights dW must be of the same size as W after it is computed, and that it must
depend on matrix multiplication of X and dD (as is the case when both X,W are single
numbers and not matrices). There is always exactly one way of achieving this so that the
dimensions work out. For example, X is of size [10 x 3] and dD of size [5 x 3], so if we want dW
and W has shape [5 x 10], then the only way of achieving this is with dD.dot(X.T) , as shown
above.

Work with small, explicit examples. Some people may find it difficult at first to derive the gradient
updates for some vectorized expressions. Our recommendation is to explicitly write out a minimal
vectorized example, derive the gradient on paper and then generalize the pattern to its efficient,
vectorized form.

Erik Learned-Miller has also written up a longer related document on taking matrix/vector
derivatives which you might find helpful. Find it here.
Back to Top

https://cs231n.github.io/optimization-2/ 11/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition

Summary
We developed intuition for what the gradients mean, how they flow backwards in the circuit,
and how they communicate which part of the circuit should increase or decrease and with
what force to make the final output higher.
We discussed the importance of staged computation for practical implementations of
backpropagation. You always want to break up your function into modules for which you
can easily derive local gradients, and then chain them with chain rule. Crucially, you almost
never want to write out these expressions on paper and differentiate them symbolically in
full, because you never need an explicit mathematical equation for the gradient of the input
variables. Hence, decompose your expressions into stages such that you can differentiate
every stage independently (the stages will be matrix vector multiplies, or max operations, or
sum operations, etc.) and then backprop through the variables one step at a time.

In the next section we will start to define neural networks, and backpropagation will allow us to
efficiently compute the gradient of a loss function with respect to its parameters. In other words,
we’re now ready to train neural nets, and the most conceptually difficult part of this class is behind
us! ConvNets will then be a small step away.

References
Automatic differentiation in machine learning: a survey

cs231n
cs231n

Back to Top

https://cs231n.github.io/optimization-2/ 12/12
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

CS231n Convolutional Neural Networks for Visual Recognition


Course Website

Table of Contents:

Quick intro without brain analogies


Modeling one neuron
Biological motivation and connections
Single neuron as a linear classifier
Commonly used activation functions
Neural Network architectures
Layer-wise organization
Example feed-forward computation
Representational power
Setting number of layers and their sizes
Summary
Additional references

Quick intro
It is possible to introduce neural networks without appealing to brain analogies. In the section on
linear classification we computed scores for different visual categories given the image using the
formula s = Wx , where W was a matrix and x was an input column vector containing all pixel
data of the image. In the case of CIFAR-10, x is a [3072x1] column vector, and W is a [10x3072]
matrix, so that the output scores is a vector of 10 class scores.

An example neural network would instead compute s = W2 max(0, W1 x) . Here, W1 could be,
for example, a [100x3072] matrix transforming the image into a 100-dimensional intermediate
vector. The function max(0, −) is a non-linearity that is applied elementwise. There are several
choices we could make for the non-linearity (which we’ll study below), but this one is a common
choice and simply thresholds all activations that are below zero to zero. Finally, the matrix W2
would then be of size [10x100], so that we again get 10 numbers out that we interpret as the class
scores. Notice that the non-linearity is critical computationally - if we left it out, the two matrices
could be collapsed to a single matrix, and therefore the predicted class scores would again be a
Back to Top
linear function of the input. The non-linearity is where we get the wiggle. The parameters W2 , W1

https://cs231n.github.io/neural-networks-1/ 1/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

are learned with stochastic gradient descent, and their gradients are derived with chain rule (and
computed with backpropagation).

A three-layer neural network could analogously look like s = W3 max(0, W2 max(0, W1 x)),
where all of W3 , W2 , W1 are parameters to be learned. The sizes of the intermediate hidden
vectors are hyperparameters of the network and we’ll see how we can set them later. Lets now
look into how we can interpret these computations from the neuron/network perspective.

Modeling one neuron


The area of Neural Networks has originally been primarily inspired by the goal of modeling
biological neural systems, but has since diverged and become a matter of engineering and
achieving good results in Machine Learning tasks. Nonetheless, we begin our discussion with a
very brief and high-level description of the biological system that a large portion of this area has
been inspired by.

Biological motivation and connections


The basic computational unit of the brain is a neuron. Approximately 86 billion neurons can be
found in the human nervous system and they are connected with approximately 10^14 - 10^15
synapses. The diagram below shows a cartoon drawing of a biological neuron (left) and a
common mathematical model (right). Each neuron receives input signals from its dendrites and
produces output signals along its (single) axon. The axon eventually branches out and connects
via synapses to dendrites of other neurons. In the computational model of a neuron, the signals
that travel along the axons (e.g. x0 ) interact multiplicatively (e.g. w0 x0 ) with the dendrites of the
other neuron based on the synaptic strength at that synapse (e.g. w0 ). The idea is that the
synaptic strengths (the weights w) are learnable and control the strength of influence (and its
direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another. In
the basic model, the dendrites carry the signal to the cell body where they all get summed. If the
final sum is above a certain threshold, the neuron can fire, sending a spike along its axon. In the
computational model, we assume that the precise timings of the spikes do not matter, and that
only the frequency of the firing communicates information. Based on this rate code interpretation,
we model the firing rate of the neuron with an activation function f, which represents the
frequency of the spikes along the axon. Historically, a common choice of activation function is the
sigmoid function σ, since it takes a real-valued input (the signal strength after the sum) and
squashes it to range between 0 and 1. We will see details of these activation functions later in this
section.

Back to Top

https://cs231n.github.io/neural-networks-1/ 2/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

A cartoon drawing of a biological neuron (left) and its mathematical model (right).

An example code for forward-propagating a single neuron might look as follows:

class Neuron(object):
# ...
def forward(self, inputs):
""" assume inputs and weights are 1-D numpy arrays and bias is a numbe
cell_body_sum = np.sum(inputs * self.weights) + self.bias
firing_rate = 1.0 / (1.0 + math.exp(-cell_body_sum)) # sigmoid activat
return firing_rate

In other words, each neuron performs a dot product with the input and its weights, adds the bias
and applies the non-linearity (or activation function), in this case the sigmoid σ(x) = 1/(1 + e−x ) .
We will go into more details about different activation functions at the end of this section.

Coarse model. It’s important to stress that this model of a biological neuron is very coarse: For
example, there are many different types of neurons, each with different properties. The dendrites
in biological neurons perform complex nonlinear computations. The synapses are not just a
single weight, they’re a complex non-linear dynamical system. The exact timing of the output
spikes in many systems is known to be important, suggesting that the rate code approximation
may not hold. Due to all these and many other simplifications, be prepared to hear groaning
sounds from anyone with some neuroscience background if you draw analogies between Neural
Networks and real brains. See this review (pdf), or more recently this review if you are interested.

Single neuron as a linear classifier


The mathematical form of the model Neuron’s forward computation might look familiar to you. As
we saw with linear classifiers, a neuron has the capacity to “like” (activation near one) or “dislike”
Back to Top

https://cs231n.github.io/neural-networks-1/ 3/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

(activation near zero) certain linear regions of its input space. Hence, with an appropriate loss
function on the neuron’s output, we can turn a single neuron into a linear classifier:

Binary Softmax classifier. For example, we can interpret σ(∑ i wi xi + b) to be the probability of
one of the classes P(yi = 1 ∣ xi ; w). The probability of the other class would be
P(yi = 0 ∣ xi ; w) = 1 − P(yi = 1 ∣ xi ; w), since they must sum to one. With this interpretation,
we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and
optimizing it would lead to a binary Softmax classifier (also known as logistic regression). Since
the sigmoid function is restricted to be between 0-1, the predictions of this classifier are based on
whether the output of the neuron is greater than 0.5.

Binary SVM classifier. Alternatively, we could attach a max-margin hinge loss to the output of the
neuron and train it to become a binary Support Vector Machine.

Regularization interpretation. The regularization loss in both SVM/Softmax cases could in this
biological view be interpreted as gradual forgetting, since it would have the effect of driving all
synaptic weights w towards zero after every parameter update.

A single neuron can be used to implement a binary classifier (e.g. binary Softmax or binary SVM
classifiers)

Commonly used activation functions


Every activation function (or non-linearity) takes a single number and performs a certain fixed
mathematical operation on it. There are several activation functions you may encounter in
practice:

Left: Sigmoid non-linearity squashes real numbers to range between [0,1] Right: The tanh non-linearity
squashes real numbers to range between [-1,1].

Back to Top

https://cs231n.github.io/neural-networks-1/ 4/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

Sigmoid. The sigmoid non-linearity has the mathematical form σ(x) = 1/(1 + e−x ) and is shown
in the image above on the left. As alluded to in the previous section, it takes a real-valued number
and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and
large positive numbers become 1. The sigmoid function has seen frequent use historically since it
has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated
firing at an assumed maximum frequency (1). In practice, the sigmoid non-linearity has recently
fallen out of favor and it is rarely ever used. It has two major drawbacks:

Sigmoids saturate and kill gradients. A very undesirable property of the sigmoid neuron is
that when the neuron’s activation saturates at either tail of 0 or 1, the gradient at these
regions is almost zero. Recall that during backpropagation, this (local) gradient will be
multiplied to the gradient of this gate’s output for the whole objective. Therefore, if the local
gradient is very small, it will effectively “kill” the gradient and almost no signal will flow
through the neuron to its weights and recursively to its data. Additionally, one must pay
extra caution when initializing the weights of sigmoid neurons to prevent saturation. For
example, if the initial weights are too large then most neurons would become saturated and
the network will barely learn.
Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of
processing in a Neural Network (more on this soon) would be receiving data that is not
zero-centered. This has implications on the dynamics during gradient descent, because if
T
the data coming into a neuron is always positive (e.g. x > 0 elementwise in f = w x + b)),
then the gradient on the weights w will during backpropagation become either all be
positive, or all negative (depending on the gradient of the whole expression f ). This could
introduce undesirable zig-zagging dynamics in the gradient updates for the weights.
However, notice that once these gradients are added up across a batch of data the final
update for the weights can have variable signs, somewhat mitigating this issue. Therefore,
this is an inconvenience but it has less severe consequences compared to the saturated
activation problem above.

Tanh. The tanh non-linearity is shown on the image above on the right. It squashes a real-valued
number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the
sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always
preferred to the sigmoid nonlinearity. Also note that the tanh neuron is simply a scaled sigmoid
neuron, in particular the following holds: tanh(x) = 2σ(2x) − 1 .

Back to Top

https://cs231n.github.io/neural-networks-1/ 5/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

Left: Rectified Linear Unit (ReLU) activation function, which is zero when x < 0 and then linear with slope 1
when x > 0. Right: A plot from Krizhevsky et al. (pdf) paper indicating the 6x improvement in convergence
with the ReLU unit compared to the tanh unit.

ReLU. The Rectified Linear Unit has become very popular in the last few years. It computes the
function f (x) = max(0, x). In other words, the activation is simply thresholded at zero (see
image above on the left). There are several pros and cons to using the ReLUs:

(+) It was found to greatly accelerate (e.g. a factor of 6 in Krizhevsky et al.) the convergence
of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that
this is due to its linear, non-saturating form.
(+) Compared to tanh/sigmoid neurons that involve expensive operations (exponentials,
etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero.
(-) Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large
gradient flowing through a ReLU neuron could cause the weights to update in such a way
that the neuron will never activate on any datapoint again. If this happens, then the gradient
flowing through the unit will forever be zero from that point on. That is, the ReLU units can
irreversibly die during training since they can get knocked off the data manifold. For
example, you may find that as much as 40% of your network can be “dead” (i.e. neurons that
never activate across the entire training dataset) if the learning rate is set too high. With a
proper setting of the learning rate this is less frequently an issue.

Leaky ReLU. Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function
being zero when x < 0, a leaky ReLU will instead have a small positive slope (of 0.01, or so). That
is, the function computes f (x) = 𝟙(x < 0)(αx) + 𝟙(x >= 0)(x) where α is a small constant.
Some people report success with this form of activation function, but the results are not always
consistent. The slope in the negative region can also be made into a parameter of each neuron, as
seen in PReLU neurons, introduced in Delving Deep into Rectifiers, by Kaiming He et al., 2015.
However, the consistency of the benefit across tasks is presently unclear.
Back to Top

https://cs231n.github.io/neural-networks-1/ 6/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

Maxout. Other types of units have been proposed that do not have the functional form
f (wT x + b) where a non-linearity is applied on the dot product between the weights and the data.
One relatively popular choice is the Maxout neuron (introduced recently by Goodfellow et al.) that
generalizes the ReLU and its leaky version. The Maxout neuron computes the function
max(wT1 x + b1 , wT2 x + b2 ). Notice that both ReLU and Leaky ReLU are a special case of this
form (for example, for ReLU we have w1 , b1 = 0). The Maxout neuron therefore enjoys all the
benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its
drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters
for every single neuron, leading to a high total number of parameters.

This concludes our discussion of the most common types of neurons and their activation
functions. As a last comment, it is very rare to mix and match different types of neurons in the
same network, even though there is no fundamental problem with doing so.

TLDR: “What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning
rates and possibly monitor the fraction of “dead” units in a network. If this concerns you, give
Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than
ReLU/Maxout.

Neural Network architectures

Layer-wise organization
Neural Networks as neurons in graphs. Neural Networks are modeled as collections of neurons
that are connected in an acyclic graph. In other words, the outputs of some neurons can become
inputs to other neurons. Cycles are not allowed since that would imply an infinite loop in the
forward pass of a network. Instead of an amorphous blobs of connected neurons, Neural Network
models are often organized into distinct layers of neurons. For regular neural networks, the most
common layer type is the fully-connected layer in which neurons between two adjacent layers are
fully pairwise connected, but neurons within a single layer share no connections. Below are two
example Neural Network topologies that use a stack of fully-connected layers:

Back to Top

https://cs231n.github.io/neural-networks-1/ 7/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

Left: A 2-layer Neural Network (one hidden layer of 4 neurons (or units) and one output layer with 2 neurons),
and three inputs. Right: A 3-layer neural network with three inputs, two hidden layers of 4 neurons each and
one output layer. Notice that in both cases there are connections (synapses) between neurons across layers,
but not within a layer.

Naming conventions. Notice that when we say N-layer neural network, we do not count the input
layer. Therefore, a single-layer neural network describes a network with no hidden layers (input
directly mapped to output). In that sense, you can sometimes hear people say that logistic
regression or SVMs are simply a special case of single-layer Neural Networks. You may also hear
these networks interchangeably referred to as “Artificial Neural Networks” (ANN) or “Multi-Layer
Perceptrons” (MLP). Many people do not like the analogies between Neural Networks and real
brains and prefer to refer to neurons as units.

Output layer. Unlike all layers in a Neural Network, the output layer neurons most commonly do
not have an activation function (or you can think of them as having a linear identity activation
function). This is because the last output layer is usually taken to represent the class scores (e.g.
in classification), which are arbitrary real-valued numbers, or some kind of real-valued target (e.g.
in regression).

Sizing neural networks. The two metrics that people commonly use to measure the size of neural
networks are the number of neurons, or more commonly the number of parameters. Working with
the two example networks in the above picture:

The first network (left) has 4 + 2 = 6 neurons (not counting the inputs), [3 x 4] + [4 x 2] = 20
weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters.
The second network (right) has 4 + 4 + 1 = 9 neurons, [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 =
32 weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters.

To give you some context, modern Convolutional Networks contain on orders of 100 million
parameters and are usually made up of approximately 10-20 layers (hence deep learning).
However, as we will see the number of effective connections is significantly greater due to
Back to Top
parameter sharing. More on this in the Convolutional Neural Networks module.

https://cs231n.github.io/neural-networks-1/ 8/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

Example feed-forward computation


Repeated matrix multiplications interwoven with activation function. One of the primary reasons
that Neural Networks are organized into layers is that this structure makes it very simple and
efficient to evaluate Neural Networks using matrix vector operations. Working with the example
three-layer neural network in the diagram above, the input would be a [3x1] vector. All connection
strengths for a layer can be stored in a single matrix. For example, the first hidden layer’s weights
W1 would be of size [4x3], and the biases for all units would be in the vector b1 , of size [4x1].
Here, every single neuron has its weights in a row of W1 , so the matrix vector multiplication
np.dot(W1,x) evaluates the activations of all neurons in that layer. Similarly, W2 would be a
[4x4] matrix that stores the connections of the second hidden layer, and W3 a [1x4] matrix for the
last (output) layer. The full forward pass of this 3-layer neural network is then simply three matrix
multiplications, interwoven with the application of the activation function:

# forward-pass of a 3-layer neural network:


f = lambda x: 1.0/(1.0 + np.exp(-x)) # activation function (use sigmoid)
x = np.random.randn(3, 1) # random input vector of three numbers (3x1)
h1 = f(np.dot(W1, x) + b1) # calculate first hidden layer activations (4x1
h2 = f(np.dot(W2, h1) + b2) # calculate second hidden layer activations (4x
out = np.dot(W3, h2) + b3 # output neuron (1x1)

In the above code, W1,W2,W3,b1,b2,b3 are the learnable parameters of the network. Notice
also that instead of having a single input column vector, the variable x could hold an entire batch
of training data (where each input example would be a column of x ) and then all examples
would be efficiently evaluated in parallel. Notice that the final Neural Network layer usually doesn’t
have an activation function (e.g. it represents a (real-valued) class score in a classification
setting).

The forward pass of a fully-connected layer corresponds to one matrix multiplication followed
by a bias offset and an activation function.

Representational power
One way to look at Neural Networks with fully-connected layers is that they define a family of
functions that are parameterized by the weights of the network. A natural question that arises is:
What is the representational power of this family of functions? In particular, are there functions
that cannot be modeled with a Neural Network?

It turns out that Neural Networks with at least one hidden layer are universal approximators . That
Back to Top
is, it can be shown (e.g. see Approximation by Superpositions of Sigmoidal Function from 1989
https://cs231n.github.io/neural-networks-1/ 9/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

(pdf), or this intuitive explanation from Michael Nielsen) that given any continuous function f (x)
and some ϵ > 0, there exists a Neural Network g(x) with one hidden layer (with a reasonable
choice of non-linearity, e.g. sigmoid) such that ∀x, ∣ f (x) − g(x) ∣< ϵ. In other words, the neural
network can approximate any continuous function.

If one hidden layer suffices to approximate any function, why use more layers and go deeper? The
answer is that the fact that a two-layer Neural Network is a universal approximator is, while
mathematically cute, a relatively weak and useless statement in practice. In one dimension, the
“sum of indicator bumps” function g(x) = ∑ i ci 𝟙(a i < x < bi ) where a, b, c are parameter
vectors is also a universal approximator, but noone would suggest that we use this functional
form in Machine Learning. Neural Networks work well in practice because they compactly express
nice, smooth functions that fit well with the statistical properties of data we encounter in practice,
and are also easy to learn using our optimization algorithms (e.g. gradient descent). Similarly, the
fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer
networks is an empirical observation, despite the fact that their representational power is equal.

As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer
nets, but going even deeper (4,5,6-layer) rarely helps much more. This is in stark contrast to
Convolutional Networks, where depth has been found to be an extremely important component
for a good recognition system (e.g. on order of 10 learnable layers). One argument for this
observation is that images contain hierarchical structure (e.g. faces are made up of eyes, which
are made up of edges, etc.), so several layers of processing make intuitive sense for this data
domain.

The full story is, of course, much more involved and a topic of much recent research. If you are
interested in these topics we recommend for further reading:

Deep Learning book in press by Bengio, Goodfellow, Courville, in particular Chapter 6.4.
Do Deep Nets Really Need to be Deep?
FitNets: Hints for Thin Deep Nets

Setting number of layers and their sizes


How do we decide on what architecture to use when faced with a practical problem? Should we
use no hidden layers? One hidden layer? Two hidden layers? How large should each layer be?
First, note that as we increase the size and number of layers in a Neural Network, the capacity of
the network increases. That is, the space of representable functions grows since the neurons can
collaborate to express many different functions. For example, suppose we had a binary
classification problem in two dimensions. We could train three separate neural networks, each
with one hidden layer of some size and obtain the following classifiers: Back to Top

https://cs231n.github.io/neural-networks-1/ 10/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

Larger Neural Networks can represent more complicated functions. The data are shown as circles colored by
their class, and the decision regions by a trained neural network are shown underneath. You can play with
these examples in this ConvNetsJS demo.

In the diagram above, we can see that Neural Networks with more neurons can express more
complicated functions. However, this is both a blessing (since we can learn to classify more
complicated data) and a curse (since it is easier to overfit the training data). Overfitting occurs
when a model with high capacity fits the noise in the data instead of the (assumed) underlying
relationship. For example, the model with 20 hidden neurons fits all the training data but at the
cost of segmenting the space into many disjoint red and green decision regions. The model with 3
hidden neurons only has the representational power to classify the data in broad strokes. It
models the data as two blobs and interprets the few red points inside the green cluster as outliers
(noise). In practice, this could lead to better generalization on the test set.

Based on our discussion above, it seems that smaller neural networks can be preferred if the data
is not complex enough to prevent overfitting. However, this is incorrect - there are many other
preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2
regularization, dropout, input noise). In practice, it is always better to use these methods to control
overfitting instead of the number of neurons.

The subtle reason behind this is that smaller networks are harder to train with local methods such
as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it
turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high
loss). Conversely, bigger neural networks contain significantly more local minima, but these
minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-
convex, it is hard to study these properties mathematically, but some attempts to understand
these objective functions have been made, e.g. in a recent paper The Loss Surfaces of Multilayer
Networks. In practice, what you find is that if you train a small network the final loss can display a
good amount of variance - in some cases you get lucky and converge to a good place but
Backinto
some
Top

https://cs231n.github.io/neural-networks-1/ 11/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

cases you get trapped in one of the bad minima. On the other hand, if you train a large network
you’ll start to find many different solutions, but the variance in the final achieved loss will be much
smaller. In other words, all solutions are about equally as good, and rely less on the luck of
random initialization.

To reiterate, the regularization strength is the preferred way to control the overfitting of a neural
network. We can look at the results achieved by three different settings:

The effects of regularization strength: Each neural network above has 20 hidden neurons, but changing the
regularization strength makes its final decision regions smoother with a higher regularization. You can play
with these examples in this ConvNetsJS demo.

The takeaway is that you should not be using smaller networks because you are afraid of
overfitting. Instead, you should use as big of a neural network as your computational budget
allows, and use other regularization techniques to control overfitting.

Summary
In summary,

We introduced a very coarse model of a biological neuron.


We discussed several types of activation functions that are used in practice, with ReLU
being the most common choice.
We introduced Neural Networks where neurons are connected with Fully-Connected layers
where neurons in adjacent layers have full pair-wise connections, but neurons within a layer
are not connected.
We saw that this layered architecture enables very efficient evaluation of Neural Networks
based on matrix multiplications interwoven with the application of the activation function.
Back to Top

https://cs231n.github.io/neural-networks-1/ 12/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition

We saw that that Neural Networks are universal function approximators, but we also
discussed the fact that this property has little to do with their ubiquitous use. They are used
because they make certain “right” assumptions about the functional forms of functions that
come up in practice.
We discussed the fact that larger networks will always work better than smaller networks,
but their higher model capacity must be appropriately addressed with stronger
regularization (such as higher weight decay), or they might overfit. We will see more forms
of regularization (especially dropout) in later sections.

Additional References
deeplearning.net tutorial with Theano
ConvNetJS demos for intuitions
Michael Nielsen’s tutorials

cs231n
cs231n

Back to Top

https://cs231n.github.io/neural-networks-1/ 13/13
Derivatives, Backpropagation, and Vectorization
Justin Johnson
September 6, 2017

1 Derivatives
1.1 Scalar Case
You are probably familiar with the concept of a derivative in the scalar case:
given a function f : R → R, the derivative of f at a point x ∈ R is defined as:

f (x + h) − f (x)
f 0 (x) = lim
h→0 h
Derivatives are a way to measure change. In the scalar case, the derivative
of the function f at the point x tells us how much the function f changes as the
input x changes by a small amount ε:

f (x + ε) ≈ f (x) + εf 0 (x)
For ease of notation we will commonly assign a name to the output of f ,
∂y
say y = f (x), and write ∂x for the derivative of y with respect to x. This
∂y
notation emphasizes that ∂x is the rate of change between the variables x and
∂y
y; concretely if x were to change by ε then y will change by approximately ε ∂x .
We can write this relationship as
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
You should read this as saying “changing x to x + ∆x implies that y will
∂y
change to approximately y + ∆x ∂x ”. This notation is nonstandard, but I like
it since it emphasizes the relationship between changes in x and changes in y.
The chain rule tells us how to compute the derivative of the compositon of
functions. In the scalar case suppose that f, g : R → R and y = f (x), z = g(y);
then we can also write z = (g ◦f )(x), or draw the following computational graph:
f g
x−
→y−
→z
The (scalar) chain rule tells us that
∂z ∂z ∂y
=
∂x ∂y ∂x

1
∂z ∂y
This equation makes intuitive sense. The derivatives ∂y and ∂x give:

∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
∂z
y → y + ∆y =⇒ z →≈ z + ∆y
∂y
Combining these two rules lets us compute the effect of x on z: if x changes
∂y ∂y
by ∆x then y will change by ∂x ∆x, so we have ∆y = ∂x ∆x. If y changes by
∂z ∂z ∂y
∆y then z will change by ∂y ∆y = ∂y ∂x ∆x which is exactly what the chain rule
tells us.

1.2 Gradient: Vector in, scalar out


This same intuition carries over into the vector case. Now suppose that f :
RN → R takes a vector as input and produces a scalar. The derivative of f at
the point x ∈ RN is now called the gradient, and it is defined as:

f (x + h) − f (x)
∇x f (x) = lim
h→0 khk
Now the gradient ∇x f (x) ∈ RN is a vector, with the same intuition as the
scalar case. If we set y = f (x) then we have the relationship
∂y
x → x + ∆x =⇒ y →≈ y + · ∆x
∂x
The formula changes a bit from the scalar case to account for the fact that
∂y
x, ∆x, and ∂x are now vectors in RN while y is a scalar. In particular when
∂y
multiplying ∂x by ∆x we use the dot product, which combines two vectors to
give a scalar.
One nice outcome of this formula is that it gives meaning to the individual
∂y
elements of the gradient ∂x . Suppose that ∆x is the ith basis vector, so that
the ith coordinate of ε is 1 and all other coordinates of ε are 0. Then the dot
∂y ∂y
product ∂x · ∆x is simply the ith coordinate of ∂x ; thus the ith coordinate of
∂y
∂x tells us the approximate amount by which y will change if we move x along
the ith coordinate axis.
∂y
This means that we can also view the gradient ∂x as a vector of partial
derivatives:
 
∂y ∂y ∂y ∂y
= , ,...,
∂x ∂x1 ∂x2 ∂xN
where xi is the ith coordinate of the vector x, which is a scalar, so each
∂y
partial derivative ∂xi
is also a scalar.

2
1.3 Jacobian: Vector in, Vector out
Now suppose that f : RN → RM takes a vector as input and produces a vector
as output. Then the derivative of f at a point x, also called the Jacobian, is
the M × N matrix of partial derivatives. If we again set y = f (x) then we can
write:
 ∂y1 ∂y1
···

∂x1 ∂xN
∂y
=  ... .. .. 

∂x . . 
∂y ∂yM
M
∂x1 ··· ∂xN

The Jacobian tells us the relationship between each element of x and each
∂y ∂yi
element of y: the (i, j)-th element of ∂x is equal to ∂xj
, so it tells us the amount
by which yi will change if xj is changed by a small amount.
Just as in the previous cases, the Jacobian tells us the relationship between
changes in the input and changes in the output:
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
∂y
Here ∂x is a M × N matrix and ∆x is an N -dimensional vector, so the
∂y
product ∂x ∆x is a matrix-vector multiplication resulting in an M -dimensional
vector.
The chain rule can be extended to the vector case using Jacobian matrices.
Suppose that f : RN → RM and g : RM → RK . Let x ∈ RN , y ∈ RM , and
z ∈ RK with y = f (x) and z = g(y), so we have the same computational graph
as the scalar case:
f g
x−
→y−
→z
The chain rule also has the same form as the scalar case:
∂z ∂z ∂y
=
∂x ∂y ∂x
∂z ∂y
However now each of these terms is a matrix: ∂y is a K × M matrix, ∂x is
∂z ∂z ∂y
a M × N matrix, and ∂x is a K × N matrix; the multiplication of ∂y and ∂x is
matrix multiplication.

3
1.4 Generalized Jacobian: Tensor in, Tensor out
Just as a vector is a one-dimensional list of numbers and a matrix is a two-
dimensional grid of numbers, a tensor is a D-dimensional grid of numbers1 .
Many operations in deep learning accept tensors as inputs and produce
tensors as outputs. For example an image is usually represented as a three-
dimensional grid of numbers, where the three dimensions correspond to the
height, width, and color channels (red, green, blue) of the image. We must
therefore develop a derivative that is compatible with functions operating on
general tensors.
Suppose now that f : RN1 ×···×NDx → RM1 ×···×MDy . Then the input to f
is a Dx -dimensional tensor of shape N1 × · · · × NDx , and the output of f is a
Dy -dimensional tensor of shape M1 × · · · × MDy . If y = f (x) then the derivative
∂y
∂x is a generalized Jacobian, which is an object with shape

(M1 × · · · × MDy ) × (N1 × · · · × NDx )


∂y
Note that we have separated the dimensions of ∂x into two groups: the
first group matches the dimensions of y and the second group matches the
dimensions of x. With this grouping, we can think of the generalized Jacobian
as generalization of a matrix, where each “row” has the same shape as y and
each “column” has the same shape as x.
Now if we let i ∈ ZDy and j ∈ ZDx be vectors of integer indices, then we
can write
 
∂y ∂yi
=
∂x i,j ∂xj
∂yi
In this equation note that yi and xj are scalars, so the derivative ∂x j
is
also a scalar. Using this notation we see that like the standard Jacobian, the
generalized Jacobian tells us the relative rates of change between all elements
of x and all elements of y.
The generalized Jacobian gives the same relationship between inputs and
outputs as before:
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
∂y
The difference is that now ∆x is a tensor of shape N1 × · · · × NDx and ∂x
is a generalized matrix of shape (M1 × · · · × MDy ) × (N1 × · · · × NDx ). The
∂y
product ∂x ∆x is therefore a generalized matrix-vector multiply, which results in
a tensor of shape M1 × · · · × MDy .
The generalized matrix-vector multipy follows the same algebraic rules as a
traditional matrix-vector multiply:
1 The word tensor is used in different ways in different fields; you may have seen the term

before in physics or abstract algebra. The machine learning definition of a tensor as a D-


dimensional grid of numbers is closely related to the definitions of tensors in these other
fields.

4
  X  ∂y   
∂y ∂y
∆x = (∆x)i = · ∆x
∂x j i
∂x i,j ∂x j,:

The only difference is that the indicies i and j are not


 scalars;
 instead they
∂y
are vectors of indicies. In the equation above the term ∂x is the jth “row”
j,:
∂y
of the generalized matrix ∂x , which is a tensor with the same shape as x. We
have also used the convention that the dot product between two tensors of the
same shape is an elementwise product followed by a sum, identical to the dot
product between vectors.
The chain rule also looks the same in the case of tensor-valued functions.
Suppose that y = f (x) and z = g(y), where x and y have the same shapes as
above and z has shape K1 × · · · × KDz . Now the chain rule looks the same as
before:
∂z ∂z ∂y
=
∂x ∂y ∂x
∂z
The difference is that now ∂y is a generalized matrix of shape (K1 ×· · ·×KDz )×
∂y
(M1 × · · · × MDy ), and is a generalized matrix of shape (M1 × · · · × MDy ) ×
∂z
∂z ∂y
(N1 × · · · × NDx ); the product ∂y ∂x is a generalized matrix-matrix multiply,
resulting in an object of shape (K1 × · · · × KDz ) × (N1 × · · · × NDx ). Like
the generalized matrix-vector multiply defined above, the generalized matrix-
matrix multiply follows the same algebraic rules as the traditional matrix-matrix
multiply:
  X  ∂z   ∂y     
∂z ∂z ∂y
= = ·
∂x i,j ∂y i,k ∂x k,j ∂y i,: ∂x :,j
k
 
∂z
In this equation the indices i, j, k are vectors of indices, and the terms ∂y
  i,:
∂y ∂z ∂y
and ∂x are the ith “row” of ∂y and the jth “column” of ∂x respectively.
:,j

2 Backpropagation with Tensors


In the context of neural networks, a layer f is typically a function of (tensor)
inputs x and weights w; the (tensor) output of the layer is then y = f (x, w).
The layer f is typically embedded in some large neural network with scalar loss
L.
During backpropagation, we assume that we are given ∂L ∂y and our goal is to
∂L ∂L
compute ∂x and ∂w . By the chain rule we know that
∂L ∂L ∂y ∂L ∂L ∂y
= =
∂x ∂y ∂x ∂w ∂y ∂w
Therefore one way to proceed would be to form the (generalized) Jacobians
∂y ∂y
∂x and ∂w and use (generalized) matrix multiplication to compute ∂L ∂L
∂x and ∂w .

5
∂y
However, there’s a problem with this approach: the Jacobian matrices ∂x and
∂y
∂w are typically far too large to fit in memory.
As a concrete example, let’s suppose that f is a linear layer that takes as
input a minibatch of N vectors, each of dimension D, and produces a minibatch
of N vectors, each of dimension M . Then x is a matrix of shape N × D, w is a
matrix of shape D × M , and y = f (x, w) = xw is a matrix of shape N × M .
∂y
The Jacobian ∂x then has shape (N × M ) × (N × D). In a typical neural
∂y
network we might have N = 64 and M = D = 4096; then ∂x consists of
64 · 4096 · 64 · 4096 scalar values; this is more than 68 billion numbers; using
32-bit floating point, this Jacobian matrix will take 256 GB of memory to store.
Therefore it is completely hopeless to try and explicitly store and manipulate
the Jacobian matrix.
However it turns out that for most common neural network layers, we can
∂y ∂L
derive expressions that compute the product ∂x ∂y without explicitly forming
∂y
the Jacobian ∂x . Even better, we can typically derive this expression without
∂y
even computing an explicit expression for the Jacobian ∂x ; in many cases we
can work out a small case on paper and then infer the general formula.
Let’s see how this works out for the case of the linear layer f (x, w) = xw.
Set N = 1, D = 2, M = 3. Then we can explicitly write


y = y1,1 y1,2 y1,3 = xw (1)
 
 w1,1 w1,2 w1,3
= x1,1 x1,2 (2)
w2,1 w2,2 w2,3
 T
x1,1 w1,1 + x1,2 w2,1
= x1,1 w1,2 + x1,2 w2,2  (3)
x1,1 w1,3 + x1,2 w2,3

During backpropagation we assume that we have access to ∂L∂y which tech-


nically has shape (1) × (N × M ); however for notational convenience we will
instead think of it as a matrix of shape N × M . Then we can write
∂L 
= dy1,1 dy1,2 dy1,3
∂y
∂L ∂L
Our goal now is to derive an expression for ∂x in terms of x, w, and ∂y ,
∂y
without explicitly forming the entire Jacobian know that ∂L
∂x . We ∂x will have
shape (1) × (N × D), but as is typical for representing gradients we instead view
∂L ∂L
∂x as a matrix of shape N × D. We know that each element of ∂x is a scalar
giving the partial derivatives of L with respect to the elements of x:
∂L  ∂L 
= ∂x1,1 ∂x∂L
∂x 1,2

Thinking one element at a time, the chain rule tells us that

6
∂L ∂L ∂y
= (4)
∂x1,1 ∂y ∂x1,1
∂L ∂L ∂y
= (5)
∂x1,2 ∂y ∂x1,2
∂L
Viewing these derivatives as generalized matrices, ∂y has shape (1)×(N ×M )
∂y ∂L
and ∂x1,1 has shape (N ×M )×(1); their product ∂x1,1 then has shape (1)×(1). If
∂L ∂y
we instead view ∂y and ∂x1,1 as matrices of shape N ×M , then their generalized
∂L ∂y
matrix product is simply the dot product ∂y · ∂x1,1 .
Now we compute
∂y 
∂y ∂y1,2 ∂y1,3
 
= ∂x1,1 ∂x1,1 ∂x1,1 = w1,1 w1,2 w1,3 (6)
∂x1,1 1,1

∂y 
∂y ∂y1,2 ∂y1,3
 
= ∂x1,1 ∂x1,2 ∂x1,2 = w2,1 w2,2 w2,3 (7)
∂x1,2 1,2

where the final equality comes from taking the derivatives of Equation 3 with
respect to x1,1 .
We can now combine these results and write
∂L ∂L ∂y
= · = dy1,1 w1,1 + dy1,2 w1,2 + dy1,3 w1,3 (8)
∂x1,1 ∂y ∂x1,1
∂L ∂L ∂y
= · = dy1,1 w2,1 + dy1,2 w2,2 + dy1,3 w2,3 (9)
∂x1,2 ∂y ∂x1,2
∂L
This gives us our final expression for ∂x :

∂L  ∂L 
= ∂x1,1 ∂x∂L (10)
∂x 1,2
 T
dy1,1 w1,1 + dy1,2 w1,2 + dy1,3 w1,3
= (11)
dy1,1 w2,1 + dy1,2 w2,2 + dy1,3 w2,3
∂L T
= x (12)
∂y
∂L ∂L T
This final result ∂x = ∂y x is very interesting because it allows us to
∂L ∂y
efficiently compute ∂x without explicitly
forming the Jacobian ∂x . We have
only derived this formula for the specific case of N = 1, D = 2, M = 3 but it in
fact holds in general.
∂L
By a similar thought process we can derive a similar expression for ∂w with-
∂y
out explicitly forming the Jacobian ∂w . You should try and work through this
as an exercise.

7
Review of differential calculus theory 1 1
Author: Guillaume Genthial

Winter 2017

Keywords: Differential, Gradients, partial derivatives, Jacobian,


chain-rule

This note is optional and is aimed at students who wish to have


a deeper understanding of differential calculus. It defines and ex-
plains the links between derivatives, gradients, jacobians, etc. First,
we go through definitions and examples for f : Rn 7→ R. Then we
introduce the Jacobian and generalize to higher dimension. Finally,
we introduce the chain-rule.

1 Introduction

We use derivatives all the time, but we forget what they mean. In
general, we have in mind that for a function f : R 7→ R, we have
something like

f ( x + h) − f ( x ) ≈ f 0 ( x )h

Some people use different notation, especially when dealing with


higher dimensions, and there usually is a lot of confusion between
the following notations

f 0 (x)
df
dx
∂f
∂x
∇x f
Scalar-product and dot-product
However, these notations refer to different mathematical objects, Given two vectors a and b,
and the confusion can lead to mistakes. This paper recalls some • scalar-product h a|bi = ∑in=1 ai bi
notions about these objects. • dot-product a T · b = h a|bi =
∑in=1 ai bi
review of differential calculus theory 2

2 Theory for f : Rn 7→ R

2.1 Differential
Notation
Formal definition dx f is a linear form Rn 7→ R
Let’s consider a function f : Rn 7→ R defined on Rn with the scalar This is the best linear approximation
of the function f
product h·|·i. We suppose that this function is differentiable, which
means that for x ∈ Rn (fixed) and a small variation h (can change) we
can write: dx f is called the differential of f in x

f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) (1)

oh→0 (h) (Landau notation) is equiva-


and dx f : Rn 7→ R is a linear form, which means that ∀ x, y ∈ Rn , lent to the existence of a function e(h)
such that lim e(h) = 0
we have dx f ( x + y) = dx f ( x ) + dx f (y). h →0

Example !
x1
Let f : R 7→ R such that f (
2 ) = 3x1 + x22 . Let’s pick
x2
! !
a h 1
∈ R2 and h = ∈ R2 . We have
b h2
!
a + h1
f( ) = 3( a + h1 ) + ( b + h2 )2
b + h2
= 3a + 3h1 + b2 + 2bh2 + h22
= 3a + b2 + 3h1 + 2bh2 + h22
= f ( a, b) + 3h1 + 2bh2 + o (h)

! h 2 = h · h = o h →0 ( h )
f(
h1
Then, d ) = 3h1 + 2bh2
a
  h2
b

2.2 Link with the gradients


Notation for x ∈ Rn , the gradient is
Formal definition usually written ∇ x f ∈ Rn
It can be shown that for all linear forms a : Rn 7→ R, there exists a
vector u a ∈ Rn such that ∀h ∈ Rn The dual of a vector space E∗ is isomor-
phic to E
a(h) = hu a |hi See Riesz representation theorem

In particular, for the differential dx f , we can find a vector u ∈ Rn


such that

dx (h) = hu| hi

. The gradient has the same shape as x


review of differential calculus theory 3

We can thus define the gradient of f in x

∇ x f := u
Then, as a conclusion, we can rewrite equation 2.1 Gradients and differential of a func-
tion are conceptually very different.
The gradient is a vector, while the
differential is a function
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) (2)
= f ( x ) + h∇ x f |hi + oh→0 (h) (3)

Example !
x1
Same example as before, f : R2 7→ R such that f ( ) =
x2
3x1 + x22 . We showed that
!
h1
d  f ( ) = 3h1 + 2bh2
 
a h2
b
We can rewrite this as
! ! !
h1 3 h1
d  f ( )=h | i
a
  h2 2b h2
b
and thus our gradient is
!
3
∇ f =
 
a 2b
b

2.3 Partial derivatives


Notation
Formal definition Partial derivatives are usually written
∂f
Now, let’s consider an orthonormal basis (e1 , . . . , en ) of Rn . Let’s 0
∂x but you may also see ∂ x f or f x
∂f
define the partial derivative • ∂xi is a function Rn 7→ R
∂f ∂f ∂f T
• ∂x = ( ∂x ,..., ∂xn ) is a function
1
Rn 7 → Rn .
∂f f ( x1 , . . . , xi−1 , xi + h, xi+1 , . . . , xn ) − f ( x1 , . . . , xn )
( x ) := lim •
∂f
(x) ∈ R
∂xi h →0 h ∂xi
∂f ∂f ∂f
• ∂x ( x ) = ( ∂x ( x ), . . . , ( x ))T ∈ Rn
∂f ∂xn
Note that the partial derivative ∂x ( x ) ∈ R and that it is defined
1

i
with respect to the i-th component and evaluated in x.
Example
Same example as before, f : R2 7→ R such that f ( x1 , x2 ) =
3x1 + x22 . Let’s write Depending on the context, most people
omit to write the ( x ) evaluation and just
write
∂f ∂f
∂x ∈ R instead of ∂x ( x )
n
review of differential calculus theory 4

! !
a+h a
! f( ) − f( )
∂f a b b
( ) = lim
∂x1 b h →0 h
3( a + h) + b2 − (3a + b2 )
= lim
h →0 h
3h
= lim
h →0 h
=3

In a similar way, we find that


!
∂f a
( ) = 2b
∂x2 b

2.4 Link with the partial derivatives


That’s why we usually write
Formal definition
∂f
It can be shown that ∇x f = (x)
∂x
n (same shape as x)
∂f
∇x f = ∑ ∂xi (x)ei
i =1
 ∂f  ei is a orthonormal basis. For instance,
∂x1 ( x ) in the canonical basis

= .. 
.

  ei = (0, . . . , 1, . . . 0)
∂f
∂xn ( x ) with 1 at index i

∂f
where ∂x ( x ) denotes the partial derivative of f with respect to the
i
ith component, evaluated in x.
Example
We showed that
  
∂ f  a


( ) =3


 ∂x1

b

 
∂ f  a


( ) = 2b


 ∂x2

b

and that
!
3
∇ f =
a
  2b
b
and then we verify that
review of differential calculus theory 5

 ! 
a∂f
 ∂x1 (
)
b 
∇  f =  ! 
a 
 ∂f a 
 
( )
b ∂x2 b

3 Summary

Formal definition
For a function f : Rn 7→ R, we have defined the following objects
which can be summarized in the following equation Recall that a T · b = h a|bi = ∑in=1 ai bi

f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) differential
= f ( x ) + h∇ x f |hi + oh→0 (h) gradient
∂f
= f ( x ) + h ( x )|hi + oh→0
∂x
 ∂f 
∂x ( x )
 1. 
 ..  |hi + oh→0
= f ( x ) + h partial derivatives

∂f
∂xn ( x )

Remark
Let’s consider x : R 7→ R such that x (u) = u for all u. Then we can
easily check that du x (h) = h. As this differential does not depend on
u, we may simply write dx. That’s why the following expression has The dx that we use refers to the differ-
some meaning, ential of u 7→ u, the identity mapping!

∂f
dx f (·) = ( x )dx (·)
∂x
because

∂f
dx f (h) = ( x )dx (h)
∂x
∂f
= ( x)h
∂x
In higher dimension, we write
n
∂f
dx f = ∑ ∂xi (x)dxi
i =1

4 Jacobian: Generalization to f : Rn 7→ Rm

For a function
review of differential calculus theory 6

   
x1 f 1 ( x1 , . . . , x n )
 .   .. 
 ..  7→
f : .
  
 
xn f m ( x1 , . . . , x n )
We can apply the previous section to each f i ( x ) :

f i ( x + h ) = f i ( x ) + d x f i ( h ) + o h →0 ( h )
= f i ( x ) + h∇ x f i |hi + oh→0 (h)
∂f
= f i ( x ) + h i ( x )|hi + oh→0
∂x
∂f ∂f
= f i ( x ) + h( i ( x ), . . . , i ( x ))T |hi + oh→0
∂x1 ∂xn

Putting all this in the same vector yields


     ∂f 
1 T
x1 + h1 x1 ∂x ( x ) · h
 .. 
= f
 .  
 . + .. 
 + o (h)
f
 .   .   . 
∂ fm T
xn + hn xn ∂x ( x ) · h
Now, let’s define the Jacobian matrix as The Jacobian matrix has dimensions
  ∂f m × n and is a generalization of the
 ∂f
T 1 ∂ f1 
1
( x ) ∂x ( x ) . . . ∂x n
( x ) gradient
 ∂x .   1 .. 
J ( x ) := 
 .
.
=
  . 

∂ fm T ∂ fm ∂ fm
∂x ( x ) ∂x ( x ) . . . ∂xn
1
( x )
Then, we have that

     ∂f ∂ f1 
1
x1 + h1 x1 ∂x1 ( x ) . . . ∂xn ( x )
 ..  = f  ..  + 
    .. 
 · h + o (h)
f
 .   .   . 
∂ fm ∂ fm
xn + hn xn ∂x ( x ) . . . ∂x ( x )
1 n

= f ( x ) + J ( x ) · h + o (h)

Example 1 : m = 1 ! In the case where m = 1, the Jacobian is


a row vector
x1 ∂ f1 ∂ f1
Let’s take our first function f : R2 7→ R such that f ( ) = ∂x1 ( x ) . . . ∂xn ( x )
x2 Remember that our gradient was
3x1 + x22 . Then, the Jacobian of f is defined as a column vector with the
same elements. We thus have that
J (x) = ∇x f T
   
∂f ∂f
∂x1 ( x ) ∂x2 ( x )
= 3 2x2
!T
3
=
2x2
= ∇ f ( x)T
review of differential calculus theory 7

Example 2 : g : R3 7→ R2 Let’s define

 
y1 !
y1 + 2y2 + 3y3
g (  y2  ) =
 
y1 y2 y3
y3

Then, the Jacobian of g is

 
∂(y1 +2y2 +3y3 )
∂y (y)T
Jg ( y ) = 
∂ ( y1 y2 y3 )

( y ) T
∂y
 
∂(y1 +2y2 +3y3 ) ∂(y1 +2y2 +3y3 ) ∂(y1 +2y2 +3y3 )
∂y1 (y) ∂y2 (y) ∂y3 (y)
= 
∂ ( y1 y2 y3 ) ∂ ( y1 y2 y3 ) ∂ ( y1 y2 y3 )

∂y1 (y) ∂y2 (y) ∂y3 ( y )
!
1 2 3
=
y2 y3 y1 y3 y1 y2

5 Generalization to f : Rn× p 7→ R

If a function takes as input a matrix A ∈ Rn× p , we can transform this


matrix into a vector a ∈ Rnp , such that

A[i, j] = a[i + nj]


Then, we end up with a function f˜ : Rnp 7→ R. We can apply
the results from 3 and we obtain for x, h ∈ Rnp corresponding to
X, h ∈ Rn× p ,

f˜( x + h) = f ( x ) + h∇ x f |hi + o (h)

 ∂f 
∂x1 ( x )
 .. 
where ∇ x f = 
 . .

∂f
∂xnp ( x )
Now, we would like to give some meaning to the following equa-
tion The gradient of f wrt to a matrix X is a
matrix of same shape as X and defined
by
f ( X + H ) = f ( X ) + h∇ X f | H i + o ( H ) ∂f
∇ X f ij = ∂X (X)
ij

Now, you can check that if you define

∂f
∇ X f ij = (X)
∂Xij
that these two terms are equivalent
review of differential calculus theory 8

h∇ x f |hi = h∇ X f | H i
np
∂f ∂f
∑ ∂xi (x)hi = ∑ ∂Xij (X ) Hij
i =1 i,j

6 Generalization to f : Rn× p 7→ Rm
Let’s generalize the generalization of
Applying the same idea as before, we can write the previous section

f ( x + h) = f ( x ) + J ( x ) · h + o (h)

where J has dimension m × n × p and is defined as

∂ fi
Jijk ( x ) = (x)
∂X jk
Writing the 2d-dot product δ = J ( x ) · h ∈ Rm means that the i-th
component of δ is You can apply the same idea to any
dimensions!
n p
∂ fi
δi = ∑∑ ∂X jk
( x )h jk
j =1 k =1

7 Chain-rule

Formal definition
Now let’s consider f : Rn 7→ Rm and g : R p 7→ Rn . We want
to compute the differential of the composition h = f ◦ g such that
h : x 7→ u = g( x ) 7→ f ( g( x )) = f (u), or

dx ( f ◦ g)
.
It can be shown that the differential is the composition of the dif-
ferentials

dx ( f ◦ g ) = d g( x ) f ◦ dx g

Where ◦ is the composition operator. Here, dg( x) f and dx g are lin-


ear transformations (see section 4). Then, the resulting differential is
also a linear transformation and the jacobian is just the dot product
between the jacobians. In other words, The chain-rule is just writing the
resulting jacobian as a dot product of
jacobians. Order of the dot product is
Jh ( x ) = J f ( g( x )) · Jg ( x ) very important!

where · is the dot-product. This dot-product between two matrices


can also be written component-wise:
review of differential calculus theory 9

n
Jh ( x )ij = ∑ J f ( g(x))ik · Jg (x)kj
k =1
Example !
x1
Let’s keep our example function f : ( ) 7→ 3x1 + x22 and our
x2
 
y1 !
y1 + 2y2 + 3y3
function g : (y2 ) = .
 
y1 y2 y3
y3
The composition of f and g is h = f ◦ g : R3 7→ R

 
y1 !
y1 + 2y2 + 3y3
h (  y2  ) = f ( )
 
y1 y2 y3
y3
= 3(y1 + 2y2 + 3y3 ) + (y1 y2 y3 )2

We can compute the three components of the gradient of h with


the partial derivatives

∂h
(y) = 3 + 2y1 y22 y23
∂y1
∂h
(y) = 6 + 2y2 y21 y23
∂y2
∂h
(y) = 9 + 2y3 y21 y22
∂y3

And then our gradient is


 
3 + 2y1 y22 y23
∇y h = 6 + 2y2 y21 y23 
 
9 + 2y3 y21 y22
In this process, we did not use our previous calculation, and that’s
a shame. Let’s use the chain-rule to make use of it. With examples 2.2
and 4, we had For a function f : Rn 7→ R, the Jacobian
is the transpose of the gradient

∇x f T = J f (x)
J f (x) = ∇x f T
 
= 3 2x2

We also need the jacobian of g, which we computed in 4

!
1 2 3
Jg ( y ) =
y2 y3 y1 y3 y1 y2
review of differential calculus theory 10

Applying the chain rule, we obtain that the jacobian of h is the


product J f · Jg (in this order). Recall that for a function Rn 7→ R, the
jacobian is formally the transpose of the gradient. Then,

Jh (y) = J f ( g(y)) · Jg (y)


= ∇ Tg(y) f · Jg (y)
!
  1 2 3
= 3 2y1 y2 y3 ·
y2 y3 y1 y3 y1 y2
 
= 3 + 2y1 y22 y23 6 + 2y2 y21 y23 9 + 2y3 y21 y22

and taking the transpose we find the same gradient that we com-
puted before!
Important remark

• The gradient is only defined for function with values in R.

• Note that the chain rule gives us a way to compute the Jacobian
and not the gradient. However, we showed that in the case of a
function f : Rn 7→ R, the jacobian and the gradient are directly
identifiable, because ∇ x J T = J ( x ). Thus, if we want to compute
the gradient of a function by using the chain-rule, the best way to
do it is to compute the Jacobian.

• As the gradient must have the same shape as the variable against
which we derive, and

– we know that the Jacobian is the transpose of the gradient


– and the Jacobian is the dot product of Jacobians

an efficient way of computing the gradient is to find the ordering


of jacobian (or the transpose of the jacobian) that yield correct
shapes!

• the notation ∂∂·· is often ambiguous and can refer to either the gra-
dient or the Jacobian.

You might also like