Vectorized Neural Network Gradients
Vectorized Neural Network Gradients
Kevin Clark
1 Introduction
The purpose of these notes is to demonstrate how to quickly compute neural
network gradients in a completely vectorized way. It is complementary to the
last part of lecture 3 in CS224n 2019, which goes over the same material.
2 Vectorized Gradients
While it is a good exercise to compute the gradient of a neural network with re-
spect to a single parameter (e.g., a single element in a weight matrix), in practice
this tends to be quite slow. Instead, it is more efficient to keep everything in ma-
trix/vector form. The basic building block of vectorized gradients is the Jacobian
Matrix. Suppose we have a function f : Rn → Rm that maps a vector of length n
to a vector of length m: f (x) = [f1 (x1 , ..., xn ), f2 (x1 , ..., xn ), ..., fm (x1 , ..., xn )].
Then its Jacobian is the following m × n matrix:
∂f1 ∂f1
∂x1 ... ∂xn
∂f
= ... .. ..
∂x . .
∂f m ∂fm
∂x1 ... ∂xn
As a little illustration of this, suppose we have a function f (x) = [f1 (x), f2 (x)]
taking a scalar to a vector of size 2 and a function g(y) = [g1 (y1 , y2 ), g2 (y1 , y2 )]
taking a vector of size two to a vector of size two. Now let’s compose them to
get g(x) = [g1 (f1 (x), f2 (x)), g2 (f1 (x), f2 (x))]. Using the regular chain rule, we
can compute the derivative of g as the Jacobian
∂ " ∂g1 ∂f1 ∂g1 ∂f2
#
∂g g1 (f 1 (x), f2 (x)) ∂f ∂x + ∂f ∂x
= ∂x ∂ = ∂g12 ∂f1 2
∂g2 ∂f2
∂x ∂x g2 (f1 (x), f2 (x)) ∂f1 ∂x + ∂f2 ∂x
1
And we see this is the same as multiplying the two Jacobians:
" #
∂g1 ∂g1 ∂f1
∂g ∂g ∂f ∂f1 ∂f2 ∂x
= = ∂g2 ∂g2 ∂f2
∂x ∂f ∂x ∂f1 ∂f2 ∂x
3 Useful Identities
This section will now go over how to compute the Jacobian for several simple
functions. It will provide some useful identities you can apply when taking neu-
ral network gradients.
(1) Matrix times column vector with respect to the column vector
∂z
(z = W x, what is ∂x ?)
∂z
So an entry ( ∂x )ij of the Jacobian will be
m m
∂z ∂zi ∂ X X ∂
( )ij = = Wik xk = Wik xk = Wij
∂x ∂xj ∂xj ∂xj
k=1 k=1
∂ ∂z
because ∂xj xk = 1 if k = j and 0 if otherwise. So we see that =W
∂x
(2) Row vector times matrix with respect to the row vector
∂z
(z = xW , what is ∂x ?)
∂z
A computation similar to (1) shows that = WT .
∂x
(3) A vector with itself
∂z
(z = x, what is ∂x ? )
We have zi = xi . So
(
∂z ∂zi ∂ 1 if i = j
( )ij = = xi =
∂x ∂xj ∂xj 0 if otherwise
∂z
So we see that the Jacobian is a diagonal matrix where the entry at (i, i)
∂x
∂z
is 1. This is just the identity matrix: = I . When applying the chain
∂x
2
rule, this term will disappear because a matrix or vector multiplied by the
identity matrix does not change.
(4) An elementwise function applied a vector
∂z
(z = f (x), what is ∂x ? )
Since f is being applied elementwise, we have zi = f (xi ). So
(
∂z ∂zi ∂ f 0 (xi ) if i = j
( )ij = = f (xi ) =
∂x ∂xj ∂xj 0 if otherwise
∂z
So we see that the Jacobian ∂x is a diagonal matrix where the entry at (i, i)
∂z
is the derivative of f applied to xi . We can write this as = diag(f 0 (x)) .
∂x
Since multiplication by a diagonal matrix is the same as doing elementwise
multiplication by the diagonal, we could also write ◦f 0 (x) when applying
the chain rule.
This is a bit more complicated than the other identities. The reason for in-
cluding ∂J
∂z in the above problem formulation will become clear in a moment.
First suppose we have a loss function J (a scalar) and are computing its
gradient with respect to a matrix W ∈ Rn×m . Then we could think of J as
a function of W taking nm inputs (the entries of W ) to a single output (J).
∂J
This means the Jacobian ∂W would be a 1 × nm vector. But in practice
this is not a very useful way of arranging the gradient. It would be much
nicer if the derivatives were in a n × m matrix like this:
∂J ∂J
∂W11 . . . ∂W 1m
∂J
= ... .. ..
∂W . .
∂J ∂J
∂Wn1 . . . ∂Wnm
Since this matrix has the same shape as W , we could just subtract it (times
the learning rate) from W when doing gradient descent. So (in a slight abuse
∂J
of notation) let’s find this matrix as ∂W instead.
This way of arranging the gradients becomes complicated when computing
∂z
∂W . Unlike J, z is a vector. So if we are trying to rearrange the gradients
∂J ∂z
like with ∂W , ∂W would be an n × m × n tensor! Luckily, we can avoid
the issue by taking the gradient with respect to a single weight Wij instead.
3
∂z
∂Wij is just a vector, which is much easier to deal with. We have
m
X
zk = Wkl xl
l=1
m
∂zk X ∂
= xl Wkl
∂Wij ∂Wij
l=1
∂
Note that ∂W ij
Wkl = 1 if i = k and j = l and 0 if otherwise. So if k 6= i
everything in the sum is zero and the gradient is zero. Otherwise, the only
nonzero element of the sum is when l = j, so we just get xj . Thus we find
∂zk
∂Wij = xj if k = i and 0 if otherwise. Another way of writing this is
0
..
.
0
∂z
xj ← ith element
=
∂Wij 0
.
..
0
∂J
Now let’s compute ∂Wij
m
∂J ∂J ∂z ∂z X ∂zk
= =δ = δk = δ i xj
∂Wij ∂z ∂Wij ∂Wij ∂Wij
k=1
∂zi ∂J
(the only nonzero term in the sum is δi ∂W ij
). To get ∂W we want a ma-
trix where entry (i, j) is δi xj . This matrix is equal to the outer product
∂J
= δ T xT
∂W
(6) Row vector time matrix with respect to the matrix
(z = xW , δ = ∂J ∂J ∂z
∂z what is ∂W = δ ∂W ?)
∂J
A similar computation to (5) shows that = xT δ .
∂W
(7) Cross-entropy loss with respect to logits (ŷ = softmax(θ), J =
CE(y, ŷ), what is ∂J
∂θ ?)
∂J
The gradient is = ŷ − y
∂θ
(or (ŷ − y)T if y is a column vector).
4
These identities will be enough to let you quickly compute the gradients for many
neural networks. However, it’s important to know how to compute Jacobians
for other functions as well in case they show up. Some examples if you want
practice: dot product of two vectors, elementwise product of two vectors, 2-norm
of a vector. Feel free to use these identities in the assignments. One option is
just to memorize them. Another option is to figure them out by looking at the
dimensions. For example, only one ordering/orientation of δ and x will produce
∂J
the correct shape for ∂W (assuming W is not square).
4 Gradient Layout
Jacobean formulation is great for applying the chain rule: you just have to mul-
tiply the Jacobians. However, when doing SGD it’s more convenient to follow
the convention “the shape of the gradient equals the shape of the parameter”
∂J
(as we did when computing ∂W ). That way subtracting the gradient times the
learning rate from the parameters is easy. We expect answers to homework
questions to follow this convention. Therefore if you compute the gradient
of a column vector using Jacobian formulation, you should take the transpose
when reporting your final answer so the gradient is a column vector. Another
option is to always follow the convention. In this case the identities may not
work, but you can still figure out the answer by making sure the dimensions of
your derivatives match up. Up to you which of these options you choose!
5
In this example, we will compute all of the network’s gradients:
∂J ∂J ∂J ∂J ∂J
∂U ∂b2 ∂W ∂b1 ∂x
To start with, recall that ReLU(x) = max(x, 0). This means
(
0 1 if x > 0
ReLU (x) = = sgn(ReLU(x))
0 if otherwise
where sgn is the signum function. Note that we are able to write the derivative
of the activation in terms of the activation itself.
∂J ∂J
Now let’s write out the chain rule for ∂U and ∂b2 :
∂J ∂J ∂ ŷ ∂θ
=
∂U ∂ ŷ ∂θ ∂U
∂J ∂J ∂ ŷ ∂θ
=
∂b2 ∂ ŷ ∂θ ∂b2
∂ ŷ
Notice that ∂J ∂J
∂ ŷ ∂θ = ∂θ is present in both gradients. This makes the math a bit
cumbersome. Even worse, if we’re implementing the model without automatic
differentiation, computing ∂J ∂θ twice will be inefficient. So it will help us to define
some variables to represent the intermediate derivatives:
∂J ∂J
δ1 = δ2 =
∂θ ∂z
These can be thought as the error signals passed down to θ and z when doing
backpropagation. We can compute them as follows:
∂J
δ1 = = (ŷ − y)T this is just identity (7)
∂θ
∂J ∂J ∂θ ∂h
δ2 = = using the chain rule
∂z ∂θ ∂h ∂z
∂θ ∂h
= δ1 substituting in δ1
∂h ∂z
∂h
= δ1 U using identity (1)
∂z
= δ1 U ◦ ReLU0 (z) using identity (4)
= δ1 U ◦ sgn(h) we computed this earlier
A good way of checking our work is by looking at the dimensions of the Jaco-
bians:
∂J
= δ1 U ◦ sgn(h)
∂z
(1 × Dh ) (1 × Nc ) (Nc × Dh ) (Dh )
6
We see that the dimensions of all the terms in the gradient match up (i.e., the
number of columns in a term equals the number of rows in the next term). This
will always be the case if we computed our gradients correctly.
Now we can use the error terms to compute our gradients. Note that we trans-
pose out answers when computing the gradients for column vectors terms to
follow the shape convention.
∂J ∂J ∂θ ∂θ
= = δ1 = δ1T hT using identity (5)
∂U ∂θ ∂U ∂U
∂J ∂J ∂θ ∂θ
= = δ1 = δ1T using identity (3) and transposing
∂b2 ∂θ ∂b2 ∂b2
∂J ∂J ∂z ∂z
= = δ2 = δ2T xT using identity (5)
∂W ∂θ ∂W ∂W
∂J ∂J ∂z ∂z
= = δ2 = δ2T using identity (3) and transposing
∂b1 ∂θ ∂b1 ∂b1
∂J ∂J ∂z
= = (δ2 W )T using identity (1) and transposing
∂x ∂θ ∂x
7
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part III
Neural Networks, Backpropagation 2 2
Authors: Rohit Mundra, Amani
Peddada, Richard Socher, Qiaojing Yan
Winter 2019
1
a=
1 + exp(−(w T x + b))
We can also combine the weights and bias term above to equiva- Neuron:
A neuron is the fundamental building
block of neural networks. We will
see that a neuron can be one of many
functions that allows for non-linearities
to accrue in the network.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 2
lently formulate:
1
a=
1 + exp(−[w T b] · [ x 1])
1
a1 =
1 + exp(w(1)T x + b1 )
..
.
1
am =
1 + exp(w(m)T x + bm )
z = Wx + b
Figure 3: This image captures how
The activations of the sigmoid function can then be written as: multiple sigmoid units are stacked on
the right, all of which receive the same
input x.
a (1)
.
. = σ (z) = σ (Wx + b)
.
a(m)
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 3
So what do these activations really tell us? Well, one can think
of these activations as indicators of the presence of some weighted
combination of features. We can then use a combination of these
activations to perform classification tasks.
1.4 Maximum Margin Objective Function Figure 4: This image captures how a
simple feed-forward network might
Like most machine learning models, neural networks also need an compute its output.
optimization objective, a measure of error or goodness which we
want to minimize or maximize respectively. Here, we will discuss a
popular error metric known as the maximum margin objective. The
idea behind using this objective is to ensure that the score computed
for "true" labeled data points is higher than the score computed for
"false" labeled data points.
Using the previous example, if we call the score computed for the
"true" labeled window "Museums in Paris are amazing" as s and the
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 4
score computed for the "false" labeled window "Not all museums in
Paris" as sc (subscripted as c to signify that the window is "corrupt").
Then, our objective function would be to maximize (s − sc ) or to
minimize (sc − s). However, we modify our objective to ensure that
error is only computed if sc > s ⇒ (sc − s) > 0. The intuition
behind doing this is that we only care the the "true" data point have
a higher score than the "false" data point and that the rest does not
matter. Thus, we want our error to be (sc − s) if sc > s else 0. Thus,
our optimization objective is now:
minimize J = max(sc − s, 0)
minimize J = max(∆ + sc − s, 0)
We can scale this margin such that it is ∆ = 1 and let the other
parameters in the optimization problem adapt to this without any
change in performance. For more information on this, read about
functional and geometric margins - a topic often covered in the study
of Support Vector Machines. Finally, we define the following opti- The max-margin objective function
mization objective which we optimize over all training windows: is most commonly associated with
Support Vector Machines (SVMs)
minimize J = max(1 + sc − s, 0)
θ ( t +1) = θ ( t ) − α ∇ θ ( t ) J
• Each layer (including the input and output layers) has neurons
which receive an input and produce an output. The j-th neuron
(k)
of layer k receives the scalar input z j and produces the scalar
(k)
activation output aj .
(k) (k)
• We will call the backpropagated error calculated at z j as δj .
• Layer 1 refers to the input layer and not the first hidden layer. For
(1) (1)
the input layer, x j = z j = aj .
• W (k) is the transfer matrix that maps the output from the k-th layer
to the input to the (k + 1)-th Thus, W (1) = W and W (2) = U to put
this new generalized notation in perspective of Section 1.3.
Backpropagation Notation:
Let us begin: Suppose the cost J = (1 + sc − s) is positive and • xi is an input to the neural network.
(1)
we want to perform the update of parameter W14 (in Figure 5 and • s is the output of the neural net-
(1) (2) work.
Figure 6), we must realize that W14 only contributes to z1 and
(2) • The j-th neuron of layer k receives
thus a1 . This fact is crucial to understanding backpropagation – (k)
the scalar input z j and produces
backpropagated gradients are only affected by values they contribute (k)
(2) the scalar activation output a j .
to. a1 is consequently used in the forward computation of score by
(1) (1)
(2) • For the input layer, x j = z j = aj .
multiplication with W1 . We can see from the max-margin loss that:
• W (k) is the transfer matrix that maps
∂J ∂J the output from the k-th layer to
=− = −1 the input to the (k + 1)-th. Thus,
∂s ∂sc
W (1) = W and W (2) = U T using
∂s notation from Section 1.3.
Therefore we will work with (1) here for simplicity. Thus,
∂Wij
(2) 0 (2) ∂ (1) (1) (1) (1) (1) (1) (1) (1) (1)
= Wi f ( zi ) (1)
( bi + a1 Wi1 + a2 Wi2 + a3 Wi3 + a4 Wi4 )
∂Wij
∂
+ ∑ ak Wik )
(2) 0 (2) (1) (1) (1)
= Wi f ( zi ) (1)
( bi
∂Wij k
(2) 0 (2) (1)
= Wi f ( zi ) a j
(2) (1)
= δi · aj
(2) (1)
We see above that the gradient reduces to the product δi · aj where
(2)
δi is essentially the error propagating backwards from the i-th neu-
(1)
ron in layer 2. a j is an input fed to i-th neuron in layer 2 when
scaled by Wij .
Let us discuss the "error sharing/distribution" interpretation of
backpropagation better using Figure 6 as an example. Say we were to
(1) Figure 6: This subnetwork shows the
update W14 :
relevant parts of the network required
(1)
to update Wij
1. We start with the an error signal of 1 propagating backwards from
(3)
a1 .
(1)
Bias Updates: Bias terms (such as b1 ) are mathematically equivalent
(2)
to other weights contributing to the neuron input (z1 ) as long as the
input being forwarded is 1. As such, the bias gradients for neuron
(k) (1)
i on layer k is simply δi . For instance, if we were updating b1
(1) (2) (2)
instead of W14 above, the gradient would simply be f 0 (z1 )W1 .
( k −1)
4. However, a j may have been forwarded to multiple nodes in
the next layer as shown in Figure 8. It should receive responsibility
for errors propagating backward from node m in layer k too, using
the exact same mechanism.
( k −1) (k) ( k −1) (k) ( k −1)
5. Thus, error received at a j is δi Wij + δm Wmj .
(k) ( k −1)
6. In fact, we can generalize this to be ∑i δi Wij .
( k −1)
7. Now that we have the correct error at a j , we move it across
neuron j at layer k − 1 by multiplying with with the local gradient
( k −1)
f 0 (z j ).
( k −1) ( k −1)
8. Thus, the error that reaches z j , called δj is
( k −1) (k) ( k −1)
f 0 (z j ) ∑i δi Wij
J (θ (i+) ) − J (θ (i−) )
f 0 (θ ) ≈
2e
where e is a small number (usually around 1e−5 ). The term J (θ (i+) )
is simply the error calculated on a forward pass for a given input
when we perturb the parameter θ’s ith element by +e. Similarly, the
term J (θ (i−) ) is the error calculated on a forward pass for the same
input when we perturb the parameter θ’s ith element by −e. Thus,
using two forward passes, we can approximate the gradient with
respect to any given parameter element in the model. We note that
this definition of the numerical gradient follows very naturally from
the definition of the derivative, where, in the scalar case,
f ( x + e) − f ( x )
f 0 (x) ≈
e
Gradient checks are a great way to
Of course, there is a slight difference – the definition above only compare analytical and numerical
perturbs x in the positive direction to compute the gradient. While it gradients. Analytical gradients should
be close and numerical gradients can be
would have been perfectly acceptable to define the numerical gradi- calculated using:
ent in this way, in practice it is often more precise and stable to use
J (θ (i+) ) − J (θ (i−) )
the centered difference formula, where we perturb a parameter in f 0 (θ ) ≈
2e
both directions. The intuition is that to get a better approximation of J (θ (i+) ) and J (θ (i−) ) can be evalu-
the derivative/slope around a point, we need to examine the func- ated using two forward passes. An
tion f ’s behavior both to the left and right of that point. It can also be implementation of this can be seen in
Snippet 2.1.
shown using Taylor’s theorem that the centered difference formula
has an error proportional to e2 , which is quite small, whereas the
derivative definition is more error-prone.
Now, a natural question you might ask is, if this method is so pre-
cise, why do we not use it to compute all of our network gradients
instead of applying back-propagation? The simple answer, as hinted
earlier, is inefficiency – recall that every time we want to compute the
gradient with respect to an element, we need to make two forward
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 10
Snippet 2.1
def eval_numerical_gradient(f, x):
"""
a naive implementation of numerical gradient of f at x
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient
at
"""
2.2 Regularization
As with many machine learning models, neural networks are highly
prone to overfitting, where a model is able to obtain near perfect per-
formance on the training dataset, but loses the ability to generalize
to unseen data. A common technique used to address overfitting (an
issue also known as the “high-variance problem”) is the incorpora-
tion of an L2 regularization penalty. The idea is that we will simply
append an extra term to our loss function J, so that the overall cost is
now calculated as:
L
JR = J + λ ∑ W (i )
F
i =1
The Frobenius Norm of a matrix U is
In the above formulation, W (i) is the Frobenius norm of the defined as follows:
F s
matrix W (i) (the i-th weight matrix in the network) and λ is the ||U || F = ∑ ∑ Uij2
i j
hyper-parameter controlling how much weight the regularization
term has relative to the original cost function. Since we are trying
to minimize JR , what regularization is essentially doing is penaliz-
ing weights for being too large while optimizing over the original
cost function. Due to the quadratic nature of the Frobenius norm
(which computes the sum of the squared elements of a matrix), L2 -
regularization effectively reduces the flexibility of the model and
thereby reduces the overfitting phenomenon. Imposing such a con-
straint can also be interpreted as the prior Bayesian belief that the
optimal weights are close to zero – how close depends on the value
of λ. Choosing the right value of λ is critical, and must be chosen
via hyperparameter-tuning. Too high a value of λ causes most of
the weights to be set too close to 0, and the model does not learn
anything meaningful from the training data, often obtaining poor ac-
curacy on training, validation, and testing sets. Too low a value, and
we fall into the domain of overfitting once again. It must be noted
that the bias terms are not regularized and do not contribute to the
cost term above – try thinking about why this is the case!
There are indeed other types of regularization that are sometimes
used, such as L1 regularization, which sums over the absolute values
(rather than squares) of parameter elements – however, this is less
commonly applied in practice since it leads to sparsity of parameter
weights. In the next section, we discuss dropout, which effectively acts
as another form of regularization by randomly dropping (i.e. setting
to zero) neurons in the forward pass.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 12
2.3 Dropout
Dropout is a powerful technique for regularization, first introduced
by Srivastava et al. in Dropout: A Simple Way to Prevent Neural Net-
works from Overfitting. The idea is simple yet effective – during train-
ing, we will randomly “drop” with some probability (1 − p) a subset
of neurons during each forward/backward pass (or equivalently,
we will keep alive each neuron with a probability p). Then, during
testing, we will use the full network to compute our predictions. The
result is that the network typically learns more meaningful informa-
tion from the data, is less likely to overfit, and usually obtains higher
performance overall on the task at hand. One intuitive reason why
this technique should be so effective is that what dropout is doing is
essentially doing is training exponentially many smaller networks at
once and averaging over their predictions.
In practice, the way we introduce dropout is that we take the out-
put h of each layer of neurons, and keep each neuron with prob-
ability p, and else set it to 0. Then, during back-propagation, we
only pass gradients through neurons that were kept alive during
the forward pass. Finally, during testing, we compute the forward
pass using all of the neurons in the network. However, a key sub-
Dropout applied to an artificial neural
tlety is that in order for dropout to work effectively, the expected network. Image credits to Srivastava et
output of a neuron during testing should be approximately the same al.
1
σ(z) =
1 + exp(−z) Figure 9: The response of a sigmoid
nonlinearity
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 13
− exp(−z)
σ0 (z) = = σ(z)(1 − σ(z))
1 + exp(−z)
exp(z) − exp(−z)
tanh(z) = = 2σ(2z) − 1
exp(z) + exp(−z) Figure 10: The response of a tanh
nonlinearity
where tanh(z) ∈ (−1, 1)
Hard tanh: The hard tanh function is sometimes preferred over the
tanh function since it is computationally cheaper. It does however
saturate for magnitudes of z greater than 1. The activation of the
hard tanh is:
−1
: z < −1
hardtanh(z) = z : −1 ≤ z ≤ 1
Figure 11: The response of a hard tanh
1 :z>1
nonlinearity
The derivative can also be expressed in a piecewise functional form:
(
0 1 : −1 ≤ z ≤ 1
hardtanh (z) =
0 : otherwise
Soft sign: The soft sign function is another nonlinearity which can
be considered an alternative to tanh since it too does not saturate as
easily as hard clipped functions:
z
softsign(z) =
1 + |z|
The derivative is the expressed as: Figure 12: The response of a soft sign
nonlinearity
sgn(z)
softsign0 (z) =
(1 + z )2
where sgn is the signum function which returns ± 1 depending on the sign of z
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 14
rect(z) = max(z, 0)
The derivative is then the piecewise function: Figure 13: The response of a ReLU
( nonlinearity
0 1 :z>0
rect (z) =
0 : otherwise
leaky(z) = max(z, k · z)
where 0 < k < 1 Figure 14: The response of a leaky
ReLU nonlinearity
This way, the derivative is representable as:
(
0 1 :z>0
leaky (z) =
k : otherwise
Mean Subtraction
Given a set of input data X, it is customary to zero-center the data by
subtracting the mean feature vector of X from X. An important point
is that in practice, the mean is calculated only across the training set,
and this mean is subtracted from the training, validation, and testing
sets.
Normalization
Another frequently used technique (though perhaps less so than
mean subtraction) is to scale every input feature dimension to have
similar ranges of magnitudes. This is useful since input features are
often measured in different “units”, but we often want to initially
consider all features as equally important. The way we accomplish
this is by simply dividing the features by their respective standard
deviation calculated across the training set.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 15
Whitening
Not as commonly used as mean-subtraction + normalization, whiten-
ing essentially converts the data to a have an identity covariance
matrix – that is, features become uncorrelated and have a variance
of 1. This is done by first mean-subtracting the data, as usual, to get
X 0 . We can then take the Singular Value Decomposition (SVD) of X 0
to get matrices U, S, V. We then compute UX 0 to project X 0 into the
basis defined by the columns of U. We finally divide each dimension
of the result by the corresponding singular value in S to scale our
data appropriately (if a singular value is zero, we can just divide by a
small number instead).
Where n(l ) is the number of input units to W (fan-in) and n(l +1)
is the number of output units from W (fan-out). In this parameter
initialization scheme, bias units are initialized to 0. This approach
attempts to maintain activation variances as well as backpropagated
gradient variances across layers. Without such initialization, the
gradient variances (which are a proxy for information) generally
decrease with backpropagation across layers.
You might think that for fast convergence rates, we should set α
to larger values – however faster convergence is not guaranteed with
larger convergence rates. In fact, with very large learning rates, we
might experience that the loss function actually diverges because the
parameters update causes the model to overshoot the convex minima
as shown in Figure 15. In non-convex models (most of those we work
with), the outcome of a large learning rate is unpredictable, but the
chances of diverging loss functions are very high.
The simple solution to avoiding a diverging loss is to use a very
small learning rate so that we carefully scan the parameter space –
of course, if we use too small a learning rate, we might not converge Figure 15: Here we see that updating
in a reasonable amount of time, or might get caught in local minima. parameter w2 with a large learning rate
can lead to divergence of the error.
Thus, as with any other hyperparameter, the learning rate must be
tuned effectively.
Since training is the most expensive phase in a deep learning
system, some research has attempted to improve this naive approach
to setting learning learning rates. For instance, Ronan Collobert
( l +1) × n ( l )
scales the learning rate of a weight Wij (where W ∈ Rn ) by
the inverse square root of the fan-in of the neuron (n ). ( l )
α0 τ
α(t) =
max(t, τ )
In the above scheme, α0 is a tunable parameter and represents the
starting learning rate. τ is also a tunable parameter and represents
the time at which the learning rate should start reducing. In practice,
this method has been found to work quite well. In the next section
we discuss another method for adaptive gradient descent which does
not require hand-set learning rates.
Snippet 2.2
# Computes a standard momentum update
# on parameters x
v = mu*v - alpha*grad_x
x += v
α ∂
θt,i = θt−1,i − q gt,i where gt,i = Jt (θ )
∑tτ =1 gτ,i
2 ∂θit
Snippet 2.3
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / np.sqrt(cache + 1e-8)
Snippet 2.4
# Update rule for RMS prop
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
Snippet 2.5
# Update rule for Adam
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 18
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)
Table of Contents:
Introduction
Simple expressions, interpreting the gradient
Compound expressions, chain rule, backpropagation
Intuitive understanding of backpropagation
Modularity: Sigmoid example
Backprop in practice: Staged computation
Patterns in backward flow
Gradients for vectorized operations
Summary
Introduction
Motivation. In this section we will develop expertise with an intuitive understanding of
backpropagation, which is a way of computing gradients of expressions through recursive
application of chain rule. Understanding of this process and its subtleties is critical for you to
understand, and effectively develop, design and debug neural networks.
Problem statement. The core problem studied in this section is as follows: We are given some
function f (x) where x is a vector of inputs and we are interested in computing the gradient of f
at x (i.e. ∇f (x) ).
Motivation. Recall that the primary reason we are interested in this problem is that in the specific
case of neural networks, f will correspond to the loss function ( L ) and the inputs x will consist
of the training data and the neural network weights. For example, the loss could be the SVM loss
function and the inputs are both the training data (xi , yi ), i = 1 … N and the weights and biases
W, b. Note that (as is usually the case in Machine Learning) we think of the training data as given
and fixed, and of the weights as variables we have control over. Hence, even though we can easily
use backpropagation to compute the gradient on the input examples xi , in practice we usually
only compute the gradient for the parameters (e.g. W, b)
so that we can use it to perform a
Back
parameter update. However, as we will see later in the class the gradient on xi can still be to Top
useful
https://cs231n.github.io/optimization-2/ 1/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
sometimes, for example for purposes of visualization and interpreting what the Neural Network
might be doing.
If you are coming to this class and you’re comfortable with deriving gradients with chain rule, we
would still like to encourage you to at least skim this section, since it presents a rarely developed
view of backpropagation as backward flow in real-valued circuits and any insights you’ll gain may
help you throughout the class.
∂f ∂f
f (x, y) = xy → =y =x
∂x ∂y
Interpretation. Keep in mind what the derivatives tell you: They indicate the rate of change of a
function with respect to that variable surrounding an infinitesimally small region near a particular
point:
df (x) f (x + h) − f (x)
= lim
dx h →0 h
A technical note is that the division sign on the left-hand side is, unlike the division sign on the
d
right-hand side, not a division. Instead, this notation indicates that the operator is being
dx
applied to the function f , and returns a different function (the derivative). A nice way to think
about the expression above is that when h is very small, then the function is well-approximated by
a straight line, and the derivative is its slope. In other words, the derivative on each variable tells
you the sensitivity of the whole expression on its value. For example, if x = 4, y = −3 then
∂f
f (x, y) = −12 and the derivative on x ∂x = −3. This tells us that if we were to increase the value
of this variable by a tiny amount, the effect on the whole expression would be to decrease it (due
to the negative sign), and by three times that amount. This can be seen by rearranging the above
df (x) ∂f
equation ( f (x + h) = f (x) + h ). Analogously, since ∂y = 4 , we expect that increasing the
dx
value of y by some very small amount h would also increase the output of the function (due to
the positive sign), and by 4h .
The derivative on each variable tells you the sensitivity of the whole expression on its value.
Back to Top
https://cs231n.github.io/optimization-2/ 2/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
∂f ∂f
f (x, y) = x + y → =1 =1
∂x ∂y
that is, the derivative on bothx, y is one regardless of what the values of x, y are. This makes
sense, since increasing either x, y would increase the output of f , and the rate of that increase
would be independent of what the actual values of x, y are (unlike the case of multiplication
above). The last function we’ll use quite a bit in the class is the max operation:
∂f ∂f
f (x, y) = max(x, y) → = 𝟙(x >= y) = 𝟙(y >= x)
∂x ∂y
That is, the (sub)gradient is 1 on the input that was larger and 0 on the other input. Intuitively, if
the inputs are x = 4, y = 2, then the max is 4, and the function is not sensitive to the setting of y .
That is, if we were to increase it by a tiny amount h , the function would keep outputting 4, and
therefore the gradient is zero: there is no effect. Of course, if we were to change y by a large
amount (e.g. larger than 2), then the value of f would change, but the derivatives tell us nothing
about the effect of such large changes on the inputs of a function; They are only informative for
tiny, infinitesimally small changes on the inputs, as indicated by the limh→0 in its definition.
Back to Top
https://cs231n.github.io/optimization-2/ 3/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
∂f ∂f ∂q
example, ∂x = ∂q ∂x . In practice this is simply a multiplication of the two numbers that hold the
two gradients. Lets see this with an example:
We are left with the gradient in the variables [dfdx,dfdy,dfdz] , which tell us the sensitivity of
the variables x,y,z on f !. This is the simplest example of backpropagation. Going forward, we
will use a more concise notation that omits the df prefix. For example, we will simply write dq
instead of dfdq , and always assume that the gradient is computed on the final output.
Back to Top
https://cs231n.github.io/optimization-2/ 4/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
This extra multiplication (for each input) due to the chain rule can turn a single and relatively
useless gate into a cog in a complex circuit such as an entire neural network.
Lets get an intuition for how this works by referring again to the example. The add gate received
inputs [-2, 5] and computed output 3. Since the gate is computing the addition operation, its local
gradient for both of its inputs is +1. The rest of the circuit computed the final value, which is -12.
During the backward pass in which the chain rule is applied recursively backwards through the
circuit, the add gate (which is an input to the multiply gate) learns that the gradient for its output
was -4. If we anthropomorphize the circuit as wanting to output a higher value (which can help
with intuition), then we can think of the circuit as “wanting” the output of the add gate to be lower
(due to negative sign), and with a force of 4. To continue the recurrence and to chain the gradient,
the add gate takes that gradient and multiplies it to all of the local gradients for its inputs (making
the gradient on both x and y 1 * -4 = -4). Notice that this has the desired effect: If x,y were to
decrease (responding to their negative gradient) then the add gate’s output would decrease, which
in turn makes the multiply gate’s output increase.
Backpropagation can thus be thought of as gates communicating to each other (through the
gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as
to make the final output value higher.
1
f (w, x) =
1 + e−(w0 x0 +w1 x1 +w2 ) Back to Top
https://cs231n.github.io/optimization-2/ 5/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
as we will see later in the class, this expression describes a 2-dimensional neuron (with inputs x
and weights w) that uses the sigmoid activation function. But for now lets think of this very simply
as just a function from inputs w,x to a single number. The function is made up of multiple gates. In
addition to the ones described already above (add, mul, max), there are four more:
1 df
f (x) = → = −1/x2
x dx
df
fc (x) = c + x → =1
dx
df
f (x) = ex → = ex
dx
df
fa (x) = ax → =a
dx
Where the functions fc , fa translate the input by a constant of c and scale the input by a constant
of a, respectively. These are technically special cases of addition and multiplication, but we
introduce them as (new) unary gates here since we do not need the gradients for the constants
c, a. The full circuit then looks as follows:
w0 2.00
-0.20
-2.00
* 0.20
x0 -1.00
0.39
4.00
+
0.20
w1 -3.00
-0.39
6.00
* 0.20 1.00 -1.00 0.37 1.37 0.73
x1 -2.00 + *-1 exp +1 1/x
0.20 -0.20 -0.53 -0.53 1.00
-0.59
w2 -3.00
0.20
Example circuit for a 2D neuron with a sigmoid activation function. The inputs are [x0,x1] and the (learnable)
weights of the neuron are [w0,w1,w2]. As we will see later, the neuron computes a dot product with the input
and then its activation is softly squashed by the sigmoid function to be in range from 0 to 1.
In the example above, we see a long chain of function applications that operates on the result of
the dot product between w,x. The function that these operations implement is called the sigmoid
function σ(x). It turns out that the derivative of the sigmoid function with respect to its input
simplifies if you perform the derivation (after a fun tricky part where we add and subtract a 1 in
the numerator):
Back to Top
https://cs231n.github.io/optimization-2/ 6/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
1
σ(x) =
1 + e−x
dσ(x) e−x 1 + e−x − 1 1
→ = = = (1 − σ(x)) σ(x)
dx (1 + e−x )2 ( 1 + e−x ) ( 1 + e−x )
As we see, the gradient turns out to simplify and becomes surprisingly simple. For example, the
sigmoid expression receives the input 1.0 and computes the output 0.73 during the forward pass.
The derivation above shows that the local gradient would simply be (1 - 0.73) * 0.73 ~= 0.2, as the
circuit computed before (see the image above), except this way it would be done with a single,
simple and efficient expression (and with less numerical issues). Therefore, in any real practical
application it would be very useful to group these operations into a single gate. Lets see the
backprop for this neuron in code:
# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid function
The point of this section is that the details of how the backpropagation is performed, and which
parts of the forward function we think of as gates, is a matter of convenience. It helps to be aware
of which parts of the expression have easy local gradients, so that they can be chained together
with the least amount of code and effort.
https://cs231n.github.io/optimization-2/ 7/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
Lets see this with another example. Suppose that we have a function of the form:
x + σ(y)
f (x, y) =
σ(x) + (x + y)2
To be clear, this function is completely useless and it’s not clear why you would ever want to
compute its gradient, except for the fact that it is a good example of backpropagation in practice.
It is very important to stress that if you were to launch into performing the differentiation with
respect to either x or y , you would end up with very large and complex expressions. However, it
turns out that doing so is completely unnecessary because we don’t need to have an explicit
function written down that evaluates the gradient. We only have to know how to compute it. Here
is how we would structure the forward pass of such expression:
x = 3 # example values
y = -4
# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator #(1)
num = x + sigy # numerator #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y #(4)
xpysqr = xpy**2 #(5)
den = sigx + xpysqr # denominator #(6)
invden = 1.0 / den #(7)
f = num * invden # done! #(8)
Phew, by the end of the expression we have computed the forward pass. Notice that we have
structured the code in such way that it contains multiple intermediate variables, each of which are
only simple expressions for which we already know the local gradients. Therefore, computing the
backprop pass is easy: We’ll go backwards and for every variable along the way in the forward
pass ( sigy, num, sigx, xpy, xpysqr, den, invden ) we will have the same variable, but
one that begins with a d , which will hold the gradient of the output of the circuit with respect to
that variable. Additionally, note that every single piece in our backprop will involve computing the
local gradient of that expression, and chaining it with the gradient on that expression with a
multiplication. For each row, we also highlight which part of the forward pass it refers to:
https://cs231n.github.io/optimization-2/ 8/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
Cache forward pass variables. To compute the backward pass it is very helpful to have some of
the variables that were used in the forward pass. In practice you want to structure your code so
that you cache these variables, and so that they are available during backpropagation. If this is too
difficult, it is possible (but wasteful) to recompute them.
Gradients add up at forks. The forward expression involves the variables x,y multiple times, so
when we perform backpropagation we must be careful to use += instead of = to accumulate
the gradient on these variables (otherwise we would overwrite it). This follows the multivariable
chain rule in Calculus, which states that if a variable branches out to different parts of the circuit,
then the gradients that flow back to it will add.
An example circuit demonstrating the intuition behind the operations that backpropagation performs during
the backward pass in order to compute the gradients on the inputs. Sum operation distributes gradients
equally to all its inputs. Max operation routes the gradient to the higher input. Multiply gate takes thetoinput
Back Top
activations, swaps them and multiplies by its gradient.
https://cs231n.github.io/optimization-2/ 9/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
x 3.00
-8.00
-12.00
* 2.00
y -4.00
6.00
-10.00 -20.00
+ *2
2.00 1.00
z 2.00
2.00
2.00
max
2.00
w -1.00
0.00
The add gate always takes the gradient on its output and distributes it equally to all of its inputs,
regardless of what their values were during the forward pass. This follows from the fact that the
local gradient for the add operation is simply +1.0, so the gradients on all inputs will exactly equal
the gradients on the output because it will be multiplied by x1.0 (and remain unchanged). In the
example circuit above, note that the + gate routed the gradient of 2.00 to both of its inputs, equally
and unchanged.
The max gate routes the gradient. Unlike the add gate which distributed the gradient unchanged
to all its inputs, the max gate distributes the gradient (unchanged) to exactly one of its inputs (the
input that had the highest value during the forward pass). This is because the local gradient for a
max gate is 1.0 for the highest value, and 0.0 for all other values. In the example circuit above, the
max operation routed the gradient of 2.00 to the z variable, which had a higher value than w, and
the gradient on w remains zero.
The multiply gate is a little less easy to interpret. Its local gradients are the input values (except
switched), and this is multiplied by the gradient on its output during the chain rule. In the example
above, the gradient on x is -8.00, which is -4.00 x 2.00.
Unintuitive effects and their consequences. Notice that if one of the inputs to the multiply gate is
very small and the other is very big, then the multiply gate will do something slightly unintuitive: it
will assign a relatively huge gradient to the small input and a tiny gradient to the large input. Note
T
that in linear classifiers where the weights are dot producted w xi (multiplied) with the inputs,
this implies that the scale of the data has an effect on the magnitude of the gradient for the
weights. For example, if you multiplied all input data examples xi
by 1000 during preprocessing,
Back
then the gradient on the weights will be 1000 times larger, and you’d have to lower the to Top
learning
https://cs231n.github.io/optimization-2/ 10/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
rate by that factor to compensate. This is why preprocessing matters a lot, sometimes in subtle
ways! And having intuitive understanding for how the gradients flow can help you debug some of
these cases.
Matrix-Matrix multiply gradient. Possibly the most tricky operation is the matrix-matrix
multiplication (which generalizes all matrix-vector and vector-vector) multiply operations:
# forward pass
W = np.random.randn(5, 10)
X = np.random.randn(10, 3)
D = W.dot(X)
Tip: use dimension analysis! Note that you do not need to remember the expressions for dW and
dX because they are easy to re-derive based on dimensions. For instance, we know that the
gradient on the weights dW must be of the same size as W after it is computed, and that it must
depend on matrix multiplication of X and dD (as is the case when both X,W are single
numbers and not matrices). There is always exactly one way of achieving this so that the
dimensions work out. For example, X is of size [10 x 3] and dD of size [5 x 3], so if we want dW
and W has shape [5 x 10], then the only way of achieving this is with dD.dot(X.T) , as shown
above.
Work with small, explicit examples. Some people may find it difficult at first to derive the gradient
updates for some vectorized expressions. Our recommendation is to explicitly write out a minimal
vectorized example, derive the gradient on paper and then generalize the pattern to its efficient,
vectorized form.
Erik Learned-Miller has also written up a longer related document on taking matrix/vector
derivatives which you might find helpful. Find it here.
Back to Top
https://cs231n.github.io/optimization-2/ 11/12
1/15/25, 4:13 PM CS231n Convolutional Neural Networks for Visual Recognition
Summary
We developed intuition for what the gradients mean, how they flow backwards in the circuit,
and how they communicate which part of the circuit should increase or decrease and with
what force to make the final output higher.
We discussed the importance of staged computation for practical implementations of
backpropagation. You always want to break up your function into modules for which you
can easily derive local gradients, and then chain them with chain rule. Crucially, you almost
never want to write out these expressions on paper and differentiate them symbolically in
full, because you never need an explicit mathematical equation for the gradient of the input
variables. Hence, decompose your expressions into stages such that you can differentiate
every stage independently (the stages will be matrix vector multiplies, or max operations, or
sum operations, etc.) and then backprop through the variables one step at a time.
In the next section we will start to define neural networks, and backpropagation will allow us to
efficiently compute the gradient of a loss function with respect to its parameters. In other words,
we’re now ready to train neural nets, and the most conceptually difficult part of this class is behind
us! ConvNets will then be a small step away.
References
Automatic differentiation in machine learning: a survey
cs231n
cs231n
Back to Top
https://cs231n.github.io/optimization-2/ 12/12
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
Table of Contents:
Quick intro
It is possible to introduce neural networks without appealing to brain analogies. In the section on
linear classification we computed scores for different visual categories given the image using the
formula s = Wx , where W was a matrix and x was an input column vector containing all pixel
data of the image. In the case of CIFAR-10, x is a [3072x1] column vector, and W is a [10x3072]
matrix, so that the output scores is a vector of 10 class scores.
An example neural network would instead compute s = W2 max(0, W1 x) . Here, W1 could be,
for example, a [100x3072] matrix transforming the image into a 100-dimensional intermediate
vector. The function max(0, −) is a non-linearity that is applied elementwise. There are several
choices we could make for the non-linearity (which we’ll study below), but this one is a common
choice and simply thresholds all activations that are below zero to zero. Finally, the matrix W2
would then be of size [10x100], so that we again get 10 numbers out that we interpret as the class
scores. Notice that the non-linearity is critical computationally - if we left it out, the two matrices
could be collapsed to a single matrix, and therefore the predicted class scores would again be a
Back to Top
linear function of the input. The non-linearity is where we get the wiggle. The parameters W2 , W1
https://cs231n.github.io/neural-networks-1/ 1/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
are learned with stochastic gradient descent, and their gradients are derived with chain rule (and
computed with backpropagation).
A three-layer neural network could analogously look like s = W3 max(0, W2 max(0, W1 x)),
where all of W3 , W2 , W1 are parameters to be learned. The sizes of the intermediate hidden
vectors are hyperparameters of the network and we’ll see how we can set them later. Lets now
look into how we can interpret these computations from the neuron/network perspective.
Back to Top
https://cs231n.github.io/neural-networks-1/ 2/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
A cartoon drawing of a biological neuron (left) and its mathematical model (right).
class Neuron(object):
# ...
def forward(self, inputs):
""" assume inputs and weights are 1-D numpy arrays and bias is a numbe
cell_body_sum = np.sum(inputs * self.weights) + self.bias
firing_rate = 1.0 / (1.0 + math.exp(-cell_body_sum)) # sigmoid activat
return firing_rate
In other words, each neuron performs a dot product with the input and its weights, adds the bias
and applies the non-linearity (or activation function), in this case the sigmoid σ(x) = 1/(1 + e−x ) .
We will go into more details about different activation functions at the end of this section.
Coarse model. It’s important to stress that this model of a biological neuron is very coarse: For
example, there are many different types of neurons, each with different properties. The dendrites
in biological neurons perform complex nonlinear computations. The synapses are not just a
single weight, they’re a complex non-linear dynamical system. The exact timing of the output
spikes in many systems is known to be important, suggesting that the rate code approximation
may not hold. Due to all these and many other simplifications, be prepared to hear groaning
sounds from anyone with some neuroscience background if you draw analogies between Neural
Networks and real brains. See this review (pdf), or more recently this review if you are interested.
https://cs231n.github.io/neural-networks-1/ 3/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
(activation near zero) certain linear regions of its input space. Hence, with an appropriate loss
function on the neuron’s output, we can turn a single neuron into a linear classifier:
Binary Softmax classifier. For example, we can interpret σ(∑ i wi xi + b) to be the probability of
one of the classes P(yi = 1 ∣ xi ; w). The probability of the other class would be
P(yi = 0 ∣ xi ; w) = 1 − P(yi = 1 ∣ xi ; w), since they must sum to one. With this interpretation,
we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and
optimizing it would lead to a binary Softmax classifier (also known as logistic regression). Since
the sigmoid function is restricted to be between 0-1, the predictions of this classifier are based on
whether the output of the neuron is greater than 0.5.
Binary SVM classifier. Alternatively, we could attach a max-margin hinge loss to the output of the
neuron and train it to become a binary Support Vector Machine.
Regularization interpretation. The regularization loss in both SVM/Softmax cases could in this
biological view be interpreted as gradual forgetting, since it would have the effect of driving all
synaptic weights w towards zero after every parameter update.
A single neuron can be used to implement a binary classifier (e.g. binary Softmax or binary SVM
classifiers)
Left: Sigmoid non-linearity squashes real numbers to range between [0,1] Right: The tanh non-linearity
squashes real numbers to range between [-1,1].
Back to Top
https://cs231n.github.io/neural-networks-1/ 4/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
Sigmoid. The sigmoid non-linearity has the mathematical form σ(x) = 1/(1 + e−x ) and is shown
in the image above on the left. As alluded to in the previous section, it takes a real-valued number
and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and
large positive numbers become 1. The sigmoid function has seen frequent use historically since it
has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated
firing at an assumed maximum frequency (1). In practice, the sigmoid non-linearity has recently
fallen out of favor and it is rarely ever used. It has two major drawbacks:
Sigmoids saturate and kill gradients. A very undesirable property of the sigmoid neuron is
that when the neuron’s activation saturates at either tail of 0 or 1, the gradient at these
regions is almost zero. Recall that during backpropagation, this (local) gradient will be
multiplied to the gradient of this gate’s output for the whole objective. Therefore, if the local
gradient is very small, it will effectively “kill” the gradient and almost no signal will flow
through the neuron to its weights and recursively to its data. Additionally, one must pay
extra caution when initializing the weights of sigmoid neurons to prevent saturation. For
example, if the initial weights are too large then most neurons would become saturated and
the network will barely learn.
Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of
processing in a Neural Network (more on this soon) would be receiving data that is not
zero-centered. This has implications on the dynamics during gradient descent, because if
T
the data coming into a neuron is always positive (e.g. x > 0 elementwise in f = w x + b)),
then the gradient on the weights w will during backpropagation become either all be
positive, or all negative (depending on the gradient of the whole expression f ). This could
introduce undesirable zig-zagging dynamics in the gradient updates for the weights.
However, notice that once these gradients are added up across a batch of data the final
update for the weights can have variable signs, somewhat mitigating this issue. Therefore,
this is an inconvenience but it has less severe consequences compared to the saturated
activation problem above.
Tanh. The tanh non-linearity is shown on the image above on the right. It squashes a real-valued
number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the
sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always
preferred to the sigmoid nonlinearity. Also note that the tanh neuron is simply a scaled sigmoid
neuron, in particular the following holds: tanh(x) = 2σ(2x) − 1 .
Back to Top
https://cs231n.github.io/neural-networks-1/ 5/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
Left: Rectified Linear Unit (ReLU) activation function, which is zero when x < 0 and then linear with slope 1
when x > 0. Right: A plot from Krizhevsky et al. (pdf) paper indicating the 6x improvement in convergence
with the ReLU unit compared to the tanh unit.
ReLU. The Rectified Linear Unit has become very popular in the last few years. It computes the
function f (x) = max(0, x). In other words, the activation is simply thresholded at zero (see
image above on the left). There are several pros and cons to using the ReLUs:
(+) It was found to greatly accelerate (e.g. a factor of 6 in Krizhevsky et al.) the convergence
of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that
this is due to its linear, non-saturating form.
(+) Compared to tanh/sigmoid neurons that involve expensive operations (exponentials,
etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero.
(-) Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large
gradient flowing through a ReLU neuron could cause the weights to update in such a way
that the neuron will never activate on any datapoint again. If this happens, then the gradient
flowing through the unit will forever be zero from that point on. That is, the ReLU units can
irreversibly die during training since they can get knocked off the data manifold. For
example, you may find that as much as 40% of your network can be “dead” (i.e. neurons that
never activate across the entire training dataset) if the learning rate is set too high. With a
proper setting of the learning rate this is less frequently an issue.
Leaky ReLU. Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function
being zero when x < 0, a leaky ReLU will instead have a small positive slope (of 0.01, or so). That
is, the function computes f (x) = 𝟙(x < 0)(αx) + 𝟙(x >= 0)(x) where α is a small constant.
Some people report success with this form of activation function, but the results are not always
consistent. The slope in the negative region can also be made into a parameter of each neuron, as
seen in PReLU neurons, introduced in Delving Deep into Rectifiers, by Kaiming He et al., 2015.
However, the consistency of the benefit across tasks is presently unclear.
Back to Top
https://cs231n.github.io/neural-networks-1/ 6/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
Maxout. Other types of units have been proposed that do not have the functional form
f (wT x + b) where a non-linearity is applied on the dot product between the weights and the data.
One relatively popular choice is the Maxout neuron (introduced recently by Goodfellow et al.) that
generalizes the ReLU and its leaky version. The Maxout neuron computes the function
max(wT1 x + b1 , wT2 x + b2 ). Notice that both ReLU and Leaky ReLU are a special case of this
form (for example, for ReLU we have w1 , b1 = 0). The Maxout neuron therefore enjoys all the
benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its
drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters
for every single neuron, leading to a high total number of parameters.
This concludes our discussion of the most common types of neurons and their activation
functions. As a last comment, it is very rare to mix and match different types of neurons in the
same network, even though there is no fundamental problem with doing so.
TLDR: “What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning
rates and possibly monitor the fraction of “dead” units in a network. If this concerns you, give
Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than
ReLU/Maxout.
Layer-wise organization
Neural Networks as neurons in graphs. Neural Networks are modeled as collections of neurons
that are connected in an acyclic graph. In other words, the outputs of some neurons can become
inputs to other neurons. Cycles are not allowed since that would imply an infinite loop in the
forward pass of a network. Instead of an amorphous blobs of connected neurons, Neural Network
models are often organized into distinct layers of neurons. For regular neural networks, the most
common layer type is the fully-connected layer in which neurons between two adjacent layers are
fully pairwise connected, but neurons within a single layer share no connections. Below are two
example Neural Network topologies that use a stack of fully-connected layers:
Back to Top
https://cs231n.github.io/neural-networks-1/ 7/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
Left: A 2-layer Neural Network (one hidden layer of 4 neurons (or units) and one output layer with 2 neurons),
and three inputs. Right: A 3-layer neural network with three inputs, two hidden layers of 4 neurons each and
one output layer. Notice that in both cases there are connections (synapses) between neurons across layers,
but not within a layer.
Naming conventions. Notice that when we say N-layer neural network, we do not count the input
layer. Therefore, a single-layer neural network describes a network with no hidden layers (input
directly mapped to output). In that sense, you can sometimes hear people say that logistic
regression or SVMs are simply a special case of single-layer Neural Networks. You may also hear
these networks interchangeably referred to as “Artificial Neural Networks” (ANN) or “Multi-Layer
Perceptrons” (MLP). Many people do not like the analogies between Neural Networks and real
brains and prefer to refer to neurons as units.
Output layer. Unlike all layers in a Neural Network, the output layer neurons most commonly do
not have an activation function (or you can think of them as having a linear identity activation
function). This is because the last output layer is usually taken to represent the class scores (e.g.
in classification), which are arbitrary real-valued numbers, or some kind of real-valued target (e.g.
in regression).
Sizing neural networks. The two metrics that people commonly use to measure the size of neural
networks are the number of neurons, or more commonly the number of parameters. Working with
the two example networks in the above picture:
The first network (left) has 4 + 2 = 6 neurons (not counting the inputs), [3 x 4] + [4 x 2] = 20
weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters.
The second network (right) has 4 + 4 + 1 = 9 neurons, [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 =
32 weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters.
To give you some context, modern Convolutional Networks contain on orders of 100 million
parameters and are usually made up of approximately 10-20 layers (hence deep learning).
However, as we will see the number of effective connections is significantly greater due to
Back to Top
parameter sharing. More on this in the Convolutional Neural Networks module.
https://cs231n.github.io/neural-networks-1/ 8/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
In the above code, W1,W2,W3,b1,b2,b3 are the learnable parameters of the network. Notice
also that instead of having a single input column vector, the variable x could hold an entire batch
of training data (where each input example would be a column of x ) and then all examples
would be efficiently evaluated in parallel. Notice that the final Neural Network layer usually doesn’t
have an activation function (e.g. it represents a (real-valued) class score in a classification
setting).
The forward pass of a fully-connected layer corresponds to one matrix multiplication followed
by a bias offset and an activation function.
Representational power
One way to look at Neural Networks with fully-connected layers is that they define a family of
functions that are parameterized by the weights of the network. A natural question that arises is:
What is the representational power of this family of functions? In particular, are there functions
that cannot be modeled with a Neural Network?
It turns out that Neural Networks with at least one hidden layer are universal approximators . That
Back to Top
is, it can be shown (e.g. see Approximation by Superpositions of Sigmoidal Function from 1989
https://cs231n.github.io/neural-networks-1/ 9/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
(pdf), or this intuitive explanation from Michael Nielsen) that given any continuous function f (x)
and some ϵ > 0, there exists a Neural Network g(x) with one hidden layer (with a reasonable
choice of non-linearity, e.g. sigmoid) such that ∀x, ∣ f (x) − g(x) ∣< ϵ. In other words, the neural
network can approximate any continuous function.
If one hidden layer suffices to approximate any function, why use more layers and go deeper? The
answer is that the fact that a two-layer Neural Network is a universal approximator is, while
mathematically cute, a relatively weak and useless statement in practice. In one dimension, the
“sum of indicator bumps” function g(x) = ∑ i ci 𝟙(a i < x < bi ) where a, b, c are parameter
vectors is also a universal approximator, but noone would suggest that we use this functional
form in Machine Learning. Neural Networks work well in practice because they compactly express
nice, smooth functions that fit well with the statistical properties of data we encounter in practice,
and are also easy to learn using our optimization algorithms (e.g. gradient descent). Similarly, the
fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer
networks is an empirical observation, despite the fact that their representational power is equal.
As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer
nets, but going even deeper (4,5,6-layer) rarely helps much more. This is in stark contrast to
Convolutional Networks, where depth has been found to be an extremely important component
for a good recognition system (e.g. on order of 10 learnable layers). One argument for this
observation is that images contain hierarchical structure (e.g. faces are made up of eyes, which
are made up of edges, etc.), so several layers of processing make intuitive sense for this data
domain.
The full story is, of course, much more involved and a topic of much recent research. If you are
interested in these topics we recommend for further reading:
Deep Learning book in press by Bengio, Goodfellow, Courville, in particular Chapter 6.4.
Do Deep Nets Really Need to be Deep?
FitNets: Hints for Thin Deep Nets
https://cs231n.github.io/neural-networks-1/ 10/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
Larger Neural Networks can represent more complicated functions. The data are shown as circles colored by
their class, and the decision regions by a trained neural network are shown underneath. You can play with
these examples in this ConvNetsJS demo.
In the diagram above, we can see that Neural Networks with more neurons can express more
complicated functions. However, this is both a blessing (since we can learn to classify more
complicated data) and a curse (since it is easier to overfit the training data). Overfitting occurs
when a model with high capacity fits the noise in the data instead of the (assumed) underlying
relationship. For example, the model with 20 hidden neurons fits all the training data but at the
cost of segmenting the space into many disjoint red and green decision regions. The model with 3
hidden neurons only has the representational power to classify the data in broad strokes. It
models the data as two blobs and interprets the few red points inside the green cluster as outliers
(noise). In practice, this could lead to better generalization on the test set.
Based on our discussion above, it seems that smaller neural networks can be preferred if the data
is not complex enough to prevent overfitting. However, this is incorrect - there are many other
preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2
regularization, dropout, input noise). In practice, it is always better to use these methods to control
overfitting instead of the number of neurons.
The subtle reason behind this is that smaller networks are harder to train with local methods such
as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it
turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high
loss). Conversely, bigger neural networks contain significantly more local minima, but these
minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-
convex, it is hard to study these properties mathematically, but some attempts to understand
these objective functions have been made, e.g. in a recent paper The Loss Surfaces of Multilayer
Networks. In practice, what you find is that if you train a small network the final loss can display a
good amount of variance - in some cases you get lucky and converge to a good place but
Backinto
some
Top
https://cs231n.github.io/neural-networks-1/ 11/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
cases you get trapped in one of the bad minima. On the other hand, if you train a large network
you’ll start to find many different solutions, but the variance in the final achieved loss will be much
smaller. In other words, all solutions are about equally as good, and rely less on the luck of
random initialization.
To reiterate, the regularization strength is the preferred way to control the overfitting of a neural
network. We can look at the results achieved by three different settings:
The effects of regularization strength: Each neural network above has 20 hidden neurons, but changing the
regularization strength makes its final decision regions smoother with a higher regularization. You can play
with these examples in this ConvNetsJS demo.
The takeaway is that you should not be using smaller networks because you are afraid of
overfitting. Instead, you should use as big of a neural network as your computational budget
allows, and use other regularization techniques to control overfitting.
Summary
In summary,
https://cs231n.github.io/neural-networks-1/ 12/13
1/15/25, 4:12 PM CS231n Convolutional Neural Networks for Visual Recognition
We saw that that Neural Networks are universal function approximators, but we also
discussed the fact that this property has little to do with their ubiquitous use. They are used
because they make certain “right” assumptions about the functional forms of functions that
come up in practice.
We discussed the fact that larger networks will always work better than smaller networks,
but their higher model capacity must be appropriately addressed with stronger
regularization (such as higher weight decay), or they might overfit. We will see more forms
of regularization (especially dropout) in later sections.
Additional References
deeplearning.net tutorial with Theano
ConvNetJS demos for intuitions
Michael Nielsen’s tutorials
cs231n
cs231n
Back to Top
https://cs231n.github.io/neural-networks-1/ 13/13
Derivatives, Backpropagation, and Vectorization
Justin Johnson
September 6, 2017
1 Derivatives
1.1 Scalar Case
You are probably familiar with the concept of a derivative in the scalar case:
given a function f : R → R, the derivative of f at a point x ∈ R is defined as:
f (x + h) − f (x)
f 0 (x) = lim
h→0 h
Derivatives are a way to measure change. In the scalar case, the derivative
of the function f at the point x tells us how much the function f changes as the
input x changes by a small amount ε:
f (x + ε) ≈ f (x) + εf 0 (x)
For ease of notation we will commonly assign a name to the output of f ,
∂y
say y = f (x), and write ∂x for the derivative of y with respect to x. This
∂y
notation emphasizes that ∂x is the rate of change between the variables x and
∂y
y; concretely if x were to change by ε then y will change by approximately ε ∂x .
We can write this relationship as
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
You should read this as saying “changing x to x + ∆x implies that y will
∂y
change to approximately y + ∆x ∂x ”. This notation is nonstandard, but I like
it since it emphasizes the relationship between changes in x and changes in y.
The chain rule tells us how to compute the derivative of the compositon of
functions. In the scalar case suppose that f, g : R → R and y = f (x), z = g(y);
then we can also write z = (g ◦f )(x), or draw the following computational graph:
f g
x−
→y−
→z
The (scalar) chain rule tells us that
∂z ∂z ∂y
=
∂x ∂y ∂x
1
∂z ∂y
This equation makes intuitive sense. The derivatives ∂y and ∂x give:
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
∂z
y → y + ∆y =⇒ z →≈ z + ∆y
∂y
Combining these two rules lets us compute the effect of x on z: if x changes
∂y ∂y
by ∆x then y will change by ∂x ∆x, so we have ∆y = ∂x ∆x. If y changes by
∂z ∂z ∂y
∆y then z will change by ∂y ∆y = ∂y ∂x ∆x which is exactly what the chain rule
tells us.
f (x + h) − f (x)
∇x f (x) = lim
h→0 khk
Now the gradient ∇x f (x) ∈ RN is a vector, with the same intuition as the
scalar case. If we set y = f (x) then we have the relationship
∂y
x → x + ∆x =⇒ y →≈ y + · ∆x
∂x
The formula changes a bit from the scalar case to account for the fact that
∂y
x, ∆x, and ∂x are now vectors in RN while y is a scalar. In particular when
∂y
multiplying ∂x by ∆x we use the dot product, which combines two vectors to
give a scalar.
One nice outcome of this formula is that it gives meaning to the individual
∂y
elements of the gradient ∂x . Suppose that ∆x is the ith basis vector, so that
the ith coordinate of ε is 1 and all other coordinates of ε are 0. Then the dot
∂y ∂y
product ∂x · ∆x is simply the ith coordinate of ∂x ; thus the ith coordinate of
∂y
∂x tells us the approximate amount by which y will change if we move x along
the ith coordinate axis.
∂y
This means that we can also view the gradient ∂x as a vector of partial
derivatives:
∂y ∂y ∂y ∂y
= , ,...,
∂x ∂x1 ∂x2 ∂xN
where xi is the ith coordinate of the vector x, which is a scalar, so each
∂y
partial derivative ∂xi
is also a scalar.
2
1.3 Jacobian: Vector in, Vector out
Now suppose that f : RN → RM takes a vector as input and produces a vector
as output. Then the derivative of f at a point x, also called the Jacobian, is
the M × N matrix of partial derivatives. If we again set y = f (x) then we can
write:
∂y1 ∂y1
···
∂x1 ∂xN
∂y
= ... .. ..
∂x . .
∂y ∂yM
M
∂x1 ··· ∂xN
The Jacobian tells us the relationship between each element of x and each
∂y ∂yi
element of y: the (i, j)-th element of ∂x is equal to ∂xj
, so it tells us the amount
by which yi will change if xj is changed by a small amount.
Just as in the previous cases, the Jacobian tells us the relationship between
changes in the input and changes in the output:
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
∂y
Here ∂x is a M × N matrix and ∆x is an N -dimensional vector, so the
∂y
product ∂x ∆x is a matrix-vector multiplication resulting in an M -dimensional
vector.
The chain rule can be extended to the vector case using Jacobian matrices.
Suppose that f : RN → RM and g : RM → RK . Let x ∈ RN , y ∈ RM , and
z ∈ RK with y = f (x) and z = g(y), so we have the same computational graph
as the scalar case:
f g
x−
→y−
→z
The chain rule also has the same form as the scalar case:
∂z ∂z ∂y
=
∂x ∂y ∂x
∂z ∂y
However now each of these terms is a matrix: ∂y is a K × M matrix, ∂x is
∂z ∂z ∂y
a M × N matrix, and ∂x is a K × N matrix; the multiplication of ∂y and ∂x is
matrix multiplication.
3
1.4 Generalized Jacobian: Tensor in, Tensor out
Just as a vector is a one-dimensional list of numbers and a matrix is a two-
dimensional grid of numbers, a tensor is a D-dimensional grid of numbers1 .
Many operations in deep learning accept tensors as inputs and produce
tensors as outputs. For example an image is usually represented as a three-
dimensional grid of numbers, where the three dimensions correspond to the
height, width, and color channels (red, green, blue) of the image. We must
therefore develop a derivative that is compatible with functions operating on
general tensors.
Suppose now that f : RN1 ×···×NDx → RM1 ×···×MDy . Then the input to f
is a Dx -dimensional tensor of shape N1 × · · · × NDx , and the output of f is a
Dy -dimensional tensor of shape M1 × · · · × MDy . If y = f (x) then the derivative
∂y
∂x is a generalized Jacobian, which is an object with shape
4
X ∂y
∂y ∂y
∆x = (∆x)i = · ∆x
∂x j i
∂x i,j ∂x j,:
5
∂y
However, there’s a problem with this approach: the Jacobian matrices ∂x and
∂y
∂w are typically far too large to fit in memory.
As a concrete example, let’s suppose that f is a linear layer that takes as
input a minibatch of N vectors, each of dimension D, and produces a minibatch
of N vectors, each of dimension M . Then x is a matrix of shape N × D, w is a
matrix of shape D × M , and y = f (x, w) = xw is a matrix of shape N × M .
∂y
The Jacobian ∂x then has shape (N × M ) × (N × D). In a typical neural
∂y
network we might have N = 64 and M = D = 4096; then ∂x consists of
64 · 4096 · 64 · 4096 scalar values; this is more than 68 billion numbers; using
32-bit floating point, this Jacobian matrix will take 256 GB of memory to store.
Therefore it is completely hopeless to try and explicitly store and manipulate
the Jacobian matrix.
However it turns out that for most common neural network layers, we can
∂y ∂L
derive expressions that compute the product ∂x ∂y without explicitly forming
∂y
the Jacobian ∂x . Even better, we can typically derive this expression without
∂y
even computing an explicit expression for the Jacobian ∂x ; in many cases we
can work out a small case on paper and then infer the general formula.
Let’s see how this works out for the case of the linear layer f (x, w) = xw.
Set N = 1, D = 2, M = 3. Then we can explicitly write
y = y1,1 y1,2 y1,3 = xw (1)
w1,1 w1,2 w1,3
= x1,1 x1,2 (2)
w2,1 w2,2 w2,3
T
x1,1 w1,1 + x1,2 w2,1
= x1,1 w1,2 + x1,2 w2,2 (3)
x1,1 w1,3 + x1,2 w2,3
6
∂L ∂L ∂y
= (4)
∂x1,1 ∂y ∂x1,1
∂L ∂L ∂y
= (5)
∂x1,2 ∂y ∂x1,2
∂L
Viewing these derivatives as generalized matrices, ∂y has shape (1)×(N ×M )
∂y ∂L
and ∂x1,1 has shape (N ×M )×(1); their product ∂x1,1 then has shape (1)×(1). If
∂L ∂y
we instead view ∂y and ∂x1,1 as matrices of shape N ×M , then their generalized
∂L ∂y
matrix product is simply the dot product ∂y · ∂x1,1 .
Now we compute
∂y
∂y ∂y1,2 ∂y1,3
= ∂x1,1 ∂x1,1 ∂x1,1 = w1,1 w1,2 w1,3 (6)
∂x1,1 1,1
∂y
∂y ∂y1,2 ∂y1,3
= ∂x1,1 ∂x1,2 ∂x1,2 = w2,1 w2,2 w2,3 (7)
∂x1,2 1,2
where the final equality comes from taking the derivatives of Equation 3 with
respect to x1,1 .
We can now combine these results and write
∂L ∂L ∂y
= · = dy1,1 w1,1 + dy1,2 w1,2 + dy1,3 w1,3 (8)
∂x1,1 ∂y ∂x1,1
∂L ∂L ∂y
= · = dy1,1 w2,1 + dy1,2 w2,2 + dy1,3 w2,3 (9)
∂x1,2 ∂y ∂x1,2
∂L
This gives us our final expression for ∂x :
∂L ∂L
= ∂x1,1 ∂x∂L (10)
∂x 1,2
T
dy1,1 w1,1 + dy1,2 w1,2 + dy1,3 w1,3
= (11)
dy1,1 w2,1 + dy1,2 w2,2 + dy1,3 w2,3
∂L T
= x (12)
∂y
∂L ∂L T
This final result ∂x = ∂y x is very interesting because it allows us to
∂L ∂y
efficiently compute ∂x without explicitly
forming the Jacobian ∂x . We have
only derived this formula for the specific case of N = 1, D = 2, M = 3 but it in
fact holds in general.
∂L
By a similar thought process we can derive a similar expression for ∂w with-
∂y
out explicitly forming the Jacobian ∂w . You should try and work through this
as an exercise.
7
Review of differential calculus theory 1 1
Author: Guillaume Genthial
Winter 2017
1 Introduction
We use derivatives all the time, but we forget what they mean. In
general, we have in mind that for a function f : R 7→ R, we have
something like
f ( x + h) − f ( x ) ≈ f 0 ( x )h
f 0 (x)
df
dx
∂f
∂x
∇x f
Scalar-product and dot-product
However, these notations refer to different mathematical objects, Given two vectors a and b,
and the confusion can lead to mistakes. This paper recalls some • scalar-product h a|bi = ∑in=1 ai bi
notions about these objects. • dot-product a T · b = h a|bi =
∑in=1 ai bi
review of differential calculus theory 2
2 Theory for f : Rn 7→ R
2.1 Differential
Notation
Formal definition dx f is a linear form Rn 7→ R
Let’s consider a function f : Rn 7→ R defined on Rn with the scalar This is the best linear approximation
of the function f
product h·|·i. We suppose that this function is differentiable, which
means that for x ∈ Rn (fixed) and a small variation h (can change) we
can write: dx f is called the differential of f in x
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) (1)
Example !
x1
Let f : R 7→ R such that f (
2 ) = 3x1 + x22 . Let’s pick
x2
! !
a h 1
∈ R2 and h = ∈ R2 . We have
b h2
!
a + h1
f( ) = 3( a + h1 ) + ( b + h2 )2
b + h2
= 3a + 3h1 + b2 + 2bh2 + h22
= 3a + b2 + 3h1 + 2bh2 + h22
= f ( a, b) + 3h1 + 2bh2 + o (h)
! h 2 = h · h = o h →0 ( h )
f(
h1
Then, d ) = 3h1 + 2bh2
a
h2
b
dx (h) = hu| hi
∇ x f := u
Then, as a conclusion, we can rewrite equation 2.1 Gradients and differential of a func-
tion are conceptually very different.
The gradient is a vector, while the
differential is a function
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) (2)
= f ( x ) + h∇ x f |hi + oh→0 (h) (3)
Example !
x1
Same example as before, f : R2 7→ R such that f ( ) =
x2
3x1 + x22 . We showed that
!
h1
d f ( ) = 3h1 + 2bh2
a h2
b
We can rewrite this as
! ! !
h1 3 h1
d f ( )=h | i
a
h2 2b h2
b
and thus our gradient is
!
3
∇ f =
a 2b
b
i
with respect to the i-th component and evaluated in x.
Example
Same example as before, f : R2 7→ R such that f ( x1 , x2 ) =
3x1 + x22 . Let’s write Depending on the context, most people
omit to write the ( x ) evaluation and just
write
∂f ∂f
∂x ∈ R instead of ∂x ( x )
n
review of differential calculus theory 4
! !
a+h a
! f( ) − f( )
∂f a b b
( ) = lim
∂x1 b h →0 h
3( a + h) + b2 − (3a + b2 )
= lim
h →0 h
3h
= lim
h →0 h
=3
∂f
where ∂x ( x ) denotes the partial derivative of f with respect to the
i
ith component, evaluated in x.
Example
We showed that
∂ f a
( ) =3
∂x1
b
∂ f a
( ) = 2b
∂x2
b
and that
!
3
∇ f =
a
2b
b
and then we verify that
review of differential calculus theory 5
!
a∂f
∂x1 (
)
b
∇ f = !
a
∂f a
( )
b ∂x2 b
3 Summary
Formal definition
For a function f : Rn 7→ R, we have defined the following objects
which can be summarized in the following equation Recall that a T · b = h a|bi = ∑in=1 ai bi
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) differential
= f ( x ) + h∇ x f |hi + oh→0 (h) gradient
∂f
= f ( x ) + h ( x )|hi + oh→0
∂x
∂f
∂x ( x )
1.
.. |hi + oh→0
= f ( x ) + h partial derivatives
∂f
∂xn ( x )
Remark
Let’s consider x : R 7→ R such that x (u) = u for all u. Then we can
easily check that du x (h) = h. As this differential does not depend on
u, we may simply write dx. That’s why the following expression has The dx that we use refers to the differ-
some meaning, ential of u 7→ u, the identity mapping!
∂f
dx f (·) = ( x )dx (·)
∂x
because
∂f
dx f (h) = ( x )dx (h)
∂x
∂f
= ( x)h
∂x
In higher dimension, we write
n
∂f
dx f = ∑ ∂xi (x)dxi
i =1
4 Jacobian: Generalization to f : Rn 7→ Rm
For a function
review of differential calculus theory 6
x1 f 1 ( x1 , . . . , x n )
. ..
.. 7→
f : .
xn f m ( x1 , . . . , x n )
We can apply the previous section to each f i ( x ) :
f i ( x + h ) = f i ( x ) + d x f i ( h ) + o h →0 ( h )
= f i ( x ) + h∇ x f i |hi + oh→0 (h)
∂f
= f i ( x ) + h i ( x )|hi + oh→0
∂x
∂f ∂f
= f i ( x ) + h( i ( x ), . . . , i ( x ))T |hi + oh→0
∂x1 ∂xn
∂f ∂ f1
1
x1 + h1 x1 ∂x1 ( x ) . . . ∂xn ( x )
.. = f .. +
..
· h + o (h)
f
. . .
∂ fm ∂ fm
xn + hn xn ∂x ( x ) . . . ∂x ( x )
1 n
= f ( x ) + J ( x ) · h + o (h)
y1 !
y1 + 2y2 + 3y3
g ( y2 ) =
y1 y2 y3
y3
∂(y1 +2y2 +3y3 )
∂y (y)T
Jg ( y ) =
∂ ( y1 y2 y3 )
( y ) T
∂y
∂(y1 +2y2 +3y3 ) ∂(y1 +2y2 +3y3 ) ∂(y1 +2y2 +3y3 )
∂y1 (y) ∂y2 (y) ∂y3 (y)
=
∂ ( y1 y2 y3 ) ∂ ( y1 y2 y3 ) ∂ ( y1 y2 y3 )
∂y1 (y) ∂y2 (y) ∂y3 ( y )
!
1 2 3
=
y2 y3 y1 y3 y1 y2
5 Generalization to f : Rn× p 7→ R
∂f
∂x1 ( x )
..
where ∇ x f =
. .
∂f
∂xnp ( x )
Now, we would like to give some meaning to the following equa-
tion The gradient of f wrt to a matrix X is a
matrix of same shape as X and defined
by
f ( X + H ) = f ( X ) + h∇ X f | H i + o ( H ) ∂f
∇ X f ij = ∂X (X)
ij
∂f
∇ X f ij = (X)
∂Xij
that these two terms are equivalent
review of differential calculus theory 8
h∇ x f |hi = h∇ X f | H i
np
∂f ∂f
∑ ∂xi (x)hi = ∑ ∂Xij (X ) Hij
i =1 i,j
6 Generalization to f : Rn× p 7→ Rm
Let’s generalize the generalization of
Applying the same idea as before, we can write the previous section
f ( x + h) = f ( x ) + J ( x ) · h + o (h)
∂ fi
Jijk ( x ) = (x)
∂X jk
Writing the 2d-dot product δ = J ( x ) · h ∈ Rm means that the i-th
component of δ is You can apply the same idea to any
dimensions!
n p
∂ fi
δi = ∑∑ ∂X jk
( x )h jk
j =1 k =1
7 Chain-rule
Formal definition
Now let’s consider f : Rn 7→ Rm and g : R p 7→ Rn . We want
to compute the differential of the composition h = f ◦ g such that
h : x 7→ u = g( x ) 7→ f ( g( x )) = f (u), or
dx ( f ◦ g)
.
It can be shown that the differential is the composition of the dif-
ferentials
dx ( f ◦ g ) = d g( x ) f ◦ dx g
n
Jh ( x )ij = ∑ J f ( g(x))ik · Jg (x)kj
k =1
Example !
x1
Let’s keep our example function f : ( ) 7→ 3x1 + x22 and our
x2
y1 !
y1 + 2y2 + 3y3
function g : (y2 ) = .
y1 y2 y3
y3
The composition of f and g is h = f ◦ g : R3 7→ R
y1 !
y1 + 2y2 + 3y3
h ( y2 ) = f ( )
y1 y2 y3
y3
= 3(y1 + 2y2 + 3y3 ) + (y1 y2 y3 )2
∂h
(y) = 3 + 2y1 y22 y23
∂y1
∂h
(y) = 6 + 2y2 y21 y23
∂y2
∂h
(y) = 9 + 2y3 y21 y22
∂y3
∇x f T = J f (x)
J f (x) = ∇x f T
= 3 2x2
!
1 2 3
Jg ( y ) =
y2 y3 y1 y3 y1 y2
review of differential calculus theory 10
and taking the transpose we find the same gradient that we com-
puted before!
Important remark
• Note that the chain rule gives us a way to compute the Jacobian
and not the gradient. However, we showed that in the case of a
function f : Rn 7→ R, the jacobian and the gradient are directly
identifiable, because ∇ x J T = J ( x ). Thus, if we want to compute
the gradient of a function by using the chain-rule, the best way to
do it is to compute the Jacobian.
• As the gradient must have the same shape as the variable against
which we derive, and
• the notation ∂∂·· is often ambiguous and can refer to either the gra-
dient or the Jacobian.