Dlvu Lecture02

Today’s lecture will be entirely devoted to the
backpropagation algorithm. The heart of all deep

Lecture 2: Backpropagation learning.
Peter Bloem
Deep Learning
dlvu.github.io
THE PLAN
In the first part, we will review the basics of neural
networks. The rest of the lecture will mostly be about
part 1: review backpropagation: the algorithm that allows us to
efficiently compute a gradient for the parameters of a
part 2: scalar backpropagation neural net, so that we can train it using the gradient
descent algorithm. The introductory lecture covered
some of this material alreayd, but we'll go over the basics
part 3: tensor backpropagation
again to set up the notations and visualization for the rest
of the lecture.
part 4: automatic differentiation
In the second part we describe backpropagation in a
scalar setting. That is, we will treat each individual
element of the neural network as a single number, and
2
simply loop over all these numbers to do backpropagation
over the whole network. This simplifies the derivation,
but it is ultimately a slow algorithm with a complicated
notation.
In the second part, we translate neural networks to
operations on vectors, matrices and tensors. This allows
us to simplify our notation, and more importantly,
massively speed up the computation of neural networks.
Backpropagation on tensors is a little more difficult to do
than backpropagation on scalars, but it's well worth the
effort.
In the third part, we will make the final leap from
manually worked out and implemented backpropagation
system to full-fledged automatic differentiation: we will
show you how to build a system that takes care of the
gradient computation entirely by itself. This is the
technology behind software like pytorch and tensorflow.
PART ONE: REVIEW
We’ll start with a quick recap of the basic principles

RECAP: NEURAL NETWORKS
behind neural networks. The name neural network is a bit
of a historical artifact.
In the very early days of AI (the late 1950s), researchers
decided to take a simple approach to AI. They started
with a single brain cell: a neuron. A neuron receives
multiple different signals from other cells through
connections called dendrites. It processes these in a
relatively simple way, deriving a single new signal, which
it sends out through its single axon. The axon branches
out so that the single signal can reach other cells.
image source: http://www.sciencealert.com/scientists-build-an-artificial-neuron-that-fully-mimics-a-human-brain-cell
1957: THE PERCEPTRON This principles needed to be radically simplified to

work with computers of that age, but doing so
yielded one of the first successful machine
y
learning systems: the perceptron.
The perceptron has a number of inputs, each of
y = w1x1 + w2x2 + b which is multiplied by a weight. These result is
w1 w2 b summed, together with a bias parameter, to
x1 x2 1
provide the output of the perceptron. If we're
doing binary classification, we can take the sign of
the output as the class.
5 The bias parameter is often represented as a
special input node, called a bias node, whose
value is fixed to 1.
For most of you, this will be nothing new. This is
simply a linear classifier or linear regression
model. It just happens to be drawn as a network.
but the real power of the brain doesn't come from
single neurons, it comes from chaining a large
number of neurons together. Can we do the same
thing with perceptrons: link the outputs of one
perceptron to the inputs of the next in a large
network, and so make the whole more powerful
than any single perceptron?
PROBLEM: COMPOSING NEURONS

This is where the perceptron turns out to be too simple
an abstraction. Composing perceptrons (making the
y output of one perceptron the input of another) doesn’t
make them more powerful. All you end out with is
something that is equivalent to another linear model.
1 2 We’re not creating models that can learning non-linear
functions.
1 3 2 1 1 3 4 2
We’ve removed the bias node here for clarity, but that
doesn’t affect our conclusions: any composition of affine
x1 x2 x3 x4 x1 x2 x3 x4 functions is itself an affine function.
y = 1(1x1 + 3x2) + 2(2x3 + 1x4) y = 1x1 + 3x2 + 4x3 + 2x4 If we're going to build networks of perceptrons that do
6 anything a single perceptron can't, we need another trick.
NONLINEARITY
The simplest solution is to apply a nonlinear function to
each neuron, called the activation function. This is a
y scalar function we apply to the output of a perceptron
after all the weighted inputs have been combined.
w1 w2 b y = (w1 x1 + w2 x2 + b) One popular option (especially in the early days) is the
logistic sigmoid, which we’ve seen already. Applying a
x1 x2
1 sigmoid means that the sum of the inputs can range from
sigmoid (x) = negative infinity to positive infinity, but the output is
1 + e-x
always in the interval [0, 1].
x if x > 0
r(x) = Another, more recent nonlinearity is the linear rectifier,
ReLU
0 otherwise or ReLU nonlinearity. This function just sets every
7 negative input to zero, and keeps everything else the
same.
Not using an activation function is also called using a
linear activation.
If you're familiar with logistic regression, you've seen the
sigmoid function already: it's stuck on the end of a linear
regression function (that is, a perceptron) to turn the
outputs into class probabilities. Now, we will take these
sigmoid outputs, and feed them as inputs to other
perceptrons.
FEEDFORWARD NETWORK Using these nonlinearities, we can arrange single

neutrons into neural networks. Any arrangement
y
output layer makes a neural network, but for ease of training,
the arrangement shown here was the most
popular for a long time. It’s called a feedforward
v2
h1 h2 h3 hidden layer network or multilayer perceptron. We arrange a

layer of hidden units in the middle, each of which
2
w1
acts as a perceptron with a nonlinearity,

x1 x2
input layer (features) connecting to all input nodes. Then we have one
or more output nodes, connecting to all hidden
aka Multilayer Perceptron (MLP) layers. Crucially:
8
• There are no cycles, the network feeds
forward from input to output.
• Nodes in the same layer are not connected
to each other, or to any other layer than the
previous one.
• Each layer is fully connected to the previous
layer, every node in one layer connects to
every node in the layer before it.
In the 80s and 90s feedforward networks usually
had just one hidden layer, because we hadn’t
figured out how to train deeper networks.
Note: Every orange and blue line in this picture
represents one parameter of the model.
REGRESSION If we want to train a regression model (a model

that predicts a numeric value), we put non-
linearities on the hidden nodes, and no activation
y
on the output node. That way, the output can
linear regression
range from negative to positive infinity, and the
nonlinearities on the hidden layer ensure that we
h1 h2 h3
can learn functions that a single perceptron
feature extractor couldn't learn (we can learn non-linear functions).
We can think of the first layer as learning some
x1 x2
nonlinear transformation of the features, and the
second layer as performing linear regression on
9
these derived features.
BINARY CLASSIFICATION If we have a classification problem with two

classes, called positive and negative, we can place
y s|x)
a sigmoid activation on the output layer, so that
(C=Po
<- p the output is between 0 and 1. We can then
logistic regression interpret this as the probability that the input has
the positive class (according to our network). The
h1 h2 h3
probability of the negative class is 1 minus this
feature extractor value.
x1 x2
10
MULTI-CLASS CLASSIFICATION: THE SOFTMAX ACTIVATION For multi-class classification, we can use the
softmax activation. We create a single output
node for every class, and ensure that their values
0.1 0.6 0.3
oi = w T h + b sum to one. We can then interpret this series of
o1 s o2 s o3 s
exp(oi ) values as the class probabilities that our network
yi = P predicts. The softmax activation ensures positive
j exp(oj ) values that sum to one.
After the softmax we can interpret the output of
node y3 as the probability that our input has class
3.
11 To compute the softmax, we simply take the
exponent of each output node oi (to ensure that
they are all positive) and then divide each by the
total (to ensure that they sum to one). We could
make the values positive in many other ways
(taking the absolute or the square), but the
exponent is a common choice for this sort of thing
in statistics and physics, and it seems to work well
enough.
The softmax activation is a little unusual: it’s not
element-wise like the sigmoid or the ReLU. To
compute the value of one output node, it looks at
the inputs of all the other outputs nodes.
HOW DO WE FIND GOOD WEIGHTS? Now that we know how to build a neural network,
and how to compute its output for a given input,
the next question is how do we train it? Given a
particular network and a set of examples of which
inputs correspond to which outputs, how do we
model✓ (x) = y find the weights for which the network makes
w2
lossx,t (✓) = kmodel✓ (x) - tk good predictions?
loss
To find good weights we first define a loss

function. This is a function of a particular model
w1
(represented by its weights) to a single scalar
value, called the model's loss. The better our
12
model, the lower the loss. If we imagine a model
with just two weights then the set of all models,
called the model space, forms a plane. For every
point in this plane, our loss function defines a loss.
We can draw this above the plane as a surface: the
loss surface (sometimes also called, more
poetically, the loss landscape).
Our job is to search the loss surface for a low
point.
Make sure you understand the difference between
the model, a function from the inputs x to the
outputs y in which the weights act as constants,
and the loss function, a function from the weights
to a loss value, in which the data acts as
constants.
The symbol θ is a common notation referring to
the set of all weights (sometimes combined into a
vector, sometimes just a set).
This is a common way of summarizing the aim of

machine learning. We have a large space of
possible parameters, with θ representing a single
choice, and in this space we want to find the θ for
which the loss on our chosen dataset is minimized.
arg min lossdata (✓)
<latexit sha1_base64="ME7sDxWgMkefduGwjrrcJfDBpRQ=">AAANyHicfZfbbts2HMbVdofOq7d0u9yNsKBANwSBlDg+oCtQS7ZRDGubBTl1URBQNC0LpkSCohy7gm72KHua3W53e5tRtiNLFGVdEfw+fvrxT9EmXYr9iBvGf48eP/ns8y++fPpV4+tnzW++3Xv+3WVEYgbRBSSYsGsXRAj7IbrgPsfomjIEAhejK3dmZ/rVHLHIJ+E5X1J0GwAv9Cc+BFx03e394gDmBX5450CSOHyKOEidV7rD0YInmERRerduj4FQXhZcPzUad3v7xqGxevRqw9w09rXNc3r3/BlvOGMC4wCFHGIQRTemQfltAhj3IUZpw4kjRAGcAQ/dxHzSvU38kMYchTDVXwhtEmOdEz2bij72GYIcL0UDQOaLBB1OAQOQiwk3ylERCkGAooPx3KfRuhnNvXWDA1Gt22SxqmZaGph4DNCpDxclsgQEUQD4tNIZLQO33IlijNg8KHdmlIJRci4Qg36U1eBUFOYDzRYoOienG326pFMURmkSM5wWBwoBMYYmYuCqGSEe02Q1GfFVzKLXnMXoIGuu+l4PAJudofGByCl1lHEmmABe7nKDrDghuockCEA4ThyaJuuvwzk4TIX4QnexMLsEsHHZeZYmiZPVzHX1M2Etie8L4ntZHBbE4eYlBI/1CWH6XKw/YZEujLqwMB+iqDz6Ih890S/k6MuCeCmLVwXxShbduKDGFXVeUOcV9b6g3svqoiAuZHFZEJeVXF5Quax+KoifZPF610s/7nrpH1KsWB2xZZZil6OJ+ElafWDJDKbJ2/N3v6VJb/VsvpQY6WbZCN0H4/Go3el1UlnGD3pr1DWtQVXPDR2rb9oqQ+7od2xjMNyyHEneHNow2sN+W46CeKt3e/aoqm9hjX5nYCkMW9qR3Rp2NuVDKJSsXu6ze+1WpSxentOzLOukV9Vzg9Wy7e6RwpA77MFg0LdXKDRmFCPJSx+M7faJUY2ieVDXaLf6Cn27AEbXsiqwtMBijSxzYK5YOAJYcvL8Y7G7/d5QDuLb+lt9264sIC+Uv2ubg5bCsGU9GZwMj1ckhIHQk6tC8vK1O93jrhxF8qBRR6xghYVs3zTqWUanwkIKLCPLtrLClnei2GQ35m2y/kHe7jt939TTVDaLjSabs733YJa8WGHG9W6lfZdfPQDXwldnCmFdPFSEw1oYqGKB9fBQCQ93wXtVv1cX7ynCvVoYT8Xi1cN7SnhvFzyt+mldPFWE01oYqmKh9fBUCU93wfOqn9fFc0U4r4XhKhZeD8+V8HwXPKn6SV08UYSTWhiiYiH18EQJT8rw4s8jO9MDrGdnYoJ1P9wcDEq/WTQ7PsygOEmu3euJD5C4GzD0ThwrPogDLRBnvJ+TzUUlFXcFzznIWruMYPFgFC1xTzHlW0m1cXl0aJ4cGr+39t9YmxvLU+0H7UftpWZqHe2N9lY71S40qP2l/a39o/3b/LVJm/fN5dr6+NFmzPda6Wn++T8JlxON</latexit>
It turns out this is actually an oversimplification,

✓ and we don't want to solve this particular problem
too well. We'll discuss this in the fourth lecture.
For now, this serves as a good summary of what
we're trying to do.
SOME COMMON LOSS FUNCTIONS Here are some common loss functions for
situations where we have examples (t) of what the
model output (y) should be for a given input (x).
ky - tk with y = model✓ (x)
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>
squared errors
regression
ky - tk with Xy = model (x) <latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>
ky - tk 1 with
=X y abs(y
= modeli -✓
✓
ti(x))
The squared error losses are derived from basic

X
ky - tk1 = i abs(yi - ti )
absolute errors
ky - tk1 with= y abs(y= model regression. The (binary) cross entropy comes from
i -✓ ti(x))
- log p✓ (t)X iwith t 2 {0, 1}
p1✓✓with
i y = model (x)
ky logistic regression (as shown last lecture) and the
ky
--logtk
- tk =
(t)X abs(y
with - t✓.1}
2i {0,
{0, i.). , K}
- log p (t) with tt 2
binary cross-entropy - log p ✓ (t) iwith t 2 {0, 1}
ky - tk
- log
max(0, =
p11✓ (t) withabs(y -
t 2it{0,
with i.). , K}
t.{-1, hinge loss comes from support vector machine
classification
- ty) 2 1}
- log p✓✓ (t) with t 2 {0, 1}. . . , K}
max(0, 1 - ty) i
with t 2 {-1, 1} classification. You can find their derivations in
cross-entropy max(0,
-- log 1 (t) with
-
log pp✓✓(t) ty) with
with tt 2 t 2 {-1,
{0, .1}
2 {0, 1}
. . , K}
max(0,
- log p1✓ (t)
- ty) with
with 2 {-1,
t 2t {0, 1}
. . . , K} most machine learning books/courses. We won’t
hinge loss max(0, 1 - ty) with t 2 {-1, 1} elaborate on them here, except to note that in all
cases the loss is lower if the model output (y) is
14
closer to the example output (t).
The loss can be computed for a single example or
for multiple examples. In almost all cases, the loss
for multiple examples is just the sum over all

their individual losses.
LOSS SURFACE We want to follow the loss surface down to the

lowest point.
15
DERIVATIVE In one dimension, we know that the derivative of a

function (like the loss) tells us how much a
function increases or decreases in we take a step
of size 1 to the right.
ce
To be more precise, it tells us how much the best

surfa
linear approximation of the function at a particular

loss
derivative
point, the tangent line, increases or decreases.
parameter space
16
GRADIENT If our input space has multiple dimensions, like

our model space, we can simply take the
derivative with respect to each input, separately,
w2
treating the others as constants. This is called a
@loss
@w1
partial derivative. The collection of all possible
@loss partial derivatives is called the gradient.
<latexit sha1_base64="24SIT/qQXoOZ2Wnh2DPqRTAanO0=">AAAHrnichVVbb9MwFA7XQrkNeOQlokJCqJqSDrbxgAS7ABKXjWldkZaqctzTNqqTWLbTNlj+U/waeIRfgpO0U1IH8EuPzvd9Jz6fT22fkoALx/l56fKVq9euN27cbN66fefuvY37D854nDAMXRyTmH31EQcSRNAVgSDwlTJAoU+g50/3M7w3A8aDODoVKYV+iMZRMAowEjo12PjojRjC0ptS2xOwEJLEnCtVJGIs5wNXKc9r/pfWUao52Gg5m06+bDNwl0HLWq7jwf3rH7xhjJMQIoEJ4vzcdajoS8REgAmoppdwoAhP0RjOEzHa7csgoomACCv7icZGCbFFbGed2cOAARYk1QHCLNAVbDxBettC99+sluIQoRB4ezgLKC9CPhsXgUDavL5c5OaqOxWlHDNEJwFeVLYmUchDJCZGkqehX01CQoDNwmoy26be5BpzAQwHPDPhWDtzRLMD46fx8RKfpHQCEVcyYUSVhRoAxmCkhXnIQSRU5t3oKZnyV4Il0M7CPPfqALHpCQzbuk4lUd3OiMRIVFO+bkO7E8Ecx2GIoqH0qB6JfD689qbKvSujJ0pKLzPK9+2TDK6gn0voZz1NFfCwBB5qsIp2L9CR3V2XnpXAM+OrvRLaW5f6SQlNDHRWQmdGZX9egucGvCihCwNNS2hqoN9K6DfTZ6TH4rzTl8VZ5Icqj0gwg3cMIFKypf+sa70wfd7nblWSzYBsuSq3ewgjfcUUQJhmdPn+9NNHJfd3Oy+cbbXO8EkCK4qztf1i3zEo42I3S46zu9vZMzgxQ9H4otDB4fYb1yxEE0bJBWlnZ+vtS7NSCoTE84tK+3sHna31OWLYMGHZq91ybcO0cR192VWtwK8TFE7V8qcm/x1D6V/YcV31lYG1ClqnWLlZq0jrFCtrV4q1Jmg2rVP9etDsXkekoByAvvEZfNJTfKRvKSRi9kyPLhuHgbZP/3rtLPoXES1WRB01s+fHXX9szKDb2Xy56X553nq9t3yHbliPrMfWU8u1dqzX1nvr2Opa2Ppu/bB+Wb8bbqPX6DcGBfXypaXmoVVZjckfDkrLpg==</latexit>
@w2
r✓ lossx,t (✓)
@loss
If we interpret the gradient as a vector, it points in
@w1
@loss
the direction in which the function grows the
<latexit sha1_base64="24SIT/qQXoOZ2Wnh2DPqRTAanO0=">AAAHrnichVVbb9MwFA7XQrkNeOQlokJCqJqSDrbxgAS7ABKXjWldkZaqctzTNqqTWLbTNlj+U/waeIRfgpO0U1IH8EuPzvd9Jz6fT22fkoALx/l56fKVq9euN27cbN66fefuvY37D854nDAMXRyTmH31EQcSRNAVgSDwlTJAoU+g50/3M7w3A8aDODoVKYV+iMZRMAowEjo12PjojRjC0ptS2xOwEJLEnCtVJGIs5wNXKc9r/pfWUao52Gg5m06+bDNwl0HLWq7jwf3rH7xhjJMQIoEJ4vzcdajoS8REgAmoppdwoAhP0RjOEzHa7csgoomACCv7icZGCbFFbGed2cOAARYk1QHCLNAVbDxBettC99+sluIQoRB4ezgLKC9CPhsXgUDavL5c5OaqOxWlHDNEJwFeVLYmUchDJCZGkqehX01CQoDNwmoy26be5BpzAQwHPDPhWDtzRLMD46fx8RKfpHQCEVcyYUSVhRoAxmCkhXnIQSRU5t3oKZnyV4Il0M7CPPfqALHpCQzbuk4lUd3OiMRIVFO+bkO7E8Ecx2GIoqH0qB6JfD689qbKvSujJ0pKLzPK9+2TDK6gn0voZz1NFfCwBB5qsIp2L9CR3V2XnpXAM+OrvRLaW5f6SQlNDHRWQmdGZX9egucGvCihCwNNS2hqoN9K6DfTZ6TH4rzTl8VZ5Icqj0gwg3cMIFKypf+sa70wfd7nblWSzYBsuSq3ewgjfcUUQJhmdPn+9NNHJfd3Oy+cbbXO8EkCK4qztf1i3zEo42I3S46zu9vZMzgxQ9H4otDB4fYb1yxEE0bJBWlnZ+vtS7NSCoTE84tK+3sHna31OWLYMGHZq91ybcO0cR192VWtwK8TFE7V8qcm/x1D6V/YcV31lYG1ClqnWLlZq0jrFCtrV4q1Jmg2rVP9etDsXkekoByAvvEZfNJTfKRvKSRi9kyPLhuHgbZP/3rtLPoXES1WRB01s+fHXX9szKDb2Xy56X553nq9t3yHbliPrMfWU8u1dqzX1nvr2Opa2Ppu/bB+Wb8bbqPX6DcGBfXypaXmoVVZjckfDkrLpg==</latexit>
@w2 fastest. Taking a step in the opposite direction
w1
means we are walking down the function.
17 In our case, this means that if we can work out the
gradient of the loss (which contains the model and
the data), then we can take a small step in the
opposite direction and be sure that we are moving

to a lower point on the loss surface.

The symbol for the gradient is a downward
pointing triangle called a nabla. The subscript
indicates the variable over which we are taking the
derivatives. Note that in this case we are treating
θ as a vector.
STOCHASTIC GRADIENT DESCENT This is the idea behind the gradient descent
algorithm. We compute the gradient, take a small
pick some initial weights θ (for the whole model) step in the opposite direction and repeat. The
loop: epochs reason we take small steps is that the gradient is
for x, t in Data: only the direction of steepest ascent locally. It's a
✓ ✓ - ↵ · r✓ lossx,t (✓) linear approximation to the nonlinear loss
learning rate function. The further we move from our current
position the worse an approximation the tangent
stochastic gradient descent: loss over one example per step (loop over data) hyperplane will be for the function that we are
minibatch gradient descent: loss over a few examples per step (loop over data) actually trying to follow. That's why we only take a
full-batch gradient descent: loss over the whole data small step, and then recompute the gradient in our
18
new position.
We can compute the gradient of the loss with
respect to a single example from our dataset, a
small batch of examples, or over the whole
dataset. These options are usually called
stochastic, minibatch and full-batch gradient
descent respectively (although minibatch and
stochastic gradient descent are used
interchangeably).
In deep learning, we almost always use minibatch
gradient descent, but there are some cases where
full batch is used.
Training usually requires multiple passes over the
data. One such pass is called an epoch.
RECAP This is the basic idea of neural network training.

What we haven't discussed is how to work out the
perceptron: linear combination of inputs gradient of a loss function over a neural network.
For simple functions like linear classifiers, this can
neural network: network of perceptrons, with scalar nonlinearities be done by hand. For more complex functions, like
very deep neural networks, this is no longer
training: (minibatch) gradient descent feasible, and we need some help. This help comes
in the form of the backpropagation algorithm.
But, how do we compute the gradient of a complex neural network?
Next video: backpropagation.
19
Lecture 2: Backpropagation
Peter Bloem
Deep Learning
dlvu.github.io
For simple models, working out a gradient is

usually done by hand, with pen and paper. This
function is then transferred to code, and used in a
gradient descent loop. But the more complicated
out model becomes, the more complex it becomes
PART TWO: SCALAR BACKPROPAGATION to work out a complete formulation of the
gradient.
How do we work out the gradient for a neural network?
Here is a diagram of the sort of network we’ll be

encountering (the GoogLeNet). We can’t work out
a complete gradient for this kind of architecture by
hand. We need help.
GRADIENT COMPUTATION: THE SYMBOLIC APPROACH Of course, working out partial derivatives is a
pretty mechanical process. We could easily take all
the rules we know, and put them into some
algorithm. This is called symbolic differentiation,
and it’s what systems like Mathematica and
Wolfram Alpha do for us.
Unfortunately, as you can see here, the derivatives
it returns get pretty horrendous the deeper the
neural network gets. This approach becomes
impractical very quickly.
23 Note that in symbolic differentiation we get a
description of the derivative that is independent
of the input. We get a function that we can then
feed any input to.

GRADIENT COMPUTATION: THE NUMERIC APPROACH Another approach is to compute the gradient
numerically. For instance by the method of finite
differences: we take a small step ε and, see how
much the function changes. The amount of change
divided by the step size is a good estimate for the
<latexit sha1_base64="G0r/UE7pa0vohX+ALQT1P/kZeiU=">AAAN1HicfZdNb9s2HMbV7q3L6i3djrsQCwq0WxZIieOXQ4Faso0c1jYL8tItCgKKphXBlEhQlGNX02nYdR9ln2aXHbbPMkp2ZImSrBPB5+GjH/8kbcphxAuFrv/z6PFHH3/y6WdPPt/54mnry692n319GdKII3yBKKH8vQNDTLwAXwhPEPyecQx9h+ArZ2al+tUc89CjwblYMnzjQzfwph6CQnbd7p7YAi9EPMHcmyfAhoxxugD2lEMUT18swA/Ali+JbcxCj9AgeQl+BLL/ZRKX+5Pb3T39QM8eUG0Y68aetn5Ob589FTv2hKLIx4FABIbhtaEzcRNDLjxEcLJjRyFmEM2gi68jMe3dxF7AIoEDlIDnUptGBAgK0mmBiccxEmQpGxBxTyYAdAflLISc/E45KsQB9HG4P5l7LFw1w7m7aggoK3cTL7LKJqWBscshu/PQokQWQz/0obirdIZL3yl34ohgPvfLnSmlZFScC8yRF6Y1OJWFecfSxQrP6elav1uyOxyESRxxkhQHSgFzjqdyYNYMsYhYnE1G7pBZ+ErwCO+nzazv1RDy2Rme7MucUkcZZ0ooFOUux0+LE+B7RH0fBpPYZnJLZHvJ3j9IpPgcOESaHQr5pOw8S+LYTmvmOOBMWkvi24L4VhVHBXG0fgklEzClHMzl+lMeAmkE0sI9hMPy6It89BRcqNGXBfFSFa8K4pUqOlFBjSrqvKDOK+p9Qb1X1UVBXKjisiAuK7mioApV/VAQP6ji+20v/WXbS39VYuXqyCOzlKccT+XPU7bB4hlK4pPzNz8lcT971jslwsAoG5HzYDwad7r9bqLK5EFvj3uGOazquaFrDgyrzpA7Bl1LH442LIeKN4fW9c5o0FGjENnovb41ruobWH3QHZo1hg3t2GqPuuvyYRwoVjf3Wf1Ou1IWN8/pm6Z53K/qucFsW1bvsMaQO6zhcDiwMhQWcUaw4mUPxk7nWK9GsTyop3fagxp9swB6zzQrsKzAYo5NY2hkLAJDojhFvlms3qA/UoPEpv7mwLIqCygK5e9ZxrBdY9iwHg+PR0cZCeUwcNWq0Lx8nW7vqKdG0Txo3JUrWGGhmzeN+6berbDQAsvYtMy0sOWTKA/ZtXETr36QN+cO7BkgSVSzPGiqOT17D2bFS2rMpNlda9/mrx9AGuGrM0WoKR7VhKNGGFTHgprhUS082gbvVv1uU7xbE+42wrh1LG4zvFsL726DZ1U/a4pnNeGsEYbVsbBmeFYLz7bBi6pfNMWLmnDRCCPqWEQzvKiFF9vgadVPm+JpTThthKF1LLQZntbC0zK8/PNI7/SQgPROTAnwgvXFoPSbxdLrwwzJm+TKvZr4EMtvA47fyGvFO3mhhfKO931sQ+76XpDIbwXX3k9b24xw8WCULfmdYqhfJdXG5eGBcXyg/9zee22uv1ieaN9q32kvNEPraq+1E+1Uu9CQ9pf2t/av9l/rsvVb6/fWHyvr40frMd9opaf15/+hwRfr</latexit>
gradient if ε is small enough.

f(x + ✏) - f(x)
f(x + ✏) - f(x) deriv ⇡
deriv ⇡ ✏ Numeric approaches are sometimes used in deep
✏ the “metho d of finite differe
nces” learning, but it’s very expensive to make them
accurate if you have a large number of
parameters.
24 Note that in the numeric approach, you only get
an answer for a particular input. If you want to
compute the gradient at some other point in
space, you have to compute another numeric
approximation. Compare this to the symbolic
approach (either with pen and paper or through
wolfram alpha) where once the differentation is
done, all we have to compute is the derivative that
we've worked out.
BACKPROPAGATION: THE MIDDLE GROUND Backpropagation is a kind of middle ground

between symbolic and numeric approaches to
Work out parts of the derivative symbolically working out the gradient. We break the
chain these together in a numeric computation. compuation into parts: we work out the
derivatives of the parts symbolically, and then
secret ingredient: the chain rule. chain these together numerically.
The secret ingredient that allows us to make this
work is the chain rule of differentiation.
25
THE CHAIN RULE Here is the chain rule: if we want the derivative of
a function which is the composition of two other
function, in this case f and g, we can take the
@f(g(x)) @f(g(x)) @g(x) derivative of f with respect to the output of g and
=
@x @g(x) @x multiply it by the derivative of g with respect to
@f(g(x)) @f @f@g(x)
@f(g(x)) @g the input x.
= @x = @g @x
@x
shorthand: @g(x) @x Since we’ll be using the chain rule a lot, we’ll
@f @f @g introduce a simple shorthand to make it a little
=
x g f @x @g @x easier to parse. We draw a little diagram of which
function feeds into which. This means we know
what the argument of each function is, so we can
26
remove the arguments from our notation.
We call this diagram a computation graph. We'll
stick with simple diagrams like this for now. At the

end of the lecture, we will expand our notation a

little bit to capture more detail of the
computation.
INTUITION FOR THE CHAIN RULE Since the chain rule is the heart of
f(g) = sf g + bf
backpropagation, and backpropagation is the
f over g g(x) = sg x + bg
slope of
heart of deep learning, we should probably take
@f
f(g) = sf g + bf with sf = some time to see why the chain rule is true at all.
@g x g f
g(x) = sg x + bg @g
@f sg =
@x
If we imagine that f and g are linear functions, it’s
with sf =
@g pretty straightforward to show that this is true.
@g They may not be, of course, but the nice thing
sg =f(g(x)) = sff(sggx + bbgg)) +
+ bbff
<latexit sha1_base64="tCAh3qBNFJD1L3/bHSxYlhX/fLk=">AAAN7HicfZfLbttGGIWZpJfUjVqnWXYzqJHCbgyDtGVdEBiISEnIoklcw7fWNNThaEQTGnIGw6EsheBbdFd020fpO/Qdum3XHVIyxau40WDO+Q8//uRQQ4sRxxeq+vejx08++fSzz59+sfXls8ZXX28//+bSpwFH+AJRQvm1BX1MHA9fCEcQfM04hq5F8JU1NWL9aoa571DvXCwYvnWh7TkTB0Ehp0bbv5rICifRrons0I5253t74PsTYE5R6I8mu/7InoNXwBrZe8nPJAKmuRU7/NGyMP6NK8E8KTJfvzJfy0pZsSoYbe+oB2pygPJAWw12lNVxOnr+TGyZY4oCF3sCEej7N5rKxG0IuXAQwdGWGfiYQTSFNr4JxKRzGzoeCwT2UAReSm0SECAoiC8XjB2OkSALOYCIOzIBoDvIIRKyKVv5KB970MX+/njmMH859Gf2ciCg7OhtOE86HuUKQ5tDduegeY4shK7vQnFXmvQXrpWfxAHBfObmJ2NKyVhwzjFHjh/34FQ25gOLb6J/Tk9X+t2C3WHPj8KAkyhbKAXMOZ7IwmToYxGwMLkY+eRM/RPBA7wfD5O5kz7k0zM83pc5uYk8zoRQKPJTlhs3x8P3iLou9MahyaLQFHguQnP/IJLiS2ARabYo5OO88ywKQzPumWWBM2nNie8z4vuiOMiIg9VJKBmDCeVgJu8/5T6QRiAt3EHYz1dfpNUTcFGMvsyIl0XxKiNeFUUryKhBSZ1l1FlJvc+o90V1nhHnRXGRERelXJFRRVH9mBE/FsXrTSf9edNJfynEyrsjl8xCrnI8ka+t5AELpygK356/+zEKu8mxelICDLS8EVkPxqNhq91tR0WZPOjNYUfT+2U9NbT1nmZUGVJHr22o/cGa5bDgTaFVtTXotYpRiKz1TtcYlvU1rNpr9/UKw5p2aDQH7VX7MPYKVjv1Gd1Ws9QWO83p6rp+3C3rqUFvGkbnsMKQOox+v98zEhQWcEZwwcsejK3WsVqOYmlQR201exX6+gaoHV0vwbIMiz7Utb6WsAgMScEp0ofF6PS6g2KQWPdf7xlG6QaKTPs7htZvVhjWrMf948FRQkI59OxiV2javla7c9QpRtE0aNiWd7DEQtdnGnZ1tV1ioRmWoW7ocWPzK1EushvtNly+kNfrDuxoIIqKZrnQiuZ47T2YC15SYSb17kr7Jn91AamFL18pQnXxqCIc1cKgKhZUD48q4dEmeLvst+vi7YpwuxbGrmKx6+HtSnh7Ezwr+1ldPKsIZ7UwrIqF1cOzSni2CV6U/aIuXlSEi1oYUcUi6uFFJbzYBE/LfloXTyvCaS0MrWKh9fC0Ep7m4eWfR7ynhwTEe2JKgOOtNga5dxaLtw/xt8XKvbzwPpbfBhy/k9uKD3JDC+Ue74fQhNx2HS+S3wq2uR+PNhnh/MEoR/I7RSt+lZQHl4cHWuvg6Kfmzht99cXyVPlW+U7ZVTSlrbxR3iqnyoWClL+Uf5R/lf8aXuO3xu+NP5bWx49WNS+U3NH4838N+h0c</latexit>
@x = sffsggx + ssffbbgg +
+ sbgf about calculus is that locally, we can treat them as
slope of f over x constant
linear functions (if they are differentiable). In an
infinitesimally small neighbourhood f and g are
@f @g
exactly linear.
27
@g @x
If f and g are locally linear, we can describe their
behavior with a slope s and an additive constant b.
The slopes, sf and sg, are simply the derivatives.
The additive constants we will show can be
ignored.
In this linear view, what the chain rule says is this:
if we approximate f(x) as a linear function, then its
slope is the slope of f as a function of g, times the
slope of g as a function of x. To prove that this is
true, we just write down f(g(x)) as linear functions,
and multiply out the brackets.
Note that this doesn’t quite count as a rigorous
proof, but it’s hopefully enough to give you some
intuition for why the chain rule holds.
MULTIVARIATE CHAIN RULE Since we’ll be looking at some pretty elaborate

computation graphs, we’ll need to be able to deal
with this situation as well: we have a computation
graph, as before, but f depends on x through two
g
different operations. How do we take the
x f
derivative of f over x?
h
The multivariate chain rule tells us that we can

simply apply the chain rule along g, taking h as a
@f @f @g @f @h constant, and sum it with the chain rule along h
= +
@x @g @x @h @x taking g as a constant.
28
INTUITION FOR THE MULTIVARIATE

f(g,CHAIN
h) = s1RULE
g + s 2 h + bf
<latexit sha1_base64="Di6a1z0KlZ6ik5JUtRHTkvcfZnA=">AAAO0nicfZfJbuM2AIbl6TZ1p+6kPfZCNJgibYNAShwvGAQYS7Yxh85MGmRroyCgaFoWLIkERTl2BB2KXvsofZreivZhSsmyrNU60fw//vzFRSYNalsel+V/Gs8++viTTz97/nnzixdftr56uff1tUd8hvAVIjZhtwb0sG25+Ipb3Ma3lGHoGDa+MeZapN8sMPMs4l7yFcX3DjRda2ohyEXVw15jpCMjmIYHOjIDMzwEOqLBLPwBfH8GvIdIU8K1BH5KKo7DNSMq9DkKjIdpCHS9uaYOlmnT6OcyhcwEilpmoOjnFpqtIY6XPHi0+AyEaYiogT5lEAX6nIJ16DAqrzsK9df66zQgqGWj/kQnGzh+sTwcu8Xl5dY0ft88FzslXPPh5b58JMcPKBeUpLAvJc/5w94L3tQnBPkOdjmyoefdKTLl9wFk3EI2Dpu672EK0Rya+M7n0959YLnU59hFIXgltKlvA05ANKlgYjGMuL0SBYiYJRwAmkGRlYupb+atPOxCB3uHk4VFvXXRW5jrAodi3dwHy3hdhbmGgckgnVlomUsWQMdzIJ+VKr2VY+QrsW9jtnDylVFKkbFALjFDlheNwbkYmA80WqreJTlP9NmKzrDrhYHP7DDbUAiYMTwVDeOih7lPg/hlxP6Ye2ec+fgwKsZ1Z0PI5hd4cih8chX5OFObQJ6vMpxocFz8iIjjQHcS6FQshXjZ6odHoRBfAcMWsEEgm+TJizAI9GjMDANcCDQnvs+I74viKCOOkk6IPQFTwsBCzD9hHhAgEAizEPbyra/S1lNwVbS+zojXRfEmI94URcPPqH5JXWTURUl9zKiPRXWZEZdFcZURVyVfnlF5UX3KiE9F8XZXp7/u6vS3gq2YHbFlVmKX46n4OMcLLJijMHh7+e7nMOjHT7JSfAyUPIiMDXgy7nT73bAo2xu9Pe4p6rCsp0BXHShaFZASg64mD0fbLMcFNg0ty53RoFO0QvZW7/W1cVnfhpUH3aFaAWzTjrX2qJsMH8ZuATVTTut32qVhMVOfvqqqp/2yngJqW9N6xxVASmjD4XCgxVGoz6iNCyzdgJ3OqVy2oqlRT+60BxX6dgLknqqWwtJMFnWsKkMlzsIxtAskTxeL1hv0R0Ujvh1/daBppQnkmeHvacqwXQFss54OT0cncRLCoGsWR4Wkw9fp9k56RSuSGo27YgZLWci2p3FflbulLCSTZaxqajSw+Z0oNtmdch+sP8jbfQf2FRCGRVhstCIc7b0NXGDtCtiupyvxXXx1A7s2fPlNEaqzRxXmqDYMqsqC6sOjyvBoV3izzJt19maFuVkbxqzKYtaHNyvDm7vC0zJP6+xphTmtDUOrstD68LQyPN0Vnpd5XmfPK8x5bRhelYXXh+eV4fmu8KTMkzp7UmFOasOQqiykPjypDE/y4cWfR3SmhzaIzsTEBpabHAxy3ywaHR/EJUhP6PWLD7G4GzD8ThwrPogDLRRnvB8DHTLTsdxQ3BVM/TAq7QLhcgOKkrinKMVbSblwfXyknB7Jv7T336jJjeW59K30nXQgKVJXeiO9lc6lKwk1/mr83fi38V/rsvXU+r31xxp91kjafCPlntaf/wNdaHDQ</latexit>
We can see why this holds in the same way as

g(x) = sg x + bg before. The short story: since all functions can be
h(x) = sh x + bh taken to be linear, their slopes distribute out into a
f(g, h) = s1 g + s2 h + bf
g(x) = sg x + bg with s1 =
@f
s2 =
@f
sg =
@g
sh =
@h sum
@g @h @x @x
h(x) = sh x + bh
@f @f @g @h
with s1 = s2 = sg = sh =
@g
h(x)) = ss11 (s
f(g(x), h(x)) @h
(s x + + bgg )) + @x
+ s (sh x++@xb
<latexit sha1_base64="WEZsfb75aWQ7JGTAM9g2CSx2+us=">AAAOW3icfZdbb9s2GIbldofOS7N0w652QyzokKxGICWODygC1JJt9GJts6A5bFFgUDQtC5ZEgqIcu4J+4653sf8ySlZkHa0bf+D7fq8ek6JNGdS2PC7L/zaePf/q62++ffFd8/u9l/s/HLz68cYjPkP4GhGbsDsDeti2XHzNLW7jO8owdAwb3xoLLdJvl5h5FnE/8zXFDw40XWtmIcjF0OTgHx0ZwSw80pEZmOHR6rgFdESDeVQeg98ugL5AgTdRjryJuQJvgDExj8UH8CanYmi+GZofxx+zEOh6c9uTtIhKdCVNSY+oRFu+S+RFMEoYfUY0sS8aOo2HIqrjVZStv32jv40jzGLW5OBQPpHjC5QLJSkOpeS6nLza4019SpDvYJcjG3revSJT/hBAxi1k47Cp+x6mEC2gie99Pus9BJZLfY5dFILXQpv5NuAERLMLphbDiNtrUUDELJEA0BwyiLhYg2Y+ysMudLDXmi4t6m1Kb2luCg7FAj4Eq3iBw1xjYDJI5xZa5cgC6HgO5PPSoLd2jPwg9m3Mlk5+MKIUjAXnCjNkedEcXIqJ+USjZ8b7TC4Tfb6mc+x6YeAzO8w2CgEzhmeiMS49zH0axF9GPKgL74IzH7eiMh67GEK2uMLTlsjJDeRxZjaBPD9kONHkuPgREceB7jTQaRjoHK94oLdOQiG+BoYtzAaBbJp3XoVBoEdzZhjgSlhz4seM+LEojjLiKLkJsadgRhhYivUnzAPCCISFWQh7+e7rtHsGrovRNxnxpijeZsTbomj4GdUvqcuMuiypjxn1saiuMuKqKK4z4rqUyzMqL6pfMuKXoni366Z/7brp34VYsTpiy6zFLscz8SsZP2DBAoXB+88f/giDfnwlT4qPgZI3IuPJeDbudPvdsCjbT3p73FPUYVlPDV11oGhVhtQx6GrycLRlOS14U2hZ7owGnWIUsrd6r6+Ny/oWVh50h2qFYUs71tqjbjJ9GLsFq5n6tH6nXZoWM83pq6p63i/rqUFta1rvtMKQOrThcDjQYhTqM2rjgpc+GTudc7kcRdOgntxpDyr07QLIPVUtwdIMizpWlaESs3AM7YKTpw+L1hv0R8Ugvp1/daBppQXkmenvacqwXWHYsp4Pz0dnMQlh0DWLs0LS6et0e2e9YhRJg8ZdsYIlFrK907ivyt0SC8mwjFVNjSY2vxPFJrtXHoLND/J234FDBYRh0Sw2WtEc7b0nc8FrV5jtenelfZe/usGuhS9/U4Tq4lFFOKqFQVUsqB4eVcKjXfBm2W/WxZsV4WYtjFnFYtbDm5Xw5i54WvbTunhaEU5rYWgVC62Hp5XwdBc8L/t5XTyvCOe1MLyKhdfD80p4vguelP2kLp5UhJNaGFLFQurhSSU8ycOLP4/oTA9tEJ2JiQ0sNzkY5H6zaHR8iF4yEvfmiw+xeDdg+IM4VnwSB1oozni/BzpkpmO5oXhXMPVWVO0ywtWTUVTiPUUpvpWUi5vTE6VzovzZPnynJm8sL6RfpF+lI0mRutI76b10KV1LqHHeuG9MG/jlf/vP95v7exvrs0bS85OUu/Z//h8ONTtS</latexit>
bh )) + +b
bf
f(g(x), h(x)) =
f(g(x), = s1 (sggg x x+b bg ) + ss222 (s (sh x
h x + bh h ) + bff
= s s x
= ss111 ssggg x
x++ s b
+ ss111 b + s s
+ ss222 ssh
bggg + +
hx + s
x b h++b
= h x + ss222 b
bhh + bfff
b
= (s
= (s s + ss22 ssh )x +
+ s bgg + + s bh +bb
= (s111 ssggg + + s2 sh h )x
)x + ss222 b bg + ss222 bbhh++ bfff
slope of f over x constant
@f @g @f @h
29 +
@g @x @h @x
MULTIVARIATE CHAIN RULE If we have more than two paths from the input to
the output, we simply sum over all of them.
g1
... ...
gi
x f
gN
@f X @f @gi
<latexit sha1_base64="DMJldPRh5qsthHcwBU7lAJ+NqwM=">AAAN5nicfZfbbts2HMbV7tRlzZZul7sRFhQYhiCQEseHiwC1ZBu9WNssyGmtgoCiaVkwJRIU5dgV9Aq7G3a7R9lr7AV2uz3CKEWRJYqyrgh+Hz/99Cdpky7FfsQN4+8nTz/59LPPv3j25c5Xz3e//mbvxbdXEYkZRJeQYMJuXBAh7IfokvscoxvKEAhcjK7dhZ3p10vEIp+EF3xN0W0AvNCf+RBw0XW3996ZMQATZ0F1B8JklqZ5e5Xqp7oTxcGdrzY40Eu89M5Pq3LRVSTc7e0bh0b+6M2GWTT2teI5u3vxnO84UwLjAIUcYhBFH0yD8tsEMO5DjNIdJ44QBXABPPQh5rP+beKHNOYohKn+UmizGOuc6NmH6lOfIcjxWjQAZL5I0OEcCFguyrFTj4pQCAIUHUyXPo0emtHSe2hwIGp5m6zyWqe1gYnHAJ37cFUjS0AQBYDPG53ROnDrnSjGiC2DemdGKRgl5wox6EdZDc5EYd7RbPqiC3JW6PM1naMwSpOY4bQ6UAiIMTQTA/NmhHhMk/xjxJpZRKecxegga+Z9pyPAFudoeiByah11nBkmgNe73CArTojuIQkCEE4Th4qVwNGKJ87BYSrEl7qLhdklgE3rzvM0SZysZq6rnwtrTXxbEd/K4rgijouXEDzVZ4TpSzH/hEW6MOrCwnyIovroy3L0TL+Uo68q4pUsXlfEa1l044oaN9RlRV021PuKei+rq4q4ksV1RVw3cnlF5bL6sSJ+lMWbbS/9ddtL30uxYnbEllmLXY5m4gcrX2DJAqbJ64s3P6fJIH+KlRIj3awboftoPJ50e4NeKsv4Ue9M+qY1auqloWcNTVtlKB3Dnm2MxhuWI8lbQhtGdzzsylEQb/T+wJ409Q2sMeyNLIVhQzuxO+NeUT6EQsnqlT570O00yuKVOQPLsk4GTb00WB3b7h8pDKXDHo1GQztHoTGjGEle+mjsdk+MZhQtg/pGtzNU6JsJMPqW1YClFRZrYpkjM2fhCGDJycvFYveHg7EcxDf1t4a23ZhAXil/3zZHHYVhw3oyOhkf5ySEgdCTq0LK8nV7/eO+HEXKoElPzGCDhWzeNBlYRq/BQiosE8u2ssLWd6LYZB/M2+ThB3mz7/R9U09T2Sw2mmzO9t6jWfJihRm3u5X2bX71ANwK3/xSCNvioSIctsJAFQtsh4dKeLgN3mv6vbZ4TxHutcJ4KhavHd5Twnvb4GnTT9viqSKctsJQFQtth6dKeLoNnjf9vC2eK8J5KwxXsfB2eK6E59vgSdNP2uKJIpy0whAVC2mHJ0p4UocXfx7ZmR5gPTsTE6z7YXEwqP1m0ez4sBC3i8L98OEjJO4GDL0Rx4p34kALxBnvp8QBzAv8MBV3Bc85yFrbjGD1aAT5PcWUbyXNxtXRoXlyaPzS2X9lFTeWZ9r32g/aj5qp9bRX2mvtTLvUoPaX9o/2r/bf7nz3t93fd/94sD59Uoz5Tqs9u3/+D4Z4IIo=</latexit>
=
@x @gi @x
i
30
EXAMPLE With that, we are ready to show how

backpropagation works. We'll start with a fairly
arbitrary function to show the principle before we
move on to more realistic neural networks.
2
f(x) =
sin(e-x )
31
ITERATING THE CHAIN RULE The first thing we do is to break up its functional
form into a series of smaller operations. The entire
2 function f is then just a chain of these small
f(x) = f(x) = d(c(b(a(x)))) operations chained together. We can draw this in a
sin(e-x )
computation graph as we did before.
operations: computation graph:
2 Normally, we wouldn’t break a function up in such
d(c) =
c x a b c d small operations. This is just a simple example to
c(b) = sin b illustrate the principle.
b(a) = ea
a(x) = -x
32
ITERATING THE CHAIN RULE Now, to work out the derivative of f, we can iterate
the chain rule. We apply it again and again, until
the derivative of f over x is expressed as a long
@f @d
@x
=
@x x a b c d product of derivatives of operation outputs over
@d @c their inputs.
=
@c @x
@d @c @b
=
@c @b @x
@d @c @b @a
=
@c @b @a @x
33
@f @f @d
=
@x @d @x
@f @d @c
=
@d @c @x
GLOBAL AND LOCAL DERIVATIVES We call the larger derivative of f over x the global
@f @d @c @b derivative. And we call the individual factors, the
= derivatives of the operation output wrt to their
@d @c @b @x inputs, the local derivatives.
@f @f @d @f @d@c @b @a
= =
@x @d @d @c @x
@b @a @x
f @d local
global derivative @c derivatives
=
@d @c @x
f @d @c @b
=
34 @d @c @b @x
f @d @c @b @a
=
@d @c @b @a @x
BACKPROPAGATION This is how the backpropagation algorithm

combines symbolic and numeric computation. We
The BACKPROPAGATION algorithm: work out the local derivatives symbolically, and
• break your computation up into a sequence of operations then work out the global derivative numerically.
what counts as an operation is up to you
• work out the local derivatives symbolically.

• compute the global derivative numerically
by computing the local derivatives and multiplying them
35
@f @f @d
=
@x @d @x
@f @d @c
= SYMBOLICALLY
WORK OUT THE LOCAL DERIVATIVES For each local derivative, we work out the
@d @c @x symbolic derivative with pen and paper.
@f @d @c @b
2 =
f(x) = @d @c @b @x Note that we could fill in the a, b and c in the
sin(e-x )
@f @f @d @c @b @a
@f @d result, but we don’t. We simply leave them as is.
operations: = = For the symbolic part, we are only interested in
@x @d @d
@c @x
@b @a @x
2 the derivative of the output of each sub-operation
d(c) = f @d @c
c @f = @d2 @c @x with respect to its immediate input.
c(b) = sin b = - 2 · cos b · ea · -1
@x fc @d @c @b The rest of thew algorithm is performed
b(a) = e a = numerically.
@d @c @b @x
a(x) = -x f @d @c @b @a
=
36
@d @c @b @a @x
COMPUTE A FORWARD PASS (X = -4.499) This we are now computing things numerically, we
need a specific input, in this case x = -4.499. We
start by feeding this through the computation
f(-4.499) = 2 graph. For each sub-operation, we store the
x a b c d
2
<latexit sha1_base64="OJjpZWerc9vinEiAFnncrzHfoVc=">AAAN+nicfZdNb9s2HMaVttu6rNnS7bgLsaDDMGSBlDh+OQSoJdvoYW2zIG9rlAUUTSuCKZGgKMeupi+z27DrPsou+yo7jZIdWaIk60Tzefjopz9FmXQY8UKh6/9uPXn67JNPP3v++fYXL3a+/Gr35deXIY04wheIEsqvHRhi4gX4QniC4GvGMfQdgq+cqZXqVzPMQ48G52LB8K0P3cCbeAgK2XW3G9iIxuMEfH8C7AmHKD5MYhs5MUoScAIObXt7+SszhF4AbMRiJ9UMkInpL6nh32zkxjAVenomZL+k8tNc9rUOWr2eHHC3u6cf6NkFqg1j1djTVtfp3csXYtseUxT5OBCIwDC8MXQmbmPIhYcITrbtKMQMoil08U0kJt3b2AtYJHAgoV9JbRIRIChIHx6MPY6RIAvZgIh7MgGgeygfW8gSbZejQhxAH4f745nHwmUznLnLhoCyvrfxPKt/UhoYuxyyew/NS2Qx9EMfivtKZ7jwnXInjgjmM7/cmVJKRsU5xxx5YVqDU1mY9yyd0vCcnq70+wW7x0GYxBEnSXGgFDDneCIHZs0Qi4jF2cPI92gangge4f20mfWdDCCfnuHxvswpdZRxJoRCUe5y/LQ4AX5A1PdhMI5tJl8vgecitvcPEim+Ag6RZodCPi47z5I4ttOaOQ44k9aS+K4gvlPFYUEcrm5CyRhMKAczOf+Uh0AagbRwD+GwPPoiHz0BF2r0ZUG8VMWrgnilik5UUKOKOiuos4r6UFAfVHVeEOequCiIi0quKKhCVT8WxI+qeL3ppr9uuukHJVbOjlwyC7nK8UR+xLIXLJ6iJH5z/vbnJO5l1+pNiTAwykbkPBqPRu1Or5OoMnnUW6OuYQ6qem7omH3DqjPkjn7H0gfDNcuh4s2hdb097LfVKETWerdnjar6GlbvdwZmjWFNO7Jaw86qfBgHitXNfVav3aqUxc1zeqZpHveqem4wW5bVPawx5A5rMBj0rQyFRZwRrHjZo7HdPtarUSwP6urtVr9GX0+A3jXNCiwrsJgj0xgYGYvAkChOkb8sVrffG6pBYl1/s29ZlQkUhfJ3LWPQqjGsWY8Hx8OjjIRyGLhqVWhevnane9RVo2geNOrIGayw0PWdRj1T71RYaIFlZFpmWtjySpSL7Ma4jZcf5PW6A3sGSBLVLBeaak7X3qNZ8ZIaM2l219o3+esHkEb46pMi1BSPasJRIwyqY0HN8KgWHm2Cd6t+tynerQl3G2HcOha3Gd6thXc3wbOqnzXFs5pw1gjD6lhYMzyrhWeb4EXVL5riRU24aIQRdSyiGV7UwotN8LTqp03xtCacNsLQOhbaDE9r4WkZXv55pHt6SEC6J6YEyMPGcmNQ+maxdPswRXInuXQvH3yA5dmA47dyW/Febmih3OP9GNuQu74XJPKs4Nr7aWuTEc4fjbIlzymGeiqpNi4PD4zjA/2X1t5rc3Viea59q32n/aAZWkd7rb3RTrULDWn/aP9tPd16tvP7zh87f+78tbQ+2VqN+UYrXTt//w8aDx5e</latexit>
output value. At the end, we get the output of the

d= 2=2 function f. This is called a forward pass: a fancy
d=c =2
2c term for computing the output of f for a given
c=
d = sin=b 2= 1
c = c b=1
sin input.
b = ea = 90
c=
b = sin
e a b=1
= 90
a = -x a = 4.499 Note that at this point, we are no longer
b=
a = e-x == 90
4.499
a = -x = 4.499 computing solutions in general. We are computing
our function for a specific input. We will be
37
computing the gradient for this specific input as
well.
COMPUTE A FORWARD PASS (X = -4.499) Keeping all intermediate values from the forward
pass in memory, we go back to our symbolic
@f 2 expression of the derivative. Here, we fill in the
f(-4.499) = 2 = - 2 · cos b · ea · -1 intermediate values a b and c. After we do this, we
@x c
2 2 can finish the multiplication numerically, giving us
d= =2= =2 = -22 · cos 90 · -4.499 e-4.499 · -1
<latexit sha1_base64="0ODmX1yVjSmtRDEDP9BXQ1p50D0=">AAAOB3icfZdNb9s2HMaV7qVdVm/JdtxFWNBiGNJAShzbOgSoJdnoYW2zIC/dojSgaFoRTIkERTl2BX2AfZrdhl33MXbcNxnlF1miJOvgEHwePvrpT1EhXYr9iGvavztPPvv8iy+fPvtq9+vnrW++3dv/7joiMYPoChJM2AcXRAj7IbriPsfoA2UIBC5GN+7EyvSbKWKRT8JLPqfoLgBe6I99CLjout9LXp6pr5wxAzA5ThMHuomefjxOVQeOCBe/JBI/NDG0dRf6KFxe0j5qG0a67nylq46z+1LNs/Q0yUO01V9D27jPVM1x7vcOtCNtcanVhr5qHCir6/x+/znfdUYExgEKOcQgim51jfK7BDDuQ4zSXSeOEAVwAjx0G/Nx7y7xQxpzFMJUfSG0cYxVTtSsEurIZwhyPBcNAJkvElT4AAQ9F/XaLUdFKAQBig5HU59Gy2Y09ZYNDkSx75LZYjLS0sDEY4A++HBWIktAEAWAP1Q6o3ngljtRjBGbBuXOjFIwSs4ZYtCPshqci8K8p9n8RpfkfKU/zOkDCqM0iRlOiwOFgBhDYzFw0YwQj2myeBjxUk2iM85idJg1F31nNmCTCzQ6FDmljjLOGBPAy11ukBUnRI+QBAEIR4lDxRvH0YwnzuFRKsQXqouF2SWAjcrOizRJnKxmrqteCGtJfFcQ38nioCAOVjcheKSOCVOnYv4Ji1RhVIWF+RBF5dFX+eixeiVHXxfEa1m8KYg3sujGBTWuqNOCOq2ojwX1UVZnBXEmi/OCOK/k8oLKZfVTQfwkix+23fS3bTf9XYoVsyOWzFyscjQWX7TFC5ZMYJq8uXz7S5oYi2v1psRI1ctG6K6NJ8NO1+imsozXenvY0027queGrtnXrTpD7uh3Lc0ebFiOJW8OrWmdQb8jR0G80XuGNazqG1it37XNGsOGdmi1B91V+RAKJauX+yyj066UxctzDNM0T42qnhvMtmX1jmsMucOybbtvLVBozChGkpeujZ3OqVaNonlQT+u0+zX6ZgK0nmlWYGmBxRyauq0vWDgCWHLy/GWxen1jIAfxTf3NvmVVJpAXyt+zdLtdY9iwntqng5MFCWEg9OSqkLx8nW7vpCdHkTxo2BUzWGEhmzsNDVPrVlhIgWVoWmZW2PJKFIvsVr9Llh/kzbpTD3Q1TWWzWGiyOVt7a7PkxTVm3OyutW/z1w/AjfDVJ4WwKR7WhMNGGFjHApvhYS083AbvVf1eU7xXE+41wnh1LF4zvFcL722Dp1U/bYqnNeG0EYbWsdBmeFoLT7fB86qfN8XzmnDeCMPrWHgzPK+F59vgSdVPmuJJTThphCF1LKQZntTCkzK8+OeR7ekBVrM9McGqH642BqVvFs22DxModpJL9/LBbSTOBgy9FduK92JDC8Qe7+fEAcwL/DAVZwXPOcxa24xgtjaKljin6PKppNq4Pj7ST4+0X9sHr83VieWZ8oPyo/KToitd5bXyRjlXrhSo/LfzdGdvZ7/1R+vP1l+tv5fWJzurMd8rpav1z//P3SLG</latexit>
<latexit sha1_base64="G5JpQX1kCo3sPbai6u9lIbcl1L4=">AAAOCHicfZdNc6M2HMbJ9i1N122yPfbCNLM7nU6SgcSxzSEza7A9e+juppm8dUM2I2SZMBZII4RjL+UL9NP01um136LXfpIKv2AQYA6ORs+jhx9/ISI5FHsh17R/t5599vkXX361/fXON88b3363u/fiOiQRg+gKEkzYrQNChL0AXXGPY3RLGQK+g9GNM7ZS/WaCWOiR4JLPKLr3gRt4Iw8CLroedn9/daYe2iMGYHycxDZ0Yj35eJyoNhwSLn5JKH5obGirLvRRuNz4sHnUNIxk1Xuoq7a980rNwvQkzlK05V9DW7vPVM22H3b3tSNtfqnlhr5s7CvL6/xh7znfsYcERj4KOMQgDO90jfL7GDDuQYySHTsKEQVwDFx0F/FR5z72AhpxFMBEfSm0UYRVTtS0FOrQYwhyPBMNAJknElT4CAQ9FwXbKUaFKAA+Cg+GE4+Gi2Y4cRcNDkS17+PpfDaSwsDYZYA+enBaIIuBH/qAP5Y6w5nvFDtRhBGb+MXOlFIwSs4pYtAL0xqci8K8p+kEh5fkfKk/zugjCsIkjhhO8gOFgBhDIzFw3gwRj2g8fxjxVo3DM84idJA2531nPcDGF2h4IHIKHUWcESaAF7scPy1OgJ4g8X0QDGObileOoymP7YOjRIgvVQcLs0MAGxadF0kc22nNHEe9ENaC+C4nvpPFfk7sL29C8FAdEaZOxPwTFqrCqAoL8yAKi6OvstEj9UqOvs6J17J4kxNvZNGJcmpUUic5dVJSn3Lqk6xOc+JUFmc5cVbK5TmVy+qnnPhJFm833fS3TTf9IMWK2RFLZiZWORqJT9r8BYvHMInfXL79JYmN+bV8UyKk6kUjdFbGk0GrbbQTWcYrvTno6GavrGeGttnVrSpD5ui2La3XX7McS94MWtNa/W5LjoJ4rXcMa1DW17Bat90zKwxr2oHV7LeX5UMokKxu5rOMVrNUFjfLMUzTPDXKemYwm5bVOa4wZA6r1+t1rTkKjRjFSPLSlbHVOtXKUTQL6mitZrdCX0+A1jHNEizNsZgDU+/pcxaOAJacPHtZrE7X6MtBfF1/s2tZpQnkufJ3LL3XrDCsWU97p/2TOQlhIHDlqpCsfK1256QjR5EsaNAWM1hiIes7DQxTa5dYSI5lYFpmWtjiShSL7E6/jxcf5PW6U/d1NUlks1hosjldeyuz5MUVZlzvrrRv8lcPwLXw5SeFsC4eVoTDWhhYxQLr4WElPNwE75b9bl28WxHu1sK4VSxuPbxbCe9ugqdlP62LpxXhtBaGVrHQenhaCU83wfOyn9fF84pwXgvDq1h4PTyvhOeb4EnZT+riSUU4qYUhVSykHp5UwpMivPjnke7pAVbTPTHBqhcsNwaFbxZNtw9jKHaSC/fiwXtInA0Yeiu2Fe/FhhaIPd7PsQ2Y63tBIs4Krn2QtjYZwXRlFC1xTtHlU0m5cX18pJ8eab8291+byxPLtvKD8qPyk6IrbeW18kY5V64UqPy3tb21t/Wi8Ufjz8Zfjb8X1mdbyzHfK4Wr8c//cNMi/Q==</latexit>
d 2c 2 ==-- 122222· ·cos

cos9090··ee4.499 · ·-1 a numeric value of the gradient of f at x = -4.499.
<latexit sha1_base64="DFkx7ICy5cN+S/j37NVEkDriLUE=">AAAN/XicfZdNb9s2HMaVbmu7rN7S7biLsKDFMKSB5Di2dQhQS7bRw9pmQd7WKA0omlYESyJHUY5dQdin2W3YdR9l2IcZMMovskRR1sEh+Dx89NOfokI6xPcipmn/7jz67PMvHj95+uXuV88aX3+z9/zbywjHFKILiH1Mrx0QId8L0QXzmI+uCUUgcHx05UysTL+aIhp5ODxnc4JuA+CG3tiDgPGuu73fXp6or+wxBTBppokNnURPPzZT1YYjzPgvjvgPSQxt3YU+cpebtA5bhpGuO1/pqm3vvlR5VnPVpa3+GtrGc6Jqtn23t68daotLrTb0VWNfWV2nd8+fsV17hGEcoJBBH0TRja4RdpsAyjzoo3TXjiNEAJwAF93EbNy9TbyQxAyFMFVfcG0c+yrDavb86sijCDJ/zhsAUo8nqPAe8OdnvEq75agIhSBA0cFo6pFo2Yym7rLBAC/xbTJbTEFaGpi4FJB7D85KZAkIogCw+0pnNA+ccieKfUSnQbkzo+SMgnOGKPSirAanvDDvSTar0Tk+Xen3c3KPwihNYuqnxYFcQJSiMR+4aEaIxSRZPAx/lSbRCaMxOsiai76TPqCTMzQ64DmljjLO2MeAlbucICtOiB4gDgIQjhKb8PeMoRlL7IPDlIsvVMfnZgcDOio7z9IksbOaOY56xq0l8V1BfCeKg4I4WN0E+yN1jKk65fOPaaRyo8ot1IMoKo++yEeP1Qsx+rIgXoriVUG8EkUnLqhxRZ0W1GlFfSioD6I6K4gzUZwXxHkllxVUJqqfCuInUbzedtNft930gxDLZ4cvmTlf5WjMv2OLFyyZwDR5c/725zQxFtfqTYmRqpeN0Fkbj4btjtFJRdlf661hVzf7VT03dMyebskMuaPXsbT+YMPSFLw5tKa1B722GAX9jd41rGFV38BqvU7flBg2tEOrNeisyodQKFjd3GcZ7ValLG6eY5imeWxU9dxgtiyr25QYcofV7/d71gKFxJT4SPCStbHdPtaqUSQP6mrtVk+ibyZA65pmBZYUWMyhqff1BQtDwBecLH9ZrG7PGIhBbFN/s2dZlQlkhfJ3Lb3fkhg2rMf948HRggRTELpiVXBevnane9QVo3AeNOzwGayw4M2dhoapdSosuMAyNC0zK2x5JfJFdqPfJssP8mbdqfu6mqaimS800ZytvbVZ8PoSs1/vltq3+eUD/Fr46pNCWBcPJeGwFgbKWGA9PJTCw23wbtXv1sW7knC3FsaVsbj18K4U3t0GT6p+UhdPJOGkFobIWEg9PJHCk23wrOpndfFMEs5qYZiMhdXDMyk82waPq35cF48l4bgWBstYcD08lsLjMjz/55Ht6YGvZnti7KteuNoYlL5ZJNs+TCDfSS7dywfvI342oOgt31a85xtawPd4PyU2oG7ghSk/K7j2QdbaZgSztZG3+DlFF08l1cZl81A/PtR+ae2/NlcnlqfK98oPyo+KrnSU18ob5VS5UKDyj/LfzuOdJ43fG380/mz8tbQ+2lmN+U4pXY2//weLIh9I</latexit>
d= c =2 -1
2
sin
c = sinc b=1 = -112 · cos 90 · e4.499 · -1 In this case, the gradient happens to be 0.
= a=b 2= 1
dc = = -1
1
1 · 0 · 90 · -1 = 0
bc =
b eeca =
= sin b=
= 901
90 == -222···000···90
=-- 90··-1
-1 == 00
b
ac== sin
e
-x
a b=1
=
= 90
4.499 2
a = -x a = 4.499
b=
a = e-x = = 90
4.499
a = -x = 4.499
38
BACKPROPAGATION Before we try this on a neural network, here are

the main things to remember about the
The BACKPROPAGATION algorithm: backpropagation algorithm.
• break your computation up into a sequence of operations
what counts as an operation is up to you Note that backpropagation by itself does not train
• work out the local derivatives symbolically. a neural net. It just provides a gradient. When
people say that they trained a network by
by computing the local derivatives and multiplying them backpropagation, that's actually shorthand for
• Much more accurate than finite differences training the network by gradient descent, with the
only source of inaccuracy is the numeric computation of the operations. gradients worked out by backpropagation.
• Much faster than symbolic differentiation
The backward pass has (broadly) the sample complexity as the forward.
39
BRACKPROPAGATION IN A FEEDFORWARD NETWORK To explain how backpropagation works in a neural

network, we extend our neural network diagram a
l little bit, to make it closer to the actual
<- computation of loss
computation graph we’ll be using.
y t
First, we separate the hidden node into the result

v2
h1 h2 h3 of the linear operation ki and the application of

k1 k2 k3 <-
non
l in
the nonlinearity hi. Second, since we’re interested
ear
i ty in the derivative of the loss rather than the output
2
w1
of the network, we extend the network with one

more step: the computation of the loss (over one
x1 x2
example to keep things simple). In this final step,
40
the output y of the network is compared to the
target value t from the data, producing a loss
value.
Note that the model is now just a subgraph of of

the computation graph. You can think of t as
another input node, like x1 and x2, but one to
which the model doesn’t have access.
BRACKPROPAGATION IN A FEEDFORWARD NETWORK We want to work out the gradient of the loss over
the parameters. We’ll isolate two parameters, v2 in
l @l @l the second layer, and w12 in the first, and see how
, backpropagation operates.
y t @v2 @w12
First, we have to break the computation of the loss
l = (y - t)22
v2
h1 h2 h3
l = (y - t) into operations. If we take the graph on the left to
y = v1 h1 + v2 h2 + v3 h3 + bh
k1 k2 k3
y = v1 h1 + v2 h2 + v3 h3 + bh be our computation graph, then we end up with
1 the operations of the right.
h2 = 1
h2 = 1 + exp(-k2 )
2
w1
1 + exp(-k2 ) To simplify things, we’ll compute the loss over only

k2 = w21 x1 + w22 x2 + bx
x1 x2 k2 = w21 x1 + w22 x2 + bx one example. We’ll use a simple squared error
k2 = w12 x1 + w2 x2 + bx
<latexit sha1_base64="iIcxV+irH76DkT6drvUDo1J4VqQ=">AAANxHicfZdNc6M2HMbZ7ds2XbfZ9tgL08x2Om0mA45j40Nm1mB79tDdTTN5a0PGI2SZMBZII4RjL0M/Sj9Nr+2936YC2xgEmJOs59HDT38hLByKvZBr2n/Pnn/y6Weff/Hiy4OvXra+/ubw1bc3IYkYRNeQYMLuHBAi7AXomnscozvKEPAdjG6duZXqtwvEQo8EV3xF0YMP3MCbeRBw0TU5NA5s6MbzSTtRfzxXbUjip0mst5NkOdHVX7Yd2e929tuJnckyse3J4ZF2omWXWm3om8aRsrkuJq9e8gN7SmDko4BDDMLwXtcof4gB4x7EKDmwoxBRAOfARfcRnxkPsRfQiKMAJuproc0irHKiptNQpx5DkOOVaADIPJGgwkfAAORisgflqBAFwEfh8XTh0XDdDBfuusGBqNRDvMwqmZQGxi4D9NGDyxJZDPzQB/yx0hmufKfciSKM2MIvd6aUglFyLhGDXpjW4EIU5gNNFye8Ihcb/XFFH1EQJnHEcFIcKATEGJqJgVkzRDyicTYZ8UTMw3POInScNrO+8yFg80s0PRY5pY4yzgwTwMtdjp8WJ0BPkPg+CKaxTZPY5mjJY/v4JBHia9XBwuwQwKZl52USx3ZaM8dRL4W1JL4viO9lcVQQR5ubEDxVZ4SpC7H+hIWqMKrCwjyIwvLo63z0TL2Wo28K4o0s3hbEW1l0ooIaVdRFQV1U1KeC+iSry4K4lMVVQVxVcnlB5bL6sSB+lMW7fTf9fd9N/5BixeqILbMSuxzNxOsoe8DiOUzit1fvfk3ifnZtnpQIqXrZCJ2t8XTc7fV7iSzjrd4ZG7o5rOq5oWcOdKvOkDsGPUsbjnYsbcmbQ2tadzToylEQ73Sjb42r+g5WG/SGZo1hRzu2OqPepnwIBZLVzX1Wv9uplMXNc/qmaZ71q3puMDuWZbRrDLnDGg6HAytDoRGjGEleujV2u2daNYrmQYbW7Qxq9N0CaIZpVmBpgcUcm/pQz1g4Alhy8vxhsYxBfyQH8V39zYFlVRaQF8pvWPqwU2PYsZ4Nz0anGQlhIHDlqpC8fN2ecWrIUSQPGvfEClZYyO5O476p9SospMAyNi0zLWx5J4pNdq8/xOsX8m7fqUe6miSyWWw02Zzuva1Z8uIaM25219r3+esH4Eb46kwhbIqHNeGwEQbWscBmeFgLD/fBu1W/2xTv1oS7jTBuHYvbDO/Wwrv74GnVT5viaU04bYShdSy0GZ7WwtN98Lzq503xvCacN8LwOhbeDM9r4fk+eFL1k6Z4UhNOGmFIHQtphie18KQML/480jM9wGp6JiZY9YLNwaD0zqLp8WEOxUly7V5PfIjEtwFD78Sx4oM40AJxxvs5tgFzfS9IxLeCax+nrX1GsNwaRUt8p+jyV0m1cdM+0c9OtN86R2/MzRfLC+V75QflJ0VXesob5a1yoVwrUPlL+Vv5R/m3NW7hVtiK1tbnzzZjvlNKV+vP/wHHCg8p</latexit>
41 loss.
BACKPROPAGATION IN A FEEDFORWARD NETWORK For the derivative with respect to v2, we’ll only
need these two operations. Anything below
l doesn’t affect the result.
l = (y - t)22
l = (y - t)
y t
y = v1 h1 + v2 h2 + v3 h3 + bh To work out the derivative we apply the chain rule,
y = v1 h1 + v2 h2 + v3 h3 + bh and work out the local derivatives symbolically.
1
vv22
h2 = 1
h1 h2 h3 h2 = 1 + exp(-k2 )
k1 k2 k3 1 + exp(-k2 )
k2 = w
@l @l21 x1 + w22 x2 + bx
@y
k2 =
@l = w x1 + w22 x2 + bx
@l21@y
@v = @y @v2
2
w1
@v22 @y @v2
= 2(y - l) · h
= 2(y - t) · h22
x1 x2
= 2(10.1 - 12.1) · .99 = -3.96
42
BACKPROPAGATION IN A FEEDFORWARD NETWORK We then do a forward pass with some values. We

get an output of 10.1, which should have been
4 12.1, so our loss is 4. We keep all intermediate
@l @l @y
= values in memory.
10.1 12.1 @v
@l2 @y
@l @v
@y2
=
@v2 = 2(y - 2t) · h2
@y @v We then take our product of local derivatives, fill
2
.99 = 2(y
2(10.1 - ·12.1)
- t) h2 · .99 = -3.96 in the numeric values from the forward pass, and
8 = 2(10.1 - 12.1) · .99 = -3.96 compute the derivative over v2.
When we apply this derivative in a gradient
3
v2 v2 - ↵ · -3.96 descent update, v2 changes as shown below.

3 5
43
BACKPROPAGATION IN A FEEDFORWARD NETWORK Let’s try something a bit earlier in the network:
the weight w12. We add two operations, apply the
l l = (y - t)22 chain rule and work out the local derivatives.
l = (y - t)2
y = v1 h1 + v2 h2 + v3 h3 + bh
y t
y = v11 h11 +1v22 h22 + v33 h33 + bh
h
h2 = 1
h22 = 1 + exp(-k2 )
1 + exp(-k2 )
v2
k2 = w21 x1 + w222 x2 + bx
kk22 =
=w 21 x
w21 x11 ++ww22
22xx22 +
+ bbxx
<latexit sha1_base64="iIcxV+irH76DkT6drvUDo1J4VqQ=">AAANxHicfZdNc6M2HMbZ7ds2XbfZ9tgL08x2Om0mA45j40Nm1mB79tDdTTN5a0PGI2SZMBZII4RjL0M/Sj9Nr+2936YC2xgEmJOs59HDT38hLByKvZBr2n/Pnn/y6Weff/Hiy4OvXra+/ubw1bc3IYkYRNeQYMLuHBAi7AXomnscozvKEPAdjG6duZXqtwvEQo8EV3xF0YMP3MCbeRBw0TU5NA5s6MbzSTtRfzxXbUjip0mst5NkOdHVX7Yd2e929tuJnckyse3J4ZF2omWXWm3om8aRsrkuJq9e8gN7SmDko4BDDMLwXtcof4gB4x7EKDmwoxBRAOfARfcRnxkPsRfQiKMAJuproc0irHKiptNQpx5DkOOVaADIPJGgwkfAAORisgflqBAFwEfh8XTh0XDdDBfuusGBqNRDvMwqmZQGxi4D9NGDyxJZDPzQB/yx0hmufKfciSKM2MIvd6aUglFyLhGDXpjW4EIU5gNNFye8Ihcb/XFFH1EQJnHEcFIcKATEGJqJgVkzRDyicTYZ8UTMw3POInScNrO+8yFg80s0PRY5pY4yzgwTwMtdjp8WJ0BPkPg+CKaxTZPY5mjJY/v4JBHia9XBwuwQwKZl52USx3ZaM8dRL4W1JL4viO9lcVQQR5ubEDxVZ4SpC7H+hIWqMKrCwjyIwvLo63z0TL2Wo28K4o0s3hbEW1l0ooIaVdRFQV1U1KeC+iSry4K4lMVVQVxVcnlB5bL6sSB+lMW7fTf9fd9N/5BixeqILbMSuxzNxOsoe8DiOUzit1fvfk3ifnZtnpQIqXrZCJ2t8XTc7fV7iSzjrd4ZG7o5rOq5oWcOdKvOkDsGPUsbjnYsbcmbQ2tadzToylEQ73Sjb42r+g5WG/SGZo1hRzu2OqPepnwIBZLVzX1Wv9uplMXNc/qmaZ71q3puMDuWZbRrDLnDGg6HAytDoRGjGEleujV2u2daNYrmQYbW7Qxq9N0CaIZpVmBpgcUcm/pQz1g4Alhy8vxhsYxBfyQH8V39zYFlVRaQF8pvWPqwU2PYsZ4Nz0anGQlhIHDlqpC8fN2ecWrIUSQPGvfEClZYyO5O476p9SospMAyNi0zLWx5J4pNdq8/xOsX8m7fqUe6miSyWWw02Zzuva1Z8uIaM25219r3+esH4Eb46kwhbIqHNeGwEQbWscBmeFgLD/fBu1W/2xTv1oS7jTBuHYvbDO/Wwrv74GnVT5viaU04bYShdSy0GZ7WwtN98Lzq503xvCacN8LwOhbeDM9r4fk+eFL1k6Z4UhNOGmFIHQtphie18KQML/480jM9wGp6JiZY9YLNwaD0zqLp8WEOxUly7V5PfIjEtwFD78Sx4oM40AJxxvs5tgFzfS9IxLeCax+nrX1GsNwaRUt8p+jyV0m1cdM+0c9OtN86R2/MzRfLC+V75QflJ0VXesob5a1yoVwrUPlL+Vv5R/m3NW7hVtiK1tbnzzZjvlNKV+vP/wHHCg8p</latexit>
h1 h2 h3 2 12 1 2 2 x
k1 k2 k3
@l @l @y @h2 @k2
=
2
<latexit sha1_base64="j3ZbafUpwvdmZIKYlDTRg9R4h0E=">AAAOdnicfZddb9s2FIad7qvL2izdLgcMwoJs2ZAFkuP44yJALdlGL9Y2C5qPLQoMiqZlwZRISJRjVxC2v7m/sF+wy1GyLEsUZd3kiO/Lw8eHpEJaFDsBU9V/9p598ulnn3/x/Mv9r168PPj68NU3twEJfYhuIMHEv7dAgLDjoRvmMIzuqY+Aa2F0Z82NRL9bID9wiPeBrSh6dIHtOVMHAsabxof/mlMfwMicUwXH6R8TkuhpHGnNOI6VHy8ViYFGK65thawli1k0GzcFPWvL3uxoXnFkbVUE09znFM2T9SDKrwr7mRsmhKW2Be+Uv65HOdG4afOy8S7H2vjwSD1T00epBloWHDWy52r86gXbNycEhi7yGMQgCB40lbLHCPjMgRjF+2YYIArgHNjoIWTT7mPkeDRkyIOxcsy1aYgVRpSk8MrE8RFkeMUDAH2HZ1DgDPACMD49++VUAfKAi4LTycKhwToMFvY6YIDP7WO0TOc+LnWMbB/QmQOXJbIIuIEL2KzSGKxcq9yIQoz8hVtuTCg5o+BcIh86QVKDK16Y9zRZTsEHcpXpsxWdIS+Io9DHcbEjF5DvoynvmIYBYiGN0h/D1/A8uGR+iE6TMG27HAB/fo0mpzxPqaGMM8UEsHKT5SbF8dATJK4LvElkUr66GFqyyDw9i7l4rFiYmy0C/EnZeR1HkZnUzLKUa24tie8K4jtRHBbEYTYIwRNlSnxlweef+IHCjQq3+A5EQbn3Td57qtyIqW8L4q0o3hXEO1G0woIaVtRFQV1U1KeC+iSqy4K4FMVVQVxV8rKCykT1Y0H8KIr3uwb9Y9egfwpp+ezwLbPiuxxN+Qc0XWDRHMbRmw9vf4ujXvpkKyVEilY2QmtjPB+1O71OLMp4o7dGXU0fVPXc0NH7miEz5I5+x1AHwy1LU/Dm0KraHvbbYiqIt3q3Z4yq+hZW7XcGusSwpR0ZrWEnKx9CnmC1c5/Ra7cqZbHzPD1d1y96VT036C3D6DYlhtxhDAaDvpGi0NCnGAleujG22xdqNRXNE3XVdqsv0bcToHZ1vQJLCyz6SNcGWsrCEMCCk+WLxej2e0MxEdvWX+8bRmUCWaH8XUMbtCSGLevF4GJ4npIQH3i2WBWSl6/d6Z53xVQkTzTq8BmssJDtSKOernYqLKTAMtINPSlseSfyTfagPUbrD/J23ylHmhLHoplvNNGc7L2NWfBiiRnXu6X2XX55B1wLX/2lENalh5LksBYGylhgPTyUwsNd8HbVb9eltyXJ7VoYW8Zi18PbUnh7Fzyt+mldeipJTmthqIyF1sNTKTzdBc+qflaXnkmSs1oYJmNh9fBMCs92wZOqn9SlJ5LkpBaGyFhIPTyRwpMyPP/nkZzpAVaSMzHBiuNlB4PSN4smx4c5v7Fk7vUPHyB+N/DRW36seM8PtICf8X6JTODbruPF/K5gm6dJtMsIlhsjj/g9RRNvJdXgtnmmXZypv7eOXuvZjeV547vGD42ThtboNF433jSuGjcNuHe1t9j7a+/vl/8dfH9wfPDT2vpsL+vzbaP0HKj/A7hTT8o=</latexit>
@l12
@w @y
@l @h @h22 @w
@k
@y2 @h @k12
@k
w1
22
=
@w12 @y @h
= 2(y - 2t) @k
2 @k @w
· v22w@w· h122 (1 - h2 ) · x1
12
t) ·· vvw
= 2(y - t) 2 ·· h
h22(1 - h22) · x1
(1 -
x1 x2
σ(x))
) = σ(x)(1 -
sigmo id: σ’(x
44
derivative of
BACKPROPAGATION Note that when we’re computing the derivative

for w12, we are also, along the way computing the
l
@l @l @y @h2 @k2 derivatives for y, h2 and k2.
=
@w12 @y @h2 @k2 @w12
y t This useful when it comes to implementing
@l = 2(y
@l - · v2w@l
@yt)@h · h22 (1 - h2 ) · x1
<latexit sha1_base64="nH9fbpauKQWPXg5ZX8gQFcwfSwU=">AAANqXicfZfbbts2HMbVdocuq9d0vRwGCAsKDFsQSInjw0WAWpKNXqxtGuS0xkFA0bQsmBIJinLsCrra0+x2e5q9zSjFkSWKsm5M8Pv46ac/RZl0KfYjbhj/PXn67Kuvv/n2+Xc7379o/fBy99WPlxGJGUQXkGDCrl0QIeyH6IL7HKNryhAIXIyu3Lmd6VcLxCKfhOd8RdFtALzQn/oQcNF1t/vziT6eMgCT8ZzqOM1/xpAns7vDNNXvdveMAyO/9HrDXDf2tPV1evfqBd8ZTwiMAxRyiEEU3ZgG5bcJYNyHGKU74zhCFMA58NBNzKe928QPacxRCFP9jdCmMdY50TNWfeIzBDleiQaAzBcJOpwBQcvFE+1UoyIUggBF+5OFT6OHZrTwHhociHLcJsu8XGllYOIxQGc+XFbIEhBEAeCzWme0CtxqJ4oxYoug2plRCkbJuUQM+lFWg1NRmI80m4HonJyu9dmKzlAYpUnMcFoeKATEGJqKgXkzQjymSf4wYtrn0QlnMdrPmnnfiQPY/AxN9kVOpaOKM8UE8GqXG2TFCdE9JEEAwkkypuKN4GjJk/H+QSrEN7qLhdklgE2qzrM0ScZZzVxXPxPWivihJH6QxWFJHK5vQvBEnxKmL8T8ExbpwqgLC/MhiqqjL4rRU/1Cjr4siZeyeFUSr2TRjUtqXFMXJXVRU+9L6r2sLkviUhZXJXFVy+Ullcvql5L4RRavt930z203/SzFitkRS2YlVjmaim9O/oIlc5gm787f/5Em/fxavykx0s2qEbqPxqNRp9vvprKMH/X2qGdaTl0vDF1rYNoqQ+EYdG3DGW5YDiVvAW0YneGgI0dBvNF7fXtU1zewxqDrWArDhnZkt4fddfkQCiWrV/jsfqddK4tX5PQtyzru1/XCYLVtu3eoMBQO23GcgZ2j0JhRjCQvfTR2OsdGPYoWQT2j0x4o9M0EGD3LqsHSEos1skzHzFk4Alhy8uJlsXuD/lAO4pv6WwPbrk0gL5W/Z5tOW2HYsB47x8OjnIQwEHpyVUhRvk63d9STo0gRNOqKGayxkM2dRn3L6NZYSIllZNlWVtjqShSL7Ma8TR4+yJt1p++ZeprKZrHQZHO29h7NkhcrzLjZrbRv86sH4Eb4+pNC2BQPFeGwEQaqWGAzPFTCw23wXt3vNcV7inCvEcZTsXjN8J4S3tsGT+t+2hRPFeG0EYaqWGgzPFXC023wvO7nTfFcEc4bYbiKhTfDcyU83wZP6n7SFE8U4aQRhqhYSDM8UcKTKrz488j29ADr2Z6YYN0P1xuDyjeLZtuHuTherN0PD+4gcTZg6L3YVnwUG1og9ni/JWPAvMAPU3FW8Mb7WWubESwfjaIlzimmfCqpNy4PD8zjA+NTe++ttT6xPNd+0n7RftVMrau91d5pp9qFBrW/tL+1f7R/W7+3PrWuW58frE+frMe81ipXC/4PcAgFiQ==</latexit>
@k
= = backpropagation. We can walk backward down the
v2
@w12 @y @h2 @k2@h @w2 12 computation graph and compute the derivative of
h1 h2 h3
@l = 2(y
@l - · v2w @k
@yt)@h · h22@l
(1 - h2 ) · x1
<latexit sha1_base64="sgE4DAIgoQfIIf/5CtOYFODi4vs=">AAANqXicfZfbbts2HMbVdocuq9d0vRwGCAsKDFsQSInjw0WAWpKNXqxtGuS0xkZA0bQimBIJinLsCrra0+x2e5q9zSjFkSWKsm5M8Pv46ac/RZl0KfYjbhj/PXn67Kuvv/n2+Xc7379o/fBy99WPlxGJGUQXkGDCrl0QIeyH6IL7HKNryhAIXIyu3Lmd6VcLxCKfhOd8RdEkAF7oz3wIuOi63f35RB/PGIDJeE51nOY/Y+gl89vDNNVvd/eMAyO/9HrDXDf2tPV1evvqBd8ZTwmMAxRyiEEU3ZgG5ZMEMO5DjNKdcRwhCuAceOgm5rPeJPFDGnMUwlR/I7RZjHVO9IxVn/oMQY5XogEg80WCDu+AoOXiiXaqUREKQYCi/enCp9FDM1p4Dw0ORDkmyTIvV1oZmHgM0DsfLitkCQiiAPC7Wme0CtxqJ4oxYoug2plRCkbJuUQM+lFWg1NRmI80m4HonJyu9bsVvUNhlCYxw2l5oBAQY2gmBubNCPGYJvnDiGmfRyecxWg/a+Z9Jw5g8zM03Rc5lY4qzgwTwKtdbpAVJ0T3kAQBCKfJmIo3gqMlT8b7B6kQ3+guFmaXADatOs/SJBlnNXNd/UxYK+KHkvhBFoclcbi+CcFTfUaYvhDzT1ikC6MuLMyHKKqOvihGz/QLOfqyJF7K4lVJvJJFNy6pcU1dlNRFTb0vqfeyuiyJS1lclcRVLZeXVC6rX0riF1m83nbTP7fd9LMUK2ZHLJmVWOVoJr45+QuWzGGavDt//0ea9PNr/abESDerRug+Go9GnW6/m8oyftTbo55pOXW9MHStgWmrDIVj0LUNZ7hhOZS8BbRhdIaDjhwF8Ubv9e1RXd/AGoOuYykMG9qR3R521+VDKJSsXuGz+512rSxekdO3LOu4X9cLg9W27d6hwlA4bMdxBnaOQmNGMZK89NHY6Rwb9ShaBPWMTnug0DcTYPQsqwZLSyzWyDIdM2fhCGDJyYuXxe4N+kM5iG/qbw1suzaBvFT+nm06bYVhw3rsHA+PchLCQOjJVSFF+Trd3lFPjiJF0KgrZrDGQjZ3GvUto1tjISWWkWVbWWGrK1Esshtzkjx8kDfrTt8z9TSVzWKhyeZs7T2aJS9WmHGzW2nf5lcPwI3w9SeFsCkeKsJhIwxUscBmeKiEh9vgvbrfa4r3FOFeI4ynYvGa4T0lvLcNntb9tCmeKsJpIwxVsdBmeKqEp9vged3Pm+K5Ipw3wnAVC2+G50p4vg2e1P2kKZ4owkkjDFGxkGZ4ooQnVXjx55Ht6QHWsz0xwbofrjcGlW8WzbYPc3G8WLsfHtxB4mzA0HuxrfgoNrRA7PF+S8aAeYEfpuKs4I33s9Y2I1g+GkVLnFNM+VRSb1weHpjHB8an9t5ba31iea79pP2i/aqZWld7q73TTrULDWp/aX9r/2j/tn5vfWpdtz4/WJ8+WY95rVWuFvwf7CcFfw==</latexit>
k1 k2 k3 the loss for every node. For the nodes below, we

= =
@w12 @y @h2 @k2 @w12 @k2 just multiply the local gradient. This means we can
2
w1
= 2(y - t) · vw · h2 (1 - h2 ) · x1 <latexit sha1_base64="I6uw0xrWK20JnGx/2G2zfhLDego=">AAANrHicfZdNb9s2HMbV7q3L6jXdjrsQCwoMQxBIieOXQ4Bako0e1jYL8tItNgKKphXBlEhQlGNX0HWfZtftu+zbjHIcWaIo62KCz8NHP/0pyqTHSBAL0/zv2fMvvvzq629efLv33cvW96/2X/9wHdOEI3yFKKH8kwdjTIIIX4lAEPyJcQxDj+Abb+7k+s0C8zig0aVYMTwJoR8FswBBIbvu9sEZGM84ROl4zgDJ1j9jRNOHu9Q6zrIM3O0fmEfm+gL1hrVpHBib6/zu9UuxN55SlIQ4EojAOL61TCYmKeQiQARne+MkxgyiOfTxbSJmvUkaRCwROEIZeCO1WUKAoCDHBdOAYyTISjYg4oFMAOgeSmAhH2qvGhXjCIY4PpwuAhY/NuOF/9gQUFZkki7XFcsqA1OfQ3YfoGWFLIVhHEJxX+uMV6FX7cQJwXwRVjtzSsmoOJeYoyDOa3AuC/OR5ZMQX9LzjX6/Yvc4irM04SQrD5QC5hzP5MB1M8YiYen6YeTMz+MzwRN8mDfXfWcu5PMLPD2UOZWOKs6MUCiqXV6YFyfCD4iGIYym6ZjJl0LgpUjHh0eZFN8Aj0izRyGfVp0XWZqO85p5HriQ1or4oSR+UMVhSRxubkLJFMwoBws5/5THQBqBtPAA4bg6+qoYPQNXavR1SbxWxZuSeKOKXlJSk5q6KKmLmvpQUh9UdVkSl6q4KomrWq4oqUJVP5fEz6r4addN/9h10z+VWDk7csms5CrHM/nZWb9g6Rxl6bvL979laX99bd6UBAOrakTek/Fk1On2u5kqkye9PepZtlvXC0PXHliOzlA4Bl3HdIdblmPFW0CbZmc46KhRiGz1Xt8Z1fUtrDnourbGsKUdOe1hd1M+jCPF6hc+p99p18riFzl927ZP+3W9MNhtx+kdawyFw3Fdd+CsUVjCGcGKlz0ZO51Tsx7FiqCe2WkPNPp2AsyebddgWYnFHtmWa61ZBIZEcYriZXF6g/5QDRLb+tsDx6lNoCiVv+dYbltj2LKeuqfDkzUJ5TDy1arQonydbu+kp0bRImjUlTNYY6HbO436ttmtsdASy8h27Lyw1ZUoF9mtNUkfP8jbdQcOLJBlqlkuNNWcr70ns+IlGjNpdmvtu/z6AaQRvv6kCDXFI004aoRBOhbUDI+08GgXvF/3+03xvibcb4TxdSx+M7yvhfd3wbO6nzXFM004a4RhOhbWDM+08GwXvKj7RVO80ISLRhihYxHN8EILL3bB07qfNsVTTThthKE6FtoMT7XwtAov/zzyPT0kIN8TUwKCaLMxqHyzWL59mMsTxsb9+OAulmcDjt/LbcVHuaGFco/3azqG3A+DKJNnBX98mLd2GeHyyShb8pxiqaeSeuP6+Mg6PTJ/bx+8tTcnlhfGT8bPxi+GZXSNt8Y749y4MpDxl/G38Y/xb+uoddm6bU0erc+fbcb8aFSu1ux/6ygG2g==</latexit>
very efficiently compute any derivatives we need.

@l @l @y @h2 @k2 @l
= =
x1 x2 @w12 @y @h2 @k2 @w12 @w12 In fact, this is where the name backpropagation
45 = 2(y - t) · vw · h2 (1 - h2 ) · x1 comes from: the derivative of the loss propagates
down the network in the oppositve direction to
the forward pass. We will show this more precisely
in the last part of this lecture.
BACKPROPAGATION
The BACKPROPAGATION algorithm:

• break your computation up into a sequence of operations
what counts as an operation is up to you
• work out the local derivatives symbolically.

by computing the local derivatives and multiplying them
• Walk backward from the loss, accumulating the derivatives.
46
BUILDING SOME INTUITION To finish up, let’s see if we can build a little
intuition for what all these accumulated
100
derivatives mean.
10 0
Here is a forward pass for some weights and some
inputs. Backpropagation starts with the loss, and
-3
walks down the network, figuring out at each step

0.5
0
how every value contributed to the result of the
forward pass. Every value that contributed
positively to a positive loss should be lowered,
1
every value that contributed positively to a

-3
negative loss should be increased, and so on.
47
IMAGINE THAT y IS A PARAMETER We’ll start with the first value below the loss: y. Of
course, this a isn’t parameter of the network, but
100
let’s imagine for a moment that it is. What would
y - ↵ · 2(y - t) the gradient descent update rule look like if we try
<latexit sha1_base64="k5Igov+6dFG0O9rDIGEOSETaqxg=">AAAN8nichZfbbts2HMbV7tRl9ZZul7shFnTohiyQEscHDAFqyTZ6sbZZkFMXGQFF07JgSiQoyrEr6EV2N+x2j7IX2GvsdrsY5TiyDpSnK4Lfx08//SlKpMOIFwpd/+vR4w8+/OjjT558uvPZ08bnX+w++/IypBFH+AJRQvm1A0NMvABfCE8QfM04hr5D8JUzs1L9ao556NHgXCwZHvnQDbyJh6CQXbe7ExuxeJkAm+CJgJzTu2+B/SNY9/4A7BmKbUjYFEoPGlMBDl9kovgO2PbOyf+O0MHt7p5+oK8uUG0Y68aetr5Ob589FTv2mKLIx4FABIbhjaEzMYohFx4iONmxoxAziGbQxTeRmHRGsRewSOAAJeC51CYRAYKC9KHB2OMYCbKUDYi4JxMAmkIOkZCl2SlGhTiAPg73x3OPhffNcO7eNwSUdR3Fi1Xdk8LA2OWQTT20KJDF0A99KKaVznDpO8VOHBHM536xM6WUjCXnAnPkhWkNTmVh3rJ0KsNzerrWp0s2xUGYxBEnSX6gFDDneCIHrpohFhGLVw8j359ZeCJ4hPfT5qrvpA/57AyP92VOoaOIMyEUimKX46fFCfAdor4Pg3FssyS2BV6I2N4/SKT4HDhEmh0K+bjoPEvi2E5r5jjgTFoL4puc+KYsDnLiYH0TSsZgQjmYy/mnPATSCKSFewiHxdEX2egJuChHX+bEy7J4lROvyqIT5dSoos5z6ryi3uXUu7K6yImLsrjMictKrsipoqy+z4nvy+L1tpu+23bTX0qxcnbkklnKVY4n8uO1esHiGUriV+evf0ri7upavykRBkbRiJwH49Gw1e62k7JMHvTmsGOY/aqeGdpmz7BUhszRa1t6f7BhOSx5M2hdbw16rXIUIhu907WGVX0Dq/fafVNh2NAOreagvS4fxkHJ6mY+q9tqVsriZjld0zSPu1U9M5hNy+ocKgyZw+r3+z1rhcIizgguedmDsdU61qtRLAvq6K1mT6FvJkDvmGYFluVYzKFp9I0Vi8CQlJwie1msTq87KAeJTf3NnmVVJlDkyt+xjH5TYdiwHvePB0crEsph4JarQrPytdqdo045imZBw7acwQoL3dxp2DX1doWF5liGpmWmhS2uRLnIboxRfP9B3qw7sGeAJCmb5UIrm9O192AueYnCTOrdSvs2v3oAqYWvPilCdfFIEY5qYZCKBdXDIyU82gbvVv1uXbyrCHdrYVwVi1sP7yrh3W3wrOpndfFMEc5qYZiKhdXDMyU82wYvqn5RFy8U4aIWRqhYRD28UMKLbfC06qd18VQRTmthqIqF1sNTJTwtwsufR7qnhwSke2JKgBesNwaFbxZLtw/p0WLtvn/wPpZnA45fy23FW7mhhXKP9708fXDX94JEnhVcez9tbTPCxYNRtuQ5xSifSqqNy8MD4/hA/7m599Jcn1ieaF9r32gvNENray+1V9qpdqEh7U/tb+0f7d+GaPza+K3x+7318aP1mK+0wtX44z98Sh//</latexit>
y
10 0 to update y?
= y - ↵ · 20
If the output is 10, and it should have been 0, then
0.5
0
gradient descent on y tells us to lower the output
y - ↵ · 2(y - t) of the network. If the output is 0 and it should
<latexit sha1_base64="k5Igov+6dFG0O9rDIGEOSETaqxg=">AAAN8nichZfbbts2HMbV7tRl9ZZul7shFnTohiyQEscHDAFqyTZ6sbZZkFMXGQFF07JgSiQoyrEr6EV2N+x2j7IX2GvsdrsY5TiyDpSnK4Lfx08//SlKpMOIFwpd/+vR4w8+/OjjT558uvPZ08bnX+w++/IypBFH+AJRQvm1A0NMvABfCE8QfM04hr5D8JUzs1L9ao556NHgXCwZHvnQDbyJh6CQXbe7ExuxeJkAm+CJgJzTu2+B/SNY9/4A7BmKbUjYFEoPGlMBDl9kovgO2PbOyf+O0MHt7p5+oK8uUG0Y68aetr5Ob589FTv2mKLIx4FABIbhjaEzMYohFx4iONmxoxAziGbQxTeRmHRGsRewSOAAJeC51CYRAYKC9KHB2OMYCbKUDYi4JxMAmkIOkZCl2SlGhTiAPg73x3OPhffNcO7eNwSUdR3Fi1Xdk8LA2OWQTT20KJDF0A99KKaVznDpO8VOHBHM536xM6WUjCXnAnPkhWkNTmVh3rJ0KsNzerrWp0s2xUGYxBEnSX6gFDDneCIHrpohFhGLVw8j359ZeCJ4hPfT5qrvpA/57AyP92VOoaOIMyEUimKX46fFCfAdor4Pg3FssyS2BV6I2N4/SKT4HDhEmh0K+bjoPEvi2E5r5jjgTFoL4puc+KYsDnLiYH0TSsZgQjmYy/mnPATSCKSFewiHxdEX2egJuChHX+bEy7J4lROvyqIT5dSoos5z6ryi3uXUu7K6yImLsrjMictKrsipoqy+z4nvy+L1tpu+23bTX0qxcnbkklnKVY4n8uO1esHiGUriV+evf0ri7upavykRBkbRiJwH49Gw1e62k7JMHvTmsGOY/aqeGdpmz7BUhszRa1t6f7BhOSx5M2hdbw16rXIUIhu907WGVX0Dq/fafVNh2NAOreagvS4fxkHJ6mY+q9tqVsriZjld0zSPu1U9M5hNy+ocKgyZw+r3+z1rhcIizgguedmDsdU61qtRLAvq6K1mT6FvJkDvmGYFluVYzKFp9I0Vi8CQlJwie1msTq87KAeJTf3NnmVVJlDkyt+xjH5TYdiwHvePB0crEsph4JarQrPytdqdo045imZBw7acwQoL3dxp2DX1doWF5liGpmWmhS2uRLnIboxRfP9B3qw7sGeAJCmb5UIrm9O192AueYnCTOrdSvs2v3oAqYWvPilCdfFIEY5qYZCKBdXDIyU82gbvVv1uXbyrCHdrYVwVi1sP7yrh3W3wrOpndfFMEc5qYZiKhdXDMyU82wYvqn5RFy8U4aIWRqhYRD28UMKLbfC06qd18VQRTmthqIqF1sNTJTwtwsufR7qnhwSke2JKgBesNwaFbxZLtw/p0WLtvn/wPpZnA45fy23FW7mhhXKP9708fXDX94JEnhVcez9tbTPCxYNRtuQ5xSifSqqNy8MD4/hA/7m599Jcn1ieaF9r32gvNENray+1V9qpdqEh7U/tb+0f7d+GaPza+K3x+7318aP1mK+0wtX44z98Sh//</latexit>
100
y
-1
have been 10, GD tells us to increase the value of

10
0 0
10
=
=y -↵
y+ ↵ ·· 20
20
<latexit sha1_base64="jS0SCCcHiShn5JqEVzzf/djJhC0=">AAANsHicfZfbbts2HMbV7tRl9Zaul7shlnUYtiCQEscHDAFqyTZ6sbZZkNMWBwFF07JqSiQoyrEr6AX2NLvd3mRvM8pxZImirCuC38dPP/0pSqTLiB8J0/zvydNPPv3s8y+efbnz1fPG19/svvj2MqIxR/gCUUL5tQsjTPwQXwhfEHzNOIaBS/CVO3My/WqOeeTT8FwsGb4NoBf6Ex9BIbvudn84+RGMfgUjxJJlCn4BoxlKRpCwKUxl55gKcGiCu90988BcXaDasNaNPWN9nd69eC52RmOK4gCHAhEYRTeWycRtArnwEcHpziiOMINoBj18E4tJ5zbxQxYLHKIUvJLaJCZAUJAhg7HPMRJkKRsQcV8mADSFHCIhH2ynHBXhEAY42h/PfRY9NKO599AQUFblNlmsqpaWBiYeh2zqo0WJLIFBFEAxrXRGy8Atd+KYYD4Pyp0ZpWRUnAvMkR9lNTiVhXnPsomIzunpWp8u2RSHUZrEnKTFgVLAnOOJHLhqRljELFk9jJz9WXQieIz3s+aq76QP+ewMj/dlTqmjjDMhFIpylxtkxQnxPaJBAMNxMmJpMhJ4IZLR/kEqxVfAJdLsUsjHZedZmiSjrGauC86ktSS+K4jvVHFQEAfrm1AyBhPKwVzOP+URkEYgLdxHOCqPvshHT8CFGn1ZEC9V8aogXqmiGxfUuKLOC+q8ot4X1HtVXRTEhSouC+KykisKqlDVjwXxoypeb7vpH9tu+qcSK2dHLpmlXOV4Ij89qxcsmaE0eXP+9rc06a6u9ZsSY2CVjch9NB4NW+1uO1Vl8qg3hx3L7lf13NC2e5ajM+SOXtsx+4MNy6HizaFNszXotdQoRDZ6p+sMq/oG1uy1+7bGsKEdOs1Be10+jEPF6uU+p9tqVsri5Tld27aPu1U9N9hNx+kcagy5w+n3+z1nhcJizghWvOzR2Godm9Uolgd1zFazp9E3E2B2bLsCywos9tC2+taKRWBIFKfIXxan0+sO1CCxqb/dc5zKBIpC+TuO1W9qDBvW4/7x4GhFQjkMPbUqNC9fq9056qhRNA8atuUMVljo5k7Drm22Kyy0wDK0HTsrbHklykV2Y90mDx/kzboDexZIU9UsF5pqztbeo1nxEo2Z1Lu19m1+/QBSC199UoTq4pEmHNXCIB0LqodHWni0Dd6r+r26eE8T7tXCeDoWrx7e08J72+BZ1c/q4pkmnNXCMB0Lq4dnWni2DV5U/aIuXmjCRS2M0LGIenihhRfb4GnVT+viqSac1sJQHQuth6daeFqGlz+PbE8PCcj2xJQAP1xvDErfLJZtH7Kjxdr98OB9LM8GHL+V24r3ckML5R7vZ3n64F7gh6k8K3ij/ay1zQgXj0bZkucUSz2VVBuXhwfW8YH5e3Pvtb0+sTwzvjO+N34yLKNtvDbeGKfGhYGMv4y/jX+MfxuHjevGXQM+WJ8+WY95aZSuxof/ATXABr0=</latexit>
the output.
48 Even though we can’t change y directly, this is the
0.5 effect we want to achieve: we want to change the
values we can change so that we achieve this
change in y. To figure out how to do that, we take
this gradient for y, and propagate it back down the
network.
Note that even though these scenarios have the
same loss (because of the square), the derivative
of the loss has a different sign, so we can tell
whether the output is bigger than the target or
the other way around.
Instead of changing y, we have to change the

values that influenced y. Here we see what that
100
looks like for the weights of the second layer.
v2 - ↵ · 2(y - t)h2
<latexit sha1_base64="3wYaiarvQNZLOn66+fywJunFpic=">AAAOC3ichZfbbts2HMaV7tSlzZbucLUbYUGHbsgCKfERQ4Baso1erG0W5NRFQUDRtCyYEgmKcuwKwp5gT7O7Ybd7iD3A3mOU48gSRXm6Ivh9/PTTn6JEuhT7ETeMf7YeffDhRx9/8vjT7SdPdz77fPfZFxcRiRlE55Bgwq5cECHsh+ic+xyjK8oQCFyMLt2pnemXM8Qin4RnfEHRTQC80B/7EHDRdbv7mwNJMrs9THUHozEHjJG775yf8t4fdWcKEwdgOgHCA0eE64cvHEiTRSby70UfTybLAGf7+H+HGqtGNso4aKa3u3vGgbG89GrDXDX2tNV1cvvsKd92RgTGAQo5xCCKrk2D8psEMO5DjNJtJ44QBXAKPHQd83HnJvFDGnMUwlR/LrRxjHVO9Kwa+shnCHK8EA0AmS8SdDgBDEAuarZdjopQCAIU7Y9mPo3um9HMu29wIAp+k8yXE5KWBiYeA3Tiw3mJLAFBFAA+qXRGi8Atd6IYIzYLyp0ZpWCUnHPEoB9lNTgRhXlLszmOzsjJSp8s6ASFUZrEDKfFgUJAjKGxGLhsRojHNFk+jHixptExZzHaz5rLvuM+YNNTNNoXOaWOMs4YE8DLXW6QFSdEd5AEAQhHiUPTxOFozhNn/yAV4nPdxcLsEsBGZedpmiROVjPX1U+FtSS+KYhvZHFQEAermxA80seE6TMx/4RFujDqwsJ8iKLy6PN89Fg/l6MvCuKFLF4WxEtZdOOCGlfUWUGdVdS7gnonq/OCOJfFRUFcVHJ5QeWy+r4gvpfFq003fbfppr9KsWJ2xJJZiFWOxuKrtnzBkilMk1dnr39Ok+7yWr0pMdLNshG6D8ajYavdbaeyjB/0xrBjWv2qnhvaVs+0VYbc0WvbRn+wZjmUvDm0YbQGvZYcBfFa73TtYVVfwxq9dt9SGNa0Q7sxaK/Kh1AoWb3cZ3dbjUpZvDyna1lWs1vVc4PVsO3OocKQO+x+v9+zlyg0ZhQjyUsfjK1W06hG0TyoY7QaPYW+ngCjY1kVWFpgsYaW2TeXLBwBLDl5/rLYnV53IAfxdf2tnm1XJpAXyt+xzX5DYVizNvvNwdGShDAQenJVSF6+Vrtz1JGjSB40bIsZrLCQ9Z2GXctoV1hIgWVo2VZW2PJKFIvs2rxJ7j/I63Wn75l6mspmsdBkc7b2HsySFyvMuN6ttG/yqwfgWvjqk0JYFw8V4bAWBqpYYD08VMLDTfBe1e/VxXuKcK8WxlOxePXwnhLe2wRPq35aF08V4bQWhqpYaD08VcLTTfC86ud18VwRzmthuIqF18NzJTzfBE+qflIXTxThpBaGqFhIPTxRwpMyvPh5ZHt6gPVsT0yw7oerjUHpm0Wz7UN20Fi57x+8j8TZgKHXYlvxVmxogdjj/SDOIswL/DAVZwXP2c9am4xg/mAULXFOMeVTSbVxcXhgNg+MXxp7L63VieWx9o32rfZCM7W29lJ7pZ1o5xrU/t16svXV1tc7v+/8sfPnzl/31kdbqzFfaqVr5+//AEAXJ0s=</latexit>
10 0
v2 Note that the current value of the weight doesn’t
= v2 - ↵ · 20 · 0.5 factor into the update. Only how much influence
the weight had on the value of y in the forward
v1 v1 - ↵ · 20 · 0.1
<latexit sha1_base64="1fAMpTLUPd/VrFgaCtGKa+Tee28=">AAAOY3icjZfrbts2HMXl7talbZZ2+zYMEBZ0GIY0kBxfMQSoJdvoh7XNgty2KAgompYFUyJBUY5dQc857AH2HqNsRdaFMqpPBM/h0U9/kjZpU+wGXNP+bTz54suvvv7m6bd7z56/2P/u4OWrq4CEDKJLSDBhNzYIEHZ9dMldjtENZQh4NkbX9txM9OsFYoFL/Au+oujOA47vTl0IuOi6P/jHgiRa3OuxamE05YAx8vCL9buadb9RrTmMLIDpDAgTnBCuNrW0YUEeacd6bFl7mwFNeU7zs3La25wTec7JZ+X0Rc79waF2rK0ftdrQ08ahkj5n9y+f8z1rQmDoIZ9DDILgVtcov4sA4y7EKN6zwgBRAOfAQbchn/buItenIUc+jNXXQpuGWOVETaqsTlyGIMcr0QCQuSJBhTPAAORiLvaKUQHygYeCo8nCpcGmGSycTYMDMZF30XI90XFhYOQwQGcuXBbIIuAFHuCzSmew8uxiJwoxYguv2JlQCsaSc4kYdIOkBmeiMB9psnaCC3KW6rMVnSE/iKOQ4Tg/UAiIMTQVA9fNAPGQRuuPEQt2HpxyFqKjpLnuOx0CNj9HkyORU+go4kwxAbzYZXtJcXz0AInnAX8SWTSOLI6WPLKOjmMhvlZtLMw2AWxSdJ7HUWQlNbNt9VxYC+KHnPihLI5y4ih9CcETdUqYuhDzT1igCqMqLMyFKCiOvsxGT9XLcvRVTrwqi9c58bos2mFODSvqIqcuKupDTn0oq8ucuCyLq5y4quTynMrL6qec+Kks3ux66V+7Xvp3KVbMjtgyK7HL0VT8Wq4XWDSHcfTu4v0fcdRfP+lKCZGqF43QfjSejDvdfjcuy/hRb417ujGs6pmhawx0U2bIHIOuqQ1HW5ZmyZtBa1pnNOiUoyDe6r2+Oa7qW1ht0B0aEsOWdmy2Rt20fAj5JauT+cx+p1Upi5Pl9A3DaPeremYwWqbZa0oMmcMcDocDc41CQ0YxKnnpo7HTaWvVKJoF9bROayDRtxOg9QyjAktzLMbY0If6moUjgEtOni0Wszfoj8pBfFt/Y2CalQnkufL3TH3Ykhi2rO1he3SyJiEM+E65KiQrX6fbO+mVo0gWNO6KGaywkO2bxn1D61ZYSI5lbJhGUtjiThSb7Fa/izY/yNt9px7qahyXzWKjlc3J3ns0l7xYYsb1bql9l18+ANfCV78Uwrp4KAmHtTBQxgLr4aEUHu6Cd6p+py7ekYQ7tTCOjMWph3ek8M4ueFr107p4KgmntTBUxkLr4akUnu6C51U/r4vnknBeC8NlLLwenkvh+S54UvWTungiCSe1METGQurhiRSeFOHFn0dypgdYTc7EBKuunx4MCr9ZNDk+JFeN1L358CESdwOG3otjxUdxoAXijPebuI0wx3P9WNwVHOsoae0yguWjUbTEPUUv30qqjavmsd4+1v5sHb410hvLU+VH5WflV0VXuspb5Z1yplwqsHHagA3c8F78t/9s/9X+Dxvrk0Y65nul8Oz/9D/MA0Tg</latexit>
0.1 0.5 0.9
0
pass. The higher the activation of the source node,
v2 v2 - ↵ · 20 · 0.5 the more the weight gets adjusted.
1
v3 v3 - ↵ · 20 · 0.9 Note also how the sign of the the derivative wrt to
-3 y is taken into account. Here, the model output
49 was too high, so the more a weight contributed to
the output, the more it gets "punished" by being
lowered. If the output had been too low, the
opposite would be true, and v3 would be increased
NEGATIVE ACTIVATION The sigmoid activation we’ve used so far allows

only positive values to emerge from the hidden
tanh 1
100
layer. If we switch to an activation that also allows
negative activations (like a linear activation or a
10 0
-1
tanh activation), we see that backpropagation very
naturally takes the sign into account.
0.1 -0.5 0.9
v1 v1 - ↵ · 20 · 0.1 In this case, we want to update in such a way that
<latexit sha1_base64="MAtH+mJ4IRrP13xM1lksk0Jt29I=">AAAOY3icjZfrbts2HMXl7talbZZ2+zYMEBZ0GLYskBJfMQSoJdvoh7XNgty2KAgompYFUyJBUY5dQc857AH2HqMcR5Yoyps+ETyHRz/9SdqkS7EfccP4u/Hkk08/+/yLp1/uPHv+YvervZevLiMSM4guIMGEXbsgQtgP0QX3OUbXlCEQuBhduTM706/miEU+Cc/5kqLbAHihP/Eh4KLrbu8vB5JkfmemuoPRhAPGyP0Pzq963v2L7sxg4gBMp0CY4Jhw/chYNxzIE+PQTB1n52HAkTpHdP/83zmtTc6xOuf4f/H0RM7d3r5xaKwevdow1419bf2c3r18znecMYFxgEIOMYiiG9Og/DYBjPsQo3THiSNEAZwBD93EfNK9TfyQxhyFMNVfC20SY50TPauyPvYZghwvRQNA5osEHU4BA5CLudgpR0UoBAGKDsZzn0YPzWjuPTQ4EBN5myxWE52WBiYeA3Tqw0WJLAFBFAA+rXRGy8Atd6IYIzYPyp0ZpWCUnAvEoB9lNTgVhflAs7UTnZPTtT5d0ikKozSJGU6LA4WAGEMTMXDVjBCPabL6GLFgZ9EJZzE6yJqrvpMBYLMzND4QOaWOMs4EE8DLXW6QFSdE95AEAQjHiUPTxOFowRPn4DAV4mvdxcLsEsDGZedZmiROVjPX1c+EtSS+L4jvZXFYEIfrlxA81ieE6XMx/4RFujDqwsJ8iKLy6It89ES/kKMvC+KlLF4VxCtZdOOCGlfUeUGdV9T7gnovq4uCuJDFZUFcVnJ5QeWy+rEgfpTF620v/WPbS/+UYsXsiC2zFLscTcSv5WqBJTOYJm/P3/2WJr3Vs14pMdLNshG6j8bjUbvT66SyjB/15qhrWoOqnhs6Vt+0VYbc0e/YxmC4YTmSvDm0YbSH/bYcBfFG7/bsUVXfwBr9zsBSGDa0I7s57KzLh1AoWb3cZ/fazUpZvDynZ1lWq1fVc4PVtO3ukcKQO+zBYNC3Vyg0ZhQjyUsfje12y6hG0Tyoa7SbfYW+mQCja1kVWFpgsUaWOTBXLBwBLDl5vljsbr83lIP4pv5W37YrE8gL5e/a5qCpMGxYW4PW8HhFQhgIPbkqJC9fu9M97spRJA8adcQMVljI5k2jnmV0KiykwDKybCsrbHknik12Y94mDz/Im32n75t6mspmsdFkc7b3Hs2SFyvMuN6ttG/zqwfgWvjql0JYFw8V4bAWBqpYYD08VMLDbfBe1e/VxXuKcK8WxlOxePXwnhLe2wZPq35aF08V4bQWhqpYaD08VcLTbfC86ud18VwRzmthuIqF18NzJTzfBk+qflIXTxThpBaGqFhIPTxRwpMyvPjzyM70AOvZmZhg3Q/XB4PSbxbNjg/ZVWPtfvjwARJ3A4beiWPFB3GgBeKM95O4jTAv8MNU3BU85yBrbTOCxaNRtMQ9xZRvJdXG5dGh2To0fm/uv7HWN5an2rfa99qPmql1tDfaW+1Uu9Bg46QBG7gRvPhn99nuq91vHqxPGusxX2ulZ/e7fwGw2UTe</latexit>
y decreases, but we note that the weight v2 is

v2 v2 + ↵ · 20 · 0.5 multiplied by a negative value. This means that
1
v3 v3 - ↵ · 20 · 0.9 (for this instance) v2 contributes negatively to the

-3
loss, and its value should be increased.
50
Note that the sign of v2 itself doesn’t matter.

Whether it’s positive or negative, its value should
increase.
We use the same principle to work our way back

down the network. If we could change the output
100
of the second node h2 directly, this is how we’d do
it.
10 0
h2 - ↵ · 2(y - t)v2 Note that we now take the value of v2 to be a

<latexit sha1_base64="A9IB/v2aC7rFSW4tAdIn9V3z90w=">AAAODXicfZfbbts2HMaV7tRl9ZZuwHaxG2FBh25NAylxfMAQoJZsoxdrmwU5bVEQUDQtC6ZEgqIcu4Iu9gR7mt0Nu90z7An2GqMURdaBsq4Ift//009/SjZpU+wGXNP+3Xr0wYcfffzJ40+3P3vS+vyLnadfXgQkZBCdQ4IJu7JBgLDro3PucoyuKEPAszG6tOdmol8uEAtc4p/xFUU3HnB8d+pCwMXU7c7vFuTR7PYgVi2MphwwRu6+t35S8+mXqjWHkQUwnQFhghPC1YPnFqTRKn7JfxAzJFqk9db2canyhaxSywZJ2WFadLuzq+1r6aXWB3o22FWy6+T26RO+bU0IDD3kc4hBEFzrGuU3EWDchRjF21YYIArgHDjoOuTT3k3k+jTkyIex+kxo0xCrnKhJP9SJyxDkeCUGADJXJKhwBhiAXHRtuxwVIB94KNibLFwa3A+DhXM/4EC0/CZapksSlwojhwE6c+GyRBYBL/AAn9Umg5VnlydRiBFbeOXJhFIwVpxLxKAbJD04EY15R5NVDs7ISabPVnSG/CCOQobjYqEQEGNoKgrTYYB4SKP0YcSrNQ+OOQvRXjJM546HgM1P0WRP5JQmyjhTTAAvT9le0hwf3UHiecCfRBaNI4ujJY+svf1YiM9UGwuzTQCblJ2ncRRZSc9sWz0V1pL4tiC+rYqjgjjKbkLwRJ0Spi7E+hMWqMKoCgtzIQrK1ed59VQ9r0ZfFMSLqnhZEC+roh0W1LCmLgrqoqbeFdS7qrosiMuquCqIq1ouL6i8qr4viO+r4tWmm/666aa/VWLF6ohPZiW+cjQVv2vpCxbNYRy9Pnvzcxz10yt7U0Kk6mUjtB+Mh+NOt9+NqzJ+0Nvjnm4M63pu6BoD3ZQZcsega2rD0ZrloOLNoTWtMxp0qlEQr/Ve3xzX9TWsNugODYlhTTs226Nu1j6E/IrVyX1mv9OutcXJc/qGYRz163puMNqm2TuQGHKHORwOB2aKQkNGMap46YOx0znS6lE0D+ppnfZAoq8XQOsZRg2WFliMsaEP9ZSFI4ArTp6/LGZv0B9Vg/i6/8bANGsLyAvt75n6sC0xrFmPhkejw5SEMOA71a6QvH2dbu+wV40iedC4K1awxkLWdxr3Da1bYyEFlrFhGkljy1+i+Miu9Zvo/gd5/d2pu7oax1Wz+NCq5uTbezBXvFhixs1uqX2TX16AG+HrTwphUzyUhMNGGChjgc3wUAoPN8E7db/TFO9Iwp1GGEfG4jTDO1J4ZxM8rftpUzyVhNNGGCpjoc3wVApPN8Hzup83xXNJOG+E4TIW3gzPpfB8Ezyp+0lTPJGEk0YYImMhzfBECk/K8OLPI9nTA6wme2KCVdfPNgal3yyabB+So0bmvn/wIRJnA4beiG3FO7GhBWKP96M4jTDHc/1YnBUcay8ZbTKC5YNRjMQ5Ra+eSuqDi4N9/Whf+6W9+8rITiyPlW+V75Tniq50lVfKa+VEOVeg8t9Wa+vrrW9af7T+bP3V+vve+mgrq/lKKV2tf/4HU+wnyA==</latexit>
h2
-3
1
constant. We are working out partial derivatives,

0.5 = h2 + ↵ · 20 · 3 so when we are focusing on one parameters, all
0
the others are taken as constant.
1
Remember, that we want to decrease the output

-3
of the network. Since v2 makes a negative
51 contribution to the loss, we can achieve this by
increasing the activation of the source node v2 is
multiplied by.
Moving down to k2, remember that the derivative

of the sigmoid is the output of the sigmoid times 1
100
minus that output.
k2 - ↵ · 2(y - t)v2 h2 (1 - h2 )
<latexit sha1_base64="wpH/KHoOQGLW3vKeyXdcjTXi1U8=">AAAN63icfZdNb9s2HMbV7q3L6i3djrsICzqkQxpIieMXDAVqyTZ6WNssyNsWBQFF07JgSiQoyrEr6FPsNuy6j7IPsc+w63Yf5diyRFHW6Q8+Dx/99Jdoky7FfsQN4+9Hjz/6+JNPP3vy+c4XTxtffrX77OvLiMQMogtIMGHXLogQ9kN0wX2O0TVlCAQuRlfu1M70qxlikU/Cc76g6DYAXuiPfQi4GLrbvXOgl0zvjlLdwWjMAWPk/nvnRz0ffqk7U5g4ANMJECY4Ilw/2ncgTRbpS/5CjJBktpwPeTIRxb75cl2+cJy73T3j0FheerUwV8WetrpO75495TvOiMA4QCGHGETRjWlQfpsAxn2IUbrjxBGiAE6Bh25iPu7cJn5IY45CmOrPhTaOsc6Jnj2tPvIZghwvRAEg80WCDieAAchFT3bKUREKQYCig9HMp9FDGc28h4ID0dDbZL5seFqamHgM0IkP5yWyBARRAPikMhgtArc8iGKM2CwoD2aUglFyzhGDfpT14FQ05j3N3mF0Tk5X+mRBJyiM0iRmOC1OFAJiDI3FxGUZIR7TZPkw4sOZRq84i9FBVi7HXvUBm56h0YHIKQ2UccaYAF4ecoOsOSG6hyQIQDhKHJomDkdznjgHh6kQn+suFmaXADYqO8/SJHGynrmufiasJfFdQXwni4OCOFjdhOCRPiZMn4n3T1ikC6MuLMyHKCrPvshnj/ULOfqyIF7K4lVBvJJFNy6ocUWdFdRZRb0vqPeyOi+Ic1lcFMRFJZcXVC6rHwriB1m83nbTX7bd9FcpVrwdsWQWYpWjsfjVWn5gyRSmyZvztz+lSXd5rb6UGOlm2QjdtfF42Gp326ks47XeHHZMq1/Vc0Pb6pm2ypA7em3b6A82LEeSN4c2jNag15KjIN7ona49rOobWKPX7lsKw4Z2aDcH7VX7EAolq5f77G6rWWmLl+d0Lcs66Vb13GA1bbtzpDDkDrvf7/fsJQqNGcVI8tK1sdU6MapRNA/qGK1mT6FvXoDRsawKLC2wWEPL7JtLFo4Alpw8/1jsTq87kIP4pv9Wz7YrL5AX2t+xzX5TYdiwnvRPBsdLEsJA6MldIXn7Wu3OcUeOInnQsC3eYIWFbO407FpGu8JCCixDy7ayxpZXolhkN+Zt8vCDvFl3+p6pp6lsFgtNNmdrb22WvFhhxvVupX2bXz0B18JXnxTCunioCIe1MFDFAuvhoRIeboP3qn6vLt5ThHu1MJ6KxauH95Tw3jZ4WvXTuniqCKe1MFTFQuvhqRKeboPnVT+vi+eKcF4Lw1UsvB6eK+H5NnhS9ZO6eKIIJ7UwRMVC6uGJEp6U4cWfR7anB1jP9sQE63642hiUfrNotn3IjiAr98OD95E4GzD0Vmwr3osNLRB7vB/EKYV5gR+m4qzgOQdZtc0I5mujqMQ5xZRPJdXi8ujQPDk0fm7uvbZWJ5Yn2rfad9q+Zmpt7bX2RjvVLjSo/aX9o/2r/dcIGr81fm/88WB9/Gg15xutdDX+/B836B70</latexit>
k2
10 0
We see here, that in the extreme regimes, the
sigmoid is resistant to change. The closer to 1 or 0
1
we get the smaller the weight update becomes.

0.5 1-σ(x) σ(x)
0 This is actually a great downside of the sigmoid
σ(x)(1-σ(x))
activation, and one of the big reasons it was
1
eventually replaced by the ReLU as the default

-3
choice for hidden units. We’ll come back to this in
52 lecture 4.
Nevertheless, this update rule tells us what the
change is to k2 that we want to achieve by
changing the gradients we can actually change
(the weights of layer 1).
Finally, we come to the weights of the first layer.

As before, we want the output of the network to
100
decrease. To achieve this, we want h2 to increase
(because v2 is negative). However, the input x1 is
w12 - ↵ · 2(y - t)v2 h2 (1 - h2 )x1 negative, so we should decrease w12 to increase
<latexit sha1_base64="O1LY0g0RhwRJhnGV19ipnfgAcKo=">AAAOoXicjZfbbts2HMbl7tRlnddsu9uNsKBDOiSBZDs+YAhQS7bRAWubBc2hiwKDomlZMCUSFOXYFfQwu92eaG8zypZlHb3p6g9+Hz/9xINNmhTbHleUf2pPPvn0s8+/ePrlwVfPvq5/8/zw2xuP+Ayia0gwYXcm8BC2XXTNbY7RHWUIOCZGt+Zcj/TbBWKeTdz3fEXRgwMs157aEHDRND6sfW9AEjyOA7URhrKB0ZQDxsjjT8Yvclo5lY05DAyA6QwIH5wQLjeODUiDVSTyl5F5MW5EGg9mojhWT5P6pbwcq4ZxcCFiK1PjUCVOP42Mze27oiRjygAM1DBohdvm06b8X7EbZxL7f1Kb4+dHypmyfuRiocbFkRQ/l+PDZ/zAmBDoO8jlEAPPu1cVyh8CwLgNMQoPDN9DFMA5sNC9z6fdh8B2qc+RC0P5hdCmPpY5kaM5kic2Q5DjlSgAZLZIkOEMCEwuZvIgG+UhFzjIO5ksbOptSm9hbQoOxDJ4CJbrZRJmOgYWA3Rmw2WGLACO5wA+KzR6K8fMNiIfI7Zwso0RpWDMOZeIQduLxuBSDMw7Gq087z25jPXZis6Q64WBz3CY7igExBiaio7r0kPcp8H6Y8Ryn3sXnPnoJCrXbRcDwOZXaHIicjINWZwpJoBnm0wnGhwXPULiOMCdBAYNA4OjpVgdJ2ehEF/IJhZmkwA2yTqvwiAwojEzTflKWDPi25T4Ni8OU+IwfgnBE3lKmLwQ80+YJwujLCzMhsjL9r5Oek/l63z0TUq8yYu3KfE2L5p+SvUL6iKlLgrqY0p9zKvLlLjMi6uUuCrk8pTK8+rHlPgxL97te+mHfS/9IxcrZkdsmZXY5WgqfmvXCyyYwzB4/f7Nb2HQWz/xSvGRrGaN0Nwam6N2p9cJ8zLe6q1RV9UGRT0xdLS+qpcZEke/oyuD4Y6lkfMm0IrSHvbb+SiId3q3p4+K+g5W6XcGWolhRzvSW8NOPHwIuTmrlfj0XrtVGBYryelpmnbeK+qJQWvperdRYkgc+mAw6OtrFOozilHOS7fGdvtcKUbRJKirtFv9En03AUpX0wqwNMWijTR1oK5ZOAI45+TJYtG7/d4wH8R346/1db0wgTw1/F1dHbRKDDvW88H5sLkmIQy4Vn5USDJ87U632c1HkSRo1BEzWGAhuzeNeprSKbCQFMtI07VoYLM7UWyye/Uh2Pwg7/adfKTKYZg3i42WN0d7b2vOeXGJGVe7S+37/OUdcCV88UshrIqHJeGwEgaWscBqeFgKD/fBW0W/VRVvlYRblTBWGYtVDW+Vwlv74GnRT6viaUk4rYShZSy0Gp6WwtN98Lzo51XxvCScV8LwMhZeDc9L4fk+eFL0k6p4UhJOKmFIGQuphiel8CQLL/48ojM9wHJ0JiZYtt34YJD5zaLR8SG6hMTuzYcPkLgbMPRGHCveiQMtEGe8n8U9hVmO7YbirmAZJ1G1zwiWW6OoxD1Fzd9KisVN40w9P1N+bx290uIby1PpB+lH6VhSpY70SnotXUrXEqwFtT9rf9X+rh/Vf61f1q821ie1uM93Uuap3/8LvrVagg==</latexit>
10 0 w12
1 h2. This is all beautifully captured by the chain
-3
1
= w12 - ↵ · 20 · -3 · · -3 rule: the two negatives of x1 and v2 cancel out and

0.5 4
0 1 we get a positive value which we subtract from
= w12 - ↵ · 20 · 3 · · 3
4 w12.
-11
-3
53
FORWARD PASS IN PSEUDOCODE To finish up let's look at how you would implement
this in code. Here is the forward pass: computing
l
for j in [1 … 3]:
the model output and the loss, given the inputs
for i in [1 … 2]: and the target value.
y t k[j] += w[i,j] * x[i]
k[j] += b[j] Assume that k and y are initialized with 0s.
v2
h1 h2 h3 for i in [1 … 3]:
k1 k2 k3 h[i] = sigmoid(k[i]))
for i in [1 … 3]:
2
w1
y += h[i] * v[i]
y += c
x1 x2
loss = (y - t) ** 2
54
BACKWARD PASS IN PSEUDOCODE And here is the backward pass. We compute

gradients for every node in the network,
y’ = 2 * (y - t) # the error
l
for i in [1 … 3]:
regardless of whether the node represents a
v’[i] = y’ * h[i] parameter. When we do the gradient descent
y t h’[i] = y’ * v[i]
update, we'll use the gradients of the parameters,
c’ = y’
and ignore the rest.
v2
h1 h2 h3 for i in [1 … 3]:
k1 k2 k3 k’[i] = h’[i] * h[i] * (1-h[i]) Note that we don’t implement the derivations
for j in [1 … 3]:
from slide 44 directly. Instead, we work backwards
2
w1
for i in [1 … 2]: down the neural network: computing the

w’[i, j] = k’[j] * x[i]
b’[j] = k’[j]
derivative of each node as we go, by taking the
x1 x2
derivative of the loss over the outputs and
55
multiplying it by the local derivative.
RECAP
Backpropagation: method for computing derivatives.
Combined with gradient descent to train NNs.
Middle ground between symbolic and numeric computation.

• Break computation up into operations.
• Work out local derivatives symbolically.
• Work out global derivative numerically.
56
Peter Bloem
Deep Learning
dlvu.github.io

We’ve seen what neural networks are, how to

train them by gradient descent and how to
compute that gradient by backpropagation.
In order to scale up to larger and more complex
PART THREE: TENSOR BACKPROPAGATION structures, we need two make our computations
as efficient as possible, and we’ll also need to
Expressing backpropagation in vector, matrix and tensor operations. simplify our notation. There’s one insight that we
are going to get a lot of mileage out of.
IT’S ALL JUST LINEAR ALGEBRA When we look at the computation of a neural
network, we can see that most operations can be
x1
expressed very naturally in those of linear algebra.
× x2
y w11 w21 b1 k1 h1 The multiplication of weights by their inputs is a

f(x) = V (Wx h+ b) + c
w12 w22 + b2 = k 2 2
multiplication of a matrix of weights by a vector of
v2
w13 w23 b3 k3 h3 h
inputs.
<latexit sha1_base64="MR7DOaUycjbul3iXVKeS8Lvtb9A=">AAAIGnicfVVNb9s2GFbbre68dUu74y7CjAHZZgSSsybZIUCXj7WHtcmC2C4QGQFFv5IFUx8gKUcqwX+y835Ib8Ouu+wf7Lr9glGS7UqiNl708n2e5yX5kBTdhASMW9af9+4/+ODDh71HH/U//uTxp5/tPHk6YXFKMYxxTGL6xkUMSBDBmAecwJuEAgpdAlN3eVrg0xVQFsTRNc8TmIXIjwIvwIir1O2O4+0Kx/VEJuXX5rHpxLjomhNpOkOHBX6Idje5qSw/mTS/NR23yrlK9b6HlcrpO3nVW8jbnYG1Z5XN1AN7HQyMdbu8ffLwV2ce4zSEiGOCGLuxrYTPBKI8wARk30kZJAgvkQ83KfeOZiKIkpRDpIb+SmFeSkwem8VKzXlAAXOSqwBhGqgKJl4gijBXfvSbpRhEKAQ2nK+ChFUhW/lVwJEycyay0mz5uKEUPkXJIsBZY2oChSxEfKElWR66zSSkBOgqbCaLaapJtpgZUBywwoRL5cxFUmwgu44v1/giTxYQMSlSSmRdqACgFDwlLEMGPE1EuRp1apbsmNMUhkVY5o7PEF1ewXyo6jQSzel4JEa8mXLVMpQ7EdzhOAxRNBdOos4Mh4wLZ7gnS+/q6JUUwimMcl3zqoAb6Osa+lrKJnheA88V2ETHW9Qzx23ppAZOtFGnNXTalrppDU01dFVDV1pl964G32lwVkMzDc1raK6hb2voW91npI7FzWgmqr0oN1VckGAFLyhAJMVgJNtroWq/b+ympDgDYmDL0u45eOqXUwFhXtDFy+tXP0lxejR6Zh3INsMlKWwo1v7Bs1NLo/jVbNYc6+hodKJxYooif1vo7PzgB1svlKQ0IVvS4eH+j9/rlXIgJL7bVjo9ORvttxemHGlOyj60Lat92ijWrFo7Yg5sU7PW76Kvh+kUuF2Cys9O/lLnv6Ao/w923FV9Y3OnIulSbDzvVORdis0GbBRNSdRh0/vt2GpaK0+Ki7BU709SPBmIVHXPQD0mFF6pC3KhfoCIx/QbdSuoHwaqlvo6wyL6PyLKNkQV9fvqZbPb75geTEZ79v6e9fN3g+cn6zfukfGF8aWxa9jGofHceGlcGmMDG++Mv4y/jX96v/Te9X7r/V5R799baz43Gq33x7+J3/G/</latexit>
h1 h2 h3
k1 k2 k3
h1 The addition of a bias is the addition of a vector of
h2
bias parameters.
2
×
w1
h3
x1 x2
v1 v2 v3 + c = y The nonlinearities can be implemented as simple
element-wise operations.
59
This perspective buys us two things. First…
Our notation becomes extremely simple. We can

describe the whole operation of a neural network
with one simple equation. This expressiveness will
be sorely needed when we move to more
complicated networks.
f(x) = V (Wx + b) + c
MATRIX MULTIPLICATION The second reason is that the biggest

computational bottleneck in a neural network is
# k, w, k, h, etc are zero’d arrays
l
for j in [1 … 2]:
the matrix multiplication. This operation is more
for i in [1 … 3]: than quadratic while everything else is linear.
y t k[i] += w[i,j] * x[i]
k[i] += b[i] Matrix multiplication (and other tensor operations
v2
h1 h2 h3 for i in [1 … 3]:
like it) can be parallelized and implemented
k1 k2 k3 h[i] = sigmoid(k[i])) efficiently but it’s a lot of work. Ideally, we’d like to
for i in [1 … 3]: let somebody else do all that work (like the
2
w1
y += h[i] * v[i]
y += c
implementers of numpy) and then just call their
x1 x2
code to do the matrix multiplications.
loss = (y - c) ** 2
61
This is especially important if you have access to a
GPU. A matrix multiplication can easily be
offloaded to the GPU for much faster processing,
but for a loop over an array, there’s no benefit.
This is called vectorizing: taking a piece of code

written in for loops, and getting rid of the loops
by expressing the function as a sequence of linear
algebra operations.
VECTORIZING Without vectorized implementations, deep

learning would be painfully slow, and GPUs would
Express everything as operations on vectors, matrices and tensors. be useless. Turning a naive, loop-based
Get rid of all the loops
implementationinto a vectorized one is a key skill
for DL researchers and programmers.
Makes the notation simpler.
Makes the execution faster.
x
y
62
W
V
b
x
c
x y
xt
y
y xW Here’s what the vectorized forward pass looks like
h
FORWARD
W y V as a computation graph, in symbolic notation and
Wk
V x W b in pseudocode.
Vl
xy V c
b
b X When you implement this in numpy it’ll look
ycW b t
c l = (y - t)2 k = w.dot(x) + b
Wt V cx h almost the same.
t y
Vh t k y = Vh + c h = sigmoid(k)
h bx W
b l h = (k) y = v.dot(h) + c
k yc h
k
V
cl t k
l b k = Wx + b l = (y - t) ** 2
tWh l
xV c
h
xk
k bl t
y
y
63
Wl c h
W
V t k
V
bh l
b
ck x
c
x t l y
xt
h
y WHAT
y xW Of course, we lose a lot of the benefit of
h
BUT ABOUT THE BACKWARD PASS?
Wk y V vectorizing if the backward pass isn’t also
Wk
b expressed in terms of matrix multiplications. How
Vll x W
V
xy V c do we vectorize the backward pass? That's the
b
b X
ycW b t
c x
l= (y - t)2 question we'll answer in the rest of this video.
Wt V c h Can we do something like this?
t
V y y = Vh + c On the left, we see the forward pass of our loss
hb
h t k @l ? @l @y @h @k
b
k
x W
h l h = (k) = computation as a set of operations on vectors and
k yc V @W @y @h @k @W matrices. (We’ve generalized things by giving the
cl t k
lW b k = Wx + b network a vector output y and vector target t).
th l
xV c
h k To generalize backpropagation to this view, we
k bl t
y
64 might ask if something similar to the chain rule
Wl c h
exists for vectors and matrices. Firstly, can we
V t k
define something analogous to the derivative of
bh l
one matrix over another, and secondly, can we
ck
t l
h
k
l
break this apart into a product of local derivatives,

possibly giving us a sequence of matrix
multiplications?
The answer is yes, there are many ways of
applying calculus to vectors and matrices, and
there are many chain rules available in these
domains. However, things can quickly get a little
hairy, so we need to tread carefully.
GRADIENTS, JACOBIANS, ETC The derivatives of high-dimensional objects are

easily defined. We simply take the derivative of
@B every input over every output, and we arrange all
f(A) = B, : derivatives of every element of A over every element of B
@A possibilities into some logical shape. For instance,
for instance:
function returns a
if we have a vector-to-vector function, the natural
00
001111 analog of the derivative is a matrix with all the
✓ ◆✓ ◆
<latexit sha1_base64="Cu9+fXzKgYNP/4xIL+VZWpA1pdw=">AAAPI3icjVfLbuM2FFXS19Sdpkm7arshGkwxLYJAchw/FgHGkm0Mis5MGuTVRoFB0bQsWBYJinLsEbTqol/Sr+mu6KaL/kC/opTsyBIleapFcM1z7uHh1aVCWtR1fK6qf+/svvf+Bx9+9OTj2idPP937bP/g82ufBAzhK0Rcwm4t6GPX8fAVd7iLbynDcGa5+MaaGjF+M8fMd4h3yZcU38+g7TljB0EuhoYHO7+OTReP+XMT2aFpYdvxQjqDnDmLCA41YJoADuvJ3xMTe6MUjIDJHHvCvwNnAJgIycnWKtlKkuVMs/bDcJwk5pNqYsQfMyjUpjRRFTJRtP5lhzD+Bb4tkuoFkpj13Wr1RK2EVi/QhOd3yp38PzlBq+UqMtw/VI/V5AHFQFsHh8r6OR8ePP2tZo4ICmbY48iFvn+nqZTfh5BxB7lY6Ac+phBNoY3vAj5u34eORwOOPRSBZwIbBy7gBMQtAUYOw4i7SxFAxByhANAEihVw0Ti1vJSPPTjD/tFo7lB/FfpzexVwKLruPlwkXRnlEkObQTpx0CLnLIQzX5RgUhj0lzMrP4gDF7P5LD8YuxQeJeYCM+T4cQ3ORWHe0LjR/UtyvsYnSzrBnh+FAXOjbKIAMGN4LBKT0Mc8oGGyGLG7pv4ZZwE+isNk7KwH2fQCj46ETm4gb2fsEsjzQ5a0jEXcLXG9PPyAyGwGRWuYVHQMxwsemkfHol9qz4DlCr5FIBvlmRdRGJpxGS0LXMStlQVfZ8DXMtjPgP31JMQdgTFhYC5agjAfCCJI2hRhP599lWaPwZUsfZ0Br2XwJgPeyKAVZNCggM4z6LyAPmTQBxldZMCFDC4z4LKgyzMol9G3GfCtDN5um/TnbZP+IsmKtyN20VJsfDwWX/uk58IpisKXl69+jMJO8qw7JcBAyxOR9Ug8GTRbnVYkw+4j3hi0Nb1XxFNCS+9qRhkhZXRbhtrrb7zUJW5qWlWb/W5TlkLuBm93jEER35hVu62eXkLYuB0YjX5rXT6MPYlqpzyj02wUymKnOh1d1087RTwl6A3DaNdLCCnD6PV6XSOxQgNGXSxx6SOx2TxVi1I0FWqrzUa3BN+8ALWt6wWzNONFH+haT0u8cAxdicnTZjHa3U5fFuKb+utdwyi8QJ4pf9vQeo0Swsbrae+0f5I4IQx6tlwVkpav2WqftGUpkgoNWuINFryQzUyDjq62Cl5IxstAN/S4sPmdKDbZnXYfrj7Im30HDjUQRTJZbDSZHO+9R7LEdUvIbjW7lL6NX57gVpovrhShKnlUIo4qzaAyL6jaPCo1j7aZt4t8u0reLhG3K83YZV7savN2qXl7m3la5NMqeVoiTivN0DIvtNo8LTVPt5nnRT6vkucl4rzSDC/zwqvN81LzfJt5UuSTKnlSIk4qzZAyL6TaPCk1T/LmxT+P+JgPXRAfk4kLHG99MMh9s2h8fJiKW8iavVp4D4vrAsOvxLHijTjjQnHG+z40IbNnjheJ64NtHsXRNiJcPBJhcnXR5ItKMbiuH2unx+pPjcMX+voS80T5WvlGea5oSkt5obxUzpUrBe38u7u/++XuV3u/7/2x9+feXyvq7s465wsl9+z98x9A4JAh</latexit>
scalar vector matrix

a1 a1
<latexit sha1_base64="Cu9+fXzKgYNP/4xIL+VZWpA1pdw=">AAAPI3icjVfLbuM2FFXS19Sdpkm7arshGkwxLYJAchw/FgHGkm0Mis5MGuTVRoFB0bQsWBYJinLsEbTqol/Sr+mu6KaL/kC/opTsyBIleapFcM1z7uHh1aVCWtR1fK6qf+/svvf+Bx9+9OTj2idPP937bP/g82ufBAzhK0Rcwm4t6GPX8fAVd7iLbynDcGa5+MaaGjF+M8fMd4h3yZcU38+g7TljB0EuhoYHO7+OTReP+XMT2aFpYdvxQjqDnDmLCA41YJoADuvJ3xMTe6MUjIDJHHvCvwNnAJgIycnWKtlKkuVMs/bDcJwk5pNqYsQfMyjUpjRRFTJRtP5lhzD+Bb4tkuoFkpj13Wr1RK2EVi/QhOd3yp38PzlBq+UqMtw/VI/V5AHFQFsHh8r6OR8ePP2tZo4ICmbY48iFvn+nqZTfh5BxB7lY6Ac+phBNoY3vAj5u34eORwOOPRSBZwIbBy7gBMQtAUYOw4i7SxFAxByhANAEihVw0Ti1vJSPPTjD/tFo7lB/FfpzexVwKLruPlwkXRnlEkObQTpx0CLnLIQzX5RgUhj0lzMrP4gDF7P5LD8YuxQeJeYCM+T4cQ3ORWHe0LjR/UtyvsYnSzrBnh+FAXOjbKIAMGN4LBKT0Mc8oGGyGLG7pv4ZZwE+isNk7KwH2fQCj46ETm4gb2fsEsjzQ5a0jEXcLXG9PPyAyGwGRWuYVHQMxwsemkfHol9qz4DlCr5FIBvlmRdRGJpxGS0LXMStlQVfZ8DXMtjPgP31JMQdgTFhYC5agjAfCCJI2hRhP599lWaPwZUsfZ0Br2XwJgPeyKAVZNCggM4z6LyAPmTQBxldZMCFDC4z4LKgyzMol9G3GfCtDN5um/TnbZP+IsmKtyN20VJsfDwWX/uk58IpisKXl69+jMJO8qw7JcBAyxOR9Ug8GTRbnVYkw+4j3hi0Nb1XxFNCS+9qRhkhZXRbhtrrb7zUJW5qWlWb/W5TlkLuBm93jEER35hVu62eXkLYuB0YjX5rXT6MPYlqpzyj02wUymKnOh1d1087RTwl6A3DaNdLCCnD6PV6XSOxQgNGXSxx6SOx2TxVi1I0FWqrzUa3BN+8ALWt6wWzNONFH+haT0u8cAxdicnTZjHa3U5fFuKb+utdwyi8QJ4pf9vQeo0Swsbrae+0f5I4IQx6tlwVkpav2WqftGUpkgoNWuINFryQzUyDjq62Cl5IxstAN/S4sPmdKDbZnXYfrj7Im30HDjUQRTJZbDSZHO+9R7LEdUvIbjW7lL6NX57gVpovrhShKnlUIo4qzaAyL6jaPCo1j7aZt4t8u0reLhG3K83YZV7savN2qXl7m3la5NMqeVoiTivN0DIvtNo8LTVPt5nnRT6vkucl4rzSDC/zwqvN81LzfJt5UuSTKnlSIk4qzZAyL6TaPCk1T/LmxT+P+JgPXRAfk4kLHG99MMh9s2h8fJiKW8iavVp4D4vrAsOvxLHijTjjQnHG+z40IbNnjheJ64NtHsXRNiJcPBJhcnXR5ItKMbiuH2unx+pPjcMX+voS80T5WvlGea5oSkt5obxUzpUrBe38u7u/++XuV3u/7/2x9+feXyvq7s465wsl9+z98x9A4JAh</latexit>
b1 b1
ff@@
@@ a2AA=A b=2
a2 A arrange derivatives in a:
partial derivatives in it.
a3 b2
0 a3 1 scalar scalar vector matrix
0@b1/@a1 @b2/@a1 1 However, once we get to matrix/vector operations
input is a
Jf = @@b@b 1/@a@b
1/@a
@b A
1 2/@a22/@a1
Jf = @ @b
@b
2
1/3@a2 /@a32/@a2 A
1/@a @b2 @b vector vector matrix ? or matrix/matrix operations, the only way to
@b1/@a3 @b2/@a3 logically arrange every input with every output is a
matrix matrix ? ? tensor of higher dimension than 2.
65
NB: We don’t normally apply the differential

symbol to non-scalars like this. We’ll introduce
better notation later.
As we see, even if we could come up with this kind

of chain rule, one of the local derivatives is now a
vector over a matrix. The result could only be
represented in a 3-tensor. There are two problems
@l ? @l @y @h @k with this:
=
@W @y @h @k @W - If the layer has n inputs and n outputs, we are
uh oh computing n3 derivatives, even though we were
only ultimately interested in n2 of them (the
derivatives of W). In the scalar algorithm we
only ever had two nested loops (an n2
complexity), and we only ever stored one
66
gradient for one node in the computation
graph. Now we suddenly have n3 complexity
and n3 memory use. We were supposed to
make things faster.
- We can easily represent a 3-tensor, but there’s
no obvious, default way to multiply a 3-tensor
with a matrix, or with a vector (in fact there are
many different ways). The multiplication of the
chain rule becomes very complicated this way.
In short, we need a more careful approach.
SIMPLIFYING ASSUMPTIONS To work out how to do this we make these

following simplifying assumptions.
1) The computation graph always has a single, scalar output: l
2) We are only ever interested in the derivative of l.
@l <- scalar
scalar scalar vector matrix
@W <- tensor
of
any dimens
ion
vector vector matrix ?
matrix matrix ? ?
67 3-tensor 3-tensor ? ?
Note that this doesn’t mean we can only ever train

loss
neural networks with a single scalar output. Our
neural networks can have any number of outputs
computation
of any shape and size. However, the loss we define

of loss
over those outputs needs to map them all to a

computation graph
outputs single scalar value. The computation graph is

always the model, plus the computation of the
loss.
computation
of model
el
mo d
inputs
We call this derivative the gradient of W. This is a

THE GRADIENT
@l
common term, but we will deviate from the
We will call
@l
the gradient of@l (with respect to@W )
standard approach in one respect.
@W
@W Normally, the gradient is defined as a row vector,
Commonly written as rW l so that it can operate on a space of column
@l vectors by matrix multiplication.
Nonstandard assumption: rW l has the same shape as@W In our case, we are never interested in multiplying
the gradient by anything. We only ever want to
@l
[rW l]ijk = sum the gradient of l wrt W with the original
@W ijk matrix W in a gradient update step. For this reason
we define the gradient as having the same shape
as the tensor with respect to which we are taking
the derivative.
In the example shown, W is a 3-tensor. The

gradient of l wrt W has the same shape as W, and
at element (i, j, k) it holds the scalar derivative of l
wrt Wijk.
With these rules, we can use tensors of any shape

and dimension and always have a well defined
gradient.
The standard gradient notation isn’t very suitable

NEW NOTATION: THE GRADIENT FOR W
for our purposes. It puts the loss front and center,
rW l = WrrW l = Wr <- always the same

but that will be the same for all our gradients. The
<-
act
object that we’re actually interested in is relegated
u
of al o to a subscript. Also, it isn’t very clear from the
int bje
ere ct
st
notation what the shape is of the tensor that
<latexit sha1_base64="08ZyZvqdBsdsNRALUTvS6G8bk3Y=">AAAOB3icfZfLbuM2GIWV6WWm6bhN2mU3RIMBiiIIpMTxZRFgLNnGLDozaZDLtJEnoGhaFkyJBEU59gh6gD5Nd0W3fYwu+yalbEfWhbI2IXgOjz7/JBXSYcQLha7/u/fss8+/+PL5i6/2v37Z+Obbg8PvbkMacYRvECWUf3BgiIkX4BvhCYI/MI6h7xB858ysVL+bYx56NLgWS4ZHPnQDb+IhKGTXw0FsIycGtjMBTvLRDqBDILgA68ZDnPanBidJAEn7CZ4IcA+APeEQxfaMAZKs/qxdD0YiPWMqwlpHkNjcc6di9PH64eBIP9FXD6g2jE3jSNs8lw+HL8W+PaYo8nEgEIFheG/oTIxiyIWHCE727SjEDKIZdPF9JCadUewFLBI4QAl4JbVJRICgIK0EGHscI0GWsgER92QCQFMoqYWs134xKsQB9HF4PJ57LFw3w7m7bghZKzyKF6vJSAoDY5dDNvXQokAWQz/0oZhWOsOl7xQ7cUQwn/vFzpRSMpacC8yRF6Y1uJSFec/S+Q2v6eVGny7ZFAdhEkecJPmBUsCc44kcuGqGWEQsXv0Yuahm4YXgET5Om6u+iz7ksys8PpY5hY4izoRQKIpdjp8WJ8CPiPo+DMaxzeTKEHghYvv4JJHiKyBXHZo5FPJx0XmVxLGd1sxxwJW0FsR3OfFdWRzkxMHmJZSMwYRyMJfzT3kIpBFIC/cQDoujb7LRE3BTjr7Nibdl8S4n3pVFJ8qpUUWd59R5RX3MqY9ldZETF2VxmROXlVyRU0VZ/ZQTP5XFD7te+tuul/5eipWzI7fMUu5yPJFftNUCi2coid9cv/0liburZ7NSIgyMohE5T8azYavdbSdlmTzpzWHHMPtVPTO0zZ5hqQyZo9e29P5gy3Ja8mbQut4a9FrlKES2eqdrDav6FlbvtfumwrClHVrNQXtTPoyDktXNfFa31ayUxc1yuqZpnneremYwm5bVOVUYMofV7/d71gqFRZwRXPKyJ2Orda5Xo1gW1NFbzZ5C306A3jHNCizLsZhD0+gbKxaBISk5RbZYrE6vOygHiW39zZ5lVSZQ5MrfsYx+U2HYsp73zwdnKxLKYeCWq0Kz8rXanbNOOYpmQcO2nMEKC92+adg19XaFheZYhqZlpoUt7kS5ye6NUbz+IG/3HTgyQJKUzXKjlc3p3nsyl7xEYSb1bqV9l189gNTCV38pQnXxSBGOamGQigXVwyMlPNoF71b9bl28qwh3a2FcFYtbD+8q4d1d8KzqZ3XxTBHOamGYioXVwzMlPNsFL6p+URcvFOGiFkaoWEQ9vFDCi13wtOqndfFUEU5rYaiKhdbDUyU8LcLLfx7pmR4SkJ6JKQFesDkYFL5ZLD0+zOQ1Y+Ne//A+lncDjt/KY8V7eaCF8oz3c2xD7vpekMi7gmsfp61dRrh4MsqWvKcY5VtJtXF7emKcn+i/No9em5sbywvtB+1H7SfN0Nraa+2NdqndaEj7b+/53sHeYeOPxp+Nvxp/r63P9jZjvtcKT+Of/wElISiG</latexit>
 T we’re looking at.

@l @l
b r = rb l = ...
@b1 @bn We can think of the nabla as an operator like a
transposition or taking an inverse. It turns a matrix
@l into another matrix, a vector into another vector
y r = ry l = and so on.
@y
For this reason, we’ll introduce a new notation.
This isn’t standard, so don’t expect to see it
anywhere else, but it will help to clarify our
mathematics a lot as we go forward.
We’ve put the W front and center, so it’s clear that
the result of taking the gradient is also a matrix,
and we’ve removed the loss, since we’ve assume
that we are always taking the gradient of the loss.
The notation works the same for vectors and even
for scalars. This is the gradient of l with respect to
W. Since l never changes, we’ll refer to this as “the
gradient for W”.
When we refer to a single element of the gradient,

we will follow our convention for matrices, and
use the non-bold version of its letter.
⇥ ⇤ @l
<latexit sha1_base64="xZtwZU/8oKUKiWt48TtDmp2dpzE=">AAAOEXicfZfLbuM2GIWV6W2ajtuk3XU2RIMpiiIIpMTxZRFgLNnGLDozaZDLdCI3oGhaVk2JBEU59gha9gn6NN0V3fYJ+gx9iVKybOtqbULwHB59+kk6pMWI4wtV/XfvyUcff/LpZ08/3//iWePLrw4Ov771acARvkGUUP7Ogj4mjodvhCMIfsc4hq5F8J01M2L9bo6571DvWiwZHrnQ9pyJg6CQXQ8Hv5sETwS4NxENTWsC7qJfTQ9aBAKTO/ZUjB5C57cIfH8BMkbpWnWnnvUQaZpwiEJzxgCJkj9Ze2Sa+3HQqisdlCgPB0fqiZo8oNzQ0saRkj6XD4fPxL45pihwsScQgb5/r6lMjELIhYMIjvbNwMcMohm08X0gJp1R6HgsENhDEXghtUlAgKAgrgkYOxwjQZayARF3ZAJAUyi/RMjK7eejfOxBF/vH47nD/FXTn9urhpAfhEfhIpmWKDcwtDlkUwctcmQhdH0Ximmp01+6Vr4TBwTzuZvvjCklY8G5wBw5flyDS1mYtyyeaf+aXqb6dMmm2POjMOAkyg6UAuYcT+TApOljEbAw+Ri5vGb+heABPo6bSd9FH/LZFR4fy5xcRx5nQigU+S7LjYvj4UdEXRd649BkcrUIvBCheXwSSfEFkEsDzSwK+TjvvIrC0IxrZlngSlpz4puM+KYoDjLiIH0JJWMwoRzM5fxT7gNpBNLCHYT9/OibzegJuClG32bE26J4lxHviqIVZNSgpM4z6rykPmbUx6K6yIiLorjMiMtSrsiooqh+yIgfiuK7XS/9ZddL3xdi5ezILbOUuxxP5G9bssDCGYrCV9evf4rCbvKkKyXAQMsbkbU2ng1b7W47KspkrTeHHU3vl/WNoa33NKPKsHH02obaH2xZTgveDbSqtga9VjEKka3e6RrDsr6FVXvtvl5h2NIOjeagnZYPY69gtTc+o9tqlspib3K6uq6fd8v6xqA3DaNzWmHYOIx+v98zEhQWcEZwwcvWxlbrXC1HsU1QR201exX6dgLUjq6XYFmGRR/qWl9LWASGpOAUm8VidHrdQTFIbOuv9wyjNIEiU/6OofWbFYYt63n/fHCWkFAOPbtYFbopX6vdOesUo+gmaNiWM1hiods3Dbu62i6x0AzLUDf0uLD5nSg32b02Clc/yNt9B440EEVFs9xoRXO899bmgpdUmEm9u9K+y189gNTCl78Uobp4VBGOamFQFQuqh0eV8GgXvF3223XxdkW4XQtjV7HY9fB2Jby9C56V/awunlWEs1oYVsXC6uFZJTzbBS/KflEXLyrCRS2MqGIR9fCiEl7sgqdlP62LpxXhtBaGVrHQenhaCU/z8PKfR3ymhwTEZ2JKgOOlB4PcbxaLjw8zefVI3asP72N5N+D4tTxWvJUHWijPeD+GJuS263iRvCvY5nHc2mWEi7VRtuQ9RSveSsqN29MT7fxE/bl59FJPbyxPlefKd8oPiqa0lZfKK+VSuVGQ8t/e4d63e88bfzT+bPzV+HtlfbKXjvlGyT2Nf/4HXCUsug==</latexit>
Element (i, j) for the gradient of W is the same as

r r
W ij
= [W ij ] = the gradient for element (i, j) of W.
@W ij To denote this element we follow the convention
we have for elements of tensors, that we use the
= Wr
ij same letter as we use for the matrix, but non-bold.
TENSOR BACKPROPAGATION This gives us a good way of thinking about the

gradients, but we still don’t have a chain rule to
Work out scalar derivatives first, then vectorize. base out backpropagation on.
Use the multivariate chain rule to derive the scalar derivative.
The main trick we will use is to stick to scalar

derivatives as much as possible.
Apply the chain rule step by step.
Start at the loss and work backward, accumulating the gradients.
Once we have worked out the derivative in purely
scalar terms (on pen and paper), we will then find
a way to vectorize their computation.
72
y
W
V
b
x
c
xt y
x
y
y xW We start simple: what is the gradient for y? This is
h
GRADIENT FOR y
W y V a vector, because y is a vector. Let’s first work out
Wk
b yr
Vl x W
V
xy V c
b
yr <latexit sha1_base64="8Kc76guShsxxBzDcjz7QnGGR1lA=">AAANoXicfZfbbts2HMbVdofOq7d0xa52IywoMAxBICWODxcFakk2OmBtsyCnNcoCiqYVwZRIUJRjV9DD7HZ7or3NKNuRJYqyrgh+Hz/9+Kdokx7FQcwN478nT5998eVXXz//pvXti/Z33++9/OEyJgmD6AISTNi1B2KEgwhd8IBjdE0ZAqGH0ZU3s3P9ao5YHJDonC8pug2BHwXTAAIuuu72fnQhTZfZXfCXGwEPA911W63W3d6+cWisHr3eMDeNfW3znN69fMFb7oTAJEQRhxjE8Y1pUH6bAsYDiFHWcpMYUQBnwEc3CZ/2b9MgoglHEcz010KbJljnRM8Z9UnAEOR4KRoAskAk6PAeMAC5mEmrGhWjCIQoPpjMAxqvm/HcXze4mBG6TRerMmWVganPAL0P4KJCloIwDgG/r3XGy9CrdqIEIzYPq505pWCUnAvEYBDnNTgVhflI88rH5+R0o98v6T2K4ixNGM7KA4WAGENTMXDVjBFPaLqajFjuWfyGswQd5M1V3xsHsNkZmhyInEpHFWeKCeDVLi/MixOhB0jCEEST1KVZ6nK04Kl7cJgJ8bUuvg048whgk6rzLEtTN6+Z5+lnwloRP5TED7I4KomjzUsInuhTwvS5WH/CYl0YdWFhAURxdfRFMXqqX8jRlyXxUhavSuKVLHpJSU1q6rykzmvqQ0l9kNVFSVzI4rIkLmu5vKRyWf1cEj/L4vWul/6566WfpFixOmLLLMUuR1PxW7P6wNIZzNJ35+9/z9LB6tl8KQnSzaoReo/G43G3N+hlsowf9c64b1pOXS8MPWto2ipD4Rj2bMMZbVmOJG8BbRjd0bArR0G81fsDe1zXt7DGsOdYCsOWdmx3Rr1N+RCKJKtf+OxBt1Mri1/kDCzLOhnU9cJgdWy7f6QwFA7bcZyhvUKhCaMYSV76aOx2T4x6FC2C+ka3M1To2wUw+pZVg6UlFmtsmY65YuEIYMnJi4/F7g8HIzmIb+tvDW27toC8VP6+bTodhWHLeuKcjI5XJISByJerQorydXv9474cRYqgcU+sYI2FbN80HlhGr8ZCSixjy7bywlZ3othkN+Ztuv5B3u47fd/Us0w2i40mm/O992iWvFhhxs1upX2XXz0AN8LXZwphUzxUhMNGGKhigc3wUAkPd8H7db/fFO8rwv1GGF/F4jfD+0p4fxc8rftpUzxVhNNGGKpioc3wVAlPd8Hzup83xXNFOG+E4SoW3gzPlfB8Fzyp+0lTPFGEk0YYomIhzfBECU+q8OLPIz/TA6znZ2KC9SDaHAwqv1k0Pz7MoDhJrt3riTtI3A0Yei+OFR/FgRaIM96vqQuYHwZRJu4KvnuQt3YZweLRKFrinmLKt5J64/Lo0Dw5NP7o7L+1NjeW59pP2s/aL5qp9bS32jvtVLvQoJZqf2v/aP+299u/tU/bZ2vr0yebMa+0ytO++R/zpgGs</latexit>
yr
yr @l
= @l
the derivative of the i-th element of y. This is
purely a scalar derivative so we can simply use the
b yir
i = @yi
ycW b t @yP
c @l
i
@ P k(y(yk - tk )22 X X @(y - tk2)2 rules we already know. We get 2(yi - ti) for that
Wt V cx h k k - tk )
@ @(yk k- tk )
yr
= ==
i =
=
t @yi
@y i kk
@y@yi
i
particular derivative.
y
Vhb t k
h @y =
X
X
i 2(y - t )
@ykk--ttkk
@y = 2(y - ti )
b x
k yc h
k
W
l P 2 X
Then, we just re-arrange all the derivatives for yi =
k
2(yk - tk )
k k
@yi i = 2(yi i- ti )
@y
into)2a vector, which gives us the complete gradient
k
V
cl t k @ 0 k (y 1k
- t k ) @(y k - tk
l
tW b
h l
= B yy -.-. tt CC 1
= for y. 11 11
x V c yy = 2 @ .... @y
A= 2i· (y - t)
A @yi The final step requires a little creativity: we need r
h k k
y -
y - tt
k l t
b n
n
y n
n
73
Wl c h
X @y - tk to figure out how to compute this vector using
V t k = 2(yk - tk ) k = 2(yi only
- tbasic
i)
linear algebra operations on the given
bh l
@yi vectors and matrices. In this case it’s not so
k complicated: we get the gradient for y by element-
ck
wise subtracting t from y and multiplying by 2.
t l
h 1 0 We haven’t needed any chain rule yet, because
k y1 - t 1 our computation graph for this part has only one
l B
yr = 2 @ .. C edge.
. A
yn - t n
x
y
W
V
b
x
c
x y V ij
xt
y x W y1 Let’s move one step down and work out the
y
h
GRADIENT FOR V Vr
W y V yk gradient for V. We start with the scalar derivative
Wk rr
VV r Vr
ij =
@l
V @V ij Vr
b yN for an arbitrary element Vij.
Vl x W
V VVr r
=
ijij =
@V
@l@l
@Vijij Vr ij =
@l
=
X @l @yk
=
X
yr
@ykr
k V ij =
@l
xy V c
b l X X @l@l @y @y X X @V ij
rr@y
k@y k
@y @V ij
k
@V ij @V ij
b == kk X @l
== yy X k@[Vh + c] r Xkk
@y X @y X X
ycW b t V ij kk
@y @yk k@V @Vijij = =k k@V
k k @yk y
@Vijijr =
@V k ij
@Vkij
ykk = k yr @[V
@V ij k @V@y
@l @y
= k: h]k + ck k =
@V ij
Now, we need a chain rule, to backpropagate
yr
k
@yk
@V ij
ij k
c X X @[Vh @[Vh + +c] c]
k
XX k @[V
@[V h]h] + +c c k
k k
Wt V cx h V ij y1
== yy rr
kk
@V
k kX == X
@Vijij= =yk yr @V ry
rr
@[Vh
y P
k: k: k
l ij=
@V
k
k k @+ c]kV kl hl + c
ij
X k k
yk =
X
rk@[Vk: h]k + ckr @[Vh + c]k
yk through y. However, since we are sticking to scalar
=
X
yr
k
@[Vk: h]k + ck
t kk
X @@PPVVklkl
kk k@V ij
@V ijk @V ij @V ij @V ij
V y
t k V ij . . . y. 1. . yk
X
== yy rr ll hh lX
k
l++ckckk P
X @ V h + c X
k
@
P
V h derivatives until we're ready to vectorize, we can
+ c
k
h
hx Wb kk
@V@V =ijij =yk yr
r l@Vkl kl hl
l
= yr
k @V h
k ij j = yr
k
l kl l k
b l y1 yk yN X
kk
X @V
rr @V hh
k
@V
@Vkl hh
k @V ij
@V ij
i
@V ij k
@V ij
simply use the scalar multivariate chain rule. This
k yc h
k ==yX
== yy
klkl ll rr
y
ijij jj
r@V kl hl @V ij hj X @V kl hl @V ij hj
kk
@Vijij = i i=y yri ijh = yr yr = yr
V
cl t k yk yN l klkl
rr
@V
kl
@V@V
k ijj
@V ij i
@V ij
=
kl
k
@V ij tells us that however many intermediate values we
i
@V ij
lW b ==yy i ihh jj 0 1
= yr r
th l yN l
h
i j ..
.
= yi hj
have, we can work out the derivative for each,
00 11 B C
.. .. Vr 0= B r
@. . .. yi hj 1. . .A = y h
C r T
x
hk V c l BB .. CC rr T..T ..
0
... keeping the others constant, and sum the results.
1
V ij VVrr BB
==@@ . .. .. . yy rr
h h
i i Vr =A
j j . .
. .
. CB
.
B
C ==y y h h . C B
A. . . yir hj . . .C = yr hT Vr = B. . . yr hj . . .C = yr hT
C
k bl t
.. @ A @ A
y .. .. ..
i
..
74
y1 . . This gives us the sum in the second equality.
Wl c h yk
V t k
We've worked out the gradient y∇ already, so we
yN
bh l
can fill that in and focus on the derivative of yk
l
over Vij. the value of yk is defined in terms of
ck
linear algebra operations like matrix
t l
multiplication, but with a little thinking we can
h
always rewrite these as a simple scalar sums.
k
l In the end we find that the derivative of yk over Vij
reduces to the value of hj.
This tells us the values of every elemen (i, j) of V∇.

All that's left is to figure out how to compute this
in a vectorized way. In this case, we can compute
V∇ as the outer product of the gradient for y,
which we've computed already, and the vector h,
which we can save during the forward pass.
RECIPE
To work out a gradient X∇ for some X:
• Write down a scalar derivative of the loss wrt one element of X.
• Use the multivariate chain rule to sum over all outputs.
• Work out the scalar derivative.
•x Vectorize the computation of X∇ in terms of the original inputs.

y
75
W
V
b
x
c
x y
xt
y
y xW Since this is an important principle to grasp, let’s
h
GRADIENT FOR h
W
W y V keep going until we get to the other set of
k
b parameters, W. We’ll leave the biases as an
Vl x W
<latexit sha1_base64="nr7M/ReXQPunOIPM0X/oYTUWfY0=">AAAQu3icnVfdbts2GHWb/XRet7nb5W6IBR26LQikxPHPRYBaso1erG1W5Kdd5BoUTcuaKZGQKNeuoMfY0+x2e4i9zShFliVK8tYJBkzznO98hx9JmTQZsX2uKH/fu3/w0ceffPrgs+bnD7/48qvWo6+vfRp4CF8hSqj32oQ+JraLr7jNCX7NPAwdk+Abc6nH+M0Ke75N3Uu+YXjiQMu15zaCXHRNHx0cG4iHi2hqvzVcaBIIvj8HxtyDKDSWDJAo+dpyImAYzZjgB850meeBLZGFm2i6jHJY1idJ7WRSwtbBv4emLpYo3GpsduH53LcGoqHhQL4w5+A6ihW2vxYR+ElImlkHiialTM1z8F/SJCgBxWTTcEmiVImkuVD9YGKJOKJUjkI9/keCZib+Ydr2VtsGey2XVaukjGb8KdQ/v+JMbNluyATm2WuhuJpR7os8H5Qk4aeR2J3t5M7Bkzh8S99kqbMuNXp7CQzKbQf7kvQPORKYtg6VYyV5QLmhpo3DRvpcTB895E1jRlHgYJcjAn3/VlUYn4TQ4zYiOGoagY8ZREto4duAz3uT0HZZwLGLIvBYYPOAAE5BvHvBzPYw4mQjGhB5tlAAaAHFFHKxx5tFKR+7UIzmaLaymX/X9FfWXYOL0eNJuE5eIFEhMLQ8yBY2WhechdDx4yqUOv2NYxY7cUCwt3KKnbFL4VFirrGHbD+uwYUozEsWv5P8S3qR4osNW2DXj8LAEws9FygA7Hl4LgKTpo95wMJkMOJFuPTPuRfgo7iZ9J0Pobd8hWdHQqfQUbQzJxTyYpfpxMVx8TtEHQeKJWUwsRc4Xot1fHQcCfAxEOsILU0KvVmR+SoK05VjgleCWgBf5MAXMjjKgaM0CSUzMKceWIn5p54PBBEkyxthvxh9lUXPwZUsfZ0Dr2XwJgfeyKAZ5NCghK5y6KqEvsuh72R0nQPXMrjJgZuSLs+hXEbf58D3Mvh6X9I3+5L+KsmK2RFbZiN2OZ6Lf+FkgYVLFIXPLp//HIX95ElXSoCBWiQic0s8HXe6/W4kw2SLt8c9VRuW8YzQ1QaqXkXIGIOurgxHOy8nEjczrSid0aAjSyGyw3t9fVzGd2aVQXeoVRB2bsd6e9RNy4exK1GtjKf3O+1SWaxMp69p2lm/jGcEra3rvZMKQsbQh8PhQE+ssMBjBEtctiV2OmdKWYplQj2l0x5U4LsJUHqaVjLLcl60saYO1cQLx5BITJ4tFr036I9kIb6rvzbQ9dIE8lz5e7o6bFcQdl7Phmej08QJ9aBryVWhWfk63d5pT5aimdC4K2aw5IXuMo37mtIteaE5L2NN1+LCFnei2GS36iS8eyHv9h04VEEUyWSx0WRyvPe2ZIlLKsiknl1J38evDiC15ssjRahOHlWIo1ozqMoLqjePKs2jfeatMt+qk7cqxK1aM1aVF6vevFVp3tpnnpX5rE6eVYizWjOsygurN88qzbN95nmZz+vkeYU4rzXDq7zwevO80jzfZ56W+bROnlaI01oztMoLrTdPK83Tonnx5xGf6SEB8ZmYEmC76cGg8M5i8fFBXCmNlH038CEWdwMPPxfHipfiQAvFGe/H0ICe5dhuJO4KlnEUt/YR4XpLFC1xT1HlW0m5cX1yrJ4dK7+0D59q6Y3lQePbxneNJw210W08bTxrXDSuGujg94M/Dv48+Kt13kKt31rkjnr/XhrzTaPwtIJ/AKnNK24=</latexit>
@l
V hr
i =
@hi
x
b V c X @l @y X @yk exercise.
by =
@yk @hi
k
= yr
k
@hi
ycW b t X
k k
X P
c @[Vh + c]k @ l Vkl hl + ck
Wt V cx h
=
k
yr
k
@hi
=
k
yr
k
@hi For the gradient for h, most of the derivation is the
t X @Vkl hl + ck X r @Vki hi
k
V y
t k = yr
k = yk same as before, until we get to the point where
h
h bx W kl
@hi
k
@hi
b l =
X
yr
k Vki
the scalar derivative is reduced to a matrix
k yc h
k
V k i
mulitplication. Unlike the previous derivation, the
cl t k 0
..
1
l 0 1 0 1
<latexit sha1_base64="jkBE5EM4oFBVqyrFXSBaIjQgf4Y=">AAAOL3icfZfbbts2HMbldoc2q7d0vdwNsaDDMASZlDg+XASoJdvoxdpmQU5blAYUTcuCJZGgKMeuoKfZE+xpht0Mu91bjJJtWaIk64rg9/HjT3+SNmVR1wm4qv7dePL0s8+/+PLZ872vXjS//mb/5bfXAQkZwleIuITdWjDAruPjK+5wF99ShqFnufjGmhmJfjPHLHCIf8mXFN970PadiYMgF10P+3+YiEemB/nUmoBp/NH0oeVC8MMZMC1sO35EhcacRQzM+ZjwAJgmMIPQe4hmogvRaBk/zDajTESyrOtYWJw49a9HYn+8jTsDUTJ8Y19mU8cfL+Wg/QP1SE0fUG5o68aBsn7OH16+4HvmmKDQwz5HLgyCO02l/D6CjDvIxfGeGQaYQjSDNr4L+aR7Hzk+DTn2UQxeC20SuoATkFQMjB2GEXeXogERc0QCQFPIIOKirnvFqAD70MPB4Xju0GDVDOb2qsHF6+H7aJEuWlwYGNkM0qmDFgWyCHpBUoRSZ7D0rGInDl3M5l6xM6EUjJJzgRlygqQG56IwH2iyD4JLcr7Wp0s6xX4QRyFz4/xAIWDG8EQMTJsB5iGN0pcRm28WnHEW4sOkmfadDSCbXeDxocgpdBRxJi6BvNhleUlxfPyIiOdBsWdMGkcmxwuxUQ+PYiG+BmKjoJlFIBsXnRdxtN44FrgQ1oL4Pie+l8VhThyuJyHuGEwIA3Ox/oQFQBhBun8RDoqjr7LRE3AlR1/nxGtZvMmJN7JohTk1LKnznDovqY859VFWFzlxIYvLnLgs5fKcymX1U078JIu3uyb9bdekv0uxYnXEkVmKU44n4pcv3WDRDMXR28t3v8RRL33WOyXEQCsakbUxnozanV4nlmV3o7dGXU0flPXM0NH7mlFlyBz9jqEOhluWY8mbQatqe9hvy1HI3erdnjEq61tYtd8Z6BWGLe3IaA076/Jh7EtWO/MZvXarVBY7y+npun7aK+uZQW8ZRve4wpA5jMFg0DdSFBoy6mLJSzfGdvtULUfRLKirtlv9Cn27AGpX10uwNMeij3RtoKUsHENXcvJssxjdfm8oB/Ft/fW+YZQWkOfK3zW0QavCsGU9HZwOT1ISwqBvy1UhWfnane5JV44iWdCoI1awxEK2M416utopsZAcy0g39KSwxZMoDtmddh+tfpC35w4caCCOZbM4aLI5OXsbs+R1K8xuvbvSvstfPcCthS+/KUJ18agiHNXCoCoWVA+PKuHRLni77Lfr4u2KcLsWxq5isevh7Up4exc8LftpXTytCKe1MLSKhdbD00p4uguel/28Lp5XhPNaGF7FwuvheSU83wVPyn5SF08qwkktDKliIfXwpBKeFOHFn0dyp4cuSO7ExAWOv74YFH6zaHJ9mCFxk1y5Vy8+wOLbgOF34lrxQVxoobjj/RSZkNme48fiW8E2D5PWLiNcbIyiJb5TNPmrpNy4Pj7S2kcnv7YO3ujrL5ZnynfK98qPiqZ0lDfKW+VcuVJQ43nj50a30Wv+2fyr+U/z35X1SWM95pVSeJr//Q+nOzm9</latexit>
<latexit sha1_base64="RDvTuOSjEtDdb8mvxSGR7Pwg7T8=">AAAOZ3icfZfbbts2HMbl7tRlbZpuwDBgN9qCDt0QBFLi+HARoJZsoxdrmxXOYYuSgKJpWTAlchTl2BX0oMOeYG8xSrZliZKsK4Lfx48//UnalE2xG3BN+6fx5LPPv/jyq6df733z7Pn+i4OX314FJGQQXUKCCbuxQYCw66NL7nKMbihDwLMxurZnZqJfzxELXOKP+JKiOw84vjtxIeCi6+HgXwvyyPIAn9oTdRrfWz6wMVB/OVctGzmuH1GhMXcRq9Z8THigWpZqBaH3EM1EF6TRMn6YbUZZkGRZV7GwuHHqX49E/ngbJybAaMJfR0nIZtAyA4jvR8U0i7nOlP96PzqXprkf1UU8HBxqx1r6qOWGvm4cKuvn4uHlM75njQkMPeRziEEQ3Ooa5XcRYNyFGMV7VhggCuAMOOg25JPOXeT6NOTIh7H6SmiTEKucqEml1bHLEOR4KRoAMlckqHAKGIBcrMdeMSpAPvBQcDSeuzRYNYO5s2pw8TboLlqkix0XBkYOA3TqwkWBLAJekFSj1BksPbvYiUKM2NwrdiaUglFyLhCDbpDU4EIU5gNN9k8wIhdrfbqkU+QHcRQyHOcHCgExhiZiYNoMEA9plL6M2LSz4JyzEB0lzbTvvA/Y7CMaH4mcQkcRZ4IJ4MUu20uK46NHSDwPiL1m0TiyOFqIDX50HAvxlSr2BZzZBLBx0fkxjtY7yFY/CmtBfJ8T38viICcO1pMQPFYnhKlzsf6EBaowqum+hygojr7MRk/USzn6KideyeJ1TryWRTvMqWFJnefUeUl9zKmPsrrIiQtZXObEZSmX51Quq59y4idZvNk16Z+7Jv1LihWrI47MUpxyNBG/mOkGi2Ywjt6O3v0eR930We+UEKl60QjtjfF02Gp327Es443eHHZ0o1/WM0Pb6OlmlSFz9Nqm1h9sWU4kbwataa1BryVHQbzVO11zWNa3sFqv3TcqDFvaodkctNflQ8iXrE7mM7utZqksTpbTNQzjrFvWM4PRNM3OSYUhc5j9fr9npig0ZBQjyUs3xlbrTCtH0Syoo7WavQp9uwBaxzBKsDTHYgwNva+nLBwBLDl5tlnMTq87kIP4tv5GzzRLC8hz5e+Yer9ZYdiynvXPBqcpCWHAd+SqkKx8rXbntCNHkSxo2BYrWGIh25mGXUNrl1hIjmVomEZS2OJJFIfsVr+LVj/I23OnHupqHMtmcdBkc3L2NmbJiyvMuN5dad/lrx6Aa+HLbwphXTysCIe1MLCKBdbDw0p4uAveKfuduninItyphXGqWJx6eKcS3tkFT8t+WhdPK8JpLQytYqH18LQSnu6C52U/r4vnFeG8FoZXsfB6eF4Jz3fBk7Kf1MWTinBSC0OqWEg9PKmEJ0V48eeR3OkBVpM7McGq668vBoXfLJpcH2ZQ3CRX7tWL95H4NmDonbhWfBAXWiDueL9FFmCO5/qx+FZwrKOktcsIFhujaInvFF3+Kik3rk6O9dbx6R/NwzfG+ovlqfKj8rPyWtGVtvJGeatcKJcKbPQaToM2/n7+3/6L/e/3f1hZnzTWY75TCs/+T/8DQF5NLA==</latexit>
sum over k doesn't disappear. We need to take it

k ->
tW b BP ..r
..
C ⇣ r ⌘T
BP .r C ⇣ r T ⌘T
h l
TT
hr BP yrVki C = yr = VT yr
r =@
B
h = @ k k kiA
k y k V C = (yyr 1VTV⌦ V)1 hr = B C T r
.. @ k yk Vki A = y V = V y into the description of the gradient for h.
xV c
A
h . 0 1 ..
. .. .
k
BP .r
k ->
C ⇣ r T ⌘T
k bl t
y
76
hr = B C
@ k yk Vki A = y V = V y
..
T r
To vectorize this, we note that each element of
Wl c h .
this vector is a dot product of y and a column of V.
V t k
This means that if we premultiply y (transposed)
bh l
by V, we get the required result as a row vector.
ck
Transposing the result (to make it a column vector)
t l
is equivalent to post multiplying y by the
h
transpose of V.
k
l Note the symmetry of the forward and the
backward operation.
x
y
W
V
b
x
c
x y V ij
xt
y
y xW y1 Now, at this point, when we analyze k, remember
h
GRADIENT FOR k
Wk
W y V yk that we already have the gradient over h. This
Vl x W b
V
yN means that we no longer have to apply the chain
xy V c l @l
b
b kri =
@ki
rule to anything above h. We can draw this scalar
ycW b t X
X r @h @hkm X X @ @(kk(k ) m)
<latexit sha1_base64="oqHsdrdaisMavx2lA/O3UClYKNo=">AAAODnicjZfbbts2HMaVdofOq7t0vdiA3QgLOnRDEEiJ48NFgFqyjV6sbRbktEWZQdG0LFgSCYpy7Aq62RPsaXY37HavsDfYY4xyFFmiKGO6ovl9/PTTn6RM2cRzQ6Zp/+w8evzRx598+uSzxudPm8++2H3+5WWIIwrRBcQeptc2CJHnBuiCucxD14Qi4NseurLnZqpfLRANXRycsxVBtz5wAnfqQsB413j3t+9OVCuM/LGvWpDFs+RXKwC2B9LfUwpgbM1Jpoz9JPvlxPNk7CZq4/+MDV3HB6+yQf73QoZlNca7e9qBtr7UakPPGntKdp2Onz9lDWuCYeSjgEEPhOGNrhF2GwPKXOihpGFFISIAzoGDbiI27d7GbkAihgKYqC+5No08lWE1LYg6cSmCzFvxBoDU5QkqnAGOz3jZGuWoEAXAR+H+ZOGS8L4ZLpz7BuOPjm7j5XpOktLA2KGAzFy4LJHFwA99wGaVznDl2+VOFHmILvxyZ0rJGQXnElHohmkNTnlh3pN0msNzfJrpsxWZoSBM4oh6SXEgFxClaMoHrpshYhGJ1w/D19Y8PGE0Qvtpc913MgB0foYm+zyn1FHGmXoYsHKX7afFCdAdxL4PgklsEb4mGFqy2No/SLj4UuWLCM5tDOik7DxL4thKa2bb6hm3lsR3BfGdKA4L4jC7CfYm6hRTdcHnH9NQ5UaVW6gLUVgefZGPnqoXYvRlQbwUxauCeCWKdlRQo4q6KKiLinpXUO9EdVkQl6K4KoirSi4rqExUPxTED6J4ve2mP2+76S9CLJ8dvmVWfJejKX+xrRdYPIdJ/Ob87Y9J3Ftf2UqJkKqXjdB+MB6N2p1eJxFl70Fvjbq6MajquaFj9HVTZsgd/Y6pDYYblkPBm0NrWnvYb4tR0Nvo3Z45quobWK3fGRgSw4Z2ZLaGnax8CAWC1cl9Zq/dqpTFyXN6hmEc96p6bjBaptk9lBhyhzkYDPrmGoVElHhI8JIHY7t9rFWjSB7U1dqtvkTfTIDWNYwKLCmwGCNDH+hrFoaAJzhZvljMbr83FIPYpv5G3zQrE8gK5e+a+qAlMWxYjwfHw6M1CaYgcMSq4Lx87U73qCtG4Txo1OEzWGHBmzuNeobWqbDgAsvIMI20sOWdyDfZjX4b37+QN/tO3dPVJBHNfKOJ5nTvPZgFrycxe/VuqX2bXz7Aq4WvPimEdfFQEg5rYaCMBdbDQyk83AbvVP1OXbwjCXdqYRwZi1MP70jhnW3wpOondfFEEk5qYYiMhdTDEyk82QbPqn5WF88k4awWhslYWD08k8KzbfC46sd18VgSjmthsIwF18NjKTwuw/M/j/RMDzw1PRNjT3WD7GBQemeR9Pgw598bmfv+wQeIfxtQ9JYfK97zAy3gZ7wfYgtQx3eDhH8rONZ+2tpmBMsHI2/x7xRd/CqpNi4PD/TjA+2n1t5rI/tieaJ8o3yrvFJ0paO8Vt4op8qFApV/d57tfLXzdfP35h/NP5t/3Vsf7WRjXiilq/n3f8fgLI8=</latexit>
computation graph, and work out the local

c hkr == hr r
Wt V cx h ki = h kh
<latexit sha1_base64="hY4wAzEfE4stcT0eO/hp6TuVyl4=">AAANn3icfZdbb9s2GIbV7tRl9Zaul7vRFhQYhiCQEseHiwK1JHu9WNMsyKmLgoCiaVkwJRIU5dgV9Ft2u/2k/ZtRtiNLFGVdfeD78tWjj6JNeRQHMTeM/549/+LLr77+5sW3e9+9bH3/w/6rH69jkjCIriDBhN16IEY4iNAVDzhGt5QhEHoY3XgzO9dv5ojFAYku+ZKi+xD4UTAJIOBi6GH/tQv9dJY9BK6750KeTrOH8GH/wDgyVpdeL8xNcaBtrvOHVy/5njsmMAlRxCEGcXxnGpTfp4DxAGKU7blJjCiAM+Cju4RPevdpENGEowhm+huhTRKsc6LnhPo4YAhyvBQFgCwQCTqcAgYgF8+xV42KUQRCFB+O5wGN12U899cFB6IJ9+li1aSsMjH1GaDTAC4qZCkI4xDwaW0wXoZedRAlGLF5WB3MKQWj5FwgBoM478G5aMxHmvc9viTnG326pFMUxVmaMJyVJwoBMYYmYuKqjBFPaLp6GLHYs/gtZwk6zMvV2FsHsNkFGh+KnMpAFWeCCeDVIS/MmxOhR0jCEETj1KVZ6nK04Kl7eJQJ8Y3uYWH2CGDjqvMiS1M375nn6RfCWhHPSuKZLA5L4nBzE4LH+oQwfS7Wn7BYF0ZdWFgAUVydfVXMnuhXcvR1SbyWxZuSeCOLXlJSk5o6L6nzmvpYUh9ldVESF7K4LInLWi4vqVxWP5fEz7J4u+umn3bd9C8pVqyO2DJLscvRRPzSrF6wdAaz9P3lhz+ytL+6Nm9KgnSzaoTek/Fk1On2u5ks4ye9PeqZllPXC0PXGpi2ylA4Bl3bcIZblmPJW0AbRmc46MhREG/1Xt8e1fUtrDHoOpbCsKUd2e1hd9M+hCLJ6hc+u99p19riFzl9y7JO+3W9MFht2+4dKwyFw3YcZ2CvUGjCKEaSlz4ZO51Tox5Fi6Ce0WkPFPp2AYyeZdVgaYnFGlmmY65YOAJYcvLiZbF7g/5QDuLb/lsD264tIC+1v2ebTlth2LKeOqfDkxUJYSDy5a6Qon2dbu+kJ0eRImjUFStYYyHbO436ltGtsZASy8iyrbyx1Z0oNtmdeZ+uf5C3+04/MPUsk81io8nmfO89mSUvVphxs1tp3+VXT8CN8PUnhbApHirCYSMMVLHAZniohIe74P2632+K9xXhfiOMr2Lxm+F9Jby/C57W/bQpnirCaSMMVbHQZniqhKe74Hndz5viuSKcN8JwFQtvhudKeL4LntT9pCmeKMJJIwxRsZBmeKKEJ1V48eeRn+kB1vMzMcF6EG0OBpXfLJofH2ZQnCTX7vWDO0h8GzD0QRwrPooDLRBnvN9SFzA/DKJMfCv47mFe7TKCxZNRVOI7xZS/SurF9fGReXpk/Nk+eGdtvlheaD9pv2i/aqbW1d5p77Vz7UqD2lL7W/tH+7f1c+v31lnrfG19/mwz57VWuVqf/gc7WQH0</latexit>
m@k
@ki i k m m@ki@ki gradient for k in terms for the gradient for h.
t y h
mk
@ (ki )
Vhb t k
h . .m
. ... = hri
b x W
h l
@ki Given that, working out the gradient for k is
k
k yc V r 0 r
0 (ki ) = with (ki0)(1
<latexit sha1_base64="pAp2wjfk9mObL7yTYQF4T3o4bKc=">AAAOFnicfZfLbuM2GIWV6W2ajttMC3TTRYkG0yZFJpASx5dFgLEkG7PozKRBbm2UBhRNy4IlkaAoxx5BL9An6NN0V3Tbbd+ij1DKdmRdKGtF8xweffpJypRNPTfkqvrv1pMPPvzo40+efrr92bPG51/sPP/yKiQRQ/gSEY+wGxuG2HMDfMld7uEbyjD0bQ9f2xMj1a+nmIUuCS74nOI7HzqBO3IR5KLrfuf370+BhXg8Tn6zAmh78N4FVug6Pvxhz0JOPEnu3X1Q51lb9rSXlb59YFnbsvzFb9HY08DL7Nf+/c6ueqguLlBtaKvGrrK6zu6fP+Pb1pCgyMcBRx4Mw1tNpfwuhoy7yMPJthWFmEI0gQ6+jfiocxe7AY04DlACXghtFHmAE5AWBgxdhhH35qIBEXNFAkBjyCDionzbxagQB9DH4cFw6tJw2QynzrLBxVPiu3i2mJukMDB2GKRjF80KZDH0Qx/ycaUznPt2sRNHHmZTv9iZUgrGknOGGXLDtAZnojDvaDrd4QU5W+njOR3jIEziiHlJfqAQMGN4JAYumiHmEY0XDyPW2CQ85SzCB2lz0XdqQjY5x8MDkVPoKOKMPAJ5scv20+IE+AER34fBMLZoElscz3hsHRwmQnwBxHpBE5tANiw6z5M4ttKa2TY4F9aC+DYnvi2L/ZzYX92EeEMwIgxMxfwTFgJhBMLCXITD4ujLbPQIXJajr3LiVVm8zonXZdGOcmpUUac5dVpRH3LqQ1md5cRZWZznxHkll+dUXlbf58T3ZfFm001/2XTTX0uxYnbElpmLXY5H4gW3WGDxBCXx64s3PyVxd3GtVkqEgVY0IvvReDxotbvtpCx7j3pz0NF0s6pnhrbe0wyZIXP02oZq9tcsRyVvBq2qrX6vVY5C3lrvdI1BVV/Dqr22qUsMa9qB0ey3V+XDOChZncxndFvNSlmcLKer6/pJt6pnBr1pGJ0jiSFzGKZp9owFCo0Y9XDJSx+NrdaJWo2iWVBHbTV7En09AWpH1yuwNMeiD3TN1BYsHEOv5OTZYjE6vW6/HMTX9dd7hlGZQJ4rf8fQzKbEsGY9MU/6xwsSwmDglKtCsvK12p3jTjmKZEGDtpjBCgtZ32nQ1dV2hYXkWAa6oaeFLe5Esclutbt4+UJe7zuwq4EkKZvFRiub0733aC55PYnZq3dL7Zv88gFeLXz1SRGqi0eScFQLg2QsqB4eSeHRJnin6nfq4h1JuFML48hYnHp4RwrvbIKnVT+ti6eScFoLQ2UstB6eSuHpJnhe9fO6eC4J57UwXMbC6+G5FJ5vgidVP6mLJ5JwUgtDZCykHp5I4UkRXvx5pGd66IH0TEw84Aarg0HhnUXT48MEiZPk0r18cBOLbwOG34hjxTtxoIXijPdjbEHm+G6QiG8FxzpIW5uMcPZoFC3xnaKVv0qqjaujQ+3kUP25uftKX32xPFW+Ub5T9hRNaSuvlNfKmXKpIOW/ra+3vt0CjT8afzb+avy9tD7ZWo35SilcjX/+B4eGLBE=</latexit>
=
=hhir (ki ) = hhir (x) -
= (k
(x)i ))(1 - x)
<latexit sha1_base64="pAp2wjfk9mObL7yTYQF4T3o4bKc=">AAAOFnicfZfLbuM2GIWV6W2ajttMC3TTRYkG0yZFJpASx5dFgLEkG7PozKRBbm2UBhRNy4IlkaAoxx5BL9An6NN0V3Tbbd+ij1DKdmRdKGtF8xweffpJypRNPTfkqvrv1pMPPvzo40+efrr92bPG51/sPP/yKiQRQ/gSEY+wGxuG2HMDfMld7uEbyjD0bQ9f2xMj1a+nmIUuCS74nOI7HzqBO3IR5KLrfuf370+BhXg8Tn6zAmh78N4FVug6Pvxhz0JOPEnu3X1Q51lb9rSXlb59YFnbsvzFb9HY08DL7Nf+/c6ueqguLlBtaKvGrrK6zu6fP+Pb1pCgyMcBRx4Mw1tNpfwuhoy7yMPJthWFmEI0gQ6+jfiocxe7AY04DlACXghtFHmAE5AWBgxdhhH35qIBEXNFAkBjyCDionzbxagQB9DH4cFw6tJw2QynzrLBxVPiu3i2mJukMDB2GKRjF80KZDH0Qx/ycaUznPt2sRNHHmZTv9iZUgrGknOGGXLDtAZnojDvaDrd4QU5W+njOR3jIEziiHlJfqAQMGN4JAYumiHmEY0XDyPW2CQ85SzCB2lz0XdqQjY5x8MDkVPoKOKMPAJ5scv20+IE+AER34fBMLZoElscz3hsHRwmQnwBxHpBE5tANiw6z5M4ttKa2TY4F9aC+DYnvi2L/ZzYX92EeEMwIgxMxfwTFgJhBMLCXITD4ujLbPQIXJajr3LiVVm8zonXZdGOcmpUUac5dVpRH3LqQ1md5cRZWZznxHkll+dUXlbf58T3ZfFm001/2XTTX0uxYnbElpmLXY5H4gW3WGDxBCXx64s3PyVxd3GtVkqEgVY0IvvReDxotbvtpCx7j3pz0NF0s6pnhrbe0wyZIXP02oZq9tcsRyVvBq2qrX6vVY5C3lrvdI1BVV/Dqr22qUsMa9qB0ey3V+XDOChZncxndFvNSlmcLKer6/pJt6pnBr1pGJ0jiSFzGKZp9owFCo0Y9XDJSx+NrdaJWo2iWVBHbTV7En09AWpH1yuwNMeiD3TN1BYsHEOv5OTZYjE6vW6/HMTX9dd7hlGZQJ4rf8fQzKbEsGY9MU/6xwsSwmDglKtCsvK12p3jTjmKZEGDtpjBCgtZ32nQ1dV2hYXkWAa6oaeFLe5Esclutbt4+UJe7zuwq4EkKZvFRiub0733aC55PYnZq3dL7Zv88gFeLXz1SRGqi0eScFQLg2QsqB4eSeHRJnin6nfq4h1JuFML48hYnHp4RwrvbIKnVT+ti6eScFoLQ2UstB6eSuHpJnhe9fO6eC4J57UwXMbC6+G5FJ5vgidVP6mLJ5JwUgtDZCykHp5I4UkRXvx5pGd66IH0TEw84Aarg0HhnUXT48MEiZPk0r18cBOLbwOG34hjxTtxoIXijPdjbEHm+G6QiG8FxzpIW5uMcPZoFC3xnaKVv0qqjaujQ+3kUP25uftKX32xPFW+Ub5T9hRNaSuvlNfKmXKpIOW/ra+3vt0CjT8afzb+avy9tD7ZWo35SilcjX/+B4eGLBE=</latexit>
cl t k i i (ki )(1 - (ki ))

sigmo id:
relatively easy, since the operation from k to h is
derivative of
r
lW b = h r hi (1 - hi )
= hi hi (1 - hi )
i
- σ(x)) an element-wise one.
th l k r
= h ⌦ 0
(k) σ’(x) = σ(x)(1
ki
xV c hk kr = hr ⌦ 0k(k)
r
= hhrr
⌦ h0⌦ r
<latexit sha1_base64="ulXpDO6BYycliF1KgIFXJTHNPOY=">AAAOHXichZfbbts2HMaV7tRl9ZZul8MAYkHbdMgCKXF8uAhQS7bRi7XNgpy6KAsompYFSyJBUY5dQQ+xJ9jT7G7Y7bAn2e0o2ZF1oDxdEfw+fvrpT9ImLeo6AVfVf7YeffTxJ59+9vjz7S+eNL78aufp15cBCRnCF4i4hF1bMMCu4+ML7nAXX1OGoWe5+MqaGol+NcMscIh/zhcU33rQ9p2xgyAXXXc7v5nIjkwP8ok1BtP4V9OHlgvB8xNgIp4Jk0wwCXc8HAAzcGwPvtgrDn8J/ndcoqZK1rWn/Zjrfnm3s6seqOkDqg1t1dhVVs/p3dMnfNscERR62OfIhUFwo6mU30aQcQe5ON42wwBTiKbQxjchH3duI8enIcc+isEzoY1DF3ACkvqAkcMw4u5CNCBijkgAaAIZRFxUcbsYFWAfii/YH80cGiybwcxeNrj4bHwbzdMpigsDI5tBOnHQvEAWQS9IylbpDBaeVezEoYvZzCt2JpSCseScY4acIKnBqSjMO5rMenBOTlf6ZEEn2A/iKGRunB8oBMwYHouBaTPAPKRR+jFiqU2DE85CvJ80076TPmTTMzzaFzmFjiLO2CWQF7ssLymOj+8R8TzojyKTxpHJ8VwsiP2DWIjPgFhAaGoRyEZF51kcrZaaBc6EtSC+zYlvy+IgJw5WLyHuCIwJAzMx/4QFQBiBsDAH4aA4+iIbPQYX5ejLnHhZFq9y4lVZtMKcGlbUWU6dVdT7nHpfVuc5cV4WFzlxUcnlOZWX1Q858UNZvN700vebXvpLKVbMjtgyC7HL8Vj8zqULLJqiOHp9/uanOOqmz2qlhBhoRSOyHoxHw1a7247LsvugN4cdTe9X9czQ1nuaITNkjl7bUPuDNcthyZtBq2pr0GuVo5C71jtdY1jV17Bqr93XJYY17dBoDtqr8mHsl6x25jO6rWalLHaW09V1/bhb1TOD3jSMzqHEkDmMfr/fM1IUGjLq4pKXPhhbrWO1GkWzoI7aavYk+noC1I6uV2BpjkUf6lpfS1k4hm7JybPFYnR63UE5iK/rr/cMozKBPFf+jqH1mxLDmvW4fzw4SkkIg75drgrJytdqd4465SiSBQ3bYgYrLGT9pmFXV9sVFpJjGeqGnhS2uBPFJrvRbqPlD/J634FdDcRx2Sw2Wtmc7L0Hc8nrSsxuvVtq3+SXD3Br4atfilBdPJKEo1oYJGNB9fBICo82wdtVv10Xb0vC7VoYW8Zi18PbUnh7Ezyt+mldPJWE01oYKmOh9fBUCk83wfOqn9fFc0k4r4XhMhZeD8+l8HwTPKn6SV08kYSTWhgiYyH18EQKT4rw4s8jOdNDFyRnYuICx18dDAq/WTQ5PkyROEku3csP72NxN2D4jThWvBMHWijOeD9EJmS25/ixuCvY5n7S2mSE8wejaIl7ila+lVQbl4cH2vGB+nNz95W+urE8Vr5Vvlf2FE1pK6+U18qpcqEg5d+t77aeb71o/N74o/Fn46+l9dHWasw3SuFp/P0fc2gwkg==</latexit> <latexit sha1_base64="ulXpDO6BYycliF1KgIFXJTHNPOY=">AAAOHXichZfbbts2HMaV7tRl9ZZul8MAYkHbdMgCKXF8uAhQS7bRi7XNgpy6KAsompYFSyJBUY5dQQ+xJ9jT7G7Y7bAn2e0o2ZF1oDxdEfw+fvrpT9ImLeo6AVfVf7YeffTxJ59+9vjz7S+eNL78aufp15cBCRnCF4i4hF1bMMCu4+ML7nAXX1OGoWe5+MqaGol+NcMscIh/zhcU33rQ9p2xgyAXXXc7v5nIjkwP8ok1BtP4V9OHlgvB8xNgIp4Jk0wwCXc8HAAzcGwPvtgrDn8J/ndcoqZK1rWn/Zjrfnm3s6seqOkDqg1t1dhVVs/p3dMnfNscERR62OfIhUFwo6mU30aQcQe5ON42wwBTiKbQxjchH3duI8enIcc+isEzoY1DF3ACkvqAkcMw4u5CNCBijkgAaAIZRFxUcbsYFWAfii/YH80cGiybwcxeNrj4bHwbzdMpigsDI5tBOnHQvEAWQS9IylbpDBaeVezEoYvZzCt2JpSCseScY4acIKnBqSjMO5rMenBOTlf6ZEEn2A/iKGRunB8oBMwYHouBaTPAPKRR+jFiqU2DE85CvJ80076TPmTTMzzaFzmFjiLO2CWQF7ssLymOj+8R8TzojyKTxpHJ8VwsiP2DWIjPgFhAaGoRyEZF51kcrZaaBc6EtSC+zYlvy+IgJw5WLyHuCIwJAzMx/4QFQBiBsDAH4aA4+iIbPQYX5ejLnHhZFq9y4lVZtMKcGlbUWU6dVdT7nHpfVuc5cV4WFzlxUcnlOZWX1Q858UNZvN700vebXvpLKVbMjtgyC7HL8Vj8zqULLJqiOHp9/uanOOqmz2qlhBhoRSOyHoxHw1a7247LsvugN4cdTe9X9czQ1nuaITNkjl7bUPuDNcthyZtBq2pr0GuVo5C71jtdY1jV17Bqr93XJYY17dBoDtqr8mHsl6x25jO6rWalLHaW09V1/bhb1TOD3jSMzqHEkDmMfr/fM1IUGjLq4pKXPhhbrWO1GkWzoI7aavYk+noC1I6uV2BpjkUf6lpfS1k4hm7JybPFYnR63UE5iK/rr/cMozKBPFf+jqH1mxLDmvW4fzw4SkkIg75drgrJytdqd4465SiSBQ3bYgYrLGT9pmFXV9sVFpJjGeqGnhS2uBPFJrvRbqPlD/J634FdDcRx2Sw2Wtmc7L0Hc8nrSsxuvVtq3+SXD3Br4atfilBdPJKEo1oYJGNB9fBICo82wdtVv10Xb0vC7VoYW8Zi18PbUnh7Ezyt+mldPJWE01oYKmOh9fBUCk83wfOqn9fFc0k4r4XhMhZeD8+l8HwTPKn6SV08kYSTWhgiYyH18EQKT4rw4s8jOdNDFyRnYuICx18dDAq/WTQ5PkyROEku3csP72NxN2D4jThWvBMHWijOeD9EJmS25/ixuCvY5n7S2mSE8wejaIl7ila+lVQbl4cH2vGB+nNz95W+urE8Vr5Vvlf2FE1pK6+U18qpcqEg5d+t77aeb71o/N74o/Fn46+l9dHWasw3SuFp/P0fc2gwkg==</latexit>
h k
= ⌦ (k)(1=-hh) ⌦ h ⌦ (1 - h)
y
k b t
77 l
Wl c h
V t k
bh l
ck
t l
h
k
l
y
W
V
b
x
c
xt y
x
y
y xW Finally, the gradient for W. The situation here is
h
GRADIENT FOR W
Wk y V exactly the same as we saw earlier for V (matrix in,
W
@l
Vl x W b
V
Wrij =
@W ij vector out, matrix multiplication), so we should
X @l @km X
x
by V c = = kr
@km
expect the derivation to have the same form (and
b m
@km @W ij m
m
@W ij
ycW b t X @[Wx + b]m X @[Wm: x]m + bm indeed it does).
c = kr = kr
Wt V cx h
m m
m
@W ij m
@W ij
P
t y =
X
kmr@ l W ml xl + bm
Vhb t k m
@W ij
h x W X @W ml xl @W ij xj
b = kr = kr
k c h l m
@W ij i
@W ij
ky V ml
cl t k = kr
i xj
l b
tWh l
0
..
1
.
x
h V c Wr = B
B r
C
C r T
xk @. . . ki xj . . .A = k x
..
y
k l b t .
y
78
Wl c h
W
V t k
V
bh l
b
ck x
c
x t l y
xt
h
y
y xW Here’s the backward pass that we just derived.
h
PSEUDOCODE: BACKWARD
W
W k y V
k We’ve left the derivatives of the bias parameters
b
Vll x W
V
x V c out. You’ll have to work these out to implement
b
by
ycW b t y’ = 2 * (y - t)
the first homework exercise.
c
Wt V cx h
t y v’ = y’.mm(h.T)
Vhb t k
h x W h’ = y’[None, :] * v).sum(axis=1)
b
k c h l
k y V
cl t k k’ = sigmoid(k) * sigmoid(1 - k)
l b
tWh l w’ = k’.mm(x.T)
xV c
h k
y
k l b t
79
Wl c h
V t k
bh l
ck
t l
h
RECAP
k
Vectorizing
l makes notation simpler, and computation faster.
Vectorizing the forward pass is usually easy.
Backward pass: work out the scalar derivatives, accumulate, then

vectorize.
80
Peter Bloem
Deep Learning
dlvu.github.io

We’ve simplified the computation of derivatives a

lot. All we’ve had to do is work out local
derivatives and chain them together. We can go
one step further: if we let the computer keep track
of our computation graph, and provide some
PART FOUR: AUTOMATIC DIFFERENTIATION backwards functions for basic operations, the
computer can work out the whole
Letting the computer do all the work.
backpropagation algorithm for us. This is called
automatic differentiation (or sometimes
x autograd).
y
W
V
b
x
c
x y
xt
y
y xW
h
W
W y V
k STORY SO FAR
THE
W b y’ = 2 * (y - t)
V
Vl x
xy V c
b v’ = y’.mm(h.T)
b
ycW b t
c x h’ = y’[None, :] * v).sum(axis=1)
Wt V c h
t y
Vhb t k k’ = sigmoid(k) * sigmoid(1 - k)
h x W
b
k c h l w’ = k’.mm(x.T)
k V
cl yt k
lW b
th l
xV c
h k
kb t
y
pen land paper
x in the computer
Wl c h y
83
V t k W
bh l V
ck b
t l x
c
h x y
xt
k y xW
THE IDEAL
y
h This is what we want to achieve. We work out on
l W y V y’ = 2 * (y - t)
Wk pen and paper the local derivatives of various
V W b
Vl x k = w.dot(x) + b modules, and then we chain these modules
x c
by V
b
h = sigmoid(k)
together in a computation graph in our code. The
ycW b t
c computer keeps the graph in memory, and can
Wt V cx h y = v.dot(h) + c
t y automatically work out the backpropagation.
Vhb t k
+ x
h x W l = (y - t) ** 2
b
k c h l
k V
cl yt k l.backward() # start backprop
l b
tWh l
xV c
h k
pen and paper k bl t
y in the computer
Wl c h
V t k
bh l
ck
t l
h This kind of algorithm is called automatic
AUTOMATIC DIFFERENTIATION
k differentiation. What we’ve been doing so far is
l called backward, or reverse mode automatic
differentiation. This is efficient if you have few
output nodes. Note that if we had two output
nodes, we’d need to do a separate backward pass
for each.
If you have few inputs, it’s more efficient to start
many inputs, few outputs, work backward few inputs, many outputs, work forward
<-
dee with the inputs and apply the chain rule working
p lea
rni
ng forward. This is called forward mode automatic
differentiation.
85
Since we assume we only have one output node

(where the gradient computation is concerned),
we will always use reverse mode.
THE PLAN
Tensors
Building computation graphs
Working out backward functions
86
TENSORS (AKA MULTIDIMENSIONAL ARRAYS) The basic datastructure of our system will be the
tensor. A tensor is a generic name for family of
datastructures that includes a scalar, a vector, a
matrix and so on.
2 2 0
There is no good way to visualize a 4-dimensional
2 0 0 5
structure. We will occasionally use this form to
1 1 2
indicate that there is a fourth dimension along
which we can also index the tensor.
0-tensor 1-tensor 2-tensor 3-tensor 4-tensor We will assume that whatever data we work with
shape: 3 shape: 3x2 shape: 2x3x2 shape: 2x3x2x4
(images, text, sounds), will in some way be
87 encoded into one or more tensors, before it is fed
into our system. Let’s look at some examples.
A CLASSIFICATION TASK AS TWO TENSORS A simple dataset, with numeric features can simply
be represented as a matrix. For the labels, we
usually create a separate corresponding vector for
181 46 0 the labels.
181 50 0
1
Any categoric features or labels should be
166 44
171 38 1
X=
152 36
y=
1 converted to numeric features (normally by one-
156 40 1
167 40 1
hot coding).
170 45 0
178 50 0
191 50 0
166 38 1
88
AN IMAGE AS A 3-TENSOR Images can be represented as 3-tensors. In an RGB

image, the color of a single pixel is represented
pixel
using three values between 0 and 1 (how red it is,
how green it is and how blue it is). This means that
an RGB image can be thought of as a stack of three
color channels, represented by matrices.
height This stack forms a 3-tensor.
width
0.1
channels
0.5
0.0
89 image source: http://www.adsell.com/scanning101.html
A DATASET OF IMAGES If we have a dataset of images, we can represent

this as a 4-tensor, with dimensions indexing the
instances, their width, their height and their color
channels respectively. Below is a snippet of code
showing that when you load the CIFAR10 image
data in Keras, you do indeed get a 4-tensor like
this.
There is no agreed standard ordering for the
dimensions in a batch of images. Tensorflow and
Keras use (batch, height, width, channels),
whereas Pytorch uses (batch, channels, height,
90
width).
(You can remember the latter with the mnemonic
“bachelor chow”.)
TENSORS IN MEMORY: ROW MAJOR ORDERING It’s important to realize that even though we think
of tensors as multidimensional arrays, in memory,
they are necessarily laid out as a single line of
numbers. The Tensor object knows the shape that
Tensor A these numbers should take, and uses that to
5 3 1 data: 5 3 1 2 6 0 compute whatever we need it to compute.
A=
2 6 0 shape: (2, 3)
We can do this in two ways: scanning along the
rows first and then the columns is called row-
row-major ordering
major ordering. This is what numpy and pytorch
both do. The other option is (unsurprisingly) called
column major ordering. In higher dimensional
91
tensors, the dimensions further to the right are
always scanned before the dimensions to their
left.
Imagine looping over all elements in a tensor with
shape (a, b, c, d) in four nested loops. If you want
to loop over the elements in the order they are in
memory, then the loop over d would be the
innermost loop, then c, then b and then a.
This allows us to perform some operations very
cheaply, but we have to be careful.
RESHAPE/VIEW Here is one such cheap operation: a reshape. This

changes the shape of the tensor, but not the data.
Tensor A
To do this cheaply, we can create a new tensor
5 3 1 2 6 0
A=
5 3 1 data: object with the same data as the old one, but with
2 6 0 shape: (2, 3) a different shape.
5 3 Tensor A’
A’ =
1 2 data:
6 0 shape: (3, 2)
92
RESHAPE/VIEW
Tensor A
5 3 1 data: 5 3 1 2 6 0
A=
2 6 0 shape: (3, 2)
Tensor a’
a’ = 5 3 1 2 6 0 data:
shape: (6, )
93
EXAMPLE: NORMALIZE COLORS Imagine that we wanted to scale a dataset of

images in such way that over the whole dataset,
images = load_images(…) each color channel has a maximal value of 1
n, h, w, c = images.size()
(independent of the other channels).
images = images.reshape(n*h*w, c) To do this, we need a 3-vector of the maxima per
images = images / images.max(dim=0)
color channel. Pytorch only allows us to take the
images = images.reshape(n, h, w, c)
maximum in one direction, but we can reshape
# this makes no sense, but it is allowed.
the array so that all the directions we’re not
images = images.reshape(n*h, w*c) interested in are collapsed into one dimension.
images = images.reshape(h, n, c, w)
Afterwards, we can reverse the reshape to get our
94 image dataset back.
We have to be careful: the last three lines in this
slide form a perfectly valid reshape, but the c
dimension in the result does not contain our color
values.
In general, you’re fine if you collapse dimensions
that are next to each other, and uncollapse them
in the same order.
TRANSPOSE A transpose operation can also be achieved

cheaply. In this case, we just keep a reference to
Tensor A the old tensor, and whenever a user requests the
5 3 1 data: 5 3 1 2 6 0
element (i, j) we return (j, i) from the original
A= tensor.
2 6 0 shape: (2, 3)
NB: This isn’t precisely how pytorch does this, but

AT=
5 2 Tensor AT the effect is the same: we get a transposed tensor
3 6 parent: in constant time, by viewing the same data in
1 0 shape: reverse(parent.shape)
memory in a different way.
def get(i, j):
return parent.get(j, i)
95
 
 

 
 
SLICE Even slicing can be accomplished by referring to
the original data.
5 3 1 data: 5 3 1 2 6 0
A=
2 6 0 shape: (3, 2)
5 3 Tensor A’
A[:, :2]=
2 6 parent:
shape: (2, 2)
def get(i, j):

return parent(i, j)
96
NON-CONTIGUOUS For some operations, however, the data needs to

be contiguous. That is, the tensor data in memory
Contiguous tensor: the data are directly laid out in row-major ordering for needs to be one uninterrupted string of data in
the shape with no gaps or shuffling required.
row major ordering with no gaps. If this isn’t the
x = torch.randn(2, 3)
x = x[:, 1:] case, pytorch will throw an exception like this.
x.view(6)
Traceback (most recent call last):
… You can fix this by calling .contiguous() on
RuntimeError: view size is not compatible with input tensor's size and
stride (at least one dimension spans across two contiguous subspaces). the tensor. The price you pay is the linear time and
Use .reshape(...) instead.
space complexity of copying the data.
x.contiguous(): make contiguous (by copying the data).
x.view(): reshape if possible without making contiguous.
x.reshape(): reshape, call contiguous() if necessary.
97
Tensors
98
COMPUTATION GRAPHS We’ll need to be a little mode precise in our

notations. From now on we’ll draw computation
Operation nodes Tensor nodes
graphs like this, making the operation explicit.
(There doesn’t seem to be a standard notation.
This will work for our purposes).
In tensorflow operations are called ops, and in
bipartite: only op-to-tensor or tensor-to-op edges pytorch they’re called functions.
tensor nodes: no input or one input, multiple outputs
operation nodes: multiple inputs and outputs
99
W
V
b
c x
x t y
x
y h W
y
FEEDFORWARDW
NETWORK V As an example, here is our MLP expressed in our
k
W
V x b new notation.
l
V
b x y c Just as before, it’s up to us how granular we make
bc x yW t the computation. Here, we wrap the whole
ct y W V+ h
×
computation of the loss in a single operation, but
ht W V b k we could also separate out the subtraction and the
h V b c l
k xσ squaring.
kl b c t shorthand:
y If there’s a uniform direction to the computation,
× l + c t h
x W we’ll leave out the arrowheads for clarity, but you
t h k × +
y V should always think of a computation graph as a
h k l × +
x W b
100 k l directional one. All connections have a definite
y V c direction.
l
W b t
V c h When we draw intricate computation graphs, we
b t k may connect an op node to another op node,
c h l without explicitly drawing the tensor node in
t k between. This should be understood as a
h l shorthand for a proper bipartite graph. Similarly,
k we may occasionally leave out the op node
l between two tensor nodes if the operation is clear
from context.
RESTATING OUR ASSUMPTIONS We hold on to the same assumptions we had

before. Without these two assumptions things
One output l node with a scalar value. would become a lot more complicated.
All gradients are of l, with respect to the tensors on the tensor nodes.
101
IMPLEMENTATIONS To store a computation graph in memory, we need

three objects.
A TensorNode objects holds the tensor value at
TensorNode: OpNode:
that node. It holds the gradient of the tensor at
value: <Tensor> inputs: List<TensorNode>
that node (to be filled by the backpropagation
gradient: <Tensor> outputs: List<TensorNode>
algorithm) and it holds a pointer to the Operation
source: <OpNode> op: <Op>
Node that produced it (which can be null for leaf
nodes).
An OpNode object represents an instance of a
particular operation being applied in the
102 computation. It stores the particular op being used
(mulitplication, additions, etc) and the inputs and
the outputs.
DEFINING AN OPERATION We also need to implement the computation of

the
the operation itself. This is done in two functions
for
ee d ass (in python these are usually class functions or
g we n kward p
yt h in bac
s an
class Op: s tore
<latexit sha1_base64="TVaHn2cx6MPh1FqPJsvR0G58OdQ=">AAAOHnicfZfbbts2HMaV7tRlzZZulwMGYkG7bAgCKXF8GBCglmyjF2ubBTltURZQNK0IpkSCohy7gl5iT7Cn2d2w2+1NdjnKB1kHyrqi+X389NOfpEw5jHih0PV/t5588OFHH3/y9NPtz57tfP7F7vMvr0IacYQvESWU3zgwxMQL8KXwBME3jGPoOwRfO2Mr1a8nmIceDS7EjOE7H7qBN/IQFLLrfvf30b6N3Nh2RqCbfA9engIbOfOfZgJse9sWeCriEeWPkA+T+9HLH8HaD2wfslBQ5RgHonF+0Mrxmx1Ah8Dc2FXcUrnf3dMP9fkFqg1j2djTltfZ/fNnYtseUhT5OBCIwDC8NXQm7mLIhYcITrbtKMRM8kAX30Zi1L6LvYBFAgcoAS+kNooIkChpgcDQ4xgJMpMNiLgnEwB6gBwiIcu4XYwKcQB9HB4MJx4LF81w4i4aQj4Lvoun8zlKCgNjl0P24KFpgSyGfuhD8VDpDGe+U+zEEcF84hc7U0rJWHJOMUdemNbgTBbmHUunPbygZ0v9YcYecBAmccRJkh8oBcw5HsmB82aIRcTi+cPItTYOTwWP8EHanPed9iAfn+PhgcwpdBRxRoRCUexy/LQ4AX5E1PdhMIxtlsSLJWQfHCZSfAHkqkBjh8rVVHSeJ3FspzVzHHAurQXxbU58Wxb7ObG/vAklQyBXOpjI+ac8BNIIpIV7CIfF0ZfZ6BG4LEdf5cSrsnidE6/LohPl1KiiTnLqpKI+5tTHsjrNidOyOMuJs0quyKmirL7Pie/L4s2mm/6y6aa/lmLl7MgtM5O7HI/ki26+wOIxSuLXF29+SuLO/FqulAgDo2hEzsp4PGi2Oq2kLJOV3hi0DbNX1TNDy+walsqQObotS+/11yxHJW8GrevNfrdZjkJkrbc71qCqr2H1bqtnKgxr2oHV6LeW5cM4KFndzGd1mo1KWdwsp2Oa5kmnqmcGs2FZ7SOFIXNYvV6va81RWMQZwSUvWxmbzRO9GsWyoLbebHQV+noC9LZpVmBZjsUcmEbPmLMIDEnJKbLFYrW7nX45SKzrb3YtqzKBIlf+tmX0GgrDmvWkd9I/npNQDgO3XBWala/Zah+3y1E0Cxq05AxWWOj6ToOOqbcqLDTHMjAtMy1scSfKTXZr3MWLF/J634E9AyRJ2Sw3Wtmc7r2VueQlCjOpdyvtm/zqAaQWvvqkCNXFI0U4qoVBKhZUD4+U8GgTvFv1u3XxriLcrYVxVSxuPbyrhHc3wbOqn9XFM0U4q4VhKhZWD8+U8GwTvKj6RV28UISLWhihYhH18EIJLzbB06qf1sVTRTithaEqFloPT5XwtAgv/zzSMz0kID0TUwK8YHkwKLyzWHp8GCN5kly4Fw/ew/LbgOM38ljxTh5ooTzj/RDbkLu+FyTyW8G1D9LWJiOcroyyJb9TjPJXSbVxdXRonBzqPzf2XpnLL5an2tfat9q+Zmgt7ZX2WjvTLjWk/bf1zdZ3W/s7f+z8ufPXzt8L65Ot5ZivtMK188//nykwHw==</latexit>
f(A) = B static functions).
<-
def forward(context, inputs): f(A) = B

forwardf : A 7! B
# given the inputs, compute the outputs forwardf : Ar7! B r The function forward computes the outputs
backward : BB 7! A
... f(A)f = given the inputs (just like any function in any
backwardf : Br 7! Ar
forwardf : A 7! B programming language).
def backward(context, outputs_gradient): backwardf : Br 7! Ar
# given the gradient of the loss wrt to the ouputs
The function backward takes the gradient for
# compute the gradient of the loss wrt to the inputs the outputs (the gradient of the loss wrt to the
... outputs) and produces the gradient for the inputs.
103
Both functions are also given a context object.

This is a data structure (a dictionary or a list) to
which the forward can add any value which it
needs to save for the backward.
Note that the backward function does not
compute the local derivative: it computes the
accumulated gradient of the loss over the inputs
(given the accumulated gradient of the loss over
the outputs).
x
y
W
V
b
c x
t y
x
x h W
y This is the algorithm for backpropagation in this
BACKPROPAGATION
y k V
W setting. We chain together these modules into a
W l x b
V build a computation graph, perform computation graph, and perform a forward pass to
V x y c
b forward pass, bottom-up from leaves.
y x W t compute a loss. We then walk backward from the
bc
W y+ V h loss node breadth-first to compute the gradients
ct × traverse the tree backwards breadth-first
VW b k for each tensor node. Because we walk breadth
h t
b c l first, we ensure that we’ve always computed the
hk σ V call backward for each OpNode
cx t
gradients for the outputs of each OpNode we
kl b breadth-first ensures that the output gradients are known
ty c h
encounter already, and we can simply call its
l +
×
x Wh t k
backward to compute the gradients for its inputs.
add the computed gradients to the
y kV l tensornodes
h Note the last point: some tensor nodes will be
104 x W lb sum them with any gradients already present
k inputs to multiple operation nodes. By the
y V c
l multivariate chain rule, we should sum the
W b t
gradients they get from all the OpNodes they feed
V c h
into. We can achieve this easily by just initializing
b t k
the gradients to zero and adding the gradients we
c h l
compute to any that are already there.
t k
h l
k
l
 
BUILDING COMPUTATION GRAPHS IN CODE There are two common strategies for constructing
the computation graph: lazy and eager execution.
Lazy execution: build your graph, compile it, feed data through.
Eager execution: perform forward pass, keep track of computation graph.
105
EXAMPLE: LAZY EXECUTION Here’s one example of how a lazy execution API
might look.
c
a = TensorNode() value: 2
?
b = TensorNode() grad: 1? Note that when we’re building the graph, we’re
source: mult(a, b) not telling it which values to put at each node.
c = a * b We’re just defining the shape of the computation,
but not performing the computation itself.
×
m = Model(
in=(a, b), When we create the model, we define which
loss=c) a b nodes are the input nodes, and which node is the
1
value: ? value: 2
? loss node. We then provide the input values and
m.train((1, 2)) grad: ?2 grad: 1?
perform the forward and backward passes.
106
BUILDING THE COMPUTATION GRAPH: LAZY EXECUTION In lazy execution, which was the norm during the
early years of deep learning, we build the
computation graph, but we don’t yet specify the
Tensor ow 1.x default, Keras default values on the tensor nodes. When the
De ne the computation graph. computation graph is finished, we define the data,
• Compile it.
• Iterate backward/forward over the data
and we feed it through.
Fast. Many possibilities for optimization. Easy to serialise models. Easy to make This is fast since the deep learning system can
training parallel. optimize the graph structure during compilation,
Di cult to debug. Model must remain static during training. but it makes models hard to debug: if something
goes wrong during training, you probably made a
mistake while defining your graph, but you will
107
only get an error while passing data through it.
The resulting stack trace will never point to the
part of your code where you made the mistake.
EXAMPLE: EAGER EXECUTION In eager mode deep learning systems, we create a

node in our computation graph (a TensorNode) by
c
a = TensorNode(value=1) value: 2 specifying what data it should contain. The result
b = TensorNode(value=2) grad: 1?
is a tensor object that stores both the data, and
source: mult(a, b)
the gradient over that data (which will be filled
c = a * b later).
×
c.backward() Here we create the variables a and b. If we now
apply an operation to these, for instance to
a b
value: 1 value: 2
multiply their values, the result is another variable
grad: ?2 grad: 1? c. Languages like python allow us to overload the *
operator it looks like we’re just computing
108
multiplication, but behind the scenes, we are
creating a computation graph that records all the
computations we’ve done.
ffi
fi

fl

We compute the data stored in c by running the

computation immediately, but we also store
references to the variables that were used to
create c, and the operation that created it. We
create the computation graph on the fly, as we
compute the forward pass.
Using this graph, we can perform the
backpropagation from a given node that we
designate as the loss node (node c in this case).
We work our way down the graph computing the
derivative of each variable with respect to c. At the
start the TensorNodes do not have their grad’s
filled in, but at the end of the backward, all
gradients have been computed.
Once the gradients have been computed, and the
gradient descent step has been performed, we
clear the computation graph. It's rebuilt from
scratch for every forward pass.
BUILDING THE COMPUTATION GRAPH: EAGER EXECUTION In eager execution, we simply execute all
operations immediately on the data, and collect
the computation graph on the fly. We then
PyTorch, Tensor ow 2.0 default, Keras option execute the backward and ditch the graph we’ve
• Build the computation graph on the y during the forward pass. collected.
Easy to debug, problems in the model occur as the module is executing. This makes debugging much easier, and allows for
Flexible: the model can be entirely di erent from one forward to the next.
more flexibility in what we can do in the forward
More di cult to optimize. A little more di cult to serialize. pass. It can, however be a little difficult to wrap
your head around. Since we’ll be using pytorch
later in the course, we’ll show you how eager
execution works step by step.
109
Tensors
110
ffi

fl
y
W
V
b
c
t
h The final ingredient we need is a large collection of
WORKING OUT THE BACKWARD FUNCTION
k
operations with backward functions worked out.
ABS
<latexit sha1_base64="PDfvGk7ccq9faPRBTMFWBm/xuaI=">AAANvXicfZfbbts2HMbV7tRl9ZZul7sRFhQYhiCQEscHFMVqSTZ6sbZZllMXGQFF04pgSiQoyrEr6Dn2NLvdnmFvM8p2ZImirCuC38dPP/1J2qRHcRBzw/jvydPPPv/iy6+efb33zfPWt9/tv/j+KiYJg+gSEkzYjQdihIMIXfKAY3RDGQKhh9G1N7Nz/XqOWByQ6IIvKRqHwI+CaQABF113+6YL/dQNAb/3pvog091XLqRFh5W5r3QXkqLjj0y/2z8wjozVo9cb5qZxoG2es7sXz/meOyEwCVHEIQZxfGsalI9TwHgAMcr23CRGFMAZ8NFtwqe9cRpENOEogpn+UmjTBOuc6Dm/PgkYghwvRQNAFogEHd4DBiAXX7lXjYpRBEIUH07mAY3XzXjurxsciBKN08WqhFllYOozQO8DuKiQpSCM8yrUOuNl6FU7UYIRm4fVzpxSMErOBWIwiPManInCfKD5rMQX5Gyj3y/pPYriLE0YzsoDhYAYQ1MxcNWMEU9ouvoYsRRm8WvOEnSYN1d9rx3AZudocihyKh1VnCkmgFe7vDAvToQeIAlDEE1Sl2apy9GCp+7hUSbEl7qHhdkjgE2qzvMs3awcTz8X1or4viS+l8VhSRxuXkLwRJ8Sps/F/BMW68KoCwsLIIqroy+L0VP9Uo6+KolXsnhdEq9l0UtKalJT5yV1XlMfSuqDrC5K4kIWlyVxWcvlJZXL6qeS+EkWb3a99OOul/4pxYrZEVtmKXY5morfodUCS2cwS99evPstS/urZ7NSEqSbVSP0Ho0no063381kGT/q7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu5vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7hUITRjGSvPTR2OmcGvUoWgT1jE57oNC3E2D0LKsGS0ss1sgyHXPFwhHAkpMXi8XuDfpDOYhv628NbLs2gbxU/p5tOm2FYct66pwOT1YkhIHIl6tCivJ1ur2TnhxFiqBRV8xgjYVs3zTqW0a3xkJKLCPLtvLCVnei2GS35jhd/yBv951+YOpZJpvFRpPN+d57NEterDDjZrfSvsuvHoAb4etfCmFTPFSEw0YYqGKBzfBQCQ93wft1v98U7yvC/UYYX8XiN8P7Snh/Fzyt+2lTPFWE00YYqmKhzfBUCU93wfO6nzfFc0U4b4ThKhbeDM+V8HwXPKn7SVM8UYSTRhiiYiHN8EQJT6rw4s8jP9MDrOdnYoL1INocDCq/WTQ/PsygOEmu3esPd5C4GzD0ThwrPogDLRBnvF9SFzA/DKJM3BV89zBv7TKCxaNRtMQ9xZRvJfXG1fGReXpk/N4+eGNtbizPtB+1n7SfNVPram+0t9qZdqlB7S/tb+0f7d/Wry3Uwq1obX36ZDPmB63ytB7+B/FwDUc=</latexit>
ABS l
class Plus(Op):
+ We’ll show how to do this for a few examples.
ABS
def forward(context, a, b):
# a, b are matrices of the same size <latexit sha1_base64="wLz1/iLAoeye3AQqFZhtdFq4K5A=">AAAPpHicjZfbbts2AIZV79R53dxsF7vYDbGgxbAFgZQ4PlwEqCXb6EUPWdIctsgNKJpWNEsiIVGOXUF7mt1u77O3GSXLsg6UV8GACf4/f3082aRBbctnsvzvo8Ynn372+RePv2x+9eTrb1pP97698kngIXyJiE28GwP62LZcfMksZuMb6mHoGDa+NuZarF8vsOdbxH3HVhRPHGi61sxCkPGqu73G9zoyw0F0F1p/RO91Fxo2BM9PgT7zIAr1OQV2lHzlbBHQ9Wbs8QPnLpzbkcBMwoso0XJivraS2QS5wNSa8qRVH5ezYZtzbxZ3UUjKId3GTXUHsntjBgYR+IVn0axCjSa7aD/uFZtmcfU6Xv0/9mLnuQKqcXGtIEHcWm/Gn0JX83PNW2yEbcun+/KhnDygWlDSwr6UPmd3e09YU58SFDjYZciGvn+ryJRNQugxC9k4auqBjylEc2ji24DNepPQcmnAsIsi8Ixrs8AGjIB4mYKp5WHE7BUvQORZPAGge8gHgfHF3CxG+diFDvYPpguL+uuivzDXBcb7gifhMtkpUaFhaHqQ3ltoWSALoePHY1Gp9FeOUazEgY29hVOsjCk5Y8m5xB6y/HgMzvjAvKXx5vPfkbNUv1/Re+z6URh4fDflGnIBex6e8YZJ0ccsoGHSGb7j5/4p8wJ8EBeTutMh9ObneHrAcwoVRZyZTSArVhlOPDgufkDEcaA7DXXKVxfDSxbqB4d8wTefAb4q0Nwg0JsWnedRmK4fA5xza0F8kxPflMVRThylLyH2FMyIBxZ8/onnA24E3OJZCPvF1pdZ6xm4LEdf5cSrsnidE6/LohHk1KCiLnLqoqI+5NSHsrrMicuyuMqJq0ouy6msrH7IiR/K4s2ul/6266W/l2L57PAts+K7HM/4302ywMI5isKX716/isJ+8qQrJcBAKRqRsTEejzvdfjcqy/ZGb497ijqs6pmhqw4UTWTIHIOuJg9HW5ajkjeDluXOaNApRyF7q/f62riqb2HlQXeoCgxb2rHWHnXT4cPYLVnNzKf1O+3KsJhZTl9V1ZN+Vc8MalvTekcCQ+bQhsPhQEtQaOBRG5e8dGPsdE7kahTNgnpypz0Q6NsJkHuqWoGlORZ1rCpDJWFhGNolJ8sWi9Yb9EflILYdf3WgaZUJZLnh72nKsC0wbFlPhiej44SEeNA1y6NCsuHrdHvHvXIUyYLGXT6DFRayfdO4r8rdCgvJsYxVTY0HtrgT+Sa7VSbh+gd5u+/AvgKiqGzmG61sjvfexlzy2gKzXe8W2nf5xQ3sWvhqTxGqi0eCcFQLg0QsqB4eCeHRLniz6jfr4k1BuFkLY4pYzHp4Uwhv7oKnVT+ti6eCcFoLQ0UstB6eCuHpLnhW9bO6eCYIZ7UwTMTC6uGZEJ7tgidVP6mLJ4JwUgtDRCykHp4I4UkRnv95xGd6aIP4TExsYLnpwaDwm0Xj40N8R0rd644PMb8bePg1P1a85QdayM94P4c69EzHciN+VzD1g7i0ywiXGyMv8XuKUr6VVAtXR4fKyaH8a3v/hZreWB5LP0g/Sj9JitSVXkgvpTPpUkKNPxt/Nf5u/NN63nrVumhdrq2NR2mb76TC03r/HwUTxEI=</latexit>

First, an operation that sums two matrices
@l
return a + b
Ar
ij =
@Aij element-wise.
X @l @Skl X @Skl
= = Sr
kl
@Skl @Aij
kl
kl
@Aij The recipe is the same as we saw in the last part:
def backward(context, goutput): X @[A + B]kl X @Akl + Bkl
= Sr = Sr
???
return goutput, goutput
kl
@Aij kl
@Aij
kl kl 1) Write out the scalar derivative for a single
@Aij
= Sr
ij
@Aij
= Sr
ij element.
Ar = Sr Br = S r
<latexit sha1_base64="grOCEBPoBUR1matLxO/x46uXoOA=">AAANunicfZfbbts2HMbV7tRl9ZZul7sRFnQYhiCQEscHYAFqSTZ6sbZZlkO7KAsomlYEUyJBUY5dQW+xp9nt9hJ7m1E+yBJFWVcEv4+ffvqTtEmP4iDmhvHfk6effPrZ5188+3Lvq+etr7/Zf/HtdUwSBtEVJJiw9x6IEQ4idMUDjtF7yhAIPYxuvKmd6zczxOKARJd8QdFdCPwomAQQcNF1v3/kQpq6IeAP3kS3sj/dCHgY6D+e6S4khfD7RrjfPzCOjOWj1xvmunGgrZ/z+xfP+Z47JjAJUcQhBnF8axqU36WA8QBilO25SYwogFPgo9uET3p3aRDRhKMIZvpLoU0SrHOi5/D6OGAIcrwQDQBZIBJ0+AAYgFx84l41KkYRCFF8OJ4FNF4145m/anDxLegunS/rl1UGpj4D9CGA8wpZCsI4r0WtM16EXrUTJRixWVjtzCkFo+ScIwaDOK/BuSjMO5pPSXxJztf6w4I+oCjO0oThrDxQCIgxNBEDl80Y8YSmy48R62Aan3GWoMO8uew7cwCbXqDxocipdFRxJpgAXu3ywrw4EXqEJAxBNE5dmqUuR3OeuodHmRBf6mJVwKlHABtXnRdZul4/nn4hrBXxbUl8K4vDkjhcv4TgsT4hTJ+J+Scs1oVRFxYWQBRXR18Voyf6lRx9XRKvZfGmJN7IopeU1KSmzkrqrKY+ltRHWZ2XxLksLkriopbLSyqX1Y8l8aMsvt/10g+7XvqHFCtmR2yZhdjlaCJ+hJYLLJ3CLH19+ebXLO0vn/VKSZBuVo3Q2xhPRp1uv5vJMt7o7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu+vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7iUITRjGSvHRj7HROjXoULYJ6Rqc9UOjbCTB6llWDpSUWa2SZjrlk4QhgycmLxWL3Bv2hHMS39bcGtl2bQF4qf882nbbCsGU9dU6HJ0sSwkDky1UhRfk63d5JT44iRdCoK2awxkK2bxr1LaNbYyEllpFlW3lhqztRbLJb8y5d/SBv951+YOpZJpvFRpPN+d7bmCUvVphxs1tp3+VXD8CN8PUvhbApHirCYSMMVLHAZniohIe74P2632+K9xXhfiOMr2Lxm+F9Jby/C57W/bQpnirCaSMMVbHQZniqhKe74Hndz5viuSKcN8JwFQtvhudKeL4LntT9pCmeKMJJIwxRsZBmeKKEJ1V48eeRn+kB1vMzMcF6EK0PBpXfLJofH6ZQnCRX7tWHO0jcDRh6I44V78SBFogz3s+pC5gfBlEm7gq+e5i3dhnBfGMULXFPMeVbSb1xfXxknh4Zv7UPXlnrG8sz7XvtB+0nzdS62ivttXauXWlQ+0v7W/tH+7f1S8trBa3pyvr0yXrMd1rlafH/AeRPDGE=</latexit>
2) Use the multivariate chain rule to sum over all

111 outputs.
3) Vectorize the result.
Note that when we draw the computation graph,
we can think of everything that happens between
S and l as a single module: we are already given
the gradient of l over S, so it doesn’t matter if it’s
one operation or a million.
Again, we can draw the computation graph at any
granularity we like: very fine individual operations
like summing and multiplying or very coarse-
grained operations like entire NNs.
For this operation, the context object is not

Dlvu Lecture02

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dlvu Lecture02

Uploaded by

Copyright:

Available Formats

Today’s lecture will be entirely devoted to the

backpropagation algorithm. The heart of all deep

PART ONE: REVIEW

We’ll start with a quick recap of the basic principles

image source: http://www.sciencealert.com/scientists-build-an-artificial-neuron-that-fully-mimics-a-human-brain-cell

1957: THE PERCEPTRON This principles needed to be radically simplified to

PROBLEM: COMPOSING NEURONS

FEEDFORWARD NETWORK Using these nonlinearities, we can arrange single

h1 h2 h3 hidden layer network or multilayer perceptron. We arrange a

acts as a perceptron with a nonlinearity,

REGRESSION If we want to train a regression model (a model

BINARY CLASSIFICATION If we have a classification problem with two

To find good weights we first define a loss

This is a common way of summarizing the aim of

It turns out this is actually an oversimplification,

The squared error losses are derived from basic

for multiple examples is just the sum over all

LOSS SURFACE We want to follow the loss surface down to the

DERIVATIVE In one dimension, we know that the derivative of a

To be more precise, it tells us how much the best

linear approximation of the function at a particular

point, the tangent line, increases or decreases.

GRADIENT If our input space has multiple dimensions, like

to a lower point on the loss surface.

RECAP This is the basic idea of neural network training.

For simple models, working out a gradient is

Here is a diagram of the sort of network we’ll be

gradient if ε is small enough.

BACKPROPAGATION: THE MIDDLE GROUND Backpropagation is a kind of middle ground

end of the lecture, we will expand our notation a

MULTIVARIATE CHAIN RULE Since we’ll be looking at some pretty elaborate

The multivariate chain rule tells us that we can

INTUITION FOR THE MULTIVARIATE

We can see why this holds in the same way as

EXAMPLE With that, we are ready to show how

BACKPROPAGATION This is how the backpropagation algorithm

• work out the local derivatives symbolically.

output value. At the end, we get the output of the

d 2c 2 ==-- 122222· ·cos

BACKPROPAGATION Before we try this on a neural network, here are

BRACKPROPAGATION IN A FEEDFORWARD NETWORK To explain how backpropagation works in a neural

First, we separate the hidden node into the result

h1 h2 h3 of the linear operation ki and the application of

of the network, we extend the network with one

Note that the model is now just a subgraph of of

1 + exp(-k2 ) To simplify things, we’ll compute the loss over only

BACKPROPAGATION IN A FEEDFORWARD NETWORK We then do a forward pass with some values. We

v2 v2 - ↵ · -3.96 descent update, v2 changes as shown below.

BACKPROPAGATION Note that when we’re computing the derivative

k1 k2 k3 the loss for every node. For the nodes below, we

very efficiently compute any derivatives we need.

The BACKPROPAGATION algorithm:

• work out the local derivatives symbolically.

• Walk backward from the loss, accumulating the derivatives.

walks down the network, figuring out at each step

every value that contributed positively to a

have been 10, GD tells us to increase the value of

Instead of changing y, we have to change the

0.1 0.5 0.9

NEGATIVE ACTIVATION The sigmoid activation we’ve used so far allows

y decreases, but we note that the weight v2 is

v3 v3 - ↵ · 20 · 0.9 (for this instance) v2 contributes negatively to the