Professional Documents
Culture Documents
Dlvu Lecture02
Dlvu Lecture02
Peter Bloem
Deep Learning
dlvu.github.io
THE PLAN
In the first part, we will review the basics of neural
networks. The rest of the lecture will mostly be about
part 1: review backpropagation: the algorithm that allows us to
efficiently compute a gradient for the parameters of a
part 2: scalar backpropagation neural net, so that we can train it using the gradient
descent algorithm. The introductory lecture covered
some of this material alreayd, but we'll go over the basics
part 3: tensor backpropagation
again to set up the notations and visualization for the rest
of the lecture.
part 4: automatic differentiation
In the second part we describe backpropagation in a
scalar setting. That is, we will treat each individual
element of the neural network as a single number, and
2
simply loop over all these numbers to do backpropagation
over the whole network. This simplifies the derivation,
but it is ultimately a slow algorithm with a complicated
notation.
In the second part, we translate neural networks to
operations on vectors, matrices and tensors. This allows
us to simplify our notation, and more importantly,
massively speed up the computation of neural networks.
Backpropagation on tensors is a little more difficult to do
than backpropagation on scalars, but it's well worth the
effort.
In the third part, we will make the final leap from
manually worked out and implemented backpropagation
system to full-fledged automatic differentiation: we will
show you how to build a system that takes care of the
gradient computation entirely by itself. This is the
technology behind software like pytorch and tensorflow.
y = 1(1x1 + 3x2) + 2(2x3 + 1x4) y = 1x1 + 3x2 + 4x3 + 2x4 If we're going to build networks of perceptrons that do
6 anything a single perceptron can't, we need another trick.
NONLINEARITY
The simplest solution is to apply a nonlinear function to
each neuron, called the activation function. This is a
y scalar function we apply to the output of a perceptron
after all the weighted inputs have been combined.
w1 w2 b y = (w1 x1 + w2 x2 + b) One popular option (especially in the early days) is the
logistic sigmoid, which we’ve seen already. Applying a
x1 x2
1 sigmoid means that the sum of the inputs can range from
sigmoid (x) = negative infinity to positive infinity, but the output is
1 + e-x
always in the interval [0, 1].
x if x > 0
r(x) = Another, more recent nonlinearity is the linear rectifier,
ReLU
0 otherwise or ReLU nonlinearity. This function just sets every
7 negative input to zero, and keeps everything else the
same.
Not using an activation function is also called using a
linear activation.
If you're familiar with logistic regression, you've seen the
sigmoid function already: it's stuck on the end of a linear
regression function (that is, a perceptron) to turn the
outputs into class probabilities. Now, we will take these
sigmoid outputs, and feed them as inputs to other
perceptrons.
x1 x2
10
MULTI-CLASS CLASSIFICATION: THE SOFTMAX ACTIVATION For multi-class classification, we can use the
softmax activation. We create a single output
node for every class, and ensure that their values
0.1 0.6 0.3
oi = w T h + b sum to one. We can then interpret this series of
o1 s o2 s o3 s
exp(oi ) values as the class probabilities that our network
yi = P predicts. The softmax activation ensures positive
j exp(oj ) values that sum to one.
After the softmax we can interpret the output of
node y3 as the probability that our input has class
3.
11 To compute the softmax, we simply take the
exponent of each output node oi (to ensure that
they are all positive) and then divide each by the
total (to ensure that they sum to one). We could
make the values positive in many other ways
(taking the absolute or the square), but the
exponent is a common choice for this sort of thing
in statistics and physics, and it seems to work well
enough.
The softmax activation is a little unusual: it’s not
element-wise like the sigmoid or the ReLU. To
compute the value of one output node, it looks at
the inputs of all the other outputs nodes.
HOW DO WE FIND GOOD WEIGHTS? Now that we know how to build a neural network,
and how to compute its output for a given input,
the next question is how do we train it? Given a
particular network and a set of examples of which
inputs correspond to which outputs, how do we
model✓ (x) = y find the weights for which the network makes
w2
lossx,t (✓) = kmodel✓ (x) - tk good predictions?
loss
SOME COMMON LOSS FUNCTIONS Here are some common loss functions for
situations where we have examples (t) of what the
model output (y) should be for a given input (x).
ky - tk with y = model✓ (x)
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>
squared errors
regression
ky - tk 1 with
=X y abs(y
= modeli -✓
✓
ti(x))
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>
i -✓ ti(x))
- log p✓ (t)X iwith t 2 {0, 1}
p1✓✓with
i y = model (x)
ky logistic regression (as shown last lecture) and the
<latexit sha1_base64="Je3kvsHjCjHzn2B8FiDp37z3DyY=">AAAO6XicnZfbbts2AIbt7tR5nddsl7shFnRIBjeQEscHFAFqyTYKbG2zIKcuCgKKpmXBlEhIlGNX1UPsbtjtHmVPsUfY7fYCo2RF1tEX0xXB/+fPTzxIpM6I6XJJ+qv+6KOPP/n0s8efN7548mXzq6c7X1+61HMQvkCUUOdahy4mpo0vuMkJvmYOhpZO8JU+V0P9aoEd16T2OV8xfGtBwzanJoJcVN3t1LXvtQ++ZkE+06dgFQDwHGg6B9oHoL3QXgCN4yX3700+A0HadhIrFp1gEtxpgsTX+AxzGOxpy31Na1Tl3slhY9ezRCPDN4M4COpusLd6qBNuHpf3QRgmWhNqABb2lHTE9yPINKPowbSB5kstIGvB/22qkQnlbgv8GEfE7wqXwV4YHNKtqgOey1HnjUbj7umudCBFDygW5LiwW4uf07udJ7yhTSjyLGxzRKDr3sgS47c+dLiJCA4amudiBtEcGvjG49PerW/azOPYRgF4JrSpRwCnIJxrMDEdjDhZiQJEjikSAJpBByIuVkQjG+ViG1rYbU0WJnPXRXdhrAsciuV06y+j5RZkGvqGA9nMRMsMmQ8tN5z6QqW7svRsJfYIdhZWtjKkFIw55xI7yHTDMTgVA/OWhSvYPaensT5bsRm23cD3HBKkGwoBOw6eioZR0cXcY370MmLbzN0T7ni4FRajupMhdOZneNISOZmKLM6UUMizVboVDo6N7xG1LGhPfI2JTROtEK11EAjxGdCJMOsUOpOs8yzw4+2igzNhzYhvUuKbvDhKiaO4E0omYEodsBDzTx0XCCMQFsdE2M22vkhaT8FFPvoyJV7mxauUeJUXdS+legV1kVIXBfU+pd7n1WVKXObFVUpcFXJ5SuV59X1KfJ8Xr7d1+m5bp7/kYsXsiC2zErscT8U3O1pg/hwF/qvz1z8Ffj964pXiYSBnjUh/MB6NO91+N8jL5EFvj3uyMizqiaGrDGS1zJA4Bl1VGo42LIc5bwItSZ3RoJOPQmSj9/rquKhvYKVBd6iUGDa0Y7U96sbDh7GdsxqJT+132oVhMZKcvqIox/2inhiUtqr2DksMiUMdDocDNUJhnsMIznnZg7HTOZaKUSwJ6kmd9qBE30yA1FOUAixLsShjRR7KEQvHkOScPFksam/QH+WD+Gb8lYGqFiaQp4a/p8rDdolhw3o8PB4dRSTUgbaRHxWaDF+n2zvq5aNoEjTuihkssNBNT+O+InULLDTFMlZUJRzY7E4Um+xGvvXXH+TNvgO7MgiCvFlstLw53HsP5pyXlJhJtbvUvs1f3oBUwhffFKGqeFQSjiphUBkLqoZHpfBoG7xR9BtV8UZJuFEJY5SxGNXwRim8sQ2eFf2sKp6VhLNKGFbGwqrhWSk82wbPi35eFc9LwnklDC9j4dXwvBSeb4OnRT+tiqcl4bQShpax0Gp4WgpPs/Di5xGe6SEB4ZmYEiCuDuuDQeabxcLjwxyJk+TavX7xIRZ3Awe/FseKt+JAC8UZ7wdfg45hmXYg7gqG1gpL24ziLhMbRUncU+T8raRYuDw8kI8PpJ/buy+V+MbyuPZt7bvaXk2udWsva69qp7WLGqr/Wf+7/k/93+a8+Wvzt+bva+ujetzmm1rmaf7xH+Zec/s=</latexit>
ky
--logtk
- tk =
(t)X abs(y
with - t✓.1}
2i {0,
{0, i.). , K}
- log p (t) with tt 2
binary cross-entropy - log p ✓ (t) iwith t 2 {0, 1}
ky - tk
- log
max(0, =
p11✓ (t) withabs(y -
t 2it{0,
with i.). , K}
t.{-1, hinge loss comes from support vector machine
classification
- ty) 2 1}
- log p✓✓ (t) with t 2 {0, 1}. . . , K}
max(0, 1 - ty) i
with t 2 {-1, 1} classification. You can find their derivations in
cross-entropy max(0,
-- log 1 (t) with
-
log pp✓✓(t) ty) with
with tt 2 t 2 {-1,
{0, .1}
2 {0, 1}
. . , K}
max(0,
- log p1✓ (t)
- ty) with
with 2 {-1,
t 2t {0, 1}
. . . , K} most machine learning books/courses. We won’t
hinge loss max(0, 1 - ty) with t 2 {-1, 1} elaborate on them here, except to note that in all
cases the loss is lower if the model output (y) is
14
closer to the example output (t).
The loss can be computed for a single example or
for multiple examples. In almost all cases, the loss
15
parameter space
16
STOCHASTIC GRADIENT DESCENT This is the idea behind the gradient descent
algorithm. We compute the gradient, take a small
pick some initial weights θ (for the whole model) step in the opposite direction and repeat. The
loop: epochs reason we take small steps is that the gradient is
for x, t in Data: only the direction of steepest ascent locally. It's a
✓ ✓ - ↵ · r✓ lossx,t (✓) linear approximation to the nonlinear loss
learning rate function. The further we move from our current
position the worse an approximation the tangent
stochastic gradient descent: loss over one example per step (loop over data) hyperplane will be for the function that we are
minibatch gradient descent: loss over a few examples per step (loop over data) actually trying to follow. That's why we only take a
full-batch gradient descent: loss over the whole data small step, and then recompute the gradient in our
18
new position.
We can compute the gradient of the loss with
respect to a single example from our dataset, a
small batch of examples, or over the whole
dataset. These options are usually called
stochastic, minibatch and full-batch gradient
descent respectively (although minibatch and
stochastic gradient descent are used
interchangeably).
In deep learning, we almost always use minibatch
gradient descent, but there are some cases where
full batch is used.
Training usually requires multiple passes over the
data. One such pass is called an epoch.
19
Lecture 2: Backpropagation
Peter Bloem
Deep Learning
dlvu.github.io
GRADIENT COMPUTATION: THE SYMBOLIC APPROACH Of course, working out partial derivatives is a
pretty mechanical process. We could easily take all
the rules we know, and put them into some
algorithm. This is called symbolic differentiation,
and it’s what systems like Mathematica and
Wolfram Alpha do for us.
Unfortunately, as you can see here, the derivatives
it returns get pretty horrendous the deeper the
neural network gets. This approach becomes
impractical very quickly.
23 Note that in symbolic differentiation we get a
description of the derivative that is independent
of the input. We get a function that we can then
feed any input to.
GRADIENT COMPUTATION: THE NUMERIC APPROACH Another approach is to compute the gradient
numerically. For instance by the method of finite
differences: we take a small step ε and, see how
much the function changes. The amount of change
divided by the step size is a good estimate for the
<latexit sha1_base64="G0r/UE7pa0vohX+ALQT1P/kZeiU=">AAAN1HicfZdNb9s2HMbV7q3L6i3djrsQCwq0WxZIieOXQ4Faso0c1jYL8tItCgKKphXBlEhQlGNX02nYdR9ln2aXHbbPMkp2ZImSrBPB5+GjH/8kbcphxAuFrv/z6PFHH3/y6WdPPt/54mnry692n319GdKII3yBKKH8vQNDTLwAXwhPEPyecQx9h+ArZ2al+tUc89CjwblYMnzjQzfwph6CQnbd7p7YAi9EPMHcmyfAhoxxugD2lEMUT18swA/Ali+JbcxCj9AgeQl+BLL/ZRKX+5Pb3T39QM8eUG0Y68aetn5Ob589FTv2hKLIx4FABIbhtaEzcRNDLjxEcLJjRyFmEM2gi68jMe3dxF7AIoEDlIDnUptGBAgK0mmBiccxEmQpGxBxTyYAdAflLISc/E45KsQB9HG4P5l7LFw1w7m7aggoK3cTL7LKJqWBscshu/PQokQWQz/0obirdIZL3yl34ohgPvfLnSmlZFScC8yRF6Y1OJWFecfSxQrP6elav1uyOxyESRxxkhQHSgFzjqdyYNYMsYhYnE1G7pBZ+ErwCO+nzazv1RDy2Rme7MucUkcZZ0ooFOUux0+LE+B7RH0fBpPYZnJLZHvJ3j9IpPgcOESaHQr5pOw8S+LYTmvmOOBMWkvi24L4VhVHBXG0fgklEzClHMzl+lMeAmkE0sI9hMPy6It89BRcqNGXBfFSFa8K4pUqOlFBjSrqvKDOK+p9Qb1X1UVBXKjisiAuK7mioApV/VAQP6ji+20v/WXbS39VYuXqyCOzlKccT+XPU7bB4hlK4pPzNz8lcT971jslwsAoG5HzYDwad7r9bqLK5EFvj3uGOazquaFrDgyrzpA7Bl1LH442LIeKN4fW9c5o0FGjENnovb41ruobWH3QHZo1hg3t2GqPuuvyYRwoVjf3Wf1Ou1IWN8/pm6Z53K/qucFsW1bvsMaQO6zhcDiwMhQWcUaw4mUPxk7nWK9GsTyop3fagxp9swB6zzQrsKzAYo5NY2hkLAJDojhFvlms3qA/UoPEpv7mwLIqCygK5e9ZxrBdY9iwHg+PR0cZCeUwcNWq0Lx8nW7vqKdG0Txo3JUrWGGhmzeN+6berbDQAsvYtMy0sOWTKA/ZtXETr36QN+cO7BkgSVSzPGiqOT17D2bFS2rMpNlda9/mrx9AGuGrM0WoKR7VhKNGGFTHgprhUS082gbvVv1uU7xbE+42wrh1LG4zvFsL726DZ1U/a4pnNeGsEYbVsbBmeFYLz7bBi6pfNMWLmnDRCCPqWEQzvKiFF9vgadVPm+JpTThthKF1LLQZntbC0zK8/PNI7/SQgPROTAnwgvXFoPSbxdLrwwzJm+TKvZr4EMtvA47fyGvFO3mhhfKO931sQ+76XpDIbwXX3k9b24xw8WCULfmdYqhfJdXG5eGBcXyg/9zee22uv1ieaN9q32kvNEPraq+1E+1Uu9CQ9pf2t/av9l/rsvVb6/fWHyvr40frMd9opaf15/+hwRfr</latexit>
25
THE CHAIN RULE Here is the chain rule: if we want the derivative of
a function which is the composition of two other
function, in this case f and g, we can take the
@f(g(x)) @f(g(x)) @g(x) derivative of f with respect to the output of g and
=
@x @g(x) @x multiply it by the derivative of g with respect to
@f(g(x)) @f @f@g(x)
@f(g(x)) @g the input x.
= @x = @g @x
@x
shorthand: @g(x) @x Since we’ll be using the chain rule a lot, we’ll
@f @f @g introduce a simple shorthand to make it a little
=
x g f @x @g @x easier to parse. We draw a little diagram of which
function feeds into which. This means we know
what the argument of each function is, so we can
26
remove the arguments from our notation.
We call this diagram a computation graph. We'll
stick with simple diagrams like this for now. At the
INTUITION FOR THE CHAIN RULE Since the chain rule is the heart of
f(g) = sf g + bf
backpropagation, and backpropagation is the
f over g g(x) = sg x + bg
slope of
heart of deep learning, we should probably take
@f
f(g) = sf g + bf with sf = some time to see why the chain rule is true at all.
@g x g f
g(x) = sg x + bg @g
@f sg =
@x
If we imagine that f and g are linear functions, it’s
with sf =
@g pretty straightforward to show that this is true.
@g They may not be, of course, but the nice thing
sg =f(g(x)) = sff(sggx + bbgg)) +
+ bbff
<latexit sha1_base64="tCAh3qBNFJD1L3/bHSxYlhX/fLk=">AAAN7HicfZfLbttGGIWZpJfUjVqnWXYzqJHCbgyDtGVdEBiISEnIoklcw7fWNNThaEQTGnIGw6EsheBbdFd020fpO/Qdum3XHVIyxau40WDO+Q8//uRQQ4sRxxeq+vejx08++fSzz59+sfXls8ZXX28//+bSpwFH+AJRQvm1BX1MHA9fCEcQfM04hq5F8JU1NWL9aoa571DvXCwYvnWh7TkTB0Ehp0bbv5rICifRrons0I5253t74PsTYE5R6I8mu/7InoNXwBrZe8nPJAKmuRU7/NGyMP6NK8E8KTJfvzJfy0pZsSoYbe+oB2pygPJAWw12lNVxOnr+TGyZY4oCF3sCEej7N5rKxG0IuXAQwdGWGfiYQTSFNr4JxKRzGzoeCwT2UAReSm0SECAoiC8XjB2OkSALOYCIOzIBoDvIIRKyKVv5KB970MX+/njmMH859Gf2ciCg7OhtOE86HuUKQ5tDduegeY4shK7vQnFXmvQXrpWfxAHBfObmJ2NKyVhwzjFHjh/34FQ25gOLb6J/Tk9X+t2C3WHPj8KAkyhbKAXMOZ7IwmToYxGwMLkY+eRM/RPBA7wfD5O5kz7k0zM83pc5uYk8zoRQKPJTlhs3x8P3iLou9MahyaLQFHguQnP/IJLiS2ARabYo5OO88ywKQzPumWWBM2nNie8z4vuiOMiIg9VJKBmDCeVgJu8/5T6QRiAt3EHYz1dfpNUTcFGMvsyIl0XxKiNeFUUryKhBSZ1l1FlJvc+o90V1nhHnRXGRERelXJFRRVH9mBE/FsXrTSf9edNJfynEyrsjl8xCrnI8ka+t5AELpygK356/+zEKu8mxelICDLS8EVkPxqNhq91tR0WZPOjNYUfT+2U9NbT1nmZUGVJHr22o/cGa5bDgTaFVtTXotYpRiKz1TtcYlvU1rNpr9/UKw5p2aDQH7VX7MPYKVjv1Gd1Ws9QWO83p6rp+3C3rqUFvGkbnsMKQOox+v98zEhQWcEZwwcsejK3WsVqOYmlQR201exX6+gaoHV0vwbIMiz7Utb6WsAgMScEp0ofF6PS6g2KQWPdf7xlG6QaKTPs7htZvVhjWrMf948FRQkI59OxiV2javla7c9QpRtE0aNiWd7DEQtdnGnZ1tV1ioRmWoW7ocWPzK1EushvtNly+kNfrDuxoIIqKZrnQiuZ47T2YC15SYSb17kr7Jn91AamFL18pQnXxqCIc1cKgKhZUD48q4dEmeLvst+vi7YpwuxbGrmKx6+HtSnh7Ezwr+1ldPKsIZ7UwrIqF1cOzSni2CV6U/aIuXlSEi1oYUcUi6uFFJbzYBE/LfloXTyvCaS0MrWKh9fC0Ep7m4eWfR7ynhwTEe2JKgOOtNga5dxaLtw/xt8XKvbzwPpbfBhy/k9uKD3JDC+Ue74fQhNx2HS+S3wq2uR+PNhnh/MEoR/I7RSt+lZQHl4cHWuvg6Kfmzht99cXyVPlW+U7ZVTSlrbxR3iqnyoWClL+Uf5R/lf8aXuO3xu+NP5bWx49WNS+U3NH4838N+h0c</latexit>
@x = sffsggx + ssffbbgg +
+ sbgf about calculus is that locally, we can treat them as
slope of f over x constant
linear functions (if they are differentiable). In an
infinitesimally small neighbourhood f and g are
@f @g
exactly linear.
27
@g @x
If f and g are locally linear, we can describe their
behavior with a slope s and an additive constant b.
The slopes, sf and sg, are simply the derivatives.
The additive constants we will show can be
ignored.
In this linear view, what the chain rule says is this:
if we approximate f(x) as a linear function, then its
slope is the slope of f as a function of g, times the
slope of g as a function of x. To prove that this is
true, we just write down f(g(x)) as linear functions,
and multiply out the brackets.
Note that this doesn’t quite count as a rigorous
proof, but it’s hopefully enough to give you some
intuition for why the chain rule holds.
bh )) + +b
bf
<latexit sha1_base64="WEZsfb75aWQ7JGTAM9g2CSx2+us=">AAAOW3icfZdbb9s2GIbldofOS7N0w652QyzokKxGICWODygC1JJt9GJts6A5bFFgUDQtC5ZEgqIcu4J+4653sf8ySlZkHa0bf+D7fq8ek6JNGdS2PC7L/zaePf/q62++ffFd8/u9l/s/HLz68cYjPkP4GhGbsDsDeti2XHzNLW7jO8owdAwb3xoLLdJvl5h5FnE/8zXFDw40XWtmIcjF0OTgHx0ZwSw80pEZmOHR6rgFdESDeVQeg98ugL5AgTdRjryJuQJvgDExj8UH8CanYmi+GZofxx+zEOh6c9uTtIhKdCVNSY+oRFu+S+RFMEoYfUY0sS8aOo2HIqrjVZStv32jv40jzGLW5OBQPpHjC5QLJSkOpeS6nLza4019SpDvYJcjG3revSJT/hBAxi1k47Cp+x6mEC2gie99Pus9BJZLfY5dFILXQpv5NuAERLMLphbDiNtrUUDELJEA0BwyiLhYg2Y+ysMudLDXmi4t6m1Kb2luCg7FAj4Eq3iBw1xjYDJI5xZa5cgC6HgO5PPSoLd2jPwg9m3Mlk5+MKIUjAXnCjNkedEcXIqJ+USjZ8b7TC4Tfb6mc+x6YeAzO8w2CgEzhmeiMS49zH0axF9GPKgL74IzH7eiMh67GEK2uMLTlsjJDeRxZjaBPD9kONHkuPgREceB7jTQaRjoHK94oLdOQiG+BoYtzAaBbJp3XoVBoEdzZhjgSlhz4seM+LEojjLiKLkJsadgRhhYivUnzAPCCISFWQh7+e7rtHsGrovRNxnxpijeZsTbomj4GdUvqcuMuiypjxn1saiuMuKqKK4z4rqUyzMqL6pfMuKXoni366Z/7brp34VYsTpiy6zFLscz8SsZP2DBAoXB+88f/giDfnwlT4qPgZI3IuPJeDbudPvdsCjbT3p73FPUYVlPDV11oGhVhtQx6GrycLRlOS14U2hZ7owGnWIUsrd6r6+Ny/oWVh50h2qFYUs71tqjbjJ9GLsFq5n6tH6nXZoWM83pq6p63i/rqUFta1rvtMKQOrThcDjQYhTqM2rjgpc+GTudc7kcRdOgntxpDyr07QLIPVUtwdIMizpWlaESs3AM7YKTpw+L1hv0R8Ugvp1/daBppQXkmenvacqwXWHYsp4Pz0dnMQlh0DWLs0LS6et0e2e9YhRJg8ZdsYIlFrK907ivyt0SC8mwjFVNjSY2vxPFJrtXHoLND/J234FDBYRh0Sw2WtEc7b0nc8FrV5jtenelfZe/usGuhS9/U4Tq4lFFOKqFQVUsqB4eVcKjXfBm2W/WxZsV4WYtjFnFYtbDm5Xw5i54WvbTunhaEU5rYWgVC62Hp5XwdBc8L/t5XTyvCOe1MLyKhdfD80p4vguelP2kLp5UhJNaGFLFQurhSSU8ycOLP4/oTA9tEJ2JiQ0sNzkY5H6zaHR8iF4yEvfmiw+xeDdg+IM4VnwSB1oozni/BzpkpmO5oXhXMPVWVO0ywtWTUVTiPUUpvpWUi5vTE6VzovzZPnynJm8sL6RfpF+lI0mRutI76b10KV1LqHHeuG9MG/jlf/vP95v7exvrs0bS85OUu/Z//h8ONTtS</latexit>
f(g(x), h(x)) =
f(g(x), = s1 (sggg x x+b bg ) + ss222 (s (sh x
<latexit sha1_base64="WEZsfb75aWQ7JGTAM9g2CSx2+us=">AAAOW3icfZdbb9s2GIbldofOS7N0w652QyzokKxGICWODygC1JJt9GJts6A5bFFgUDQtC5ZEgqIcu4J+4653sf8ySlZkHa0bf+D7fq8ek6JNGdS2PC7L/zaePf/q62++ffFd8/u9l/s/HLz68cYjPkP4GhGbsDsDeti2XHzNLW7jO8owdAwb3xoLLdJvl5h5FnE/8zXFDw40XWtmIcjF0OTgHx0ZwSw80pEZmOHR6rgFdESDeVQeg98ugL5AgTdRjryJuQJvgDExj8UH8CanYmi+GZofxx+zEOh6c9uTtIhKdCVNSY+oRFu+S+RFMEoYfUY0sS8aOo2HIqrjVZStv32jv40jzGLW5OBQPpHjC5QLJSkOpeS6nLza4019SpDvYJcjG3revSJT/hBAxi1k47Cp+x6mEC2gie99Pus9BJZLfY5dFILXQpv5NuAERLMLphbDiNtrUUDELJEA0BwyiLhYg2Y+ysMudLDXmi4t6m1Kb2luCg7FAj4Eq3iBw1xjYDJI5xZa5cgC6HgO5PPSoLd2jPwg9m3Mlk5+MKIUjAXnCjNkedEcXIqJ+USjZ8b7TC4Tfb6mc+x6YeAzO8w2CgEzhmeiMS49zH0axF9GPKgL74IzH7eiMh67GEK2uMLTlsjJDeRxZjaBPD9kONHkuPgREceB7jTQaRjoHK94oLdOQiG+BoYtzAaBbJp3XoVBoEdzZhjgSlhz4seM+LEojjLiKLkJsadgRhhYivUnzAPCCISFWQh7+e7rtHsGrovRNxnxpijeZsTbomj4GdUvqcuMuiypjxn1saiuMuKqKK4z4rqUyzMqL6pfMuKXoni366Z/7brp34VYsTpiy6zFLscz8SsZP2DBAoXB+88f/giDfnwlT4qPgZI3IuPJeDbudPvdsCjbT3p73FPUYVlPDV11oGhVhtQx6GrycLRlOS14U2hZ7owGnWIUsrd6r6+Ny/oWVh50h2qFYUs71tqjbjJ9GLsFq5n6tH6nXZoWM83pq6p63i/rqUFta1rvtMKQOrThcDjQYhTqM2rjgpc+GTudc7kcRdOgntxpDyr07QLIPVUtwdIMizpWlaESs3AM7YKTpw+L1hv0R8Ugvp1/daBppQXkmenvacqwXWHYsp4Pz0dnMQlh0DWLs0LS6et0e2e9YhRJg8ZdsYIlFrK907ivyt0SC8mwjFVNjSY2vxPFJrtXHoLND/J234FDBYRh0Sw2WtEc7b0nc8FrV5jtenelfZe/usGuhS9/U4Tq4lFFOKqFQVUsqB4eVcKjXfBm2W/WxZsV4WYtjFnFYtbDm5Xw5i54WvbTunhaEU5rYWgVC62Hp5XwdBc8L/t5XTyvCOe1MLyKhdfD80p4vguelP2kLp5UhJNaGFLFQurhSSU8ycOLP4/oTA9tEJ2JiQ0sNzkY5H6zaHR8iF4yEvfmiw+xeDdg+IM4VnwSB1oozni/BzpkpmO5oXhXMPVWVO0ywtWTUVTiPUUpvpWUi5vTE6VzovzZPnynJm8sL6RfpF+lI0mRutI76b10KV1LqHHeuG9MG/jlf/vP95v7exvrs0bS85OUu/Z//h8ONTtS</latexit>
h x + bh h ) + bff
= s s x
= ss111 ssggg x
x++ s b
+ ss111 b + s s
+ ss222 ssh
bggg + +
hx + s
x b h++b
= h x + ss222 b
bhh + bfff
b
= (s
= (s s + ss22 ssh )x +
+ s bgg + + s bh +bb
= (s111 ssggg + + s2 sh h )x
)x + ss222 b bg + ss222 bbhh++ bfff
slope of f over x constant
@f @g @f @h
29 +
@g @x @h @x
MULTIVARIATE CHAIN RULE If we have more than two paths from the input to
the output, we simply sum over all of them.
g1
... ...
gi
x f
gN
@f X @f @gi
<latexit sha1_base64="DMJldPRh5qsthHcwBU7lAJ+NqwM=">AAAN5nicfZfbbts2HMbV7tRlzZZul7sRFhQYhiCQEseHiwC1ZBu9WNssyGmtgoCiaVkwJRIU5dgV9Aq7G3a7R9lr7AV2uz3CKEWRJYqyrgh+Hz/99Cdpky7FfsQN4+8nTz/59LPPv3j25c5Xz3e//mbvxbdXEYkZRJeQYMJuXBAh7IfokvscoxvKEAhcjK7dhZ3p10vEIp+EF3xN0W0AvNCf+RBw0XW3996ZMQATZ0F1B8JklqZ5e5Xqp7oTxcGdrzY40Eu89M5Pq3LRVSTc7e0bh0b+6M2GWTT2teI5u3vxnO84UwLjAIUcYhBFH0yD8tsEMO5DjNIdJ44QBXABPPQh5rP+beKHNOYohKn+UmizGOuc6NmH6lOfIcjxWjQAZL5I0OEcCFguyrFTj4pQCAIUHUyXPo0emtHSe2hwIGp5m6zyWqe1gYnHAJ37cFUjS0AQBYDPG53ROnDrnSjGiC2DemdGKRgl5wox6EdZDc5EYd7RbPqiC3JW6PM1naMwSpOY4bQ6UAiIMTQTA/NmhHhMk/xjxJpZRKecxegga+Z9pyPAFudoeiByah11nBkmgNe73CArTojuIQkCEE4Th4qVwNGKJ87BYSrEl7qLhdklgE3rzvM0SZysZq6rnwtrTXxbEd/K4rgijouXEDzVZ4TpSzH/hEW6MOrCwnyIovroy3L0TL+Uo68q4pUsXlfEa1l044oaN9RlRV021PuKei+rq4q4ksV1RVw3cnlF5bL6sSJ+lMWbbS/9ddtL30uxYnbEllmLXY5m4gcrX2DJAqbJ64s3P6fJIH+KlRIj3awboftoPJ50e4NeKsv4Ue9M+qY1auqloWcNTVtlKB3Dnm2MxhuWI8lbQhtGdzzsylEQb/T+wJ409Q2sMeyNLIVhQzuxO+NeUT6EQsnqlT570O00yuKVOQPLsk4GTb00WB3b7h8pDKXDHo1GQztHoTGjGEle+mjsdk+MZhQtg/pGtzNU6JsJMPqW1YClFRZrYpkjM2fhCGDJycvFYveHg7EcxDf1t4a23ZhAXil/3zZHHYVhw3oyOhkf5ySEgdCTq0LK8nV7/eO+HEXKoElPzGCDhWzeNBlYRq/BQiosE8u2ssLWd6LYZB/M2+ThB3mz7/R9U09T2Sw2mmzO9t6jWfJihRm3u5X2bX71ANwK3/xSCNvioSIctsJAFQtsh4dKeLgN3mv6vbZ4TxHutcJ4KhavHd5Twnvb4GnTT9viqSKctsJQFQtth6dKeLoNnjf9vC2eK8J5KwxXsfB2eK6E59vgSdNP2uKJIpy0whAVC2mHJ0p4UocXfx7ZmR5gPTsTE6z7YXEwqP1m0ez4sBC3i8L98OEjJO4GDL0Rx4p34kALxBnvp8QBzAv8MBV3Bc85yFrbjGD1aAT5PcWUbyXNxtXRoXlyaPzS2X9lFTeWZ9r32g/aj5qp9bRX2mvtTLvUoPaX9o/2r/bf7nz3t93fd/94sD59Uoz5Tqs9u3/+D4Z4IIo=</latexit>
=
@x @gi @x
i
30
2
f(x) =
sin(e-x )
31
ITERATING THE CHAIN RULE The first thing we do is to break up its functional
form into a series of smaller operations. The entire
2 function f is then just a chain of these small
f(x) = f(x) = d(c(b(a(x)))) operations chained together. We can draw this in a
sin(e-x )
computation graph as we did before.
operations: computation graph:
2 Normally, we wouldn’t break a function up in such
d(c) =
c x a b c d small operations. This is just a simple example to
c(b) = sin b illustrate the principle.
b(a) = ea
a(x) = -x
32
ITERATING THE CHAIN RULE Now, to work out the derivative of f, we can iterate
the chain rule. We apply it again and again, until
the derivative of f over x is expressed as a long
@f @d
@x
=
@x x a b c d product of derivatives of operation outputs over
@d @c their inputs.
=
@c @x
@d @c @b
=
@c @b @x
@d @c @b @a
=
@c @b @a @x
33
@f @f @d
=
@x @d @x
@f @d @c
=
@d @c @x
GLOBAL AND LOCAL DERIVATIVES We call the larger derivative of f over x the global
@f @d @c @b derivative. And we call the individual factors, the
= derivatives of the operation output wrt to their
@d @c @b @x inputs, the local derivatives.
@f @f @d @f @d@c @b @a
= =
@x @d @d @c @x
@b @a @x
f @d local
global derivative @c derivatives
=
@d @c @x
f @d @c @b
=
34 @d @c @b @x
f @d @c @b @a
=
@d @c @b @a @x
35
@f @f @d
=
@x @d @x
@f @d @c
= SYMBOLICALLY
WORK OUT THE LOCAL DERIVATIVES For each local derivative, we work out the
@d @c @x symbolic derivative with pen and paper.
@f @d @c @b
2 =
f(x) = @d @c @b @x Note that we could fill in the a, b and c in the
sin(e-x )
@f @f @d @c @b @a
@f @d result, but we don’t. We simply leave them as is.
operations: = = For the symbolic part, we are only interested in
@x @d @d
@c @x
@b @a @x
2 the derivative of the output of each sub-operation
d(c) = f @d @c
c @f = @d2 @c @x with respect to its immediate input.
c(b) = sin b = - 2 · cos b · ea · -1
@x fc @d @c @b The rest of thew algorithm is performed
b(a) = e a = numerically.
@d @c @b @x
a(x) = -x f @d @c @b @a
=
36
@d @c @b @a @x
COMPUTE A FORWARD PASS (X = -4.499) This we are now computing things numerically, we
need a specific input, in this case x = -4.499. We
start by feeding this through the computation
f(-4.499) = 2 graph. For each sub-operation, we store the
x a b c d
2
<latexit sha1_base64="OJjpZWerc9vinEiAFnncrzHfoVc=">AAAN+nicfZdNb9s2HMaVttu6rNnS7bgLsaDDMGSBlDh+OQSoJdvoYW2zIG9rlAUUTSuCKZGgKMeupi+z27DrPsou+yo7jZIdWaIk60Tzefjopz9FmXQY8UKh6/9uPXn67JNPP3v++fYXL3a+/Gr35deXIY04wheIEsqvHRhi4gX4QniC4GvGMfQdgq+cqZXqVzPMQ48G52LB8K0P3cCbeAgK2XW3G9iIxuMEfH8C7AmHKD5MYhs5MUoScAIObXt7+SszhF4AbMRiJ9UMkInpL6nh32zkxjAVenomZL+k8tNc9rUOWr2eHHC3u6cf6NkFqg1j1djTVtfp3csXYtseUxT5OBCIwDC8MXQmbmPIhYcITrbtKMQMoil08U0kJt3b2AtYJHAgoV9JbRIRIChIHx6MPY6RIAvZgIh7MgGgeygfW8gSbZejQhxAH4f745nHwmUznLnLhoCyvrfxPKt/UhoYuxyyew/NS2Qx9EMfivtKZ7jwnXInjgjmM7/cmVJKRsU5xxx5YVqDU1mY9yyd0vCcnq70+wW7x0GYxBEnSXGgFDDneCIHZs0Qi4jF2cPI92gangge4f20mfWdDCCfnuHxvswpdZRxJoRCUe5y/LQ4AX5A1PdhMI5tJl8vgecitvcPEim+Ag6RZodCPi47z5I4ttOaOQ44k9aS+K4gvlPFYUEcrm5CyRhMKAczOf+Uh0AagbRwD+GwPPoiHz0BF2r0ZUG8VMWrgnilik5UUKOKOiuos4r6UFAfVHVeEOequCiIi0quKKhCVT8WxI+qeL3ppr9uuukHJVbOjlwyC7nK8UR+xLIXLJ6iJH5z/vbnJO5l1+pNiTAwykbkPBqPRu1Or5OoMnnUW6OuYQ6qem7omH3DqjPkjn7H0gfDNcuh4s2hdb097LfVKETWerdnjar6GlbvdwZmjWFNO7Jaw86qfBgHitXNfVav3aqUxc1zeqZpHveqem4wW5bVPawx5A5rMBj0rQyFRZwRrHjZo7HdPtarUSwP6urtVr9GX0+A3jXNCiwrsJgj0xgYGYvAkChOkb8sVrffG6pBYl1/s29ZlQkUhfJ3LWPQqjGsWY8Hx8OjjIRyGLhqVWhevnane9RVo2geNOrIGayw0PWdRj1T71RYaIFlZFpmWtjySpSL7Ma4jZcf5PW6A3sGSBLVLBeaak7X3qNZ8ZIaM2l219o3+esHkEb46pMi1BSPasJRIwyqY0HN8KgWHm2Cd6t+tynerQl3G2HcOha3Gd6thXc3wbOqnzXFs5pw1gjD6lhYMzyrhWeb4EXVL5riRU24aIQRdSyiGV7UwotN8LTqp03xtCacNsLQOhbaDE9r4WkZXv55pHt6SEC6J6YEyMPGcmNQ+maxdPswRXInuXQvH3yA5dmA47dyW/Febmih3OP9GNuQu74XJPKs4Nr7aWuTEc4fjbIlzymGeiqpNi4PD4zjA/2X1t5rc3Viea59q32n/aAZWkd7rb3RTrULDWn/aP9tPd16tvP7zh87f+78tbQ+2VqN+UYrXTt//w8aDx5e</latexit>
COMPUTE A FORWARD PASS (X = -4.499) Keeping all intermediate values from the forward
pass in memory, we go back to our symbolic
@f 2 expression of the derivative. Here, we fill in the
f(-4.499) = 2 = - 2 · cos b · ea · -1 intermediate values a b and c. After we do this, we
@x c
2 2 can finish the multiplication numerically, giving us
d= =2= =2 = -22 · cos 90 · -4.499 e-4.499 · -1
<latexit sha1_base64="0ODmX1yVjSmtRDEDP9BXQ1p50D0=">AAAOB3icfZdNb9s2HMaV7qVdVm/JdtxFWNBiGNJAShzbOgSoJdnoYW2zIC/dojSgaFoRTIkERTl2BX2AfZrdhl33MXbcNxnlF1miJOvgEHwePvrpT1EhXYr9iGvavztPPvv8iy+fPvtq9+vnrW++3dv/7joiMYPoChJM2AcXRAj7IbriPsfoA2UIBC5GN+7EyvSbKWKRT8JLPqfoLgBe6I99CLjout9LXp6pr5wxAzA5ThMHuomefjxOVQeOCBe/JBI/NDG0dRf6KFxe0j5qG0a67nylq46z+1LNs/Q0yUO01V9D27jPVM1x7vcOtCNtcanVhr5qHCir6/x+/znfdUYExgEKOcQgim51jfK7BDDuQ4zSXSeOEAVwAjx0G/Nx7y7xQxpzFMJUfSG0cYxVTtSsEurIZwhyPBcNAJkvElT4AAQ9F/XaLUdFKAQBig5HU59Gy2Y09ZYNDkSx75LZYjLS0sDEY4A++HBWIktAEAWAP1Q6o3ngljtRjBGbBuXOjFIwSs4ZYtCPshqci8K8p9n8RpfkfKU/zOkDCqM0iRlOiwOFgBhDYzFw0YwQj2myeBjxUk2iM85idJg1F31nNmCTCzQ6FDmljjLOGBPAy11ukBUnRI+QBAEIR4lDxRvH0YwnzuFRKsQXqouF2SWAjcrOizRJnKxmrqteCGtJfFcQ38nioCAOVjcheKSOCVOnYv4Ji1RhVIWF+RBF5dFX+eixeiVHXxfEa1m8KYg3sujGBTWuqNOCOq2ojwX1UVZnBXEmi/OCOK/k8oLKZfVTQfwkix+23fS3bTf9XYoVsyOWzFyscjQWX7TFC5ZMYJq8uXz7S5oYi2v1psRI1ctG6K6NJ8NO1+imsozXenvY0027queGrtnXrTpD7uh3Lc0ebFiOJW8OrWmdQb8jR0G80XuGNazqG1it37XNGsOGdmi1B91V+RAKJauX+yyj066UxctzDNM0T42qnhvMtmX1jmsMucOybbtvLVBozChGkpeujZ3OqVaNonlQT+u0+zX6ZgK0nmlWYGmBxRyauq0vWDgCWHLy/GWxen1jIAfxTf3NvmVVJpAXyt+zdLtdY9iwntqng5MFCWEg9OSqkLx8nW7vpCdHkTxo2BUzWGEhmzsNDVPrVlhIgWVoWmZW2PJKFIvsVr9Llh/kzbpTD3Q1TWWzWGiyOVt7a7PkxTVm3OyutW/z1w/AjfDVJ4WwKR7WhMNGGFjHApvhYS083AbvVf1eU7xXE+41wnh1LF4zvFcL722Dp1U/bYqnNeG0EYbWsdBmeFoLT7fB86qfN8XzmnDeCMPrWHgzPK+F59vgSdVPmuJJTThphCF1LKQZntTCkzK8+OeR7ekBVrM9McGqH642BqVvFs22DxModpJL9/LBbSTOBgy9FduK92JDC8Qe7+fEAcwL/DAVZwXPOcxa24xgtjaKljin6PKppNq4Pj7ST4+0X9sHr83VieWZ8oPyo/KToitd5bXyRjlXrhSo/LfzdGdvZ7/1R+vP1l+tv5fWJzurMd8rpav1z//P3SLG</latexit>
<latexit sha1_base64="G5JpQX1kCo3sPbai6u9lIbcl1L4=">AAAOCHicfZdNc6M2HMbJ9i1N122yPfbCNLM7nU6SgcSxzSEza7A9e+juppm8dUM2I2SZMBZII4RjL+UL9NP01um136LXfpIKv2AQYA6ORs+jhx9/ISI5FHsh17R/t5599vkXX361/fXON88b3363u/fiOiQRg+gKEkzYrQNChL0AXXGPY3RLGQK+g9GNM7ZS/WaCWOiR4JLPKLr3gRt4Iw8CLroedn9/daYe2iMGYHycxDZ0Yj35eJyoNhwSLn5JKH5obGirLvRRuNz4sHnUNIxk1Xuoq7a980rNwvQkzlK05V9DW7vPVM22H3b3tSNtfqnlhr5s7CvL6/xh7znfsYcERj4KOMQgDO90jfL7GDDuQYySHTsKEQVwDFx0F/FR5z72AhpxFMBEfSm0UYRVTtS0FOrQYwhyPBMNAJknElT4CAQ9FwXbKUaFKAA+Cg+GE4+Gi2Y4cRcNDkS17+PpfDaSwsDYZYA+enBaIIuBH/qAP5Y6w5nvFDtRhBGb+MXOlFIwSs4pYtAL0xqci8K8p+kEh5fkfKk/zugjCsIkjhhO8gOFgBhDIzFw3gwRj2g8fxjxVo3DM84idJA2531nPcDGF2h4IHIKHUWcESaAF7scPy1OgJ4g8X0QDGObileOoymP7YOjRIgvVQcLs0MAGxadF0kc22nNHEe9ENaC+C4nvpPFfk7sL29C8FAdEaZOxPwTFqrCqAoL8yAKi6OvstEj9UqOvs6J17J4kxNvZNGJcmpUUic5dVJSn3Lqk6xOc+JUFmc5cVbK5TmVy+qnnPhJFm833fS3TTf9IMWK2RFLZiZWORqJT9r8BYvHMInfXL79JYmN+bV8UyKk6kUjdFbGk0GrbbQTWcYrvTno6GavrGeGttnVrSpD5ui2La3XX7McS94MWtNa/W5LjoJ4rXcMa1DW17Bat90zKwxr2oHV7LeX5UMokKxu5rOMVrNUFjfLMUzTPDXKemYwm5bVOa4wZA6r1+t1rTkKjRjFSPLSlbHVOtXKUTQL6mitZrdCX0+A1jHNEizNsZgDU+/pcxaOAJacPHtZrE7X6MtBfF1/s2tZpQnkufJ3LL3XrDCsWU97p/2TOQlhIHDlqpCsfK1256QjR5EsaNAWM1hiIes7DQxTa5dYSI5lYFpmWtjiShSL7E6/jxcf5PW6U/d1NUlks1hosjldeyuz5MUVZlzvrrRv8lcPwLXw5SeFsC4eVoTDWhhYxQLr4WElPNwE75b9bl28WxHu1sK4VSxuPbxbCe9ugqdlP62LpxXhtBaGVrHQenhaCU83wfOyn9fF84pwXgvDq1h4PTyvhOeb4EnZT+riSUU4qYUhVSykHp5UwpMivPjnke7pAVbTPTHBqhcsNwaFbxZNtw9jKHaSC/fiwXtInA0Yeiu2Fe/FhhaIPd7PsQ2Y63tBIs4Krn2QtjYZwXRlFC1xTtHlU0m5cX18pJ8eab8291+byxPLtvKD8qPyk6IrbeW18kY5V64UqPy3tb21t/Wi8Ufjz8Zfjb8X1mdbyzHfK4Wr8c//cNMi/Q==</latexit>
d= c =2 -1
2
sin
c = sinc b=1 = -112 · cos 90 · e4.499 · -1 In this case, the gradient happens to be 0.
= a=b 2= 1
dc = = -1
1
1 · 0 · 90 · -1 = 0
bc =
b eeca =
= sin b=
= 901
90 == -222···000···90
=-- 90··-1
-1 == 00
b
ac== sin
e
-x
a b=1
=
= 90
4.499 2
a = -x a = 4.499
b=
a = e-x = = 90
4.499
a = -x = 4.499
38
39
BRACKPROPAGATION IN A FEEDFORWARD NETWORK We want to work out the gradient of the loss over
the parameters. We’ll isolate two parameters, v2 in
l @l @l the second layer, and w12 in the first, and see how
, backpropagation operates.
y t @v2 @w12
First, we have to break the computation of the loss
l = (y - t)22
v2
h1 h2 h3
l = (y - t) into operations. If we take the graph on the left to
y = v1 h1 + v2 h2 + v3 h3 + bh
k1 k2 k3
y = v1 h1 + v2 h2 + v3 h3 + bh be our computation graph, then we end up with
1 the operations of the right.
h2 = 1
h2 = 1 + exp(-k2 )
2
w1
41 loss.
BACKPROPAGATION IN A FEEDFORWARD NETWORK For the derivative with respect to v2, we’ll only
need these two operations. Anything below
l doesn’t affect the result.
l = (y - t)22
l = (y - t)
y t
y = v1 h1 + v2 h2 + v3 h3 + bh To work out the derivative we apply the chain rule,
y = v1 h1 + v2 h2 + v3 h3 + bh and work out the local derivatives symbolically.
1
vv22
h2 = 1
h1 h2 h3 h2 = 1 + exp(-k2 )
k1 k2 k3 1 + exp(-k2 )
k2 = w
@l @l21 x1 + w22 x2 + bx
@y
k2 =
@l = w x1 + w22 x2 + bx
@l21@y
@v = @y @v2
2
w1
@v22 @y @v2
= 2(y - l) · h
= 2(y - t) · h22
x1 x2
= 2(10.1 - 12.1) · .99 = -3.96
42
.99 = 2(y
2(10.1 - ·12.1)
- t) h2 · .99 = -3.96 in the numeric values from the forward pass, and
8 = 2(10.1 - 12.1) · .99 = -3.96 compute the derivative over v2.
When we apply this derivative in a gradient
3
43
BACKPROPAGATION IN A FEEDFORWARD NETWORK Let’s try something a bit earlier in the network:
the weight w12. We add two operations, apply the
l l = (y - t)22 chain rule and work out the local derivatives.
l = (y - t)2
y = v1 h1 + v2 h2 + v3 h3 + bh
y t
y = v11 h11 +1v22 h22 + v33 h33 + bh
h
h2 = 1
h22 = 1 + exp(-k2 )
1 + exp(-k2 )
v2
k2 = w21 x1 + w222 x2 + bx
kk22 =
=w 21 x
w21 x11 ++ww22
22xx22 +
+ bbxx
<latexit sha1_base64="iIcxV+irH76DkT6drvUDo1J4VqQ=">AAANxHicfZdNc6M2HMbZ7ds2XbfZ9tgL08x2Om0mA45j40Nm1mB79tDdTTN5a0PGI2SZMBZII4RjL0M/Sj9Nr+2936YC2xgEmJOs59HDT38hLByKvZBr2n/Pnn/y6Weff/Hiy4OvXra+/ubw1bc3IYkYRNeQYMLuHBAi7AXomnscozvKEPAdjG6duZXqtwvEQo8EV3xF0YMP3MCbeRBw0TU5NA5s6MbzSTtRfzxXbUjip0mst5NkOdHVX7Yd2e929tuJnckyse3J4ZF2omWXWm3om8aRsrkuJq9e8gN7SmDko4BDDMLwXtcof4gB4x7EKDmwoxBRAOfARfcRnxkPsRfQiKMAJuproc0irHKiptNQpx5DkOOVaADIPJGgwkfAAORisgflqBAFwEfh8XTh0XDdDBfuusGBqNRDvMwqmZQGxi4D9NGDyxJZDPzQB/yx0hmufKfciSKM2MIvd6aUglFyLhGDXpjW4EIU5gNNFye8Ihcb/XFFH1EQJnHEcFIcKATEGJqJgVkzRDyicTYZ8UTMw3POInScNrO+8yFg80s0PRY5pY4yzgwTwMtdjp8WJ0BPkPg+CKaxTZPY5mjJY/v4JBHia9XBwuwQwKZl52USx3ZaM8dRL4W1JL4viO9lcVQQR5ubEDxVZ4SpC7H+hIWqMKrCwjyIwvLo63z0TL2Wo28K4o0s3hbEW1l0ooIaVdRFQV1U1KeC+iSry4K4lMVVQVxVcnlB5bL6sSB+lMW7fTf9fd9N/5BixeqILbMSuxzNxOsoe8DiOUzit1fvfk3ifnZtnpQIqXrZCJ2t8XTc7fV7iSzjrd4ZG7o5rOq5oWcOdKvOkDsGPUsbjnYsbcmbQ2tadzToylEQ73Sjb42r+g5WG/SGZo1hRzu2OqPepnwIBZLVzX1Wv9uplMXNc/qmaZ71q3puMDuWZbRrDLnDGg6HAytDoRGjGEleujV2u2daNYrmQYbW7Qxq9N0CaIZpVmBpgcUcm/pQz1g4Alhy8vxhsYxBfyQH8V39zYFlVRaQF8pvWPqwU2PYsZ4Nz0anGQlhIHDlqpC8fN2ecWrIUSQPGvfEClZYyO5O476p9SospMAyNi0zLWx5J4pNdq8/xOsX8m7fqUe6miSyWWw02Zzuva1Z8uIaM25219r3+esH4Eb46kwhbIqHNeGwEQbWscBmeFgLD/fBu1W/2xTv1oS7jTBuHYvbDO/Wwrv74GnVT5viaU04bYShdSy0GZ7WwtN98Lzq503xvCacN8LwOhbeDM9r4fk+eFL1k6Z4UhNOGmFIHQtphie18KQML/480jM9wGp6JiZY9YLNwaD0zqLp8WEOxUly7V5PfIjEtwFD78Sx4oM40AJxxvs5tgFzfS9IxLeCax+nrX1GsNwaRUt8p+jyV0m1cdM+0c9OtN86R2/MzRfLC+V75QflJ0VXesob5a1yoVwrUPlL+Vv5R/m3NW7hVtiK1tbnzzZjvlNKV+vP/wHHCg8p</latexit>
h1 h2 h3 2 12 1 2 2 x
k1 k2 k3
@l @l @y @h2 @k2
=
2
<latexit sha1_base64="j3ZbafUpwvdmZIKYlDTRg9R4h0E=">AAAOdnicfZddb9s2FIad7qvL2izdLgcMwoJs2ZAFkuP44yJALdlGL9Y2C5qPLQoMiqZlwZRISJRjVxC2v7m/sF+wy1GyLEsUZd3kiO/Lw8eHpEJaFDsBU9V/9p598ulnn3/x/Mv9r168PPj68NU3twEJfYhuIMHEv7dAgLDjoRvmMIzuqY+Aa2F0Z82NRL9bID9wiPeBrSh6dIHtOVMHAsabxof/mlMfwMicUwXH6R8TkuhpHGnNOI6VHy8ViYFGK65thawli1k0GzcFPWvL3uxoXnFkbVUE09znFM2T9SDKrwr7mRsmhKW2Be+Uv65HOdG4afOy8S7H2vjwSD1T00epBloWHDWy52r86gXbNycEhi7yGMQgCB40lbLHCPjMgRjF+2YYIArgHNjoIWTT7mPkeDRkyIOxcsy1aYgVRpSk8MrE8RFkeMUDAH2HZ1DgDPACMD49++VUAfKAi4LTycKhwToMFvY6YIDP7WO0TOc+LnWMbB/QmQOXJbIIuIEL2KzSGKxcq9yIQoz8hVtuTCg5o+BcIh86QVKDK16Y9zRZTsEHcpXpsxWdIS+Io9DHcbEjF5DvoynvmIYBYiGN0h/D1/A8uGR+iE6TMG27HAB/fo0mpzxPqaGMM8UEsHKT5SbF8dATJK4LvElkUr66GFqyyDw9i7l4rFiYmy0C/EnZeR1HkZnUzLKUa24tie8K4jtRHBbEYTYIwRNlSnxlweef+IHCjQq3+A5EQbn3Td57qtyIqW8L4q0o3hXEO1G0woIaVtRFQV1U1KeC+iSqy4K4FMVVQVxV8rKCykT1Y0H8KIr3uwb9Y9egfwpp+ezwLbPiuxxN+Qc0XWDRHMbRmw9vf4ujXvpkKyVEilY2QmtjPB+1O71OLMp4o7dGXU0fVPXc0NH7miEz5I5+x1AHwy1LU/Dm0KraHvbbYiqIt3q3Z4yq+hZW7XcGusSwpR0ZrWEnKx9CnmC1c5/Ra7cqZbHzPD1d1y96VT036C3D6DYlhtxhDAaDvpGi0NCnGAleujG22xdqNRXNE3XVdqsv0bcToHZ1vQJLCyz6SNcGWsrCEMCCk+WLxej2e0MxEdvWX+8bRmUCWaH8XUMbtCSGLevF4GJ4npIQH3i2WBWSl6/d6Z53xVQkTzTq8BmssJDtSKOernYqLKTAMtINPSlseSfyTfagPUbrD/J23ylHmhLHoplvNNGc7L2NWfBiiRnXu6X2XX55B1wLX/2lENalh5LksBYGylhgPTyUwsNd8HbVb9eltyXJ7VoYW8Zi18PbUnh7Fzyt+mldeipJTmthqIyF1sNTKTzdBc+qflaXnkmSs1oYJmNh9fBMCs92wZOqn9SlJ5LkpBaGyFhIPTyRwpMyPP/nkZzpAVaSMzHBiuNlB4PSN4smx4c5v7Fk7vUPHyB+N/DRW36seM8PtICf8X6JTODbruPF/K5gm6dJtMsIlhsjj/g9RRNvJdXgtnmmXZypv7eOXuvZjeV547vGD42ThtboNF433jSuGjcNuHe1t9j7a+/vl/8dfH9wfPDT2vpsL+vzbaP0HKj/A7hTT8o=</latexit>
@l12
@w @y
@l @h @h22 @w
@k
@y2 @h @k12
@k
w1
22
=
@w12 @y @h
= 2(y - 2t) @k
2 @k @w
· v22w@w· h122 (1 - h2 ) · x1
12
t) ·· vvw
= 2(y - t) 2 ·· h
h22(1 - h22) · x1
(1 -
x1 x2
σ(x))
) = σ(x)(1 -
sigmo id: σ’(x
44
derivative of
@k
= = backpropagation. We can walk backward down the
v2
@w12 @y @h2 @k2@h @w2 12 computation graph and compute the derivative of
h1 h2 h3
@l = 2(y
@l - · v2w @k
@yt)@h · h22@l
(1 - h2 ) · x1
<latexit sha1_base64="sgE4DAIgoQfIIf/5CtOYFODi4vs=">AAANqXicfZfbbts2HMbVdocuq9d0vRwGCAsKDFsQSInjw0WAWpKNXqxtGuS0xkZA0bQimBIJinLsCrra0+x2e5q9zSjFkSWKsm5M8Pv46ac/RZl0KfYjbhj/PXn67Kuvv/n2+Xc7379o/fBy99WPlxGJGUQXkGDCrl0QIeyH6IL7HKNryhAIXIyu3Lmd6VcLxCKfhOd8RdEkAF7oz3wIuOi63f35RB/PGIDJeE51nOY/Y+gl89vDNNVvd/eMAyO/9HrDXDf2tPV1evvqBd8ZTwmMAxRyiEEU3ZgG5ZMEMO5DjNKdcRwhCuAceOgm5rPeJPFDGnMUwlR/I7RZjHVO9IxVn/oMQY5XogEg80WCDu+AoOXiiXaqUREKQYCi/enCp9FDM1p4Dw0ORDkmyTIvV1oZmHgM0DsfLitkCQiiAPC7Wme0CtxqJ4oxYoug2plRCkbJuUQM+lFWg1NRmI80m4HonJyu9bsVvUNhlCYxw2l5oBAQY2gmBubNCPGYJvnDiGmfRyecxWg/a+Z9Jw5g8zM03Rc5lY4qzgwTwKtdbpAVJ0T3kAQBCKfJmIo3gqMlT8b7B6kQ3+guFmaXADatOs/SJBlnNXNd/UxYK+KHkvhBFoclcbi+CcFTfUaYvhDzT1ikC6MuLMyHKKqOvihGz/QLOfqyJF7K4lVJvJJFNy6pcU1dlNRFTb0vqfeyuiyJS1lclcRVLZeXVC6rX0riF1m83nbTP7fd9LMUK2ZHLJmVWOVoJr45+QuWzGGavDt//0ea9PNr/abESDerRug+Go9GnW6/m8oyftTbo55pOXW9MHStgWmrDIVj0LUNZ7hhOZS8BbRhdIaDjhwF8Ubv9e1RXd/AGoOuYykMG9qR3R521+VDKJSsXuGz+512rSxekdO3LOu4X9cLg9W27d6hwlA4bMdxBnaOQmNGMZK89NHY6Rwb9ShaBPWMTnug0DcTYPQsqwZLSyzWyDIdM2fhCGDJyYuXxe4N+kM5iG/qbw1suzaBvFT+nm06bYVhw3rsHA+PchLCQOjJVSFF+Trd3lFPjiJF0KgrZrDGQjZ3GvUto1tjISWWkWVbWWGrK1Esshtzkjx8kDfrTt8z9TSVzWKhyeZs7T2aJS9WmHGzW2nf5lcPwI3w9SeFsCkeKsJhIwxUscBmeKiEh9vgvbrfa4r3FOFeI4ynYvGa4T0lvLcNntb9tCmeKsJpIwxVsdBmeKqEp9vged3Pm+K5Ipw3wnAVC2+G50p4vg2e1P2kKZ4owkkjDFGxkGZ4ooQnVXjx55Ht6QHWsz0xwbofrjcGlW8WzbYPc3G8WLsfHtxB4mzA0HuxrfgoNrRA7PF+S8aAeYEfpuKs4I33s9Y2I1g+GkVLnFNM+VRSb1weHpjHB8an9t5ba31iea79pP2i/aqZWld7q73TTrULDWp/aX9r/2j/tn5vfWpdtz4/WJ8+WY95rVWuFvwf7CcFfw==</latexit>
BACKPROPAGATION
46
BUILDING SOME INTUITION To finish up, let’s see if we can build a little
intuition for what all these accumulated
100
derivatives mean.
10 0
Here is a forward pass for some weights and some
inputs. Backpropagation starts with the loss, and
-3
IMAGINE THAT y IS A PARAMETER We’ll start with the first value below the loss: y. Of
course, this a isn’t parameter of the network, but
100
let’s imagine for a moment that it is. What would
y - ↵ · 2(y - t) the gradient descent update rule look like if we try
<latexit sha1_base64="k5Igov+6dFG0O9rDIGEOSETaqxg=">AAAN8nichZfbbts2HMbV7tRl9ZZul7shFnTohiyQEscHDAFqyTZ6sbZZkFMXGQFF07JgSiQoyrEr6EV2N+x2j7IX2GvsdrsY5TiyDpSnK4Lfx08//SlKpMOIFwpd/+vR4w8+/OjjT558uvPZ08bnX+w++/IypBFH+AJRQvm1A0NMvABfCE8QfM04hr5D8JUzs1L9ao556NHgXCwZHvnQDbyJh6CQXbe7ExuxeJkAm+CJgJzTu2+B/SNY9/4A7BmKbUjYFEoPGlMBDl9kovgO2PbOyf+O0MHt7p5+oK8uUG0Y68aetr5Ob589FTv2mKLIx4FABIbhjaEzMYohFx4iONmxoxAziGbQxTeRmHRGsRewSOAAJeC51CYRAYKC9KHB2OMYCbKUDYi4JxMAmkIOkZCl2SlGhTiAPg73x3OPhffNcO7eNwSUdR3Fi1Xdk8LA2OWQTT20KJDF0A99KKaVznDpO8VOHBHM536xM6WUjCXnAnPkhWkNTmVh3rJ0KsNzerrWp0s2xUGYxBEnSX6gFDDneCIHrpohFhGLVw8j359ZeCJ4hPfT5qrvpA/57AyP92VOoaOIMyEUimKX46fFCfAdor4Pg3FssyS2BV6I2N4/SKT4HDhEmh0K+bjoPEvi2E5r5jjgTFoL4puc+KYsDnLiYH0TSsZgQjmYy/mnPATSCKSFewiHxdEX2egJuChHX+bEy7J4lROvyqIT5dSoos5z6ryi3uXUu7K6yImLsrjMictKrsipoqy+z4nvy+L1tpu+23bTX0qxcnbkklnKVY4n8uO1esHiGUriV+evf0ri7upavykRBkbRiJwH49Gw1e62k7JMHvTmsGOY/aqeGdpmz7BUhszRa1t6f7BhOSx5M2hdbw16rXIUIhu907WGVX0Dq/fafVNh2NAOreagvS4fxkHJ6mY+q9tqVsriZjld0zSPu1U9M5hNy+ocKgyZw+r3+z1rhcIizgguedmDsdU61qtRLAvq6K1mT6FvJkDvmGYFluVYzKFp9I0Vi8CQlJwie1msTq87KAeJTf3NnmVVJlDkyt+xjH5TYdiwHvePB0crEsph4JarQrPytdqdo045imZBw7acwQoL3dxp2DX1doWF5liGpmWmhS2uRLnIboxRfP9B3qw7sGeAJCmb5UIrm9O192AueYnCTOrdSvs2v3oAqYWvPilCdfFIEY5qYZCKBdXDIyU82gbvVv1uXbyrCHdrYVwVi1sP7yrh3W3wrOpndfFMEc5qYZiKhdXDMyU82wYvqn5RFy8U4aIWRqhYRD28UMKLbfC06qd18VQRTmthqIqF1sNTJTwtwsufR7qnhwSke2JKgBesNwaFbxZLtw/p0WLtvn/wPpZnA45fy23FW7mhhXKP9708fXDX94JEnhVcez9tbTPCxYNRtuQ5xSifSqqNy8MD4/hA/7m599Jcn1ieaF9r32gvNENray+1V9qpdqEh7U/tb+0f7d+GaPza+K3x+7318aP1mK+0wtX44z98Sh//</latexit>
y
10 0 to update y?
= y - ↵ · 20
If the output is 10, and it should have been 0, then
0.5
0
gradient descent on y tells us to lower the output
y - ↵ · 2(y - t) of the network. If the output is 0 and it should
<latexit sha1_base64="k5Igov+6dFG0O9rDIGEOSETaqxg=">AAAN8nichZfbbts2HMbV7tRl9ZZul7shFnTohiyQEscHDAFqyTZ6sbZZkFMXGQFF07JgSiQoyrEr6EV2N+x2j7IX2GvsdrsY5TiyDpSnK4Lfx08//SlKpMOIFwpd/+vR4w8+/OjjT558uvPZ08bnX+w++/IypBFH+AJRQvm1A0NMvABfCE8QfM04hr5D8JUzs1L9ao556NHgXCwZHvnQDbyJh6CQXbe7ExuxeJkAm+CJgJzTu2+B/SNY9/4A7BmKbUjYFEoPGlMBDl9kovgO2PbOyf+O0MHt7p5+oK8uUG0Y68aetr5Ob589FTv2mKLIx4FABIbhjaEzMYohFx4iONmxoxAziGbQxTeRmHRGsRewSOAAJeC51CYRAYKC9KHB2OMYCbKUDYi4JxMAmkIOkZCl2SlGhTiAPg73x3OPhffNcO7eNwSUdR3Fi1Xdk8LA2OWQTT20KJDF0A99KKaVznDpO8VOHBHM536xM6WUjCXnAnPkhWkNTmVh3rJ0KsNzerrWp0s2xUGYxBEnSX6gFDDneCIHrpohFhGLVw8j359ZeCJ4hPfT5qrvpA/57AyP92VOoaOIMyEUimKX46fFCfAdor4Pg3FssyS2BV6I2N4/SKT4HDhEmh0K+bjoPEvi2E5r5jjgTFoL4puc+KYsDnLiYH0TSsZgQjmYy/mnPATSCKSFewiHxdEX2egJuChHX+bEy7J4lROvyqIT5dSoos5z6ryi3uXUu7K6yImLsrjMictKrsipoqy+z4nvy+L1tpu+23bTX0qxcnbkklnKVY4n8uO1esHiGUriV+evf0ri7upavykRBkbRiJwH49Gw1e62k7JMHvTmsGOY/aqeGdpmz7BUhszRa1t6f7BhOSx5M2hdbw16rXIUIhu907WGVX0Dq/fafVNh2NAOreagvS4fxkHJ6mY+q9tqVsriZjld0zSPu1U9M5hNy+ocKgyZw+r3+z1rhcIizgguedmDsdU61qtRLAvq6K1mT6FvJkDvmGYFluVYzKFp9I0Vi8CQlJwie1msTq87KAeJTf3NnmVVJlDkyt+xjH5TYdiwHvePB0crEsph4JarQrPytdqdo045imZBw7acwQoL3dxp2DX1doWF5liGpmWmhS2uRLnIboxRfP9B3qw7sGeAJCmb5UIrm9O192AueYnCTOrdSvs2v3oAqYWvPilCdfFIEY5qYZCKBdXDIyU82gbvVv1uXbyrCHdrYVwVi1sP7yrh3W3wrOpndfFMEc5qYZiKhdXDMyU82wYvqn5RFy8U4aIWRqhYRD28UMKLbfC06qd18VQRTmthqIqF1sNTJTwtwsufR7qnhwSke2JKgBesNwaFbxZLtw/p0WLtvn/wPpZnA45fy23FW7mhhXKP9708fXDX94JEnhVcez9tbTPCxYNRtuQ5xSifSqqNy8MD4/hA/7m599Jcn1ieaF9r32gvNENray+1V9qpdqEh7U/tb+0f7d+GaPza+K3x+7318aP1mK+0wtX44z98Sh//</latexit>
100
y
-1
the output.
48 Even though we can’t change y directly, this is the
0.5 effect we want to achieve: we want to change the
values we can change so that we achieve this
change in y. To figure out how to do that, we take
this gradient for y, and propagate it back down the
network.
Note that even though these scenarios have the
same loss (because of the square), the derivative
of the loss has a different sign, so we can tell
whether the output is bigger than the target or
the other way around.
10 0
v2 Note that the current value of the weight doesn’t
= v2 - ↵ · 20 · 0.5 factor into the update. Only how much influence
the weight had on the value of y in the forward
v1 v1 - ↵ · 20 · 0.1
<latexit sha1_base64="1fAMpTLUPd/VrFgaCtGKa+Tee28=">AAAOY3icjZfrbts2HMXl7talbZZ2+zYMEBZ0GIY0kBxfMQSoJdvoh7XNgty2KAgompYFUyJBUY5dQc857AH2HqNsRdaFMqpPBM/h0U9/kjZpU+wGXNP+bTz54suvvv7m6bd7z56/2P/u4OWrq4CEDKJLSDBhNzYIEHZ9dMldjtENZQh4NkbX9txM9OsFYoFL/Au+oujOA47vTl0IuOi6P/jHgiRa3OuxamE05YAx8vCL9buadb9RrTmMLIDpDAgTnBCuNrW0YUEeacd6bFl7mwFNeU7zs3La25wTec7JZ+X0Rc79waF2rK0ftdrQ08ahkj5n9y+f8z1rQmDoIZ9DDILgVtcov4sA4y7EKN6zwgBRAOfAQbchn/buItenIUc+jNXXQpuGWOVETaqsTlyGIMcr0QCQuSJBhTPAAORiLvaKUQHygYeCo8nCpcGmGSycTYMDMZF30XI90XFhYOQwQGcuXBbIIuAFHuCzSmew8uxiJwoxYguv2JlQCsaSc4kYdIOkBmeiMB9psnaCC3KW6rMVnSE/iKOQ4Tg/UAiIMTQVA9fNAPGQRuuPEQt2HpxyFqKjpLnuOx0CNj9HkyORU+go4kwxAbzYZXtJcXz0AInnAX8SWTSOLI6WPLKOjmMhvlZtLMw2AWxSdJ7HUWQlNbNt9VxYC+KHnPihLI5y4ih9CcETdUqYuhDzT1igCqMqLMyFKCiOvsxGT9XLcvRVTrwqi9c58bos2mFODSvqIqcuKupDTn0oq8ucuCyLq5y4quTynMrL6qec+Kks3ux66V+7Xvp3KVbMjtgyK7HL0VT8Wq4XWDSHcfTu4v0fcdRfP+lKCZGqF43QfjSejDvdfjcuy/hRb417ujGs6pmhawx0U2bIHIOuqQ1HW5ZmyZtBa1pnNOiUoyDe6r2+Oa7qW1ht0B0aEsOWdmy2Rt20fAj5JauT+cx+p1Upi5Pl9A3DaPeremYwWqbZa0oMmcMcDocDc41CQ0YxKnnpo7HTaWvVKJoF9bROayDRtxOg9QyjAktzLMbY0If6moUjgEtOni0Wszfoj8pBfFt/Y2CalQnkufL3TH3Ykhi2rO1he3SyJiEM+E65KiQrX6fbO+mVo0gWNO6KGaywkO2bxn1D61ZYSI5lbJhGUtjiThSb7Fa/izY/yNt9px7qahyXzWKjlc3J3ns0l7xYYsb1bql9l18+ANfCV78Uwrp4KAmHtTBQxgLr4aEUHu6Cd6p+py7ekYQ7tTCOjMWph3ek8M4ueFr107p4KgmntTBUxkLr4akUnu6C51U/r4vnknBeC8NlLLwenkvh+S54UvWTungiCSe1METGQurhiRSeFOHFn0dypgdYTc7EBKuunx4MCr9ZNDk+JFeN1L358CESdwOG3otjxUdxoAXijPebuI0wx3P9WNwVHOsoae0yguWjUbTEPUUv30qqjavmsd4+1v5sHb410hvLU+VH5WflV0VXuspb5Z1yplwqsHHagA3c8F78t/9s/9X+Dxvrk0Y65nul8Oz/9D/MA0Tg</latexit>
0
pass. The higher the activation of the source node,
v2 v2 - ↵ · 20 · 0.5 the more the weight gets adjusted.
1
v3 v3 - ↵ · 20 · 0.9 Note also how the sign of the the derivative wrt to
-3 y is taken into account. Here, the model output
49 was too high, so the more a weight contributed to
the output, the more it gets "punished" by being
lowered. If the output had been too low, the
opposite would be true, and v3 would be increased
h2
-3
1
k2
10 0
We see here, that in the extreme regimes, the
sigmoid is resistant to change. The closer to 1 or 0
1
10 0 w12
1 h2. This is all beautifully captured by the chain
-3
1
-3
53
FORWARD PASS IN PSEUDOCODE To finish up let's look at how you would implement
this in code. Here is the forward pass: computing
l
for j in [1 … 3]:
the model output and the loss, given the inputs
for i in [1 … 2]: and the target value.
y t k[j] += w[i,j] * x[i]
k[j] += b[j] Assume that k and y are initialized with 0s.
v2
h1 h2 h3 for i in [1 … 3]:
k1 k2 k3 h[i] = sigmoid(k[i]))
for i in [1 … 3]:
2
w1
y += h[i] * v[i]
y += c
x1 x2
loss = (y - t) ** 2
54
h1 h2 h3 for i in [1 … 3]:
k1 k2 k3 k’[i] = h’[i] * h[i] * (1-h[i]) Note that we don’t implement the derivations
for j in [1 … 3]:
from slide 44 directly. Instead, we work backwards
2
w1
RECAP
56
Lecture 2: Backpropagation
Peter Bloem
Deep Learning
dlvu.github.io
IT’S ALL JUST LINEAR ALGEBRA When we look at the computation of a neural
network, we can see that most operations can be
x1
expressed very naturally in those of linear algebra.
× x2
w13 w23 b3 k3 h3 h
inputs.
<latexit sha1_base64="MR7DOaUycjbul3iXVKeS8Lvtb9A=">AAAIGnicfVVNb9s2GFbbre68dUu74y7CjAHZZgSSsybZIUCXj7WHtcmC2C4QGQFFv5IFUx8gKUcqwX+y835Ib8Ouu+wf7Lr9glGS7UqiNl708n2e5yX5kBTdhASMW9af9+4/+ODDh71HH/U//uTxp5/tPHk6YXFKMYxxTGL6xkUMSBDBmAecwJuEAgpdAlN3eVrg0xVQFsTRNc8TmIXIjwIvwIir1O2O4+0Kx/VEJuXX5rHpxLjomhNpOkOHBX6Idje5qSw/mTS/NR23yrlK9b6HlcrpO3nVW8jbnYG1Z5XN1AN7HQyMdbu8ffLwV2ce4zSEiGOCGLuxrYTPBKI8wARk30kZJAgvkQ83KfeOZiKIkpRDpIb+SmFeSkwem8VKzXlAAXOSqwBhGqgKJl4gijBXfvSbpRhEKAQ2nK+ChFUhW/lVwJEycyay0mz5uKEUPkXJIsBZY2oChSxEfKElWR66zSSkBOgqbCaLaapJtpgZUBywwoRL5cxFUmwgu44v1/giTxYQMSlSSmRdqACgFDwlLEMGPE1EuRp1apbsmNMUhkVY5o7PEF1ewXyo6jQSzel4JEa8mXLVMpQ7EdzhOAxRNBdOos4Mh4wLZ7gnS+/q6JUUwimMcl3zqoAb6Osa+lrKJnheA88V2ETHW9Qzx23ppAZOtFGnNXTalrppDU01dFVDV1pl964G32lwVkMzDc1raK6hb2voW91npI7FzWgmqr0oN1VckGAFLyhAJMVgJNtroWq/b+ympDgDYmDL0u45eOqXUwFhXtDFy+tXP0lxejR6Zh3INsMlKWwo1v7Bs1NLo/jVbNYc6+hodKJxYooif1vo7PzgB1svlKQ0IVvS4eH+j9/rlXIgJL7bVjo9ORvttxemHGlOyj60Lat92ijWrFo7Yg5sU7PW76Kvh+kUuF2Cys9O/lLnv6Ao/w923FV9Y3OnIulSbDzvVORdis0GbBRNSdRh0/vt2GpaK0+Ki7BU709SPBmIVHXPQD0mFF6pC3KhfoCIx/QbdSuoHwaqlvo6wyL6PyLKNkQV9fvqZbPb75geTEZ79v6e9fN3g+cn6zfukfGF8aWxa9jGofHceGlcGmMDG++Mv4y/jX96v/Te9X7r/V5R799baz43Gq33x7+J3/G/</latexit>
h1 h2 h3
k1 k2 k3
h1 The addition of a bias is the addition of a vector of
h2
bias parameters.
2
×
w1
h3
x1 x2
v1 v2 v3 + c = y The nonlinearities can be implemented as simple
element-wise operations.
59
f(x) = V (Wx + b) + c
h1 h2 h3 for i in [1 … 3]:
like it) can be parallelized and implemented
k1 k2 k3 h[i] = sigmoid(k[i])) efficiently but it’s a lot of work. Ideally, we’d like to
for i in [1 … 3]: let somebody else do all that work (like the
2
w1
y += h[i] * v[i]
y += c
implementers of numpy) and then just call their
x1 x2
code to do the matrix multiplications.
loss = (y - c) ** 2
61
This is especially important if you have access to a
GPU. A matrix multiplication can easily be
offloaded to the GPU for much faster processing,
but for a loop over an array, there’s no benefit.
x
y
62
W
V
b
x
c
x y
xt
y
y xW Here’s what the vectorized forward pass looks like
h
FORWARD
W y V as a computation graph, in symbolic notation and
Wk
V x W b in pseudocode.
Vl
xy V c
b
b X When you implement this in numpy it’ll look
ycW b t
c l = (y - t)2 k = w.dot(x) + b
Wt V cx h almost the same.
t y
Vh t k y = Vh + c h = sigmoid(k)
h bx W
b l h = (k) y = v.dot(h) + c
k yc h
k
V
cl t k
l b k = Wx + b l = (y - t) ** 2
tWh l
xV c
h
xk
k bl t
y
y
63
Wl c h
W
V t k
V
bh l
b
ck x
c
x t l y
xt
h
y WHAT
y xW Of course, we lose a lot of the benefit of
h
BUT ABOUT THE BACKWARD PASS?
Wk y V vectorizing if the backward pass isn’t also
Wk
b expressed in terms of matrix multiplications. How
Vll x W
V
xy V c do we vectorize the backward pass? That's the
b
b X
ycW b t
c x
l= (y - t)2 question we'll answer in the rest of this video.
Wt V c h Can we do something like this?
t
V y y = Vh + c On the left, we see the forward pass of our loss
hb
h t k @l ? @l @y @h @k
b
k
x W
h l h = (k) = computation as a set of operations on vectors and
k yc V @W @y @h @k @W matrices. (We’ve generalized things by giving the
cl t k
lW b k = Wx + b network a vector output y and vector target t).
th l
xV c
h k To generalize backpropagation to this view, we
k bl t
y
64 might ask if something similar to the chain rule
Wl c h
exists for vectors and matrices. Firstly, can we
V t k
define something analogous to the derivative of
bh l
one matrix over another, and secondly, can we
ck
t l
h
k
l
b1 b1
ff@@
@@ a2AA=A b=2
a2 A arrange derivatives in a:
partial derivatives in it.
a3 b2
0 a3 1 scalar scalar vector matrix
0@b1/@a1 @b2/@a1 1 However, once we get to matrix/vector operations
input is a
Jf = @@b@b 1/@a@b
1/@a
@b A
1 2/@a22/@a1
Jf = @ @b
@b
2
1/3@a2 /@a32/@a2 A
1/@a @b2 @b vector vector matrix ? or matrix/matrix operations, the only way to
@b1/@a3 @b2/@a3 logically arrange every input with every output is a
matrix matrix ? ? tensor of higher dimension than 2.
65
@l <- scalar
scalar scalar vector matrix
@W <- tensor
of
any dimens
ion
vector vector matrix ?
matrix matrix ? ?
67 3-tensor 3-tensor ? ?
el
mo d
inputs
72
y
W
V
b
x
c
xt y
x
y
y xW We start simple: what is the gradient for y? This is
h
GRADIENT FOR y
W y V a vector, because y is a vector. Let’s first work out
Wk
b yr
Vl x W
V
xy V c
b
yr <latexit sha1_base64="8Kc76guShsxxBzDcjz7QnGGR1lA=">AAANoXicfZfbbts2HMbVdofOq7d0xa52IywoMAxBICWODxcFakk2OmBtsyCnNcoCiqYVwZRIUJRjV9DD7HZ7or3NKNuRJYqyrgh+Hz/9+Kdokx7FQcwN478nT5998eVXXz//pvXti/Z33++9/OEyJgmD6AISTNi1B2KEgwhd8IBjdE0ZAqGH0ZU3s3P9ao5YHJDonC8pug2BHwXTAAIuuu72fnQhTZfZXfCXGwEPA911W63W3d6+cWisHr3eMDeNfW3znN69fMFb7oTAJEQRhxjE8Y1pUH6bAsYDiFHWcpMYUQBnwEc3CZ/2b9MgoglHEcz010KbJljnRM8Z9UnAEOR4KRoAskAk6PAeMAC5mEmrGhWjCIQoPpjMAxqvm/HcXze4mBG6TRerMmWVganPAL0P4KJCloIwDgG/r3XGy9CrdqIEIzYPq505pWCUnAvEYBDnNTgVhflI88rH5+R0o98v6T2K4ixNGM7KA4WAGENTMXDVjBFPaLqajFjuWfyGswQd5M1V3xsHsNkZmhyInEpHFWeKCeDVLi/MixOhB0jCEEST1KVZ6nK04Kl7cJgJ8bUuvg048whgk6rzLEtTN6+Z5+lnwloRP5TED7I4KomjzUsInuhTwvS5WH/CYl0YdWFhAURxdfRFMXqqX8jRlyXxUhavSuKVLHpJSU1q6rykzmvqQ0l9kNVFSVzI4rIkLmu5vKRyWf1cEj/L4vWul/6566WfpFixOmLLLMUuR1PxW7P6wNIZzNJ35+9/z9LB6tl8KQnSzaoReo/G43G3N+hlsowf9c64b1pOXS8MPWto2ipD4Rj2bMMZbVmOJG8BbRjd0bArR0G81fsDe1zXt7DGsOdYCsOWdmx3Rr1N+RCKJKtf+OxBt1Mri1/kDCzLOhnU9cJgdWy7f6QwFA7bcZyhvUKhCaMYSV76aOx2T4x6FC2C+ka3M1To2wUw+pZVg6UlFmtsmY65YuEIYMnJi4/F7g8HIzmIb+tvDW27toC8VP6+bTodhWHLeuKcjI5XJISByJerQorydXv9474cRYqgcU+sYI2FbN80HlhGr8ZCSixjy7bywlZ3othkN+Ztuv5B3u47fd/Us0w2i40mm/O992iWvFhhxs1upX2XXz0AN8LXZwphUzxUhMNGGKhigc3wUAkPd8H7db/fFO8rwv1GGF/F4jfD+0p4fxc8rftpUzxVhNNGGKpioc3wVAlPd8Hzup83xXNFOG+E4SoW3gzPlfB8Fzyp+0lTPFGEk0YYomIhzfBECU+q8OLPIz/TA6znZ2KC9SDaHAwqv1k0Pz7MoDhJrt3riTtI3A0Yei+OFR/FgRaIM96vqQuYHwZRJu4KvnuQt3YZweLRKFrinmLKt5J64/Lo0Dw5NP7o7L+1NjeW59pP2s/aL5qp9bS32jvtVLvQoJZqf2v/aP+299u/tU/bZ2vr0yebMa+0ytO++R/zpgGs</latexit>
yr
yr @l
= @l
the derivative of the i-th element of y. This is
purely a scalar derivative so we can simply use the
b yir
i = @yi
ycW b t @yP
c @l
i
@ P k(y(yk - tk )22 X X @(y - tk2)2 rules we already know. We get 2(yi - ti) for that
Wt V cx h k k - tk )
@ @(yk k- tk )
yr
= ==
i =
=
t @yi
@y i kk
@y@yi
i
particular derivative.
y
Vhb t k
h @y =
X
X
i 2(y - t )
@ykk--ttkk
@y = 2(y - ti )
b x
k yc h
k
W
l P 2 X
Then, we just re-arrange all the derivatives for yi =
k
2(yk - tk )
k k
@yi i = 2(yi i- ti )
@y
into)2a vector, which gives us the complete gradient
k
V
cl t k @ 0 k (y 1k
- t k ) @(y k - tk
l
tW b
h l
= B yy -.-. tt CC 1
= for y. 11 11
x V c yy = 2 @ .... @y
A= 2i· (y - t)
A @yi The final step requires a little creativity: we need r
h k k
y -
y - tt
k l t
b n
n
y n
n
73
Wl c h
X @y - tk to figure out how to compute this vector using
V t k = 2(yk - tk ) k = 2(yi only
- tbasic
i)
linear algebra operations on the given
bh l
@yi vectors and matrices. In this case it’s not so
k complicated: we get the gradient for y by element-
ck
wise subtracting t from y and multiplying by 2.
t l
h 1 0 We haven’t needed any chain rule yet, because
k y1 - t 1 our computation graph for this part has only one
l B
yr = 2 @ .. C edge.
. A
yn - t n
x
y
W
V
b
x
c
x y V ij
xt
y x W y1 Let’s move one step down and work out the
y
h
GRADIENT FOR V Vr
W y V yk gradient for V. We start with the scalar derivative
Wk rr
VV r Vr
ij =
@l
V @V ij Vr
b yN for an arbitrary element Vij.
Vl x W
V VVr r
=
ijij =
@V
@l@l
@Vijij Vr ij =
@l
=
X @l @yk
=
X
yr
@ykr
k V ij =
@l
xy V c
b l X X @l@l @y @y X X @V ij
rr@y
k@y k
@y @V ij
k
@V ij @V ij
b == kk X @l
== yy X k@[Vh + c] r Xkk
@y X @y X X
ycW b t V ij kk
@y @yk k@V @Vijij = =k k@V
k k @yk y
@Vijijr =
@V k ij
@Vkij
ykk = k yr @[V
@V ij k @V@y
@l @y
= k: h]k + ck k =
@V ij
Now, we need a chain rule, to backpropagate
yr
k
@yk
@V ij
ij k
c X X @[Vh @[Vh + +c] c]
k
XX k @[V
@[V h]h] + +c c k
k k
Wt V cx h V ij y1
== yy rr
kk
@V
k kX == X
@Vijij= =yk yr @V ry
rr
@[Vh
y P
k: k: k
l ij=
@V
k
k k @+ c]kV kl hl + c
ij
X k k
yk =
X
rk@[Vk: h]k + ckr @[Vh + c]k
yk through y. However, since we are sticking to scalar
=
X
yr
k
@[Vk: h]k + ck
t kk
X @@PPVVklkl
kk k@V ij
@V ijk @V ij @V ij @V ij
V y
t k V ij . . . y. 1. . yk
X
== yy rr ll hh lX
k
l++ckckk P
X @ V h + c X
k
@
P
V h derivatives until we're ready to vectorize, we can
+ c
k
h
hx Wb kk
@V@V =ijij =yk yr
r l@Vkl kl hl
l
= yr
k @V h
k ij j = yr
k
l kl l k
b l y1 yk yN X
kk
X @V
rr @V hh
k
@V
@Vkl hh
k @V ij
@V ij
i
@V ij k
@V ij
simply use the scalar multivariate chain rule. This
k yc h
k ==yX
== yy
klkl ll rr
y
ijij jj
r@V kl hl @V ij hj X @V kl hl @V ij hj
kk
@Vijij = i i=y yri ijh = yr yr = yr
V
cl t k yk yN l klkl
rr
@V
kl
@V@V
k ijj
@V ij i
@V ij
=
kl
k
@V ij tells us that however many intermediate values we
i
@V ij
lW b ==yy i ihh jj 0 1
= yr r
th l yN l
h
i j ..
.
= yi hj
have, we can work out the derivative for each,
00 11 B C
.. .. Vr 0= B r
@. . .. yi hj 1. . .A = y h
C r T
x
hk V c l BB .. CC rr T..T ..
0
... keeping the others constant, and sum the results.
1
V ij VVrr BB
==@@ . .. .. . yy rr
h h
i i Vr =A
j j . .
. .
. CB
.
B
C ==y y h h . C B
A. . . yir hj . . .C = yr hT Vr = B. . . yr hj . . .C = yr hT
C
k bl t
.. @ A @ A
y .. .. ..
i
..
74
y1 . . This gives us the sum in the second equality.
Wl c h yk
V t k
We've worked out the gradient y∇ already, so we
yN
bh l
can fill that in and focus on the derivative of yk
l
over Vij. the value of yk is defined in terms of
ck
linear algebra operations like matrix
t l
multiplication, but with a little thinking we can
h
always rewrite these as a simple scalar sums.
k
l In the end we find that the derivative of yk over Vij
reduces to the value of hj.
RECIPE
W
V
b
x
c
x y
xt
y
y xW Since this is an important principle to grasp, let’s
h
GRADIENT FOR h
W
W y V keep going until we get to the other set of
k
b parameters, W. We’ll leave the biases as an
Vl x W
<latexit sha1_base64="nr7M/ReXQPunOIPM0X/oYTUWfY0=">AAAQu3icnVfdbts2GHWb/XRet7nb5W6IBR26LQikxPHPRYBaso1erG1W5Kdd5BoUTcuaKZGQKNeuoMfY0+x2e4i9zShFliVK8tYJBkzznO98hx9JmTQZsX2uKH/fu3/w0ceffPrgs+bnD7/48qvWo6+vfRp4CF8hSqj32oQ+JraLr7jNCX7NPAwdk+Abc6nH+M0Ke75N3Uu+YXjiQMu15zaCXHRNHx0cG4iHi2hqvzVcaBIIvj8HxtyDKDSWDJAo+dpyImAYzZjgB850meeBLZGFm2i6jHJY1idJ7WRSwtbBv4emLpYo3GpsduH53LcGoqHhQL4w5+A6ihW2vxYR+ElImlkHiialTM1z8F/SJCgBxWTTcEmiVImkuVD9YGKJOKJUjkI9/keCZib+Ydr2VtsGey2XVaukjGb8KdQ/v+JMbNluyATm2WuhuJpR7os8H5Qk4aeR2J3t5M7Bkzh8S99kqbMuNXp7CQzKbQf7kvQPORKYtg6VYyV5QLmhpo3DRvpcTB895E1jRlHgYJcjAn3/VlUYn4TQ4zYiOGoagY8ZREto4duAz3uT0HZZwLGLIvBYYPOAAE5BvHvBzPYw4mQjGhB5tlAAaAHFFHKxx5tFKR+7UIzmaLaymX/X9FfWXYOL0eNJuE5eIFEhMLQ8yBY2WhechdDx4yqUOv2NYxY7cUCwt3KKnbFL4VFirrGHbD+uwYUozEsWv5P8S3qR4osNW2DXj8LAEws9FygA7Hl4LgKTpo95wMJkMOJFuPTPuRfgo7iZ9J0Pobd8hWdHQqfQUbQzJxTyYpfpxMVx8TtEHQeKJWUwsRc4Xot1fHQcCfAxEOsILU0KvVmR+SoK05VjgleCWgBf5MAXMjjKgaM0CSUzMKceWIn5p54PBBEkyxthvxh9lUXPwZUsfZ0Dr2XwJgfeyKAZ5NCghK5y6KqEvsuh72R0nQPXMrjJgZuSLs+hXEbf58D3Mvh6X9I3+5L+KsmK2RFbZiN2OZ6Lf+FkgYVLFIXPLp//HIX95ElXSoCBWiQic0s8HXe6/W4kw2SLt8c9VRuW8YzQ1QaqXkXIGIOurgxHOy8nEjczrSid0aAjSyGyw3t9fVzGd2aVQXeoVRB2bsd6e9RNy4exK1GtjKf3O+1SWaxMp69p2lm/jGcEra3rvZMKQsbQh8PhQE+ssMBjBEtctiV2OmdKWYplQj2l0x5U4LsJUHqaVjLLcl60saYO1cQLx5BITJ4tFr036I9kIb6rvzbQ9dIE8lz5e7o6bFcQdl7Phmej08QJ9aBryVWhWfk63d5pT5aimdC4K2aw5IXuMo37mtIteaE5L2NN1+LCFnei2GS36iS8eyHv9h04VEEUyWSx0WRyvPe2ZIlLKsiknl1J38evDiC15ssjRahOHlWIo1ozqMoLqjePKs2jfeatMt+qk7cqxK1aM1aVF6vevFVp3tpnnpX5rE6eVYizWjOsygurN88qzbN95nmZz+vkeYU4rzXDq7zwevO80jzfZ56W+bROnlaI01oztMoLrTdPK83Tonnx5xGf6SEB8ZmYEmC76cGg8M5i8fFBXCmNlH038CEWdwMPPxfHipfiQAvFGe/H0ICe5dhuJO4KlnEUt/YR4XpLFC1xT1HlW0m5cX1yrJ4dK7+0D59q6Y3lQePbxneNJw210W08bTxrXDSuGujg94M/Dv48+Kt13kKt31rkjnr/XhrzTaPwtIJ/AKnNK24=</latexit>
@l
V hr
i =
@hi
x
b V c X @l @y X @yk exercise.
by =
@yk @hi
k
= yr
k
@hi
ycW b t X
k k
X P
c @[Vh + c]k @ l Vkl hl + ck
Wt V cx h
=
k
yr
k
@hi
=
k
yr
k
@hi For the gradient for h, most of the derivation is the
t X @Vkl hl + ck X r @Vki hi
k
V y
t k = yr
k = yk same as before, until we get to the point where
h
h bx W kl
@hi
k
@hi
b l =
X
yr
k Vki
the scalar derivative is reduced to a matrix
k yc h
k
V k i
mulitplication. Unlike the previous derivation, the
cl t k 0
..
1
l 0 1 0 1
<latexit sha1_base64="jkBE5EM4oFBVqyrFXSBaIjQgf4Y=">AAAOL3icfZfbbts2HMbldoc2q7d0vdwNsaDDMASZlDg+XASoJdvoxdpmQU5blAYUTcuCJZGgKMeuoKfZE+xpht0Mu91bjJJtWaIk64rg9/HjT3+SNmVR1wm4qv7dePL0s8+/+PLZ872vXjS//mb/5bfXAQkZwleIuITdWjDAruPjK+5wF99ShqFnufjGmhmJfjPHLHCIf8mXFN970PadiYMgF10P+3+YiEemB/nUmoBp/NH0oeVC8MMZMC1sO35EhcacRQzM+ZjwAJgmMIPQe4hmogvRaBk/zDajTESyrOtYWJw49a9HYn+8jTsDUTJ8Y19mU8cfL+Wg/QP1SE0fUG5o68aBsn7OH16+4HvmmKDQwz5HLgyCO02l/D6CjDvIxfGeGQaYQjSDNr4L+aR7Hzk+DTn2UQxeC20SuoATkFQMjB2GEXeXogERc0QCQFPIIOKirnvFqAD70MPB4Xju0GDVDOb2qsHF6+H7aJEuWlwYGNkM0qmDFgWyCHpBUoRSZ7D0rGInDl3M5l6xM6EUjJJzgRlygqQG56IwH2iyD4JLcr7Wp0s6xX4QRyFz4/xAIWDG8EQMTJsB5iGN0pcRm28WnHEW4sOkmfadDSCbXeDxocgpdBRxJi6BvNhleUlxfPyIiOdBsWdMGkcmxwuxUQ+PYiG+BmKjoJlFIBsXnRdxtN44FrgQ1oL4Pie+l8VhThyuJyHuGEwIA3Ox/oQFQBhBun8RDoqjr7LRE3AlR1/nxGtZvMmJN7JohTk1LKnznDovqY859VFWFzlxIYvLnLgs5fKcymX1U078JIu3uyb9bdekv0uxYnXEkVmKU44n4pcv3WDRDMXR28t3v8RRL33WOyXEQCsakbUxnozanV4nlmV3o7dGXU0flPXM0NH7mlFlyBz9jqEOhluWY8mbQatqe9hvy1HI3erdnjEq61tYtd8Z6BWGLe3IaA076/Jh7EtWO/MZvXarVBY7y+npun7aK+uZQW8ZRve4wpA5jMFg0DdSFBoy6mLJSzfGdvtULUfRLKirtlv9Cn27AGpX10uwNMeij3RtoKUsHENXcvJssxjdfm8oB/Ft/fW+YZQWkOfK3zW0QavCsGU9HZwOT1ISwqBvy1UhWfnane5JV44iWdCoI1awxEK2M416utopsZAcy0g39KSwxZMoDtmddh+tfpC35w4caCCOZbM4aLI5OXsbs+R1K8xuvbvSvstfPcCthS+/KUJ18agiHNXCoCoWVA+PKuHRLni77Lfr4u2KcLsWxq5isevh7Up4exc8LftpXTytCKe1MLSKhdbD00p4uguel/28Lp5XhPNaGF7FwuvheSU83wVPyn5SF08qwkktDKliIfXwpBKeFOHFn0dyp4cuSO7ExAWOv74YFH6zaHJ9mCFxk1y5Vy8+wOLbgOF34lrxQVxoobjj/RSZkNme48fiW8E2D5PWLiNcbIyiJb5TNPmrpNy4Pj7S2kcnv7YO3ujrL5ZnynfK98qPiqZ0lDfKW+VcuVJQ43nj50a30Wv+2fyr+U/z35X1SWM95pVSeJr//Q+nOzm9</latexit>
<latexit sha1_base64="RDvTuOSjEtDdb8mvxSGR7Pwg7T8=">AAAOZ3icfZfbbts2HMbl7tRlbZpuwDBgN9qCDt0QBFLi+HARoJZsoxdrmxXOYYuSgKJpWTAlchTl2BX0oMOeYG8xSrZliZKsK4Lfx48//UnalE2xG3BN+6fx5LPPv/jyq6df733z7Pn+i4OX314FJGQQXUKCCbuxQYCw66NL7nKMbihDwLMxurZnZqJfzxELXOKP+JKiOw84vjtxIeCi6+HgXwvyyPIAn9oTdRrfWz6wMVB/OVctGzmuH1GhMXcRq9Z8THigWpZqBaH3EM1EF6TRMn6YbUZZkGRZV7GwuHHqX49E/ngbJybAaMJfR0nIZtAyA4jvR8U0i7nOlP96PzqXprkf1UU8HBxqx1r6qOWGvm4cKuvn4uHlM75njQkMPeRziEEQ3Ooa5XcRYNyFGMV7VhggCuAMOOg25JPOXeT6NOTIh7H6SmiTEKucqEml1bHLEOR4KRoAMlckqHAKGIBcrMdeMSpAPvBQcDSeuzRYNYO5s2pw8TboLlqkix0XBkYOA3TqwkWBLAJekFSj1BksPbvYiUKM2NwrdiaUglFyLhCDbpDU4EIU5gNN9k8wIhdrfbqkU+QHcRQyHOcHCgExhiZiYNoMEA9plL6M2LSz4JyzEB0lzbTvvA/Y7CMaH4mcQkcRZ4IJ4MUu20uK46NHSDwPiL1m0TiyOFqIDX50HAvxlSr2BZzZBLBx0fkxjtY7yFY/CmtBfJ8T38viICcO1pMQPFYnhKlzsf6EBaowqum+hygojr7MRk/USzn6KideyeJ1TryWRTvMqWFJnefUeUl9zKmPsrrIiQtZXObEZSmX51Quq59y4idZvNk16Z+7Jv1LihWrI47MUpxyNBG/mOkGi2Ywjt6O3v0eR930We+UEKl60QjtjfF02Gp327Es443eHHZ0o1/WM0Pb6OlmlSFz9Nqm1h9sWU4kbwataa1BryVHQbzVO11zWNa3sFqv3TcqDFvaodkctNflQ8iXrE7mM7utZqksTpbTNQzjrFvWM4PRNM3OSYUhc5j9fr9npig0ZBQjyUs3xlbrTCtH0Syoo7WavQp9uwBaxzBKsDTHYgwNva+nLBwBLDl5tlnMTq87kIP4tv5GzzRLC8hz5e+Yer9ZYdiynvXPBqcpCWHAd+SqkKx8rXbntCNHkSxo2BYrWGIh25mGXUNrl1hIjmVomEZS2OJJFIfsVr+LVj/I23OnHupqHMtmcdBkc3L2NmbJiyvMuN5dad/lrx6Aa+HLbwphXTysCIe1MLCKBdbDw0p4uAveKfuduninItyphXGqWJx6eKcS3tkFT8t+WhdPK8JpLQytYqH18LQSnu6C52U/r4vnFeG8FoZXsfB6eF4Jz3fBk7Kf1MWTinBSC0OqWEg9PKmEJ0V48eeR3OkBVpM7McGq668vBoXfLJpcH2ZQ3CRX7tWL95H4NmDonbhWfBAXWiDueL9FFmCO5/qx+FZwrKOktcsIFhujaInvFF3+Kik3rk6O9dbx6R/NwzfG+ovlqfKj8rPyWtGVtvJGeatcKJcKbPQaToM2/n7+3/6L/e/3f1hZnzTWY75TCs/+T/8DQF5NLA==</latexit>
tW b BP ..r
..
<latexit sha1_base64="RDvTuOSjEtDdb8mvxSGR7Pwg7T8=">AAAOZ3icfZfbbts2HMbl7tRlbZpuwDBgN9qCDt0QBFLi+HARoJZsoxdrmxXOYYuSgKJpWTAlchTl2BX0oMOeYG8xSrZliZKsK4Lfx48//UnalE2xG3BN+6fx5LPPv/jyq6df733z7Pn+i4OX314FJGQQXUKCCbuxQYCw66NL7nKMbihDwLMxurZnZqJfzxELXOKP+JKiOw84vjtxIeCi6+HgXwvyyPIAn9oTdRrfWz6wMVB/OVctGzmuH1GhMXcRq9Z8THigWpZqBaH3EM1EF6TRMn6YbUZZkGRZV7GwuHHqX49E/ngbJybAaMJfR0nIZtAyA4jvR8U0i7nOlP96PzqXprkf1UU8HBxqx1r6qOWGvm4cKuvn4uHlM75njQkMPeRziEEQ3Ooa5XcRYNyFGMV7VhggCuAMOOg25JPOXeT6NOTIh7H6SmiTEKucqEml1bHLEOR4KRoAMlckqHAKGIBcrMdeMSpAPvBQcDSeuzRYNYO5s2pw8TboLlqkix0XBkYOA3TqwkWBLAJekFSj1BksPbvYiUKM2NwrdiaUglFyLhCDbpDU4EIU5gNN9k8wIhdrfbqkU+QHcRQyHOcHCgExhiZiYNoMEA9plL6M2LSz4JyzEB0lzbTvvA/Y7CMaH4mcQkcRZ4IJ4MUu20uK46NHSDwPiL1m0TiyOFqIDX50HAvxlSr2BZzZBLBx0fkxjtY7yFY/CmtBfJ8T38viICcO1pMQPFYnhKlzsf6EBaowqum+hygojr7MRk/USzn6KideyeJ1TryWRTvMqWFJnefUeUl9zKmPsrrIiQtZXObEZSmX51Quq59y4idZvNk16Z+7Jv1LihWrI47MUpxyNBG/mOkGi2Ywjt6O3v0eR930We+UEKl60QjtjfF02Gp327Es443eHHZ0o1/WM0Pb6OlmlSFz9Nqm1h9sWU4kbwataa1BryVHQbzVO11zWNa3sFqv3TcqDFvaodkctNflQ8iXrE7mM7utZqksTpbTNQzjrFvWM4PRNM3OSYUhc5j9fr9npig0ZBQjyUs3xlbrTCtH0Syoo7WavQp9uwBaxzBKsDTHYgwNva+nLBwBLDl5tlnMTq87kIP4tv5GzzRLC8hz5e+Yer9ZYdiynvXPBqcpCWHAd+SqkKx8rXbntCNHkSxo2BYrWGIh25mGXUNrl1hIjmVomEZS2OJJFIfsVr+LVj/I23OnHupqHMtmcdBkc3L2NmbJiyvMuN5dad/lrx6Aa+HLbwphXTysCIe1MLCKBdbDw0p4uAveKfuduninItyphXGqWJx6eKcS3tkFT8t+WhdPK8JpLQytYqH18LQSnu6C52U/r4vnFeG8FoZXsfB6eF4Jz3fBk7Kf1MWTinBSC0OqWEg9PKmEJ0V48eeR3OkBVpM7McGq668vBoXfLJpcH2ZQ3CRX7tWL95H4NmDonbhWfBAXWiDueL9FFmCO5/qx+FZwrKOktcsIFhujaInvFF3+Kik3rk6O9dbx6R/NwzfG+ovlqfKj8rPyWtGVtvJGeatcKJcKbPQaToM2/n7+3/6L/e/3f1hZnzTWY75TCs/+T/8DQF5NLA==</latexit>
C ⇣ r ⌘T
BP .r C ⇣ r T ⌘T
h l
TT
hr BP yrVki C = yr = VT yr
r =@
B
h = @ k k kiA
k y k V C = (yyr 1VTV⌦ V)1 hr = B C T r
.. @ k yk Vki A = y V = V y into the description of the gradient for h.
xV c
A
h . 0 1 ..
. .. .
k
<latexit sha1_base64="RDvTuOSjEtDdb8mvxSGR7Pwg7T8=">AAAOZ3icfZfbbts2HMbl7tRlbZpuwDBgN9qCDt0QBFLi+HARoJZsoxdrmxXOYYuSgKJpWTAlchTl2BX0oMOeYG8xSrZliZKsK4Lfx48//UnalE2xG3BN+6fx5LPPv/jyq6df733z7Pn+i4OX314FJGQQXUKCCbuxQYCw66NL7nKMbihDwLMxurZnZqJfzxELXOKP+JKiOw84vjtxIeCi6+HgXwvyyPIAn9oTdRrfWz6wMVB/OVctGzmuH1GhMXcRq9Z8THigWpZqBaH3EM1EF6TRMn6YbUZZkGRZV7GwuHHqX49E/ngbJybAaMJfR0nIZtAyA4jvR8U0i7nOlP96PzqXprkf1UU8HBxqx1r6qOWGvm4cKuvn4uHlM75njQkMPeRziEEQ3Ooa5XcRYNyFGMV7VhggCuAMOOg25JPOXeT6NOTIh7H6SmiTEKucqEml1bHLEOR4KRoAMlckqHAKGIBcrMdeMSpAPvBQcDSeuzRYNYO5s2pw8TboLlqkix0XBkYOA3TqwkWBLAJekFSj1BksPbvYiUKM2NwrdiaUglFyLhCDbpDU4EIU5gNN9k8wIhdrfbqkU+QHcRQyHOcHCgExhiZiYNoMEA9plL6M2LSz4JyzEB0lzbTvvA/Y7CMaH4mcQkcRZ4IJ4MUu20uK46NHSDwPiL1m0TiyOFqIDX50HAvxlSr2BZzZBLBx0fkxjtY7yFY/CmtBfJ8T38viICcO1pMQPFYnhKlzsf6EBaowqum+hygojr7MRk/USzn6KideyeJ1TryWRTvMqWFJnefUeUl9zKmPsrrIiQtZXObEZSmX51Quq59y4idZvNk16Z+7Jv1LihWrI47MUpxyNBG/mOkGi2Ywjt6O3v0eR930We+UEKl60QjtjfF02Gp327Es443eHHZ0o1/WM0Pb6OlmlSFz9Nqm1h9sWU4kbwataa1BryVHQbzVO11zWNa3sFqv3TcqDFvaodkctNflQ8iXrE7mM7utZqksTpbTNQzjrFvWM4PRNM3OSYUhc5j9fr9npig0ZBQjyUs3xlbrTCtH0Syoo7WavQp9uwBaxzBKsDTHYgwNva+nLBwBLDl5tlnMTq87kIP4tv5GzzRLC8hz5e+Yer9ZYdiynvXPBqcpCWHAd+SqkKx8rXbntCNHkSxo2BYrWGIh25mGXUNrl1hIjmVomEZS2OJJFIfsVr+LVj/I23OnHupqHMtmcdBkc3L2NmbJiyvMuN5dad/lrx6Aa+HLbwphXTysCIe1MLCKBdbDw0p4uAveKfuduninItyphXGqWJx6eKcS3tkFT8t+WhdPK8JpLQytYqH18LQSnu6C52U/r4vnFeG8FoZXsfB6eF4Jz3fBk7Kf1MWTinBSC0OqWEg9PKmEJ0V48eeR3OkBVpM7McGq668vBoXfLJpcH2ZQ3CRX7tWL95H4NmDonbhWfBAXWiDueL9FFmCO5/qx+FZwrKOktcsIFhujaInvFF3+Kik3rk6O9dbx6R/NwzfG+ovlqfKj8rPyWtGVtvJGeatcKJcKbPQaToM2/n7+3/6L/e/3f1hZnzTWY75TCs/+T/8DQF5NLA==</latexit>
BP .r
k ->
C ⇣ r T ⌘T
k bl t
y
76
hr = B C
@ k yk Vki A = y V = V y
..
T r
To vectorize this, we note that each element of
Wl c h .
this vector is a dot product of y and a column of V.
V t k
This means that if we premultiply y (transposed)
bh l
by V, we get the required result as a row vector.
ck
Transposing the result (to make it a column vector)
t l
is equivalent to post multiplying y by the
h
transpose of V.
k
l Note the symmetry of the forward and the
backward operation.
x
y
W
V
b
x
c
x y V ij
xt
y
y xW y1 Now, at this point, when we analyze k, remember
h
GRADIENT FOR k
Wk
W y V yk that we already have the gradient over h. This
Vl x W b
V
yN means that we no longer have to apply the chain
xy V c l @l
b
b kri =
@ki
rule to anything above h. We can draw this scalar
ycW b t X
X r @h @hkm X X @ @(kk(k ) m)
<latexit sha1_base64="oqHsdrdaisMavx2lA/O3UClYKNo=">AAAODnicjZfbbts2HMaVdofOq7t0vdiA3QgLOnRDEEiJ48NFgFqyjV6sbRbktEWZQdG0LFgSCYpy7Aq62RPsaXY37HavsDfYY4xyFFmiKGO6ovl9/PTTn6RM2cRzQ6Zp/+w8evzRx598+uSzxudPm8++2H3+5WWIIwrRBcQeptc2CJHnBuiCucxD14Qi4NseurLnZqpfLRANXRycsxVBtz5wAnfqQsB413j3t+9OVCuM/LGvWpDFs+RXKwC2B9LfUwpgbM1Jpoz9JPvlxPNk7CZq4/+MDV3HB6+yQf73QoZlNca7e9qBtr7UakPPGntKdp2Onz9lDWuCYeSjgEEPhOGNrhF2GwPKXOihpGFFISIAzoGDbiI27d7GbkAihgKYqC+5No08lWE1LYg6cSmCzFvxBoDU5QkqnAGOz3jZGuWoEAXAR+H+ZOGS8L4ZLpz7BuOPjm7j5XpOktLA2KGAzFy4LJHFwA99wGaVznDl2+VOFHmILvxyZ0rJGQXnElHohmkNTnlh3pN0msNzfJrpsxWZoSBM4oh6SXEgFxClaMoHrpshYhGJ1w/D19Y8PGE0Qvtpc913MgB0foYm+zyn1FHGmXoYsHKX7afFCdAdxL4PgklsEb4mGFqy2No/SLj4UuWLCM5tDOik7DxL4thKa2bb6hm3lsR3BfGdKA4L4jC7CfYm6hRTdcHnH9NQ5UaVW6gLUVgefZGPnqoXYvRlQbwUxauCeCWKdlRQo4q6KKiLinpXUO9EdVkQl6K4KoirSi4rqExUPxTED6J4ve2mP2+76S9CLJ8dvmVWfJejKX+xrRdYPIdJ/Ob87Y9J3Ftf2UqJkKqXjdB+MB6N2p1eJxFl70Fvjbq6MajquaFj9HVTZsgd/Y6pDYYblkPBm0NrWnvYb4tR0Nvo3Z45quobWK3fGRgSw4Z2ZLaGnax8CAWC1cl9Zq/dqpTFyXN6hmEc96p6bjBaptk9lBhyhzkYDPrmGoVElHhI8JIHY7t9rFWjSB7U1dqtvkTfTIDWNYwKLCmwGCNDH+hrFoaAJzhZvljMbr83FIPYpv5G3zQrE8gK5e+a+qAlMWxYjwfHw6M1CaYgcMSq4Lx87U73qCtG4Txo1OEzWGHBmzuNeobWqbDgAsvIMI20sOWdyDfZjX4b37+QN/tO3dPVJBHNfKOJ5nTvPZgFrycxe/VuqX2bXz7Aq4WvPimEdfFQEg5rYaCMBdbDQyk83AbvVP1OXbwjCXdqYRwZi1MP70jhnW3wpOondfFEEk5qYYiMhdTDEyk82QbPqn5WF88k4awWhslYWD08k8KzbfC46sd18VgSjmthsIwF18NjKTwuw/M/j/RMDzw1PRNjT3WD7GBQemeR9Pgw598bmfv+wQeIfxtQ9JYfK97zAy3gZ7wfYgtQx3eDhH8rONZ+2tpmBMsHI2/x7xRd/CqpNi4PD/TjA+2n1t5rI/tieaJ8o3yrvFJ0paO8Vt4op8qFApV/d57tfLXzdfP35h/NP5t/3Vsf7WRjXiilq/n3f8fgLI8=</latexit>
m@k
@ki i k m m@ki@ki gradient for k in terms for the gradient for h.
t y h
mk
@ (ki )
Vhb t k
h . .m
. ... = hri
b x W
h l
@ki Given that, working out the gradient for k is
k
k yc V r 0 r
0 (ki ) = with (ki0)(1
<latexit sha1_base64="pAp2wjfk9mObL7yTYQF4T3o4bKc=">AAAOFnicfZfLbuM2GIWV6W2ajttMC3TTRYkG0yZFJpASx5dFgLEkG7PozKRBbm2UBhRNy4IlkaAoxx5BL9An6NN0V3Tbbd+ij1DKdmRdKGtF8xweffpJypRNPTfkqvrv1pMPPvzo40+efrr92bPG51/sPP/yKiQRQ/gSEY+wGxuG2HMDfMld7uEbyjD0bQ9f2xMj1a+nmIUuCS74nOI7HzqBO3IR5KLrfuf370+BhXg8Tn6zAmh78N4FVug6Pvxhz0JOPEnu3X1Q51lb9rSXlb59YFnbsvzFb9HY08DL7Nf+/c6ueqguLlBtaKvGrrK6zu6fP+Pb1pCgyMcBRx4Mw1tNpfwuhoy7yMPJthWFmEI0gQ6+jfiocxe7AY04DlACXghtFHmAE5AWBgxdhhH35qIBEXNFAkBjyCDionzbxagQB9DH4cFw6tJw2QynzrLBxVPiu3i2mJukMDB2GKRjF80KZDH0Qx/ycaUznPt2sRNHHmZTv9iZUgrGknOGGXLDtAZnojDvaDrd4QU5W+njOR3jIEziiHlJfqAQMGN4JAYumiHmEY0XDyPW2CQ85SzCB2lz0XdqQjY5x8MDkVPoKOKMPAJ5scv20+IE+AER34fBMLZoElscz3hsHRwmQnwBxHpBE5tANiw6z5M4ttKa2TY4F9aC+DYnvi2L/ZzYX92EeEMwIgxMxfwTFgJhBMLCXITD4ujLbPQIXJajr3LiVVm8zonXZdGOcmpUUac5dVpRH3LqQ1md5cRZWZznxHkll+dUXlbf58T3ZfFm001/2XTTX0uxYnbElpmLXY5H4gW3WGDxBCXx64s3PyVxd3GtVkqEgVY0IvvReDxotbvtpCx7j3pz0NF0s6pnhrbe0wyZIXP02oZq9tcsRyVvBq2qrX6vVY5C3lrvdI1BVV/Dqr22qUsMa9qB0ey3V+XDOChZncxndFvNSlmcLKer6/pJt6pnBr1pGJ0jiSFzGKZp9owFCo0Y9XDJSx+NrdaJWo2iWVBHbTV7En09AWpH1yuwNMeiD3TN1BYsHEOv5OTZYjE6vW6/HMTX9dd7hlGZQJ4rf8fQzKbEsGY9MU/6xwsSwmDglKtCsvK12p3jTjmKZEGDtpjBCgtZ32nQ1dV2hYXkWAa6oaeFLe5Esclutbt4+UJe7zuwq4EkKZvFRiub0733aC55PYnZq3dL7Zv88gFeLXz1SRGqi0eScFQLg2QsqB4eSeHRJnin6nfq4h1JuFML48hYnHp4RwrvbIKnVT+ti6eScFoLQ2UstB6eSuHpJnhe9fO6eC4J57UwXMbC6+G5FJ5vgidVP6mLJ5JwUgtDZCykHp5I4UkRXvx5pGd66IH0TEw84Aarg0HhnUXT48MEiZPk0r18cBOLbwOG34hjxTtxoIXijPdjbEHm+G6QiG8FxzpIW5uMcPZoFC3xnaKVv0qqjaujQ+3kUP25uftKX32xPFW+Ub5T9hRNaSuvlNfKmXKpIOW/ra+3vt0CjT8afzb+avy9tD7ZWo35SilcjX/+B4eGLBE=</latexit>
=
=hhir (ki ) = hhir (x) -
= (k
(x)i ))(1 - x)
<latexit sha1_base64="pAp2wjfk9mObL7yTYQF4T3o4bKc=">AAAOFnicfZfLbuM2GIWV6W2ajttMC3TTRYkG0yZFJpASx5dFgLEkG7PozKRBbm2UBhRNy4IlkaAoxx5BL9An6NN0V3Tbbd+ij1DKdmRdKGtF8xweffpJypRNPTfkqvrv1pMPPvzo40+efrr92bPG51/sPP/yKiQRQ/gSEY+wGxuG2HMDfMld7uEbyjD0bQ9f2xMj1a+nmIUuCS74nOI7HzqBO3IR5KLrfuf370+BhXg8Tn6zAmh78N4FVug6Pvxhz0JOPEnu3X1Q51lb9rSXlb59YFnbsvzFb9HY08DL7Nf+/c6ueqguLlBtaKvGrrK6zu6fP+Pb1pCgyMcBRx4Mw1tNpfwuhoy7yMPJthWFmEI0gQ6+jfiocxe7AY04DlACXghtFHmAE5AWBgxdhhH35qIBEXNFAkBjyCDionzbxagQB9DH4cFw6tJw2QynzrLBxVPiu3i2mJukMDB2GKRjF80KZDH0Qx/ycaUznPt2sRNHHmZTv9iZUgrGknOGGXLDtAZnojDvaDrd4QU5W+njOR3jIEziiHlJfqAQMGN4JAYumiHmEY0XDyPW2CQ85SzCB2lz0XdqQjY5x8MDkVPoKOKMPAJ5scv20+IE+AER34fBMLZoElscz3hsHRwmQnwBxHpBE5tANiw6z5M4ttKa2TY4F9aC+DYnvi2L/ZzYX92EeEMwIgxMxfwTFgJhBMLCXITD4ujLbPQIXJajr3LiVVm8zonXZdGOcmpUUac5dVpRH3LqQ1md5cRZWZznxHkll+dUXlbf58T3ZfFm001/2XTTX0uxYnbElpmLXY5H4gW3WGDxBCXx64s3PyVxd3GtVkqEgVY0IvvReDxotbvtpCx7j3pz0NF0s6pnhrbe0wyZIXP02oZq9tcsRyVvBq2qrX6vVY5C3lrvdI1BVV/Dqr22qUsMa9qB0ey3V+XDOChZncxndFvNSlmcLKer6/pJt6pnBr1pGJ0jiSFzGKZp9owFCo0Y9XDJSx+NrdaJWo2iWVBHbTV7En09AWpH1yuwNMeiD3TN1BYsHEOv5OTZYjE6vW6/HMTX9dd7hlGZQJ4rf8fQzKbEsGY9MU/6xwsSwmDglKtCsvK12p3jTjmKZEGDtpjBCgtZ32nQ1dV2hYXkWAa6oaeFLe5Esclutbt4+UJe7zuwq4EkKZvFRiub0733aC55PYnZq3dL7Zv88gFeLXz1SRGqi0eScFQLg2QsqB4eSeHRJnin6nfq4h1JuFML48hYnHp4RwrvbIKnVT+ti6eScFoLQ2UstB6eSuHpJnhe9fO6eC4J57UwXMbC6+G5FJ5vgidVP6mLJ5JwUgtDZCykHp5I4UkRXvx5pGd66IH0TEw84Aarg0HhnUXT48MEiZPk0r18cBOLbwOG34hjxTtxoIXijPdjbEHm+G6QiG8FxzpIW5uMcPZoFC3xnaKVv0qqjaujQ+3kUP25uftKX32xPFW+Ub5T9hRNaSuvlNfKmXKpIOW/ra+3vt0CjT8afzb+avy9tD7ZWo35SilcjX/+B4eGLBE=</latexit>
h k
= ⌦ (k)(1=-hh) ⌦ h ⌦ (1 - h)
y
k b t
77 l
Wl c h
V t k
bh l
ck
t l
h
k
l
y
W
V
b
x
c
xt y
x
y
y xW Finally, the gradient for W. The situation here is
h
GRADIENT FOR W
Wk y V exactly the same as we saw earlier for V (matrix in,
W
@l
Vl x W b
V
Wrij =
@W ij vector out, matrix multiplication), so we should
X @l @km X
x
by V c = = kr
@km
expect the derivation to have the same form (and
b m
@km @W ij m
m
@W ij
ycW b t X @[Wx + b]m X @[Wm: x]m + bm indeed it does).
c = kr = kr
Wt V cx h
m m
m
@W ij m
@W ij
P
t y =
X
kmr@ l W ml xl + bm
Vhb t k m
@W ij
h x W X @W ml xl @W ij xj
b = kr = kr
k c h l m
@W ij i
@W ij
ky V ml
cl t k = kr
i xj
l b
tWh l
0
..
1
.
x
h V c Wr = B
B r
C
C r T
xk @. . . ki xj . . .A = k x
..
y
k l b t .
y
78
Wl c h
W
V t k
V
bh l
b
ck x
c
x t l y
xt
h
y
y xW Here’s the backward pass that we just derived.
h
PSEUDOCODE: BACKWARD
W
W k y V
k We’ve left the derivatives of the bias parameters
b
Vll x W
V
x V c out. You’ll have to work these out to implement
b
by
ycW b t y’ = 2 * (y - t)
the first homework exercise.
c
Wt V cx h
t y v’ = y’.mm(h.T)
Vhb t k
h x W h’ = y’[None, :] * v).sum(axis=1)
b
k c h l
k y V
cl t k k’ = sigmoid(k) * sigmoid(1 - k)
l b
tWh l w’ = k’.mm(x.T)
xV c
h k
y
k l b t
79
Wl c h
V t k
bh l
ck
t l
h
RECAP
k
Vectorizing
l makes notation simpler, and computation faster.
80
Lecture 2: Backpropagation
Peter Bloem
Deep Learning
dlvu.github.io
THE PLAN
Tensors
86
TENSORS (AKA MULTIDIMENSIONAL ARRAYS) The basic datastructure of our system will be the
tensor. A tensor is a generic name for family of
datastructures that includes a scalar, a vector, a
scalar vector matrix
matrix and so on.
2 2 0
There is no good way to visualize a 4-dimensional
2 0 0 5
structure. We will occasionally use this form to
1 1 2
indicate that there is a fourth dimension along
which we can also index the tensor.
0-tensor 1-tensor 2-tensor 3-tensor 4-tensor We will assume that whatever data we work with
shape: 3 shape: 3x2 shape: 2x3x2 shape: 2x3x2x4
(images, text, sounds), will in some way be
87 encoded into one or more tensors, before it is fed
into our system. Let’s look at some examples.
A CLASSIFICATION TASK AS TWO TENSORS A simple dataset, with numeric features can simply
be represented as a matrix. For the labels, we
usually create a separate corresponding vector for
181 46 0 the labels.
181 50 0
1
Any categoric features or labels should be
166 44
171 38 1
X=
152 36
y=
1 converted to numeric features (normally by one-
156 40 1
167 40 1
hot coding).
170 45 0
178 50 0
191 50 0
166 38 1
88
0.5
0.0
TENSORS IN MEMORY: ROW MAJOR ORDERING It’s important to realize that even though we think
of tensors as multidimensional arrays, in memory,
they are necessarily laid out as a single line of
numbers. The Tensor object knows the shape that
Tensor A these numbers should take, and uses that to
5 3 1 data: 5 3 1 2 6 0 compute whatever we need it to compute.
A=
2 6 0 shape: (2, 3)
We can do this in two ways: scanning along the
rows first and then the columns is called row-
row-major ordering
major ordering. This is what numpy and pytorch
both do. The other option is (unsurprisingly) called
column major ordering. In higher dimensional
91
tensors, the dimensions further to the right are
always scanned before the dimensions to their
left.
Imagine looping over all elements in a tensor with
shape (a, b, c, d) in four nested loops. If you want
to loop over the elements in the order they are in
memory, then the loop over d would be the
innermost loop, then c, then b and then a.
This allows us to perform some operations very
cheaply, but we have to be careful.
5 3 Tensor A’
A’ =
1 2 data:
6 0 shape: (3, 2)
92
RESHAPE/VIEW
Tensor A
5 3 1 data: 5 3 1 2 6 0
A=
2 6 0 shape: (3, 2)
Tensor a’
a’ = 5 3 1 2 6 0 data:
shape: (6, )
93
SLICE Even slicing can be accomplished by referring to
the original data.
5 3 1 data: 5 3 1 2 6 0
A=
2 6 0 shape: (3, 2)
5 3 Tensor A’
A[:, :2]=
2 6 parent:
shape: (2, 2)
97
Tensors
98
99
W
V
b
c x
x t y
x
y h W
y
FEEDFORWARDW
NETWORK V As an example, here is our MLP expressed in our
k
W
V x b new notation.
l
V
b x y c Just as before, it’s up to us how granular we make
bc x yW t the computation. Here, we wrap the whole
ct y W V+ h
×
computation of the loss in a single operation, but
ht W V b k we could also separate out the subtraction and the
h V b c l
k xσ squaring.
kl b c t shorthand:
y If there’s a uniform direction to the computation,
× l + c t h
x W we’ll leave out the arrowheads for clarity, but you
t h k × +
y V should always think of a computation graph as a
h k l × +
x W b
100 k l directional one. All connections have a definite
y V c direction.
l
W b t
V c h When we draw intricate computation graphs, we
b t k may connect an op node to another op node,
c h l without explicitly drawing the tensor node in
t k between. This should be understood as a
h l shorthand for a proper bipartite graph. Similarly,
k we may occasionally leave out the op node
l between two tensor nodes if the operation is clear
from context.
All gradients are of l, with respect to the tensors on the tensor nodes.
101
<-
<latexit sha1_base64="TVaHn2cx6MPh1FqPJsvR0G58OdQ=">AAAOHnicfZfbbts2HMaV7tRlzZZulwMGYkG7bAgCKXF8GBCglmyjF2ubBTltURZQNK0IpkSCohy7gl5iT7Cn2d2w2+1NdjnKB1kHyrqi+X389NOfpEw5jHih0PV/t5588OFHH3/y9NPtz57tfP7F7vMvr0IacYQvESWU3zgwxMQL8KXwBME3jGPoOwRfO2Mr1a8nmIceDS7EjOE7H7qBN/IQFLLrfvf30b6N3Nh2RqCbfA9engIbOfOfZgJse9sWeCriEeWPkA+T+9HLH8HaD2wfslBQ5RgHonF+0Mrxmx1Ah8Dc2FXcUrnf3dMP9fkFqg1j2djTltfZ/fNnYtseUhT5OBCIwDC8NXQm7mLIhYcITrbtKMRM8kAX30Zi1L6LvYBFAgcoAS+kNooIkChpgcDQ4xgJMpMNiLgnEwB6gBwiIcu4XYwKcQB9HB4MJx4LF81w4i4aQj4Lvoun8zlKCgNjl0P24KFpgSyGfuhD8VDpDGe+U+zEEcF84hc7U0rJWHJOMUdemNbgTBbmHUunPbygZ0v9YcYecBAmccRJkh8oBcw5HsmB82aIRcTi+cPItTYOTwWP8EHanPed9iAfn+PhgcwpdBRxRoRCUexy/LQ4AX5E1PdhMIxtlsSLJWQfHCZSfAHkqkBjh8rVVHSeJ3FspzVzHHAurQXxbU58Wxb7ObG/vAklQyBXOpjI+ac8BNIIpIV7CIfF0ZfZ6BG4LEdf5cSrsnidE6/LohPl1KiiTnLqpKI+5tTHsjrNidOyOMuJs0quyKmirL7Pie/L4s2mm/6y6aa/lmLl7MgtM5O7HI/ki26+wOIxSuLXF29+SuLO/FqulAgDo2hEzsp4PGi2Oq2kLJOV3hi0DbNX1TNDy+walsqQObotS+/11yxHJW8GrevNfrdZjkJkrbc71qCqr2H1bqtnKgxr2oHV6LeW5cM4KFndzGd1mo1KWdwsp2Oa5kmnqmcGs2FZ7SOFIXNYvV6va81RWMQZwSUvWxmbzRO9GsWyoLbebHQV+noC9LZpVmBZjsUcmEbPmLMIDEnJKbLFYrW7nX45SKzrb3YtqzKBIlf+tmX0GgrDmvWkd9I/npNQDgO3XBWala/Zah+3y1E0Cxq05AxWWOj6ToOOqbcqLDTHMjAtMy1scSfKTXZr3MWLF/J634E9AyRJ2Sw3Wtmc7r2VueQlCjOpdyvtm/zqAaQWvvqkCNXFI0U4qoVBKhZUD4+U8GgTvFv1u3XxriLcrYVxVSxuPbyrhHc3wbOqn9XFM0U4q4VhKhZWD8+U8GwTvKj6RV28UISLWhihYhH18EIJLzbB06qf1sVTRTithaEqFloPT5XwtAgv/zzSMz0kID0TUwK8YHkwKLyzWHp8GCN5kly4Fw/ew/LbgOM38ljxTh5ooTzj/RDbkLu+FyTyW8G1D9LWJiOcroyyJb9TjPJXSbVxdXRonBzqPzf2XpnLL5an2tfat9q+Zmgt7ZX2WjvTLjWk/bf1zdZ3W/s7f+z8ufPXzt8L65Ot5ZivtMK188//nykwHw==</latexit>
... f(A)f = given the inputs (just like any function in any
backwardf : Br 7! Ar
forwardf : A 7! B programming language).
def backward(context, outputs_gradient): backwardf : Br 7! Ar
# given the gradient of the loss wrt to the ouputs
The function backward takes the gradient for
# compute the gradient of the loss wrt to the inputs the outputs (the gradient of the loss wrt to the
... outputs) and produces the gradient for the inputs.
103
x
y
W
V
b
c x
t y
x
x h W
y This is the algorithm for backpropagation in this
BACKPROPAGATION
y k V
W setting. We chain together these modules into a
W l x b
V build a computation graph, perform computation graph, and perform a forward pass to
V x y c
b forward pass, bottom-up from leaves.
y x W t compute a loss. We then walk backward from the
bc
W y+ V h loss node breadth-first to compute the gradients
ct × traverse the tree backwards breadth-first
VW b k for each tensor node. Because we walk breadth
h t
b c l first, we ensure that we’ve always computed the
hk σ V call backward for each OpNode
cx t
gradients for the outputs of each OpNode we
kl b breadth-first ensures that the output gradients are known
ty c h
encounter already, and we can simply call its
l +
×
x Wh t k
backward to compute the gradients for its inputs.
add the computed gradients to the
y kV l tensornodes
h Note the last point: some tensor nodes will be
104 x W lb sum them with any gradients already present
k inputs to multiple operation nodes. By the
y V c
l multivariate chain rule, we should sum the
W b t
gradients they get from all the OpNodes they feed
V c h
into. We can achieve this easily by just initializing
b t k
the gradients to zero and adding the gradients we
c h l
compute to any that are already there.
t k
h l
k
l
BUILDING COMPUTATION GRAPHS IN CODE There are two common strategies for constructing
the computation graph: lazy and eager execution.
Lazy execution: build your graph, compile it, feed data through.
105
EXAMPLE: LAZY EXECUTION Here’s one example of how a lazy execution API
might look.
c
a = TensorNode() value: 2
?
b = TensorNode() grad: 1? Note that when we’re building the graph, we’re
source: mult(a, b) not telling it which values to put at each node.
c = a * b We’re just defining the shape of the computation,
but not performing the computation itself.
×
m = Model(
in=(a, b), When we create the model, we define which
loss=c) a b nodes are the input nodes, and which node is the
1
value: ? value: 2
? loss node. We then provide the input values and
m.train((1, 2)) grad: ?2 grad: 1?
perform the forward and backward passes.
106
BUILDING THE COMPUTATION GRAPH: LAZY EXECUTION In lazy execution, which was the norm during the
early years of deep learning, we build the
computation graph, but we don’t yet specify the
Tensor ow 1.x default, Keras default values on the tensor nodes. When the
De ne the computation graph. computation graph is finished, we define the data,
• Compile it.
• Iterate backward/forward over the data
and we feed it through.
Fast. Many possibilities for optimization. Easy to serialise models. Easy to make This is fast since the deep learning system can
training parallel. optimize the graph structure during compilation,
Di cult to debug. Model must remain static during training. but it makes models hard to debug: if something
goes wrong during training, you probably made a
mistake while defining your graph, but you will
107
only get an error while passing data through it.
The resulting stack trace will never point to the
part of your code where you made the mistake.
ffi
fi
fl
BUILDING THE COMPUTATION GRAPH: EAGER EXECUTION In eager execution, we simply execute all
operations immediately on the data, and collect
the computation graph on the fly. We then
PyTorch, Tensor ow 2.0 default, Keras option execute the backward and ditch the graph we’ve
• Build the computation graph on the y during the forward pass. collected.
Easy to debug, problems in the model occur as the module is executing. This makes debugging much easier, and allows for
Flexible: the model can be entirely di erent from one forward to the next.
more flexibility in what we can do in the forward
More di cult to optimize. A little more di cult to serialize. pass. It can, however be a little difficult to wrap
your head around. Since we’ll be using pytorch
later in the course, we’ll show you how eager
execution works step by step.
109
Tensors
110
ffi
fl
y
W
V
b
c
t
h The final ingredient we need is a large collection of
WORKING OUT THE BACKWARD FUNCTION
k
operations with backward functions worked out.
ABS
<latexit sha1_base64="PDfvGk7ccq9faPRBTMFWBm/xuaI=">AAANvXicfZfbbts2HMbV7tRl9ZZul7sRFhQYhiCQEscHFMVqSTZ6sbZZllMXGQFF04pgSiQoyrEr6Dn2NLvdnmFvM8p2ZImirCuC38dPP/1J2qRHcRBzw/jvydPPPv/iy6+efb33zfPWt9/tv/j+KiYJg+gSEkzYjQdihIMIXfKAY3RDGQKhh9G1N7Nz/XqOWByQ6IIvKRqHwI+CaQABF113+6YL/dQNAb/3pvog091XLqRFh5W5r3QXkqLjj0y/2z8wjozVo9cb5qZxoG2es7sXz/meOyEwCVHEIQZxfGsalI9TwHgAMcr23CRGFMAZ8NFtwqe9cRpENOEogpn+UmjTBOuc6Dm/PgkYghwvRQNAFogEHd4DBiAXX7lXjYpRBEIUH07mAY3XzXjurxsciBKN08WqhFllYOozQO8DuKiQpSCM8yrUOuNl6FU7UYIRm4fVzpxSMErOBWIwiPManInCfKD5rMQX5Gyj3y/pPYriLE0YzsoDhYAYQ1MxcNWMEU9ouvoYsRRm8WvOEnSYN1d9rx3AZudocihyKh1VnCkmgFe7vDAvToQeIAlDEE1Sl2apy9GCp+7hUSbEl7qHhdkjgE2qzvMs3awcTz8X1or4viS+l8VhSRxuXkLwRJ8Sps/F/BMW68KoCwsLIIqroy+L0VP9Uo6+KolXsnhdEq9l0UtKalJT5yV1XlMfSuqDrC5K4kIWlyVxWcvlJZXL6qeS+EkWb3a99OOul/4pxYrZEVtmKXY5morfodUCS2cwS99evPstS/urZ7NSEqSbVSP0Ho0no063381kGT/q7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu5vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7hUITRjGSvPTR2OmcGvUoWgT1jE57oNC3E2D0LKsGS0ss1sgyHXPFwhHAkpMXi8XuDfpDOYhv628NbLs2gbxU/p5tOm2FYct66pwOT1YkhIHIl6tCivJ1ur2TnhxFiqBRV8xgjYVs3zTqW0a3xkJKLCPLtvLCVnei2GS35jhd/yBv951+YOpZJpvFRpPN+d57NEterDDjZrfSvsuvHoAb4etfCmFTPFSEw0YYqGKBzfBQCQ93wft1v98U7yvC/UYYX8XiN8P7Snh/Fzyt+2lTPFWE00YYqmKhzfBUCU93wfO6nzfFc0U4b4ThKhbeDM+V8HwXPKn7SVM8UYSTRhiiYiHN8EQJT6rw4s8jP9MDrOdnYoL1INocDCq/WTQ/PsygOEmu3esPd5C4GzD0ThwrPogDLRBnvF9SFzA/DKJM3BV89zBv7TKCxaNRtMQ9xZRvJfXG1fGReXpk/N4+eGNtbizPtB+1n7SfNVPram+0t9qZdqlB7S/tb+0f7d/Wry3Uwq1obX36ZDPmB63ytB7+B/FwDUc=</latexit>
ABS l
<latexit sha1_base64="PDfvGk7ccq9faPRBTMFWBm/xuaI=">AAANvXicfZfbbts2HMbV7tRl9ZZul7sRFhQYhiCQEscHFMVqSTZ6sbZZllMXGQFF04pgSiQoyrEr6Dn2NLvdnmFvM8p2ZImirCuC38dPP/1J2qRHcRBzw/jvydPPPv/iy6+efb33zfPWt9/tv/j+KiYJg+gSEkzYjQdihIMIXfKAY3RDGQKhh9G1N7Nz/XqOWByQ6IIvKRqHwI+CaQABF113+6YL/dQNAb/3pvog091XLqRFh5W5r3QXkqLjj0y/2z8wjozVo9cb5qZxoG2es7sXz/meOyEwCVHEIQZxfGsalI9TwHgAMcr23CRGFMAZ8NFtwqe9cRpENOEogpn+UmjTBOuc6Dm/PgkYghwvRQNAFogEHd4DBiAXX7lXjYpRBEIUH07mAY3XzXjurxsciBKN08WqhFllYOozQO8DuKiQpSCM8yrUOuNl6FU7UYIRm4fVzpxSMErOBWIwiPManInCfKD5rMQX5Gyj3y/pPYriLE0YzsoDhYAYQ1MxcNWMEU9ouvoYsRRm8WvOEnSYN1d9rx3AZudocihyKh1VnCkmgFe7vDAvToQeIAlDEE1Sl2apy9GCp+7hUSbEl7qHhdkjgE2qzvMs3awcTz8X1or4viS+l8VhSRxuXkLwRJ8Sps/F/BMW68KoCwsLIIqroy+L0VP9Uo6+KolXsnhdEq9l0UtKalJT5yV1XlMfSuqDrC5K4kIWlyVxWcvlJZXL6qeS+EkWb3a99OOul/4pxYrZEVtmKXY5morfodUCS2cwS99evPstS/urZ7NSEqSbVSP0Ho0no063381kGT/q7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu5vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7hUITRjGSvPTR2OmcGvUoWgT1jE57oNC3E2D0LKsGS0ss1sgyHXPFwhHAkpMXi8XuDfpDOYhv628NbLs2gbxU/p5tOm2FYct66pwOT1YkhIHIl6tCivJ1ur2TnhxFiqBRV8xgjYVs3zTqW0a3xkJKLCPLtvLCVnei2GS35jhd/yBv951+YOpZJpvFRpPN+d57NEterDDjZrfSvsuvHoAb4etfCmFTPFSEw0YYqGKBzfBQCQ93wft1v98U7yvC/UYYX8XiN8P7Snh/Fzyt+2lTPFWE00YYqmKhzfBUCU93wfO6nzfFc0U4b4ThKhbeDM+V8HwXPKn7SVM8UYSTRhiiYiHN8EQJT6rw4s8jP9MDrOdnYoL1INocDCq/WTQ/PsygOEmu3esPd5C4GzD0ThwrPogDLRBnvF9SFzA/DKJM3BV89zBv7TKCxaNRtMQ9xZRvJfXG1fGReXpk/N4+eGNtbizPtB+1n7SfNVPram+0t9qZdqlB7S/tb+0f7d/Wry3Uwq1obX36ZDPmB63ytB7+B/FwDUc=</latexit>
class Plus(Op):
+ We’ll show how to do this for a few examples.
ABS
<latexit sha1_base64="PDfvGk7ccq9faPRBTMFWBm/xuaI=">AAANvXicfZfbbts2HMbV7tRl9ZZul7sRFhQYhiCQEscHFMVqSTZ6sbZZllMXGQFF04pgSiQoyrEr6Dn2NLvdnmFvM8p2ZImirCuC38dPP/1J2qRHcRBzw/jvydPPPv/iy6+efb33zfPWt9/tv/j+KiYJg+gSEkzYjQdihIMIXfKAY3RDGQKhh9G1N7Nz/XqOWByQ6IIvKRqHwI+CaQABF113+6YL/dQNAb/3pvog091XLqRFh5W5r3QXkqLjj0y/2z8wjozVo9cb5qZxoG2es7sXz/meOyEwCVHEIQZxfGsalI9TwHgAMcr23CRGFMAZ8NFtwqe9cRpENOEogpn+UmjTBOuc6Dm/PgkYghwvRQNAFogEHd4DBiAXX7lXjYpRBEIUH07mAY3XzXjurxsciBKN08WqhFllYOozQO8DuKiQpSCM8yrUOuNl6FU7UYIRm4fVzpxSMErOBWIwiPManInCfKD5rMQX5Gyj3y/pPYriLE0YzsoDhYAYQ1MxcNWMEU9ouvoYsRRm8WvOEnSYN1d9rx3AZudocihyKh1VnCkmgFe7vDAvToQeIAlDEE1Sl2apy9GCp+7hUSbEl7qHhdkjgE2qzvMs3awcTz8X1or4viS+l8VhSRxuXkLwRJ8Sps/F/BMW68KoCwsLIIqroy+L0VP9Uo6+KolXsnhdEq9l0UtKalJT5yV1XlMfSuqDrC5K4kIWlyVxWcvlJZXL6qeS+EkWb3a99OOul/4pxYrZEVtmKXY5morfodUCS2cwS99evPstS/urZ7NSEqSbVSP0Ho0no063381kGT/q7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu5vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7hUITRjGSvPTR2OmcGvUoWgT1jE57oNC3E2D0LKsGS0ss1sgyHXPFwhHAkpMXi8XuDfpDOYhv628NbLs2gbxU/p5tOm2FYct66pwOT1YkhIHIl6tCivJ1ur2TnhxFiqBRV8xgjYVs3zTqW0a3xkJKLCPLtvLCVnei2GS35jhd/yBv951+YOpZJpvFRpPN+d57NEterDDjZrfSvsuvHoAb4etfCmFTPFSEw0YYqGKBzfBQCQ93wft1v98U7yvC/UYYX8XiN8P7Snh/Fzyt+2lTPFWE00YYqmKhzfBUCU93wfO6nzfFc0U4b4ThKhbeDM+V8HwXPKn7SVM8UYSTRhiiYiHN8EQJT6rw4s8jP9MDrOdnYoL1INocDCq/WTQ/PsygOEmu3esPd5C4GzD0ThwrPogDLRBnvF9SFzA/DKJM3BV89zBv7TKCxaNRtMQ9xZRvJfXG1fGReXpk/N4+eGNtbizPtB+1n7SfNVPram+0t9qZdqlB7S/tb+0f7d/Wry3Uwq1obX36ZDPmB63ytB7+B/FwDUc=</latexit>
Ar = Sr Br = S r
<latexit sha1_base64="grOCEBPoBUR1matLxO/x46uXoOA=">AAANunicfZfbbts2HMbV7tRl9ZZul7sRFnQYhiCQEscHYAFqSTZ6sbZZlkO7KAsomlYEUyJBUY5dQW+xp9nt9hJ7m1E+yBJFWVcEv4+ffvqTtEmP4iDmhvHfk6effPrZ5188+3Lvq+etr7/Zf/HtdUwSBtEVJJiw9x6IEQ4idMUDjtF7yhAIPYxuvKmd6zczxOKARJd8QdFdCPwomAQQcNF1v3/kQpq6IeAP3kS3sj/dCHgY6D+e6S4khfD7RrjfPzCOjOWj1xvmunGgrZ/z+xfP+Z47JjAJUcQhBnF8axqU36WA8QBilO25SYwogFPgo9uET3p3aRDRhKMIZvpLoU0SrHOi5/D6OGAIcrwQDQBZIBJ0+AAYgFx84l41KkYRCFF8OJ4FNF4145m/anDxLegunS/rl1UGpj4D9CGA8wpZCsI4r0WtM16EXrUTJRixWVjtzCkFo+ScIwaDOK/BuSjMO5pPSXxJztf6w4I+oCjO0oThrDxQCIgxNBEDl80Y8YSmy48R62Aan3GWoMO8uew7cwCbXqDxocipdFRxJpgAXu3ywrw4EXqEJAxBNE5dmqUuR3OeuodHmRBf6mJVwKlHABtXnRdZul4/nn4hrBXxbUl8K4vDkjhcv4TgsT4hTJ+J+Scs1oVRFxYWQBRXR18Voyf6lRx9XRKvZfGmJN7IopeU1KSmzkrqrKY+ltRHWZ2XxLksLkriopbLSyqX1Y8l8aMsvt/10g+7XvqHFCtmR2yZhdjlaCJ+hJYLLJ3CLH19+ebXLO0vn/VKSZBuVo3Q2xhPRp1uv5vJMt7o7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu+vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7iUITRjGSvHRj7HROjXoULYJ6Rqc9UOjbCTB6llWDpSUWa2SZjrlk4QhgycmLxWL3Bv2hHMS39bcGtl2bQF4qf882nbbCsGU9dU6HJ0sSwkDky1UhRfk63d5JT44iRdCoK2awxkK2bxr1LaNbYyEllpFlW3lhqztRbLJb8y5d/SBv951+YOpZJpvFRpPN+d7bmCUvVphxs1tp3+VXD8CN8PUvhbApHirCYSMMVLHAZniohIe74P2632+K9xXhfiOMr2Lxm+F9Jby/C57W/bQpnirCaSMMVbHQZniqhKe74Hndz5viuSKcN8JwFQtvhudKeL4LntT9pCmeKMJJIwxRsZBmeKKEJ1V48eeRn+kB1vMzMcF6EK0PBpXfLJofH6ZQnCRX7tWHO0jcDRh6I44V78SBFogz3s+pC5gfBlEm7gq+e5i3dhnBfGMULXFPMeVbSb1xfXxknh4Zv7UPXlnrG8sz7XvtB+0nzdS62ivttXauXWlQ+0v7W/tH+7f1S8trBa3pyvr0yXrMd1rlafH/AeRPDGE=</latexit>