Professional Documents
Culture Documents
2.1 Backpropagation
We will walk through an example for forward and backward propagation on a simple example.
Suppose are given a dataset X = { x1 , . . . , xn }, where each xi ∈ R M . Each example in this dataset has a
class Y ∈ [1, 2, . . . , K ], which you are trying to predict.
To do this, you set up a simple network:
where [k] represents the k-th index of the vector resulting from the softmax.
Answer the following questions:
1. Suppose we use a batch size of B. What should we expect to be the shapes of x, W, and b? What is
the shape of the resulting output vector?
4. What is the derivative of p(Y = k| x ) w.r.t. a weight parameter Wj,s ? What about w.r.t. a bias
parameter bs ? (Hint: Use the result from Part 3.)
Solutions:
3. Because softmax is a multiple input, multiple output function, we have to compute the derivative for
∂σ (v j )
each input-output pair (a.k.a the Jacobian). We compute ∂vi where j is the input dimension and i
2-1
2-2 Section 2: Backpropagation and Optimization
∂ ∂ exp(v j )
σ(v j ) = K
(2.4)
∂vi ∂vi ∑k=1 exp(vk )
∑kK=1 exp(vk ) ∂v∂ i exp(v j ) − exp(v j ) ∂v∂ i ∑kK=1 exp(vk )
= (2.5)
∑kK=1 exp(vk ))2
∑kK=1 exp(vk )δij exp(vi ) − exp(v j ) exp(vi )
= 2 (2.6)
∑kK=1 exp(vk )
K
exp(vi ) δij ∑k=1 exp(vk ) − exp(v j )
= K
(2.7)
∑k=1 exp(vk ) ∑kK=1 exp(vk )
= σi (v)(δij − σj (v)) (2.8)
where δij is the Kronecker delta and equals 1 when i = j, otherwise 0, and σi represents the i-
th output of the softmax function applied to the vector v. This gives the derivative in terms of
the output, which is common when exponents are involved. See https://eli.thegreenplace.
net/2016/the-softmax-function-and-its-derivative/ for another derivation with a bit more
explanation of each step.
4. In our notation, we let p(Y = k | x ) = σk ( f ( x )), where f ( x ) is the output of our affine layer. Therefore,
we want to compute
∂ ∂ ∂
σ ( f ( x )) = σ ( f ( x )) f (x) (2.9)
∂Wj,s k ∂ f (x) k ∂Wj,s
∂
= σk ( f ( x ))(δks − σs ( f ( x ))) f (x) (2.10)
∂Wj,s
∂
= σk ( f ( x ))(δks − σs ( f ( x ))) ( xW + b) (2.11)
∂Wj,s
M
∂
∂Wj,s m∑
= σk ( f ( x ))(δks − σs ( f ( x ))) x`,m Wm,s + bs (2.12)
=1
= σk ( f ( x ))(δks − σs ( f ( x ))) x`,j (2.13)
∂
For bias, similar except that ∂bs f ( x ) = 1, so we get
∂
σ ( f ( x )) = σk ( f ( x ))(δks − σs ( f ( x ))) (2.14)
∂bs k
Now that we have obtained gradients for the loss with respect to our weights, we have to use them to
update the weights.
There are several optimization methods available.
Section 2: Backpropagation and Optimization 2-3
Problem, the loss is a sum overall all samples (xi , yi ) in our dataset (D).
dL( xi , yi , W )
dw = ∑ dw
( xi ,yi )∈ D
To obtain one full gradient, we need to process the full dataset, this is often too expensive.
At each step, we select a random (stochastic) subsample of the data, called a minibatch MB.
MB ⊂ D, k MBk = S
For example, the batch size (S) could be 10, 32, 76, etc. Then we only compute the gradients with respect
to this minibatch.
1 dL( xi , yi , W )
S ( x ,y∑
dŵ =
)∈ MB
dw
i i
dŵ is noisy, and depends on the minibatch selected, as well as the batch size. However we use it as a
substitute for the true gradient.
wnew = wold − αdŵ
What if we kept the noisy gradients and summed them into a “momentum vector” to reduce noise? At
each step, update the momentum with the new gradient:
Mw,t+1 = µMw,t − αdŵ
Where µ ∈ [0, 1], is the momentum constant, which determines the forgetting rate of old gradients. Then
use the momentum to update the weigths:
wt+1 = wt + Mw,t+1
Figure 2.1: Effect of using momentum on a 2-d problem. The lines are equipotential lines of the loss, and
the momentum method converges faster, with the intuition that it remembers the direction of the older
gradients, which were “pushing” it towards the center, representing the local minimum.
2-4 Section 2: Backpropagation and Optimization
2.2.4 RMSProp
Similar to SGD, this method runs with randomly selected mini-batches. We expect the parameters to be
in the same range of scale, however, gradients of some weights might have much larger magnitude than
others, which might make them overfluctuate.
In RMSProp, you keep a moving average of of the “norm” of the gradients for each weight, and normalize
the gradient by this norm. It is equivalent to having a different learning rate for each parameter, based on
its gradient.
At each step we update the norm estimate of the weight:
Where β ∈ [0, 1] is the decay factor of the norm term. Then we apply the normalized gradient update:
α
w t +1 = w t − √ dw
sw,t + e
These operations above are applied element-wise to each element of the vectors involved. (If we’re dealing
with a multi-layer neural network, think of w as representing the concatenation of all parameters together
√
in one giant vector.) For example, sw,t means taking the square root of each element in the vector sw,t .
Note: ADAM is roughly RMSProp with a momemtum update. The two parameters (µ and β) are called
β 1 and β 2 .
Here’s a blog post with an up-to-date list of popular optimization methods for deep-learning: http:
//ruder.io/optimizing-gradient-descent/
Now we have a way to decide on a neural network architecture, obtain gradients for all the weights
(backpropagation), and update the weights (optimization method).
How do we know whether the neural network is training? How do we compare neural networks?
It is usually based on a metric evaluated on the validation dataset, and the validation loss is often used.
Here are things to keep in mind.
All the methods described above involve a learning rate. How is this rate chosen?
Answer: start with a random learning rate and then increase it or decrease it until finding a good rate.
Can you tell in each situation, whether the rate is too high or too low.
Section 2: Backpropagation and Optimization 2-5
Figure 2.2: Plotting the validation loss for the same architecture and optimization method, with varying
learning rates. When is the learning rate too high, too low, and at the right level.
A straightforward way to find a reasonable learning rate is to test every order of magnitude of a parameter,
such as α ∈ {1e−5 , 1e−4 , 1e−3 , 1e−2 }.
You can start simple with SGD, and try more complicated optimization methods such as ADAM. Note
that each method needs the learning rate to be re-tuned, as the best learning rate for SGD might not be
the best one for RMSProp.
Figure 2.3: The optimization method can both change what value the loss converges to, as well as how
fast it converges to it.
There is no reason the learning rate has to be fixed throughout training. Commonly, the learning rate is
reduced over time, allowing the network to be fine-tuned with smaller gradients. An exponential decay
2-6 Section 2: Backpropagation and Optimization
Figure 2.4: The validation accuracy of the training of an AlexNet. Every time the accuracy plateaus, α is
divided by 10, and training continues.