You are on page 1of 6

CS 182/282A: Designing, Visualizing and Understanding Deep Neural Networks Spring 2019

Section 2: Backpropagation and Optimization

2.1 Backpropagation

We will walk through an example for forward and backward propagation on a simple example.
Suppose are given a dataset X = { x1 , . . . , xn }, where each xi ∈ R M . Each example in this dataset has a
class Y ∈ [1, 2, . . . , K ], which you are trying to predict.
To do this, you set up a simple network:

p(Y = k| x ) = softmax( xW + b)[k] (2.1)



exp xW:,k + bk
= K (2.2)
∑s=1 exp xW:,s + bs


exp ∑ jM=1 x`,j Wj,k + bk



= K (2.3)
∑s=1 exp ∑ jM=1 x`,j Wj,s + bs


where [k] represents the k-th index of the vector resulting from the softmax.
Answer the following questions:

1. Suppose we use a batch size of B. What should we expect to be the shapes of x, W, and b? What is
the shape of the resulting output vector?

2. Related: Why do we write this as xW, rather than Wx?



3. What is the derivative of the softmax function w.r.t. its input: ∂v σ ( v ), where v is a vector of size K?

4. What is the derivative of p(Y = k| x ) w.r.t. a weight parameter Wj,s ? What about w.r.t. a bias
parameter bs ? (Hint: Use the result from Part 3.)

Solutions:

1. x is batched vectors in R M , so shape is B × M, W, b transforms this into K classes so W has shape


M × K and b has shape K. Note that b is ‘broadcasted’ to add (can also be implemented as part of
matrix multiply by adding constant 1 dim to input).

2. If x is batched, then Wx makes no sense - can’t multiply M × K matrix by a B × M matrix. Since


batch is almost always the first dimension, usually implement this as xW instead of Wx.

3. Because softmax is a multiple input, multiple output function, we have to compute the derivative for
∂σ (v j )
each input-output pair (a.k.a the Jacobian). We compute ∂vi where j is the input dimension and i

2-1
2-2 Section 2: Backpropagation and Optimization

is the output dimension of the Jacobian.

∂ ∂ exp(v j )
σ(v j ) = K
(2.4)
∂vi ∂vi ∑k=1 exp(vk )
∑kK=1 exp(vk ) ∂v∂ i exp(v j ) − exp(v j ) ∂v∂ i ∑kK=1 exp(vk )
= (2.5)
∑kK=1 exp(vk ))2
∑kK=1 exp(vk )δij exp(vi ) − exp(v j ) exp(vi )
= 2 (2.6)
∑kK=1 exp(vk )
K
exp(vi ) δij ∑k=1 exp(vk ) − exp(v j )
= K
(2.7)
∑k=1 exp(vk ) ∑kK=1 exp(vk )
= σi (v)(δij − σj (v)) (2.8)

where δij is the Kronecker delta and equals 1 when i = j, otherwise 0, and σi represents the i-
th output of the softmax function applied to the vector v. This gives the derivative in terms of
the output, which is common when exponents are involved. See https://eli.thegreenplace.
net/2016/the-softmax-function-and-its-derivative/ for another derivation with a bit more
explanation of each step.

4. In our notation, we let p(Y = k | x ) = σk ( f ( x )), where f ( x ) is the output of our affine layer. Therefore,
we want to compute

∂ ∂ ∂
σ ( f ( x )) = σ ( f ( x )) f (x) (2.9)
∂Wj,s k ∂ f (x) k ∂Wj,s

= σk ( f ( x ))(δks − σs ( f ( x ))) f (x) (2.10)
∂Wj,s

= σk ( f ( x ))(δks − σs ( f ( x ))) ( xW + b) (2.11)
∂Wj,s
 M 

∂Wj,s m∑
= σk ( f ( x ))(δks − σs ( f ( x ))) x`,m Wm,s + bs (2.12)
=1
= σk ( f ( x ))(δks − σs ( f ( x ))) x`,j (2.13)


For bias, similar except that ∂bs f ( x ) = 1, so we get


σ ( f ( x )) = σk ( f ( x ))(δks − σs ( f ( x ))) (2.14)
∂bs k

2.2 Optimization methods

Now that we have obtained gradients for the loss with respect to our weights, we have to use them to
update the weights.
There are several optimization methods available.
Section 2: Backpropagation and Optimization 2-3

2.2.1 Gradient Descent


dL
For each weight, we have a dw = dw , we update with the goal to minimize loss, so we go against the
gradient.
wnew = wold − αdw
How do you choose the value of α, also called the learning rate? Another problem:
L= ∑ Li = L ( xi , yi , W )
( xi ,yi )∈ D

Problem, the loss is a sum overall all samples (xi , yi ) in our dataset (D).
dL( xi , yi , W )
dw = ∑ dw
( xi ,yi )∈ D

To obtain one full gradient, we need to process the full dataset, this is often too expensive.

2.2.2 Stochastic Gradient Descent (SGD)

At each step, we select a random (stochastic) subsample of the data, called a minibatch MB.
MB ⊂ D, k MBk = S
For example, the batch size (S) could be 10, 32, 76, etc. Then we only compute the gradients with respect
to this minibatch.
1 dL( xi , yi , W )
S ( x ,y∑
dŵ =
)∈ MB
dw
i i

dŵ is noisy, and depends on the minibatch selected, as well as the batch size. However we use it as a
substitute for the true gradient.
wnew = wold − αdŵ

2.2.3 SGD with momentum

What if we kept the noisy gradients and summed them into a “momentum vector” to reduce noise? At
each step, update the momentum with the new gradient:
Mw,t+1 = µMw,t − αdŵ
Where µ ∈ [0, 1], is the momentum constant, which determines the forgetting rate of old gradients. Then
use the momentum to update the weigths:
wt+1 = wt + Mw,t+1

Figure 2.1: Effect of using momentum on a 2-d problem. The lines are equipotential lines of the loss, and
the momentum method converges faster, with the intuition that it remembers the direction of the older
gradients, which were “pushing” it towards the center, representing the local minimum.
2-4 Section 2: Backpropagation and Optimization

2.2.4 RMSProp

Similar to SGD, this method runs with randomly selected mini-batches. We expect the parameters to be
in the same range of scale, however, gradients of some weights might have much larger magnitude than
others, which might make them overfluctuate.
In RMSProp, you keep a moving average of of the “norm” of the gradients for each weight, and normalize
the gradient by this norm. It is equivalent to having a different learning rate for each parameter, based on
its gradient.
At each step we update the norm estimate of the weight:

sw,t = βsw,t−1 + (1 − β)(dw)2

Where β ∈ [0, 1] is the decay factor of the norm term. Then we apply the normalized gradient update:

α
w t +1 = w t − √ dw
sw,t + e

These operations above are applied element-wise to each element of the vectors involved. (If we’re dealing
with a multi-layer neural network, think of w as representing the concatenation of all parameters together

in one giant vector.) For example, sw,t means taking the square root of each element in the vector sw,t .
Note: ADAM is roughly RMSProp with a momemtum update. The two parameters (µ and β) are called
β 1 and β 2 .
Here’s a blog post with an up-to-date list of popular optimization methods for deep-learning: http:
//ruder.io/optimizing-gradient-descent/

2.3 Training advice and strategies

Now we have a way to decide on a neural network architecture, obtain gradients for all the weights
(backpropagation), and update the weights (optimization method).
How do we know whether the neural network is training? How do we compare neural networks?
It is usually based on a metric evaluated on the validation dataset, and the validation loss is often used.
Here are things to keep in mind.

2.3.1 Tuning the learning rate

All the methods described above involve a learning rate. How is this rate chosen?
Answer: start with a random learning rate and then increase it or decrease it until finding a good rate.
Can you tell in each situation, whether the rate is too high or too low.
Section 2: Backpropagation and Optimization 2-5

Figure 2.2: Plotting the validation loss for the same architecture and optimization method, with varying
learning rates. When is the learning rate too high, too low, and at the right level.

A straightforward way to find a reasonable learning rate is to test every order of magnitude of a parameter,
such as α ∈ {1e−5 , 1e−4 , 1e−3 , 1e−2 }.

2.3.2 Choosing an optimization method

You can start simple with SGD, and try more complicated optimization methods such as ADAM. Note
that each method needs the learning rate to be re-tuned, as the best learning rate for SGD might not be
the best one for RMSProp.

Figure 2.3: The optimization method can both change what value the loss converges to, as well as how
fast it converges to it.

2.3.3 Learning rate schedules

There is no reason the learning rate has to be fixed throughout training. Commonly, the learning rate is
reduced over time, allowing the network to be fine-tuned with smaller gradients. An exponential decay
2-6 Section 2: Backpropagation and Optimization

can be used for the learning rate:


αt = α0 e−λt
Another strategy is to keep the learning rate constant until the validation loss stabilizes, then reduce the
learning rate, and keep training.

Figure 2.4: The validation accuracy of the training of an AlexNet. Every time the accuracy plateaus, α is
divided by 10, and training continues.

You might also like