2020 CS182 Section 2 Notes

CS 182/282A: Designing, Visualizing and Understanding Deep Neural Networks Spring 2019
Section 2: Backpropagation and Optimization
2.1 Backpropagation
We will walk through an example for forward and backward propagation on a simple example.
Suppose are given a dataset X = { x1 , . . . , xn }, where each xi ∈ R M . Each example in this dataset has a
class Y ∈ [1, 2, . . . , K ], which you are trying to predict.
To do this, you set up a simple network:
p(Y = k| x ) = softmax( xW + b)[k] (2.1)

exp xW:,k + bk
= K (2.2)
∑s=1 exp xW:,s + bs

exp ∑ jM=1 x`,j Wj,k + bk

= K (2.3)
∑s=1 exp ∑ jM=1 x`,j Wj,s + bs

where [k] represents the k-th index of the vector resulting from the softmax.
Answer the following questions:
1. Suppose we use a batch size of B. What should we expect to be the shapes of x, W, and b? What is
the shape of the resulting output vector?
2. Related: Why do we write this as xW, rather than Wx?

∂
3. What is the derivative of the softmax function w.r.t. its input: ∂v σ ( v ), where v is a vector of size K?
4. What is the derivative of p(Y = k| x ) w.r.t. a weight parameter Wj,s ? What about w.r.t. a bias
parameter bs ? (Hint: Use the result from Part 3.)
Solutions:
1. x is batched vectors in R M , so shape is B × M, W, b transforms this into K classes so W has shape

M × K and b has shape K. Note that b is ‘broadcasted’ to add (can also be implemented as part of
matrix multiply by adding constant 1 dim to input).
2. If x is batched, then Wx makes no sense - can’t multiply M × K matrix by a B × M matrix. Since

batch is almost always the first dimension, usually implement this as xW instead of Wx.
3. Because softmax is a multiple input, multiple output function, we have to compute the derivative for
∂σ (v j )
each input-output pair (a.k.a the Jacobian). We compute ∂vi where j is the input dimension and i
2-1
2-2 Section 2: Backpropagation and Optimization
is the output dimension of the Jacobian.
∂ ∂ exp(v j )
σ(v j ) = K
(2.4)
∂vi ∂vi ∑k=1 exp(vk )
∑kK=1 exp(vk ) ∂v∂ i exp(v j ) − exp(v j ) ∂v∂ i ∑kK=1 exp(vk )
= (2.5)
∑kK=1 exp(vk ))2
∑kK=1 exp(vk )δij exp(vi ) − exp(v j ) exp(vi )
= 2 (2.6)
∑kK=1 exp(vk )
K
exp(vi ) δij ∑k=1 exp(vk ) − exp(v j )
= K
(2.7)
∑k=1 exp(vk ) ∑kK=1 exp(vk )
= σi (v)(δij − σj (v)) (2.8)
where δij is the Kronecker delta and equals 1 when i = j, otherwise 0, and σi represents the i-
th output of the softmax function applied to the vector v. This gives the derivative in terms of
the output, which is common when exponents are involved. See https://eli.thegreenplace.
net/2016/the-softmax-function-and-its-derivative/ for another derivation with a bit more
explanation of each step.
4. In our notation, we let p(Y = k | x ) = σk ( f ( x )), where f ( x ) is the output of our affine layer. Therefore,
we want to compute
∂ ∂ ∂
σ ( f ( x )) = σ ( f ( x )) f (x) (2.9)
∂Wj,s k ∂ f (x) k ∂Wj,s
∂
= σk ( f ( x ))(δks − σs ( f ( x ))) f (x) (2.10)
∂Wj,s
∂
= σk ( f ( x ))(δks − σs ( f ( x ))) ( xW + b) (2.11)
∂Wj,s
M
∂
∂Wj,s m∑
= σk ( f ( x ))(δks − σs ( f ( x ))) x`,m Wm,s + bs (2.12)
=1
= σk ( f ( x ))(δks − σs ( f ( x ))) x`,j (2.13)
∂
For bias, similar except that ∂bs f ( x ) = 1, so we get
∂
σ ( f ( x )) = σk ( f ( x ))(δks − σs ( f ( x ))) (2.14)
∂bs k
2.2 Optimization methods
Now that we have obtained gradients for the loss with respect to our weights, we have to use them to
update the weights.
There are several optimization methods available.
Section 2: Backpropagation and Optimization 2-3
2.2.1 Gradient Descent

dL
For each weight, we have a dw = dw , we update with the goal to minimize loss, so we go against the
gradient.
wnew = wold − αdw
How do you choose the value of α, also called the learning rate? Another problem:
L= ∑ Li = L ( xi , yi , W )
( xi ,yi )∈ D
Problem, the loss is a sum overall all samples (xi , yi ) in our dataset (D).
dL( xi , yi , W )
dw = ∑ dw
( xi ,yi )∈ D
To obtain one full gradient, we need to process the full dataset, this is often too expensive.
2.2.2 Stochastic Gradient Descent (SGD)
At each step, we select a random (stochastic) subsample of the data, called a minibatch MB.
MB ⊂ D, k MBk = S
For example, the batch size (S) could be 10, 32, 76, etc. Then we only compute the gradients with respect
to this minibatch.
1 dL( xi , yi , W )
S ( x ,y∑
dŵ =
)∈ MB
dw
i i
dŵ is noisy, and depends on the minibatch selected, as well as the batch size. However we use it as a
substitute for the true gradient.
wnew = wold − αdŵ
2.2.3 SGD with momentum
What if we kept the noisy gradients and summed them into a “momentum vector” to reduce noise? At
each step, update the momentum with the new gradient:
Mw,t+1 = µMw,t − αdŵ
Where µ ∈ [0, 1], is the momentum constant, which determines the forgetting rate of old gradients. Then
use the momentum to update the weigths:
wt+1 = wt + Mw,t+1
Figure 2.1: Effect of using momentum on a 2-d problem. The lines are equipotential lines of the loss, and
the momentum method converges faster, with the intuition that it remembers the direction of the older
gradients, which were “pushing” it towards the center, representing the local minimum.
2.2.4 RMSProp
Similar to SGD, this method runs with randomly selected mini-batches. We expect the parameters to be
in the same range of scale, however, gradients of some weights might have much larger magnitude than
others, which might make them overfluctuate.
In RMSProp, you keep a moving average of of the “norm” of the gradients for each weight, and normalize
the gradient by this norm. It is equivalent to having a different learning rate for each parameter, based on
its gradient.
At each step we update the norm estimate of the weight:
sw,t = βsw,t−1 + (1 − β)(dw)2
Where β ∈ [0, 1] is the decay factor of the norm term. Then we apply the normalized gradient update:
α
w t +1 = w t − √ dw
sw,t + e
These operations above are applied element-wise to each element of the vectors involved. (If we’re dealing
with a multi-layer neural network, think of w as representing the concatenation of all parameters together
√
in one giant vector.) For example, sw,t means taking the square root of each element in the vector sw,t .
Note: ADAM is roughly RMSProp with a momemtum update. The two parameters (µ and β) are called
β 1 and β 2 .
Here’s a blog post with an up-to-date list of popular optimization methods for deep-learning: http:
//ruder.io/optimizing-gradient-descent/
2.3 Training advice and strategies
Now we have a way to decide on a neural network architecture, obtain gradients for all the weights
(backpropagation), and update the weights (optimization method).
How do we know whether the neural network is training? How do we compare neural networks?
It is usually based on a metric evaluated on the validation dataset, and the validation loss is often used.
Here are things to keep in mind.
2.3.1 Tuning the learning rate
All the methods described above involve a learning rate. How is this rate chosen?
Answer: start with a random learning rate and then increase it or decrease it until finding a good rate.
Can you tell in each situation, whether the rate is too high or too low.
Section 2: Backpropagation and Optimization 2-5
Figure 2.2: Plotting the validation loss for the same architecture and optimization method, with varying
learning rates. When is the learning rate too high, too low, and at the right level.
A straightforward way to find a reasonable learning rate is to test every order of magnitude of a parameter,
such as α ∈ {1e−5 , 1e−4 , 1e−3 , 1e−2 }.
2.3.2 Choosing an optimization method
You can start simple with SGD, and try more complicated optimization methods such as ADAM. Note
that each method needs the learning rate to be re-tuned, as the best learning rate for SGD might not be
the best one for RMSProp.
Figure 2.3: The optimization method can both change what value the loss converges to, as well as how
fast it converges to it.
2.3.3 Learning rate schedules
There is no reason the learning rate has to be fixed throughout training. Commonly, the learning rate is
reduced over time, allowing the network to be fine-tuned with smaller gradients. An exponential decay
can be used for the learning rate:

αt = α0 e−λt
Another strategy is to keep the learning rate constant until the validation loss stabilizes, then reduce the
learning rate, and keep training.
Figure 2.4: The validation accuracy of the training of an AlexNet. Every time the accuracy plateaus, α is
divided by 10, and training continues.

2020 CS182 Section 2 Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 CS182 Section 2 Notes

Uploaded by

Copyright:

Available Formats

CS 182/282A: Designing, Visualizing and Understanding Deep Neural Networks Spring 2019

Section 2: Backpropagation and Optimization

p(Y = k| x ) = softmax( xW + b)[k] (2.1)

exp ∑ jM=1 x`,j Wj,k + bk

2. Related: Why do we write this as xW, rather than Wx?

1. x is batched vectors in R M , so shape is B × M, W, b transforms this into K classes so W has shape

2. If x is batched, then Wx makes no sense - can’t multiply M × K matrix by a B × M matrix. Since

is the output dimension of the Jacobian.

2.2 Optimization methods

2.2.1 Gradient Descent

2.2.2 Stochastic Gradient Descent (SGD)

2.2.3 SGD with momentum

sw,t = βsw,t−1 + (1 − β)(dw)2

2.3 Training advice and strategies

2.3.1 Tuning the learning rate

2.3.2 Choosing an optimization method

2.3.3 Learning rate schedules

can be used for the learning rate:

You might also like