Machine Learning - Exercise 4: Companion Slides

Machine Learning - Exercise 4
Companion Slides
Ali Athar Sabarinath Mahadevan
December 6, 2018
Exercise Goal Lecture Recap
General backpropagation with

Backpropagation for fixed network computational graphs
This exercise is about
▸ Understanding backpropagation, deriving formulas, optimizing them

▸ Implement simple neural network framework yourself
▸ Digit recognition
Recap: Neural Networks
labels
Network’s
fact Network’s floss error rate
inputs
output (E)
Linear module, Θi = (W, b) Activation function (tanh, σ, ReLu)

Parameters
▸ Training data (inputs) X = {xi}i=1..N with xi ∊ 𝕀, N the batch size
▸ Training labels T = {ti}i=1..N with xi ∊ 𝕆
▸ Network is a parametrized, (sub-)differentiable function F(X,Θ) : 𝕀 x ℙ → 𝕆
▸ e.g., 𝕆 = ℝDim (regression), 𝕆 = [0,1]Dim (prob. classification)

▸ Loss (criterion) L (T,F(X,Θ)) : 𝕆 x 𝕆 → ℝ, put on top of output to measure performance
▸ find optimal parameters: Θ* = argminΘ L (T,F(X,Θ))
Recap: Backpropagation
labels
Network’s
fact Network’s floss error rate
inputs
output (E)
▸ Optimize towards lower error rate, i.e., lower E

▸ Take derivative of E with respect to each modules parameters, follow gradient
▸ Example: Gradient Descent: Θ = Θ - λ*DΘ(E(x))
▸ DΘ(E(x)) = DΘ(E) for brevity
Derivative w.r.t. modules parameters Θ at point x
▸ How to calculate DΘ(E)
Learning rate
▸ Reverse order of modules
▸ Module gets Dout(E), calculates DΘ(E), passes Din(E) to next module
Example: Linear/Fully Connected Module
Given: Derivative with respect to output
Calculate:
x y = WT·x + b ▸ Derivatives with respect to parameters Θ
Θ = (W, b)
Element-wise:
!
Without batching
Given: Derivative with respect to output
Calculate:
x y = WT·x + b ▸ Derivative with respect to input
Θ = (W, b)
Element-wise:
!
Without batching
Putting it together:
fprop(x):
cache.x = x run training data through
x y = WT·x + b return WT*x + b (forwards)
bprop(dE):
dW = cache.x * dE
db = dE run gradients through
return dE * W (backwards)
Θ = (W, b)
update(rate):
W = W - rate*dW
update the parameters
b = b - rate*dbT
(grad. descent)
!
Without batching
Mini-Batching
▸ Batch learning
▸ All training samples processed at once, parameters updated once at the end
▸ Stable, well understood, many acceleration techniques, but slow
▸ Stochastic learning
▸ Each training sample separately, parameters updated at each step
▸ Noisy (though may lead to better results), fast
▸ Mini-batching
▸ Middle ground, batches of data processed, bundled updates
▸ Combine advantages, reduce drawbacks
▸ Example
▸ Linear Module f with input dimension Nin and output dimension Nout, batch size n
broadcast (i.e., repeat) b
mini-batch matrix
Batching Update Rule
▸ (Mini-)Batch learning
▸ Multiple samples processed at once
▸ Calculate gradient for each sample, but don’t update the parameters
▸ After processing the batch, update using a sum of all gradients
▸ Learning rate has to be adapted, e.g., divide E by batch size
▸ Example: Gradient Descent
Derivative of E w.r.t parameters Θ at point xk
▸ To make things easier, we write

Example: Linear/Fully Connected Module - Batching
Given: Derivatives with respect to outputs

Plural!
Calculate:
xk yk = WT·xk + b ▸ Derivatives with respect to parameters Θ
Θ = (W, b)
kth element in the
batch
Deriv. w.r.t outputs

assumed to be
given row-wise:
Example: Linear/Fully Connected Module - Batching
Given: Derivatives with respect to outputs

Plural!
Calculate:
xk yk = WT·xk + b ▸ Derivatives with respect to inputs
Θ = (W, b)
kth element in the
batch
Deriv. w.r.t outputs

assumed to be
given row-wise:
Example: Training a Network
1. network = [module1,module2, …, modulen], loss = floss

2.
3. for X, T in batched(inputs, labels) do
4. z = X
5. for module in net do
6. z = module.fprop(z)
7. end for
8. E = loss.fprop(z,T)
9. dz = loss.bprop(1/batchSize) // Normalization for batch size
for module in reversed(net) do
10.
dz = module.brop(dz)
11.
end for
12. for module in net do
13. module.update(rate)
14. end for
15. end for
16.
Debugging Tip: Gradient Checking
Check the Jacobian J from bprop with numerical differentiation
▸ Numerical approach: Column-wise (here for the first column)
▸ Backprop: Row-wise (here for the first row)
▸ Advice
▸ Use (small) random x
Expected Results/Tips for MNIST
▸ [Linear(28x28, 10), Softmax]
▸ should give ± 750 errors
▸ [Linear(28x28, 200), tanh, Linear(200,10), Softmax]
▸ should give ± 250 errors
▸ Typical learning rates
▸ λ ∊ [0.1, 0.01]
▸ Typical batch sizes
▸ NB ∊ [100, 1000]
▸ Weight initialization
▸ W ∊ ℝMxN
▸ W ~ N (0, ), i.e., sampled from normal distribution around 0 with deviation
▸ b=0
▸ Pre-process the data
▸ Dividide values by 255 (= max pixel value)

Machine Learning - Exercise 4: Companion Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning - Exercise 4: Companion Slides

Uploaded by

Copyright:

Available Formats

Machine Learning - Exercise 4

Ali Athar Sabarinath Mahadevan

General backpropagation with

This exercise is about

▸ Understanding backpropagation, deriving formulas, optimizing them

Linear module, Θi = (W, b) Activation function (tanh, σ, ReLu)

▸ Training labels T = {ti}i=1..N with xi ∊ 𝕆

▸ Network is a parametrized, (sub-)differentiable function F(X,Θ) : 𝕀 x ℙ → 𝕆

▸ e.g., 𝕆 = ℝDim (regression), 𝕆 = [0,1]Dim (prob. classification)

▸ Optimize towards lower error rate, i.e., lower E

Given: Derivative with respect to output

Given: Derivative with respect to output

broadcast (i.e., repeat) b

▸ Example: Gradient Descent

Derivative of E w.r.t parameters Θ at point xk

▸ To make things easier, we write

Given: Derivatives with respect to outputs

Deriv. w.r.t outputs

Given: Derivatives with respect to outputs

Deriv. w.r.t outputs

1. network = [module1,module2, …, modulen], loss = floss

▸ Numerical approach: Column-wise (here for the first column)

▸ Backprop: Row-wise (here for the first row)

You might also like