You are on page 1of 14

Machine Learning - Exercise 4

Companion Slides

Ali Athar Sabarinath Mahadevan

December 6, 2018
Exercise Goal Lecture Recap

General backpropagation with


Backpropagation for fixed network computational graphs

This exercise is about

▸ Understanding backpropagation, deriving formulas, optimizing them


▸ Implement simple neural network framework yourself
▸ Digit recognition
Recap: Neural Networks
labels

Network’s
fact Network’s floss error rate
inputs
output (E)

Linear module, Θi = (W, b) Activation function (tanh, σ, ReLu)


Parameters
▸ Training data (inputs) X = {xi}i=1..N with xi ∊ 𝕀, N the batch size

▸ Training labels T = {ti}i=1..N with xi ∊ 𝕆

▸ Network is a parametrized, (sub-)differentiable function F(X,Θ) : 𝕀 x ℙ → 𝕆

▸ e.g., 𝕆 = ℝDim (regression), 𝕆 = [0,1]Dim (prob. classification)


▸ Loss (criterion) L (T,F(X,Θ)) : 𝕆 x 𝕆 → ℝ, put on top of output to measure performance
▸ find optimal parameters: Θ* = argminΘ L (T,F(X,Θ))
Recap: Backpropagation
labels

Network’s
fact Network’s floss error rate
inputs
output (E)

▸ Optimize towards lower error rate, i.e., lower E


▸ Take derivative of E with respect to each modules parameters, follow gradient
▸ Example: Gradient Descent: Θ = Θ - λ*DΘ(E(x))
▸ DΘ(E(x)) = DΘ(E) for brevity
Derivative w.r.t. modules parameters Θ at point x
▸ How to calculate DΘ(E)
Learning rate
▸ Reverse order of modules
▸ Module gets Dout(E), calculates DΘ(E), passes Din(E) to next module
Example: Linear/Fully Connected Module

Given: Derivative with respect to output

Calculate:
x y = WT·x + b ▸ Derivatives with respect to parameters Θ

Θ = (W, b)

Element-wise:

!
Without batching
Example: Linear/Fully Connected Module

Given: Derivative with respect to output

Calculate:
x y = WT·x + b ▸ Derivative with respect to input

Θ = (W, b)

Element-wise:

!
Without batching
Example: Linear/Fully Connected Module
Putting it together:

fprop(x):
cache.x = x run training data through
x y = WT·x + b return WT*x + b (forwards)

bprop(dE):
dW = cache.x * dE
db = dE run gradients through
return dE * W (backwards)
Θ = (W, b)
update(rate):
W = W - rate*dW
update the parameters
b = b - rate*dbT
(grad. descent)

!
Without batching
Mini-Batching
▸ Batch learning
▸ All training samples processed at once, parameters updated once at the end
▸ Stable, well understood, many acceleration techniques, but slow
▸ Stochastic learning
▸ Each training sample separately, parameters updated at each step
▸ Noisy (though may lead to better results), fast
▸ Mini-batching
▸ Middle ground, batches of data processed, bundled updates
▸ Combine advantages, reduce drawbacks
▸ Example
▸ Linear Module f with input dimension Nin and output dimension Nout, batch size n

broadcast (i.e., repeat) b

mini-batch matrix
Batching Update Rule
▸ (Mini-)Batch learning
▸ Multiple samples processed at once
▸ Calculate gradient for each sample, but don’t update the parameters
▸ After processing the batch, update using a sum of all gradients
▸ Learning rate has to be adapted, e.g., divide E by batch size

▸ Example: Gradient Descent

Derivative of E w.r.t parameters Θ at point xk

▸ To make things easier, we write


Example: Linear/Fully Connected Module - Batching

Given: Derivatives with respect to outputs


Plural!
Calculate:
xk yk = WT·xk + b ▸ Derivatives with respect to parameters Θ

Θ = (W, b)
kth element in the
batch

Deriv. w.r.t outputs


assumed to be
given row-wise:
Example: Linear/Fully Connected Module - Batching

Given: Derivatives with respect to outputs


Plural!
Calculate:
xk yk = WT·xk + b ▸ Derivatives with respect to inputs

Θ = (W, b)
kth element in the
batch

Deriv. w.r.t outputs


assumed to be
given row-wise:
Example: Training a Network

1. network = [module1,module2, …, modulen], loss = floss


2.
3. for X, T in batched(inputs, labels) do
4. z = X
5. for module in net do
6. z = module.fprop(z)
7. end for
8. E = loss.fprop(z,T)
9. dz = loss.bprop(1/batchSize) // Normalization for batch size
for module in reversed(net) do
10.
dz = module.brop(dz)
11.
end for
12. for module in net do
13. module.update(rate)
14. end for
15. end for
16.
Debugging Tip: Gradient Checking
Check the Jacobian J from bprop with numerical differentiation

▸ Numerical approach: Column-wise (here for the first column)

▸ Backprop: Row-wise (here for the first row)

▸ Advice
▸ Use (small) random x
Expected Results/Tips for MNIST
▸ [Linear(28x28, 10), Softmax]
▸ should give ± 750 errors
▸ [Linear(28x28, 200), tanh, Linear(200,10), Softmax]
▸ should give ± 250 errors
▸ Typical learning rates
▸ λ ∊ [0.1, 0.01]
▸ Typical batch sizes
▸ NB ∊ [100, 1000]
▸ Weight initialization
▸ W ∊ ℝMxN
▸ W ~ N (0, ), i.e., sampled from normal distribution around 0 with deviation
▸ b=0
▸ Pre-process the data
▸ Dividide values by 255 (= max pixel value)

You might also like