Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017
(Deep Learning Practical)
Labs:
(Computer Vision) Thomas Brox,
(Robotics) Wolfram Burgard,
(Machine Learning) Frank Hutter,
(Neurorobotics) Joschka Boedecker
University of Freiburg
January 12, 2018
Andreas Eitel Uni FR DL 2017 (1)

Technical Issues
I Location: Monday, 14:00 - 16:00, building 082, room 00 028

I Remark: We will be there for questions every week from 14:00 - 16:00.
I We expect you to work on your own. Your attendance is required during
lectures/presentations
I We expect you have basic knowledge in ML (e.g. heard the Machine
Learning lecture).
I Contact information (tutors):
Aaron Klein kleinaa@cs.uni-freiburg.de
Artemij Amiranashvili amiranas@cs.uni-freiburg.de
Mohammadreza Zolfaghari zolfagha@cs.uni-freiburg.de
Maria Hügle hueglem@cs.uni-freiburg.de
Jingwhei Zhang zhang@cs.uni-freiburg.de
Andreas Eitel eitel@cs.uni-freiburg.de
I Homepage:
http://ais.informatik.uni-freiburg.de/teaching/ws17/deep learning course

Schedule and outline
I Phase 1
I Today: Introduction to Deep Learning (lecture).
I Assignment 1) 16.10 - 06.11 Implementation of a feed-forward Neural
Network from scratch (each person individually, 1-2 page report).
I 23.10: meeting solving open questions
I 06.11: Intro to CNNs and Tensorflow (mandatory lecture), hand in
Assignment 1
I Assignment 2) 06.11 - 20.11 Train CNN in tensorflow (each person
individually, 1-2 page report)
I 13.11 meeting solving open questions

I Phase 1
Assignment 1
I Phase 2 (split into three tracks)
I 20.11 First meeting in tracks, hand in Assignment 2
I Assignment 3) 20.11 - 18.12 Group work track specific task, 1-2 page
report
I Assignment 4) 18.12 - 22.01 Group work advanced topic, 1-2 page report

I Phase 1
Assignment 1
I Phase 2 (split into three tracks)
I 20.11 First meeting in tracks, hand in Assignment 2
I Assignment 3) 20.11 - 18.12 Group work track specific task, 1-2 page
report
I Assignment 4) 18.12 - 22.01 Group work advanced topic, 1-2 page report
I Phase 3:
I 15.01: meeting final project topics and solving open questions
I Assignment 5) 22.01 - 12.02 Solve a challenging problem of your choice
using the tools from Phase 2 (group work + 5 min presentation)

Tracks
I Track 1 Neurorobotics/Robotics
I Robot navigation
I Deep reinforcement learning

Tracks
I Robot navigation
I Track 2 Machine Learning
I Architecture search and hyperparameter optimization
I Visual recognition

Tracks
I Robot navigation
I Track 2 Machine Learning
I Architecture search and hyperparameter optimization
I Visual recognition
I Track 3 Computer Vision
I Image segmentation
I Autoencoders
I Generative adversarial networks

Groups and Evaluation
I In total we have three practical phases

I Reports:
I Short 1-2 page report explaining your results, typically accompanied by 1-2
figures (e.g. a learning curve)
I hand in the code you used to generate these results as well, send us your
github/bitbucket repo
I Presentations:
I Phase 3: 5 min group work presentation in last week
I the reports and presentation after each exercise will be evaluated w.r.t.
scientific quality and quality of presentation
I Requirements for passing:
I attending the class in the lecture and presentation weeks
I active participation in the small groups, i.e., programming, testing, writing
report

Today...
I Groups: You will work on your own for the first exercise
I Assignment: You will have three weeks to build a MLP and apply it to
the MNIST dataset (more on this at the end)
I Lecture: Short recap on how MLPs (feed-forward neural networks) work
and how to train them

(Deep) Machine Learning Overview
Data
?
inference
Model M
1 Learn Model M from the data
2 Let the model M infer unknown quantities from data

(Deep) Machine Learning Overview
Data
?
inference
Model M

Machine Learning Overview
What is the difference between deep learning and a standard machine learning
pipeline ?

Standard Machine Learning Pipeline
(1) Engineer good features (not learned)

(2) Learn Model
(3) Inference e.g. classes of unseen data
x 1
x 2
x
(3)
3
features
(1) x 1 color height (4 cm) (2)
x 2
color height (3.5 cm)
x 3
color height (10 cm)
Data

Unsupervised Feature Learning Pipeline
(1a) Maybe engineer good features (not learned)

(1b) Learn (deep) representation unsupervisedly
(2) Learn Model
x 1
x 2
features
x x 1 color height (4 cm) (3)
3
x 2
color height (3.5 cm)
x
(1a) 3
color height (10 cm)
(2)
Data Pixels
(1b)
Supervised Deep Learning Pipeline
(1) Jointly Learn everything with a deep architecture

x 1
Classes
x 2
(2)
x 3
Pixels
Data
(1)

Training supervised feed-forward neural networks
I Let’s formalize!
I We are given:
I Dataset D = {(x1 , y1 ), . . . , (xN , yN )}
I A neural network with parameters θ which implements a function fθ (x)
I We want to learn:
I The parameters θ such that ∀i ∈ [1, N ] : fθ (xi ) = yi

I A neural network with parameters θ which implements a function fθ (x)

→ θ is given by the network weights w and bias terms b
smax smax
f(x) a a
1
(2)
h (x) a
... a
... a
1
(1)
h (x) a
... a
... ai
1
x ... x ... x
1 j N

Neural network forward-pass
I Computing fθ (x) for a neural network is a forward-pass
smax smax
f(x) a a
1
I unit i activation:
XN
(2)
h (x) a
... a
... a
ai = wi,j xj + bi
j=i
1
unit i output:
... ...
I (1)
(1) h (x) a a ai
hi (x) = t(ai ) where t(·) is an
activation or transfer function b i
w i,j
1
x ... x ... x
1 j N

smax smax
f(x) a a
alternatively (and much faster) use 1
vector notation: (2)
h (x) a
... a
... a
I layer activation:
a(1) = W(1) x + b(1)
1
I layer output:
h(1) (x) = t(a(1) )
(1)
h (x) a
... a
... ai
where t(·) is applied element wise (1) b i
W w i,j
1
x ... x ... x
1 j N

smax smax
f(x) a a
Second layer 1
I layer 2 activation:
(2)
h (x) a
... a
... a
a(2) = W(2) h(1) (x) + b(2) (2)
I layer 2 output: W 1
h(1) (x) = t(a(2) )
(1)
h (x) a
... a
... ai
where t(·) is applied element wise
(1)
W 1
x ... x ... x
1 j N

smax smax
Output layer f(x) a a
output layer activation: (3)
W
I
1
a(3) = W(3) h(2) (x) + b(3)
I network output:
(2)
h (x) a
... a
... a
f (x) = o(a(3) ) (2)
where o(·) is the output W 1
nonlinearity (1)
h (x) a
... a
... ai
I for classification use softmax:
ezi (1)
oi (z) = P|z|
j=1 e
zj W 1
x ... x ... x
1 j N

Neural network activation functions

I
Typical nonlinearities for hidden layers are: tanh(ai ),

I
1
sigmoid σ(ai ) = , ReLU relu(ai ) = max(ai , 0)
1 + e−ai
I tanh and sigmoid are both squashing non-linearities
I ReLU just thresholds at 0
→ Why not linear ?

I Train parameters θ such that ∀i ∈ [1, N ] : fθ (xi ) = yi


I We can do this via minimizing the empirical risk on our dataset D
N
1 X
min L(fθ , D) = min l(fθ (xi ), yi ), (1)
θ θ N i=1
where l(·, ·) is a per example loss


I We can do this via minimizing the empirical risk on our dataset D
N
1 X
min L(fθ , D) = min l(fθ (xi ), yi ), (1)
θ θ N i=1
where l(·, ·) is a per example loss

I For regression often use the squared loss:
M
1X
l(fθ (x), y) = (fj,θ (x) − yj )2
2 j=1
I For M-class classification use the negative log likelihood:

M
X
l(fθ (x), y) = −log(fj,θ (x)) · yj
j

Gradient descent
I The simplest approach to minimizing min L(fθ , D) is gradient descent

θ
Gradient descent:
θ0 ← init randomly
do
t+1 ∂L(fθ , D)
I θ = θt − γ
∂θ
while
(L(fθt+1 , V ) − L(fθt , V ))2 >


Gradient descent
I The simplest approach to minimizing min L(fθ , D) is gradient descent

θ
Gradient descent:
do
t+1 ∂L(fθ , D)
I θ = θt − γ
∂θ
while
(L(fθt+1 , V ) − L(fθt , V ))2 >

I Where V is a validation dataset (why not use D ?)
N
1 X
I Remember in our case: L(fθ , D) = l(fθ (xi ), yi )
N i=1
I We will get to computing the derivatives shortly

Gradient descent
I Gradient descent example: D = {(x1 , y 1 ), . . . , (x100 , y 100 )} with

x ∼ U [0, 1]
y = 3 · x + where ∼ N (0, 0.1)

Gradient descent

x ∼ U [0, 1]
y = 3 · x + where ∼ N (0, 0.1)
I Learn parameters θ of function fθ (x) = θx using loss

1 1
l(fθ (x), y) = kfθ (x) − yk22 = (fθ (x) − y)2
2 2
N N
∂L(fθ , D) 1 X ∂l(fθ , D) 1 X
= = (θx − y)x
∂θ N i=1 ∂θ N i=1

Gradient descent

x ∼ U [0, 1]
y = 3 · x + where ∼ N (0, 0.1)
I Learn parameters θ of function fθ (x) = θx using loss

1 1
l(fθ (x), y) = kfθ (x) − yk22 = (fθ (x) − y)2
2 2
N N
∂L(fθ , D) 1 X ∂l(fθ , D) 1 X
= = (θx − y)x
∂θ N i=1 ∂θ N i=1

Gradient descent
I Gradient descent example γ = 2.
gradient descent

Stochastic Gradient descent (SGD)
I There are two problems with gradient descent:

1. We have to find a good γ
2. Computing the gradient is expensive if the training dataset is large!
I Problem 2 can be attacked with online optimization
(we will have a look at this)
I Problem 1 remains but can be tackled via second order methods or other
advanced optimization algorithms (rprop/rmsprop, adagrad)

Gradient descent
1. We have to find a good γ (γ = 2., γ = 5.)
gradient descent 2

2 Computing the gradient is expensive if the training dataset is large!

I What if we would only evaluate f on parts of the data ?
Stochastic Gradient Descent:

do
I (x0 , y0 ) ∼ D
sample example from dataset D
t+1 ∂l(fθ (x0 ), y0 )
I θ = θt − γ t
∂θ
while (L(fθt+1 , V ) − L(fθt , V ))2 >
∞
X X∞
where γ t → ∞ and (γ t )2 < ∞
t=1 t=1
(γ should go to zero but not too fast)

2 Computing the gradient is expensive if the training dataset is large!

I What if we would only evaluate f on parts of the data ?
Stochastic Gradient Descent:

do
I (x0 , y0 ) ∼ D
sample example from dataset D
t+1 ∂l(fθ (x0 ), y0 )
I θ = θt − γ t
∂θ
while (L(fθt+1 , V ) − L(fθt , V ))2 >
∞
X X∞
where γ t → ∞ and (γ t )2 < ∞
t=1 t=1
(γ should go to zero but not too fast)
→ SGD can speed up optimization for large datasets
→ but can yield very noisy updates
→ in practice mini-batches are used
(compute l(·, ·) for several samples and average)
→ we still have to find a learning rate shedule γ t
→ Same data, assuming that gradient evaluation on all data takes 4 times as
much time as evaluating a single datapoint
1
(gradient descent (γ = 2), stochastic gradient descent (γ t = 0.01 ))
t
sgd
Neural Network backward pass
→ Now how do we compute the gradient for a network ?
l(f(x), y)
I Use the chain rule: smax smax

∂f (g(x)) ∂f (g(x)) ∂g(x)
f(x) a a
= (3)
∂x ∂g(x) ∂x W 1
I first compute loss on output layer

(2)
h (x) a
... a
... a
(2)
W
I then backpropagate to get
∂l(f (x), y) ∂l(f (x), y) 1
∂W(3)
and
∂a(3)
(1)
h (x) a
... a
... ai
(1)
W 1
x ... x ... x
1 j N

l(f(x), y)
I gradient wrt. layer 3 weights:

smax smax
∂l(f (x), y)
=
∂l(f (x), y) ∂a(3) f(x) a a
(3)
W
∂W (3) ∂a(3) W(3) 1
I assuming l is NLL and softmax outputs, (2)
h (x) a
... a
... a
gradient wrt. layer 3 activation is:
(2)
∂l(f (x), y)
= −(y − f (x)) W 1
∂a(3)
y is one-hot encoded
(1)
h (x) a
... a
... ai
(1)
I gradient of a wrt. W(3) : W 1
∂a(3)
∂W(3)
= h(2) (x)T x ... x ... x
1 j N
→ recall
a(3) = W(3) h(2) (x) + b(3)

l(f(x), y)
I gradient wrt. layer 3 weights:
∂l(f (x), y) ∂l(f (x), y) ∂a(3)
= smax smax
∂W (3) ∂a(3) W(3) f(x) a a
(3)
assuming l is NLL and softmax outputs,
W
I
1
gradient wrt. layer 3 activation is:
∂l(f (x), y)
(2)
h (x) a
... a
... a
= −(y − f (x)) (2)
∂a(3) W 1
I gradient of a wrt. W(3) : (1)
h (x) a
... a
... ai
∂a(3)
= h(2) (x)T (1)
∂W(3) W 1
I combined:
∂l(f (x), y) x ... x ... x
1 j N
= −(y − f (x))(h(2) (x))T

∂W(3)
→ recall
a(3) = W(3) h(2) (x) + b(3)

I gradient wrt. layer 3 weights: l(f(x), y)

∂l(f (x), y) ∂l(f (x), y) ∂a(3)
=
∂W(3) ∂a(3) W(3)
I assuming l is NLL and softmax outputs, smax smax
gradient wrt. layer 3 activation is: f(x) a a
(3)
∂l(f (x), y)
= −(y − f (x)) W 1
∂a(3) (2)
h (x) a
... a
... a
I gradient of a wrt. W(3) : (2)
∂a(3) W 1
∂W(3)
= (h(2) (x))T (1)
h (x) a
... a
... ai
I gradient wrt. previous layer: (1)
W 1
∂l(f (x), y)
=
∂l(f (x), y) ∂a(3) x ... x ... x
1 j N
∂h(2) (x) ∂a(3) ∂h(2) (x)

T ∂l(f (x), y)
= W(3) → recall
∂a(3)
a(3) = W(3) h(2) (x) + b(3)

l(f(x), y)
I gradient wrt. layer 2 weights: smax smax

∂l(f (x), y) ∂l(f (x), y) ∂h(2) (x) ∂a(2) f(x) a a
= (3)
∂W (2) ∂h(2) (x) ∂a(2) W(2) W 1
I same schema as before just have to

(2)
h (x) a
... a
... a
consider computing derivative of activation (2)
∂h(2) (x) W 1
function
∂a(2)
, e.g. for sigmoid σ(·) (1)
h (x) a
... a
... ai
∂h(2) (x) (1)
∂a(2)
= σ(ai )(1 − ai ) W 1
I and backprop even further x ... x ... x

1 j N
→ recall
a(3) = W(3) h(2) (x) + b(3)

Gradient Checking
→ Backward-pass is just repeated application of the chain rule

→ However, there is a huge potential for bugs ...
→ Gradient checking to the rescue (simply check code via finite-differences):
Gradient Checking:
θ = (W(1) , b(1) , . . . ) init randomly
x ← init randomly ; y ← init randomly
∂l(fθ (x), y)
ganalytic ← compute gradient via backprop
∂θ
for i in #θ
I θ̂ = θ
I θ̂i = θ̂i +
l(fθ̂ (x), y) − l(fθ (x), y)
I gnumeric =

I assert(kgnumeric − ganalytic k < )
I can also be used to test partial implementations
(i.e. layers, activation functions)
→ simply remove loss computation and backprop ones

Overfitting
I If you train the parameters of a large network θ = (W(1) , b(1) , . . . ) you

will see overfitting!
→ L(fθ (x), D) L(fθ (x), V )
I This can be at least partly conquered with regularization

Overfitting

→ L(fθ (x), D) L(fθ (x), V )
I weight decay: change cost (and gradient)
N #θ
1 X 1 X
L(fθ , D) = min l(fθ (xi ), yi ) + kθi k2
N θ i=1 #θ i
→ enforces small weights (occams razor)

Overfitting

→ L(fθ (x), D) L(fθ (x), V )
N #θ
1 X 1 X
N θ i=1 #θ i

I dropout: kill ≈ 50% of the activations in each hidden layer during training
1
forward pass. Multiply hidden activations by during testing
2
→ prevents co-adaptation / enforces robustness to noise

Overfitting

→ L(fθ (x), D) L(fθ (x), V )
N #θ
1 X 1 X
N θ i=1 #θ i

I dropout: kill ≈ 50% of the activations in each hidden layer during training
1
forward pass. Multiply hidden activations by during testing
2
→ prevents co-adaptation / enforces robustness to noise
I Many, many more !

Assignment
I Implementation: Implement a simple feed-forward neural network by
completing the provided stub this includes:
I possibility to use 2-4 layers
I sigmoid/tanh and ReLU for the hidden layer
I softmax output layer
I optimization via gradient descent (gd)
I optimization via stochastic gradient descent (sgd)
I gradient checking code (!!!)
I weight initialization with random noise (!!!) (use normal distribution with
changing std. deviation for now)
I Bonus points for testing some advanced ideas:
I implement dropout, l2 regularization
I implement a different optimization scheme (rprop, rmsprop, adagrad)
I Code stub: https://github.com/aisrobots/dl_lab_2017
I Evaluation:
I Find good parameters (learning rate, number of iterations etc.) using a
validation set (usually take the last 10k examples from the training set)
I After optimizing parameters run on the full dataset and test once on the
test-set (you should be able to reach ≈ 1.6 − 1.8% error )
I Submission: Clone our github repo and send us the link to your
github/bitbucket repo including your solution code and report. Email to
eitel@cs.uni-freiburg.de, hueglem@cs.uni-freiburg.de and
zhang@cs.uni-freiburg.de
Slide Information
I These slides have been created by Tobias Springenberg for the DL

Lab Course WS 2016/2017.

Deep Learning Lab Course 2017 (Deep Learning Practical)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Lab Course 2017 (Deep Learning Practical)

Uploaded by

Copyright:

Available Formats

Deep Learning Lab Course 2017

(Deep Learning Practical)

January 12, 2018

Andreas Eitel Uni FR DL 2017 (1)

I Location: Monday, 14:00 - 16:00, building 082, room 00 028

Andreas Eitel Uni FR DL 2017 (2)

Andreas Eitel Uni FR DL 2017 (3)

Andreas Eitel Uni FR DL 2017 (3)

Andreas Eitel Uni FR DL 2017 (3)

Andreas Eitel Uni FR DL 2017 (4)

Andreas Eitel Uni FR DL 2017 (4)

Andreas Eitel Uni FR DL 2017 (4)

I In total we have three practical phases

Andreas Eitel Uni FR DL 2017 (5)

Andreas Eitel Uni FR DL 2017 (6)

Andreas Eitel Uni FR DL 2017 (7)

Andreas Eitel Uni FR DL 2017 (8)

Andreas Eitel Uni FR DL 2017 (9)

(1) Engineer good features (not learned)

Andreas Eitel Uni FR DL 2017 (10)

(1a) Maybe engineer good features (not learned)

(1) Jointly Learn everything with a deep architecture

Andreas Eitel Uni FR DL 2017 (12)

Andreas Eitel Uni FR DL 2017 (13)

I A neural network with parameters θ which implements a function fθ (x)

Andreas Eitel Uni FR DL 2017 (14)

I Computing fθ (x) for a neural network is a forward-pass

Andreas Eitel Uni FR DL 2017 (15)

I Computing fθ (x) for a neural network is a forward-pass

Andreas Eitel Uni FR DL 2017 (16)

I Computing fθ (x) for a neural network is a forward-pass

Andreas Eitel Uni FR DL 2017 (17)

I Computing fθ (x) for a neural network is a forward-pass

Andreas Eitel Uni FR DL 2017 (18)

Neural network activation functions

Typical nonlinearities for hidden layers are: tanh(ai ),

→ Why not linear ?

Andreas Eitel Uni FR DL 2017 (19)

I Train parameters θ such that ∀i ∈ [1, N ] : fθ (xi ) = yi

Andreas Eitel Uni FR DL 2017 (20)

I Train parameters θ such that ∀i ∈ [1, N ] : fθ (xi ) = yi

where l(·, ·) is a per example loss

Andreas Eitel Uni FR DL 2017 (20)

I Train parameters θ such that ∀i ∈ [1, N ] : fθ (xi ) = yi

where l(·, ·) is a per example loss

I For M-class classification use the negative log likelihood:

Andreas Eitel Uni FR DL 2017 (20)

I The simplest approach to minimizing min L(fθ , D) is gradient descent

Andreas Eitel Uni FR DL 2017 (21)

I The simplest approach to minimizing min L(fθ , D) is gradient descent

Andreas Eitel Uni FR DL 2017 (21)

I Gradient descent example: D = {(x1 , y 1 ), . . . , (x100 , y 100 )} with

Andreas Eitel Uni FR DL 2017 (22)

I Gradient descent example: D = {(x1 , y 1 ), . . . , (x100 , y 100 )} with

I Learn parameters θ of function fθ (x) = θx using loss

Andreas Eitel Uni FR DL 2017 (22)

I Gradient descent example: D = {(x1 , y 1 ), . . . , (x100 , y 100 )} with

I Learn parameters θ of function fθ (x) = θx using loss

Andreas Eitel Uni FR DL 2017 (22)

I Gradient descent example γ = 2.

Andreas Eitel Uni FR DL 2017 (23)

I There are two problems with gradient descent:

Andreas Eitel Uni FR DL 2017 (24)