4 - DNN Tip

Tips for Deep Learning
Recipe of Deep Learning

YES
Step 1: define a NO
Good Results on
set of function
Testing Data?
Overfitting!
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network
Do not always blame Overfitting
Not well trained
Overfitting?
Training Data Testing Data
Deep Residual Learning for Image Recognition

http://arxiv.org/abs/1512.03385
YES
Good Results on
Different approaches for Testing Data?
different problems.
e.g. dropout for good results YES

on testing data
Good Results on
Training Data?
Neural
Network
YES
Early Stopping
Good Results on
Testing Data?
Regularization
Dropout YES
Good Results on
New activation function Training Data?
Adaptive Learning Rate

Hard to get the power of Deep …
Results on Training Data
Deeper usually does not imply better.

Vanishing Gradient Problem
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Smaller gradients Larger gradients
Learn very slow Learn very fast
Almost random Already converge

based on random!?
Vanishing Gradient Problem
Smaller gradients
x1 …… 𝑦1 𝑦ො1
Small
x2 output
…… 𝑦2 𝑦ො2
……
……
……
……
𝑙
……
……
+∆𝑙
xN …… 𝑦𝑀 𝑦
ො
Large 𝑀
+∆𝑤 input
Intuitive way to compute the derivatives …

𝜕𝑙 ∆𝑙
=?
𝜕𝑤 ∆𝑤
ReLU
• Rectified Linear Unit (ReLU)
Reason:
𝑎
𝜎 𝑧 1. Fast to compute
𝑎=𝑧
2. Biological reason
𝑎=0 3. Infinite sigmoid
𝑧
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
𝑎
𝑎=𝑧
ReLU
𝑎=0
𝑧
0
x1 y1
0 y2
x2
0
0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎=0
𝑧
x1 y1
y2
x2
Do not have
smaller gradients
ReLU - variant
𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈

𝑎 𝑎
𝑎=𝑧 𝑎=𝑧
𝑧 𝑧
𝑎 = 0.01𝑧 𝑎 = 𝛼𝑧
α also learned by
gradient descent
Maxout ReLU is a special cases of Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2
x2 + −1 + 4
Max 1 Max 4
+ 1 + 3
You can have more than 2 elements in a group.

Maxout ReLU is a special cases of Maxout
𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x 0 + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑏
0
1 1
𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥
𝑧2 =0
Maxout More than ReLU
𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑤′
𝑏
1 𝑏′
1 Learnable Activation
Function
𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥
𝑧2 = 𝑤 ′ 𝑥 + 𝑏 ′
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group
2 elements in a group 3 elements in a group

Maxout - Training
• Given a training data x, we know which z would be
the max
+ 𝑧11 + 𝑧12
Input
Max 𝑎11 Max 𝑎12
x1 + 𝑧21 𝑚𝑎𝑥 𝑧11 , 𝑧21 + 𝑧22
x2 + 𝑧31 + 𝑧32
𝑥 Max 𝑎21 Max 𝑎22

+ 𝑧41 𝑎 1 + 𝑧42 𝑎2
Maxout - Training
• Given a training data x, we know which z would be
the max
+ 𝑧11 + 𝑧12
Input
𝑎11 𝑎12
x1 + 𝑧21 + 𝑧22
x2 + 𝑧31 + 𝑧32
𝑥 𝑎21 𝑎22
+ 𝑧41 𝑎 1 + 𝑧42 𝑎2
• Train this thin and linear network
Different thin and linear network for different examples
YES
Early Stopping
Good Results on
Testing Data?
Regularization
Dropout YES
Good Results on

Review Smaller
Learning Rate
𝑤2
Larger
Learning Rate
Adagrad
𝑤1
𝜂
𝑤 𝑡+1 ← 𝑤𝑡 − 𝑔𝑡
σ𝑡𝑖=0 𝑔𝑖 2
Use first derivative to estimate second derivative

RMSProp
Error Surface can be very complex when training NN.
Smaller
Learning Rate
𝑤2
Larger
Learning Rate
𝑤1
RMSProp
𝜂 0
𝑤1 ← − 0𝑔𝑤0 𝜎 0 = 𝑔0
𝜎
2 1
𝜂 1
𝑤 ← 𝑤 − 1𝑔 𝜎1 = 𝛼 𝜎0 2 + 1 − 𝛼 𝑔1 2
𝜎
3 2
𝜂 2
𝑤 ← 𝑤 − 2𝑔 𝜎2 = 𝛼 𝜎1 2 + 1 − 𝛼 𝑔2 2
𝜎
……
𝜂 𝑡
𝑤 𝑡+1 ← 𝑤𝑡 − 𝑡𝑔 𝜎𝑡 = 𝛼 𝜎 𝑡−1 2 + 1 − 𝛼 𝑔𝑡 2
𝜎
Root Mean Square of the gradients
with previous gradients being decayed
Hard to find
optimal network parameters
Total
Loss Very slow at the
plateau
Stuck at saddle point
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
The value of a network parameter w

In physical world ……
• Momentum
How about put this phenomenon

in gradient descent?
Review: Vanilla Gradient Descent
𝛻𝐿 𝜃 0 Start at position 𝜃 0
𝛻𝐿 𝜃1 Compute gradient at 𝜃 0
𝜃0
𝜃 1 Move to 𝜃 1 = 𝜃 0 - η𝛻𝐿 𝜃 0
𝛻𝐿 𝜃 2
Compute gradient at 𝜃 1
Gradient 𝜃2
Move to 𝜃 2 = 𝜃 1 – η𝛻𝐿 𝜃 1
Movement 𝛻𝐿 𝜃3
𝜃3
……
Stop until 𝛻𝐿 𝜃 𝑡 ≈ 0
Momentum
Movement: movement of last Start at point 𝜃 0
step minus gradient at present Movement v0=0
𝛻𝐿 𝜃 0 Compute gradient at 𝜃 0
𝛻𝐿 𝜃 1
𝜃0 Movement v1 = λv0 - η𝛻𝐿 𝜃 0
𝜃1 Move to 𝜃 1 = 𝜃 0 + v1
𝛻𝐿 𝜃 2
𝜃2
Gradient Movement v2 = λv1 - η𝛻𝐿 𝜃 1
Movement Move to 𝜃 2 = 𝜃 1 + v2
𝜃3
Movement 𝛻𝐿 𝜃 3 Movement not just based
of last step on gradient, but previous
movement.
Momentum
Movement: movement of last Start at point 𝜃 0
step minus gradient at present Movement v0=0
vi is actually the weighted sum of
all the previous gradient: Movement v1 = λv0 - η𝛻𝐿 𝜃 0
𝛻𝐿 𝜃 0 ,𝛻𝐿 𝜃 1 , … 𝛻𝐿 𝜃 𝑖−1 Move to 𝜃 1 = 𝜃 0 + v1
v0 = 0 Compute gradient at 𝜃 1
v1 = - η𝛻𝐿 𝜃0 Movement v2 = λv1 - η𝛻𝐿 𝜃 1
Move to 𝜃 2 = 𝜃 1 + v2
v2 = - λ η𝛻𝐿 𝜃 0 - η𝛻𝐿 𝜃 1
Movement not just based
……
on gradient, but previous

movement
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost
Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Momentum
Negative of 𝜕𝐿 ∕ 𝜕𝑤
Momentum
Real Movement
𝜕𝐿∕𝜕𝑤 = 0
Adam RMSProp + Momentum
for momentum
for RMSprop
YES
Early Stopping
Good Results on
Testing Data?
Regularization
Dropout YES
Good Results on

Early Stopping
Total
Loss
Stop at Validation set
here Testing set
Training set
Epochs
http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
Keras: the-validation-loss-isnt-decreasing-anymore
YES
Early Stopping
Good Results on
Testing Data?
Regularization
Dropout YES
Good Results on

Regularization
• New loss function to be minimized
• Find a set of weight not only minimizing original
cost but also close to zero
L   L    
1
2
Regularization term
2
  w1 , w2 ,
Original loss L2 regularization:
  w1   w2   
2 2
(e.g. minimize square
2
error, cross entropy …)
(usually not consider biases)
L2 regularization:
Regularization   w1   w2   
2 2
2
L L
L   L    
1
Gradient:   w
2 2
w w
L  L t 
Update: w t 1
 w 
t
 w  
t
 w 
w  w 
L
 1   w  
t
Weight Decay
w
Closer to zero
L1 regularization:
Regularization  1  w1  w2  

L L
L   L       sgn w
1

2 1
w w
Update:
L  L
w  w 
t 1 t
 w  
t
  sgn w 
t 
w  w 
L
 w 
t
w
 
  sgn w Always delete
t
L …… L2
 1   w  
t
w
Regularization - Weight Decay
• Our brain prunes out the useless link between
neurons.
Doing the same thing to machine’s brain improves

the performance.
YES
Early Stopping
Good Results on
Testing Data?
Regularization
Dropout YES
Good Results on

Dropout
Training:
➢ Each time before updating the parameters

 Each neuron has p% to dropout
Dropout
Training:
Thinner!
➢ Each time before updating the parameters

 Each neuron has p% to dropout
The structure of the network is changed.
 Using the new network for training
For each mini-batch, we resample the dropout neurons
Dropout
Testing:
➢ No dropout
 If the dropout rate at training is p%,
all the weights times 1-p%
 Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
Dropout
- Intuitive Reason
Testing
No dropout
(拿下重物後就變很強)
Training
Dropout (腳上綁重物)
Dropout - Intuitive Reason
我的 partner
會擺爛，所以
我要好好做
➢ When teams up, if everyone expect the partner will do

the work, nothing will be done finally.
➢ However, if you know your partner will dropout, you
will do better.
➢ When testing, no one dropout actually, so obtaining
good results eventually.
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Weights from training
𝑤1 0.5 × 𝑤1 𝑧 ′ ≈ 2𝑧
𝑤2 𝑧 0.5 × 𝑤2 𝑧 ′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Weights multiply 1-p%
𝑧′ ≈ 𝑧
Dropout is a kind of ensemble.
Training
Ensemble Set
Set 1 Set 2 Set 3 Set 4
Network Network Network Network

1 2 3 4
Train a bunch of networks with different structures

Ensemble
Testing data x
Network Network Network Network

1 2 3 4
y1 y2 y3 y4
average
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout
M neurons
……
2M possible
networks
➢Using one mini-batch to train one network

➢Some parameters in the network are shared
Testing of Dropout testing data x
All the
weights
……
multiply
1-p%
y1 y2 y3
?????
average ≈ y
Testing of Dropout
x1 x2
x1 x2 x1 x2 w1 w2
w1 w2 w1 w2
z=w1x1+w2x2
z=w1x1+w2x2 z=w2x2
x1 x2
x1 x2 x1 x2
1 1
w1 w1 w1 w2
w2 w2 2 2
1 1
z=w1x1 𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2
z=0 2 2
YES
Step 1: define a NO
Good Results on
set of function
Testing Data?
Overfitting!
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network
Try another task
政治
“stock” in document
經濟
Machine
體育
“president” in document
體育政治財經
http://top-breaking-news.com/
Try another task
Live Demo

4 - DNN Tip

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 - DNN Tip

Uploaded by

Copyright:

Available Formats

Tips for Deep Learning

Recipe of Deep Learning

Training Data Testing Data

Deep Residual Learning for Image Recognition

e.g. dropout for good results YES

Adaptive Learning Rate

Results on Training Data

Deeper usually does not imply better.

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge

Intuitive way to compute the derivatives …

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈

• Learnable activation function [Ian J. Goodfellow, ICML’13]

You can have more than 2 elements in a group.

2 elements in a group 3 elements in a group

𝑥 Max 𝑎21 Max 𝑎22

Adaptive Learning Rate

Use first derivative to estimate second derivative

Stuck at local minima

The value of a network parameter w

How about put this phenomenon

on gradient, but previous

Adaptive Learning Rate

Adaptive Learning Rate

• New loss function to be minimized

• New loss function to be minimized

Doing the same thing to machine’s brain improves

Adaptive Learning Rate

➢ Each time before updating the parameters

➢ Each time before updating the parameters

➢ When teams up, if everyone expect the partner will do

Set 1 Set 2 Set 3 Set 4

Network Network Network Network

Train a bunch of networks with different structures

Network Network Network Network

➢Using one mini-batch to train one network

You might also like