You are on page 1of 52

Tips for Deep Learning

Recipe of Deep Learning


YES

Step 1: define a NO
Good Results on
set of function
Testing Data?
Overfitting!
Step 2: goodness
of function YES

NO
Step 3: pick the Good Results on
best function Training Data?

Neural
Network
Do not always blame Overfitting
Not well trained

Overfitting?

Training Data Testing Data

Deep Residual Learning for Image Recognition


http://arxiv.org/abs/1512.03385
Recipe of Deep Learning
YES

Good Results on
Different approaches for Testing Data?
different problems.

e.g. dropout for good results YES


on testing data
Good Results on
Training Data?

Neural
Network
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Testing Data?
Regularization

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate


Hard to get the power of Deep …

Results on Training Data

Deeper usually does not imply better.


Vanishing Gradient Problem
x1 …… y1
x2 …… y2
……

……
……

……

……
xN …… yM

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge


based on random!?
Vanishing Gradient Problem
Smaller gradients

x1 …… 𝑦1 𝑦ො1
Small
x2 output
…… 𝑦2 𝑦ො2
……

……
……

……
𝑙

……

……
+∆𝑙
xN …… 𝑦𝑀 𝑦

Large 𝑀
+∆𝑤 input

Intuitive way to compute the derivatives …


𝜕𝑙 ∆𝑙
=?
𝜕𝑤 ∆𝑤
ReLU
• Rectified Linear Unit (ReLU)
Reason:
𝑎
𝜎 𝑧 1. Fast to compute
𝑎=𝑧
2. Biological reason
𝑎=0 3. Infinite sigmoid
𝑧
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
𝑎
𝑎=𝑧
ReLU
𝑎=0
𝑧
0

x1 y1

0 y2
x2
0

0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎=0
𝑧

x1 y1

y2
x2
Do not have
smaller gradients
ReLU - variant

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈


𝑎 𝑎
𝑎=𝑧 𝑎=𝑧

𝑧 𝑧
𝑎 = 0.01𝑧 𝑎 = 𝛼𝑧

α also learned by
gradient descent
Maxout ReLU is a special cases of Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]

+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2

x2 + −1 + 4
Max 1 Max 4
+ 1 + 3

You can have more than 2 elements in a group.


Maxout ReLU is a special cases of Maxout

𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x 0 + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑏
0
1 1

𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥
𝑧2 =0
Maxout More than ReLU

𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑤′
𝑏
1 𝑏′
1 Learnable Activation
Function
𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥

𝑧2 = 𝑤 ′ 𝑥 + 𝑏 ′
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group

2 elements in a group 3 elements in a group


Maxout - Training
• Given a training data x, we know which z would be
the max
+ 𝑧11 + 𝑧12
Input
Max 𝑎11 Max 𝑎12
x1 + 𝑧21 𝑚𝑎𝑥 𝑧11 , 𝑧21 + 𝑧22

x2 + 𝑧31 + 𝑧32

𝑥 Max 𝑎21 Max 𝑎22


+ 𝑧41 𝑎 1 + 𝑧42 𝑎2
Maxout - Training
• Given a training data x, we know which z would be
the max
+ 𝑧11 + 𝑧12
Input
𝑎11 𝑎12
x1 + 𝑧21 + 𝑧22

x2 + 𝑧31 + 𝑧32

𝑥 𝑎21 𝑎22
+ 𝑧41 𝑎 1 + 𝑧42 𝑎2
• Train this thin and linear network
Different thin and linear network for different examples
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Testing Data?
Regularization

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate


Review Smaller
Learning Rate

𝑤2

Larger
Learning Rate

Adagrad
𝑤1
𝜂
𝑤 𝑡+1 ← 𝑤𝑡 − 𝑔𝑡
σ𝑡𝑖=0 𝑔𝑖 2

Use first derivative to estimate second derivative


RMSProp
Error Surface can be very complex when training NN.

Smaller
Learning Rate
𝑤2

Larger
Learning Rate

𝑤1
RMSProp
𝜂 0
𝑤1 ← − 0𝑔𝑤0 𝜎 0 = 𝑔0
𝜎
2 1
𝜂 1
𝑤 ← 𝑤 − 1𝑔 𝜎1 = 𝛼 𝜎0 2 + 1 − 𝛼 𝑔1 2
𝜎
3 2
𝜂 2
𝑤 ← 𝑤 − 2𝑔 𝜎2 = 𝛼 𝜎1 2 + 1 − 𝛼 𝑔2 2
𝜎
……

𝜂 𝑡
𝑤 𝑡+1 ← 𝑤𝑡 − 𝑡𝑔 𝜎𝑡 = 𝛼 𝜎 𝑡−1 2 + 1 − 𝛼 𝑔𝑡 2
𝜎
Root Mean Square of the gradients
with previous gradients being decayed
Hard to find
optimal network parameters
Total
Loss Very slow at the
plateau
Stuck at saddle point

Stuck at local minima

𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0

The value of a network parameter w


In physical world ……
• Momentum

How about put this phenomenon


in gradient descent?
Review: Vanilla Gradient Descent

𝛻𝐿 𝜃 0 Start at position 𝜃 0

𝛻𝐿 𝜃1 Compute gradient at 𝜃 0
𝜃0
𝜃 1 Move to 𝜃 1 = 𝜃 0 - η𝛻𝐿 𝜃 0
𝛻𝐿 𝜃 2
Compute gradient at 𝜃 1
Gradient 𝜃2
Move to 𝜃 2 = 𝜃 1 – η𝛻𝐿 𝜃 1
Movement 𝛻𝐿 𝜃3
𝜃3

……
Stop until 𝛻𝐿 𝜃 𝑡 ≈ 0
Momentum
Movement: movement of last Start at point 𝜃 0
step minus gradient at present Movement v0=0
𝛻𝐿 𝜃 0 Compute gradient at 𝜃 0
𝛻𝐿 𝜃 1
𝜃0 Movement v1 = λv0 - η𝛻𝐿 𝜃 0

𝜃1 Move to 𝜃 1 = 𝜃 0 + v1
𝛻𝐿 𝜃 2
Compute gradient at 𝜃 1
𝜃2
Gradient Movement v2 = λv1 - η𝛻𝐿 𝜃 1
Movement Move to 𝜃 2 = 𝜃 1 + v2
𝜃3
Movement 𝛻𝐿 𝜃 3 Movement not just based
of last step on gradient, but previous
movement.
Momentum
Movement: movement of last Start at point 𝜃 0
step minus gradient at present Movement v0=0
Compute gradient at 𝜃 0
vi is actually the weighted sum of
all the previous gradient: Movement v1 = λv0 - η𝛻𝐿 𝜃 0
𝛻𝐿 𝜃 0 ,𝛻𝐿 𝜃 1 , … 𝛻𝐿 𝜃 𝑖−1 Move to 𝜃 1 = 𝜃 0 + v1
v0 = 0 Compute gradient at 𝜃 1
v1 = - η𝛻𝐿 𝜃0 Movement v2 = λv1 - η𝛻𝐿 𝜃 1
Move to 𝜃 2 = 𝜃 1 + v2
v2 = - λ η𝛻𝐿 𝜃 0 - η𝛻𝐿 𝜃 1
Movement not just based
……

on gradient, but previous


movement
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost
Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Momentum
Negative of 𝜕𝐿 ∕ 𝜕𝑤
Momentum
Real Movement

𝜕𝐿∕𝜕𝑤 = 0
Adam RMSProp + Momentum

for momentum
for RMSprop
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Testing Data?
Regularization

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate


Early Stopping
Total
Loss
Stop at Validation set
here Testing set

Training set

Epochs
http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
Keras: the-validation-loss-isnt-decreasing-anymore
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Testing Data?
Regularization

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate


Regularization
• New loss function to be minimized
• Find a set of weight not only minimizing original
cost but also close to zero

L   L    
1
2
Regularization term
2
  w1 , w2 ,
Original loss L2 regularization:
  w1   w2   
2 2
(e.g. minimize square
2
error, cross entropy …)
(usually not consider biases)
L2 regularization:
Regularization   w1   w2   
2 2
2

• New loss function to be minimized

L L
L   L    
1
Gradient:   w
2 2
w w
L  L t 
Update: w t 1
 w 
t
 w  
t
 w 
w  w 
L
 1   w  
t
Weight Decay
w
Closer to zero
L1 regularization:
Regularization  1  w1  w2  

• New loss function to be minimized


L L
L   L       sgn w
1

2 1
w w
Update:
L  L
w  w 
t 1 t
 w  
t
  sgn w 
t 

w  w 
L
 w 
t

w
 
  sgn w Always delete
t

L …… L2
 1   w  
t

w
Regularization - Weight Decay
• Our brain prunes out the useless link between
neurons.

Doing the same thing to machine’s brain improves


the performance.
Recipe of Deep Learning
YES
Early Stopping
Good Results on
Testing Data?
Regularization

Dropout YES

Good Results on
New activation function Training Data?

Adaptive Learning Rate


Dropout
Training:

➢ Each time before updating the parameters


 Each neuron has p% to dropout
Dropout
Training:

Thinner!

➢ Each time before updating the parameters


 Each neuron has p% to dropout
The structure of the network is changed.
 Using the new network for training
For each mini-batch, we resample the dropout neurons
Dropout
Testing:

➢ No dropout
 If the dropout rate at training is p%,
all the weights times 1-p%
 Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
Dropout
- Intuitive Reason
Testing
No dropout
(拿下重物後就變很強)
Training
Dropout (腳上綁重物)
Dropout - Intuitive Reason
我的 partner
會擺爛,所以
我要好好做

➢ When teams up, if everyone expect the partner will do


the work, nothing will be done finally.
➢ However, if you know your partner will dropout, you
will do better.
➢ When testing, no one dropout actually, so obtaining
good results eventually.
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Weights from training
𝑤1 0.5 × 𝑤1 𝑧 ′ ≈ 2𝑧
𝑤2 𝑧 0.5 × 𝑤2 𝑧 ′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Weights multiply 1-p%
𝑧′ ≈ 𝑧
Dropout is a kind of ensemble.
Training
Ensemble Set

Set 1 Set 2 Set 3 Set 4

Network Network Network Network


1 2 3 4

Train a bunch of networks with different structures


Dropout is a kind of ensemble.
Ensemble
Testing data x

Network Network Network Network


1 2 3 4

y1 y2 y3 y4

average
Dropout is a kind of ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout

M neurons

……
2M possible
networks

➢Using one mini-batch to train one network


➢Some parameters in the network are shared
Dropout is a kind of ensemble.
Testing of Dropout testing data x

All the
weights

……
multiply
1-p%

y1 y2 y3
?????
average ≈ y
Testing of Dropout
x1 x2
x1 x2 x1 x2 w1 w2
w1 w2 w1 w2

z=w1x1+w2x2

z=w1x1+w2x2 z=w2x2
x1 x2
x1 x2 x1 x2
1 1
w1 w1 w1 w2
w2 w2 2 2

1 1
z=w1x1 𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2
z=0 2 2
Recipe of Deep Learning
YES

Step 1: define a NO
Good Results on
set of function
Testing Data?
Overfitting!
Step 2: goodness
of function YES

NO
Step 3: pick the Good Results on
best function Training Data?

Neural
Network
Try another task
政治
“stock” in document
經濟
Machine

體育
“president” in document

體育 政治 財經
http://top-breaking-news.com/
Try another task
Live Demo

You might also like