DEU CSC5045 Intelligent System Applications Using Fuzzy - 7+Deep+Learning

Deep Learning Tutorial
李宏毅
Hung-yi Lee
1
Outline
Part I: Introduction of Deep Learning
Part II: Why Deep?
Part III: Tips for Training Deep Neural Network
Part IV: Neural Network with Memory

2
Part I:
Introduction of
Deep Learning
What people already knew in 1980s
3
Example Application
• Handwriting Digit Recognition
Machine “2
”
4
Handwriting Digit Recognition
Input Output
x1 y1
0.1 is 1
x2 y2
0.7 is 2
The image
is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit. 5
Example Application
• Handwriting Digit Recognition
x1 y1
x2
y2
Machine “2
……
……
”
x256 256 10 y10
𝑓 :𝑅 →𝑅
In deep learning, the function is
represented by neural network
6
Element of Neural Network
Neuron 𝐾
𝑓 :𝑅 →𝑅
a1 w1 z  a1w1  a2 w2    aK wK  b
a2 w2
z  z 
 a
wK
…
aK Activation
weights function
b
bias
7
Neural Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
Deep means many hidden layers

8
Example of Neural Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function  z 
1
 z   z
1 e z
9
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
10
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
2 2
𝑓 :𝑅 →𝑅 𝑓
([ ]) [
1
−1
=
0 .62
0.83 ] ([ ]) [
𝑓
0
0
=
0 .51
0.85 ]
Different parameters define different function
11
Matrix Operation
1 4 0.98
y1
1
-2
1
-1 -2 0.12
-1 y2
1
0
𝜎[
1
−1
−2
1 ] [ ] +¿ [ ] ¿ [
(
1
−1
1
)
0
0 .98
0.12 ]
[ ]
4
−2 12
Neural Network
x1 …… y1
x 2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
𝜎W1 x(+ b)
1
𝜎W2 a1(+ b)
2
𝜎 L-1 + )
WL a( bL
13
Neural Network
x1 …… y1
x 2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
Using parallel computing techniques

y ¿ 𝑓 x( )
to speed up matrix operation
¿WL … 𝜎
W2 𝜎
𝜎 W1 (
x(+ (
b)
1 … + bL
+ b2 )
)
14
Softmax
• Softmax layer as the output layer
Ordinary Layer
z1   
y1   z1
In general, the output of
z2   
y2   z 2
network can be any
value.
May not be easy to interpret
z3  y3   z3 
15
Softmax
Probability:
• Softmax layer as the output layer
Softmax Layer
3
3 20 0.88
 e
z1 zj
z1 e e y1  e z1
j 1
1 0.12 3
z2 e e z2 2.7  y2  e z2
e
zj
j 1
0.05 ≈0
z3 -3
3
e e z3
 y3  e z3
e
zj
3 j 1
 e zj
j 1
16
How to set network parameters
𝜃= { 𝑊 , 𝑏 ,𝑊 ,𝑏 , ⋯ 𝑊 , 𝑏 }
1 1 2 2 𝐿 𝐿
x1 …… y1
0.1 is 1
x2
Softmax
…… y2
0.7 is 2
……
……
……
x256 …… y10
0.2 is 0
16 x 16 = 256
Ink → 1 Set the network parameters such that ……
No ink → 0
Input: How to let thethe
y1 has neural
maximum value
network achieve this
Input: y2 has the maximum value
17
Training Data
• Preparing training data: images and their labels
“5 “0 “4 “1
” ” ” ”
“9 “2 “1 “3
” ” ” ”
Using the training data to find the
network parameters.
18
Given a set of network parameters , each
Cost example has a cost value.
“1
”
x1 …… y0.2
1 1
x2 …… y2
0.3 0
Cost
……
……
……
……
……
x256 …… y0.5 𝐿(𝜃) 0
10
target
Cost can be Euclidean distance or cross
entropy of the network output and target 19
Total Cost
For all training data … Total Cost:
𝑅
x1 NN y1 ^
𝐶 ( 𝜃 )=∑ 𝐿 ( 𝜃 ) 𝑟
1
𝑦
1
𝐿 ( 𝜃)
𝑟 =1
x2 NN y2 ^
𝑦
2
2
𝐿 ( 𝜃) How bad the network
parameters is on this
x3 NN y3 ^
𝑦
3
task
3
𝐿 (𝜃)
……
……
……
……
Find the network

parameters that
xR NN yR ^
𝑦
𝑅
minimize this value
𝑅
𝐿 ( 𝜃) 20
Assume there are only two
parameters w1 and w2 in a
Gradient Descent network.
𝜃= { 𝑤 1 ,𝑤 2 }
Error Surface
The colors represent the value of C. Randomly pick a
starting point
Compute the
negative gradient
∗ at
𝑤2 𝜃
−𝜂 𝛻 𝐶 ( 𝜃0 ) − 𝛻 𝐶 ( 𝜃0 )
− 𝛻 𝐶 ( 𝜃0 )
Times the
𝜃
0
𝛻 𝐶 (𝜃 )=
0
[ 𝜕 𝐶 ( 𝜃0 ) / 𝜕 𝑤1
𝜕 𝐶 ( 𝜃 ) / 𝜕 𝑤2
0
] learning rate
−𝜂 𝛻 𝐶 ( 𝜃0 )
𝑤1 21
Gradient Descent
Eventually, we would
Randomly pick a
reach a minima …..
starting point
Compute the
2−𝜂 𝛻 𝐶 ( 𝜃 )
2
negative gradient
−𝜂 𝛻 𝐶 ( 𝜃𝜃)
1
𝑤2 1( 𝜃 2 )
at
− 𝛻 𝐶
−𝛻𝐶 (𝜃 )
𝜃
1
− 𝛻 𝐶 ( 𝜃0 )
Times the
0
learning rate
𝜃
−𝜂 𝛻 𝐶 ( 𝜃0 )
𝑤1 22
Local Minima
• Gradient descent never guarantee global minima
Different initial
point
𝐶 Reach different
minima, so different
results
Who is Afraid of Non-Convex
Loss Functions?
𝑤1 𝑤2 http://videolectures.net/
eml07_lecun_wia/ 23
Besides local minima ……
cost
Very slow at the
plateau
Stuck at saddle point
Stuck at local minima
𝛻𝐶 ( 𝜃 ) ≈0 𝛻𝐶 ( 𝜃 ) =0 𝛻𝐶 ( 𝜃 ) =0
parameter space
24
In physical world ……
• Momentum
How about put this phenomenon

in gradient descent?
25
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost
Movement =
Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement
Gradient = 0 26
Mini-batch
 Randomly initialize
x1 NN y1 ^
𝑦
1  Pick the 1st batch
Mini-batch
1
𝐿 𝐶 =𝐿 +𝐿
1 31
+⋯
x31 NN y31 ^
𝑦
31
𝜃1 ← 𝜃0 −𝜂 𝛻 𝐶 ( 𝜃 0 )
31
𝐿  Pick the 2nd batch
……
2 16
𝐶 =𝐿 + 𝐿 +⋯
𝜃2 ← 𝜃1 − 𝜂 𝛻 𝐶 ( 𝜃1 )
x2 NN y2 ^
𝑦
2
Mini-batch
…
2
𝐿
C is different each time
x 16
NN y 16 ^
𝑦
16
when we update
16
𝐿 parameters!
……
27
Mini-batch
Original Gradient Descent With Mini-batch
unstable
The colors represent the total C on all training data.

28
Mini-batch Faster Better!
 Randomly initialize
x1 NN y1 ^
𝑦
1  Pick the 1st batch
Mini-batch
1
𝐶 𝐶 =𝐶 +𝐶
1 31
+⋯
x31 NN y31 ^
𝑦
31
𝜃1 ← 𝜃0 −𝜂 𝛻 𝐶 ( 𝜃 0 )
31
𝐶  Pick the 2nd batch
……
2 16
𝐶 =𝐶 +𝐶 +⋯
𝜃2 ← 𝜃1 − 𝜂 𝛻 𝐶 ( 𝜃1 )
x2 NN y2 ^
𝑦
2
Mini-batch
…
2
𝐶  Until all mini-batches
have been picked
x16 NN y16 ^
𝑦
16
𝐶
16 one epoch
……
Repeat the above process

29
Backpropagation
• A network can have millions of parameters.
• Backpropagation is the way to compute the gradients
efficiently (not today)
• Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_201
5_2/Lecture/DNN%20backprop.ecm.mp4/index.html
• Many toolkits can compute the gradients automatically
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lec
ture/Theano%20DNN.ecm.mp4/index.html 30
Part II:
Why Deep?
31
Deeper is Better?
Word Error Word Error
Layer X Size Rate (%) Layer X Size Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4 Not surprised, more
4 X 2k 17.8 parameters, better
5 X 2k 17.2 performance
1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011. 32
Universality Theorem
Any continuous function f
f : R N  RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html
Why “Deep” neural network not “Fat” neural network?

33
Fat + Short v.s. Thin + Tall
The same number
of parameters
Which one is better?

……
x1 x2 …… xN x1 x2 …… xN
Shallow Deep 34
Fat + Short v.s. Thin + Tall
Word Error Word Error
Layer X Size Rate (%) Layer X Size Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011. 35
Why Deep?
• Deep → Modularization
Classifier Girls with 長髮長髮
1 long hair 女女長髮長髮
女女
Classifier Boys with 長髮
2 long hair 男 examples
Image
weak Little
Classifier Girls with 短髮
短髮
3 short hair 女女短髮短髮
女女
Classifier Boys with 短髮
短髮
4 short hair 男男短髮短髮
男男 36
Why Deep? Each basic classifier can have
sufficient training examples.
長髮長髮
長髮長髮男
Boy or 短髮
短髮女女長髮
女女短髮女 v.s. 短髮
短髮
Girl? 短髮女短髮
女女男男男短髮
Image Basic 男
Classifier
長髮長髮短髮
短髮
Long or
女女長髮
長髮女女短髮短髮
short? 女女 v.s. 女女
長髮短髮短髮
Classifiers for the
男男男短髮
短髮
attributes 男男
37
Why Deep?
can be trained by little data
Classifier Girls with
1 long hair
Boy or
Girl? Classifier Boys with
2 fine long Little
hair data
Image Basic
Classifier Classifier Girls with
Long or 3 short hair
short?
Classifier Boys with
Sharing by the 4 short hair
following classifiers
as module 38
Deep Learning also works
Why Deep? on small data set like TIMIT.
• Deep → Modularization → Less training data?

x1 ……
x2 The modularization is ……
automatically learned from
……
……
……
……
data.
xN ……
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
39
Hand-crafted
kernel function
SVM
Apply simple
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Deep Learning
simple
Learnable kernel
𝜙 ( 𝑥) classifier
x1 …… y1
x2
𝑥 …… y2
…
…
…
…
…
xN …… yM 40
Hard to get the power of Deep …
Before 2006, deeper usually does not imply better.
41
Part III:
Tips for Training DNN
42
Recipe for Learning
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-
explained-in-a-single-powerpoint-slide/ 43
Recipe for Learning
Don’t overfitting
forget!
Modify the Network Preventing
Better optimization Overfitting
Strategy
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-
explained-in-a-single-powerpoint-slide/ 44
Recipe for Learning
Modify the Network

• New activation functions, for example, ReLU or
Maxout
Better optimization Strategy
• Adaptive learning rates
Prevent Overfitting
• Dropout Only use this approach when you already

obtained good results on the training data.
45
Part III:
New Activation Function
46
ReLU
• Rectified Linear Unit (ReLU)
Reason:
𝑎
𝜎 (𝑧) 1. Fast to compute
𝑎=𝑧
2. Biological reason
𝑎= 0 3. Infinite sigmoid
𝑧
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
47
Vanishing Gradient Problem
x1 …… y1
In x2006,
2 people used RBM
……pre-training. y2
In 2015, people use ReLU.
……
……
……
……
……
xN …… yM
Smaller gradients Larger gradients
Learn very slow Learn very fast
Almost random Already converge

based on 48
Vanishing Gradient Problem
Smaller gradients
x1 …… 𝑦1 ^
𝑦1
Small
x2 …… output𝑦 2 ^
𝑦2
𝐶
……
……
……
……
……
……
xN …… 𝑦𝑀 +∆ 𝐶 ^
𝑦 𝑀
Large
+∆ 𝑤 input
Intuitive way to compute the gradient …

𝜕𝐶 ∆𝐶
=?
𝜕𝑤 ∆𝑤 49
𝑎
𝑎=𝑧
ReLU
𝑎= 0
𝑧
0
x1 y1
0 y2
x2
0
0
50
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎= 0
𝑧
x1 y1
y2
x2
Do not have
smaller gradients
51
Maxout ReLU is a special cases of Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2
x2 + −1 + 4
Max 1 Max 4
+ 1 + 3
You can have more than 2 elements in a group.

52
• Learnable activation function [Ian J. Goodfellow, ICML’13]

• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group
2 elements in a group 3 elements in a group
53
Part III:
Adaptive Learning Rate
54
Set the learning
Learning Rate rate η carefully
−𝜂 𝛻 𝐶 ( 𝜃0 ) If learning rate is too large
Cost may not decrease

after each update
𝑤2
− 𝛻 𝐶 ( 𝜃0 )
0
𝜃
𝑤1 55
Can we give different
Set the learning
Learning Rate parameters different
rate η carefully
learning rates?
If learning rate is too large
Cost may not decrease

after each update
𝑤2
If learning rate is too small
−𝛻𝐶 (𝜃 )0
−𝜂 𝛻 𝐶 ( 𝜃 )
0
0
𝜃 Training would be too slow
𝑤1 56
Original Gradient Descent
Adagrad 𝜃𝑡 ← 𝜃 𝑡 −1 −𝜂 𝛻 𝐶 ( 𝜃 𝑡 −1 )
Each parameter w are considered separately

𝑡 𝜕 𝐶 ( 𝜃 𝑡
)
𝑤
𝑡 +1 𝑡
←𝑤 − 𝜂 𝑤 𝑔
𝑡
𝑔 =
𝜕𝑤
Parameter dependent
learning rate
𝜂 constant
𝜂 𝑤=
√
𝑡
∑(𝑔 𝑖 2
) Summation of the square of
𝑖=0
the previous derivatives
57
𝜂
𝜂𝑤 =
Adagrad
√∑
𝑡
𝑖 2
(𝑔 )
𝑖=0
g0 g1 …… g0 g1 ……
𝑤1 𝑤2 20.0 10.0 ……
0.1 0.2 ……
Learning rate: Learning rate:

𝜂 𝜂 𝜂 𝜂
¿ 2¿
√ 0.1 2
0 .1 √ 20 20
𝜂 𝜂 𝜂 𝜂
2¿ ¿
√ 0.1 2
+0. 2 0 .22 √ 202 +10 2 22
Observation: 1. Learning rate is smaller and
smaller for all parameters
2. Smaller derivatives, larger
Why?
learning rate, and vice versa 58
Larger
derivatives
Smaller
Learning Rate
Smaller Derivatives
Larger Learning Rate
2. Smaller derivatives, larger

Why?
learning rate, and vice versa 59
Not the whole story ……
• Adagrad [John Duchi, JMLR’11]
• RMSprop
• https://www.youtube.com/watch?v=O3sxAc4hxZU
• Adadelta [Matthew D. Zeiler, arXiv’12]

• Adam [Diederik P. Kingma, ICLR’15]
• AdaSecant [Caglar Gulcehre, arXiv’14]
• “No more pesky learning rates” [Tom Schaul, arXiv’12]
60
Part III:
Dropout
61
Pick a mini-batch
Dropout 𝜃𝑡 ← 𝜃 𝑡 −1 −𝜂 𝛻 𝐶 ( 𝜃 𝑡 −1 )
Training:
 Each time before computing the gradients

 Each neuron has p% to dropout
62
Pick a mini-batch
Dropout 𝜃𝑡 ← 𝜃 𝑡 −1 −𝜂 𝛻 𝐶 ( 𝜃 𝑡 −1 )
Training:
Thinner!
 Each time before computing the gradients

 Each neuron has p% to dropout
The structure of the network is changed.
 Using the new network for training
For each mini-batch, we resample the dropout neurons 63
Dropout
Testing:
 No dropout
 If the dropout rate at training is p%,
all the weights times (1-p)%
 Assume that the dropout rate is 50%.
If a weight by training, set for testing. 64
Dropout - Intuitive Reason
我的 partner
會擺爛，所以
我要好好做
 When teams up, if everyone expect the partner will do

the work, nothing will be done finally.
 However, if you know your partner will dropout, you
will do better.
 When testing, no one dropout actually, so obtaining
good results eventually. 65
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Weights from training
𝑤1 ′
0 .5 ×𝑤1 𝑧 ≈2 𝑧
𝑤2 𝑧 0 .5 ×𝑤2 𝑧
′
𝑤3 0 .5 × 𝑤3
𝑤4 0 .5 ×𝑤 4
Weights multiply (1-p)%
′
𝑧 ≈𝑧
66
Dropout is a kind of ensemble.
Training
Ensemble Set
Set 1 Set 2 Set 3 Set 4
Network Network Network Network

1 2 3 4
Train a bunch of networks with different structures
67
Ensemble
Testing data x
Network Network Network Network

1 2 3 4
y1 y2 y3 y4
average 68
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout
M neurons
……
2M possible
networks
Using one mini-batch to train one network

Some parameters in the network are shared 69
Testing of Dropout testing data x
All the
weights
……
multiply
(1-p)%
y1 y2 y3
average ≈ y 70
More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi,
NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons
• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate
71
Part IV:
Neural Network
with Memory
72
Neural Network needs Memory
• Name Entity Recognition
• Detecting named entities like name of people,
locations, organization, etc. in a sentence.
1
0.1 people
apple 0
0.1 location
0 DNN
0.5 organization
…
0 0.3 none
73
Neural Network needs Memory
• Name Entity Recognition
• Detecting named entities like name of people,
locations, organization, etc. in a sentence.
target ORG target NONE
y1 y2 y3 y4 y5 y6 y7
DNN DNN DNN DNN DNN DNN DNN
x1 x2 x3 x4 x5 x6 x7
the president of apple eats an apple
DNN needs memory! 74
Recurrent Neural Network (RNN)
y1 y2
The output of hidden layer

are stored in the memory.
copy
a1 a2
Memory can be considered x1 x2

as another input.
75
RNN
y1 y2 y3
Wo copy Wo copy Wo
a 1
a 2 a3
a1 a2
Wi Wh Wh
Wi Wi
x1 x2 x3
The same network is used again and again.
Output yi depends on x1, x2, …… xi

76
RNN How to train?
^ 1
𝑦 target ^
𝑦
2 target
^
𝑦
3 target
L1 L2 L3
y1 y2 y3
Wo Wo Wo
Wh Wh
……
Wi Wi Wi
x1 x2 x3
Find the network parameters to minimize the total cost:

Backpropagation through time (BPTT)
77
Of course it can be deep …
yt yt+1 yt+2
…… ……
……
……
……
…… ……
…… ……
xt xt+1 xt+2
78
Bidirectional RNN
xt xt+1 xt+2
…… ……
yt yt+1 yt+2
…… ……
xt xt+1 xt+2
79
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• E.g. Speech Recognition
Output: “ 好棒 (character sequence)

”
Trimming
Problem?
Why can’t it be 好好好棒棒棒棒棒
“ 好棒棒”
(vector
Input:
sequence
)
80
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06]
[Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15]
[Andrew Senior, ASRU’15]
“ 好棒 Add an extra symbol “ 好棒

” “φ” representing “null” 棒”
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
81
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→ 機器學
習)
Containing all
machine
learning
information about
input sequence
82
習)
機器學習慣性 ……
……
machine
learning
Don’t know when to stop

83
推 tlkagk: ========= 斷 ==========

Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D
%E6%8E%A8%E6%96%87 ( 鄉民百科 ) 84
習)
學
===
機器習
machine
learning
Add a symbol “===“ ( 斷 )

[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
85
感謝曾柏翔同學提供實驗結果
Unfortunately ……
• RNN-based network is not always easy to learn
Real experiments on Language modeling
sometimes
Lucky
86
The error surface is rough.
The error surface is either
very flat or very steep.
Clipping
Cost
w2
w1 [Razvan Pascanu, ICML’13]

87
Why?
𝑤 =1 𝑦
1000
=1 Large Small
𝑤=1.01 𝑦
1000
≈ 20000 gradient Learning
rate?
𝑤=0.99 𝑦
1000
≈0 small Large
𝑤=0.01 𝑦
1000
≈0 gradient Learning
rate? 999
=w
y1 y2 y3 y1000
Toy Example
1 1 1 1
……
w w w
1 1 1 1
1 0 0 0
88
Helpful Techniques
• Nesterov’s Accelerated Gradient (NAG):
• Advance momentum method
• RMS Prop
• Advanced approach to give each parameter
different learning rates
• Considering the change of Second derivatives
• Long Short-term Memory (LSTM)
• Can deal with gradient vanishing (not gradient
explode)
89
Long Short-term Memory
(LSTM)Other part of the network
Special Neuron:
Signal control
Output Gate
4 inputs,
the output gate 1 output
(Other part of
the network)
Memory Forget Signal control
Cell Gate the forget gate
(Other part of
the network)
Signal control
Input Gate LSTM
the input gate
(Other part of
the network)
Other part of the network 90
𝑎 ¿ h ( 𝑐′ ) 𝑓 ( 𝑧 𝑜 )
𝑧𝑜 multiply
Activation function f is usually
𝑓 ( 𝑧𝑜)
h ( 𝑐′ ) a sigmoid function
Between 0 and 1
Mimic open and close gate
𝑐 𝑓 (𝑧 𝑓)
′
𝑐c 𝑧𝑓
𝑐𝑓 ( 𝑧 𝑓 )
′
𝑓 ( 𝑧𝑖 ) 𝑔 ( 𝑧 ) 𝑓 ( 𝑧 𝑖)
𝑐 =𝑔 ( 𝑧 ) 𝑓 ( 𝑧 𝑖 ) + 𝑐𝑓 ( 𝑧 𝑓 )
𝑧𝑖
multiply
𝑔(𝑧 )
𝑧 91
Original Network:
Simply replace the neurons with LSTM
……
……
𝑎1 𝑎2
𝑧1 𝑧2
x1 x2 Input
92
𝑎1 𝑎2
+ +
+ +
+ +
+ +
4 times of parameters x1 x2 Input

93
Extension: “peephole”
LSTM
yt yt+1
ct-1 ct ct+1
× ＋ × × ＋ ×
× ×
zf zi z zo zf zi z zo
ct-1 ht-1 xt ct ht xt+1

94
Other Simpler Alternatives
Gated Recurrent Unit (GRU) Structurally Constrained
Recurrent Network (SCRN)
[Tomas
[Cho, EMNLP’14] Mikolov,
ICLR’15]
Vanilla RNN Initialized with Identity matrix + ReLU activation
function [Quoc V. Le, arXiv’15]
 Outperform or be comparable with LSTM in 4 different tasks
95
What is the next wave?
Internal memory or
• Attention-based Model information from output
…… ……
Reading Head Writing Head
Reading Head Writing Head

Controller Controller
Input x DNN/LSTM output y

Already applied on speech recognition, caption
generation, QA, visual QA 96
What is the next wave?
• Attention-based Model
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. arXiv
Pre-Print, 2015.
• Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. arXiv Pre-Print,
2014
• Ask Me Anything: Dynamic Memory Networks for Natural Language Processing.
Kumar et al. arXiv Pre-Print, 2015
• Neural Machine Translation by Jointly Learning to Align and Translate. D. Bahdanau,
K. Cho, Y. Bengio; International Conference on Representation Learning 2015.
• Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention. Kelvin Xu et. al.. arXiv Pre-Print, 2015.
• Attention-Based Models for Speech Recognition. Jan Chorowski, Dzmitry Bahdanau,
Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio. arXiv Pre-Print, 2015.
• Recurrent models of visual attention. V. Mnih, N. Hees, A. Graves and K.
Kavukcuoglu. In NIPS, 2014.
• A Neural Attention Model for Abstractive Sentence Summarization. A. M. Rush, S.
Chopra and J. Weston. EMNLP 2015.
97
Concluding Remarks
98
Concluding Remarks
• Introduction of deep learning
• Discussing some reasons using deep learning
• New techniques for deep learning
• ReLU, Maxout
• Giving all the parameters different learning rates
• Dropout
• Network with memory
• Recurrent neural network
• Long short-term memory (LSTM)
99
Reading Materials
• “Neural Networks and Deep Learning”
• written by Michael Nielsen
• http://neuralnetworksanddeeplearning.com/
• “Deep Learning” (not finished yet)
• Written by Yoshua Bengio, Ian J. Goodfellow and
Aaron Courville
• http://www.iro.umontreal.ca/~bengioy/dlbook/
100
Thank you
for your attention!
101
Acknowledgement
• 感謝 Ryan Sun 來信指出投影片上的錯字
102
Appendix
103
Matrix Operation
1 4 0.98
x1 y1
1 -2
1
-1 -2 0.12
x2 y2
1
-1
0
𝜎[
1 −2
−1W 1 ] [ ] +¿ [ ] ¿ [
1
x(
− 1
1
)
b
0
0 .98
a
0.12 ]
104
Why Deep? – Logic Circuits
• A two levels of basic logic gates can represent any
Boolean function.
• However, no one uses two levels of logic gates to
build computers
• Using multiple layers of logic gates to build some
functions are much simpler (less gates needed).
105
Boosting Weak classifier
Input Weak classifier

Combine
……
Weak classifier
Deep Learning
Weak Boosted weak Boosted Boosted
classifier classifier weak classifier
x1 ……
x2 ……
…
…
…
…
xN ……
106
𝑧 + 𝑧❑
1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
x 𝑏
x 0 + 𝑧❑
2 𝑚𝑎𝑥 { 𝑧 ❑
1 ,𝑧2 }
❑
𝑏 0
1 1
𝑎 𝑎
𝑧 =𝑤𝑥+ 𝑏
𝑧 1 =𝑤𝑥 +𝑏
𝑥 𝑥
0
107
𝑧 + 𝑧❑
1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
x 𝑏
x + 𝑧❑
2 𝑚𝑎𝑥 { 𝑧 ❑
1 ,𝑧2 }
❑
′
𝑏 𝑤
′
1 𝑏
1 Learnable Activation
Function
𝑎 𝑎
𝑧 =𝑤𝑥+ 𝑏
𝑧 1 =𝑤𝑥 +𝑏
𝑥 𝑥
𝑧 2=𝑤′ 𝑥 +𝑏′
108

DEU CSC5045 Intelligent System Applications Using Fuzzy - 7+Deep+Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DEU CSC5045 Intelligent System Applications Using Fuzzy - 7+Deep+Learning

Uploaded by

Copyright:

Available Formats

Deep Learning Tutorial

Part I: Introduction of Deep Learning

Part II: Why Deep?

Part III: Tips for Training Deep Neural Network

Part IV: Neural Network with Memory

What people already knew in 1980s

Deep means many hidden layers

Using parallel computing techniques

Find the network

Stuck at local minima

How about put this phenomenon

The colors represent the total C on all training data.

Repeat the above process

Why “Deep” neural network not “Fat” neural network?

Which one is better?

• Deep → Modularization → Less training data?

Before 2006, deeper usually does not imply better.

Modify the Network

• Adaptive learning rates

• Dropout Only use this approach when you already

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge

Intuitive way to compute the gradient …

• Learnable activation function [Ian J. Goodfellow, ICML’13]

You can have more than 2 elements in a group.

• Learnable activation function [Ian J. Goodfellow, ICML’13]

2 elements in a group 3 elements in a group

−𝜂 𝛻 𝐶 ( 𝜃0 ) If learning rate is too large

Cost may not decrease

If learning rate is too large

Cost may not decrease

Each parameter w are considered separately

Learning rate: Learning rate:

Larger Learning Rate

2. Smaller derivatives, larger

• Adadelta [Matthew D. Zeiler, arXiv’12]

 Each time before computing the gradients

 Each time before computing the gradients

 When teams up, if everyone expect the partner will do

Set 1 Set 2 Set 3 Set 4

Network Network Network Network

Train a bunch of networks with different structures

Network Network Network Network

Using one mini-batch to train one network

DNN DNN DNN DNN DNN DNN DNN

The output of hidden layer

Memory can be considered x1 x2

The same network is used again and again.

Output yi depends on x1, x2, …… xi

Find the network parameters to minimize the total cost:

Output: “ 好棒 (character sequence)

“ 好棒 Add an extra symbol “ 好棒

Don’t know when to stop

推 tlkagk: ========= 斷 ==========

Add a symbol “===“ ( 斷 )

w1 [Razvan Pascanu, ICML’13]

4 times of parameters x1 x2 Input

ct-1 ht-1 xt ct ht xt+1

Reading Head Writing Head

Input x DNN/LSTM output y

Input Weak classifier

You might also like