You are on page 1of 299

Deep Learning

Dr. M. Prasad
Deep learning
attracts lots of attention.
• I believe you have seen lots of exciting results
before.

Deep learning trends


at Google. Source:
SIGMOD/Jeff Dean

This talk focuses on the basic techniques.


Outlin
e
Lecture I: Introduction of Deep Learning

Lecture II: Tips for Training Deep Neural Network

Lecture III: Variants of Neural Network

Lecture IV: Next Wave


Introduction
of Deep
Learning
Outline of Lecture
I
Introduction of Deep Learning
Let’s start with general
machine learning.
Why Deep?

“Hello World” for Deep Learning


Machine Learning
≈ Looking for a
•Function
Speech Recognition
f
“How are you”
• Image Recognition
f
“Cat”
• Playing Go
f “5-5” (next move)

• Dialogue System
f “Hi”  “Hello”
 (what the user said) (system response)
Image Recognition:

Framewor f  “cat

k
A set of Model
function f1 , f 2 

f1  “cat f2  “money”
 ” 

f1  “dog” f2  “snake”
 
Image Recognition:

Framewor f  “cat

k
A set of Model
function f1 , f 2  Better!

Goodness of
function f
Supervised Learning

Trainin function input:


g
Data “cat “dog”
function output: “monkey”
Image Recognition:

Framewor f  “cat

k
Training Testing
A set of Model
function f1 , f 2  “cat
Step 1 ”

Goodness of Pick the “Best” Function


Using f
function f
Step 2 Step 3 f*

Trainin
g
Data “monkey” “cat” “dog”
Three Steps for Deep
Learning
Step 1: Step 2: Step 3: pick
define a set goodness of the best
of function function function

Deep Learning is so simple ……


Three Steps for Deep
Learning
Step 1: Step 2: Step 3: pick
eNfineeuraa goodness of the best
lset oNf function function
feutnwcotri
oknLearning is so simple ……
Deep
Human
Brains
Neural
Network
Neuron
z  a1w1    ak    aK wK 
wk b
a1 w1
A simple function

wk z z 
ak 

wK

a
aK weights b bias
Activation
function
Neural
Network
Neuron Sigmoid Function z
1
z
1 e z
z
2
1

4
-1 -2  z 0.98

Activation
-1
function
1 weights 1 bias
Neural
Network
Different connections leads to
different network structure

 z
 z  z

 z
Each neurons can have different values
of weights and biases.
Weights and biases are network parameters 𝜃
Fully Connect
Networ
Feedforward
0.98
k1 1 4

-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function z
1
z
1 e z
z
Fully Connect
Networ
Feedforward
0.98 2
k1 1 4
0.86
0.62

-2 -1 3 -1
1 0 -2
-1 -2 0.12 - 0.11 -1 0.83
-1 2
1 -1 4
0 0 2
Fully Connect
Networ
Feedforward
0.73 2 0.72 0.51
k0 1 3
-2 -1 -1
1 0 -2
-1 0.5 0.12 -1 0.85
0 -2
1 -1 4
0 0 2
This is a function. 1 0.62 0 0.51
𝑓 = 𝑓 =
Input vector, output vector −1 0.83 0 0.85
Given parameters 𝜃, define a function
Given network structure, define a function set
Fully Connect
Networ
Feedforward neuron
k Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2

……
……

……

……

……
xN …… yM
Input Output
Hidden Layers Layer
Laye
r Deep means many hidden layers
Output Layer
(Option)
• Softmax layer as the output layer

Ordinary Layer

z1  y   z
 In general, the output of
z2 
1
y  z  1
network can be any value.

 May not be easy to interpret


y   z
2 2
z3 
3
 3
Output Layer
(Option) Probability:
• Softmax layer as the output layer  1 > 𝑦𝑖 > 0

Softmax Layer
 𝑖 𝑦𝑖 = 1
3
3 20 0.88
z1 e e z
1
 y1  e z1 e z
j

j
0.12 1
3
1
z2 e
e 2.7
z2  y z2
e z j

2 j
1
z3 -3
0.05  3
e
e z3  e
z3
e zj


3 j
1
≈0

j
e
zj
y
1
Example
Application
Input Output
y1
0.1 is 1
x1
x2 0.7
y2 is 2
The image
is

……
……
……

0.2
“2”
y10 is 0
x256
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example
Application
• Handwriting Digit Recognition

x1 y1 is 1

x2 is 2
y2
MNaecu “2”
……

……
……
hrianle
Network
What is needed is a is 0
x25 y10
6
function ……
Input: output:
256-dim vector 10-dim vector
Example
Application
Input Layer 1 Layer 2 Layer L Output
x1 …… y1 is 1
x2
A function set c…o…ntaining y2 is 2
the candidates for “2”

……
……

……

……

……
……
xN
Handwriting Digit
…… Recognition
is 0
Input Output y10
Hidden Layers Layer
Laye
r You need to decide the network structure to
let a good function in your function set.
FA
Q

• Q: How many layers? How many neurons for each


layer?

Trial and Error + Intuition

• Q: Can the structure be automatically determined?


Three Steps for Deep
Learning
Step 1: Step 2: Step 3: pick
deNfineeur goodness of the best
aalset oNf function function
feutnwcotri
oknLearning is so simple ……
Deep
Training
Data
• Preparing training data: images and their labels

“5” “0” “4” “1”

“9” “2” “1” “3”

The learning target is defined on


the training data.
Learning
Target x …… y1 is 1
1

x2

Softmax
…… …… y2 is 2

……

……
x256
…… is 0
16 x 16 = 256 y10
Ink → 1 The learning target is ……
No ink → 0
Input: y1 has the maximum value

Input: y2 has the maximum value


A good function should make the loss
Los of all examples as small as possible.
s
“1”

x1 …… y1 As close as 1
x2 possible
y2 0
Given a se…t…
……

……
……

of

……

……
Loss
xN parameters
…… 0
y10 𝑙
target
Loss can be the distance between the
network output and target
Total Loss:
Total 𝑅

Loss
For all training data …
𝐿 = 𝑙𝑟
𝑟=1

x1 NN y1 𝑦
𝑙1 1 As small as possible
NN 𝑦
x2 y2
𝑙2 2
Find a function in
function set that
x3 NN y3 𝑦 minimizes total loss L
𝑙3 3
……
……

……
……

Find the network


NN 𝑦𝑅 parameters 𝜽∗ that
xR yR
𝑙𝑅 minimize total loss L
Three Steps for Deep
Learning
Step 1: Step 2: Step 3: pick
deNfineeur goodness of the best
aalset oNf function function
feutnwcotri
oknLearning is so simple ……
Deep
How to pick the best function

Find network parameters 𝜽∗ that minimize total loss L


Layer l Layer l+1
Enumerate all possible values

Network parameters 𝜃 =
106
𝑤1, 𝑤2, 𝑤3, ⋯ , 𝑏1, 𝑏2, 𝑏3,
weights

……
……
⋯ Millions of parameters

E.g. speech recognition: 8 layers and


1000 1000
1000 neurons each layer
neurons neurons
Network parameters 𝜃 =
Gradient Descent 𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯

Find network parameters 𝜽∗ that minimize total loss L


 Pick an initial value for w
Total
Random, RBM pre-train
Loss 𝐿
Usually good enough

w
Network parameters 𝜃 =
Gradient Descent 𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯

Find network parameters 𝜽∗ that minimize total loss L


 Pick an initial value for w

Total  Compute 𝜕𝐿 𝜕𝑤
Loss 𝐿 Negative Increase w

Positive Decrease w

w
http://chico386.pixnet.net/album/photo/171572850
Network parameters 𝜃 =
Gradient Descent 𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯

Find network parameters 𝜽∗ that minimize total loss L


 Pick an initial value for w
Total  Compute 𝜕𝐿 𝜕𝑤
Loss 𝐿
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
Repeat

η is called
“learning rate” w
−𝜂𝜕𝐿 𝜕𝑤
Network parameters 𝜃 =
Gradient Descent 𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯

Find network parameters 𝜽∗ that minimize total loss L


 Pick an initial value for w
Total  Compute 𝜕𝐿 𝜕𝑤
Loss 𝐿
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
Repeat Until 𝜕𝐿 𝜕𝑤 is approximately small
(when update is little)

w
Gradient Descent
𝜃
Compute 𝜕𝐿 𝜕𝑤1 𝜕𝐿
𝑤1 0.2 0.15
−𝜇 𝜕𝐿 𝜕𝑤1 𝜕𝑤1
𝜕𝐿
𝑤2 -0.1 Compute 𝜕𝐿 𝜕𝑤2 0.05 𝛻𝐿 = 𝜕𝑤2

……

−𝜇 𝜕𝐿 𝜕𝑤2 𝜕𝐿
Compute 𝜕𝐿 𝜕𝑏1 𝜕𝑏1
0.3 0.2
𝑏1 ⋮
−𝜇 𝜕𝐿 𝜕𝑏1
gradient
……
Gradient Descent
𝜃
Compute 𝜕𝐿 𝜕𝑤1 Compute 𝜕𝐿 𝜕𝑤1
𝑤1 0.2 0.15 0.09
−𝜇 𝜕𝐿 𝜕𝑤1 ……
−𝜇 𝜕𝐿 𝜕𝑤1

𝑤2 -0.1 Compute 𝜕𝐿 𝜕𝑤2 0.05 0.15


Compute 𝜕𝐿 𝜕𝑤2
……
……

−𝜇 𝜕𝐿 𝜕𝑤2 −𝜇 𝜕𝐿 𝜕𝑤2
Compute 𝜕𝐿 𝜕𝑏1 Compute 𝜕𝐿 𝜕𝑏1
0.3 0.2 0.10
𝑏1
−𝜇 𝜕𝐿 𝜕𝑏1 −𝜇 𝜕𝐿 𝜕𝑏1
……
……
Gradient Descent

Color: Value of
𝑤2 Total Loss L

Randomly pick a starting point

𝑤1
Gradient Descent Hopfully, we would reach
a minima …..

Color: Value of
𝑤2 Total Loss L

(−𝜂 𝜕𝐿 𝜕𝑤1, −𝜂 𝜕𝐿 𝜕𝑤2)

Compute 𝜕𝐿 𝜕𝑤1, 𝜕𝐿 𝜕𝑤2


𝑤1
Gradient Descent -
Difficulty
• Gradient descent never guarantee global minima

Different initial point

Reach different minima,


𝐿 so different results

There are some tips to


help you avoid local
𝑤1 minima, no guarantee.
𝑤2
You are playing Age of Empires …
Gradient Descent You cannot see the whole map.

(−𝜂 𝜕𝐿 𝜕𝑤1, −𝜂 𝜕𝐿 𝜕𝑤2)

Compute 𝜕𝐿 𝜕𝑤1, 𝜕𝐿 𝜕𝑤2

𝑤2 𝑤1
Gradient Descent
This is the “learning” of machines in deep
learning ……
Even alpha go using this
approach.
People image …… Actually …..

I hope you are not too disappointed :p


Backpropagatio
n
• Backpropagation: an efficient way to compute 𝜕𝐿 𝜕𝑤
• Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_201
5_2/Lecture/DNN%20backprop.ecm.mp4/index.html

台大周伯威
同學開發

Don’t worry about 𝜕𝐿 𝜕𝑤, the toolkits will handle it.


Concluding
Remarks
Step 1: Step 2: Step 3: pick
define a set goodness of the best
of function function function

Deep Learning is so simple ……


Outline of Lecture
I
Introduction of Deep Learning

Why Deep?

“Hello World” for Deep Learning


Deeper is Better?
Word Error Word Error
Layer X Size Rate (%) Layer X Size
Rate (%)
1 X 2k 24.2
2 X 2k 20.4
Not surprised, more
3 X 2k 18.4
parameters, better
4 X 2k 17.8 performance
5 X 2k 17.2
1 X 3772 22.5
7 X 2k 17.1
1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Universality
Theorem
Any continuous function f

f : R N  RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons)
eplearning.com/chap4.html

Why “Deep” neural network not “Fat” neural network?


Fat + Short v.s. Thin +
Tall The same number
of parameters

…W…hich one is
better?
x1 x2 …… xN x1 x2 …… xN

Shallow Deep
Fat + Short v.s. Thin +
Tall
Word Error Word Error
Layer X Size Rate (%) Layer X Size Rate (%)
1 X 2k 24.2
2 X 2k 20.4 Why?
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Analog
y Logic circuits Neural network
• Logic circuits consists of • Neural network consists of
gates neurons
• A two layers of logic gates • A hidden layer network can
can represent any Boolean represent any continuous
function. function.
• Using multiple layers of • Using multiple layers of
logic gates to build some neurons to represent some
functions are much simpler functions are much simpler
less gates needed less less
parameters data?

This page is for EE background.


Modularizatio
n
• Deep → Modularization
Classifier Girls with
1 long
hair
Classifier Boys with
2 long hair
weak Litt 男 le
Image
Classifier Girls with examples
3 short 女
hair
Classifier Boys with
4 short
hair
Each basic classifier can have
Modularizatio sufficient training examples.
n
• Deep → Modularization

Boy or Girl? 女女 v.s


. 男
Image Basic
Classifier
Long or
女 女
short? v.s
Classifiers for the .

attributes
Modularizatio
can be trained by little data
n
• Deep → Modularization
Classifier Girls with
1 long hair

Boy or Girl? Classifier Boys with


2 fine
Basic long Lhiattri le
Image
Classifier Classifier data with
Girls
Long or 3 short
short? hair
Classifier Boys with
Sharing by the 4 short hair
following classifiers
as module
Modularizatio
n
• Deep → Modularization → Less training data?
x1 ……
x2 The modularization is ……
automatically learned from data.
……

……
……

……
xN ……

The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Reference: Zeiler, M. D., & Fergus, R.

Modularizatio (2014). Visualizing and


understanding convolutional
networks. In Computer Vision–ECCV
n 2014 (pp. 818-833)

• Deep → Modularization
x1 ……
x2 ……
……

……
……

……
xN ……

The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Outline of Lecture
I
Introduction of Deep Learning

Why Deep?

“Hello World” for Deep Learning


Kera
s
Very flexible
or
Need some
effort to learn

Easy to learn and use


Interface of
TensorFlow or (still have some flexibility)
Theano You can modify it if you can write
kera TensorFlow or Theano
s
Kera
s
• François Chollet is the author of Keras.
• He currently works for Google as a deep learning
engineer and researcher.
• Keras means horn in Greek
• Documentation: http://keras.io/
• Example:
https://github.com/fchollet/keras/tree/master/exa
mples
Example
Application
• Handwriting Digit Recognition

Machine “1”

28 x 28

MNIST Data: http://yann.lecun.com/exdb/mnist/


“Hello world” for deep learning
Keras provides data sets loading function:
http://keras.io/datasets/
Kera
s
28x28
……

……
500

……
500

Softmax

y1 y2
y10
……
Kera
s
Kera
s
Step 3.1: Configuration

𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
0.1
Step 3.2: Find the optimal network parameters

Training data Labels Next lecture


(Images)
Kera
s
Step 3.2: Find the optimal network parameters

numpy array numpy array

28 x 28 …… 10 ……
=784

Number of training examples Number of training examples


https://www.tensorflow.org/versions/r0.8/tutorials/mnist/beginners/index.html
Kera
s

Save and load models


http://keras.io/getting-started/faq/#how-can-i-save-a-keras-model

How to use the neural network (testing):

case 1:

case 2:
Kera
s
• Using GPU to speed training
• Way 1
• THEANO_FLAGS=device=gpu0 python
YourCode.py
• Way 2 (in your code)
• import os
• os.environ["THEANO_FLAGS"] =
"device=gpu0"
Tips for Training
DNN
Recipe of Deep Learning
YES

Step 1: define a NO
Good Results on
set of function
Testing Data?
Overfitting!
Step 2: goodness
of function YES

NO
Step 3: pick the Good Results on
Training Data?
best function

Neural
Network
Do not always blame
Overfitting
Not well trained

Overfitting?

Training Data Testing Data


Recipe of Deep Learning
YES

Good Results on
Different approaches for Testing Data?
different problems.

e.g. dropout for good results YES


on testing data
Good Results on
Training Data?

Neural
Network
Recipe of Deep Learning
YES

Choosing proper loss Good Results on


Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
Choosing Proper
Loss “1”

x1 …… y1 1 𝑦 1
1
x2 ……

Softmax
y2 0 𝑦 0

… 2
……

……

……
……
……
loss
x256
…… 0 𝑦
Which one is better? y10
10

10 10 0
target
Square 2 Cross
Error 𝑦𝑖 − Entrop − 𝑦 𝑖𝑙𝑛𝑦𝑖
𝑦
𝑖=
𝑖 =0 y 𝑖= =0
1 1
Let’s try
it
Square Error

Cross Entropy
Accuracy
Testing
Let’s try :
Square Error 0.11
it Cross Entropy 0.84

Training
Cross
Entrop
y
Square
Error
Choosing Proper
Loss
When using softmax output layer,
choose cross entropy
Cross
Entrop
y

Tota
l Square
Loss Error
http://jmlr.org/procee
dings/papers/v9/gloro
w1 w2
t10a/glorot10a.pdf
Recipe of Deep Learning
YES

Choosing proper loss Good Results on


Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
We do not really minimize total loss!
Mini-  Randomly initialize
network parameters
batch  Pick the 1st batch
NN 𝑦
Mini-batch

x1 y1
𝑙1
1 𝐿′ = 𝑙1 + 𝑙31 + ⋯
NN Update parameters once
x31 y31 𝑦
𝑙 31 31  Pick the 2nd batch
……

𝐿′′ = 𝑙2 + 𝑙16 + ⋯
Update parameters once
NN 𝑦
Mini-batch

x2 y2


𝑙2
2  Until all mini-batches
have been picked
x16 NN y16 𝑦
one epoch
𝑙16 16
……

Repeat the above process


Mini-
batch
 Pick the 1st batch
x1 NN y1 𝑦
Mini-batch

1
𝐿′ = 𝑙1 + 𝑙31 + ⋯
𝑙1
Update parameters once
x31
NN y31 𝑦
 Pick the 2nd batch
𝑙 31 31
……

𝐿′′ = 𝑙2 + 𝑙16 + ⋯
100 examples in a mini-batch Update parameters once


 Until all mini-batches
Repeat 20 times have been picked
one epoch
We do not really minimize total loss!
Mini-  Randomly initialize
network parameters
batch  Pick the 1st batch
NN 𝑦
Mini-batch

x1 y1
𝑙1
1 𝐿′ = 𝑙1 + 𝑙31 + ⋯
NN Update parameters once
x31 y31 𝑦
𝑙 31 31  Pick the 2nd batch
……

𝐿′′ = 𝑙2 + 𝑙16 + ⋯
Update parameters once
NN 𝑦
Mini-batch

x2 y2


2
𝑙2
L is different each time
x16 NN y16 𝑦 when we update
𝑙16 16 parameters!
……
Mini-
batch
Original Gradient Descent With Mini-batch

Unstable!!!

The colors represent the total loss.


Not always true with
Mini-batch is parallel computing.
Faster
Original Gradient Descent With Mini-batch
Update after seeing all If there are 20 batches, update
examples 20 times in one epoch.

See all See only one


examples batch
Can have the same speed
(not super large data set)

1 epoch

Mini-batch has better


Testing
: Accuracy
Mini-batch is
Better! Mini-batch 0.84
No batch 0.12

Training
Mini-batch
Accuracy

No batch

Epoch
Shuffle the training examples for each epoch
Epoch 1 Epoch 2

x1 NN y1 𝑦 x1 NN y1 𝑦

Mini-batch
Mini-batch

1 1
𝑙1 𝑙1
x31 NN y31 31 x31
NN y31 𝑦
𝑦
𝑙 31 𝑙17 31

……
……

Don’t worry. This is the default of Keras.

x2 NN y2 𝑦 x2 NN y2 𝑦
Mini-batch
Mini-batch

2 2
𝑙2 𝑙2

x16 NN y16 x16 NN y16 𝑦


𝑦
𝑙16 16 𝑙26 16

……
……
Recipe of Deep Learning
YES

Choosing proper loss Good Results on


Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
Hard to get the power of Deep

Results on Training Data

Deeper usually does not imply better.


Testing: Accuracy
Let’s try 3 layers 0.84
it 9 layers 0.11

Training
3 layers

9 layers
Vanishing Gradient
Problem
x 1
…… y1
x2 …… y2
……

……
……

……

……
xN …… yM

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge


based on random!?
Vanishing Gradient
Problem
Smaller gradients

x1 …… 𝑦1 𝑦
1 Small
x2 … output
… 𝑦2 𝑦
……

……

……
2
…… ……
xN … 𝑙…

Large
+∆𝑤
+∆𝑙
input
… derivatives … 𝑦𝑀
Intuitive way to compute the
𝑦𝜕𝑙 𝑀=? ∆𝑙
𝜕𝑤 ∆𝑤
Hard to get the power of Deep

In 2006, people used RBM pre-training.


In 2015, people use ReLU.
ReL
U
• Rectified Linear Unit (ReLU)
Reason:
𝑎
𝜎 𝑧 1. Fast to compute
𝑎= 𝑧
2. Biological reason
𝑎= 0 3. Infinite sigmoid
𝑧
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
𝑎
𝑎= 𝑧
ReL
U 𝑎= 0
𝑧
0

x1 y1

0 y2
x2
0

0
𝑎
𝑎= 𝑧
ReL
U A Thinner linear network 𝑎= 0
𝑧

x1 y1

y2
x2
Do not have
smaller gradients
Let’s try
it
9 layers Accuracy
Testing:
Let’s try Sigmoid 0.11
it ReLU 0.96
• 9 layers

Training

ReLU
Sigmoid
ReLU -
variant
𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈
𝑎 𝑎
𝑎= 𝑧 𝑎=𝑧

𝑧 𝑧
𝑎 = 0.01𝑧 𝑎 = 𝛼𝑧

α also learned by
gradient
descent
Maxout ReLU is a special cases of Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]

+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2

x2 + −1 + 4
Max 1 Max 4
+ 1 + 3

You can have more than 2 elements in a group.


Maxout ReLU is a special cases of Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]


• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group

2 elements in a group 3 elements in


a group
Recipe of Deep Learning
YES

Choosing proper loss Good Results on


Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
Learning Set the learning
rate η carefully
Rates
If learning rate is too large

Total loss may not decrease


after each update
𝑤2

𝑤1
Learning Set the learning
rate η carefully
Rates
If learning rate is too large

Total loss may not decrease


after each update
𝑤2
If learning rate is too small

Training would be too slow

𝑤1
Learning
Rates
• Popular & Simple Idea: Reduce the learning rate by
some factor every few epochs.
• At the beginning, we are far from the destination, so we
use larger learning rate
• After several epochs, we are close to the destination, so
we reduce the learning rate
• E.g. 1/t decay: 𝜂𝑡 = 𝜂 𝑡+1
• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning
rates
Adagra
d
Original: 𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤
Adagrad: w ← 𝑤 − ߟ𝑤𝜕𝐿 ∕ 𝜕𝑤
Parameter dependent
learning rate

𝜂 constant
ߟ𝑤 =
𝑔𝑖 is 𝜕𝐿 ∕ 𝜕𝑤 obtained
𝑖= 𝑔𝑖 2
𝑡
0 at the i-th update
Summation of the square of the previous derivatives
𝜂
ߟ𝑤 =
Adagra 𝑖= 𝑔𝑖 2
d g 0
g 1 g0 1g
𝑡
0

𝑤1 …… 𝑤2 20.0 ……
0.1
0.2 …… 10.0 ……
Learning rate: Learning rate:
𝜂 𝜂 𝜂 𝜂
= =
0.12 0.1 20 2
20
𝜂 𝜂 𝜂 𝜂
=
= 202 + 102 22
0.12 + 0.22
Observation:
0.22 1.Learning rate is smaller and
smaller for all parameters
2.Smaller derivatives, larger
Why?
learning rate, and vice versa
Larger
derivatives

Smaller
Learning Rate

Smaller Derivatives

Larger Learning Rate

2. Smaller derivatives, larger


Why?
learning rate, and vice versa
Not the whole story
……
• Adagrad [John Duchi, JMLR’11]
• RMSprop
• https://www.youtube.com/watch?v=O3sxAc4hxZU

• Adadelta [Matthew D. Zeiler, arXiv’12]


• “No more pesky learning rates” [Tom Schaul, arXiv’12]
• AdaSecant [Caglar Gulcehre, arXiv’14]
• Adam [Diederik P. Kingma, ICLR’15]
• Nadam
• http://cs229.stanford.edu/proj2015/054_report.pdf
Recipe of Deep Learning
YES

Choosing proper loss Good Results on


Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
Hard to find
optimal network parameters
Tota
l Very slow at the
Loss plateau
Stuck at saddle point

Stuck at local minima

𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0

The value of a network parameter w


In physical world
……
• Momentum

How about put this phenomenon


in gradient descent?
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost
Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Momentum
Negative of 𝜕𝐿 ∕ 𝜕𝑤
Momentum
Real Movement

𝜕𝐿∕𝜕𝑤 = 0
Ada RMSProp (Advanced Adagrad) + Momentum

m
Accuracy
Let’s try Testing
Original 0.96
:
it Adam 0.97
• ReLU, 3 layer

Training

Original
Adam
Recipe of Deep Learning
YES

Early Stopping Good Results on


Testing Data?

Regularization
YES

Dropout Good Results on


Training Data?

Network Structure
Why
Overfitting?
• Training data and testing data can be different.

Training Data: Testing Data:

Learning target is defined by the training data.


The parameters achieving the learning target do not
necessary have good results on the testing data.
Panacea for
Overfitting
• Have more training data
• Create more training data (?)

Handwriting recognition:

Original Created
Training Data: Training Data:

Shift
15 。
Why
Overfitting?
• For experiments, we added some noises to the
testing data
Why
Overfitting?
• For experiments, we added some noises to the
testing data

Accuracy
Testing:
Clean 0.97
Noisy 0.50

Training is not influenced.


Recipe of Deep Learning
YES

Early Stopping Good Results on


Testing Data?

Weight Decay
YES

Dropout Good Results on


Training Data?

Network Structure
Early
Stopping
Tota
l
Loss Stop at Validation set
here Testing set

Training set

Epochs

Keras: http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
the-validation-loss-isnt-decreasing-anymore
Recipe of Deep Learning
YES

Early Stopping Good Results on


Testing Data?

Weight Decay
YES

Dropout Good Results on


Training Data?

Network Structure
Weight
Decay
• Our brain prunes out the useless link between
neurons.

Doing the same thing to machine’s brain improves


the performance.
Weight
Decay

Weight decay is one Useless


kind of regularization
Close to zero ( 萎縮
了)
Weight
Decay
L
• Implementation Original: w  w 
w
 0.01
L
Weight Decay: w  1  w  
0.99
w
Smaller and smaller

Keras: http://keras.io/regularizers/
Recipe of Deep Learning
YES

Early Stopping Good Results on


Testing Data?

Weight Decay
YES

Dropout Good Results on


Training Data?

Network Structure
Dropout
Training:

 Each time before updating the parameters


 Each neuron has p% to dropout
Dropout
Training:

Thinner!

 Each time before updating the parameters


 Each neuron has p% to dropout
The structure of the network is
changed.
 Using the new network for training
For each mini-batch, we resample the dropout neurons
Dropout
Testing
:

 No dropout
 If the dropout rate at training is p%,
all the weights times (1-p)%
 Assume that the dropout rate is
50%.
If a weight w = 1 by
Dropout - Intuitive
Reason 我 的 partner
會擺爛,所以
我要好好做

 When teams up, if everyone expect the partner will do


the work, nothing will be done finally.
 However, if you know your partner will dropout, you
will do better.
 When testing, no one dropout actually, so obtaining
good results eventually.
Dropout - Intuitive
Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Weights from training
𝑤1 0.5 × 𝑤1 𝑧 ′ ≈ 2𝑧
𝑤2 𝑧 0.5 × 𝑤2 𝑧′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Weights multiply (1-p)%
𝑧′ ≈ 𝑧
Dropout is a kind of
ensemble. Training
Set
Ensemble

Set 1 Set 2 Set 3 Set 4

Network Network Network Network


1 2 3 4

Train a bunch of networks with different structures


Dropout is a kind of
ensemble.
Ensemble
Testing data x

Network Network Network Network


1 2 3 4

y1 y2 y3 y4

average
Dropout is a kind of
ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout

M neurons

……
2M possible
networks

 Using one mini-batch to train one network


 Some parameters in the network are shared
Dropout is a kind of
ensemble.
Testing of Dropout
testing data x

All the
weights

……
multiply
(1-p)%

y1 y2 y3
?????
averag ≈ y
More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi,
NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons
• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate
Let’s try
it ……

500
……

model.add( dropout(0.8) )
……
500
model.add( dropout(0.8) )
Softmax

y1 y2
y10
……
Let’s try
it
No Dropout
Accuracy

Dropout

Testing:
Training Accuracy
Noisy 0.50
Epoch + dropout 0.63
Recipe of Deep Learning
YES

Early Stopping Good Results on


Testing Data?

Regularization
YES

Dropout Good Results on


Training Data?

Network Structure
CNN is a very good example!
(next lecture)
Concluding
Remarks of
Lecture II
Recipe of Deep Learning
YES

Step 1: define a NO
Good Results on
set of function
Testing Data?

Step 2: goodness
of function YES

NO
Step 3: pick the Good Results on
Training Data?
best function

Neural
Network
Let’s try another task
Document
Classification
“stock” in document
政治

經濟
Machine

體育
“president” in document

體育 政治 財經
http://top-breaking-news.com/
Dat
a
MS
E
ReL
U
Accuracy

Adaptive ning MSE 0.36

Lear Rate CE 0.55


+ ReLU 0.75
+ Adam 0.77
Accuracy

Dropout Adam 0.77


+ dropout 0.79
Lecture III:
Variants of
Neural Networks
Variants of Neural
Networks
Convolutional Neural
Network (CNN) Widely used in
image processing

Recurrent Neural Network


(RNN)
Why CNN for
Image?
• When processing image, the first layer of fully
connected network would be very large
……

Softmax
……
100
……3 x 107

……
……
100 100 x 100 x 3 1000
Can the fully connected network be simplified by
considering the properties of image
Why CNN for
Image
• Some patterns are much smaller than the whole
image
A neuron does not have to see the whole image
to discover the pattern.
Connecting to small region with less
parameters

“beak” detector
Why CNN for
Image
• The same patterns appear in different regions.
“upper-left
beak” detector

Do almost the same


thing
They can use
the same
set of beak”
“middle
parameters.
detector
Why CNN for
Image
• Subsampling the pixels will not change
the object bird
bird

subsampling

We can subsample the pixels to make image smaller


Less parameters for the network to process the image
Three Steps for Deep
Learning
Step 1: Step 2: Step 3: pick
Cdoenfvni oelutai goodness of the best
osneatl function function
Neoufrfaul
n NcetLearning
Deep tiwoonrk is so simple ……
The whole
CNNcat dog ……
Convolution

Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution

Max Pooling

Flatten
The whole
CNN
Property 1
 Some patterns are much Convolution
smaller than the whole image
Property 2
 The same patterns appear in Max Pooling
Can repeat
different regions.
many times
Property 3 Convolution
 Subsampling the pixels will
not change the object
Max Pooling

Flatten
The whole
CNNcat dog ……
Convolution

Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution

Max Pooling

Flatten
CNN – Those are the network
Convolution parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0 Matrix
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1
0 0 1 0 1 0 -1 1 -1 Filter 2

……
6 x 6 image Matrix
Each filter detects a small
Property 1
pattern (3 x 3).
1 -1 -1
CNN – -1 1 -1 Filter 1
-1 -1 1
Convolution
stride=1
1 0 0 0 0 1
0 1 0 0 1 0
3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
1 -1 -1
CNN – -1 1 -1 Filter 1
-1 -1 1
Convolution
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0
3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 We set stride=1 below

6 x 6 image
1 -1 -1
CNN – -1 1 -1 Filter 1
-1 -1 1
Convolution
stride=1
1 0 0 0 0 1
0 1 0 0 1 0
3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1

Property 2
-1 1 -1
CNN – -1 1 -1 Filter 2
-1 1 -1
Convolution
stride=1 Do the same process for
1 0 0 0 0 1 every filter
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0 -1 -1 -1 -1
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Featur
0 0 1 0 1 0 -3 -3 e 1
-1 1
Map
6 x 6 image 3 0-1
-220 -2- -1
-1 -4 3
4 x 4 image
1 -1 -1
CNN – Zero -1 1 -1 Filter 1
-1 -1 1
Padding
0 0
0
0 1 0 0 0 0 1
0 0 1 0 0 1 0 You will get another 6 x 6
0 0 1 1 0 0 images in this way
1 0 0 0 1 0
0 1 0 0 1 0 0 Zero padding
0 0 1 0 1 0 0
6 x 6 image
0 0
0
CNN – Colorful
image 1 -1 -1
1 -1 -1
-1-1 11 -1-1
1 -1 -1 -1 1 -1
-1-1 11 -1-1 -1-1-1 11 1 -1-1-1 Filter 2
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1 1 -1
-1 -1 1 -1 1 -1
Colorful image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
The whole
CNNcat dog ……
Convolution

Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution

Max Pooling

Flatten
CNN – Max
Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
CNN – Max
Pooling
New image
1 0 0 0 0 1 but
0 1 0 0 1 0 Conv smaller
0 0 1 1 0 0
3 0
1 0 0 0 1 0 -1 1
0 1 0 0 1 0 Max
0 0 1 0 1 0 Pooling
30 13
6 x 6 image
2 x 2 image
Each filter
The whole
CNN3
-1 0 Convolution
1

30 13 Max Pooling
Can repeat
A new image many times
Smaller than the original Convolution
image
The number of the channel Max Pooling
is the number of filters
The whole
CNNcat dog ……
Convolution

Max Pooling
A new image
Fully Connected
Feedforward network Convolution

Max Pooling
A new image
Flatten
3
Flatten
0

1
3 0
-1 1 3

30 1 -1
3 Flatten
1 Fully Connected
Feedforward network
0

3
The whole
CNN
Convolution

Max Pooling
Can repeat
many times
Convolution

Max Pooling
Input
+ 5
Max 7
x1 + 7

x2 + −1
Max 1
+ 1

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution Max
image pooling
(Ignoring the non-linear activation function after the convolution.)
1 -1 -1 Filter 1 1: 1
-1 1 -1 2: 0
-1 -1 1 3: 0
3
4: 0
1 0 0 0 0 1


0 1 0 0 1 0 7:
0 0 1 1 0 0
1 0 0 0 1 0 0
0 1 0 0 1 0 8:
0 0 1 0 1 0


6 x 6 image 1 0
13:
9: 0
14:
Less parameters! Only connect to 9
15: 1 input, not fully
0 1
16: connected

10: 0
1 -1 - 1: 1
1
-1 1 - Filter 1 2: 0
1 3: 0 3
-1 4: 0
1 0-1 01 0 0 1


0 1 0 0 1 0 7: 0
0 0 1 1 0 0
1 0 0 0 8: 1 -1
9: 0
0 1 0 0 1
1 0 10: 0
0 0 1 0 0
1 0


6 x 6 image 13: 0
14: 0
Less parameters!
15: 1
Shared weights
Even less parameters! 16: 1

Input
+ 5
Max 7
x1 + 7

x2 + −1
Max 1
+ 1

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution Max
image pooling
(Ignoring the non-linear activation function after the convolution.)
Input
+ 5
Max 7
x1 + 7

x1 + −1
Max 1
+ 1
3 -1 -3 -1
3 0
-3 1 0 -3

-3 -3 0 1
3 1
3 -2 -2 -1
Input
+ 5
Max 7
x1 + 7

x2 + −1
Dim = 6 x 6 = 36
Max 1
parameters + 1
= 36 x Dim = 4 x 4 x 2
32 = 1152 = 32
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0 convolution
0 0 1 0 1 0 Max
Only 9 x 2 = 18 pooling
image
parameters
Convolutional Neural
Network
Step 1: Step 2: Step 3: pick
Cdoenfvni oelutai goodness of the best
osneatl function function
Neoufrfaul
nNcettiwoonrk
“monkey 0
” “cat” 1
CNN

……
“dog” 0
Convolution, Max target
Pooling, fully connected
Learning: Nothing special, just gradient descent ……
Playing
Go
Next move
Network (19 x 19
positions)

19 x 19 matrix 19 x 19 vector
19i(xm1a9gvee)
ctorBlack: 1 Fully-connected feedword
white: -1 network can be used
none: 0 But CNN performs much better.
進藤光 v.s. 社清
春 黑 : 5 之五
Playing 白 : 天元
Go
Training: record of previous plays
黑 : 五之
5
Target:
Network “ 天元”
=1
else = 0

Target:
Network “ 五之 5”
=1
else = 0
Why CNN for playing
Go?
• Some patterns are much smaller than the whole
image

Alpha Go uses 5 x 5 for first layer

• The same patterns appear in different regions.


Why CNN for playing
Go?
• Subsampling the pixels will not change
the object Max Pooling How to
explain this???

Alpha Go does not use Max Pooling ……


Variants of Neural
Networks
Convolutional Neural
Network (CNN)

Recurrent Neural Network


(RNN) Neural Network with Memory
Example
Application
• Slot Filling

I would like to arrive Taipei on November 2nd.

ticket booking system

Destination: Taipei
Slot
time of arrival: November 2nd
Example
Application
Solving slot filling by
y1 y2
Feedforward network?
Input: a word
(Each word is represented
as a vector)

Taipei x1 x2
1-of-N
encoding
How to represent each word as a vector?
1-of-N Encoding lexicon = {apple, bag, cat, dog, elephant}
The vector is lexicon size. apple = [ 1 0 0 0 0]
Each dimension corresponds bag = [ 0 1 0 0 0]
to a word in the lexicon cat = [ 0 0 1 0 0]

The dimension for the word dog = [ 0 0 0 1 0]


is 1, and others are 0 elephant = [ 0 0 0 0 1]
Beyond 1-of-N
encoding
Dimension for “Other” Word hashing
apple 0 a-a-a 0
bag 0 a-a-b 0



cat 0 a-p-p 1
0 26 X 26 X 26


dog 0 p-l-e 1


elephan …

p-p-l 1
t “other” 1
w = “apple”

w = “Gandalf” w = “Sauron”
187
Example time of
dest departure
Application
Solving slot filling by
y1 y2
Feedforward network?
Input: a word
(Each word is represented
as a vector)
Output:
Probability distribution that
the input word belonging to
the slots
Taipei x1 x2
Example time of
dest departure
Application
arrive Taipei on November
y1 y2
2nd

other dest other time time


Problem?

leave Taipei November 2nd


on
place of departure

Neural network Taipei x1 x2


needs
memory!
Three Steps for Deep
Learning
Step 1: Step 2: Step 3: pick
deRfeicnueraen goodness of the best
function function
stet
Deep
NeoLearning
ufrfaul is so simple ……
nNcettiwoonrk
Recurrent Neural Network
(RNN) y 1 y2

The output of hidden layer


are stored in the memory.
store

a1 a2

Memory can be considered x1 x2


as another input.
RN The same network is used again and again.

N
Probability of Probability of Probability of
“arrive” in each slot “Taipei” in each slot “on” in each slot

y1 y2 y3
store store
a1 a2 a3
a1 a2

x1 x2 x3

arrive Taipei on November 2nd


RN Different

N
Prob of “leave” Prob of “Taipei” Prob of “arrive” Prob of “Taipei”
in each slot in each slot in each slot in each slot
…… ……
y1 y2 y1 y2
store store
a1 a2 …… a1 a2 ……
a1 a1

x1 x2 …… x1 x2 ……
leave Taipei arrive Taipei

The values stored in the memory is different.


Of course it can be deep

yt yt+1 yt+2

…… ……

……
……
……

…… ……

…… ……

xt xt+1 xt+2
Bidirectional
RNN
xt xt+1 xt+2

…… ……

yt yt+1 yt+2

…… ……

xt xt+1 xt+2
Long Short-term Memory
(LSTM)
Other part of the network
Special Neuron:
Signal control
Output Gate
4 inputs,
the output gate 1 output
(Other part of
the network)
Memory Forget Signal control
Cell Gate the forget gate
(Other part of
the network)
Signal control
the input gate
Input Gate LSTM
(Other part
of the
network) Other part of the network
𝑎 = ℎ 𝑐 ′ 𝑓 𝑧𝑜
𝑧𝑜 multiply
Activation function f is
𝑓 𝑧𝑜 ℎ 𝑐′ usually a sigmoid function
Between 0 and 1
Mimic open and close
𝑐 𝑓 𝑧𝑓
gate
𝑐 𝑧𝑓
c′
𝑐𝑓
𝑐 ′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
𝑓 𝑧𝑖
𝑧𝑖 𝑧𝑓
𝑔 𝑧 𝑓 𝑧𝑖
multiply
𝑔 𝑧
𝑧
0
≈0
-10
10

170 10
≈1
≈1 3
10
3

3
-3
≈1
10
-3

1- -10
730
≈0
≈1 -3
10
-3

-3
LST
M
ct-1
……
vector

zf zi
z zo
4 vectors

xt
LST
M yt
ct-1
zo
× + ×

× zf

z zi
zf zi zo

xt
z
Extension: “peephole”
LST
M yt yt+1
ct+1
ct-1 ct

× + × × + ×

× ×

zf zi
z zo zf zi
z zo

ct-1 xt xt+1
ht-1 ct ht
Multiple-layer
LSTM

Don’t worry if you cannot understand this.


Keras can handle it.
Keras supports
“LSTM”, “GRU”, “SimpleRNN” layers

This is quite
standard now.

https://img.komicolle.org/2015-09-20/src/14426967627131.gif
Three Steps for Deep
Learning
Step 1: Step 2: Step 3: pick
define a set goodness of the best
of function function function

Deep Learning is so simple ……


Learning Target
other dest other

0 … 1 … 0 0 … 1 … 0 0 … 1 … 0

y1 y2 y3
copy copy
a1 a2 a3
a1 a2

W i
x1 x2 3
x

Training
Sentences: arrive Taipei on November 2nd
other dest other time time
Three Steps for Deep
Learning
Step 1: Step 2: Step 3: pick
define a set goodness of the best
of function function function

Deep Learning is so simple ……


Learnin y1 y2
g
Backpropagation
through time (BPTT)
copy

a1 a2

𝑤
𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤 x1 x2

RNN Learning is very difficult in practice.


感謝 曾柏翔 同學
提供實驗結果

Unfortunately
……
• RNN-based network is not always easy to learn
Real experiments on Language modeling

sometimes
Total Loss

Lucky

Epoch
The error surface is
rough. The error surface is either
very flat or very steep.

Total Loss
Clipping os
t

w2

w1 [Razvan Pascanu, ICML’13]


Why
𝑤?= 1 𝑦1000 = 1 Large Small
𝑤 = 1.01 𝜕𝐿 𝜕𝑤 Learning
𝑦1000 ≈ 20000 rate?
𝑤 = 0.99 𝑦1000 ≈ 0 small Large
𝑤 = 𝜕𝐿 𝜕𝑤 Learning rate?
0.01 𝑦1000 ≈ 0
=w999
y1 y2 y3
Toy Example y1000
1 1 1
…… 1
w w w
1 1 1 1
1 0 0 0
Helpful
Techniques
• Long Short-term Memory (LSTM)
• Can deal with gradient vanishing (not gradient
explode)
 Memory and input are
added
 The influence never disappears
unless forget gate is closed
No Gradient vanishing add
(If forget gate is opened.)
Gated Recurrent Unit (GRU):
simpler than LSTM [Cho, EMNLP’14]
Helpful
Techniques Structurally Constrained
Clockwise RNN
Recurrent Network (SCRN)

[Jan Koutnik, JMLR’14] [Tomas Mikolov, ICLR’15]

Vanilla RNN Initialized with Identity matrix + ReLU activation


function [Quoc V. Le, arXiv’15]
 Outperform or be comparable with LSTM in 4 different
tasks
More Applications
…… of
Probability Probability of Probability of
“arrive” in each slot “Taipei” in each slot “on” in each slot

y1 y2 y3
Input satnorde output are
a1 with the sama2 e a3
botshtosar1eequences a2
length
RNN can do more than that!
x1
x2 x3

arrive Taipei on November 2nd


Keras Example:
Many to one https://github.com/fchollet/keras/blob
/master/examples/imdb_lstm.py

• Input is a vector sequence, but output is only one vector

Sentiment Analysis 超好雷


好雷
看了這部電影覺 這部電影太糟了 這部電影很
得很高興 …… . ……. 棒 …… .
普雷
負雷
Positive ( 正 Negative ( 負 Positive ( 正 超負雷
雷) 雷) 雷)

……

我 覺 得 …… 太 糟 了
Many to Many (Output is
shorter)
• Both input and output are both sequences, but the output
is shorter.
• E.g. Speech Recognition

Output: “ 好棒”
Trimming
Problem? (character sequence)

Why can’t it be 好好好棒棒棒棒棒


“ 好棒棒”
(vector
Input:
sequence)
Many to Many (Output is
shorter)
• Both input and output are both sequences, but the output
is shorter.
• Connectionist Temporal Classification (CTC) [Alex Graves,
ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li,
Interspeech’15][Andrew Senior, ASRU’15]

“好 Add an extra symbol “φ” “ 好棒


棒” representing “null” 棒”

好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
Many to Many (No
Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→ 機器學
習)

Containing all
machine

learning

information about
input sequence
Many to Many (No
Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→ 機器學
習)
機 器 學 ……
習 慣 性

……
machine

learning

Don’t know when to stop


Many to Many (No
Limitation)

推 tlkagk: ========= 斷 ==========


Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%
E6%8E%A8%E6%96%87 ( 鄉民百科 )
Many to Many (No
Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→ 機器學
習)
機 學 習

===

machine

learning

Add a symbol “===“ ( 斷 )


[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
One to
Many
• Input an image, but output a sequence of words
[Kelvin Xu, arXiv’15][Li Yao, ICCV’15]
A vector
for whole
a woman is
===
image

CNN ……

Input
image Caption Generation
Application:
Video Caption
Generation
A girl is running.

Video

A group of people is A group of people is


knocked by a tree. walking in the forest.
Video Caption
Generation
• Can machine describe what it see from video?
• Demo: 曾柏翔、吳柏瑜、盧宏宗
Concluding
Remarks
Convolutional Neural
Network (CNN)

Recurrent Neural Network


(RNN)
Lecture
IV: Next
Wave
Outlin
e
Supervised Learning
• Ultra Deep Network
• Attention Model New network structure

Reinforcement Learning

Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
Skyscrape
r

https://zh.wikipedia.org/wiki/%E9%9B%99%E5%B3%B0%E5%A1%94#/me
dia/File:BurjDubaiHeight.svg
Ultra Deep 22 layers
Network
http://cs231n.stanford.e
19 layers
du/slides/winter1516_le
cture8.pdf

8 layers
6.7%
7.3%
16.4%

AlexNet (2012) VGG (2014) GoogleNet (2014)


Ultra Deep
101 layers
Network 152 layers

3.57%

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipe
(2012) (2014) (2014) (2015) i
Ultra Deep
Network 152 layers
Worry about overfitting?
Worry about training
first!
This ultra deep network 3.57%
have special structure.

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net
(2012) (2014) (2014) (2015)
Ultra Deep
Network
• Ultra deep network is the
ensemble of many networks
with different depth.

6 layers
Ensemble 4 layers
2 layers
Ultra Deep
Network
• FractalNet

Resnet in Resnet

Good Initialization?
Ultra Deep
Network
• •

copy Gate
controlle copy
r
output layer output layer output layer

Highway Network automatically


determines the layers needed!
Input layer Input layer Input layer
Outlin
e
Supervised Learning
• Ultra Deep Network
• Attention Model New network structure

Reinforcement Learning

Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
Attention-based Model
What you learned Lunch today
in these lectures

What is deep
learning?

summer
vacation
Answer Organize 10 years
ago

http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Attention-based Model
Input DNN/RNN output

Reading Head
Controller

Reading Head

…… ……
Machine’s Memory
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e
cm.mp4/index.html
Attention-based Model v2

Input DNN/RNN output

Reading Head Writing Head


Controller Controller

Writing Head Reading Head

…… ……
Machine’s Memory

Neural Turing Machine


Reading
Comprehension
Query DNN/RNN answer

Reading Head
Controller

Semantic
Analysis
…… ……

Each sentence becomes a vector.


Reading
Comprehension
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J.
Weston, R. Fergus. NIPS, 2015.
The position of reading head:

Keras has example:


https://github.com/fchollet/keras/blob/master/examples/ba
bi_memnn.py
Visual Question
Answering

source: http://visualqa.org/
Visual Question
Answering
Query DNN/RNN answer

Reading Head
Controller

CNN A vector for


each
region
Visual Question
Answering
• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring
Question-Guided Spatial Attention for Visual Question
Answering. arXiv Pre-Print, 2015
Speech Question
Answering
• TOEFL Listening Comprehension Test by Machine
• Example:
Audio Story: (The original story is 5 min long.)
Question: “ What is a possible origin of Venus’ clouds? ”
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface
temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
Experimental setup:
Simple 717 for training,
124 for validation,
Baselines 122 for testing

(2) select the shortest (4) the choice with


choice as answer semantic most similar to
others
Accuracy (%)

random

(1) (2) (3) (4) (5) (6) (7)


Naive Approaches
Everything is learned
Model Architecture from training examples

…… It be quite possible that this be


Answer due to volcanic eruption because
volcanic eruption often emit gas. If
Attention that be the case volcanism could very
Select the choice most well be the root cause of Venus 's thick
similar to the answer cloud cover. And also we have observe
burst of radio energy from the planet
Attention
Question 's surface. These burst be similar to
Semantics what we see when volcano erupt on
earth ……

Semantic Speech Semantic


Analysis Recognition Analysis

Audio Story:
Question: “what is a possible
origin of Venus‘ clouds?"
Model Architecture
Word-based Attention
Model Architecture
Sentence-based Attention
(A) (A) (A) (A)

(A)

(B) (B) (B)


Supervised
Learning

Memory Network: 39.2%


Accuracy (%)

(proposed by FB AI group)

(1) (2) (3) (4) (5) (6) (7)


Naive Approaches
Supervised [Tseng & Lee, Interspeech 16]
[Fang & Hsu & Lee, SLT 16]

Learning
Word-based Attention: 48.8%

Memory Network: 39.2%


Accuracy (%)

(proposed by FB AI group)

(1) (2) (3) (4) (5) (6) (7)


Naive Approaches
Outlin
e
Supervised Learning
• Ultra Deep Network
• Attention Model New network structure

Reinforcement Learning

Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
Scenario of
Reinforcement Learning
Observation Action

Agent

Don’t do Reward
that

Environment
Scenario of
Learnin
Reinforcement Agent learns to take actions to
maximize expected reward.
g Observation Action

Agent

Thank you. Reward

http://www.sznews.com/news/conte Environment
nt/2013-11/26/content_8800180.ht
Supervised v.s.
Reinforcement
• Supervised “Hello” Say “Hi”
Learning from
teacher “Bye bye” Say “Good bye”
• Reinforcement

……. ……. ……

Hello  …… Bad
Learning from
critics Agent Agent
Scenario of
Learnin
Reinforcement Agent learns to take actions to
maximize expected reward.
g Observation Action

Reward Next Move

If win, reward = 1
If loss, reward = -1
Otherwise, reward = 0
Environ
ment
Supervised v.s.
Reinforcement
• Supervised:

Next move: Next move:


“5-5” “3-3”

• Reinforcement Learning

First move …… many Win!


moves ……

Alpha Go is supervised learning + reinforcement learning.


Difficulties of
Reinforcement Learning
• It may be better to sacrifice immediate reward to
gain more long-term reward
• E.g. Playing Go
• Agent’s actions affect the subsequent data it
receives
• E.g. Exploration
Deep Reinforcement
Learning
DNN
Observation Action
……


Function Function
Input Output

Used to pick the


best function Reward

Environment
Application: Interactive
Retrieval
• Interactive retrieval is helpful. [Wu & Lee, INTERSPEECH 16]

“Deep Learning”

user

“Deep Learning” related to Machine


Learning? “Deep Learning” related to
Education?
Deep Reinforcement
Learning
• Different network depth
Some depth is needed.

Better retrieval
The task cannot be addressed
performance,
Less user labor by linear model.

More Interaction
More applications
• Alpha Go, Playing Video Games, Dialogue
• Flying Helicopter
• https://www.youtube.com/watch?v=0JL04JJjocc
• Driving
• https://www.youtube.com/watch?v=0xo1Ldx3L
5Q
• Google Cuts Its Giant Electricity Bill With
DeepMind-Powered AI
• http://www.bloomberg.com/news/articles/2016-07-
19/google-cuts-its-giant-electricity-bill-with-deepmind-
powered-ai
To learn deep reinforcement
learning ……
• Lectures of David Silver
• http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Te
aching.html
• 10 lectures (1:30 each)
• Deep Reinforcement Learning
• http://videolectures.net/rldm2015_silver_reinfo
rcement_learning/
Outlin
e
Supervised Learning
• Ultra Deep Network
• Attention Model New network structure

Reinforcement Learning

Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
Does machine know what the
world look like?
Ref:
https://openai.com/blog/generative-
models/

Draw something!
Deep
Dream
• Given a photo, machine adds what it sees ……

http://deepdreamgenerator.com/
Deep
Dream
• Given a photo, machine adds what it sees ……

http://deepdreamgenerator.com/
Deep
Style
• Given a photo, make its style like famous paintings

https://dreamscopeapp.com/
Deep
Style
• Given a photo, make its style like famous paintings

https://dreamscopeapp.com/
Deep
Style

CNN CNN

content style

CNN

?
Generating Images by
RNN
color of color of color of
2nd 3rd pixel 4th
pixel pixel

color of color of color of


1st 2nd 3rd pixel
pixel pixel
Generating Images by
RNN
• Pixel Recurrent Neural Networks
• https://arxiv.org/abs/1601.06759

Real
Worl
d
Generating Images
• Training a decoder to generate images is
unsupervised

? code Training data is a lot of images

Neural Network
Auto-encoder
code NN
Decoder
Not state-of-
the-art Learn together
approach NN code
Encoder

As close as possible

Output Layer
Input Layer

Layer
bottle


Layer
Layer


Layer

Encoder Decoder
Code
Generating Images
• Training a decoder to generate images is
unsupervised
• Variation Auto-encoder (VAE)
• Ref: Auto-Encoding Variational Bayes,
https://arxiv.org/abs/1312.6114
• Generative Adversarial Network (GAN)
• Ref: Generative Adversarial Networks,
http://arxiv.org/abs/1406.2661

code NN
Decoder
Which one is machine-generated?

Ref: https://openai.com/blog/generative-models/
畫漫 https://github.com/mattya/chainer-DCGAN

畫 !!!
Outlin
e
Supervised Learning
• Ultra Deep Network
• Attention Model New network structure

Reinforcement Learning

Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
Machine
Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision

http://top-breaking-news.com/
Machine
Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision

Word Vector / Embedding


tree
flower
do rabbit
run g
cat
jump
Machine
Reading
• Generating Word Vector/Embedding is
unsupervised
Apple Training data is a lot of text

Neural Network

?
https://garavato.files.wordpress.com/2011/11/stacksdocuments.jpg?w=490
Machine
Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision
• A word can be understood by its context
You shall know a word
蔡英文、馬英九 are
by the company it keeps
something very
similar
馬英九 520 宣誓就

蔡英文 520 宣誓就


Word Vector

Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014
283
𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
Word Vector ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦

• Characteristics
𝑉 ℎ𝑜𝑡𝑡𝑒𝑟 ≈ 𝑉 𝑏𝑖𝑔𝑔𝑒𝑟 − 𝑉 𝑏𝑖𝑔
𝑉 𝑅𝑜𝑚𝑒 − −𝑉𝑉 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
ℎ𝑜𝑡
𝑉 𝑘𝑖𝑛𝑔 𝐼𝑡𝑎𝑙𝑦 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒 − 𝑉 𝑎𝑢𝑛𝑡
− 𝑉 𝑞𝑢𝑒𝑒𝑛
• Solving analogies

Rome : Italy = Berlin : ?


Compute 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉
𝐼𝑡𝑎𝑙𝑦 Find the word w with the
closest V(w) 284
Machine
Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision
Dem
o
• Model used in demo is provided by 陳仰德
• Part of the project done by 陳仰德、林資偉
• TA: 劉元銘
• Training data is from PTT (collected by 葉青
峰)

286
Outlin
e
Supervised Learning
• Ultra Deep Network
• Attention Model New network structure

Reinforcement Learning

Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
Learning from Audio
Book
Machine does not have
any prior knowledge

Machine listens to lots of


audio book

Like an infant

[Chung, Interspeech 16)


Audio Word to
Vector
• Audio segment corresponding to an unknown word
Fixed-length vector
Audio Word to
Vector
• The audio segments corresponding to words with
similar pronunciations are close to each other.
dog
never
dog
never

dogs
never

ever ever
Sequence-to-sequence
Auto-encoder
vector
audio segment

RNN Encoder The values in the memory


represent the whole
audio segment
The vector we want

How to train RNN Encoder?

x1 x2 x3 x4 acoustic features
audio segment
Sequence-to-sequence
Input acoustic features
Auto-encoder
x3 x4
x1 x2
The RNN encoder and
decoder are jointly trained.
y1 y2 y3 y4
RNN Encoder

RNN Decoder
x1 x2 x3 x4 acoustic features
audio segment
Audio Word to
Vector
•- Visualizing
Resultsembedding vectors of the words
fear

fame

name near
WaveNet
(DeepMind)

https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Concluding
Remarks
Concluding
Remarks
Lecture I: Introduction of Deep Learning

Lecture II: Tips for Training Deep Neural Network

Lecture III: Variants of Neural Network

Lecture IV: Next Wave


AI 即將取代多數的工
作?
• New Job in AI Age AI 訓練
師 ( 機器學習專家、
資料科學家 )

http://www.express.co.uk/news/science/651202/First-step-towards-The-Terminator-
becoming-reality-AI-beats-champ-of-world-s-oldest-game
AI 訓練

機器不是自己會學嗎?
為什麼需要 AI 訓練

戰鬥
是寶
可夢
在打,
AI 訓練

寶可夢訓練師 AI 訓練師
• 寶可夢訓練師要挑選適合 • 在 step 1 , AI 訓練師要
的寶可夢來戰鬥 挑 選合適的模型
• 寶可夢有不同的屬性 • 不同模型適合處理不
• 召喚出來的寶可夢不一定 同的問題
能操控 • 不一定能在 step 3 找出
• E.g. 小智的噴火龍 best function
• 需要足夠的經驗 • E.g. Deep Learning
• 需要足夠的經驗
AI 訓練

• 厲害的 AI , AI 訓練師功不可

• 讓我們一起朝 AI 訓練師之路邁

http://www.gvm.com.tw/w
eb only_content_10787.html

You might also like