2.game AI 1

Game AI
July 17, 2019

Tommy Ho
AI Plays Game Human Play Game
https://www.youtube.com/watch?v=5GMDbStRgoc http://www.youtube.com/watch?v=Fbs4lnGLS8M
https://www.youtube.com/watch?v=qv6UVOQ0F44 https://www.youtube.com/watch?v=nEnNtiumgII
http://www.youtube.com/watch?v=Xj7-QA-aCus
Game AI
Deep Learning
July 17, 2019
Tommy Ho
Evolution of AI
https://www.embedded-vision.com/
Difference between AI, ML, and DL
https://blogs.oracle.com/bigdata/difference-ai-machine-learning-deep-learning
Traditional Hand-coded Code or Rules?
https://www.xenonstack.com/blog/data-science/log-analytics-deep-machine-learning-ai/
Outline
Lecture I: Introduction of Deep Learning
Lecture II: Variants of Neural Network
Lecture III: Beyond Supervised Learning
http://speech.ee.ntu.edu.tw/~tlkagk/slide/Tutorial_HYLee_Deep.pptx
Lecture I:
Introduction of
Deep Learning
Outline
Introduction of Deep Learning
“Hello World” for Deep Learning
Tips for Deep Learning

Machine Learning
≈ Looking for a Function
• Speech Recognition
f   “How are you”
• Image Recognition
f   “Cat”
• Playing Go
f   “5-5”
(next move)
• Dialogue System
f “Hi”  “Hello”
(what the user said) (system response)
Image Recognition:
Framework f  “cat”
A set of Model
function f1 , f 2 
f1   “cat” f2   “money”
f1   “dog” f2   “snake”
Image Recognition:
A set of Model
function f1 , f 2  Better!
Goodness of
function f
Supervised Learning
Training function input:

Data
function output: “monkey” “cat” “dog”
Image Recognition:
Training Testing
A set of Model
function f1 , f 2  “cat”
Step 1
Goodness of Pick the “Best” Function

Using f 
function f f*
Step 2 Step 3
Training
Data
“monkey” “cat” “dog”
Three Steps for Deep Learning
Step 1: define a set of function

Neural Network
Step 2: goodness of function
Step 3: pick the best function

Neural Network
Neuron
z  a1w1    ak wk    aK wK  b
a1
… w1 A simple function
…
wk z  z 
ak …  a
Activation
…
wK function
aK weights b bias
Neural Network
Sigmoid Function   z 
Neuron
 z  
1
1  ez z
2
1
 z 
4
- -  0.98
1 2
Activation
-1
function
1 weights 1 bias
Neural Network
Different connections lead to
different network structures
  z 
  z    z 
  z 
The neurons have different values of
weights and biases.
Weights and biases are network parameters 𝜃
Fully Connect Feedforward Network
1 4 0.98
1
-
2 1
-1 -2 0.12
-
1 1
0
Sigmoid Function   z 
 z  
1
1  ez z
1 4 0.98 2 0.86 3 0.62
1
- - -
2 1 1 0 1 -
2
-1 -2 0.12 -2 0.11 -1 0.83
-
1 -1 4
1
0 0 2
1 0.73 2 0.72 3 0.51
0
- - -
2 1 1 0 1 -
2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
This is a function. 1 0.62 0 0.51
𝑓 = 𝑓 =
Input vector, output vector −1 0.83 0 0.85
Given parameters 𝜃, define a function
Given network structure, define a function set
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
Deep means many hidden layers
Why Deep? Universality Theorem
Any continuous function f
f : R N  RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden neurons) http://neuralnetworksandde
eplearning.com/chap4.html
Why Deep? Analogy
Logic circuits Neural network
• Logic circuits consists of • Neural network consists of
gates neurons
• A two layers of logic gates • A hidden layer network
can represent any Boolean can represent any
function. continuous function.
• Using multiple layers of • Using multiple layers of
logic gates to build some neurons to represent some
functions are much simpler functions are much simpler
less gates needed less less
parameters data?
More reason:
https://www.youtube.com/watch?v=XsC9byQk
UH8&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89y
u49&index=13
Deep = Many hidden layers
22 layers
http://cs231n.stanford.
edu/slides/winter1516_ 19 layers
lecture8.pdf
8 layers
6.7%
7.3%
16.4%
AlexNet (2012) VGG (2014) GoogleNet (2014)

Deep = Many hidden layers
101 layers
152 layers
Special
structure
3.57%
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Output Layer
• Softmax layer as the output layer
Ordinary Layer
z1   
y1   z1
In general, the output of
z2   
y2   z 2
network can be any
value.
May not be easy to
z3   
y3   z 3 interpret
Output Layer
Probability:
• Softmax layer as the output layer  1 > 𝑦𝑖 > 0
 σ𝑖 𝑦𝑖 = 1
Softmax Layer
0.88 3
e
3
z1 e e z1 20
 y1  e z1 zj
j 1
0.12 3
z2 1
e e z2 2.7
 y2  e z2
e
zj
j 1
0.05 ≈0
z3 -
3
3
e e z3
 y3  e z3
e
zj
3 j 1
 e zj
j 1
Example Application
Input Output
x1
y1
0.1 is 1
x2 y2
0.7 is 2
The image
is “2”
……
……
……
x256 y1
0.2 is 0
16 x 16 = 256 0
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition
x1 y1 is 1
x2
y2 is 2
Neural
Machine
…… “2”
……
……
Network
x256 y1 is 0
What is needed is a
function …… 0
Input: output:
256-dim 10-dim vector
vector
Example Application
Input Layer 1 Layer 2 Layer L Output
x1 …… y1 is 1
x2 ……
A function set containing the y2 is 2
candidates for “2”
……
……
……
……
……
……
Handwriting Digit Recognition
xN …… y1 is 0
Input Output 0
Layer Hidden Layers Layer
You need to decide the network structure

to let a good function in your function set.

Famous Example
Real case:
https://youtu.be/ZKkeiqcL8qw?t=24
What did other people explain it?

https://www.youtube.com/watch?v=aircAruvnKk
Training Data
• Preparing training data: images and their labels
“5” “0” “4” “1”
“9” “2” “1” “3”
The learning target is defined on

the training data.
Learning Target
x1 …… y1 is 1
Softmax
x2 …… y2 is 2
……
……
……
x256 …… y1 is 0
16 x 16 = 256
0
Ink → 1 The learning target is ……
No ink → 0
Input: y1 has the maximum value
Input: y2 has the maximum value

A good function should make the loss
Loss of all examples as small as possible.
“1”
x1 …… y1 As close 1
x2 as
Softmax
Given a set ……
of y2 0
possible
parameters
……
……
……
……
……
Loss
x256 …… y1 𝑙 0
0
Loss can be square error or cross entropy target
between the network output and target
Total Loss:
Total Loss 𝑅
𝐿 = ෍ 𝑙𝑟
For all training data … 𝑟=1
x1 NN y1 𝑦ො 1
𝑙1 As small as possible
x2 NN y2 𝑦ො 2
𝑙2 Find a function in
function set that
x3 NN y3 𝑦ො 3 minimizes total loss L
𝑙3
……
……
……
…… Find the network
xR NN yR 𝑦ො 𝑅 parameters 𝜽∗ that
𝑙𝑅 minimize total loss L

How to pick the best function
Find network parameters 𝜽∗ that minimize total loss L

Layer l Layer l+1
Enumerate all possible values
Network parameters 𝜃 =
106
𝑤1 , 𝑤2 , 𝑤3 , ⋯ , 𝑏1 , 𝑏2 , 𝑏3 , ⋯
weights
……
……
Millions of parameters
E.g. speech recognition: 8 layers and

1000 1000
1000 neurons each layer
neurons neurons
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯

 Pick an initial value for w
Total
Random, RBM pre-train
Loss 𝐿
Usually good enough
w

Total  Compute 𝜕𝐿Τ𝜕𝑤
Loss 𝐿 Negative Increase w
Positive Decrease w
w
http://chico386.pixnet.net/album/photo/171572850

Loss 𝐿 𝑤 ← 𝑤 − 𝜂𝜕𝐿Τ𝜕𝑤
Repeat
η is called
−𝜂𝜕𝐿Τ𝜕𝑤 “learning rate” w

Loss 𝐿 𝑤 ← 𝑤 − 𝜂𝜕𝐿Τ𝜕𝑤
Repeat Until 𝜕𝐿Τ𝜕𝑤 is approximately small
(when update is little)
w
Local Minima
Total
Loss Very slow at
the plateau
Stuck at saddle point
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
The value of a network parameter w

Local Minima
• Gradient descent never guarantee global minima
Different initial point
Reach different minima,

so different results
𝑤1 𝑤2
Gradient Descent
This is the “learning” of machines in deep
learning ……
Even alpha go using this approach.
People image …… Actually …..
I hope you are not too disappointed :p

Backpropagation
• Backpropagation: an efficient way to compute 𝜕𝐿Τ𝜕𝑤 in neural network
Step 1: Step 2: Step 3:
define a goodness pick the
set of of best
function function function
Deep Learning is so simple ……
Now If you want to find a function
If you have lots of function input/output (?) as training data
You can use deep learning

For example, you can do …….
• Image Recognition
“monkey”
“monkey”
“cat”
Network
“cat”
“dog”
“dog”
“Talk” in e-mail
Spam
filtering Network 1/0
(Yes/No)
“free” in e-mail
1 (Yes)
0 (No)
(http://spam-filter-review.toptenreviews.com/)
Sport
“stock” in document
Finance
Network
Local News
“president” in document
Sport Local Finance

http://top-breaking-news.com/
News
Outline

Keras
Very flexible
or
Need some
effort to learn
Easy to learn and use

Interface of
TensorFlow (still have some flexibility)
or Theano You can modify it if you can write
keras TensorFlow or Theano
Keras
• François Chollet is the author of Keras.
• He currently works for Google as a deep learning engineer and researcher.
• Keras means horn in Greek
• Documentation: http://keras.io/
• Example: https://github.com/fchollet/keras/tree/master/examples
• Keras tutorial: https://github.com/tgjeon/Keras-Tutorials
Example Application
• Handwriting Digit Recognition
Machine “1”
28 x 28
MNIST Data: http://yann.lecun.com/exdb/mnist/

“Hello world” for deep learning
Keras provides data sets loading function: http://keras.io/datasets/
Keras
…
28x28 …
…
500 …
…
500 …
Softmax
y1 y2
…… y10
Keras
Keras
Step 3.1: Configuration
𝑤 ← 𝑤 − 𝜂𝜕𝐿Τ𝜕𝑤
0.1
Step 3.2: Find the optimal network parameters
Training Labels
data (digits)
(Images)
Keras
Step 3.2: Find the optimal network parameters
numpy array numpy array
28 x 28 …… 1 ……
=784 0
Number of training examples Number of training examples

https://www.tensorflow.org/versions/r0.8/tutorials/mnist/beginners/index.html
Keras
Save and load models

http://keras.io/getting-started/faq/#how-can-i-save-a-keras-model
How to use the neural network (testing):
case 1:
case 2:
Keras
• Using GPU to speed training
• Way 1
• THEANO_FLAGS=device=gpu0 python YourCode.py
• Way 2 (in your code)
• import os
• os.environ["THEANO_FLAGS"] = "device=gpu0"
set of of best
Outline

Recipe of Deep Learning
YES
Step 1: define a NO
Good Results on
set of function Testing Data?
Overfitting!
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
Training Data?
best function
Neural
Network
Do not always blame Overfitting
Not well trained
Overfitting?
Training Data Testing Data
Deep Residual Learning for Image Recognition

http://arxiv.org/abs/1512.03385
YES
Good Results on
Different approaches for Testing Data?
different problems.
e.g. dropout for good YES

results on testing data
Good Results on
Training Data?
Neural
Network
YES
Choosing proper loss Good Results on

Testing Data?
Mini-batch
New activation YES
function
Adaptive Learning Good Results on
Training Data?
Rate
Momentum
Choosing Proper Loss
“1”
x1 …… y1 1 𝑦ො1 1
x2
Softmax
…… …… y2 0 𝑦ො2 0
……
……
……
……
……
loss
x256 …… y1 0 𝑦ො10 0
Which one is better? 0
10 10 target
Square 2 Cross
Error ෍ 𝑦𝑖 − 𝑦ෝ𝑖 Entropy − ෍ 𝑦ෝ𝑖 𝑙𝑛𝑦𝑖
𝑖=1 =0 𝑖=1 =0
YES

Testing Data?
Mini-batch
New activation YES
function
Training Data?
Rate
Momentum
We do not really minimize total loss!
Mini-batch  Randomly initialize
network parameters
Mini-batch
x1 NN y1 𝑦ො 1  Pick the 1st batch
𝑙1 𝐿′ = 𝑙1 + 𝑙 31 + ⋯
x31 NN y31 𝑦ො 31 Update parameters once
𝑙 31  Pick the 2nd batch
……
𝐿′′ = 𝑙 2 + 𝑙16 + ⋯
Update parameters once
x2 NN y2 𝑦ො 2
Mini-batch
…
𝑙2  Until all mini-batches
have been picked
x16 NN y16 𝑦ො 16
𝑙16 one epoch
……
Repeat the above process

Mini-batch
1
 Pick the 1st batch
x1 NN y1 𝑦ො
𝐿′ = 𝑙1 + 𝑙 31 + ⋯
Mini-batch
1
𝑙
x31 NN y31 𝑦ො 31
…… 𝑙 31  Pick the 2nd batch
𝐿′′ = 𝑙 2 + 𝑙16 + ⋯
100 examples in a mini-
…
batch  Until all mini-batches
Repeat 20 times have been picked
one epoch
Shuffle the training examples for each epoch
Epoch 1 Epoch 2
x1 NN y1 𝑦ො 1 x1 NN y1 𝑦ො 1
Mini-batch
Mini-batch
𝑙1 𝑙1
x31 NN y31 𝑦ො 31 x17 NN y17 𝑦ො 17
𝑙 31 𝑙17
……
……
Don’t worry. This is the default of Keras.
x2 NN y2 𝑦ො 2 x2 NN y2 𝑦ො 2
Mini-batch
Mini-batch
𝑙2 𝑙2
x16 NN y16 𝑦ො 16 x26 NN y26 𝑦ො 26

𝑙16 𝑙 26
……
……
YES

Testing Data?
Mini-batch
New activation YES
function
Training Data?
Rate
Momentum
Hard to get the power of Deep …
Results on Training Data
Deeper usually does not imply better.

Vanishing Gradient Problem
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Smaller gradients Larger gradients
Learn very slow Learn very fast
Almost random Already converge

based on random!?
Vanishing Gradient Problem
Smaller gradients
x1 …… 𝑦1 𝑦ො1
x2 Small
…… output 𝑦2 𝑦ො2
……
……
……
……
……
……
𝑙
+∆𝑙
xN …… 𝑦𝑀 𝑦ො𝑀
Large
+∆𝑤
input
Intuitive way to compute the derivatives …
𝜕𝑙 ∆𝑙
=?
𝜕𝑤 ∆𝑤
Hard to get the power of Deep …
In 2006, people used RBM pre-training.

In 2015, people use ReLU.
ReLU
• Rectified Linear Unit (ReLU)
Reason:
𝑎
𝜎 𝑧 1. Fast to compute
𝑎=𝑧
2. Biological reason
𝑎=0 3. Infinite sigmoid
𝑧
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
𝑎
𝑎=𝑧
ReLU
𝑎=0
𝑧
0
x1 y1
0 y2
x2
0
0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎=0
𝑧
x1 y1
y2
x2
Do not have
smaller gradients
ReLU - variant
𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈

𝑎 𝑎
𝑎=𝑧 𝑎=𝑧
𝑧 𝑧
𝑎 = 0.01𝑧 𝑎 = 𝛼𝑧
α also learned by
gradient descent
Maxout ReLU is a special cases of Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
+ 5 neuro + 1
Input nMa Ma
x 7 x 2
x1 + 7 + 2
x2 + −1 + 4
Ma Ma
x
1 x 4
+ 1 + 3
You can have more than 2 elements in a group.

Maxout ReLU is a special cases of Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]

• Activation function in maxout network can be any piecewise linear
convex function
• How many pieces depending on how many elements in a group
2 elements in a 3 elements in a
group group
YES

Testing Data?
Mini-batch
New activation YES
function
Training Data?
Rate
Momentum
Learning Rates Set the
learning rate η
carefully
If learning rate is too large
Total loss may not decrease

after each update
𝑤2
𝑤1
Learning Rates Set the
learning rate η
carefully
If learning rate is too large
Total loss may not decrease

after each update
𝑤2
If learning rate is too small
Training would be too

slow
𝑤1
Learning Rates
• Popular & Simple Idea: Reduce the learning rate by some factor every
few epochs.
• At the beginning, we are far from the destination, so we use larger learning
rate
• After several epochs, we are close to the destination, so we reduce the
learning rate
• E.g. 1/t decay: 𝜂𝑡 = 𝜂 Τ 𝑡 + 1
• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning rates
Adagrad
Original: 𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤
Adagrad: w ← 𝑤 − 𝜂𝑤 𝜕𝐿 ∕ 𝜕𝑤
Parameter dependent
learning rate
𝜂 constant
𝜂𝑤 =
σ𝑡𝑖=0 𝑔𝑖 2 𝑔𝑖 is 𝜕𝐿 ∕ 𝜕𝑤 obtained
at the i-th update
Summation of the square of the previous derivatives
𝜂
𝜂𝑤 =
Adagrad σ𝑡𝑖=0 𝑔𝑖 2
g0 g1 …… g0 g1 ……
𝑤1 𝑤2
0.1 0.2 …… 20.0 10.0 ……
Learning rate: Learning rate:
𝜂 𝜂 𝜂 𝜂
= =
0.12 0.1 202 20
𝜂 𝜂 𝜂 𝜂
= =
0.12 + 0.22 0.22 202 + 102 22
Observation: 1. Learning rate is smaller and
smaller for all parameters
2. Smaller derivatives, larger
Why?
learning rate, and vice versa
Larger
derivatives
Smaller
Learning Rate
Smaller Derivatives
Larger Learning Rate
2. Smaller derivatives, larger

Why?
learning rate, and vice versa
YES

Testing Data?
Mini-batch
New activation YES
function
Training Data?
Rate
Momentum
Hard to find
optimal network parameters
Total
Loss Very slow at
the plateau
Stuck at saddle point
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
The value of a network parameter w

In physical world ……
• Momentum
How about put this phenomenon

in gradient descent?
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost
Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Momentum
Negative of 𝜕𝐿 ∕ 𝜕𝑤
Momentum
Real Movement
𝜕𝐿∕𝜕𝑤 = 0
Adam RMSProp (Advanced Adagrad) + Momentum
YES
Early Stopping Good Results on

Testing Data?
Regularization
YES
Dropout Good Results on

Training Data?
Network Structure
Panacea for Overfitting
• Have more training data
• Create more training data (?)
Handwriting recognition:
Original Created
Training Data: Training Data:
Shift 15。
YES

Testing Data?
Regularization
YES

Training Data?
Network Structure
Dropout
Training:
 Each time before updating the parameters

 Each neuron has p% to dropout
Dropout
Training:
Thinner!
 Each time before updating the parameters

 Each neuron has p% to dropout
The structure of the network is changed.
 Using the new network for training
For each mini-batch, we resample the dropout
neurons
Dropout
Testing:
 No dropout
 If the dropout rate at training is p%,
all the weights times 1-p%
 Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout rate) when testing?
Training of Dropout Testing of Dropout

Assume dropout rate is 50% No dropout
Weights from training
𝑤1 0.5 × 𝑤1 𝑧 ′ ≈ 2𝑧
𝑤2 𝑧 0.5 × 𝑤2 𝑧 ′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Weights multiply 1-p%
𝑧′ ≈ 𝑧
Dropout is a kind of ensemble.
Trainin
Ensemble g Set
Set 1 Set 2 Set 3 Set 4
Network Network Network Network

1 2 3 4
Train a bunch of networks with different structures

Ensemble
Testing data x
Network Network Network Network

1 2 3 4
y1 y2 y3 y4
average
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout
M neurons
……
2M
possible
networks
Using one mini-batch to train one network
Some parameters in the network are shared
Testing of Dropout testing data x
All the
weights
……
multiply
1-p%
y1 y2 y3
?????
average ≈ y
YES

Testing Data?
Regularization
YES

Training Data?
Network Structure
CNN is a very good example!
(next lecture)
Concluding Remarks
YES
Step 1: define a set NO
of function Good Results on
Testing Data?
Step 2: goodness of
function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network
Lecture II:
Variants of Neural
Networks
Variants of Neural Networks
Convolutional Neural
Network (CNN)image
Widely used in
processing
Recurrent Neural
Network (RNN)
Why CNN for Image? [Zeiler, M. D., ECCV 2014]
x1 ……
x2 ……
……
……
……
……
Represented
as pixels xN ……
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Can the network be simplified by

considering the properties of images?
Why CNN for Image
• Some patterns are much smaller than the whole image
A neuron does not have to see the whole image

to discover the pattern.
Connecting to small region with less
parameters
“beak” detector
Why CNN for Image
• The same patterns appear in different regions.
“upper-left
beak” detector
Do almost the same thing

They can use the same
set of parameters.
“middle beak”
detector
Why CNN for Image
• Subsampling the pixels will not change the object
bird
bird
subsampling
We can subsample the pixels to make image smaller

Less parameters for the network to process the image
Convolutional
set of
Neural Network
of best
The whole CNN
cat dog ……
Convolution
Max Pooling
Can repeat
Fully Connected many times
Feedforward Convolution
network
Max Pooling
Flatten
The whole CNN
Property 1
 Some patterns are much Convolution
smaller than the whole image
Property 2
Max Pooling
 The same patterns appear in
Can repeat
different regions.
many times
Property 3 Convolution
 Subsampling the pixels will
not change the object
Max Pooling
Flatten
The whole CNN
cat dog ……
Convolution
Max Pooling
Can repeat
network
Max Pooling
Flatten
CNN – Convolution Those are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1 Matrix
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
-1 1 -1 Matrix
0 0 1 0 1 0
…
…
6 x 6 image
Each filter detects a small
Property 1
pattern (3 x 3).
1 -1 -1
CNN – Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
We set stride=1 below
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
Property 2
-1 1 -1
-1 1 -1
stride=1 Do the same process for
1 0 0 0 0 1 every filter
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
4 x 4 image
1 -1 -1
CNN – Zero Padding -1 1 -1 Filter 1
-1 -1 1
0 0 0
0 1 0 0 0 0 1
0 0 1 0 0 1 0
0 0 1 1 0 0 You will get another 6 x 6
1 0 0 0 1 0 images in this way
0 1 0 0 1 0 0
0 0 1 0 1 0 0 Zero padding
0 0 0
6 x 6 image
CNN – Colorful image
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1-1 11 -1-1 -1-1 11 -1-1
-1 1 -1 Filter 1 -1 -1 1 1 -1 -1 Filter 2
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1
Colorful
image 1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolutio
image n
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
……
……
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1: 1
-1 1 -1 2: 0
-1 -1 1 3: 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1 Only connect to
16: 1 9 input, not fully
connected
…
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1
16: 1 Shared
Even less parameters! weights
…
The whole CNN
cat dog ……
Convolution
Max Pooling
Can repeat
network
Max Pooling
Flatten
CNN – Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
CNN – Max Pooling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
Can repeat
A new many times
image Convolution
Smaller than the original image

The number of the channel Max Pooling
is the number of filters
The whole CNN
cat dog ……
Convolution
Max Pooling
A new image
Fully Connected
network
Max Pooling
A new image
Flatten
3
Flatten
0
1
3 0
-1 1 3
3 1 -1
0 3 Flatten
1 Fully Connected
Feedforward
0 network
3
Convolutional Neural Network
set of
Convolutional of best
function
Neural Network function function
“monkey” 0
“cat” 1
CNN
…
…
“dog” 0
Convolution, Max target
Pooling, fully connected
Learning: Nothing special, just gradient descent ……
Only modified the network structure and
CNN in Keras
input format (vector -> 3-D tensor)
input
Convolution
1 -1 -1
-1 1 -1
-1 1 -1 … There are
-1 1 -1
-1 -1 1 25 3x3
-1 1 -1 … Max Pooling
filters.
Input_shape = ( 1 , 28 , 28 )
1: black/weight, 3: RGB 28 x 28 pixels Convolution
- Max Pooling
3 3
1
-
1
3
CNN in Keras
input
1 x 28 x 28
Convolution
How many parameters
9 25 x 26 x
for each filter?
26 Max Pooling
25 x 13 x
13
Convolution
How many parameters
225 50 x 11 x
for each filter?
11 Max Pooling
50 x 5 x 5
CNN in Keras
input
1 x 28 x 28
output Convolution
25 x 26 x
Fully Connected 26 Max Pooling
Feedforward
25 x 13 x
network
13 Convolution
50 x 11 x
11 Max Pooling
1250 50 x 5 x 5
Flatten
What does CNN learn?
The output of the k-th filter is a x
11 x 11 matrix. input
Degree of the activation 11 11
of the k-th filter: 𝑘
𝑎 𝑘 = ෍ ෍ 𝑎𝑖𝑗 25 3x3
Convolution
𝑖=1 𝑗=1 filters
𝑥 ∗ = 𝑎𝑟𝑔 max 𝑎𝑘 (gradient ascent)
𝑥
11 Max Pooling
- …… -
3
1 𝑘 1 50 3x3
𝑎𝑖𝑗 Convolution
- - filters
1 ……
11 3 3 50 x 11 x
……
……
…… 11 Max Pooling
- …… -
3
2 1
What does CNN learn?
The output of the k-th filter is a
11 x 11 matrix. input
Degree of the activation 11 11
of the k-th filter: 𝑘
𝑎 𝑘 = ෍ ෍ 𝑎𝑖𝑗 25 3x3
Convolution
𝑖=1 𝑗=1 filters
𝑥 ∗ = 𝑎𝑟𝑔 max 𝑎𝑘 (gradient ascent)
𝑥
Max Pooling
50 3x3
Convolution
filters
50 x 11 x
11 Max Pooling
For each filter

What does CNN learn? input
𝑥 ∗ = 𝑎𝑟𝑔 max 𝑦 𝑖 Can we see digits? Convolution

𝑥
Max Pooling
0 1 2
Convolution
Max Pooling
3 4 5
flatten
6 7 8
Deep Neural Networks are Easily Fooled

https://www.youtube.com/watch?v=M2IebCN9Ht4
𝑦𝑖
What does CNN learn? Over all
pixel values
𝑥 ∗ = 𝑎𝑟𝑔 max 𝑦 𝑖 𝑥 ∗ = 𝑎𝑟𝑔 max 𝑦 𝑖 + ෍ 𝑥𝑖𝑗

𝑥 𝑥
𝑖,𝑗
0 1 2 0 1 2
3 4 5 3 4 5
6 7 8 6 7 8
CNN
Modify
Deep Dream image
• Given a photo, machine adds what it sees ……

3.9
−1.5
2.3
⋮
CNN exaggerates what it sees
http://deepdreamgenerator.com/
Deep Dream
• Given a photo, machine adds what it sees ……
http://deepdreamgenerator.com/
Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
Deep Style
CNN CNN
A Neural content style

Algorithm of
Artistic Style
https://arxiv.org/abs/150
8.06576
CNN
?
More Application: Playing Go
Next move
Network (19 x 19
positions)
19 x 19 19 x 19 vector
matrix (image)
19 x 19 vector
Black: 1 Fully-connected feedforward
white: -1 network can be used
none: 0 But CNN performs much better.
Why CNN for playing Go?
• Some patterns are much smaller than the whole image
Alpha Go uses 5 x 5 for first layer

• The same patterns appear in different regions.
Why CNN for playing Go?
• Subsampling the pixels will not change the object
Max Pooling How to explain this???
Alpha Go does not use Max Pooling ……

CNN
Variants of Neural Networks
Network (CNN)
Recurrent Neural
Network (RNN)
Neural Network with
Memory
Example Application
• Slot Filling
I would like to arrive Taipei on November 2nd.
ticket booking system
Destination: Taipei
Slot
time of arrival: November 2nd
Example Application
y1 y2
Solving slot filling by
Feedforward network?
Input: a word
(Each word is
represented as a vector)
Taipei x1 x2
1-of-N encoding
How to represent each word as a vector?

1-of-N Encoding lexicon = {apple, bag, cat, dog, elephant}
The vector is lexicon size. apple = [ 1 0 0 0 0]
Each dimension corresponds bag = [ 0 1 0 0 0]
to a word in the lexicon cat = [ 0 0 1 0 0]
The dimension for the word dog = [ 0 0 0 1 0]
is 1, and others are 0 elephant = [ 0 0 0 0 1]
Beyond 1-of-N encoding
Dimension for “Other” Word hashing
apple 0 a-a-a 0
bag 0 a-a-b 0
…
…
cat 0 a-p-p 1
dog 0
…
26 X 26 X
elephant 0 p-l-e 1
26
…
…
…
p-p-l 1
“other” 1
…
w = “apple”
…
w = “Gandalf” w = “Sauron”
157
Example Application time of
dest departure
y1 y2
Solving slot filling by
Feedforward network?
Input: a word
(Each word is
represented as a vector)
Output:
Probability distribution that
the input word belonging to
the slots
Taipei x1 x2
Example Application time of
dest departure
y1 y2
arrive Taipei on November 2nd
other dest other time time

Problem?
leave Taipei on November 2nd
place of departure
Neural network Taipei x1 x2

needs memory!
Recurrent
set of
Neural Network
of best
Recurrent Neural Network (RNN)
y1 y2
The output of hidden layer

are stored in the memory.
store
a1 a2
Memory can be considered x1 x2

as another input.
RNN The same network is used again and again.
Probability of Probability of Probability of

“arrive” in each slot “Taipei” in each slot “on” in each slot
y1 y2 y3
store store
a1 a2 a3
a1 a2
x1 x2 x3

RNN Different
Prob of “leave” Prob of “Taipei” Prob of “arrive” Prob of “Taipei”

in each slot in each slot in each slot in each slot
y1 y2 … y1 y2 …
store … store …
a1 a2 a1 a2
a1 … a1 …
… …
x1 x2 … x1 x2 …
leav Taipei … arrive Taipei …
e
The values stored in the memory is different.
Of course it can be deep …
yt yt+1 yt+2
…… ……
……
……
……
…… ……
…… ……
xt xt+1 xt+2
Bidirectional RNN
xt xt+1 xt+2
…… ……
yt yt+1 yt+2
…… ……
xt xt+1 xt+2
RNN
• https://www.youtube.com/watch?v=_aCuOwF1ZjU&list=PLjJh1vlSEYg
vZ3ze_4pxKHNh1g5PId36-&index=8
Long Short-term Memory (LSTM)
Other part of the network
Special Neuron:
Signal control
Output Gate
4 inputs,
the output gate 1 output
(Other part
of the
Memory Forget Signal control
network)
Cell Gate the forget gate
(Other part
of the
Signal control network)
Input Gate LSTM
the input gate
(Other part
of the
network) Other part of the network
𝑎 = ℎ 𝑐 ′ 𝑓 𝑧𝑜
𝑧𝑜 multiply
Activation function f is
𝑓 𝑧𝑜 ℎ 𝑐′ usually a sigmoid function
Between 0 and 1
Mimic open and close gate
𝑐 𝑓 𝑧𝑓
𝑐c′ 𝑧𝑓
𝑐𝑓 𝑧𝑓
𝑔 𝑧 𝑓 𝑧𝑖
𝑐 ′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
𝑓 𝑧𝑖
𝑧𝑖
multiply
𝑔 𝑧
𝑧
0
≈
-10 0
1
0
1 1
7
0 ≈ 0
≈ 1
3
1 1
0
3
3
-3
≈
10 1
-3
1 -
-3
7
0 ≈ 10
≈ 0
-3
1 1
0
-3
-3
LSTM
ct-1
……
vector
zf zi z zo 4 vectors
xt
LSTM
yt
zo
ct-1
× ＋ ×
× zf
zi
zf zi z zo
xt
z
Extension: “peephole”
LSTM
yt yt+1
ct-1 ct ct+1
× ＋ × × ＋ ×
× ×
zf zi z zo zf zi z zo
ct-1 ht-1 xt ct ht xt+1

Multiple-layer
LSTM
Don’t worry if you cannot understand this.

Keras can handle it.
Keras supports
“LSTM”, “GRU”, “SimpleRNN” layers
This is quite
standard now.
https://img.komicolle.org/2015-09-20/src/14426967627131.gif
LSTM
• https://www.youtube.com/watch?v=5dMXyiWddYs
set of of best
Learning Target
other dest other
0 … 1 … 0 0 … 1 … 0 0 … 1 … 0
y1 y2 y3
copy copy
a1 a2 a3
a1 a2
Wi
x1 x2 x3
Training
Sentences: arrive Taipei on November 2nd
other dest other time time
set of of best
Learning y1 y2
Backpropagation
through time (BPTT)
copy
a1 a2
𝑤
𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤 x1 x2
RNN Learning is very difficult in practice.

感謝曾柏翔同學
提供實驗結果
Unfortunately ……
• RNN-based network is not always easy to learn
Total Loss Real experiments on Language modeling
sometime
s
Lucky
Epoch
The error surface is rough.
The error surface is
either very flat or very
steep.
Total
Clippin
CostLoss
g
w2
w1 [Razvan Pascanu, ICML’13]

Why?
𝑤=1 𝑦 1000 = 1 Large Small
𝑤 = 1.01 𝑦 1000 ≈ 20000 𝜕𝐿Τ𝜕𝑤 Learning rate?
𝑤 = 0.99 𝑦 1000 ≈ 0 small Large
𝑤 = 0.01 𝑦 1000 ≈ 0 𝜕𝐿Τ𝜕𝑤 Learning rate?
=w999
y1 y2 y3 y1000
Toy Example
1 1 1 1
……
w w w
1 1 1 1
1 0 0 0
Helpful Techniques
• Long Short-term Memory (LSTM)
• Can deal with gradient vanishing (not gradient
explode)
 Memory and input are
added
 The influence never disappears
unless forget gate is closed
No Gradient vanishing add
(If forget gate is opened.)
Gated Recurrent Unit (GRU):
simpler than LSTM [Cho, EMNLP’14]
Helpful Techniques
Structurally Constrained
Clockwise RNN
Recurrent Network (SCRN)
[Jan Koutnik, JMLR’14] [Tomas Mikolov, ICLR’15]
Vanilla RNN Initialized with Identity matrix + ReLU activation

function [Quoc V. Le, arXiv’15]
 Outperform or be comparable with LSTM in 4 different tasks
More Applications ……
Probability of Probability of Probability of
“arrive” in each slot “Taipei” in each “on” in each slot
slot
y1 y2 y3
Input store
and output are both sequences
store
a1 with the a2 length
same a3
a 1
a2
RNN can do more than that!
x1 x2 x3

Many to one [Shen & Lee, Interspeech 16]
• Input is a vector sequence, but output is only one vector

OT
Key Term …
Key Terms:
Extraction DNN, LSTN
V V V V … V
1 2 3 4 T
Output Layer
Embedding Layer
document x1 x2 x3 x4 … xT
Hidden Layer
Embedding Layer
V V V V … V ΣαiVi
1 2 3 4 T
α1 α2 α3 α4 … αT
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output is shorter.
• E.g. Speech Recognition
Output: “好棒” (character sequence)

Trimming
Problem?
Why can’t it be 好好好棒棒棒棒棒
“好棒棒”
(vector
Input:
sequenc
e)
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output is shorter.
• Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06][Alex Graves,
ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]
“好棒” Add an extra symbol “好棒棒”

“φ” representing “null”
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
Many to Many (No Limitation)
• Both input and output are both sequences with different lengths. → Sequence
to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine
learning
Containing all
information about
input sequence
機器學習慣性 ……
……
machine
learning
Don’t know when to stop

===
機器學習
machine
learning
Add a symbol “===“ (斷)

[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
Image Caption Generation
• Input an image, but output a sequence of words
[Kelvin Xu, arXiv’15][Li Yao, ICCV’15]
A vector
for whole ===
image a woman is
CNN ……
Input
image Caption Generation
Video Caption Generation
A girl is running.
Video
A group of people is A group of people is

knocked by a tree. walking in the
Chat-bot
Attention-based Model
What you learned in Breakfast
these lectures today
What is deep
learning?
summer
vacation 10
Organiz
Answer years ago
e
http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Attention-based Model
Input DNN/RNN output
Reading
Head
Controller
Reading Head
…… ……
Machine’s Memory
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e
cm.mp4/index.html
Attention-based Model v2
Input DNN/RNN output
Reading
Writing Head
Head
Controller
Controller
Writing Head Reading Head
…… ……
Machine’s Memory
Neural Turing Machine

Reading Comprehension
Query DNN/RNN answer
Reading
Head
Controller
Semantic
Analysis
…… ……
Each sentence becomes a vector.

Reading Comprehension
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus.
NIPS, 2015.
The position of reading head:
Keras has example:

https://github.com/fchollet/keras/blob/master/examples/b
abi_memnn.py
Visual Question Answering
source: http://visualqa.org/
Visual Question Answering
Query DNN/RNN answer
Reading
Head
Controller
CNN A vector for

each region
Speech Question Answering
• TOEFL Listening Comprehension Test by Machine
• Example:
Audio Story: (The original story is 5 min long.)
Question: “ What is a possible origin of Venus’ clouds? ”
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
Everything is learned
Model Architecture from training examples
…… It be quite possible that this be

Answer due to volcanic eruption because
volcanic eruption often emit gas. If
that be the case volcanism could very
Select the choice most Attention well be the root cause of Venus 's
similar to the answer thick cloud cover. And also we have
observe burst of radio energy from
Question Attention the planet 's surface. These burst be
Semantics similar to what we see when volcano
erupt on earth ……
Semantic Speech Semantic

Analysis Recognition Analysis
Question: “what is a possible Audio Story:

origin of
Experimental setup:
717 for training,
Simple Baselines 124 for validation, 122 for
testing
(2) select the shortest (4) the choice with

choice as answer semantic most similar to
others
Accuracy (%)
random
(1) (2) (3) (4) (5) (6) (7)

Naive Approaches
Memory Network
Memory Network: 39.2%

Accuracy (%)
(proposed by FB AI group)
(1) (2) (3) (4) (5) (6) (7)

Naive Approaches
Proposed Approach [Tseng & Lee, Interspeech 16]
[Fang & Hsu & Lee, SLT 16]
Proposed Approach: 48.8%
Memory Network: 39.2%

Accuracy (%)
(proposed by FB AI group)
(1) (2) (3) (4) (5) (6) (7)

Naive Approaches
Concluding Remarks
Network (CNN)
Recurrent Neural
Network (RNN)
Lecture III:
Beyond Supervised
Learning
Outline
Unsupervised Learning
Reinforcement Learning
• 化繁為簡 • 無中生有
only having
function input
only having
function
function function
output
code
Motivation
• In MNIST, a digit is 28 x 28 dims.
• Most 28 x 28 dim vectors are not digits
-20。 -10。 0。 10。 20。

Auto-encoder
Usually <784
Compact
NN code representation of
Encoder the input object
28 X 28 = 784
Learn together
NN Can reconstruct
code the original object
Decoder
As close as possible
NN NN
𝑐
Encoder Decoder
Deep Auto-encoder
• NN encoder + NN decoder = a deep network
Output Layer
Input Layer
Layer
Layer
Layer
Layer
Layer
bottl
Layer … …
e
Encoder Decoder
𝑥 Code 𝑥ො
Reference: Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the

dimensionality of data with neural networks." Science 313.5786 (2006): 504-507
Deep Auto-encoder
Original
Image
784
784
30
PCA
Deep
Auto-encoder
500
500
250
250
30
1000
1000
784
784
784 784
1000
2
500
784
250
2
250
500
1000
784
More: Contractive auto-encoder
Ref: Rifai, Salah, et al. "Contractive
Auto-encoder auto-encoders: Explicit invariance
during feature extraction.“ Proceedings
of the 28th International Conference on
Machine Learning (ICML-11). 2011.
• De-noising auto-encoder
encode decode
𝑐
𝑥 𝑥′ 𝑥ො
Add
noise
Vincent, Pascal, et al. "Extracting and composing robust

features with denoising autoencoders." ICML, 2008.
Auto-encoder – Pre-training DNN
• Greedy Layer-wise Pre-training again
output 10
500
Target
1000 784 𝑥ො
W1’
1000 1000
W1
Input 784 Input 784 𝑥
output 10
500 1000 𝑎ො 1
W2’
Target
1000 1000
W2
1000 1000 𝑎1
fix W1
output 10 1000 𝑎ො 2
W3’
500 500
W3
Target
1000 1000 𝑎2
fix W2
1000 1000 𝑎1
fix W1
Find-tune by
• Greedy Layer-wise Pre-training again backpropagation
output 10 output 10
Rando
W4 m init
500 500
W3
Target
1000 1000
W2
1000 1000
W1
Word Vector/Embedding
• Machine learn the meaning of words from reading a lot of
documents without supervision
tree
flower
dog rabbit
run
jump cat
How to exploit the context?
• Count based
• If two words wi and wj frequently co-occur, V(wi) and
V(wj) would be close to each other
• E.g. Glove Vector:
http://nlp.stanford.edu/projects/glove/
V(wi) . V(wj) Ni,j

Inner product Number of times wi and wj
in the same document
• Prediction based
w
…… wi-2 wi-1 ___
Prediction-based i
0 z1
1-of-N
1 z2 The probability
encoding
0 for each word as
of the
…
……
……
the next word wi
word wi-1
……
 Take out the input of the z2 tree

neurons in the first layer flower
 Use it to represent a dog rabbit run
word w jump
cat
 Word vector, word
embedding feature: V(w) z1 223
Prediction-based
– Various Architectures
• Continuous bag of word (CBOW) model
wi-1
…… wi-1 ____ wi+1 …… Neural
wi
wi+ Network
1
predicting the word given its context
• Skip-gram
…… ____ wi ____ …… w wi-1
Neural
i
Network
wi+
1
predicting the context given a word 224
Word Embedding
Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014
225
Word Embedding
𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
• Characteristics ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
𝑉 ℎ𝑜𝑡𝑡𝑒𝑟 − 𝑉 ℎ𝑜𝑡 ≈ 𝑉 𝑏𝑖𝑔𝑔𝑒𝑟 − 𝑉 𝑏𝑖𝑔

𝑉 𝑅𝑜𝑚𝑒 − 𝑉 𝐼𝑡𝑎𝑙𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
𝑉 𝑘𝑖𝑛𝑔 − 𝑉 𝑞𝑢𝑒𝑒𝑛 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒 − 𝑉 𝑎𝑢𝑛𝑡
• Solving analogies
Rome : Italy = Berlin : ?

Compute 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Find the word w with the closest V(w)
226
Audio Word to Vector
Machine does not have

any prior knowledge
Machine listens to lots of

audio book
Like an
infant
[Chung, Interspeech 16)

• Dimension reduction for a sequence with variable length
audio segments (word-level) Fixed-length vector
dog
never
dog
Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen,
never
Hung-Yi Lee, Lin-Shan Lee, Audio Word2Vec:
dogs Unsupervised Learning of Audio Segment
Representations using Sequence-to-sequence
never
Autoencoder, Interspeech 2016
ever ever
Sequence-to-sequence
Auto-encoder
vector
audio segment
RNN Encoder The values in the memory

represent the whole audio
segment
The vector we want
How to train RNN Encoder?
x1 x2 x3 x4 acoustic features
audio segment
Input acoustic features
Auto-encoder
x1 x2 x3 x4
The RNN encoder and
decoder are jointly trained.
y1 y2 y3 y4
RNN Encoder
RNN Decoder
x1 x2 x3 x4 acoustic features
audio segment
Auto-encoder
• Visualizing embedding vectors of the words
fear
fame
name near
–Application
“US President”
spoken
query
user
“US President” “US President”
Spoken Content
Compute similarity between spoken queries and audio

files on acoustic level, and find the query term
–Application
Audio archive divided into variable- Off-line
length audio segments
Audio
Segment to
Vector
Audio
Segment to Similarity
Spoken Vector
Query
On-line Search Result
Experimental Results
• Query-by-Example Spoken Term Detection
SA: sequence
auto-encoder
DSA: de-noising
MAP
sequence auto-encoder
Input: clean speech +
noise
output: clean speech
training epochs for sequence
auto-encoder
Next Step ……
• Can we include semantics?
walk
dog
walked
cat
run cats
flower tree
Creation
Draw something!
Creation
• Generative Models: https://openai.com/blog/generative-models/
What I cannot create,

I do not understand.
Richard Feynman
https://www.quora.com/What-did-Richard-Feynman-mean-when-he-said-What-I-
cannot-create-I-do-not-understand
Ref: Aaron van den Oord, Nal Kalchbrenner, Koray
PixelRNN Kavukcuoglu, Pixel Recurrent Neural Networks,
arXiv preprint, 2016
• To create an image, generating a pixel each time
E.g. 3 x 3 images
NN
NN
NN
……
Can be trained just with a large collection
of images without any annotation
Ref: Aaron van den Oord, Nal Kalchbrenner, Koray
PixelRNN Kavukcuoglu, Pixel Recurrent Neural Networks,
Real
Worl
d
PixelRNN – beyond Image
Audio: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu,
WaveNet: A Generative Model for Raw Audio, arXiv preprint, 2016
Video: Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo
Danihelka, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu, Video Pixel Networks ,
Auto-encoder
code
NN NN
Encoder Decoder
code
Randomly generate NN
a vector as code Decoder
Image ?
Variation Auto-encoder (VAE)

Ref: Auto-Encoding Variational Bayes,
https://arxiv.org/abs/1312.6114
Auto-encoder
NN NN
input output
Encoder Decoder
code
VAE
Minimize
m1 reconstruction error
m2
NN m3 𝑐1
input NN
Encoder 𝜎1 ex + 𝑐2 output
Decoder
𝜎2 p 𝑐3
𝜎3
X 𝑐𝑖 = 𝑒𝑥𝑝 𝜎𝑖 × 𝑒𝑖 + 𝑚𝑖
𝑒1
From a
𝑒2 Minimize
normal 𝑒3 3
distribution 2
෍ 𝑒𝑥𝑝 𝜎𝑖 − 1 + 𝜎𝑖 + 𝑚𝑖
𝑖=1
Why VAE?
decode
code
encode
VAE
Cifar-10
https://github.com/openai/iaf
Source of image: https://arxiv.org/pdf/1606.04934v1.pdf

VAE - Writing Poetry
sentence NN NN
sentence
Encoder Decoder
code
Code Space
i went to the store to buy some
groceries.
i store to buy some groceries.
i were to buy any groceries.
……
"come with me," she said.
"talk to me," she said.
"don’t worry about it," she said.
Ref: http://www.wired.co.uk/article/google-artificial-intelligence-poetry
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal
Jozefowicz, Samy Bengio, Generating Sentences from a Continuous Space, arXiv
prepring, 2015
Problems of VAE
• It does not really try to simulate real images
code
NN As close as
Output
Decoder possible
One pixel difference One pixel difference

from the target from the target
Realistic Fake
Generative Adversarial Network (GAN)
Ref: Generative Adversarial Networks,

The evolution of generation
NN NN NN
Generator Generator Generator
v1 v2 v3
Discri- Discri- Discri-

minator minator minator
v1 v2 v3
Binary
Classifier Real images:
Cifar-10
• Which one is machine-generated?
Ref: https://openai.com/blog/generative-models/
Want to practice
Generation Models?
Pokémon Creation
• Small images of 792 Pokémon's
• Can machine learn to create new Pokémons?
Don't catch them! Create them!
• Source of image:
http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_base_stat
s_(Generation_VI)
Original image is 40 x
40
Making them into 20 x 20
Pokémon Creation
 Each pixel is represented by 3 numbers (corresponding

to RGB)
R=50, G=150, B=100
 Each pixel is represented by a 1-of-N encoding feature

0 0 1 0 0 ……
Clustering the similar color 167 colors in total

Real
Pokémon
Never seen
by machine!
Cover
50% It is difficult to evaluate
generation.
Cover
75%
Drawing from scratch
Pokémon Creation Need some randomness
Pokémon Creation
m1
m2
NN m3 𝑐1
input NN
Encoder 𝜎1 ex + 𝑐2 output
Decoder
𝜎2 p 𝑐3
𝜎3 10-dim
X
𝑒1
𝑒2
Pick two dim, 𝑒3
and fix the rest
eight
𝑐1 NN
𝑐2 Decoder ?
𝑐3
10-dim
Pokémon Creation - Data
• Original image (40 x 40):
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_creation/im
age.rar
• Pixels (20 x 20):
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_creation/pix
el_color.txt
• Each line corresponds to an image, and each number corresponds to a
pixel
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_cr
eation/colormap.txt 0
1
2
……
You can use the data without permission
…
Outline
Reinforcement Learning
Scenario of Reinforcement Learning
Observation Action
Agent
Don’t do Reward
that
Environment
Agent learns to take actions to
maximize expected reward.
Observation Action
Agent
Thank you. Reward
http://www.sznews.com/news/content Environment
/2013-11/26/content_8800180.htm
Supervised v.s. Reinforcement
• Supervised “Hello” Say “Hi”
Learning from
teacher “Bye bye” Say “Good bye”
• Reinforcement
……. ……. ……
Hello  …… Bad
Learning from
critics Agent Agent
Agent learns to take actions to
maximize expected reward.
Observation Action
Reward Next
Move
If win, reward = 1
If loss, reward = -1
Otherwise, reward = 0
Environment
Supervised v.s. Reinforcement
• Supervised:
Next move: Next move:

“5-5” “3-3”
• Reinforcement Learning
First move …… many moves …… Win!
Alpha Go is supervised learning + reinforcement learning.

Difficulties of Reinforcement Learning
• It may be better to sacrifice immediate reward to gain more long-
term reward
• E.g. Playing Go
• Agent’s actions affect the subsequent data it receives
• E.g. Exploration
Deep Reinforcement Learning
DNN
Observation Action
…
…
Function … Function
Input Output
Used to pick the

best function Reward
Environment
Application: Interactive Retrieval
• Interactive retrieval is helpful. [Wu & Lee, INTERSPEECH 16]
“Deep Learning”
user
“Deep Learning” related to Machine Learning?

“Deep Learning” related to Education?
Deep Reinforcement Learning
• Different network depth
Some depth is needed.
Better retrieval
The task cannot be addressed
performance,
Less user labor by linear model.
More Interaction

2.game AI 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.game AI 1

Uploaded by

Copyright:

Available Formats

Game AI

July 17, 2019

Lecture I: Introduction of Deep Learning

Lecture II: Variants of Neural Network

Lecture III: Beyond Supervised Learning

Introduction of Deep Learning

“Hello World” for Deep Learning

Tips for Deep Learning

Training function input:

Goodness of Pick the “Best” Function

Step 1: define a set of function

Step 2: goodness of function

Step 3: pick the best function

AlexNet (2012) VGG (2014) GoogleNet (2014)

You need to decide the network structure

Step 1: define a set of function

Step 2: goodness of function

Step 3: pick the best function

What did other people explain it?

“5” “0” “4” “1”

“9” “2” “1” “3”

The learning target is defined on

Input: y2 has the maximum value

Step 1: define a set of function

Step 2: goodness of function

Step 3: pick the best function

Find network parameters 𝜽∗ that minimize total loss L

E.g. speech recognition: 8 layers and

Find network parameters 𝜽∗ that minimize total loss L

Find network parameters 𝜽∗ that minimize total loss L

Find network parameters 𝜽∗ that minimize total loss L

Find network parameters 𝜽∗ that minimize total loss L

Stuck at local minima

The value of a network parameter w

Different initial point

Reach different minima,

I hope you are not too disappointed :p

You can use deep learning

Sport Local Finance

Introduction of Deep Learning

“Hello World” for Deep Learning

Tips for Deep Learning

Easy to learn and use

MNIST Data: http://yann.lecun.com/exdb/mnist/

Step 3.1: Configuration

numpy array numpy array

Number of training examples Number of training examples

Save and load models

How to use the neural network (testing):

Introduction of Deep Learning

“Hello World” for Deep Learning

Tips for Deep Learning

Not well trained

Training Data Testing Data

Deep Residual Learning for Image Recognition

e.g. dropout for good YES

Choosing proper loss Good Results on

Choosing proper loss Good Results on

Repeat the above process

x16 NN y16 𝑦ො 16 x26 NN y26 𝑦ො 26

Choosing proper loss Good Results on

Results on Training Data