4 - DL (v2)

Deep Learning
Deep learning
attracts lots of attention.
• I believe you have seen lots of exciting results
before.
Deep learning trends at Google. Source: SIGMOD/Jeff Dean

Ups and downs of Deep Learning
• 1958: Perceptron (linear model)
• 1969: Perceptron has limitation
• 1980s: Multi-layer perceptron
• Do not have significant difference from DNN today
• 1986: Backpropagation
• Usually more than 3 hidden layers is not helpful
• 1989: 1 hidden layer is “good enough”, why deep?
• 2006: RBM initialization (breakthrough)
• 2009: GPU
• 2011: Start to be popular in speech recognition
• 2012: win ILSVRC image competition
Three Steps for Deep Learning
Step 1: Step 2: Step 3: pick

define a set
Neural goodness of the best
ofNetwork
function function function
Deep Learning is so simple ……

Neural Network
  z 
  z    z 
  z 
“Neuron”
Neural Network
Different connection leads to different network
structures
Network parameter 𝜃: all the weights and biases in the “neurons”
Fully Connect Feedforward
Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function  z 
 z  
1
z
1 e z
Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
This is a function. 1 0.62 0 0.51
𝑓 = 𝑓 =
Input vector, output vector −1 0.83 0 0.85
Given network structure, define a function set

Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
Deep = Many hidden layers
22 layers
http://cs231n.stanford.e
du/slides/winter1516_le 19 layers
cture8.pdf
8 layers
6.7%
7.3%
16.4%
AlexNet (2012) VGG (2014) GoogleNet (2014)

Deep = Many hidden layers
101 layers
152 layers
Special
structure
3.57%
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Matrix Operation
1 4 0.98
1 y1
-2
1
-1 -2 0.12
-1 y2
1
0
1 −2 1 1 0.98
𝜎 + =
−1 1 −1 0 0.12
4
−2
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
Using parallel computing techniques

y =𝑓 x
to speed up matrix operation
=𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL
Output Layer
Feature extractor replacing
feature engineering
x1 y1
……
x2
Softmax
…… y2
x
……
……
……
……
……
xK
…… yM
Input Output = Multi-class
Layer Hidden Layers Layer Classifier
Example Application
Input Output
y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition
x1 y1 is 1
x2
y2 is 2
Neural
Machine “2”
……
……
……
Network
x256 y10 is 0
What is needed is a
function ……
Input: output:
256-dim vector 10-dim vector
Example Application
Input Layer 1 Layer 2 Layer L Output
x1 …… y1 is 1
x2 ……
A function set containing the y2 is 2
candidates for “2”
……
……
……
……
……
……
Handwriting Digit Recognition
xN …… y10 is 0
Input Output
Layer Hidden Layers Layer
You need to decide the network structure to

let a good function in your function set.
FAQ
• Q: How many layers? How many neurons for each

layer?
Trial and Error + Intuition
• Q: Can the structure be automatically determined?
• E.g. Evolutionary Artificial Neural Networks
• Q: Can we design the network structure?
Convolutional Neural Network (CNN)

define a set
ofNetwork

Loss for an Example
target
“1”
x1 …… y1 𝑦ො1 1
x2
Softmax
……
Given a set of y2 𝑦ො2 0
parameters
……
……
……
……
……
……
Cross
Entropy
x256 …… y10 𝑦ො10 0
10 𝑦 𝑦ො
𝐶 𝑦 , 𝑦ො = − ෍ 𝑦ෝ𝑖 𝑙𝑛𝑦𝑖
𝑖=1
Total Loss:
Total Loss 𝑁
𝐿 = ෍ 𝐶𝑛
For all training data … 𝑛=1
x1 NN y1 𝑦ො 1
𝐶1
Find a function in
x2 NN y2 𝑦ො 2
𝐶2 function set that
minimizes total loss L
x3 NN y3 𝑦ො 3
𝐶3
……
……
……
……
Find the network

xN NN yN 𝑦
ො 𝑁 parameters 𝜽∗ that
𝐶 𝑁
minimize total loss L

define a set
ofNetwork

Gradient Descent
𝜃
Compute 𝜕𝐿Τ𝜕𝑤1 𝜕𝐿
𝑤1 0.2 0.15
−𝜇 𝜕𝐿Τ𝜕𝑤1 𝜕𝑤1
Compute 𝜕𝐿Τ𝜕𝑤2 𝜕𝐿
𝑤2 -0.1
−𝜇 𝜕𝐿Τ𝜕𝑤2
0.05 𝛻𝐿 = 𝜕𝑤2
⋮
……
𝜕𝐿
Compute 𝜕𝐿Τ𝜕𝑏1 𝜕𝑏1
𝑏1 0.3 0.2
−𝜇 𝜕𝐿Τ𝜕𝑏1 ⋮
……
gradient
Gradient Descent
𝜃
Compute 𝜕𝐿Τ𝜕𝑤1 Compute 𝜕𝐿Τ𝜕𝑤1
𝑤1 0.2 0.15 0.09
−𝜇 𝜕𝐿Τ𝜕𝑤1 −𝜇 𝜕𝐿Τ𝜕𝑤1 ……
Compute 𝜕𝐿Τ𝜕𝑤2 Compute 𝜕𝐿Τ𝜕𝑤2
𝑤2 -0.1 0.05 0.15
−𝜇 𝜕𝐿Τ𝜕𝑤2 −𝜇 𝜕𝐿Τ𝜕𝑤2
……
……
Compute 𝜕𝐿Τ𝜕𝑏1 Compute 𝜕𝐿Τ𝜕𝑏1

𝑏1 0.3 0.2 0.10
−𝜇 𝜕𝐿Τ𝜕𝑏1 −𝜇 𝜕𝐿Τ𝜕𝑏1
……
……
Gradient Descent
This is the “learning” of machines in deep
learning ……
Even alpha go using this approach.
People image …… Actually …..
I hope you are not too disappointed :p

Backpropagation
• Backpropagation: an efficient way to compute 𝜕𝐿Τ𝜕𝑤 in
neural network
libdnn
台大周伯威
同學開發
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20b
ackprop.ecm.mp4/index.html
Concluding Remarks

define a set
ofNetwork
What are the benefits of deep architecture?

Deeper is Better?
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4 Not surprised, more
3 X 2k 18.4 parameters, better
4 X 2k 17.8 performance
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Universality Theorem
Any continuous function f
f : R N  RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html
Why “Deep” neural network not “Fat” neural network?

(next lecture)
“深度學習深度學習”
• My Course: Machine learning and having it deep and
structured
• http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.
html
• 6 hour version: http://www.slideshare.net/tw_dsconf/ss-
62245351
• “Neural Networks and Deep Learning”
• written by Michael Nielsen
• http://neuralnetworksanddeeplearning.com/
• “Deep Learning”
• written by Yoshua Bengio, Ian J. Goodfellow and Aaron
Courville
• http://www.deeplearningbook.org
Acknowledgment
• 感謝 Victor Chen 發現投影片上的打字錯誤

4 - DL (v2)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 - DL (v2)

Uploaded by

Copyright:

Available Formats

Deep Learning

Deep learning trends at Google. Source: SIGMOD/Jeff Dean

Step 1: Step 2: Step 3: pick

Deep Learning is so simple ……

Given network structure, define a function set

AlexNet (2012) VGG (2014) GoogleNet (2014)

Using parallel computing techniques

You need to decide the network structure to

• Q: How many layers? How many neurons for each

Step 1: Step 2: Step 3: pick

Deep Learning is so simple ……

Find the network

Step 1: Step 2: Step 3: pick

Deep Learning is so simple ……

Compute 𝜕𝐿Τ𝜕𝑏1 Compute 𝜕𝐿Τ𝜕𝑏1

I hope you are not too disappointed :p

Step 1: Step 2: Step 3: pick

What are the benefits of deep architecture?

Why “Deep” neural network not “Fat” neural network?

You might also like