You are on page 1of 32

Deep Learning

Deep learning
attracts lots of attention.
• I believe you have seen lots of exciting results
before.

Deep learning trends at Google. Source: SIGMOD/Jeff Dean


Ups and downs of Deep Learning
• 1958: Perceptron (linear model)
• 1969: Perceptron has limitation
• 1980s: Multi-layer perceptron
• Do not have significant difference from DNN today
• 1986: Backpropagation
• Usually more than 3 hidden layers is not helpful
• 1989: 1 hidden layer is “good enough”, why deep?
• 2006: RBM initialization (breakthrough)
• 2009: GPU
• 2011: Start to be popular in speech recognition
• 2012: win ILSVRC image competition
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set
Neural goodness of the best
ofNetwork
function function function

Deep Learning is so simple ……


Neural Network

  z 

  z    z 

  z 
“Neuron”
Neural Network
Different connection leads to different network
structures
Network parameter 𝜃: all the weights and biases in the “neurons”
Fully Connect Feedforward
Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function  z 

 z  
1
z
1 e z
Fully Connect Feedforward
Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Fully Connect Feedforward
Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
This is a function. 1 0.62 0 0.51
𝑓 = 𝑓 =
Input vector, output vector −1 0.83 0 0.85

Given network structure, define a function set


Fully Connect Feedforward
Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2

……
……

……

……

……
xN …… yM
Input Output
Layer Hidden Layers Layer
Deep = Many hidden layers
22 layers

http://cs231n.stanford.e
du/slides/winter1516_le 19 layers
cture8.pdf

8 layers
6.7%
7.3%
16.4%

AlexNet (2012) VGG (2014) GoogleNet (2014)


Deep = Many hidden layers
101 layers
152 layers

Special
structure

3.57%

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Matrix Operation
1 4 0.98
1 y1
-2
1
-1 -2 0.12
-1 y2
1
0

1 −2 1 1 0.98
𝜎 + =
−1 1 −1 0 0.12
4
−2
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……

……
xN x a1 ……
a2 y yM

𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……

……
xN x a1 ……
a2 y yM

Using parallel computing techniques


y =𝑓 x
to speed up matrix operation

=𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL
Output Layer
Feature extractor replacing
feature engineering
x1 y1
……
x2

Softmax
…… y2
x

……
……

……
……

……
xK
…… yM
Input Output = Multi-class
Layer Hidden Layers Layer Classifier
Example Application

Input Output

y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”

……
……
……

x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition

x1 y1 is 1
x2
y2 is 2
Neural
Machine “2”
……

……
……
Network
x256 y10 is 0
What is needed is a
function ……
Input: output:
256-dim vector 10-dim vector
Example Application
Input Layer 1 Layer 2 Layer L Output
x1 …… y1 is 1
x2 ……
A function set containing the y2 is 2
candidates for “2”

……
……

……

……

……
……
Handwriting Digit Recognition
xN …… y10 is 0
Input Output
Layer Hidden Layers Layer

You need to decide the network structure to


let a good function in your function set.
FAQ

• Q: How many layers? How many neurons for each


layer?
Trial and Error + Intuition
• Q: Can the structure be automatically determined?
• E.g. Evolutionary Artificial Neural Networks
• Q: Can we design the network structure?
Convolutional Neural Network (CNN)
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set
Neural goodness of the best
ofNetwork
function function function

Deep Learning is so simple ……


Loss for an Example
target
“1”

x1 …… y1 𝑦ො1 1

x2

Softmax
……
Given a set of y2 𝑦ො2 0
parameters
……

……
……

……
……
……
Cross
Entropy
x256 …… y10 𝑦ො10 0
10 𝑦 𝑦ො
𝐶 𝑦 , 𝑦ො = − ෍ 𝑦ෝ𝑖 𝑙𝑛𝑦𝑖
𝑖=1
Total Loss:
Total Loss 𝑁

𝐿 = ෍ 𝐶𝑛
For all training data … 𝑛=1

x1 NN y1 𝑦ො 1
𝐶1
Find a function in
x2 NN y2 𝑦ො 2
𝐶2 function set that
minimizes total loss L
x3 NN y3 𝑦ො 3
𝐶3
……
……

……
……

Find the network


xN NN yN 𝑦
ො 𝑁 parameters 𝜽∗ that
𝐶 𝑁
minimize total loss L
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set
Neural goodness of the best
ofNetwork
function function function

Deep Learning is so simple ……


Gradient Descent
𝜃
Compute 𝜕𝐿Τ𝜕𝑤1 𝜕𝐿
𝑤1 0.2 0.15
−𝜇 𝜕𝐿Τ𝜕𝑤1 𝜕𝑤1
Compute 𝜕𝐿Τ𝜕𝑤2 𝜕𝐿
𝑤2 -0.1
−𝜇 𝜕𝐿Τ𝜕𝑤2
0.05 𝛻𝐿 = 𝜕𝑤2

……

𝜕𝐿
Compute 𝜕𝐿Τ𝜕𝑏1 𝜕𝑏1
𝑏1 0.3 0.2
−𝜇 𝜕𝐿Τ𝜕𝑏1 ⋮
……

gradient
Gradient Descent
𝜃
Compute 𝜕𝐿Τ𝜕𝑤1 Compute 𝜕𝐿Τ𝜕𝑤1
𝑤1 0.2 0.15 0.09
−𝜇 𝜕𝐿Τ𝜕𝑤1 −𝜇 𝜕𝐿Τ𝜕𝑤1 ……
Compute 𝜕𝐿Τ𝜕𝑤2 Compute 𝜕𝐿Τ𝜕𝑤2
𝑤2 -0.1 0.05 0.15
−𝜇 𝜕𝐿Τ𝜕𝑤2 −𝜇 𝜕𝐿Τ𝜕𝑤2
……
……

Compute 𝜕𝐿Τ𝜕𝑏1 Compute 𝜕𝐿Τ𝜕𝑏1


𝑏1 0.3 0.2 0.10
−𝜇 𝜕𝐿Τ𝜕𝑏1 −𝜇 𝜕𝐿Τ𝜕𝑏1
……
……
Gradient Descent
This is the “learning” of machines in deep
learning ……
Even alpha go using this approach.
People image …… Actually …..

I hope you are not too disappointed :p


Backpropagation
• Backpropagation: an efficient way to compute 𝜕𝐿Τ𝜕𝑤 in
neural network

libdnn
台大周伯威
同學開發
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20b
ackprop.ecm.mp4/index.html
Concluding Remarks

Step 1: Step 2: Step 3: pick


define a set
Neural goodness of the best
ofNetwork
function function function

What are the benefits of deep architecture?


Deeper is Better?
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4 Not surprised, more
3 X 2k 18.4 parameters, better
4 X 2k 17.8 performance
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Universality Theorem
Any continuous function f

f : R N  RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html

Why “Deep” neural network not “Fat” neural network?


(next lecture)
“深度學習深度學習”
• My Course: Machine learning and having it deep and
structured
• http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.
html
• 6 hour version: http://www.slideshare.net/tw_dsconf/ss-
62245351
• “Neural Networks and Deep Learning”
• written by Michael Nielsen
• http://neuralnetworksanddeeplearning.com/
• “Deep Learning”
• written by Yoshua Bengio, Ian J. Goodfellow and Aaron
Courville
• http://www.deeplearningbook.org
Acknowledgment
• 感謝 Victor Chen 發現投影片上的打字錯誤

You might also like