Lecture16 Dnns

School of Computer Science
10-601B Introduction to Machine Learning
Deep Learning
(Part I)
Readings: Matt Gormley

Nielsen (online book) Lecture 16
Neural Networks and Deep Learning
October 24, 2016
1
Reminders
• Midsemester grades released today
2
Outline
• Deep Neural Networks (DNNs)
– Three ideas for training a DNN
– Experiments: MNIST digit classification Part I
– Autoencoders
– Pretraining
• Convolutional Neural Networks (CNNs)
– Convolutional layers
– Pooling layers
– Image recognition
• Recurrent Neural Networks (RNNs) Part II
– Bidirectional RNNs
– Deep Bidirectional RNNs
– Deep Bidirectional LSTMs
– Connection to forward-backward algorithm
3
PRE-TRAINING FOR DEEP NETS
4
A Recipe for
Background
Goals for Today’sMachine
LectureLearning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Deep Neural Networks)
2. Consider variants of this recipe for training
2. Choose each of these:
– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function
5
Training Idea #1: No pre-training
 Idea #1: (Just like a shallow network)
 Compute the supervised gradient by backpropagation.
 Take small steps in the direction of the gradient (SGD)
6
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task
•
2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r
7
•
2.0
1.5
1.0
% E rro r
8
• What goes wrong?

A. Gets stuck in local optima
• Nonconvex objective
• Usually start at a random (bad) point in parameter space
B. Gradient is progressively getting more dilute
• “Vanishing gradients”
9
Problem A:
Training
Nonconvexity
• Where does the nonconvexity come from?

• Even a simple quadratic z = xy objective is
nonconvex:
y
x
10
Problem A:
Training
Nonconvexity
• Where does the nonconvexity come from?

• Even a simple quadratic z = xy objective is
nonconvex:
y
x
11
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…
…climbs to the top

of the nearest hill…
12
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

13
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

14
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

15
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…which might not

lead to the top of
the mountain
16
Problem B:
Training
Vanishing Gradients
The gradient for an Output y
edge at the base of the

…
network depends on Hidden Layer c1 c2 cF
the gradients of many

edges above it Hidden Layer b1 b2 … bE
The chain rule multiplies Hidden Layer a1 a2 … aD
many of these partial

derivatives together Input x1 x2 x3 … xM
17
Problem B:
Training
Vanishing Gradients
The gradient for an Output y

…
the gradients of many


18
Problem B:
Training
Vanishing Gradients
The gradient for an Output
0.1
y

…
the gradients of many 0.3

0.2
0.7
19
• What goes wrong?

A. Gets stuck in local optima
• Nonconvex objective
• Usually start at a random (bad) point in parameter space
B. Gradient is progressively getting more dilute
• “Vanishing gradients”
20
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
 Train each level of the model in a greedy way
 Then use our original idea
1. Supervised Pre-training
– Use labeled data
– Work bottom-up
• Train hidden layer 1. Then fix its parameters.
• …
• Train hidden layer n. Then fix its parameters.
2. Supervised Fine-tuning
– Use labeled data to train following “Idea #1”
– Refine the features by backpropagation so that they become tuned to the end-
task
21
Idea #2: Supervised
Training
Pre-training
Output y
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
22
Idea #2: Supervised
Training
Pre-training
Output y
b1 b2 … bE
Hidden Layer 2
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
23
Idea #2: Supervised
Training
Pre-training
Output y

 Then use our original idea …
Hidden Layer 3 c1 c2 cF
b1 b2 … bE
Hidden Layer 2
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
24
Idea #2: Supervised
Training
Pre-training
Output y

 Then use our original idea …
Hidden Layer 3 c1 c2 cF
b1 b2 … bE
Hidden Layer 2
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
25
•
2.0
1.5
1.0
% E rro r
26
•
2.0
1.5
1.0
% E rro r
27
Idea #3: Unsupervised
Training
Pre-training
 Use our original idea, but pick a better starting point
1. Unsupervised Pre-training
– Use unlabeled data
– Work bottom-up
• …
• Train hidden layer n. Then fix its parameters.
2. Supervised Fine-tuning
– Use labeled data to train following “Idea #1”
– Refine the features by backpropagation so that they become tuned to the end-
task
28
The solution:
Unsupervised pre-training
of the first layer:
• What should it predict?
• What else do we
observe?
Output y
• The input!
a1 a2 … aD
Hidden Layer
This topology defines an

Auto-encoder. …
Input x1 x2 x3 xM
29
The solution:
of the first layer:
• What should it predict?
• What else do we
observe? “Input” x1’ x2’ x3’ … xM ’
• The input!
a1 a2 … aD
Hidden Layer
This topology defines an

Auto-encoder. …
Input x1 x2 x3 xM
30
Auto-Encoders
Key idea: Encourage z to give small reconstruction error:
– x’ is the reconstruction of x
– Loss = || x – DECODER(ENCODER(x)) ||2
– Train with the same backpropagation algorithm for 2-layer Neural
Networks with xm as both input and output.
“Input” x1’ x2’ x3’ … xM ’
DECODER: x’ = h(W’z)
a1 a2 … aD
Hidden Layer
ENCODER: z = h(Wx)
x1 x2 x3 … xM
Input
31
Slide adapted from Raman Arora
The solution:
• Work bottom-up
– Train hidden layer 1.
Then fix its parameters.
– Train hidden layer 2. “Input” x1’ x2’ x3’ … xM ’
– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer

x1 x2 x3 … xM
Input
32
The solution:
• Work bottom-up
a1’ a2 ’ … aF’
– Train hidden layer 1.
– Train hidden layer 2. Hidden Layer b1 b2 … bF

– …
a1 a2 … aD
Hidden Layer

x1 x2 x3 … xM
Input
33
The solution:
b1’ b2 ’ … bF’
• Work bottom-up
c1 c2 … cF
– Train hidden layer 1. Hidden Layer

– Train hidden layer 2. Hidden Layer b1 b2 … bF

– …
a1 a2 … aD
Hidden Layer

x1 x2 x3 … xM
Input
34
The solution:
Unsupervised pre-training Output y
• Work bottom-up
– Train hidden layer 1. Then Hidden Layer c1 c2 … cF
fix its parameters.

– Train hidden layer 2. Then b1 b2 … bF
Hidden Layer
fix its parameters.
– …
– Train hidden layer n. Then Hidden Layer a1 a2 … aD
fix its parameters.

Supervised fine-tuning x1 x2 x3 … xM
Input
Backprop and update all
parameters 35
Deep Network Training
 Idea #1:
1. Supervised fine-tuning only
 Idea #2:
1. Supervised layer-wise pre-training
2. Supervised fine-tuning
 Idea #3:
1. Unsupervised layer-wise pre-training
2. Supervised fine-tuning
36
•
2.0
1.5
1.0
% E rro r
37
•
2.0
1.5
1.0
% E rro r
38
Is layer-wise pre-training
Training
always necessary?
In 2010, a record on a hand-writing

recognition task was set by standard supervised
backpropagation (our Idea #1).
How? A very fast implementation on GPUs.
See Ciresen et al. (2010)

39
Deep Learning
• Goal: learn features at different levels of
abstraction
• Training can be tricky due to…
– Nonconvexity
– Vanishing gradients
• Unsupervised layer-wise pre-training can
help with both!
40

Lecture16 Dnns

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture16 Dnns

Uploaded by

Copyright:

Available Formats

School of Computer Science

10-601B Introduction to Machine Learning

Readings: Matt Gormley

• What goes wrong?

• Where does the nonconvexity come from?

• Where does the nonconvexity come from?

…climbs to the top

…climbs to the top

…climbs to the top

…climbs to the top

…climbs to the top

…which might not

The gradient for an Output y

edge at the base of the

network depends on Hidden Layer c1 c2 cF

the gradients of many

The chain rule multiplies Hidden Layer a1 a2 … aD

many of these partial

The gradient for an Output y

edge at the base of the

network depends on Hidden Layer c1 c2 cF

the gradients of many

The chain rule multiplies Hidden Layer a1 a2 … aD

many of these partial

The gradient for an Output

edge at the base of the

network depends on Hidden Layer c1 c2 cF

the gradients of many 0.3

The chain rule multiplies Hidden Layer a1 a2 … aD

• What goes wrong?

 Train each level of the model in a greedy way

 Train each level of the model in a greedy way

This topology defines an

This topology defines an

“Input” x1’ x2’ x3’ … xM ’

Then fix its parameters.

Then fix its parameters.

Then fix its parameters.

Then fix its parameters.

Then fix its parameters.

Then fix its parameters.

fix its parameters.

fix its parameters.

In 2010, a record on a hand-writing

How? A very fast implementation on GPUs.

See Ciresen et al. (2010)

You might also like