You are on page 1of 40

School of Computer Science

10-601B Introduction to Machine Learning

Deep Learning
(Part I)

Readings: Matt Gormley


Nielsen (online book) Lecture 16
Neural Networks and Deep Learning
October 24, 2016

1
Reminders
• Midsemester grades released today

2
Outline
• Deep Neural Networks (DNNs)
– Three ideas for training a DNN
– Experiments: MNIST digit classification Part I
– Autoencoders
– Pretraining
• Convolutional Neural Networks (CNNs)
– Convolutional layers
– Pooling layers
– Image recognition
• Recurrent Neural Networks (RNNs) Part II
– Bidirectional RNNs
– Deep Bidirectional RNNs
– Deep Bidirectional LSTMs
– Connection to forward-backward algorithm

3
PRE-TRAINING FOR DEEP NETS

4
A Recipe for
Background
Goals for Today’sMachine
LectureLearning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Deep Neural Networks)
2. Consider variants of this recipe for training
2. Choose each of these:
– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

5
Training Idea #1: No pre-training
 Idea #1: (Just like a shallow network)
 Compute the supervised gradient by backpropagation.
 Take small steps in the direction of the gradient (SGD)

6
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

7
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

8
Training Idea #1: No pre-training
 Idea #1: (Just like a shallow network)
 Compute the supervised gradient by backpropagation.
 Take small steps in the direction of the gradient (SGD)

• What goes wrong?


A. Gets stuck in local optima
• Nonconvex objective
• Usually start at a random (bad) point in parameter space
B. Gradient is progressively getting more dilute
• “Vanishing gradients”

9
Problem A:
Training
Nonconvexity

• Where does the nonconvexity come from?


• Even a simple quadratic z = xy objective is
nonconvex:

y
x
10
Problem A:
Training
Nonconvexity

• Where does the nonconvexity come from?


• Even a simple quadratic z = xy objective is
nonconvex:

y
x
11
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top


of the nearest hill…

12
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top


of the nearest hill…

13
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top


of the nearest hill…

14
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top


of the nearest hill…

15
Problem A:
Training
Nonconvexity
Stochastic Gradient
Descent…

…climbs to the top


of the nearest hill…

…which might not


lead to the top of
the mountain

16
Problem B:
Training
Vanishing Gradients

The gradient for an Output y

edge at the base of the


network depends on Hidden Layer c1 c2 cF

the gradients of many


edges above it Hidden Layer b1 b2 … bE

The chain rule multiplies Hidden Layer a1 a2 … aD

many of these partial


derivatives together Input x1 x2 x3 … xM

17
Problem B:
Training
Vanishing Gradients

The gradient for an Output y

edge at the base of the


network depends on Hidden Layer c1 c2 cF

the gradients of many


edges above it Hidden Layer b1 b2 … bE

The chain rule multiplies Hidden Layer a1 a2 … aD

many of these partial


derivatives together Input x1 x2 x3 … xM

18
Problem B:
Training
Vanishing Gradients

The gradient for an Output

0.1
y

edge at the base of the


network depends on Hidden Layer c1 c2 cF

the gradients of many 0.3


edges above it Hidden Layer b1 b2 … bE

0.2

The chain rule multiplies Hidden Layer a1 a2 … aD

0.7
many of these partial
derivatives together Input x1 x2 x3 … xM

19
Training Idea #1: No pre-training
 Idea #1: (Just like a shallow network)
 Compute the supervised gradient by backpropagation.
 Take small steps in the direction of the gradient (SGD)

• What goes wrong?


A. Gets stuck in local optima
• Nonconvex objective
• Usually start at a random (bad) point in parameter space
B. Gradient is progressively getting more dilute
• “Vanishing gradients”

20
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
 Train each level of the model in a greedy way
 Then use our original idea

1. Supervised Pre-training
– Use labeled data
– Work bottom-up
• Train hidden layer 1. Then fix its parameters.
• Train hidden layer 2. Then fix its parameters.
• …
• Train hidden layer n. Then fix its parameters.
2. Supervised Fine-tuning
– Use labeled data to train following “Idea #1”
– Refine the features by backpropagation so that they become tuned to the end-
task
21
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
 Train each level of the model in a greedy way
 Then use our original idea

Output y

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

22
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
 Train each level of the model in a greedy way
 Then use our original idea
Output y

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

23
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
Output y

 Train each level of the model in a greedy way


 Then use our original idea …
Hidden Layer 3 c1 c2 cF

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

24
Idea #2: Supervised
Training
Pre-training
 Idea #2: (Two Steps)
Output y

 Train each level of the model in a greedy way


 Then use our original idea …
Hidden Layer 3 c1 c2 cF

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

25
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

26
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

27
Idea #3: Unsupervised
Training
Pre-training
 Idea #3: (Two Steps)
 Use our original idea, but pick a better starting point
 Train each level of the model in a greedy way

1. Unsupervised Pre-training
– Use unlabeled data
– Work bottom-up
• Train hidden layer 1. Then fix its parameters.
• Train hidden layer 2. Then fix its parameters.
• …
• Train hidden layer n. Then fix its parameters.
2. Supervised Fine-tuning
– Use labeled data to train following “Idea #1”
– Refine the features by backpropagation so that they become tuned to the end-
task
28
The solution:
Unsupervised pre-training
Unsupervised pre-training
of the first layer:
• What should it predict?
• What else do we
observe?
Output y

• The input!
a1 a2 … aD
Hidden Layer

This topology defines an


Auto-encoder. …
Input x1 x2 x3 xM

29
The solution:
Unsupervised pre-training
Unsupervised pre-training
of the first layer:
• What should it predict?
• What else do we
observe? “Input” x1’ x2’ x3’ … xM ’

• The input!
a1 a2 … aD
Hidden Layer

This topology defines an


Auto-encoder. …
Input x1 x2 x3 xM

30
Auto-Encoders
Key idea: Encourage z to give small reconstruction error:
– x’ is the reconstruction of x
– Loss = || x – DECODER(ENCODER(x)) ||2
– Train with the same backpropagation algorithm for 2-layer Neural
Networks with xm as both input and output.

“Input” x1’ x2’ x3’ … xM ’

DECODER: x’ = h(W’z)
a1 a2 … aD
Hidden Layer

ENCODER: z = h(Wx)
x1 x2 x3 … xM
Input

31
Slide adapted from Raman Arora
The solution:
Unsupervised pre-training

Unsupervised pre-training
• Work bottom-up
– Train hidden layer 1.
Then fix its parameters.
– Train hidden layer 2. “Input” x1’ x2’ x3’ … xM ’
Then fix its parameters.
– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer

Then fix its parameters.


x1 x2 x3 … xM
Input

32
The solution:
Unsupervised pre-training

Unsupervised pre-training
• Work bottom-up
a1’ a2 ’ … aF’
– Train hidden layer 1.
Then fix its parameters.
– Train hidden layer 2. Hidden Layer b1 b2 … bF

Then fix its parameters.


– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer

Then fix its parameters.


x1 x2 x3 … xM
Input

33
The solution:
Unsupervised pre-training
b1’ b2 ’ … bF’

Unsupervised pre-training
• Work bottom-up
c1 c2 … cF
– Train hidden layer 1. Hidden Layer

Then fix its parameters.


– Train hidden layer 2. Hidden Layer b1 b2 … bF

Then fix its parameters.


– …
a1 a2 … aD
– Train hidden layer n.
Hidden Layer

Then fix its parameters.


x1 x2 x3 … xM
Input

34
The solution:
Unsupervised pre-training
Unsupervised pre-training Output y

• Work bottom-up
– Train hidden layer 1. Then Hidden Layer c1 c2 … cF

fix its parameters.


– Train hidden layer 2. Then b1 b2 … bF
Hidden Layer
fix its parameters.
– …
– Train hidden layer n. Then Hidden Layer a1 a2 … aD

fix its parameters.


Supervised fine-tuning x1 x2 x3 … xM
Input
Backprop and update all
parameters 35
Deep Network Training
 Idea #1:
1. Supervised fine-tuning only

 Idea #2:
1. Supervised layer-wise pre-training
2. Supervised fine-tuning

 Idea #3:
1. Unsupervised layer-wise pre-training
2. Supervised fine-tuning

36
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

37
Training Comparison on MNIST
• Results from Bengio et al. (2006) on
MNIST digit classification task

2.5 Percent error (lower is better)
2.0
1.5
1.0
% E rro r

38
Is layer-wise pre-training
Training
always necessary?

In 2010, a record on a hand-writing


recognition task was set by standard supervised
backpropagation (our Idea #1).

How? A very fast implementation on GPUs.

See Ciresen et al. (2010)


39
Deep Learning
• Goal: learn features at different levels of
abstraction
• Training can be tricky due to…
– Nonconvexity
– Vanishing gradients
• Unsupervised layer-wise pre-training can
help with both!

40

You might also like