You are on page 1of 30

What is Deep Learning?

• cascade of many layers of nonlinear processing units for


feature extraction and transformation
– Each successive layer uses the output from the previous
Deep Learning layer as input
• may be supervised or unsupervised
• applications include speech recognition,
Boonserm Kijsirikul image classification, text analysis, etc.
Dept. of Computer Engineering
Chulalongkorn University

1 2

Behavior of Multilayer Neural


Motivation
Networks
Types of Exclusive-OR Classes with Most General
Structure Decision Regions Problem Meshed regions Region Shapes
• Deep Architectures can be representationally
efficient Single-Layer Half Plane A B
Bounded By B
– Fewer computational units for same function Hyper plane A
B A
• Automatic feature extraction
Two-Layer
– Less human effort Convex Open
Or
A B
B
Closed Regions A
• Unsupervised learning B A

– Modern data sets are enormous Three-Layer Arbitrary


(Complexity A B
• Deep architectures work well! Limited by No. B
A
of Nodes) B A
– vision, audio, NLP, etc.
3 4
Unsupervised Feature Learning
Autoencoder
with a Neural Network
x1 x1 Autoencoder.
• An autoencoder is a neural network trained to Network is trained to
attempt to copy its input to the output. x2 x2 output the input (learn
a1 identify function).
• Contain two parts: x3 x3
– Encoder: map the input to a hidden representation
a2
– Decoder: map the hidden representation to the output x4 x4

a3 Trivial solution unless:


x5 x5 - Constrain number of
+1
units in Layer 2 (learn
x6 x6 compressed
representation), or
+1 Layer 2 Layer 3 - Constrain Layer 2 to be
5 sparse. 6
Layer 1

Compressed Representation Compressed Representation

• Can this be learned • Hidden units transform the input space into a new
space representing “constructed” features

7 8
Compressed Representation Autoencoder Variants

• Hidden units transform the input space into a new • Bottleneck: use fewer hidden units than inputs
space representing “constructed” features • Sparsity: use a penalty function that encourages
most hidden unit activations to be near 0
• Denoising: train to predict true input from
1 0 0 corrupted input
0 0 1
1 1 0 • Contractive: force encoder to have small derivatives
1 1 1 of hidden unit output as input varies
0 0 0
0 1 1 • Etc.
1 0 1
1 1 0
9 10

Sparse Autoencoder Denoising Autoencoder

• A good representation is one that can be obtained


Training a sparse autoencoder. robustly from a corrupted input and that will be
a1
a2
useful for recovering the corresponding clean input.
Given unlabeled training data x(1), x(2), … a3 • The partially corrupted output is cleaned (de-
noised).

Reconstruction error L1 sparsity term


term

11 12
Denoising Autoencoder Procedure Stacking Autoencoder
1. (Corrupting) Clean input is partially corrupted through a x1 x1
stochastic mapping ஽ , yielding corrupted input ஽ .
2. The corrupted input passes through a basic autoencoder and is x2 x2
mapped to a hidden representation ఏ .
a1
3. From this hidden representation, we can reconstruct ఏ . x3 x3
4. Minimize the reconstruction error ு .
a2
• ு choices: cross-entropy loss or squared error loss x4 x4
‫ܮ‬ு ‫ݔ‬ǡ ‫ݖ‬
ௗ a3
x5 x5
ൌ െ ෍ ‫ݔ‬௝ Ž‘‰ ‫ݖ‬௝ ൅ ሺͳ െ ‫ݔ‬௝ ሻ Ž‘‰ሺͳ െ ‫ݖ‬௝ ሻ
௝ୀଵ
+1
x6 x6

+1 Layer 2 Layer 3
13 14
Layer 1

Stacking Autoencoder Stacking Autoencoder


x1 x1

x2 x2

a1 a1
x3 x3

a2 a2
x4 x4

a3 a3
x5 x5

x6
+1 New representation for input. x6
+1

+1 Layer 2 +1 Layer 2
15 16
Layer 1 Layer 1
Stacking Autoencoder Stacking Autoencoder
x1 x1

x2 x2

a1 b1 a1 b1
x3 x3

a2 b2 a2 b2
x4 x4

a3 b3 a3 b3
x5 x5

+1 +1 +1 +1
x6 x6

+1 Train parameters so that +1 Train parameters so that ,


17 18

Stacking Autoencoder Stacking Autoencoder


x1 x1

x2 x2

a1 b1 a1 b1
x3 x3

a2 b2 a2 b2
x4 x4

a3 b3 a3 b3
x5 x5

+1 +1 +1 +1
x6 x6 New representation for input.

+1 Train parameters so that , +1


19 20
Stacking Autoencoder Stacking Autoencoder
x1 x1

x2 x2

a1 b1 a1 b1 c1
x3 x3

a2 b2 a2 b2 c2
x4 x4

a3 b3 a3 b3 c3
x5 x5

+1 +1 +1 +1 +1
x6 x6

+1 +1
21 22

Stacking Autoencoder Fine-Tuning


x1
• After completion, run backpropagation on the
x2 entire network to fine-tune weights for the
a1 b1 c1
supervised task
x3
• Because this backpropagation starts with good
a2 b2 c2 structure and weights, its credit assignment is
x4
better and so its final results are better than if we
x5
a3 b3 c3 just ran backpropagation initially
+1 +1 +1 New representation
x6
for input.

+1
23 24
Use [c1, c3, c3] as representation to feed to learning algorithm.
Dropout Dropout

• “Hiding” parts of the network during training Training:


• Allows for greater multi-function learning Pick a mini-batch
• Proof against overfitting
• All percentage dropouts work, even 50+%
• Build some redundancy into the hidden units
• Essentially create an “ensemble” of neural ¾ Each time before computing the gradients
networks, but without high cost of training many z Each neuron has p% to dropout
deep networks

25 26

Dropout Dropout

Training: Testing:
Pick a mini-batch

Thinner!

¾ Each time before computing the gradients ¾ No dropout


z Each neuron has p% to dropout z If the dropout rate at training is p%,
The structure of the network is changed. all the weights times (1-p)%
z Using the new network for training
z Assume that the dropout rate is 50%.
For each mini-batch, we resample the dropout neurons 27 If a weight by training, set for testing.28
Dropout Dropout is a kind of ensemble
• Why the weights should multiply (1-p)% (dropout Training
rate) when testing?
Ensemble Set
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Set 1 Set 2 Set 3 Set 4
Weights from training

Network Network Network Network


1 2 3 4

Weights multiply (1-p)% Train a bunch of networks with different structures

29 30

Dropout is a kind of ensemble Dropout is a kind of ensemble


Ensemble Training of
minibatch minibatch minibatch minibatch
Testing data x 1 2 3 4 Dropout

M neurons

……
Network Network Network Network
1 2 3 4
2M possible
y1 y2 y3 y4 networks

¾Using one mini-batch to train one network


average
31 ¾Some parameters in the network are shared 32
Dropout is a kind of ensemble Softmax
Testing of Dropout testing data x
• Softmax squashes a K-dimensional output vector
of K-class classification to a K-dimensional vector
of real values in range [0,1]
೥ೕ

಼ ೥ೖ
All the ೖసభ
• Example:

……
weights
multiply • Softmax can be used as probability.
(1-p)%
• Derivative of softmax

y1 y2 y3

average ≈ y 33 ೕ 34

Convolutional Neural Networks Convolutional Neural Networks

• inspired by mammalian visual cortex • Intuition: Neural network with specialized connectivity
– The visual cortex contains a complex arrangement of structure,
cells, which are sensitive to small sub-regions of the – Stacking multiple layers of feature extractors
visual field, called a receptive field. These cells act as – Low-level layers extract local features. Feature maps
local filters over the input space and are well-suited to – High-level layers extract learn global patterns.
exploit the strong spatially local correlation present in • A CNN is a list of layers that transform the Pooling
natural images. input data into an output class/prediction.
– Two basic cell types: – Convolve input
– Non-linearity Non-linearity
o Simple cells respond maximally to specific edge-like patterns
within their receptive field. – Subsampling or Pooling
– Fully connection
o Complex cells have larger receptive fields and are locally – Output
Convolution
invariant to the exact position of the pattern. (Learned)

Input image
35 36
Convolutional Neural Networks LeNet-5

37 38

LeNet-5 LeNet-5

39 40
LeNet-5 LeNet-5

41 42

LeNet-5 LeNet-5

43 44
45 46

Big Difference from Small Shift An Example of Handcrafted Features

47 48
Connectivity & Weight Sharing
Convolution Layer
Depends on Layer
• Detect the same feature at different positions
in the input image

features

All different weights All different weights Shared weights


Input
Filter
¾ Convolution layer has much smaller number of parameters (kernel)
by local connection and weight sharing

49 Feature map 50

Activation Functions Sigmoid Function

51 52
tanh function ReLU Function

53 54

Leaky ReLU Maxout

– Generalize ReLU and


Leaky ReLU
– Does not saturates
– Does not die

– doubles the number of


parameters/neuron
Maxout

55 56
Maxout Exponential Linear Unit (ELU)

Input
+ +
Max Max
x1 + +

x2 + +
Max Max

+ +

• You can have more than two elements in a group.


57 58

Sub-sampling (pooling) layer Average & Max Pooling


¾ Spatial Pooling
- Average or Max

¾ Role of Pooling • Average pooling


- Invariance to small transformations
- reduce the effect of noises and shift
or distortion Max

• Max pooling
Average

59 60
Normalization Convolutional Layer

• Contrast normalization (between/across feature map) • The core layer of CNNs


- Equalizes the features map • The convolutional layer consists of a set of filters.
• Each filter covers a spatially small portion of the input data.
• Each filter is convolved across the dimensions of the input
data, producing a multidimensional feature map.
• As we convolve the filter, we are computing the dot product between
the parameters of the filter and the input.
• Intuition: the network will learn filters that activate when they
see some specific type of feature at some spatial position in
the input.
Feature maps Feature maps
after contrast normalization • The key architectural characteristics of the convolutional layer
is local connectivity and shared weights.
61 62

Convolutional Layer Convolutional Layer


w x o

ଵ ଵ ଶ ଶ ଽ ଽ

63 64
Convolutional Layer Convolutional Layer

65 66

Convolutional Layer Convolutional Layer

67 68
Average Pooling with No Padding Average Pooling with No Padding
• no padding – input=3x3, filter=2x2, stride=1x1 • no padding – input=3x3, filter=2x2, stride=1x1

1 2 3 3 1 2 3 3 4
4 5 6 4 5 6
7 8 9 7 8 9

69 70

Average Pooling with No Padding Average Pooling with No Padding


• no padding – input=3x3, filter=2x2, stride=1x1 • no padding – input=3x3, filter=2x2, stride=1x1

1 2 3 3 4 1 2 3 3 4
4 5 6 6 4 5 6 6 7
7 8 9 7 8 9

71 72
Average Pooling with Padding Average Pooling with Padding
• zero padding – input=3x3, filter=2x2, stride=1x1 • zero padding – input=3x3, filter=2x2, stride=1x1

73 74

Average Pooling with Padding Average Pooling with Padding


• zero padding – input=3x3, filter=2x2, stride=1x1 • Zero padding – input=3x3, filter=2x2, stride=1x1

1 2 3 0 1 2 3 0 3
4 5 6 0 4 5 6 0
7 8 9 0 7 8 9 0
0 0 0 0 0 0 0 0
75 76
Average Pooling with Padding Average Pooling with Padding
• zero padding – input=3x3, filter=2x2, stride=1x1 • zero padding – input=3x3, filter=2x2, stride=1x1

1 2 3 0 3 4 1 2 3 0 3 4 4.5

4 5 6 0 4 5 6 0
7 8 9 0 7 8 9 0
0 0 0 0 0 0 0 0
77 78

Average Pooling with Padding Max Pooling with Padding


• zero padding – input=3x3, filter=2x2, stride=1x1 • zero padding – input=3x5, filter=3x3, stride=2x2

1 2 3 0 3 4 4.5

4 5 6 0 6 7 7.5

7 8 9 0 7.5 8.5 9
0 0 0 0
79 80
Max Pooling with Padding Max Pooling with Padding
• zero padding – input=3x5, filter=3x3, stride=2x2 • zero padding – input=3x5, filter=3x3, stride=2x2

81 82

Max Pooling with Padding Max Pooling with Padding


• zero padding – input=3x5, filter=3x3, stride=2x2 • zero padding – input=3x5, filter=3x3, stride=2x2

83 84
Max Pooling with Padding An Example of Conv Layer
• zero padding – input=3x5, filter=3x3, stride=2x2 ( 0+0+0
+0+0+0
+0 -2 -2)+
( 0+0+0
+0+0+0
+0+1+0)+
( 0+0+0
+0+0+0
+0 -1+0)+
1

= -3

85 86

An Example of Conv Layer An Example of Conv Layer

87 88
An Example of Conv Layer An Example of Conv Layer

89 90

An Example of Conv Layer An Example of Conv Layer

91 92
An Example of Conv Layer An Example of Conv Layer

93 94

Learned Features Feature Hierarchies

object models

object parts
(combination
of edges)

edges

pixels

95 96
Examples of Learned Object Parts
Recurrent Neural Networks
from Object Categories
Faces Cars Elephants Chairs • Recurrent Neural Networks are networks with
loops in them, allowing information to persist.

97 98

Recurrent Neural Networks Architecture for an 1980’s RNN


Some Sequence of outputs
information
is passed
from one
subunit to
the next

Sequence of inputs Problem with this: it’s extremely deep


Start of End of and very hard to train
sequence sequence
marker marker
99 100
Examples of Recurrent Neural Networks Long Short-Term Memory (LSTM)

• RNN short-term dependencies


• Language model trying to predict the next word
based on the previous ones

(1) Standard mode of processing without RNN, from fixed-sized input to fixed-sized output
(e.g. image classification).
(2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
(3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing
positive or negative sentiment).
(4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence
in English and then outputs a sentence in French).
(5) Synced sequence input and output (e.g. video classification where we wish to label each
frame of the video). 101 102

Long Short-Term Memory (LSTM) The Vanishing Gradient Problem


• The influence of a given input on the hidden layer,
• RNN long-term dependencies and thus on the network output, either decays or
• Language model trying to predict the next word blows up exponentially as it cycles around the
based on the previous ones network's recurrent connections.

103 104
Vanishing Gradients Long Short-Term Memory (LSTM)

• Uses memory blocks


• A memory block contains
¾ self-connected memory cell
¾ 3 multiplicative units
input, output, forget gates
• Gates allows LSTM memory
cells to store and access inform-
ation over long periods of time,
thereby mitigating the
vanishing gradient problem.

105 106

Simply Replace the Neurons with LSTM Simply Replace the Neurons with LSTM
……

……

x1 x2 Input 107 108


Preservation of Gradient Information Standard RNN
• The shading indicates their sensitivity to the inputs
• The memory cell remembers the first input as long as
the forget gate is open and the input gate is closed.
• The sensitivity of the output layer can be switched on
and off by the output gate without affecting the cell

• The repeating module in a standard RNN contains a


single layer.
• The module will have a very simple structure

109 110
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Repeating Module in an LSTM The Core Idea Behind LSTMs


“Bits of Decide Decide
memory” what to what to
forget insert

• The horizontal line running through the top


Combine with
transformed xt
represents the cell state.
• The cell state runs straight down the entire chain,
with only some minor linear interactions.
• It is very easy for information to just flow along it
• The repeating module in an LSTM contains four unchanged.
interacting layers. 111 112
The Core Idea Behind LSTMs Forget Gate

• The ability to remove or add information to the cell


state, carefully regulated by structures called gates.
• Gates optionally let information through, ft : What part of memory to “forget”
– zero means forget this bit
composed of
¾ a sigmoid neural net layer and • First decide what information to throw away from state
¾ a pointwise multiplication operation. • Using a sigmoid called the “forget gate”
• The sigmoid outputs numbers between zero and • It looks at ht−1 and xt, and outputs a number between 0
one, describing how much of each component and 1 for each number in the cell state Ct−1.
should be let through. ¾ 1 represents “completely keep this”
• zero means “let nothing through” ¾ 0 represents “completely get rid of this.”
• one means “let everything through”
113 114

Input Gate Update the Cell State


it : What bits to insert into the
next states

Ct : Next memory cell content –


௧ : What content to store into mixture of not-forgotten part of
the next state previous cell and insertion

• Next, decide what new information to store in the state • Update the old cell state, Ct−1, into the new cell
¾ a sigmoid called the “input gate” decides which values to update state Ct, by
¾ a tanh creates a vector of new candidate values, that could be ¾ Multiplying Ct−1 by ft, forgetting the things we decided
added to the state. to forget earlier.
¾ These two will be combined to create an update to the state. ¾ Then adding it . This is the new candidate values,
scaled by how much we decided to update each state
115
value. 116
Output Gate
ot : What part of cell to output

௧ : maps to [-1,+1]

• Finally, decide what to output.


• The output is based on cell state.
¾ First, run a sigmoid to decide what parts of the cell
state to output.
¾ Then, put the cell state through tanh (to push the
values to be between −1 and 1) and multiply it by the
output of the sigmoid gate, so that we only output the
parts we decided to. 117

You might also like