5 DeepLearning x4

What is Deep Learning?
• cascade of many layers of nonlinear processing units for

feature extraction and transformation
– Each successive layer uses the output from the previous
Deep Learning layer as input
• may be supervised or unsupervised
• applications include speech recognition,
Boonserm Kijsirikul image classification, text analysis, etc.
Dept. of Computer Engineering
Chulalongkorn University
1 2
Behavior of Multilayer Neural

Motivation
Networks
Types of Exclusive-OR Classes with Most General
Structure Decision Regions Problem Meshed regions Region Shapes
• Deep Architectures can be representationally
efficient Single-Layer Half Plane A B
Bounded By B
– Fewer computational units for same function Hyper plane A
B A
• Automatic feature extraction
Two-Layer
– Less human effort Convex Open
Or
A B
B
Closed Regions A
• Unsupervised learning B A
– Modern data sets are enormous Three-Layer Arbitrary

(Complexity A B
• Deep architectures work well! Limited by No. B
A
of Nodes) B A
– vision, audio, NLP, etc.
3 4
Unsupervised Feature Learning
Autoencoder
with a Neural Network
x1 x1 Autoencoder.
• An autoencoder is a neural network trained to Network is trained to
attempt to copy its input to the output. x2 x2 output the input (learn
a1 identify function).
• Contain two parts: x3 x3
– Encoder: map the input to a hidden representation
a2
– Decoder: map the hidden representation to the output x4 x4
a3 Trivial solution unless:

x5 x5 - Constrain number of
+1
units in Layer 2 (learn
x6 x6 compressed
representation), or
+1 Layer 2 Layer 3 - Constrain Layer 2 to be
5 sparse. 6
Layer 1
Compressed Representation Compressed Representation
• Can this be learned • Hidden units transform the input space into a new
space representing “constructed” features
7 8
Compressed Representation Autoencoder Variants
• Hidden units transform the input space into a new • Bottleneck: use fewer hidden units than inputs
space representing “constructed” features • Sparsity: use a penalty function that encourages
most hidden unit activations to be near 0
• Denoising: train to predict true input from
1 0 0 corrupted input
0 0 1
1 1 0 • Contractive: force encoder to have small derivatives
1 1 1 of hidden unit output as input varies
0 0 0
0 1 1 • Etc.
1 0 1
1 1 0
9 10
Sparse Autoencoder Denoising Autoencoder
• A good representation is one that can be obtained

Training a sparse autoencoder. robustly from a corrupted input and that will be
a1
a2
useful for recovering the corresponding clean input.
Given unlabeled training data x(1), x(2), … a3 • The partially corrupted output is cleaned (de-
noised).
Reconstruction error L1 sparsity term

term
11 12
Denoising Autoencoder Procedure Stacking Autoencoder
1. (Corrupting) Clean input is partially corrupted through a x1 x1
stochastic mapping ஽ , yielding corrupted input ஽ .
2. The corrupted input passes through a basic autoencoder and is x2 x2
mapped to a hidden representation ఏ .
a1
3. From this hidden representation, we can reconstruct ఏ . x3 x3
4. Minimize the reconstruction error ு .
a2
• ு choices: cross-entropy loss or squared error loss x4 x4
‫ܮ‬ு ‫ݔ‬ǡ ‫ݖ‬
ௗ a3
x5 x5
ൌ െ ෍ ‫ݔ‬௝ ‫ݖ‬௝ ൅ ሺͳ െ ‫ݔ‬௝ ሻ ሺͳ െ ‫ݖ‬௝ ሻ
௝ୀଵ
+1
x6 x6
+1 Layer 2 Layer 3
13 14
Layer 1
Stacking Autoencoder Stacking Autoencoder

x1 x1
x2 x2
a1 a1
x3 x3
a2 a2
x4 x4
a3 a3
x5 x5
x6
+1 New representation for input. x6
+1
+1 Layer 2 +1 Layer 2
15 16
Layer 1 Layer 1
x1 x1
x2 x2
a1 b1 a1 b1
x3 x3
a2 b2 a2 b2
x4 x4
a3 b3 a3 b3
x5 x5
+1 +1 +1 +1
x6 x6
+1 Train parameters so that +1 Train parameters so that ,

17 18

x1 x1
x2 x2
a1 b1 a1 b1
x3 x3
a2 b2 a2 b2
x4 x4
a3 b3 a3 b3
x5 x5
+1 +1 +1 +1
x6 x6 New representation for input.
+1 Train parameters so that , +1

19 20
x1 x1
x2 x2
a1 b1 a1 b1 c1
x3 x3
a2 b2 a2 b2 c2
x4 x4
a3 b3 a3 b3 c3
x5 x5
+1 +1 +1 +1 +1
x6 x6
+1 +1
21 22
Stacking Autoencoder Fine-Tuning

x1
• After completion, run backpropagation on the
x2 entire network to fine-tune weights for the
a1 b1 c1
supervised task
x3
• Because this backpropagation starts with good
a2 b2 c2 structure and weights, its credit assignment is
x4
better and so its final results are better than if we
x5
a3 b3 c3 just ran backpropagation initially
+1 +1 +1 New representation
x6
for input.
+1
23 24
Use [c1, c3, c3] as representation to feed to learning algorithm.
Dropout Dropout
• “Hiding” parts of the network during training Training:

• Allows for greater multi-function learning Pick a mini-batch
• Proof against overfitting
• All percentage dropouts work, even 50+%
• Build some redundancy into the hidden units
• Essentially create an “ensemble” of neural ¾ Each time before computing the gradients
networks, but without high cost of training many z Each neuron has p% to dropout
deep networks
25 26
Dropout Dropout
Training: Testing:
Pick a mini-batch
Thinner!
¾ Each time before computing the gradients ¾ No dropout

z Each neuron has p% to dropout z If the dropout rate at training is p%,
The structure of the network is changed. all the weights times (1-p)%
z Using the new network for training
z Assume that the dropout rate is 50%.
For each mini-batch, we resample the dropout neurons 27 If a weight by training, set for testing.28
Dropout Dropout is a kind of ensemble
• Why the weights should multiply (1-p)% (dropout Training
rate) when testing?
Ensemble Set
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Set 1 Set 2 Set 3 Set 4
Weights from training
Network Network Network Network

1 2 3 4
Weights multiply (1-p)% Train a bunch of networks with different structures
29 30
Dropout is a kind of ensemble Dropout is a kind of ensemble

Ensemble Training of
minibatch minibatch minibatch minibatch
Testing data x 1 2 3 4 Dropout
M neurons
……
Network Network Network Network
1 2 3 4
2M possible
y1 y2 y3 y4 networks
¾Using one mini-batch to train one network

average
31 ¾Some parameters in the network are shared 32
Dropout is a kind of ensemble Softmax
Testing of Dropout testing data x
• Softmax squashes a K-dimensional output vector
of K-class classification to a K-dimensional vector
of real values in range [0,1]
೥ೕ
಼ ೥ೖ
All the ೖసభ
• Example:
……
weights
multiply • Softmax can be used as probability.
(1-p)%
• Derivative of softmax
y1 y2 y3
೔
average ≈ y 33 ೕ 34
Convolutional Neural Networks Convolutional Neural Networks
• inspired by mammalian visual cortex • Intuition: Neural network with specialized connectivity
– The visual cortex contains a complex arrangement of structure,
cells, which are sensitive to small sub-regions of the – Stacking multiple layers of feature extractors
visual field, called a receptive field. These cells act as – Low-level layers extract local features. Feature maps
local filters over the input space and are well-suited to – High-level layers extract learn global patterns.
exploit the strong spatially local correlation present in • A CNN is a list of layers that transform the Pooling
natural images. input data into an output class/prediction.
– Two basic cell types: – Convolve input
– Non-linearity Non-linearity
o Simple cells respond maximally to specific edge-like patterns
within their receptive field. – Subsampling or Pooling
– Fully connection
o Complex cells have larger receptive fields and are locally – Output
Convolution
invariant to the exact position of the pattern. (Learned)
Input image
35 36
Convolutional Neural Networks LeNet-5
37 38
LeNet-5 LeNet-5
39 40
LeNet-5 LeNet-5
41 42
LeNet-5 LeNet-5
43 44
45 46
Big Difference from Small Shift An Example of Handcrafted Features
47 48
Connectivity & Weight Sharing
Convolution Layer
Depends on Layer
• Detect the same feature at different positions
in the input image
features
All different weights All different weights Shared weights

Input
Filter
¾ Convolution layer has much smaller number of parameters (kernel)
by local connection and weight sharing
49 Feature map 50
Activation Functions Sigmoid Function
51 52
tanh function ReLU Function
53 54
Leaky ReLU Maxout
– Generalize ReLU and

Leaky ReLU
– Does not saturates
– Does not die
– doubles the number of

parameters/neuron
Maxout
55 56
Maxout Exponential Linear Unit (ELU)
Input
+ +
Max Max
x1 + +
x2 + +
Max Max
+ +
• You can have more than two elements in a group.

57 58
Sub-sampling (pooling) layer Average & Max Pooling

¾ Spatial Pooling
- Average or Max
¾ Role of Pooling • Average pooling

- Invariance to small transformations
- reduce the effect of noises and shift
or distortion Max
• Max pooling
Average
59 60
Normalization Convolutional Layer
• Contrast normalization (between/across feature map) • The core layer of CNNs

- Equalizes the features map • The convolutional layer consists of a set of filters.
• Each filter covers a spatially small portion of the input data.
• Each filter is convolved across the dimensions of the input
data, producing a multidimensional feature map.
• As we convolve the filter, we are computing the dot product between
the parameters of the filter and the input.
• Intuition: the network will learn filters that activate when they
see some specific type of feature at some spatial position in
the input.
Feature maps Feature maps
after contrast normalization • The key architectural characteristics of the convolutional layer
is local connectivity and shared weights.
61 62
Convolutional Layer Convolutional Layer

w x o
ଵ ଵ ଶ ଶ ଽ ଽ
63 64
65 66
67 68
Average Pooling with No Padding Average Pooling with No Padding
• no padding – input=3x3, filter=2x2, stride=1x1 • no padding – input=3x3, filter=2x2, stride=1x1
1 2 3 3 1 2 3 3 4
4 5 6 4 5 6
7 8 9 7 8 9
69 70
Average Pooling with No Padding Average Pooling with No Padding

• no padding – input=3x3, filter=2x2, stride=1x1 • no padding – input=3x3, filter=2x2, stride=1x1
1 2 3 3 4 1 2 3 3 4
4 5 6 6 4 5 6 6 7
7 8 9 7 8 9
71 72
Average Pooling with Padding Average Pooling with Padding
• zero padding – input=3x3, filter=2x2, stride=1x1 • zero padding – input=3x3, filter=2x2, stride=1x1
73 74

• zero padding – input=3x3, filter=2x2, stride=1x1 • Zero padding – input=3x3, filter=2x2, stride=1x1
1 2 3 0 1 2 3 0 3
4 5 6 0 4 5 6 0
7 8 9 0 7 8 9 0
0 0 0 0 0 0 0 0
75 76
1 2 3 0 3 4 1 2 3 0 3 4 4.5
4 5 6 0 4 5 6 0
7 8 9 0 7 8 9 0
0 0 0 0 0 0 0 0
77 78
Average Pooling with Padding Max Pooling with Padding

1 2 3 0 3 4 4.5
4 5 6 0 6 7 7.5
7 8 9 0 7.5 8.5 9
0 0 0 0
79 80
Max Pooling with Padding Max Pooling with Padding
81 82
Max Pooling with Padding Max Pooling with Padding

83 84
Max Pooling with Padding An Example of Conv Layer
• zero padding – input=3x5, filter=3x3, stride=2x2 ( 0+0+0
+0+0+0
+0 -2 -2)+
( 0+0+0
+0+0+0
+0+1+0)+
( 0+0+0
+0+0+0
+0 -1+0)+
1
= -3
85 86
An Example of Conv Layer An Example of Conv Layer
87 88
89 90
91 92
93 94
Learned Features Feature Hierarchies
object models
object parts
(combination
of edges)
edges
pixels
95 96
Examples of Learned Object Parts
Recurrent Neural Networks
from Object Categories
Faces Cars Elephants Chairs • Recurrent Neural Networks are networks with
loops in them, allowing information to persist.
97 98
Recurrent Neural Networks Architecture for an 1980’s RNN

Some Sequence of outputs
information
is passed
from one
subunit to
the next
Sequence of inputs Problem with this: it’s extremely deep

Start of End of and very hard to train
sequence sequence
marker marker
99 100
Examples of Recurrent Neural Networks Long Short-Term Memory (LSTM)
• RNN short-term dependencies

• Language model trying to predict the next word
based on the previous ones
(1) Standard mode of processing without RNN, from fixed-sized input to fixed-sized output
(e.g. image classification).
(2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
(3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing
positive or negative sentiment).
(4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence
in English and then outputs a sentence in French).
(5) Synced sequence input and output (e.g. video classification where we wish to label each
frame of the video). 101 102
Long Short-Term Memory (LSTM) The Vanishing Gradient Problem

• The influence of a given input on the hidden layer,
• RNN long-term dependencies and thus on the network output, either decays or
• Language model trying to predict the next word blows up exponentially as it cycles around the
based on the previous ones network's recurrent connections.
103 104
Vanishing Gradients Long Short-Term Memory (LSTM)
• Uses memory blocks

• A memory block contains
¾ self-connected memory cell
¾ 3 multiplicative units
input, output, forget gates
• Gates allows LSTM memory
cells to store and access inform-
ation over long periods of time,
thereby mitigating the
vanishing gradient problem.
105 106
Simply Replace the Neurons with LSTM Simply Replace the Neurons with LSTM
……
……
x1 x2 Input 107 108

Preservation of Gradient Information Standard RNN
• The shading indicates their sensitivity to the inputs
• The memory cell remembers the first input as long as
the forget gate is open and the input gate is closed.
• The sensitivity of the output layer can be switched on
and off by the output gate without affecting the cell
• The repeating module in a standard RNN contains a

single layer.
• The module will have a very simple structure
109 110
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Repeating Module in an LSTM The Core Idea Behind LSTMs

“Bits of Decide Decide
memory” what to what to
forget insert
• The horizontal line running through the top

Combine with
transformed xt
represents the cell state.
• The cell state runs straight down the entire chain,
with only some minor linear interactions.
• It is very easy for information to just flow along it
• The repeating module in an LSTM contains four unchanged.
interacting layers. 111 112
The Core Idea Behind LSTMs Forget Gate
• The ability to remove or add information to the cell

state, carefully regulated by structures called gates.
• Gates optionally let information through, ft : What part of memory to “forget”
– zero means forget this bit
composed of
¾ a sigmoid neural net layer and • First decide what information to throw away from state
¾ a pointwise multiplication operation. • Using a sigmoid called the “forget gate”
• The sigmoid outputs numbers between zero and • It looks at ht−1 and xt, and outputs a number between 0
one, describing how much of each component and 1 for each number in the cell state Ct−1.
should be let through. ¾ 1 represents “completely keep this”
• zero means “let nothing through” ¾ 0 represents “completely get rid of this.”
• one means “let everything through”
113 114
Input Gate Update the Cell State

it : What bits to insert into the
next states
Ct : Next memory cell content –

௧ : What content to store into mixture of not-forgotten part of
the next state previous cell and insertion
• Next, decide what new information to store in the state • Update the old cell state, Ct−1, into the new cell
¾ a sigmoid called the “input gate” decides which values to update state Ct, by
¾ a tanh creates a vector of new candidate values, that could be ¾ Multiplying Ct−1 by ft, forgetting the things we decided
added to the state. to forget earlier.
¾ These two will be combined to create an update to the state. ¾ Then adding it . This is the new candidate values,
scaled by how much we decided to update each state
115
value. 116
Output Gate
ot : What part of cell to output
௧ : maps to [-1,+1]
• Finally, decide what to output.

• The output is based on cell state.
¾ First, run a sigmoid to decide what parts of the cell
state to output.
¾ Then, put the cell state through tanh (to push the
values to be between −1 and 1) and multiply it by the
output of the sigmoid gate, so that we only output the
parts we decided to. 117

5 DeepLearning x4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 DeepLearning x4

Uploaded by

Copyright:

Available Formats

What is Deep Learning?

• cascade of many layers of nonlinear processing units for

Behavior of Multilayer Neural

– Modern data sets are enormous Three-Layer Arbitrary

a3 Trivial solution unless:

Compressed Representation Compressed Representation

Sparse Autoencoder Denoising Autoencoder

• A good representation is one that can be obtained

Reconstruction error L1 sparsity term

Stacking Autoencoder Stacking Autoencoder

+1 Train parameters so that +1 Train parameters so that ,

Stacking Autoencoder Stacking Autoencoder

+1 Train parameters so that , +1

Stacking Autoencoder Fine-Tuning

• “Hiding” parts of the network during training Training:

¾ Each time before computing the gradients ¾ No dropout

Network Network Network Network

Weights multiply (1-p)% Train a bunch of networks with different structures

Dropout is a kind of ensemble Dropout is a kind of ensemble

¾Using one mini-batch to train one network

Convolutional Neural Networks Convolutional Neural Networks

Big Difference from Small Shift An Example of Handcrafted Features

All different weights All different weights Shared weights

Activation Functions Sigmoid Function

Leaky ReLU Maxout

– Generalize ReLU and

– doubles the number of

• You can have more than two elements in a group.

Sub-sampling (pooling) layer Average & Max Pooling

¾ Role of Pooling • Average pooling

• Contrast normalization (between/across feature map) • The core layer of CNNs

Convolutional Layer Convolutional Layer

Convolutional Layer Convolutional Layer

Average Pooling with No Padding Average Pooling with No Padding

Average Pooling with Padding Average Pooling with Padding

Average Pooling with Padding Max Pooling with Padding

Max Pooling with Padding Max Pooling with Padding

An Example of Conv Layer An Example of Conv Layer

An Example of Conv Layer An Example of Conv Layer

Learned Features Feature Hierarchies

Recurrent Neural Networks Architecture for an 1980’s RNN

Sequence of inputs Problem with this: it’s extremely deep

• RNN short-term dependencies

Long Short-Term Memory (LSTM) The Vanishing Gradient Problem

• Uses memory blocks

x1 x2 Input 107 108

• The repeating module in a standard RNN contains a

Repeating Module in an LSTM The Core Idea Behind LSTMs

• The horizontal line running through the top

• The ability to remove or add information to the cell

Input Gate Update the Cell State

Ct : Next memory cell content –

• Finally, decide what to output.

You might also like