Professional Documents
Culture Documents
1 2
• Can this be learned • Hidden units transform the input space into a new
space representing “constructed” features
7 8
Compressed Representation Autoencoder Variants
• Hidden units transform the input space into a new • Bottleneck: use fewer hidden units than inputs
space representing “constructed” features • Sparsity: use a penalty function that encourages
most hidden unit activations to be near 0
• Denoising: train to predict true input from
1 0 0 corrupted input
0 0 1
1 1 0 • Contractive: force encoder to have small derivatives
1 1 1 of hidden unit output as input varies
0 0 0
0 1 1 • Etc.
1 0 1
1 1 0
9 10
11 12
Denoising Autoencoder Procedure Stacking Autoencoder
1. (Corrupting) Clean input is partially corrupted through a x1 x1
stochastic mapping , yielding corrupted input .
2. The corrupted input passes through a basic autoencoder and is x2 x2
mapped to a hidden representation ఏ .
a1
3. From this hidden representation, we can reconstruct ఏ . x3 x3
4. Minimize the reconstruction error ு .
a2
• ு choices: cross-entropy loss or squared error loss x4 x4
ܮு ݔǡ ݖ
ௗ a3
x5 x5
ൌ െ ݔ ݖ ሺͳ െ ݔ ሻ ሺͳ െ ݖ ሻ
ୀଵ
+1
x6 x6
+1 Layer 2 Layer 3
13 14
Layer 1
x2 x2
a1 a1
x3 x3
a2 a2
x4 x4
a3 a3
x5 x5
x6
+1 New representation for input. x6
+1
+1 Layer 2 +1 Layer 2
15 16
Layer 1 Layer 1
Stacking Autoencoder Stacking Autoencoder
x1 x1
x2 x2
a1 b1 a1 b1
x3 x3
a2 b2 a2 b2
x4 x4
a3 b3 a3 b3
x5 x5
+1 +1 +1 +1
x6 x6
x2 x2
a1 b1 a1 b1
x3 x3
a2 b2 a2 b2
x4 x4
a3 b3 a3 b3
x5 x5
+1 +1 +1 +1
x6 x6 New representation for input.
x2 x2
a1 b1 a1 b1 c1
x3 x3
a2 b2 a2 b2 c2
x4 x4
a3 b3 a3 b3 c3
x5 x5
+1 +1 +1 +1 +1
x6 x6
+1 +1
21 22
+1
23 24
Use [c1, c3, c3] as representation to feed to learning algorithm.
Dropout Dropout
25 26
Dropout Dropout
Training: Testing:
Pick a mini-batch
Thinner!
29 30
M neurons
……
Network Network Network Network
1 2 3 4
2M possible
y1 y2 y3 y4 networks
಼ ೖ
All the ೖసభ
• Example:
……
weights
multiply • Softmax can be used as probability.
(1-p)%
• Derivative of softmax
y1 y2 y3
average ≈ y 33 ೕ 34
• inspired by mammalian visual cortex • Intuition: Neural network with specialized connectivity
– The visual cortex contains a complex arrangement of structure,
cells, which are sensitive to small sub-regions of the – Stacking multiple layers of feature extractors
visual field, called a receptive field. These cells act as – Low-level layers extract local features. Feature maps
local filters over the input space and are well-suited to – High-level layers extract learn global patterns.
exploit the strong spatially local correlation present in • A CNN is a list of layers that transform the Pooling
natural images. input data into an output class/prediction.
– Two basic cell types: – Convolve input
– Non-linearity Non-linearity
o Simple cells respond maximally to specific edge-like patterns
within their receptive field. – Subsampling or Pooling
– Fully connection
o Complex cells have larger receptive fields and are locally – Output
Convolution
invariant to the exact position of the pattern. (Learned)
Input image
35 36
Convolutional Neural Networks LeNet-5
37 38
LeNet-5 LeNet-5
39 40
LeNet-5 LeNet-5
41 42
LeNet-5 LeNet-5
43 44
45 46
47 48
Connectivity & Weight Sharing
Convolution Layer
Depends on Layer
• Detect the same feature at different positions
in the input image
features
49 Feature map 50
51 52
tanh function ReLU Function
53 54
55 56
Maxout Exponential Linear Unit (ELU)
Input
+ +
Max Max
x1 + +
x2 + +
Max Max
+ +
• Max pooling
Average
59 60
Normalization Convolutional Layer
ଵ ଵ ଶ ଶ ଽ ଽ
63 64
Convolutional Layer Convolutional Layer
65 66
67 68
Average Pooling with No Padding Average Pooling with No Padding
• no padding – input=3x3, filter=2x2, stride=1x1 • no padding – input=3x3, filter=2x2, stride=1x1
1 2 3 3 1 2 3 3 4
4 5 6 4 5 6
7 8 9 7 8 9
69 70
1 2 3 3 4 1 2 3 3 4
4 5 6 6 4 5 6 6 7
7 8 9 7 8 9
71 72
Average Pooling with Padding Average Pooling with Padding
• zero padding – input=3x3, filter=2x2, stride=1x1 • zero padding – input=3x3, filter=2x2, stride=1x1
73 74
1 2 3 0 1 2 3 0 3
4 5 6 0 4 5 6 0
7 8 9 0 7 8 9 0
0 0 0 0 0 0 0 0
75 76
Average Pooling with Padding Average Pooling with Padding
• zero padding – input=3x3, filter=2x2, stride=1x1 • zero padding – input=3x3, filter=2x2, stride=1x1
1 2 3 0 3 4 1 2 3 0 3 4 4.5
4 5 6 0 4 5 6 0
7 8 9 0 7 8 9 0
0 0 0 0 0 0 0 0
77 78
1 2 3 0 3 4 4.5
4 5 6 0 6 7 7.5
7 8 9 0 7.5 8.5 9
0 0 0 0
79 80
Max Pooling with Padding Max Pooling with Padding
• zero padding – input=3x5, filter=3x3, stride=2x2 • zero padding – input=3x5, filter=3x3, stride=2x2
81 82
83 84
Max Pooling with Padding An Example of Conv Layer
• zero padding – input=3x5, filter=3x3, stride=2x2 ( 0+0+0
+0+0+0
+0 -2 -2)+
( 0+0+0
+0+0+0
+0+1+0)+
( 0+0+0
+0+0+0
+0 -1+0)+
1
= -3
85 86
87 88
An Example of Conv Layer An Example of Conv Layer
89 90
91 92
An Example of Conv Layer An Example of Conv Layer
93 94
object models
object parts
(combination
of edges)
edges
pixels
95 96
Examples of Learned Object Parts
Recurrent Neural Networks
from Object Categories
Faces Cars Elephants Chairs • Recurrent Neural Networks are networks with
loops in them, allowing information to persist.
97 98
(1) Standard mode of processing without RNN, from fixed-sized input to fixed-sized output
(e.g. image classification).
(2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
(3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing
positive or negative sentiment).
(4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence
in English and then outputs a sentence in French).
(5) Synced sequence input and output (e.g. video classification where we wish to label each
frame of the video). 101 102
103 104
Vanishing Gradients Long Short-Term Memory (LSTM)
105 106
Simply Replace the Neurons with LSTM Simply Replace the Neurons with LSTM
……
……
109 110
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
• Next, decide what new information to store in the state • Update the old cell state, Ct−1, into the new cell
¾ a sigmoid called the “input gate” decides which values to update state Ct, by
¾ a tanh creates a vector of new candidate values, that could be ¾ Multiplying Ct−1 by ft, forgetting the things we decided
added to the state. to forget earlier.
¾ These two will be combined to create an update to the state. ¾ Then adding it . This is the new candidate values,
scaled by how much we decided to update each state
115
value. 116
Output Gate
ot : What part of cell to output
௧ : maps to [-1,+1]