Professional Documents
Culture Documents
-1 0 1
* -1 0 1
-1 0 1
=
Convolutional Layer
• T h o u g h d ir e ct c o n n e c t i o n s a r e v e r y
s p a r s e , d e e p e r l a y e r s indirectly c o n n e c t e d
to m o s t of the input image
• E ff e c t i n c r e a s e s w i t h s t r i d e d c o n v o l u t i o n
or pooling.
Input neurons
representing a 28x28
image (such as from
M N I ST d a t a s e t )
Every h i d d e n layer n e u r o n
h a s a local receptive field
o f region 5x5 pixels
An d s o on, t h e first h i d d e n l aye r is built!
( 2 8 - 5 + 1 ) = 2 4 x 2 4 n e u r o n s i n t h e h i d d e n l a y e r o n 'valid'
c o n v o l u t i o n S i ze o f t h e h i d d e n l a y e r c a n b e c h a n g e d u s i n g a n o t h e r
variant o f co nvo l u t i o n
Reason 2 : Parameter sharing
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
7
7x7 input (spatially)
assume 3x3 filter
7
A closer look at spatial dimensions:
7
A closer look at spatial dimensions:
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
doesn’t fit!
7
cannot apply 3x3 filter on
7x7 input with stride 3.
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
Zero-Padding: common to the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
(recall:)
(N - F) / stride + 1
Zero-Padding: common to the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
0
7x7 output!
Zero-Padding: common to the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
0
7x7 output!
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Examples time:
3 2 1 0 3 4
1 2 3 4
y
Summary
Pooling Layer: Receptive Field Size
h n−1 hn hn 1
Conv. Pool.
layer layer
If convolutional filters have size KxK and stride 1, and pooling layer has
pools of size PxP, then each unit in the pooling layer depends upon a
patch (at the input of the preceding conv. layer) of size: (P+K- 1)x(P+K-1)
Pooling Layer: Receptive Field Size
h n−1 hn hn 1
Conv. Pool.
layer layer
If convolutional filters have size KxK and stride 1, and pooling layer has
pools of size PxP, then each unit in the pooling layer depends upon a
patch (at the input of the preceding conv. layer) of size: (P+K- 1)x(P+K-1)
ConvNets: Typical Stage
One stage (zoom)
Convol. Pooling
7
8
ConvNets: Typical Stage
One stage (zoom)
Convol. Pooling
Convol. Pooling
Whole system
Input Class
Image Fully Conn. Labels
Layers
NxMxM, M small
85
K hidden units /
Kx1x1 feature maps
H hidden units /
Hx1x1 feature maps
NxMxM, M small
TRAINING TIME
Input CNN
Image
TEST TIME
Input
CNN
Image y
x
87
ConvNets: Test
At test time, run only is forward mode (FPROP).
CONV NETS: EXAMPLES
- OCR / House number & Traffic sign classification
CONV NETS: EXAMPLES
- Texture classification
CONV NETS: EXAMPLES
- Pedestrian detection
CONV NETS: EXAMPLES
- Scene Parsing
CONV NETS: EXAMPLES
- Segmentation 3D volumetric images
CONV NETS: EXAMPLES
- Action recognition from videos
102
CONV NETS: EXAMPLES
- Object detection
Architecture for Classification
input label
image
64 128 256 512 512
24 Layers in total!!!
109
Architecture for Classification
input label
image
}
TOTAL
88
Architecture for Classification
input label
image
}
TOTAL
89
Architecture for Classification
input label
image
}
TOTAL
90
Optimization
Cross-validation
The more data: the more layers and the more kernels
Look at the number of parameters at each layer Look
at the number of flops at each layer
Computational resources
Be creative :)
How To Optimize
SGD (with momentum) usually works very well
Use non-linearity
Dropout
Hinton et al. “Improving Nns by preventing co-adaptation of feature detectors” arxiv
2012
hidden unit
Good training: hidden units are sparse across samples and
across features.
Good To Know
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
samples
hidden unit
Bad training: many hidden units ignore the input and/or
exhibit strong correlations.
Good To Know
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
Visualize parameters