CNN Midterm

Typical C N N Layer
Convolutional Detector Normalization

Output:
Input Stage: Affine Stage: Pooling Stage Stage
Feature Map
Transform Nonlinearity (Optional)
A simple CNN structure
CONV: Convolutional kernel layer

RELU: Activation function
POOL: Dimension reduction layer
FC: Fully connection layer
Fully Connected Layer
Example: 200x200 image
40K hidden units
~2B parameters!!!
- Spatial correlation is local

- Waste of resources + we have not enough
training samples anyway..
Locally Connected Layer
Example: 200x200 image

40K hidden units
Filter size: 10x10
4M parameters
Note: This parameterization is good when

input image is registered (e.g.,
face recognition).
The Convolution operation
Convolutional kernel
Padding on the input

volume with zeros in such
way that the conv layer
does not alter the spatial
dimensions of the input
Example of 'valid' 2-D convolution
(without kernel flipping) where a
3x4 matrix convolved with a 2x2
kernel to output a 2x3 matrix
Convolutional Layer
Share the same parameters across

different locations (assuming input is
stationary):
Convolutions with learned kernels
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
-1 0 1
* -1 0 1
-1 0 1
=
Convolutional Layer
Learn multiple filters.
E.g.: 200x200 image 100

Filters
Filter size: 10x10 10K
parameters
Reason 1 : Sparse Connectivity
• R e c e p t i v e f i e l d s o f u n i t s in d e e p e r l a y e r s
larger t h a n s h a l l o w l a y e r s
• T h o u g h d ir e ct c o n n e c t i o n s a r e v e r y
s p a r s e , d e e p e r l a y e r s indirectly c o n n e c t e d
to m o s t of the input image
• E ff e c t i n c r e a s e s w i t h s t r i d e d c o n v o l u t i o n
or pooling.
Input neurons
representing a 28x28
image (such as from
M N I ST d a t a s e t )
Every h i d d e n layer n e u r o n
h a s a local receptive field
o f region 5x5 pixels
An d s o on, t h e first h i d d e n l aye r is built!
( 2 8 - 5 + 1 ) = 2 4 x 2 4 n e u r o n s i n t h e h i d d e n l a y e r o n 'valid'
c o n v o l u t i o n S i ze o f t h e h i d d e n l a y e r c a n b e c h a n g e d u s i n g a n o t h e r
variant o f co nvo l u t i o n
Reason 2 : Parameter sharing
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32
28
convolve (slide) over all

spatial locations
32 28
3 1
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
We stack these up to get a “new image” of size 28x28x6!

Stride
7
7x7 input (spatially)
assume 3x3 filter
7

assume 3x3 filter
7

assume 3x3 filter
7
7
assume 3x3 filter
7
7
assume 3x3 filter
=> 5x5 output

7
7
assume 3x3 filter
applied with stride 2
7
7
assume 3x3 filter
7
7
assume 3x3 filter
=> 3x3 output!
7
7
assume 3x3 filter
applied with stride 3?
7
7
assume 3x3 filter
applied with stride 3?
doesn’t fit!
7
cannot apply 3x3 filter on
7x7 input with stride 3.
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
Zero-Padding: common to the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
(recall:)
(N - F) / stride + 1
0 0 0 0 0 0
e.g. input 7x7
0
0
0
7x7 output!
0 0 0 0 0 0
e.g. input 7x7
0
0
0
7x7 output!
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Examples time:
Input volume: 32x32x3

10 5x5 filters with stride 1, pad 2
Output volume size: ?

Examples time:

Output volume size:

(32+2*2-5)/1+1 = 32 spatially, so
32x32x10
Examples time:

Number of parameters in this layer?

Examples time:

Number of parameters in this layer?

each filter has 5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760
Summary
Common settings:
K = (powers of 2, e.g. 32, 64, 128, 512)

- F = 3, S = 1, P = 1
- F = 5, S = 1, P = 2
- F = 5, S = 2, P = ? (whatever fits)
- F = 1, S = 1, P = 0
Pooling Layer
Let us assume filter is an “eye” detector.
Q.: how can we make the detection robust to

the exact location of the eye?
Pooling Layer
By “pooling” (e.g., taking max) filter

responses at different locations we gain
robustness to the exact spatial location of
features.
Pooling layer
Pooling
Pooling
Eﬀect = invariance to small translations of the input

- makes the representations smaller and more manageable
- operates over each activation map independently
Max Pooling
Single depth slice

1 1 2 4
x
max pool with 2x2 filters
5 6 7 8 and stride 2 6 8
3 2 1 0 3 4
1 2 3 4
y
Summary
Pooling Layer: Receptive Field Size
h n−1 hn hn 1
Conv. Pool.
layer layer
If convolutional filters have size KxK and stride 1, and pooling layer has
pools of size PxP, then each unit in the pooling layer depends upon a
patch (at the input of the preceding conv. layer) of size: (P+K- 1)x(P+K-1)
Pooling Layer: Receptive Field Size
h n−1 hn hn 1
Conv. Pool.
layer layer
If convolutional filters have size KxK and stride 1, and pooling layer has
pools of size PxP, then each unit in the pooling layer depends upon a
patch (at the input of the preceding conv. layer) of size: (P+K- 1)x(P+K-1)
ConvNets: Typical Stage
One stage (zoom)
Convol. Pooling
7
8
ConvNets: Typical Stage
One stage (zoom)
Convol. Pooling
Conceptually similar to: SIFT, HoG, etc.

ConvNets: Typical Architecture
One stage (zoom)
Convol. Pooling
Whole system
Input Class
Image Fully Conn. Labels
Layers
1st stage 2nd stage 3rd stage

ConvNets: Typical Architecture
Whole system
Input Class
Image Fully Conn. Labels
Layers
1st stage 2nd stage 3rd stage
Conceptually similar to:
SIFT → K-Means → Pyramid Pooling → SVM

Lazebnik et al. “...Spatial Pyramid Matching...” CVPR 2006
SIFT → Fisher Vect. → Pooling → SVM

Sanchez et al. “Image classifcation with F.V.: Theory and practice” IJCV 2012
H hidden units /
Hx1x1 feature maps
NxMxM, M small
Fully conn. layer /

Conv. layer (H kernels of size NxMxM)
85
K hidden units /
Kx1x1 feature maps
H hidden units /
Hx1x1 feature maps
NxMxM, M small
Fully conn. layer /

Conv. layer (H kernels of size NxMxM)
Fully conn. layer /
Conv. layer (K kernels of size Hx186x1)
Viewing fully connected layers as convolutional layers enables efficient use
of convnets on bigger images (no need to slide windows but unroll
network over space as needed to re-use computation).
TRAINING TIME
Input CNN
Image
TEST TIME
Input
CNN
Image y
x
87
ConvNets: Test
At test time, run only is forward mode (FPROP).
CONV NETS: EXAMPLES
- OCR / House number & Traffic sign classification
CONV NETS: EXAMPLES
- Texture classification
CONV NETS: EXAMPLES
- Pedestrian detection
CONV NETS: EXAMPLES
- Scene Parsing
CONV NETS: EXAMPLES
- Segmentation 3D volumetric images
CONV NETS: EXAMPLES
- Action recognition from videos
102
CONV NETS: EXAMPLES
- Object detection
Architecture for Classification
input label
image
64 128 256 512 512
Conv. layer: 3x3 filters
Max pooling layer: 2x2, stride 2 Fully
connected layer: 4096 hiddens
24 Layers in total!!!
109
input label
image
}
TOTAL
FLOPS: 20G 0.1G 20G
88
input label
image
}
TOTAL
Nr. of parameters: 21M 123M 144M
89
input label
image
}
TOTAL
Nr. of parameters: 21M 123M 144M
Data augmentation is key to improve generalization:

- random translation
- left/right flipping
- scaling
90
Optimization
SGD with momentum:

Learning rate = 0.01
Momentum = 0.9
Improving generalization by:

Weight sharing (convolution)
Input distortions
Dropout = 0.5
Weight decay = 0.0005
Choosing The Architecture
Task dependent
Cross-validation
[Convolution → pooling]* + fully connected layer
The more data: the more layers and the more kernels
Look at the number of parameters at each layer Look
at the number of flops at each layer
Computational resources
Be creative :)
How To Optimize
SGD (with momentum) usually works very well
Pick learning rate by running on a subset of the data

Bottou “Stochastic Gradient Tricks” Neural Networks 2012
Start with large learning rate and divide by 2 until loss does not diverge
Decay learning rate by a factor of ~1000 or more by the end of training
Use non-linearity
Initialize parameters so that each feature across layers has

similar variance. Avoid units in saturation.
Improving Generalization
Weight sharing (greatly reduce the number of parameters)
Data augmentation (e.g., jittering, noise injection, etc.)
Dropout
Hinton et al. “Improving Nns by preventing co-adaptation of feature detectors” arxiv
2012
Weight decay (L2, L1)
Sparsity in the hidden units
Multi-task (unsupervised learning)

Good To Know
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
samples
hidden unit
Good training: hidden units are sparse across samples and
across features.
Good To Know
samples
hidden unit
Bad training: many hidden units ignore the input and/or
exhibit strong correlations.
Good To Know
Visualize parameters
GOOD BAD BAD BAD
too noisy too correlated lack structure
Good training: learned filters exhibit structure and are uncorrelated.

Good To Know
Visualize parameters
Measure error on both training and validation set.
Test on a small subset of the data and check the error → 0.
What If It Does Not Work?
Training diverges:
Learning rate may be too large → decrease learning rate
BPROP is buggy → numerical gradient checking
Parameters collapse / loss is minimized but accuracy is low
Check loss function:
Is it appropriate for the task you want to solve?
Does it have degenerate solutions? Check “pull-up” term.
Network is underperforming
Compute flops and nr. params. → if too small, make net larger
Visualize hidden units/params → fix optmization
Network is too slow
Compute flops and nr. params. → GPU,distrib. framework, make net
smaller
• By applying convolution and pooling, important features will be
obtained (extracted) and it can reduce the complexity of dimension
also the computational speed.
• These will be applied to ANN for classification.

CNN Midterm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CNN Midterm

Uploaded by

Copyright:

Available Formats

Typical C N N Layer

Convolutional Detector Normalization

CONV: Convolutional kernel layer

- Spatial correlation is local

Example: 200x200 image

Note: This parameterization is good when

Padding on the input

Share the same parameters across

Learn multiple filters.

E.g.: 200x200 image 100

convolve (slide) over all

We stack these up to get a “new image” of size 28x28x6!

7x7 input (spatially)

7x7 input (spatially)

=> 5x5 output

Input volume: 32x32x3

Output volume size: ?

Input volume: 32x32x3

Output volume size:

Input volume: 32x32x3

Number of parameters in this layer?

Input volume: 32x32x3

Number of parameters in this layer?

K = (powers of 2, e.g. 32, 64, 128, 512)

Let us assume filter is an “eye” detector.

Q.: how can we make the detection robust to

By “pooling” (e.g., taking max) filter

Eﬀect = invariance to small translations of the input

Single depth slice

Conceptually similar to: SIFT, HoG, etc.

1st stage 2nd stage 3rd stage

1st stage 2nd stage 3rd stage

Conceptually similar to:

SIFT → K-Means → Pyramid Pooling → SVM

SIFT → Fisher Vect. → Pooling → SVM

Fully conn. layer /

Fully conn. layer /

Conv. layer: 3x3 filters

Max pooling layer: 2x2, stride 2 Fully

connected layer: 4096 hiddens

FLOPS: 20G 0.1G 20G

Nr. of parameters: 21M 123M 144M

Nr. of parameters: 21M 123M 144M

Data augmentation is key to improve generalization:

SGD with momentum:

Improving generalization by:

[Convolution → pooling]* + fully connected layer

Pick learning rate by running on a subset of the data

Initialize parameters so that each feature across layers has

Weight decay (L2, L1)

Sparsity in the hidden units

Multi-task (unsupervised learning)

GOOD BAD BAD BAD

too noisy too correlated lack structure

Good training: learned filters exhibit structure and are uncorrelated.

You might also like