You are on page 1of 103

Typical C N N Layer

Convolutional Detector Normalization


Output:
Input Stage: Affine Stage: Pooling Stage Stage
Feature Map
Transform Nonlinearity (Optional)
A simple CNN structure

CONV: Convolutional kernel layer


RELU: Activation function
POOL: Dimension reduction layer
FC: Fully connection layer
Fully Connected Layer
Example: 200x200 image
40K hidden units
~2B parameters!!!

- Spatial correlation is local


- Waste of resources + we have not enough
training samples anyway..
Locally Connected Layer

Example: 200x200 image


40K hidden units
Filter size: 10x10
4M parameters

Note: This parameterization is good when


input image is registered (e.g.,
face recognition).
The Convolution operation
Convolutional kernel

Padding on the input


volume with zeros in such
way that the conv layer
does not alter the spatial
dimensions of the input
Example of 'valid' 2-D convolution
(without kernel flipping) where a
3x4 matrix convolved with a 2x2
kernel to output a 2x3 matrix
Convolutional Layer

Share the same parameters across


different locations (assuming input is
stationary):
Convolutions with learned kernels
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer
Convolutional Layer

-1 0 1
* -1 0 1
-1 0 1
=
Convolutional Layer

Learn multiple filters.

E.g.: 200x200 image 100


Filters
Filter size: 10x10 10K
parameters
Reason 1 : Sparse Connectivity
• R e c e p t i v e f i e l d s o f u n i t s in d e e p e r l a y e r s
larger t h a n s h a l l o w l a y e r s

• T h o u g h d ir e ct c o n n e c t i o n s a r e v e r y
s p a r s e , d e e p e r l a y e r s indirectly c o n n e c t e d
to m o s t of the input image

• E ff e c t i n c r e a s e s w i t h s t r i d e d c o n v o l u t i o n
or pooling.
Input neurons
representing a 28x28
image (such as from
M N I ST d a t a s e t )
Every h i d d e n layer n e u r o n
h a s a local receptive field
o f region 5x5 pixels
An d s o on, t h e first h i d d e n l aye r is built!

( 2 8 - 5 + 1 ) = 2 4 x 2 4 n e u r o n s i n t h e h i d d e n l a y e r o n 'valid'
c o n v o l u t i o n S i ze o f t h e h i d d e n l a y e r c a n b e c h a n g e d u s i n g a n o t h e r
variant o f co nvo l u t i o n
Reason 2 : Parameter sharing
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

32

28

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!


Stride
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

7
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

7
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

=> 5x5 output


7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

doesn’t fit!
7
cannot apply 3x3 filter on
7x7 input with stride 3.
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
Zero-Padding: common to the border

0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

(recall:)
(N - F) / stride + 1
Zero-Padding: common to the border

0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

0
7x7 output!
Zero-Padding: common to the border

0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

0
7x7 output!
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size: ?


Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size:


(32+2*2-5)/1+1 = 32 spatially, so
32x32x10
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


each filter has 5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760
Summary
Common settings:

K = (powers of 2, e.g. 32, 64, 128, 512)


- F = 3, S = 1, P = 1
- F = 5, S = 1, P = 2
- F = 5, S = 2, P = ? (whatever fits)
- F = 1, S = 1, P = 0
Pooling Layer

Let us assume filter is an “eye” detector.

Q.: how can we make the detection robust to


the exact location of the eye?
Pooling Layer

By “pooling” (e.g., taking max) filter


responses at different locations we gain
robustness to the exact spatial location of
features.
Pooling layer
Pooling
Pooling

Effect = invariance to small translations of the input


- makes the representations smaller and more manageable
- operates over each activation map independently
Max Pooling

Single depth slice


1 1 2 4
x
max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

y
Summary
Pooling Layer: Receptive Field Size

h n−1 hn hn 1

Conv. Pool.
layer layer

If convolutional filters have size KxK and stride 1, and pooling layer has
pools of size PxP, then each unit in the pooling layer depends upon a
patch (at the input of the preceding conv. layer) of size: (P+K- 1)x(P+K-1)
Pooling Layer: Receptive Field Size

h n−1 hn hn 1

Conv. Pool.
layer layer

If convolutional filters have size KxK and stride 1, and pooling layer has
pools of size PxP, then each unit in the pooling layer depends upon a
patch (at the input of the preceding conv. layer) of size: (P+K- 1)x(P+K-1)
ConvNets: Typical Stage
One stage (zoom)

Convol. Pooling

7
8
ConvNets: Typical Stage
One stage (zoom)

Convol. Pooling

Conceptually similar to: SIFT, HoG, etc.


ConvNets: Typical Architecture
One stage (zoom)

Convol. Pooling

Whole system
Input Class
Image Fully Conn. Labels

Layers

1st stage 2nd stage 3rd stage


ConvNets: Typical Architecture
Whole system
Input Class
Image Fully Conn. Labels
Layers

1st stage 2nd stage 3rd stage

Conceptually similar to:

SIFT → K-Means → Pyramid Pooling → SVM


Lazebnik et al. “...Spatial Pyramid Matching...” CVPR 2006

SIFT → Fisher Vect. → Pooling → SVM


Sanchez et al. “Image classifcation with F.V.: Theory and practice” IJCV 2012
H hidden units /
Hx1x1 feature maps

NxMxM, M small

Fully conn. layer /


Conv. layer (H kernels of size NxMxM)

85
K hidden units /
Kx1x1 feature maps
H hidden units /
Hx1x1 feature maps

NxMxM, M small

Fully conn. layer /


Conv. layer (H kernels of size NxMxM)
Fully conn. layer /
Conv. layer (K kernels of size Hx186x1)
Viewing fully connected layers as convolutional layers enables efficient use
of convnets on bigger images (no need to slide windows but unroll
network over space as needed to re-use computation).

TRAINING TIME

Input CNN
Image

TEST TIME

Input
CNN
Image y
x

87
ConvNets: Test
At test time, run only is forward mode (FPROP).
CONV NETS: EXAMPLES
- OCR / House number & Traffic sign classification
CONV NETS: EXAMPLES
- Texture classification
CONV NETS: EXAMPLES
- Pedestrian detection
CONV NETS: EXAMPLES
- Scene Parsing
CONV NETS: EXAMPLES
- Segmentation 3D volumetric images
CONV NETS: EXAMPLES
- Action recognition from videos

102
CONV NETS: EXAMPLES
- Object detection
Architecture for Classification

input label
image
64 128 256 512 512

Conv. layer: 3x3 filters

Max pooling layer: 2x2, stride 2 Fully

connected layer: 4096 hiddens

24 Layers in total!!!

109
Architecture for Classification

input label
image

}
TOTAL

FLOPS: 20G 0.1G 20G

88
Architecture for Classification

input label
image

}
TOTAL

Nr. of parameters: 21M 123M 144M

89
Architecture for Classification

input label
image

}
TOTAL

Nr. of parameters: 21M 123M 144M

Data augmentation is key to improve generalization:


- random translation
- left/right flipping
- scaling

90
Optimization

SGD with momentum:


Learning rate = 0.01
Momentum = 0.9

Improving generalization by:


Weight sharing (convolution)
Input distortions
Dropout = 0.5
Weight decay = 0.0005
Choosing The Architecture
Task dependent

Cross-validation

[Convolution → pooling]* + fully connected layer

The more data: the more layers and the more kernels
Look at the number of parameters at each layer Look
at the number of flops at each layer
Computational resources

Be creative :)
How To Optimize
SGD (with momentum) usually works very well

Pick learning rate by running on a subset of the data


Bottou “Stochastic Gradient Tricks” Neural Networks 2012
Start with large learning rate and divide by 2 until loss does not diverge
Decay learning rate by a factor of ~1000 or more by the end of training

Use non-linearity

Initialize parameters so that each feature across layers has


similar variance. Avoid units in saturation.
Improving Generalization
Weight sharing (greatly reduce the number of parameters)
Data augmentation (e.g., jittering, noise injection, etc.)

Dropout
Hinton et al. “Improving Nns by preventing co-adaptation of feature detectors” arxiv
2012

Weight decay (L2, L1)

Sparsity in the hidden units

Multi-task (unsupervised learning)


Good To Know
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
samples

hidden unit
Good training: hidden units are sparse across samples and
across features.
Good To Know
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
samples

hidden unit
Bad training: many hidden units ignore the input and/or
exhibit strong correlations.
Good To Know
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
Visualize parameters

GOOD BAD BAD BAD

too noisy too correlated lack structure

Good training: learned filters exhibit structure and are uncorrelated.


Good To Know
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
Visualize parameters
Measure error on both training and validation set.
Test on a small subset of the data and check the error → 0.
What If It Does Not Work?
Training diverges:
Learning rate may be too large → decrease learning rate
BPROP is buggy → numerical gradient checking
Parameters collapse / loss is minimized but accuracy is low
Check loss function:
Is it appropriate for the task you want to solve?
Does it have degenerate solutions? Check “pull-up” term.
Network is underperforming
Compute flops and nr. params. → if too small, make net larger
Visualize hidden units/params → fix optmization
Network is too slow
Compute flops and nr. params. → GPU,distrib. framework, make net
smaller
• By applying convolution and pooling, important features will be
obtained (extracted) and it can reduce the complexity of dimension
also the computational speed.
• These will be applied to ANN for classification.

You might also like