CNN6 Feb 19

CS60010: Deep Learning
CNN – Part 2
Sudeshna Sarkar
Spring 2019
6 Feb 2019
Fully Connected Network to CNN
• Too many parameters!
• Can some of the parameters be shared?
Local Patterns
• Some patterns are much smaller than the whole image
Can represent a small region with fewer parameters
“beak” detector
Patterns can appear at different places
“upper-left
beak” detector
They can be compressed

to the same parameters.
“middle beak”
detector
CNN
• Capture local features based on local sequences
• Spatial or temporal or combination
• Local Connectivity: Connect only local region of input
(receptive field) to the next layer neurons.
• Use each feature at all spatial / temporal contexts: Parameter
Sharing across regions
• 1d convolution: 𝑓 ∗ 𝑔 𝑛 = ∑𝑀 𝑚=−𝑀 𝑓 𝑛 − 𝑚 𝑔[𝑚]
• Group the outputs of a feature afterwards if required.
A convolutional layer
A CNN is a neural network with some convolutional layers

(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.
Beak detector
A filter
Convolution
Filters
1 0 0 0 0 1 1 -1 -1
0 1 0 0 1 0 -1 1 -1 These are the network
-1 -1 1 parameters to be learned.
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1
0 0 1 0 1 0 -1 1 -1
6 x 6 image Each filter detects
a small pattern
(3 x 3)
Convolution in Images
https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-
networks-a9c2f2ee1524
Parameter Sharing: replicated features
• Use many different copies of the
same feature detector with different The red connections all
positions. have the same weight.
• Replication greatly reduces the number

of free parameters to be learned.
• Use several different feature types,
each with its own map of replicated
detectors.
• Allows each patch of image to be
represented in several ways.
What does replicating the feature detectors achieve?
• Equivariant activities
representation translated
by active representation
neurons
translated
image image
• Invariant knowledge: If a feature is useful in some locations

during training, detectors for that feature will be available
in all locations during testing.
Backpropagation with weight constraints
• It’s easy to modify the

backpropagation algorithm to
incorporate linear constraints
between the weights.
• We compute the gradients as
usual, and then modify the
gradients so that they satisfy
the constraints.
• So if the weights started off
satisfying the constraints, they
will continue to satisfy them.
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32
28
convolve (slide) over all spatial

locations
32 28
3 1
21 Jan 2015
Zero Padding
https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-
networks-a9c2f2ee1524
Zero Padding
0 0 0 0 0 0 e.g. input 7x7
0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
0
0
0
(recall:)
(N - F) / stride + 1
Zero padding
e.g. input 7x7

0 0 0 0 0 0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
0
21 Jan 2015
zero padding
0 0 0 0 0 0 e.g. input 7x7
0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
0
7x7 output!
0 in general, common to see CONV layers with
0 stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
A closer look at spatial dimensions:
7x7 input (spatially)

assume 3x3 filter
21 Jan 2015

assume 3x3 filter
21 Jan 2015

assume 3x3 filter
21 Jan 2015

assume 3x3 filter
=> 5x5 output
21 Jan 2015
7
assume 3x3 filter
applied with stride 2
21 Jan 2015
7
assume 3x3 filter
21 Jan 2015
7
assume 3x3 filter
=> 3x3 output!
21 Jan 2015
7
assume 3x3 filter
applied with stride 3?
21 Jan 2015
7
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
21 Jan 2015
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
stride 1 => (7 - 3)/1 + 1 = 5
N
F stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
21 Jan 2015
Feature Map
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
We stack these up to get a “new image” of size 28x28x6!
21 Jan 2015
nc’ : number of filters
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions
32 28
CONV,
ReLU
e.g. 6
5x5x3
filters
32 28
3 6
21 Jan 2015
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with
activation functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
filters filters
32 28 24
3 6 10
21 Jan 2015
Deep Convolutional Network
Preview [From Yann LeCun
slides]
21 Jan 2015
Pool layers
• The convolution layers often followed by pool layers in CNNs.
• It can reduce the weights and will not lose too much information.
• We often use max operation to do pooling.
1 2 5 6
3 4 2 8
Max pooling 4 8
3 4 4 2 5 6
1 5 6 3
Translational invariance at each level:
Single depth slice by averaging four neighboring
The Pooling Layer operates replicated detectors to give a single
independently on every depth slice of output to the next level.
the input and resizes it spatially, using
the MAX operation.
37
Window Size and Stride in pool layers
• The window size is the pooling range.
• The stride is how much pixel the window move.
• For this example, window size = stride = 2.
1 2 5 6
Single Max pooling
depth 3 4 2 8 4 8
slice
3 4 4 2 5 6
1 5 6 3
Problem: After several levels of
pooling, we have lost information
• There are two types of the pool layers.
about the precise positions of
• If window size = stride, this is traditional
things.
pooling.
This makes it impossible to use
• If window size > stride, this is overlapping
pooling. the precise spatial relationships
• The larger window size and stride will be between high-level parts for
very destructive. recognition.
38
General pooling
• Other pooling functions: Average pooling, L2-norm pooling
• Backpropagation. the backward pass for a max(x, y) operation routes the

gradient to the input that had the highest value in the forward pass.
• Hence, during the forward pass of a pooling layer you may keep track of
the index of the max activation (sometimes also called the switches) so
that gradient routing is efficient during backpropagation.
Why Pooling
• Subsampling pixels will not change the object
bird
bird
Subsampling
We can subsample the pixels to make image smaller

fewer parameters to characterize the image
Pooling Layer
• Insertion of pooling layer:
• reduce the spatial size of the representation
• reduce the amount of parameters and computation in the network, and
hence also control overfitting.
• The depth dimension remains unchanged.
Fully-connected layer
• This layer is the same as the layer in the traditional NNs.
• We often use this type of layers in the end of the CNNs.
42
POOL POOL POOL
ReLU ReLU ReLU ReLU ReLU ReLU Fully-
CONV CONV CONV CONV CONV CONV connected
32x32
Weights: 280 910 910 910 910 910 1600

Size: 16x16 8x8 4x4 43
Getting rid of pooling
1. Striving for Simplicity: The All Convolutional Net proposes to
discard the pooling layer and have an architecture that only
consists of repeated CONV layers.
• To reduce the size of the representation they suggest using larger
stride in CONV layer once in a while.
• Argument:
• The purpose of pooling layers is to perform dimensionality reduction to
widen subsequent convolutional layers' receptive fields.
• The same effect can be achieved by using a convolutional layer: using a stride
of 2 also reduces the dimensionality of the output and widens the receptive
field of higher layers.
• The resulting operation differs from a max-pooling layer in that
• it cannot perform a true max operation
• it allows pooling across input channels.
2. The second is Very Deep Convolutional Networks for Large-
Scale Image Recognition.
• The core idea here is that hand-tuning layer kernel sizes to
achieve optimal receptive fields (say, 5×5 or 7×7) can be
replaced by simply stacking homogenous 3×3 layers.
• The same effect of widening the receptive field is then
achieved by layer composition rather than increasing the kernel
size
• three stacked 3×3 have a 7×7 receptive field.
• At the same time, the number of parameters is reduced:
• a 7×7 layer has 81% more parameters than three stacked 3×3 layers.
Case Study: LeNet-5 [LeCun et al., 1998]
Conv filters were 5x5, applied at stride 1

Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]
21 Jan 2015
The ILSVRC-2012 competition on ImageNet
• The dataset has 1.2 million high-

resolution training images. • Some of the best existing
• The classification task: computer vision methods
• Get the “correct” class in your top were tried on this dataset by
5 bets. There are 1000 classes. leading computer vision
• The localization task: groups from Oxford, INRIA,
• For each bet, put a box around XRCE, …
the object. Your box must have at
• Computer vision systems
least 50% overlap with the correct
box. use complicated multi-
stage systems.
• The early stages are
typically hand-tuned by
optimizing a few
parameters.
[Krizhevsky et al. 2012]
Case Study: AlexNet
AlexNet. popularized Convolutional Networks in Computer Vision, developed by

Alex Krizhevsky, Ilya Sutskever and Geoff Hinton.
The AlexNet was submitted to the ImageNet ILSVRC challenge in 2012 and
significantly outperformed the second runner-up (top 5 error of 16% compared to
runner-up with 26% error). The Network was deeper, bigger, and featured
Convolutional Layers stacked on top of each other
21 Jan 2015
A neural network for ImageNet
• Alex Krizhevsky (NIPS 2012)

• The activation functions were:
developed a very deep
– Rectified linear units in every
convolutional neural net of
hidden layer. These train much
the type pioneered by Yann faster and are more expressive
Le Cun. Its architecture was: than logistic units.
– 7 hidden layers not counting – Competitive normalization to
some max pooling layers. suppress hidden activities when
– The early layers were nearby units have stronger
convolutional. activities. This helps with
– The last two layers were globally variations in intensity.
connected.
Error rates on the ILSVRC-2012 competition
classification classification
&localization
• University of Toronto (Alex Krizhevsky) • 16.4%

•
34.1%
• University of Tokyo • 26.1% 53.6%

• Oxford University Computer • 26.9% 50.0%
Vision Group • 27.0%
• INRIA (French national research
institute in CS) + XRCE (Xerox • 29.5%
Research Center Europe)
• University of Amsterdam
VGGNet: ILSVRC 2014 2nd place
• Sequence of deeper networks
trained progressively
• Large receptive fields replaced by
successive layers of 3x3
convolutions (with ReLU in
between)
• One 7x7 conv layer with C feature

maps needs 49C2 weights, three
3x3 conv layers need only 27C2
weights
• Experimented with 1x1
convolutions
K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
Recognition, ICLR 2015
GoogLeNet: ILSVRC 2014 winner
• The Inception Module
• Inception Module dramatically reduced the number of
parameters in the network
(4M, compared to AlexNet with 60M).
• Uses Average Pooling instead of Fully Connected layers at
the top of the ConvNet
• Several followup versions to the GoogLeNet, most
recently Inception-v4.
C. Szegedy et al., Going deeper with convolutions, CVPR 2015

GoogLeNet
• Parallel paths with different receptive field sizes and operations are
meant to capture sparse patterns of correlations in the stack of
feature maps

GoogLeNet
• Parallel paths with different receptive field sizes and operations are meant to capture
sparse patterns of correlations in the stack of feature maps
• Use 1x1 convolutions for dimensionality reduction before expensive convolutions

Case Study: GoogLeNet [Szegedy et al., 2014]
Inception module
ILSVRC 2014 winner (6.7% top 5 error)
21 Jan 2015
GoogLeNet
Inception module

1x1 dimension reduction layers

(reduce compute bottlenecks)
Inception module
21 Jan 2015
Helper loss (during training only)
Inception module
21 Jan 2015
GoogLeNet
Auxiliary classifier

Case Study: GoogLeNet
Fun features:
- Only 5 million params!

(Removes FC layers
completely)
Compared to AlexNet:
- 12X less params
- 2x more compute
- 6.67% (vs. 16.4%)
21 Jan 2015
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters
C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters
• Increase the number of feature maps while decreasing spatial resolution
(pooling)
C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
ResNet
• The residual module
• Introduce skip or shortcut connections (existing before in various forms in
literature)
• Make it easy for network layers to represent the identity mapping
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image
Recognition, CVPR 2016 (Best Paper)
Case Study: ResNet
[He et al., 2015]
- Batch Normalization after every CONV layer
- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error
plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
ResNet
• Directly performing 3x3 convolutions with 256
Deeper residual module (bottleneck) feature maps at input and output:
256 x 256 x 3 x 3 ~ 600K operations
• Using 1x1 convolutions to reduce 256 to 64
feature maps, followed by 3x3 convolutions,
followed by 1x1 convolutions to expand back
to 256 maps:
256 x 64 x 1 x 1 ~ 16K
64 x 64 x 3 x 3 ~ 36K
64 x 256 x 1 x 1 ~ 16K
Total: ~70K
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning
for Image Recognition, CVPR 2016 (Best Paper)
Case Study: ResNet [He et al., 2015]
21 Jan 2015
(slide from Kaiming He’s recent presentation)
21 Jan 2015
21 Jan 2015
ResNet
• Architectures for ImageNet:
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition,
CVPR 2016 (Best Paper)
Inception v4
C. Szegedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on

Learning, arXiv 2016
Summary: ILSVRC 2012-2015
Team Year Place Error (top- External
5) data
SuperVision-Toronto 2012 - 16.4% no
(AlexNet, 7 layers)
SuperVision 2012 1st 15.3% ImageNet
22k
Clarifai – NYU (7 2013 - 11.7% no
layers)
Clarifai 2013 1st 11.2% ImageNet
22k
VGG – Oxford (16 2014 2nd 7.32% no
layers)
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
GoogLeNet (19 2014 1st 6.67% no
Accuracy vs. efficiency
https://culurciello.github.io/tech/2016/06/04/nets.html
Design principles
• Reduce filter sizes (except possibly at the lowest layer),
factorize filters aggressively
• Use 1x1 convolutions to reduce and expand the number of
feature maps judiciously
• Use skip connections and/or create multiple paths through
the network
What’s missing from the picture?
• Training tricks and details: initialization, regularization,
normalization
• Training data augmentation
• Averaging classifier outputs over multiple crops/flips
• Ensembles of networks
• What about ILSVRC 2016?

• No more ImageNet classification
• No breakthroughs comparable to ResNet
APPLICATIONS
Object classification [9]
77
Human Pose Estimation [10]
78
Super Resolution [11]
79
CNN on Text
Convolution vs. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0
Fully- x2
0 0 1 1 0 0
connected 1 0 0 0 1 0 …
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
fewer parameters!
1 1
1 -1 -1 Filter 1
2 0
-1 1 -1
3 0
-1 -1 1
4: 0 3
1 0 0 0 0 1
…
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
10: 0
0 1 0 0 1 0
0 0 1 0 1 0
…
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to 9
inputs.
16 1
…
“You need a lot of a data if you want to
train/use CNNs”
21 Jan 2015
Transfer Learning
“You need a lot of a data if you want to

train/use CNNs”
21 Jan 2015
Transfer Learning with CNNs
1. Train on
Imagenet
21 Jan 2015
1. Train on 2. If small dataset: fix

Imagenet all weights (treat CNN
as fixed feature
extractor), retrain only
the classifier
i.e. swap the Softmax

layer at the end
21 Jan 2015
1. Train on 2. If small dataset: fix 3. If you have medium sized

Imagenet all weights (treat CNN dataset, “finetune”
as fixed feature instead: use the old weights
extractor), retrain only as initialization, train the full
the classifier network or only some of the
higher layers
layer at the end retrain bigger portion of the
network, or even all of it.
21 Jan 2015
1. Train on 2. If small dataset: fix 3. If you have medium sized

Imagenet all weights (treat CNN dataset, “finetune”
as fixed feature instead: use the old weights
extractor), retrain only as initialization, train the full
the classifier network or only some of the
higher layers
layer at the end retrain bigger portion of the
network, or even all of it.
tip: use only ~1/10th of

the original learning rate
in finetuning to player,
and ~1/100th on
intermediate layers
21 Jan 2015
Case Study: VGGNet
[Simonyan and Zisserman, 2014]
Only 3x3 CONV stride 1, pad 1

and 2x2 MAX POOL stride 2
best model
11.2% top 5 error in ILSVRC 2013

->
7.3% top 5 error
21 Jan 2015
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
21 Jan 2015
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
21 Jan 2015
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 Note:
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Most memory is in
early CONV
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Most params are
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
21 Jan 2015
Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w
21 Jan 2015
21 Jan 2015
21 Jan 2015
2-3 weeks of training

on 8 GPU machine
at runtime: faster
than a VGGNet!
(even though it has
8x more layers)
21 Jan 2015
Case Study:
ResNet 224x224x3
spatial dimension
[He et al., 2015]
only 56x56!
21 Jan 2015
21 Jan 2015
- Batch Normalization after every CONV layer

- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
21 Jan 2015
21 Jan 2015
(this trick is also used in GoogLeNet)
21 Jan 2015
21 Jan 2015
Case Study Bonus: DeepMind’s AlphaGo
21 Jan 2015
policy network:
[19x19x48] Input
CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]
CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]
CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)
21 Jan 2015
Summary
• ConvNets stack CONV,ReLU,POOL,FC layers
• Trend towards smaller filters and deeper architectures
• Trend towards getting rid of POOL/FC layers (just CONV)
• Early architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
• but recent advances such as ResNet/GoogLeNet use only Conv-
ReLU, 1x1 convolutions and Softmax
21 Jan 2015
Activation Functions: Biological Inspiration
21 Jan 2015
Activation Functions
21 Jan 2015
Sigmoid Leaky ReLU

max(0.1x, x)
Maxout
tanh tanh(x)
ELU
ReLU max(0,x)
21 Jan 2015
• Squashes numbers to range [0,1] – can

kill gradients.
• A key element in LSTM networks –

“control signals”
• Best for learning “logical” functions –

i.e. functions on binary inputs.
• Not as good for image networks

(replaced by RELU)
Sigmoid (logistic function)
• Not zero-centered
21 Jan 2015
Consider what happens when the input to a neuron (x)
is always positive:
What can we say about the gradients on w?
21 Jan 2015
Consider what happens when the input to a
neuron is always positive...
allowed
gradient
update
directions
zig zag path

allowed
gradient
update
directions
hypothetical
optimal w
What can we say about the gradients on w? vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)
21 Jan 2015
• Squashes numbers to range [-1,1]
• Zero centered (nice)
• Still kills gradients when saturated :(
• Also used in LSTMs for bounded, signed

values.
• Not as good for binary functions

tanh(x)
[LeCun et al., 1991]
21 Jan 2015
• Computes f(x) = max(0,x)
• Does not saturate (in +region)
• Converges faster than sigmoid/tanh on

image data (e.g. 6x)
• Not suitable for logical functions
• Not for control in recurrent nets
ReLU
(Rectified Linear Unit)
[Krizhevsky et al., 2012]
21 Jan 2015
Activation Functions - Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Not zero-centered output

- An annoyance:
hint: what is the gradient when x < 0?
ReLU
(Rectified Linear Unit)
115 21 Jan 2015

x
ReLU
gate
What happens when x = -10?

What happens when x = 0?
What happens when x = 10?
21 Jan 2015
active ReLU
DATA CLOUD
=> Need to take care that no

RELU unit is outside the data dead ReLU
envelope – it will never learn. will never activate
=> never update
21 Jan 2015
[Mass et al., 2013]
Activation Functions [He et al., 2015]
• Does not saturate

image data(e.g. 6x)
• will not “die”.
Leaky ReLU
21 Jan 2015
[Mass et al., 2013]
Activation Functions [He et al., 2015]
• Does not saturate

image data (e.g. 6x)
• will not “die”.
Parametric Rectifier (PReLU)

Leaky ReLU
backprop into \alpha

(parameter)
21 Jan 2015
[Clevert et al., 2015]
Exponential Linear Units (ELU)
• All benefits of ReLU
• Does not die
• Closer to zero mean outputs
21 Jan 2015
Maxout “Neuron”
[Goodfellow et al., 2013]
• Does not have the basic form of dot product -> nonlinearity
• Generalizes ReLU and Leaky ReLU
• Linear Regime! Does not saturate! Does not die!
Problem: doubles the number of parameters/neuron :(
21 Jan 2015
TLDR: In practice:
Try everything. Usually:
• Use ReLU on early image layers.
• Try out Leaky ReLU / Maxout / ELU.
• Use sigmoids for smooth functions, e.g. robot control.
• Sigmoids are also good for logical functions AND/OR.
21 Jan 2015
Weight Initialization
21 Jan 2015
- Q: what happens when W=0 init is used?
21 Jan 2015
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
21 Jan 2015
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
Works ~okay for small networks, but can lead to non-

homogeneous distributions of activations across the
layers of a network.
21 Jan 2015
Activation Statistics
E.g. 10-layer net with

500 neurons on each
layer, using tanh non-
linearities, and
initializing as described
in last slide.
21 Jan 2015
21 Jan 2015
All activations
become zero!
Q: think about the
backward pass. What
do the gradients look
like?
Hint: think about backward

pass for a W*X gate.
21 Jan 2015
*1.0 instead of *0.01 Almost all neurons
completely
saturated, either -1
and 1. Gradients will
be all zero.
21 Jan 2015
“Xavier initialization” [Glorot et al., 2010]
Easy Derivation (linear case):

Assume weights and inbound
activations have mean zero and are
independent.
Their variances multiply for each
term, and then scale by fan_in for
each output term.
21 Jan 2015
but when using the ReLU
nonlinearity it breaks.
21 Jan 2015
He et al., 2015
(note additional /2)
factor of 2 doesn’t seem like much,

but remember it applies
multiplicatively 150 times in a large
ResNet.
21 Jan 2015
He et al., 2015
(note additional /2)
21 Jan 2015
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
All you need is a good init, Mishkin and Matas, 2015

…
21 Jan 2015
This Time: Localization and Detection
Results from Faster R-CNN, Ren et al 2015
21 Jan 2015
Computer Vision Tasks
Classification + Instance
Classification Object Detection
Localization Segmentation
CAT CAT CAT, DOG, DUCK CAT, DOG, DUCK
Single object Multiple objects
21 Jan 2015
21 Jan 2015
Classification + Localization: Task
Classification: C classes
Input: Image
Output: Class label CAT
Evaluation metric: Accuracy
Localization:
Input: Image
Output: Box in the image (x, y, w, h)
Evaluation metric: Intersection over Union
Classification + Localization: Do both

(x, y, w, h)
21 Jan 2015
Classification + Localization: ImageNet
1000 classes (same as classification)
Each image has 1 class, at least one bounding

box
~800 training images per class
Algorithm produces 5 (class, box) guesses
Example is correct if at least one one guess has

correct class AND bounding box at least 0.5
intersection over union (IoU)
Krizhevsky et. al. 2012
21 Jan 2015
Idea #1: Localization as Regression
Input: image
Neural Net Output:

Box coordinates
(4 numbers)
Loss:
L2 distance
Correct output:
box coordinates
(4 numbers)
Only one object,
simpler than detection
21 Jan 2015
Simple Recipe for Classification + Localization
Step 1: Train (or download) a classification model (AlexNet, VGG, GoogLeNet)
Convolution
and Pooling Fully-connected layers
Softmax loss
Final conv
feature map Class scores
Image
21 Jan 2015
Step 2: Attach new fully-connected “regression head” to the network
Fully-connected
layers
“Classification head”
Convolution Class scores

and Pooling
Fully-connected
layers
“Regression head”
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Step 3: Train the regression head only with SGD and L2 loss
Fully-connected
layers

and Pooling
Fully-connected
layers
L2 loss
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Step 4: At test time use both heads
Fully-connected
layers

and Pooling
Fully-connected
layers
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Per-class vs class agnostic regression
Assume classification over C

classes: Fully-connected
layers
Classification head:
C numbers
(one per class)

and Pooling
Class agnostic:
4 numbers
Fully-connected (one box)
layers
Class specific:
C x 4 numbers
Final conv (one box per class)
feature map
Box coordinates
Image
21 Jan 2015
Where to attach the regression head?
After conv layers:

Overfeat, VGG After last FC layer:
DeepPose, R-CNN
Convolution Fully-connected
and Pooling layers
Softmax loss
Final conv
Image
21 Jan 2015
Aside: Localizing multiple objects
Want to localize exactly K objects

in each image
Fully-connected
(e.g. whole cat, cat head, cat left layers
ear, cat right ear for K=4)

and Pooling
Fully-connected
layers
K x 4 numbers
(one box per object)
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Aside: Human Pose Estimation
Represent a person by K joints
Regress (x, y) for each joint from

last fully-connected layer of
AlexNet
(Details: Normalized coordinates,

iterative refinement)
Toshev and Szegedy, “DeepPose: Human Pose Estimation via

Deep Neural Networks”, CVPR 2014
21 Jan 2015
Idea #2: Sliding Window
● Run classification + regression network at multiple

locations on a high-resolution image
● Convert fully-connected layers into convolutional

layers for efficient computation
● Combine classifier and regressor predictions across all

scales for final prediction
21 Jan 2015
Sliding Window: Overfeat
4096 4096 Class scores:
Winner of ILSVRC 2013 localization 1000
challenge
FC FC
Softmax
Convolution loss
+ pooling
FC
FC
FC FC
Feature map:
1024 x 5 x 5 Euclidean
Image: loss
3 x 221 x 221
Boxes:
1024 1000 x 4
4096
Sermanet et al, “Integrated Recognition, Localization and
Detection using Convolutional Networks”, ICLR 2014
21 Jan 2015
Network input:
3 x 221 x 221
Larger image:
3 x 257 x 257
21 Jan 2015
0.
5
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
0. 0.7
5 5
Network input:
3 x 221 x 221
3 x 257 x 257 P(cat)
21 Jan 2015
0.7
0.5
5
0.6
Network input:
3 x 221 x 221
3 x 257 x 257 P(cat)
21 Jan 2015
0. 0.7
5 5
0.
0.8
6
Network input:
3 x 221 x 221
3 x 257 x 257 P(cat)
21 Jan 2015
0.
0.75
5
0.
0.8
6
Network input:
3 x 221 x 221
3 x 257 x 257 P(cat)
21 Jan 2015
Greedily merge boxes and

scores (details in paper)
0.8
Network input:
3 x 221 x 221
Larger image: Classification score:
3 x 257 x 257 P(cat)
21 Jan 2015
In practice use many sliding window

locations and multiple scales
Window positions + score maps

Final Predictions
Box regression outputs
Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014
21 Jan 2015
Efficient Sliding Window: Overfeat
4096 4096 Class scores:

1000
FC FC
Convolution
+ pooling
FC
FC
FC FC
Feature map:
1024 x 5 x 5
Image:
3 x 221 x 221
Boxes:
1024 1000 x 4
4096
21 Jan 2015
Efficient sliding window by converting fully-

connected layers into convolutions
Class scores:
4096 x 1 x 1 1024 x 1 x 1 1000 x 1 x 1
Convolution
+ pooling
1 x 1 conv 1 x 1 conv
5x5
conv
5x5
conv
Feature map: 1 x 1 conv 1 x 1 conv

1024 x 5 x 5
Image:
3 x 221 x 221
4096 x 1 x 1 1024 x 1 x 1 Box coordinates:
(4 x 1000) x 1 x 1
21 Jan 2015
Training time: Small image,

1 x 1 classifier output
Test time: Larger image, 2 x

2 classifier output, only
extra compute at yellow
regions
Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014
21 Jan 2015
ImageNet Classification + Localization
AlexNet: Localization method not

published
Overfeat: Multiscale convolutional

regression with box merging
VGG: Same as Overfeat, but fewer scales

and locations; simpler method, gains all
due to deeper features
ResNet: Different localization method

(RPN) and much deeper features
21 Jan 2015
21 Jan 2015
21 Jan 2015
Detection as Regression?
DOG, (x, y, w, h)
CAT, (x, y, w, h)
CAT, (x, y, w, h)
DUCK (x, y, w, h)
= 16 numbers
21 Jan 2015
DOG, (x, y, w, h)
CAT, (x, y, w, h)
= 8 numbers
21 Jan 2015
CAT, (x, y, w, h)
CAT, (x, y, w, h)
….
CAT (x, y, w, h)
= many numbers
Need variable sized outputs
21 Jan 2015
Detection as Classification
CAT? NO
DOG? NO
21 Jan 2015
CAT? YES!
DOG? NO
21 Jan 2015
CAT? NO
DOG? NO
21 Jan 2015
Problem: Need to test many positions and scales
Solution: If your classifier is fast enough, just do it
21 Jan 2015
Histogram of Oriented Gradients
Dalal and Triggs, “Histograms of Oriented Gradients for Human Detection”, CVPR 2005
Slide credit: Ross Girshick
21 Jan 2015
Deformable Parts Model (DPM)
Felzenszwalb et al, “Object Detection with Discriminatively

Trained Part Based Models”, PAMI 2010
21 Jan 2015
Aside: Deformable Parts Models are CNNs?
Girschick et al, “Deformable Part Models are Convolutional Neural Networks”, CVPR 2015
21 Jan 2015
Problem: Need to test many positions and scales,

and use a computationally demanding classifier (CNN)
Solution: Only look at a tiny subset of possible positions
21 Jan 2015
Region Proposals
● Find “blobby” image regions that are likely to contain objects

● “Class-agnostic” object detector
● Look for “blob-like” regions
21 Jan 2015
Region Proposals: Selective Search
Bottom-up segmentation, merging regions at multiple scales
Convert
regions to
boxes
Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
21 Jan 2015
Region Proposals: Many other choices
Hosang et al, “What makes for effective detection proposals?”, PAMI 2015
21 Jan 2015
Region Proposals: Many other choices
Hosang et al, “What makes for effective detection proposals?”, PAMI 2015
21 Jan 2015
Putting it together: R-CNN
Girschick et al, “Rich feature hierarchies for

accurate object detection and semantic
segmentation”, CVPR 2014
Slide credit: Ross Girschick
21 Jan 2015
R-CNN Training
Step 1: Train (or download) a classification model for ImageNet (AlexNet)
Convolution
Fully-connected
and Pooling
layers
Softmax loss
Final conv
1000 classes
Image
21 Jan 2015
R-CNN Training
Step 2: Fine-tune model for detection

- Instead of 1000 ImageNet classes, want 20 object classes + background
- Throw away final fully-connected layer, reinitialize from scratch
- Keep training model using positive / negative regions from detection images
Re-initialize this layer:

was 4096 x 1000,
Convolution
Fully-connected now will be 4096 x 21
and Pooling
layers
Softmax loss
Final conv
feature map Class scores:
21 classes
Image
21 Jan 2015
R-CNN Training
Step 3: Extract features

- Extract region proposals for all images
- For each region: warp to CNN input size, run forward through CNN, save pool5 features to disk
- Have a big hard drive: features are ~200GB for PASCAL dataset!
Convolution
and Pooling
pool5 features
Image Region Proposals Crop + Warp Forward pass Save to disk
21 Jan 2015
R-CNN Training
Step 4: Train one binary SVM per class to classify region features
Training image regions
Cached region features
Positive samples for cat SVM Negative samples for cat SVM
21 Jan 2015
R-CNN Training
Step 4: Train one binary SVM per class to classify region features
Negative samples for dog SVM Positive samples for dog SVM
21 Jan 2015
R-CNN Training
Step 5 (bbox regression): For each class, train a linear regression model to map from
cached features to offsets to GT boxes to make up for “slightly wrong” proposals
Regression targets (0, 0, 0, 0) (.25, 0, 0, 0) (0, 0, -0.125, 0)

(dx, dy, dw, dh) Proposal is good Proposal too Proposal too
Normalized coordinates far to left wide
21 Jan 2015
Object Detection: Datasets
ImageNet
PASCAL
Detection MS-COCO
VOC
(ILSVRC (2014)
(2010)
2014)
Number of
20 200 80
classes
Number of
images ~20k ~470k ~120k
(train + val)
Mean objects
2.4 1.1 7.2
per image
21 Jan 2015
Object Detection: Evaluation
We use a metric called “mean average precision” (mAP)
Compute average precision (AP) separately for each class, then average over classes
A detection is a true positive if it has IoU (Intersection over Union) with a ground-truth box
greater than some threshold (usually 0.5) (mAP@0.5)
Combine all detections from all test images to draw a precision / recall curve for each class; AP is
area under the curve
TL;DR mAP is a number from 0 to 100; high is good
21 Jan 2015
R-CNN Results
Wang et al, “Regionlets for Generic Object Detection”, ICCV 2013
21 Jan 2015
R-CNN Results
Big improvement compared to pre-CNN

methods
21 Jan 2015
R-CNN Results
Bounding box regression helps

a bit
21 Jan 2015
R-CNN Results
Features from a deeper network

help a lot
21 Jan 2015
R-CNN Problems
1. Slow at test-time: need to run full forward pass of CNN

for each region proposal
2. SVMs and regressors are post-hoc: CNN features not

updated in response to SVMs and regressors
3. Complex multistage training pipeline
21 Jan 2015
Fast R-CNN
Girschick, “Fast R-CNN”, ICCV 2015
21 Jan 2015
R-CNN Problem #1:
Slow at test-time due to
independent forward passes of the
CNN
Solution:
Share computation of
convolutional layers between
proposals for an image
21 Jan 2015
R-CNN Problem #2:
Post-hoc training: CNN not
updated in response to final
classifiers and regressors
R-CNN Problem #3:

Complex training pipeline
Solution:
Just train the whole system end-
to-end all at once!
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
Convolution
and Pooling Fully-connected
layers
Hi-res input image:

Hi-res conv features: Problem: Fully-connected layers
3 x 800 x 600
CxHxW expect low-res conv features: C x
with region proposal
with region proposal hxw
21 Jan 2015
Project region proposal onto

conv feature map
Convolution
layers
Hi-res input image:

3 x 800 x 600
21 Jan 2015
Divide projected region

Convolution into h x w grid
layers
Hi-res input image:

3 x 800 x 600
21 Jan 2015
Max-pool within
each grid cell
Convolution
layers
Hi-res input image:

Hi-res conv features: RoI conv features: Fully-connected layers expect
3 x 800 x 600
CxHxW Cxhxw low-res conv features:
with region proposal for region proposal Cxhxw
21 Jan 2015
Can back propagate

similar to max pooling
Convolution
layers
Hi-res input image:

Hi-res conv features: RoI conv features: Fully-connected layers expect
3 x 800 x 600
CxHxW Cxhxw low-res conv features:
with region proposal for region proposal Cxhxw
21 Jan 2015
Fast R-CNN Results
R-CNN Fast R-CNN
Faster!
Training 84 hours 9.5 hours
Time:
(Speedup) 1x 8.8x
Using VGG-16 CNN on Pascal VOC 2007 dataset
21 Jan 2015
Fast R-CNN Results
R-CNN Fast R-CNN
Faster!
Time:
(Speedup) 1x 8.8x
FASTER!
Test time per 47 seconds 0.32 seconds
image
(Speedup) 1x 146x
21 Jan 2015
Fast R-CNN Results
R-CNN Fast R-CNN

Time:
(Speedup) 1x 8.8x
Faster!
Test time per 47 seconds 0.32 seconds

FASTER!
image
(Speedup) 1x 146x
Better!
mAP (VOC 66.0 66.9
2007)
21 Jan 2015
Fast R-CNN Problem:
Test-time speeds don’t include region proposals
R-CNN Fast R-CNN
Test time per

47 seconds 0.32 seconds
image
(Speedup) 1x 146x
Test time per

image with
50 seconds 2 seconds
Selective
Search
(Speedup) 1x 25x
21 Jan 2015
Fast R-CNN Problem Solution:
Test-time speeds don’t include region proposals

Just make the CNN do region proposals too!
R-CNN Fast R-CNN
Test time per

47 seconds 0.32 seconds
image
(Speedup) 1x 146x
Test time per

image with
50 seconds 2 seconds
Selective
Search
(Speedup) 1x 25x
21 Jan 2015
Faster R-CNN:
Insert a Region Proposal Network

(RPN) after the last convolutional
layer
RPN trained to produce region

proposals directly; no need for
external region proposals!
After RPN, use RoI Pooling and an

upstream classifier and bbox
regressor just like Fast R-CNN
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with

Region Proposal Networks”, NIPS 2015
21 Jan 2015
Faster R-CNN: Region Proposal Network
Slide a small window on the feature map
Build a small network for:

• classifying object or not-object, and
• regressing bbox locations 1 x 1 conv 1 x 1 conv
Position of the sliding window provides localization

1 x 1 conv
information with reference to the image
Box regression provides finer localization

information with reference to this sliding window
Slide credit: Kaiming He
21 Jan 2015
Faster R-CNN: Region Proposal Network
Use N anchor boxes at each location
Anchors are translation invariant: use the

same ones at every location
Regression gives offsets from anchor

boxes
Classification gives the probability that

each (regressed) anchor shows an object
21 Jan 2015
Faster R-CNN: Training
In the paper: Ugly pipeline

- Use alternating optimization to train RPN, then
Fast R-CNN with RPN proposals, etc.
- More complex than it has to be
Since publication: Joint training!

One network, four losses
- RPN classification (anchor good / bad)
- RPN regression (anchor -> proposal)
- Fast R-CNN classification (over classes)
- Fast R-CNN regression (proposal -> box)
21 Jan 2015
Faster R-CNN: Results
R-CNN Fast R-CNN Faster R-

CNN
Test time per 50 2 seconds 0.2 seconds

image seconds
(with proposals)
(Speedup) 1x 25x 250x
mAP (VOC 2007) 66.0 66.9 66.9
21 Jan 2015
Object Detection State-of-the-art:
ResNet 101 + Faster R-CNN + some extras
He et. al, “Deep Residual Learning for Image Recognition”, arXiv 2015
21 Jan 2015
ImageNet Detection 2013 - 2015
21 Jan 2015
YOLO: You Only Look Once Detection as Regression
Divide image into S x S grid
Within each grid cell predict:

B Boxes: 4 coordinates +
confidence
Class scores: C numbers
Regression from image to

7 x 7 x (5 * B + C) tensor
Direct prediction using a CNN

Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, arXiv 2015
21 Jan 2015
YOLO: You Only Look Once Detection as Regression
Faster than Faster R-CNN, but not

as good
Redmon et al, “You Only Look Once:

Unified, Real-Time Object Detection”, arXiv 2015
21 Jan 2015
Object Detection code links:
R-CNN
(Cafffe + MATLAB): https://github.com/rbgirshick/rcnn
Probably don’t use this; too slow
Fast R-CNN
(Caffe + MATLAB): https://github.com/rbgirshick/fast-rcnn
Faster R-CNN
(Caffe + MATLAB): https://github.com/ShaoqingRen/faster_rcnn
(Caffe + Python): https://github.com/rbgirshick/py-faster-rcnn
YOLO
http://pjreddie.com/darknet/yolo/
21 Jan 2015
Recap
Localization:
- Find a fixed number of objects (one or many)
- L2 regression from CNN features to box coordinates
- Much simpler than detection; consider it for your projects!
- Overfeat: Regression + efficient sliding window with FC -> conv conversion
- Deeper networks do better
Object Detection:
- Find a variable number of objects by classifying image regions
- Before CNNs: dense multiscale sliding window (HoG, DPM)
- Avoid dense sliding window with region proposals
- R-CNN: Selective Search + CNN classification / regression
- Fast R-CNN: Swap order of convolutions and region extraction
- Faster R-CNN: Compute region proposals within the network
- Deeper networks do better
21 Jan 2015

CNN6 Feb 19

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CNN6 Feb 19

Uploaded by

Copyright:

Available Formats

CS60010: Deep Learning

Can represent a small region with fewer parameters

They can be compressed

A CNN is a neural network with some convolutional layers

• Replication greatly reduces the number

• Invariant knowledge: If a feature is useful in some locations

• It’s easy to modify the

convolve (slide) over all spatial

e.g. input 7x7

7x7 input (spatially)

7x7 input (spatially)

7x7 input (spatially)

7x7 input (spatially)

=> 5x5 output

We stack these up to get a “new image” of size 28x28x6!

• Backpropagation. the backward pass for a max(x, y) operation routes the

• Subsampling pixels will not change the object

We can subsample the pixels to make image smaller

Weights: 280 910 910 910 910 910 1600

Conv filters were 5x5, applied at stride 1

• The dataset has 1.2 million high-

AlexNet. popularized Convolutional Networks in Computer Vision, developed by

• Alex Krizhevsky (NIPS 2012)

• University of Toronto (Alex Krizhevsky) • 16.4%

• University of Tokyo • 26.1% 53.6%

• One 7x7 conv layer with C feature

C. Szegedy et al., Going deeper with convolutions, CVPR 2015

C. Szegedy et al., Going deeper with convolutions, CVPR 2015

C. Szegedy et al., Going deeper with convolutions, CVPR 2015

ILSVRC 2014 winner (6.7% top 5 error)

C. Szegedy et al., Going deeper with convolutions, CVPR 2015

1x1 dimension reduction layers

Helper loss (during training only)

C. Szegedy et al., Going deeper with convolutions, CVPR 2015

- Only 5 million params!

C. Szegedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on

• What about ILSVRC 2016?

“You need a lot of a data if you want to

1. Train on 2. If small dataset: fix

i.e. swap the Softmax

1. Train on 2. If small dataset: fix 3. If you have medium sized

1. Train on 2. If small dataset: fix 3. If you have medium sized

tip: use only ~1/10th of

Only 3x3 CONV stride 1, pad 1

11.2% top 5 error in ILSVRC 2013

ILSVRC 2015 winner (3.6% top 5 error)

Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w

ILSVRC 2015 winner (3.6% top 5 error)

2-3 weeks of training

(slide from Kaiming He’s recent presentation)

- Batch Normalization after every CONV layer

(this trick is also used in GoogLeNet)

Sigmoid Leaky ReLU

• Squashes numbers to range [0,1] – can

• A key element in LSTM networks –

• Best for learning “logical” functions –

• Not as good for image networks

What can we say about the gradients on w?

zig zag path

• Squashes numbers to range [-1,1]

• Zero centered (nice)

• Still kills gradients when saturated :(

• Also used in LSTMs for bounded, signed