You are on page 1of 217

CS60010: Deep Learning

CNN – Part 2

Sudeshna Sarkar
Spring 2019

6 Feb 2019
Fully Connected Network to CNN
• Too many parameters!
• Can some of the parameters be shared?
Local Patterns
• Some patterns are much smaller than the whole image

Can represent a small region with fewer parameters

“beak” detector
Patterns can appear at different places

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector
CNN
• Capture local features based on local sequences
• Spatial or temporal or combination
• Local Connectivity: Connect only local region of input
(receptive field) to the next layer neurons.
• Use each feature at all spatial / temporal contexts: Parameter
Sharing across regions
• 1d convolution: 𝑓 ∗ 𝑔 𝑛 = ∑𝑀 𝑚=−𝑀 𝑓 𝑛 − 𝑚 𝑔[𝑚]
• Group the outputs of a feature afterwards if required.
A convolutional layer

A CNN is a neural network with some convolutional layers


(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.

Beak detector

A filter
Convolution

Filters

1 0 0 0 0 1 1 -1 -1
0 1 0 0 1 0 -1 1 -1 These are the network
-1 -1 1 parameters to be learned.
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1
0 0 1 0 1 0 -1 1 -1
6 x 6 image Each filter detects
a small pattern
(3 x 3)
Convolution in Images

https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-
networks-a9c2f2ee1524
Parameter Sharing: replicated features
• Use many different copies of the
same feature detector with different The red connections all
positions. have the same weight.

• Replication greatly reduces the number


of free parameters to be learned.
• Use several different feature types,
each with its own map of replicated
detectors.
• Allows each patch of image to be
represented in several ways.
What does replicating the feature detectors achieve?
• Equivariant activities

representation translated
by active representation
neurons

translated
image image

• Invariant knowledge: If a feature is useful in some locations


during training, detectors for that feature will be available
in all locations during testing.
Backpropagation with weight constraints

• It’s easy to modify the


backpropagation algorithm to
incorporate linear constraints
between the weights.
• We compute the gradients as
usual, and then modify the
gradients so that they satisfy
the constraints.
• So if the weights started off
satisfying the constraints, they
will continue to satisfy them.
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all spatial


locations

32 28

3 1

21 Jan 2015
Zero Padding

https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-
networks-a9c2f2ee1524
Zero Padding
0 0 0 0 0 0 e.g. input 7x7
0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
0
0
0

(recall:)
(N - F) / stride + 1
Zero padding

e.g. input 7x7


0 0 0 0 0 0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
0

21 Jan 2015
zero padding
0 0 0 0 0 0 e.g. input 7x7
0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
0
7x7 output!
0 in general, common to see CONV layers with
0 stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

21 Jan 2015
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

21 Jan 2015
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

21 Jan 2015
A closer look at spatial dimensions:

7x7 input (spatially)


assume 3x3 filter

=> 5x5 output

21 Jan 2015
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

21 Jan 2015
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

21 Jan 2015
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!

21 Jan 2015
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

21 Jan 2015
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.

21 Jan 2015
N

Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
stride 1 => (7 - 3)/1 + 1 = 5
N
F stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\

21 Jan 2015
Feature Map
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

activation maps

32

28

Convolution Layer

32 28

3 6

We stack these up to get a “new image” of size 28x28x6!

21 Jan 2015
nc’ : number of filters
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
filters
32 28

3 6

21 Jan 2015
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with
activation functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
filters filters
32 28 24

3 6 10

21 Jan 2015
Deep Convolutional Network
Preview [From Yann LeCun
slides]

21 Jan 2015
Pool layers
• The convolution layers often followed by pool layers in CNNs.
• It can reduce the weights and will not lose too much information.
• We often use max operation to do pooling.

1 2 5 6

3 4 2 8
Max pooling 4 8

3 4 4 2 5 6

1 5 6 3
Translational invariance at each level:
Single depth slice by averaging four neighboring
The Pooling Layer operates replicated detectors to give a single
independently on every depth slice of output to the next level.
the input and resizes it spatially, using
the MAX operation.

37
Window Size and Stride in pool layers
• The window size is the pooling range.
• The stride is how much pixel the window move.
• For this example, window size = stride = 2.

1 2 5 6
Single Max pooling
depth 3 4 2 8 4 8
slice
3 4 4 2 5 6

1 5 6 3
Problem: After several levels of
pooling, we have lost information
• There are two types of the pool layers.
about the precise positions of
• If window size = stride, this is traditional
things.
pooling.
This makes it impossible to use
• If window size > stride, this is overlapping
pooling. the precise spatial relationships
• The larger window size and stride will be between high-level parts for
very destructive. recognition.
38
General pooling
• Other pooling functions: Average pooling, L2-norm pooling

• Backpropagation. the backward pass for a max(x, y) operation routes the


gradient to the input that had the highest value in the forward pass.
• Hence, during the forward pass of a pooling layer you may keep track of
the index of the max activation (sometimes also called the switches) so
that gradient routing is efficient during backpropagation.
Why Pooling

• Subsampling pixels will not change the object

bird

bird

Subsampling

We can subsample the pixels to make image smaller


fewer parameters to characterize the image
Pooling Layer
• Insertion of pooling layer:
• reduce the spatial size of the representation
• reduce the amount of parameters and computation in the network, and
hence also control overfitting.
• The depth dimension remains unchanged.
Fully-connected layer
• This layer is the same as the layer in the traditional NNs.
• We often use this type of layers in the end of the CNNs.

42
POOL POOL POOL
ReLU ReLU ReLU ReLU ReLU ReLU Fully-
CONV CONV CONV CONV CONV CONV connected

32x32

Weights: 280 910 910 910 910 910 1600


Size: 16x16 8x8 4x4 43
Getting rid of pooling
1. Striving for Simplicity: The All Convolutional Net proposes to
discard the pooling layer and have an architecture that only
consists of repeated CONV layers.
• To reduce the size of the representation they suggest using larger
stride in CONV layer once in a while.
• Argument:
• The purpose of pooling layers is to perform dimensionality reduction to
widen subsequent convolutional layers' receptive fields.
• The same effect can be achieved by using a convolutional layer: using a stride
of 2 also reduces the dimensionality of the output and widens the receptive
field of higher layers.
• The resulting operation differs from a max-pooling layer in that
• it cannot perform a true max operation
• it allows pooling across input channels.
2. The second is Very Deep Convolutional Networks for Large-
Scale Image Recognition.
• The core idea here is that hand-tuning layer kernel sizes to
achieve optimal receptive fields (say, 5×5 or 7×7) can be
replaced by simply stacking homogenous 3×3 layers.
• The same effect of widening the receptive field is then
achieved by layer composition rather than increasing the kernel
size
• three stacked 3×3 have a 7×7 receptive field.
• At the same time, the number of parameters is reduced:
• a 7×7 layer has 81% more parameters than three stacked 3×3 layers.
Case Study: LeNet-5 [LeCun et al., 1998]

Conv filters were 5x5, applied at stride 1


Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]

21 Jan 2015
The ILSVRC-2012 competition on ImageNet

• The dataset has 1.2 million high-


resolution training images. • Some of the best existing
• The classification task: computer vision methods
• Get the “correct” class in your top were tried on this dataset by
5 bets. There are 1000 classes. leading computer vision
• The localization task: groups from Oxford, INRIA,
• For each bet, put a box around XRCE, …
the object. Your box must have at
• Computer vision systems
least 50% overlap with the correct
box. use complicated multi-
stage systems.
• The early stages are
typically hand-tuned by
optimizing a few
parameters.
[Krizhevsky et al. 2012]
Case Study: AlexNet

AlexNet. popularized Convolutional Networks in Computer Vision, developed by


Alex Krizhevsky, Ilya Sutskever and Geoff Hinton.
The AlexNet was submitted to the ImageNet ILSVRC challenge in 2012 and
significantly outperformed the second runner-up (top 5 error of 16% compared to
runner-up with 26% error). The Network was deeper, bigger, and featured
Convolutional Layers stacked on top of each other
21 Jan 2015
A neural network for ImageNet

• Alex Krizhevsky (NIPS 2012)


• The activation functions were:
developed a very deep
– Rectified linear units in every
convolutional neural net of
hidden layer. These train much
the type pioneered by Yann faster and are more expressive
Le Cun. Its architecture was: than logistic units.
– 7 hidden layers not counting – Competitive normalization to
some max pooling layers. suppress hidden activities when
– The early layers were nearby units have stronger
convolutional. activities. This helps with
– The last two layers were globally variations in intensity.
connected.
Error rates on the ILSVRC-2012 competition

classification classification
&localization

• University of Toronto (Alex Krizhevsky) • 16.4%



34.1%

• University of Tokyo • 26.1% 53.6%


• Oxford University Computer • 26.9% 50.0%
Vision Group • 27.0%
• INRIA (French national research
institute in CS) + XRCE (Xerox • 29.5%
Research Center Europe)
• University of Amsterdam
VGGNet: ILSVRC 2014 2nd place
• Sequence of deeper networks
trained progressively
• Large receptive fields replaced by
successive layers of 3x3
convolutions (with ReLU in
between)

• One 7x7 conv layer with C feature


maps needs 49C2 weights, three
3x3 conv layers need only 27C2
weights
• Experimented with 1x1
convolutions

K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
Recognition, ICLR 2015
GoogLeNet: ILSVRC 2014 winner
• The Inception Module
• Inception Module dramatically reduced the number of
parameters in the network
(4M, compared to AlexNet with 60M).
• Uses Average Pooling instead of Fully Connected layers at
the top of the ConvNet
• Several followup versions to the GoogLeNet, most
recently Inception-v4.

C. Szegedy et al., Going deeper with convolutions, CVPR 2015


GoogLeNet
• The Inception Module
• Parallel paths with different receptive field sizes and operations are
meant to capture sparse patterns of correlations in the stack of
feature maps

C. Szegedy et al., Going deeper with convolutions, CVPR 2015


GoogLeNet
• The Inception Module
• Parallel paths with different receptive field sizes and operations are meant to capture
sparse patterns of correlations in the stack of feature maps
• Use 1x1 convolutions for dimensionality reduction before expensive convolutions

C. Szegedy et al., Going deeper with convolutions, CVPR 2015


Case Study: GoogLeNet [Szegedy et al., 2014]

Inception module

ILSVRC 2014 winner (6.7% top 5 error)

21 Jan 2015
GoogLeNet

Inception module

C. Szegedy et al., Going deeper with convolutions, CVPR 2015


Case Study: GoogLeNet [Szegedy et al., 2014]

1x1 dimension reduction layers


(reduce compute bottlenecks)

Inception module

21 Jan 2015
Case Study: GoogLeNet [Szegedy et al., 2014]

Helper loss (during training only)

Inception module

21 Jan 2015
GoogLeNet

Auxiliary classifier

C. Szegedy et al., Going deeper with convolutions, CVPR 2015


Case Study: GoogLeNet

Fun features:

- Only 5 million params!


(Removes FC layers
completely)

Compared to AlexNet:
- 12X less params
- 2x more compute
- 6.67% (vs. 16.4%)

21 Jan 2015
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters
• Increase the number of feature maps while decreasing spatial resolution
(pooling)

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
ResNet
• The residual module
• Introduce skip or shortcut connections (existing before in various forms in
literature)
• Make it easy for network layers to represent the identity mapping

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image
Recognition, CVPR 2016 (Best Paper)
Case Study: ResNet
[He et al., 2015]
- Batch Normalization after every CONV layer
- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error
plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
ResNet
• Directly performing 3x3 convolutions with 256
Deeper residual module (bottleneck) feature maps at input and output:
256 x 256 x 3 x 3 ~ 600K operations
• Using 1x1 convolutions to reduce 256 to 64
feature maps, followed by 3x3 convolutions,
followed by 1x1 convolutions to expand back
to 256 maps:
256 x 64 x 1 x 1 ~ 16K
64 x 64 x 3 x 3 ~ 36K
64 x 256 x 1 x 1 ~ 16K
Total: ~70K

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning
for Image Recognition, CVPR 2016 (Best Paper)
Case Study: ResNet [He et al., 2015]

21 Jan 2015
(slide from Kaiming He’s recent presentation)

21 Jan 2015
21 Jan 2015
ResNet
• Architectures for ImageNet:

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition,
CVPR 2016 (Best Paper)
Inception v4

C. Szegedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on


Learning, arXiv 2016
Summary: ILSVRC 2012-2015
Team Year Place Error (top- External
5) data
SuperVision-Toronto 2012 - 16.4% no
(AlexNet, 7 layers)
SuperVision 2012 1st 15.3% ImageNet
22k
Clarifai – NYU (7 2013 - 11.7% no
layers)
Clarifai 2013 1st 11.2% ImageNet
22k
VGG – Oxford (16 2014 2nd 7.32% no
layers)
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
GoogLeNet (19 2014 1st 6.67% no
Accuracy vs. efficiency

https://culurciello.github.io/tech/2016/06/04/nets.html
Design principles
• Reduce filter sizes (except possibly at the lowest layer),
factorize filters aggressively
• Use 1x1 convolutions to reduce and expand the number of
feature maps judiciously
• Use skip connections and/or create multiple paths through
the network
What’s missing from the picture?
• Training tricks and details: initialization, regularization,
normalization
• Training data augmentation
• Averaging classifier outputs over multiple crops/flips
• Ensembles of networks

• What about ILSVRC 2016?


• No more ImageNet classification
• No breakthroughs comparable to ResNet
APPLICATIONS
Object classification [9]

77
Human Pose Estimation [10]

78
Super Resolution [11]

79
CNN on Text
Convolution vs. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0
Fully- x2
0 0 1 1 0 0
connected 1 0 0 0 1 0 …



0 1 0 0 1 0
0 0 1 0 1 0
x36
fewer parameters!
1 1
1 -1 -1 Filter 1
2 0
-1 1 -1
3 0
-1 -1 1
4: 0 3
1 0 0 0 0 1


0 1 0 0 1 0 0

0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
10: 0
0 1 0 0 1 0
0 0 1 0 1 0


13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to 9
inputs.
16 1

“You need a lot of a data if you want to
train/use CNNs”

21 Jan 2015
Transfer Learning

“You need a lot of a data if you want to


train/use CNNs”

21 Jan 2015
Transfer Learning with CNNs

1. Train on
Imagenet

21 Jan 2015
Transfer Learning with CNNs

1. Train on 2. If small dataset: fix


Imagenet all weights (treat CNN
as fixed feature
extractor), retrain only
the classifier

i.e. swap the Softmax


layer at the end

21 Jan 2015
Transfer Learning with CNNs

1. Train on 2. If small dataset: fix 3. If you have medium sized


Imagenet all weights (treat CNN dataset, “finetune”
as fixed feature instead: use the old weights
extractor), retrain only as initialization, train the full
the classifier network or only some of the
higher layers
i.e. swap the Softmax
layer at the end retrain bigger portion of the
network, or even all of it.

21 Jan 2015
Transfer Learning with CNNs

1. Train on 2. If small dataset: fix 3. If you have medium sized


Imagenet all weights (treat CNN dataset, “finetune”
as fixed feature instead: use the old weights
extractor), retrain only as initialization, train the full
the classifier network or only some of the
higher layers
i.e. swap the Softmax
layer at the end retrain bigger portion of the
network, or even all of it.

tip: use only ~1/10th of


the original learning rate
in finetuning to player,
and ~1/100th on
intermediate layers

21 Jan 2015
Case Study: VGGNet
[Simonyan and Zisserman, 2014]

Only 3x3 CONV stride 1, pad 1


and 2x2 MAX POOL stride 2

best model

11.2% top 5 error in ILSVRC 2013


->
7.3% top 5 error

21 Jan 2015
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

21 Jan 2015
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

21 Jan 2015
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 Note:
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
early CONV
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Most params are
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

21 Jan 2015
Case Study: ResNet [He et al., 2015]

ILSVRC 2015 winner (3.6% top 5 error)

Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w

21 Jan 2015
(slide from Kaiming He’s recent presentation)

21 Jan 2015
21 Jan 2015
Case Study: ResNet [He et al., 2015]

ILSVRC 2015 winner (3.6% top 5 error)

2-3 weeks of training


on 8 GPU machine

at runtime: faster
than a VGGNet!
(even though it has
8x more layers)

(slide from Kaiming He’s recent presentation)

21 Jan 2015
Case Study:
ResNet 224x224x3

spatial dimension
[He et al., 2015]
only 56x56!

21 Jan 2015
Case Study: ResNet [He et al., 2015]

21 Jan 2015
Case Study: ResNet [He et al., 2015]

- Batch Normalization after every CONV layer


- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used

21 Jan 2015
Case Study: ResNet [He et al., 2015]

21 Jan 2015
Case Study: ResNet [He et al., 2015]

(this trick is also used in GoogLeNet)

21 Jan 2015
Case Study: ResNet [He et al., 2015]

21 Jan 2015
Case Study Bonus: DeepMind’s AlphaGo

21 Jan 2015
policy network:
[19x19x48] Input
CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]
CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]
CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)

21 Jan 2015
Summary
• ConvNets stack CONV,ReLU,POOL,FC layers
• Trend towards smaller filters and deeper architectures
• Trend towards getting rid of POOL/FC layers (just CONV)
• Early architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
• but recent advances such as ResNet/GoogLeNet use only Conv-
ReLU, 1x1 convolutions and Softmax

21 Jan 2015
Activation Functions: Biological Inspiration

21 Jan 2015
Activation Functions

21 Jan 2015
Activation Functions

Sigmoid Leaky ReLU


max(0.1x, x)

Maxout

tanh tanh(x)
ELU

ReLU max(0,x)

21 Jan 2015
Activation Functions

• Squashes numbers to range [0,1] – can


kill gradients.

• A key element in LSTM networks –


“control signals”

• Best for learning “logical” functions –


i.e. functions on binary inputs.

• Not as good for image networks


(replaced by RELU)
Sigmoid (logistic function)
• Not zero-centered

21 Jan 2015
Consider what happens when the input to a neuron (x)
is always positive:

What can we say about the gradients on w?

21 Jan 2015
Consider what happens when the input to a
neuron is always positive...
allowed
gradient
update
directions

zig zag path


allowed
gradient
update
directions

hypothetical
optimal w
What can we say about the gradients on w? vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)

21 Jan 2015
Activation Functions

• Squashes numbers to range [-1,1]

• Zero centered (nice)

• Still kills gradients when saturated :(

• Also used in LSTMs for bounded, signed


values.

• Not as good for binary functions


tanh(x)

[LeCun et al., 1991]

21 Jan 2015
Activation Functions
• Computes f(x) = max(0,x)

• Does not saturate (in +region)

• Converges faster than sigmoid/tanh on


image data (e.g. 6x)

• Not suitable for logical functions

• Not for control in recurrent nets

ReLU
(Rectified Linear Unit)

[Krizhevsky et al., 2012]

21 Jan 2015
Activation Functions - Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

- Not zero-centered output


- An annoyance:

hint: what is the gradient when x < 0?

ReLU
(Rectified Linear Unit)

115 21 Jan 2015


x
ReLU
gate

What happens when x = -10?


What happens when x = 0?
What happens when x = 10?

21 Jan 2015
active ReLU

DATA CLOUD

=> Need to take care that no


RELU unit is outside the data dead ReLU
envelope – it will never learn. will never activate
=> never update

21 Jan 2015
[Mass et al., 2013]
Activation Functions [He et al., 2015]

• Does not saturate

• Converges faster than sigmoid/tanh on


image data(e.g. 6x)

• will not “die”.

Leaky ReLU

21 Jan 2015
[Mass et al., 2013]
Activation Functions [He et al., 2015]

• Does not saturate


• Converges faster than sigmoid/tanh on
image data (e.g. 6x)
• will not “die”.

Parametric Rectifier (PReLU)


Leaky ReLU

backprop into \alpha


(parameter)

21 Jan 2015
Activation Functions
[Clevert et al., 2015]

Exponential Linear Units (ELU)

• All benefits of ReLU

• Does not die

• Closer to zero mean outputs

21 Jan 2015
Maxout “Neuron”
[Goodfellow et al., 2013]

• Does not have the basic form of dot product -> nonlinearity
• Generalizes ReLU and Leaky ReLU
• Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron :(

21 Jan 2015
TLDR: In practice:
Try everything. Usually:
• Use ReLU on early image layers.
• Try out Leaky ReLU / Maxout / ELU.
• Use sigmoids for smooth functions, e.g. robot control.
• Sigmoids are also good for logical functions AND/OR.

21 Jan 2015
Weight Initialization

21 Jan 2015
Weight Initialization

- Q: what happens when W=0 init is used?

21 Jan 2015
Weight Initialization
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

21 Jan 2015
Weight Initialization
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

Works ~okay for small networks, but can lead to non-


homogeneous distributions of activations across the
layers of a network.

21 Jan 2015
Activation Statistics

E.g. 10-layer net with


500 neurons on each
layer, using tanh non-
linearities, and
initializing as described
in last slide.

21 Jan 2015
21 Jan 2015
All activations
become zero!
Q: think about the
backward pass. What
do the gradients look
like?

Hint: think about backward


pass for a W*X gate.

21 Jan 2015
*1.0 instead of *0.01 Almost all neurons
completely
saturated, either -1
and 1. Gradients will
be all zero.

21 Jan 2015
“Xavier initialization” [Glorot et al., 2010]

Easy Derivation (linear case):


Assume weights and inbound
activations have mean zero and are
independent.
Their variances multiply for each
term, and then scale by fan_in for
each output term.

21 Jan 2015
but when using the ReLU
nonlinearity it breaks.

21 Jan 2015
He et al., 2015
(note additional /2)

factor of 2 doesn’t seem like much,


but remember it applies
multiplicatively 150 times in a large
ResNet.

21 Jan 2015
He et al., 2015
(note additional /2)

21 Jan 2015
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013

Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

All you need is a good init, Mishkin and Matas, 2015


21 Jan 2015
This Time: Localization and Detection

Results from Faster R-CNN, Ren et al 2015

21 Jan 2015
Computer Vision Tasks

Classification + Instance
Classification Object Detection
Localization Segmentation

CAT CAT CAT, DOG, DUCK CAT, DOG, DUCK

Single object Multiple objects

21 Jan 2015
Computer Vision Tasks

Classification + Instance
Classification Object Detection
Localization Segmentation

21 Jan 2015
Classification + Localization: Task

Classification: C classes
Input: Image
Output: Class label CAT
Evaluation metric: Accuracy

Localization:
Input: Image
Output: Box in the image (x, y, w, h)
Evaluation metric: Intersection over Union

Classification + Localization: Do both


(x, y, w, h)

21 Jan 2015
Classification + Localization: ImageNet

1000 classes (same as classification)

Each image has 1 class, at least one bounding


box

~800 training images per class

Algorithm produces 5 (class, box) guesses

Example is correct if at least one one guess has


correct class AND bounding box at least 0.5
intersection over union (IoU)

Krizhevsky et. al. 2012

21 Jan 2015
Idea #1: Localization as Regression

Input: image

Neural Net Output:


Box coordinates
(4 numbers)

Loss:
L2 distance
Correct output:
box coordinates
(4 numbers)
Only one object,
simpler than detection

21 Jan 2015
Simple Recipe for Classification + Localization

Step 1: Train (or download) a classification model (AlexNet, VGG, GoogLeNet)

Convolution
and Pooling Fully-connected layers

Softmax loss

Final conv
feature map Class scores
Image

21 Jan 2015
Simple Recipe for Classification + Localization

Step 2: Attach new fully-connected “regression head” to the network

Fully-connected
layers

“Classification head”

Convolution Class scores


and Pooling

Fully-connected
layers

“Regression head”

Final conv
feature map
Box coordinates
Image

21 Jan 2015
Simple Recipe for Classification + Localization

Step 3: Train the regression head only with SGD and L2 loss

Fully-connected
layers

Convolution Class scores


and Pooling

Fully-connected
layers

L2 loss

Final conv
feature map
Box coordinates
Image

21 Jan 2015
Simple Recipe for Classification + Localization

Step 4: At test time use both heads

Fully-connected
layers

Convolution Class scores


and Pooling

Fully-connected
layers

Final conv
feature map
Box coordinates
Image

21 Jan 2015
Per-class vs class agnostic regression

Assume classification over C


classes: Fully-connected
layers
Classification head:
C numbers
(one per class)

Convolution Class scores


and Pooling

Class agnostic:
4 numbers
Fully-connected (one box)
layers

Class specific:
C x 4 numbers
Final conv (one box per class)
feature map
Box coordinates
Image

21 Jan 2015
Where to attach the regression head?

After conv layers:


Overfeat, VGG After last FC layer:
DeepPose, R-CNN

Convolution Fully-connected
and Pooling layers

Softmax loss

Final conv
feature map Class scores
Image

21 Jan 2015
Aside: Localizing multiple objects

Want to localize exactly K objects


in each image
Fully-connected
(e.g. whole cat, cat head, cat left layers
ear, cat right ear for K=4)

Convolution Class scores


and Pooling

Fully-connected
layers

K x 4 numbers
(one box per object)
Final conv
feature map
Box coordinates
Image

21 Jan 2015
Aside: Human Pose Estimation

Represent a person by K joints

Regress (x, y) for each joint from


last fully-connected layer of
AlexNet

(Details: Normalized coordinates,


iterative refinement)

Toshev and Szegedy, “DeepPose: Human Pose Estimation via


Deep Neural Networks”, CVPR 2014

21 Jan 2015
Idea #2: Sliding Window

● Run classification + regression network at multiple


locations on a high-resolution image

● Convert fully-connected layers into convolutional


layers for efficient computation

● Combine classifier and regressor predictions across all


scales for final prediction

21 Jan 2015
Sliding Window: Overfeat
4096 4096 Class scores:
Winner of ILSVRC 2013 localization 1000
challenge

FC FC
Softmax
Convolution loss
+ pooling

FC

FC

FC FC
Feature map:
1024 x 5 x 5 Euclidean
Image: loss
3 x 221 x 221

Boxes:
1024 1000 x 4
4096
Sermanet et al, “Integrated Recognition, Localization and
Detection using Convolutional Networks”, ICLR 2014

21 Jan 2015
Sliding Window: Overfeat

Network input:
3 x 221 x 221
Larger image:
3 x 257 x 257

21 Jan 2015
Sliding Window: Overfeat

0.
5

Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

0. 0.7
5 5

Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

0.7
0.5
5

0.6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

0. 0.7
5 5

0.
0.8
6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

0.
0.75
5

0.
0.8
6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

Greedily merge boxes and


scores (details in paper)

0.8

Network input:
3 x 221 x 221
Larger image: Classification score:
3 x 257 x 257 P(cat)

21 Jan 2015
Sliding Window: Overfeat

In practice use many sliding window


locations and multiple scales

Window positions + score maps


Final Predictions
Box regression outputs

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

21 Jan 2015
Efficient Sliding Window: Overfeat

4096 4096 Class scores:


1000

FC FC

Convolution
+ pooling

FC

FC
FC FC
Feature map:
1024 x 5 x 5
Image:
3 x 221 x 221

Boxes:
1024 1000 x 4
4096

21 Jan 2015
Efficient Sliding Window: Overfeat

Efficient sliding window by converting fully-


connected layers into convolutions
Class scores:
4096 x 1 x 1 1024 x 1 x 1 1000 x 1 x 1

Convolution
+ pooling
1 x 1 conv 1 x 1 conv
5x5
conv

5x5
conv

Feature map: 1 x 1 conv 1 x 1 conv


1024 x 5 x 5
Image:
3 x 221 x 221
4096 x 1 x 1 1024 x 1 x 1 Box coordinates:
(4 x 1000) x 1 x 1

21 Jan 2015
Efficient Sliding Window: Overfeat

Training time: Small image,


1 x 1 classifier output

Test time: Larger image, 2 x


2 classifier output, only
extra compute at yellow
regions

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

21 Jan 2015
ImageNet Classification + Localization

AlexNet: Localization method not


published

Overfeat: Multiscale convolutional


regression with box merging

VGG: Same as Overfeat, but fewer scales


and locations; simpler method, gains all
due to deeper features

ResNet: Different localization method


(RPN) and much deeper features

21 Jan 2015
Computer Vision Tasks

Classification + Instance
Classification Object Detection
Localization Segmentation

21 Jan 2015
Computer Vision Tasks

Classification + Instance
Classification Object Detection
Localization Segmentation

21 Jan 2015
Detection as Regression?

DOG, (x, y, w, h)
CAT, (x, y, w, h)
CAT, (x, y, w, h)
DUCK (x, y, w, h)

= 16 numbers

21 Jan 2015
Detection as Regression?

DOG, (x, y, w, h)
CAT, (x, y, w, h)

= 8 numbers

21 Jan 2015
Detection as Regression?

CAT, (x, y, w, h)
CAT, (x, y, w, h)
….
CAT (x, y, w, h)

= many numbers

Need variable sized outputs

21 Jan 2015
Detection as Classification

CAT? NO

DOG? NO

21 Jan 2015
Detection as Classification

CAT? YES!

DOG? NO

21 Jan 2015
Detection as Classification

CAT? NO

DOG? NO

21 Jan 2015
Detection as Classification

Problem: Need to test many positions and scales

Solution: If your classifier is fast enough, just do it

21 Jan 2015
Histogram of Oriented Gradients

Dalal and Triggs, “Histograms of Oriented Gradients for Human Detection”, CVPR 2005
Slide credit: Ross Girshick

21 Jan 2015
Deformable Parts Model (DPM)

Felzenszwalb et al, “Object Detection with Discriminatively


Trained Part Based Models”, PAMI 2010

21 Jan 2015
Aside: Deformable Parts Models are CNNs?

Girschick et al, “Deformable Part Models are Convolutional Neural Networks”, CVPR 2015

21 Jan 2015
Detection as Classification

Problem: Need to test many positions and scales,


and use a computationally demanding classifier (CNN)

Solution: Only look at a tiny subset of possible positions

21 Jan 2015
Region Proposals

● Find “blobby” image regions that are likely to contain objects


● “Class-agnostic” object detector
● Look for “blob-like” regions

21 Jan 2015
Region Proposals: Selective Search

Bottom-up segmentation, merging regions at multiple scales

Convert
regions to
boxes

Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013

21 Jan 2015
Region Proposals: Many other choices

Hosang et al, “What makes for effective detection proposals?”, PAMI 2015

21 Jan 2015
Region Proposals: Many other choices

Hosang et al, “What makes for effective detection proposals?”, PAMI 2015

21 Jan 2015
Putting it together: R-CNN

Girschick et al, “Rich feature hierarchies for


accurate object detection and semantic
segmentation”, CVPR 2014

Slide credit: Ross Girschick

21 Jan 2015
R-CNN Training

Step 1: Train (or download) a classification model for ImageNet (AlexNet)

Convolution
Fully-connected
and Pooling
layers

Softmax loss

Final conv
feature map Class scores
1000 classes
Image

21 Jan 2015
R-CNN Training

Step 2: Fine-tune model for detection


- Instead of 1000 ImageNet classes, want 20 object classes + background
- Throw away final fully-connected layer, reinitialize from scratch
- Keep training model using positive / negative regions from detection images

Re-initialize this layer:


was 4096 x 1000,
Convolution
Fully-connected now will be 4096 x 21
and Pooling
layers

Softmax loss

Final conv
feature map Class scores:
21 classes
Image

21 Jan 2015
R-CNN Training

Step 3: Extract features


- Extract region proposals for all images
- For each region: warp to CNN input size, run forward through CNN, save pool5 features to disk
- Have a big hard drive: features are ~200GB for PASCAL dataset!

Convolution
and Pooling

pool5 features

Image Region Proposals Crop + Warp Forward pass Save to disk

21 Jan 2015
R-CNN Training

Step 4: Train one binary SVM per class to classify region features

Training image regions

Cached region features

Positive samples for cat SVM Negative samples for cat SVM

21 Jan 2015
R-CNN Training

Step 4: Train one binary SVM per class to classify region features

Training image regions

Cached region features

Negative samples for dog SVM Positive samples for dog SVM

21 Jan 2015
R-CNN Training

Step 5 (bbox regression): For each class, train a linear regression model to map from
cached features to offsets to GT boxes to make up for “slightly wrong” proposals

Training image regions

Cached region features

Regression targets (0, 0, 0, 0) (.25, 0, 0, 0) (0, 0, -0.125, 0)


(dx, dy, dw, dh) Proposal is good Proposal too Proposal too
Normalized coordinates far to left wide

21 Jan 2015
Object Detection: Datasets

ImageNet
PASCAL
Detection MS-COCO
VOC
(ILSVRC (2014)
(2010)
2014)

Number of
20 200 80
classes

Number of
images ~20k ~470k ~120k
(train + val)

Mean objects
2.4 1.1 7.2
per image

21 Jan 2015
Object Detection: Evaluation

We use a metric called “mean average precision” (mAP)

Compute average precision (AP) separately for each class, then average over classes

A detection is a true positive if it has IoU (Intersection over Union) with a ground-truth box
greater than some threshold (usually 0.5) (mAP@0.5)

Combine all detections from all test images to draw a precision / recall curve for each class; AP is
area under the curve

TL;DR mAP is a number from 0 to 100; high is good

21 Jan 2015
R-CNN Results

Wang et al, “Regionlets for Generic Object Detection”, ICCV 2013

21 Jan 2015
R-CNN Results

Big improvement compared to pre-CNN


methods

21 Jan 2015
R-CNN Results

Bounding box regression helps


a bit

21 Jan 2015
R-CNN Results

Features from a deeper network


help a lot

21 Jan 2015
R-CNN Problems

1. Slow at test-time: need to run full forward pass of CNN


for each region proposal

2. SVMs and regressors are post-hoc: CNN features not


updated in response to SVMs and regressors

3. Complex multistage training pipeline

21 Jan 2015
Fast R-CNN

Girschick, “Fast R-CNN”, ICCV 2015

Slide credit: Ross Girschick

21 Jan 2015
R-CNN Problem #1:
Slow at test-time due to
independent forward passes of the
CNN

Solution:
Share computation of
convolutional layers between
proposals for an image

21 Jan 2015
R-CNN Problem #2:
Post-hoc training: CNN not
updated in response to final
classifiers and regressors

R-CNN Problem #3:


Complex training pipeline

Solution:
Just train the whole system end-
to-end all at once!

Slide credit: Ross Girschick

21 Jan 2015
Fast R-CNN: Region of Interest Pooling

Convolution
and Pooling Fully-connected
layers

Hi-res input image:


Hi-res conv features: Problem: Fully-connected layers
3 x 800 x 600
CxHxW expect low-res conv features: C x
with region proposal
with region proposal hxw

21 Jan 2015
Fast R-CNN: Region of Interest Pooling

Project region proposal onto


conv feature map
Convolution
and Pooling Fully-connected
layers

Hi-res input image:


Hi-res conv features: Problem: Fully-connected layers
3 x 800 x 600
CxHxW expect low-res conv features: C x
with region proposal
with region proposal hxw

21 Jan 2015
Fast R-CNN: Region of Interest Pooling

Divide projected region


Convolution into h x w grid
and Pooling Fully-connected
layers

Hi-res input image:


Hi-res conv features: Problem: Fully-connected layers
3 x 800 x 600
CxHxW expect low-res conv features: C x
with region proposal
with region proposal hxw

21 Jan 2015
Fast R-CNN: Region of Interest Pooling

Max-pool within
each grid cell
Convolution
and Pooling Fully-connected
layers

Hi-res input image:


Hi-res conv features: RoI conv features: Fully-connected layers expect
3 x 800 x 600
CxHxW Cxhxw low-res conv features:
with region proposal
with region proposal for region proposal Cxhxw

21 Jan 2015
Fast R-CNN: Region of Interest Pooling

Can back propagate


similar to max pooling
Convolution
and Pooling Fully-connected
layers

Hi-res input image:


Hi-res conv features: RoI conv features: Fully-connected layers expect
3 x 800 x 600
CxHxW Cxhxw low-res conv features:
with region proposal
with region proposal for region proposal Cxhxw

21 Jan 2015
Fast R-CNN Results

R-CNN Fast R-CNN

Faster!
Training 84 hours 9.5 hours
Time:

(Speedup) 1x 8.8x

Using VGG-16 CNN on Pascal VOC 2007 dataset

21 Jan 2015
Fast R-CNN Results

R-CNN Fast R-CNN

Faster!
Training 84 hours 9.5 hours
Time:

(Speedup) 1x 8.8x
FASTER!
Test time per 47 seconds 0.32 seconds
image

(Speedup) 1x 146x
Using VGG-16 CNN on Pascal VOC 2007 dataset

21 Jan 2015
Fast R-CNN Results

R-CNN Fast R-CNN

Training 84 hours 9.5 hours


Time:

(Speedup) 1x 8.8x
Faster!

Test time per 47 seconds 0.32 seconds


FASTER!
image

(Speedup) 1x 146x
Better!
mAP (VOC 66.0 66.9
2007)
Using VGG-16 CNN on Pascal VOC 2007 dataset

21 Jan 2015
Fast R-CNN Problem:

Test-time speeds don’t include region proposals

R-CNN Fast R-CNN

Test time per


47 seconds 0.32 seconds
image

(Speedup) 1x 146x

Test time per


image with
50 seconds 2 seconds
Selective
Search
(Speedup) 1x 25x

21 Jan 2015
Fast R-CNN Problem Solution:

Test-time speeds don’t include region proposals


Just make the CNN do region proposals too!

R-CNN Fast R-CNN

Test time per


47 seconds 0.32 seconds
image

(Speedup) 1x 146x

Test time per


image with
50 seconds 2 seconds
Selective
Search
(Speedup) 1x 25x
21 Jan 2015
Faster R-CNN:

Insert a Region Proposal Network


(RPN) after the last convolutional
layer

RPN trained to produce region


proposals directly; no need for
external region proposals!

After RPN, use RoI Pooling and an


upstream classifier and bbox
regressor just like Fast R-CNN

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with


Region Proposal Networks”, NIPS 2015

Slide credit: Ross Girschick

21 Jan 2015
Faster R-CNN: Region Proposal Network

Slide a small window on the feature map

Build a small network for:


• classifying object or not-object, and
• regressing bbox locations 1 x 1 conv 1 x 1 conv

Position of the sliding window provides localization


1 x 1 conv
information with reference to the image

Box regression provides finer localization


information with reference to this sliding window

Slide credit: Kaiming He

21 Jan 2015
Faster R-CNN: Region Proposal Network

Use N anchor boxes at each location

Anchors are translation invariant: use the


same ones at every location

Regression gives offsets from anchor


boxes

Classification gives the probability that


each (regressed) anchor shows an object

21 Jan 2015
Faster R-CNN: Training

In the paper: Ugly pipeline


- Use alternating optimization to train RPN, then
Fast R-CNN with RPN proposals, etc.
- More complex than it has to be

Since publication: Joint training!


One network, four losses
- RPN classification (anchor good / bad)
- RPN regression (anchor -> proposal)
- Fast R-CNN classification (over classes)
- Fast R-CNN regression (proposal -> box)

Slide credit: Ross Girschick

21 Jan 2015
Faster R-CNN: Results

R-CNN Fast R-CNN Faster R-


CNN

Test time per 50 2 seconds 0.2 seconds


image seconds
(with proposals)

(Speedup) 1x 25x 250x

mAP (VOC 2007) 66.0 66.9 66.9

21 Jan 2015
Object Detection State-of-the-art:
ResNet 101 + Faster R-CNN + some extras

He et. al, “Deep Residual Learning for Image Recognition”, arXiv 2015

21 Jan 2015
ImageNet Detection 2013 - 2015

21 Jan 2015
YOLO: You Only Look Once Detection as Regression

Divide image into S x S grid

Within each grid cell predict:


B Boxes: 4 coordinates +
confidence
Class scores: C numbers

Regression from image to


7 x 7 x (5 * B + C) tensor

Direct prediction using a CNN


Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, arXiv 2015

21 Jan 2015
YOLO: You Only Look Once Detection as Regression

Faster than Faster R-CNN, but not


as good

Redmon et al, “You Only Look Once:


Unified, Real-Time Object Detection”, arXiv 2015

21 Jan 2015
Object Detection code links:

R-CNN
(Cafffe + MATLAB): https://github.com/rbgirshick/rcnn
Probably don’t use this; too slow

Fast R-CNN
(Caffe + MATLAB): https://github.com/rbgirshick/fast-rcnn

Faster R-CNN
(Caffe + MATLAB): https://github.com/ShaoqingRen/faster_rcnn
(Caffe + Python): https://github.com/rbgirshick/py-faster-rcnn

YOLO
http://pjreddie.com/darknet/yolo/

21 Jan 2015
Recap
Localization:
- Find a fixed number of objects (one or many)
- L2 regression from CNN features to box coordinates
- Much simpler than detection; consider it for your projects!
- Overfeat: Regression + efficient sliding window with FC -> conv conversion
- Deeper networks do better
Object Detection:
- Find a variable number of objects by classifying image regions
- Before CNNs: dense multiscale sliding window (HoG, DPM)
- Avoid dense sliding window with region proposals
- R-CNN: Selective Search + CNN classification / regression
- Fast R-CNN: Swap order of convolutions and region extraction
- Faster R-CNN: Compute region proposals within the network
- Deeper networks do better

21 Jan 2015

You might also like