Professional Documents
Culture Documents
CNN – Part 2
Sudeshna Sarkar
Spring 2019
6 Feb 2019
Fully Connected Network to CNN
• Too many parameters!
• Can some of the parameters be shared?
Local Patterns
• Some patterns are much smaller than the whole image
“beak” detector
Patterns can appear at different places
“upper-left
beak” detector
“middle beak”
detector
CNN
• Capture local features based on local sequences
• Spatial or temporal or combination
• Local Connectivity: Connect only local region of input
(receptive field) to the next layer neurons.
• Use each feature at all spatial / temporal contexts: Parameter
Sharing across regions
• 1d convolution: 𝑓 ∗ 𝑔 𝑛 = ∑𝑀 𝑚=−𝑀 𝑓 𝑛 − 𝑚 𝑔[𝑚]
• Group the outputs of a feature afterwards if required.
A convolutional layer
Beak detector
A filter
Convolution
Filters
1 0 0 0 0 1 1 -1 -1
0 1 0 0 1 0 -1 1 -1 These are the network
-1 -1 1 parameters to be learned.
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1
0 0 1 0 1 0 -1 1 -1
6 x 6 image Each filter detects
a small pattern
(3 x 3)
Convolution in Images
https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-
networks-a9c2f2ee1524
Parameter Sharing: replicated features
• Use many different copies of the
same feature detector with different The red connections all
positions. have the same weight.
representation translated
by active representation
neurons
translated
image image
28
32 28
3 1
21 Jan 2015
Zero Padding
https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-
networks-a9c2f2ee1524
Zero Padding
0 0 0 0 0 0 e.g. input 7x7
0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
0
0
0
(recall:)
(N - F) / stride + 1
Zero padding
21 Jan 2015
zero padding
0 0 0 0 0 0 e.g. input 7x7
0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output?
0
7x7 output!
0 in general, common to see CONV layers with
0 stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
A closer look at spatial dimensions:
21 Jan 2015
A closer look at spatial dimensions:
21 Jan 2015
A closer look at spatial dimensions:
21 Jan 2015
A closer look at spatial dimensions:
21 Jan 2015
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
21 Jan 2015
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
21 Jan 2015
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
21 Jan 2015
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
21 Jan 2015
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
21 Jan 2015
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
stride 1 => (7 - 3)/1 + 1 = 5
N
F stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
21 Jan 2015
Feature Map
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
21 Jan 2015
nc’ : number of filters
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions
32 28
CONV,
ReLU
e.g. 6
5x5x3
filters
32 28
3 6
21 Jan 2015
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with
activation functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
filters filters
32 28 24
3 6 10
21 Jan 2015
Deep Convolutional Network
Preview [From Yann LeCun
slides]
21 Jan 2015
Pool layers
• The convolution layers often followed by pool layers in CNNs.
• It can reduce the weights and will not lose too much information.
• We often use max operation to do pooling.
1 2 5 6
3 4 2 8
Max pooling 4 8
3 4 4 2 5 6
1 5 6 3
Translational invariance at each level:
Single depth slice by averaging four neighboring
The Pooling Layer operates replicated detectors to give a single
independently on every depth slice of output to the next level.
the input and resizes it spatially, using
the MAX operation.
37
Window Size and Stride in pool layers
• The window size is the pooling range.
• The stride is how much pixel the window move.
• For this example, window size = stride = 2.
1 2 5 6
Single Max pooling
depth 3 4 2 8 4 8
slice
3 4 4 2 5 6
1 5 6 3
Problem: After several levels of
pooling, we have lost information
• There are two types of the pool layers.
about the precise positions of
• If window size = stride, this is traditional
things.
pooling.
This makes it impossible to use
• If window size > stride, this is overlapping
pooling. the precise spatial relationships
• The larger window size and stride will be between high-level parts for
very destructive. recognition.
38
General pooling
• Other pooling functions: Average pooling, L2-norm pooling
bird
bird
Subsampling
42
POOL POOL POOL
ReLU ReLU ReLU ReLU ReLU ReLU Fully-
CONV CONV CONV CONV CONV CONV connected
32x32
21 Jan 2015
The ILSVRC-2012 competition on ImageNet
classification classification
&localization
K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
Recognition, ICLR 2015
GoogLeNet: ILSVRC 2014 winner
• The Inception Module
• Inception Module dramatically reduced the number of
parameters in the network
(4M, compared to AlexNet with 60M).
• Uses Average Pooling instead of Fully Connected layers at
the top of the ConvNet
• Several followup versions to the GoogLeNet, most
recently Inception-v4.
Inception module
21 Jan 2015
GoogLeNet
Inception module
Inception module
21 Jan 2015
Case Study: GoogLeNet [Szegedy et al., 2014]
Inception module
21 Jan 2015
GoogLeNet
Auxiliary classifier
Fun features:
Compared to AlexNet:
- 12X less params
- 2x more compute
- 6.67% (vs. 16.4%)
21 Jan 2015
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters
C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
Inception v2, v3
• Regularize training with batch normalization, reducing importance of
auxiliary classifiers
• More variants of inception modules with aggressive factorization of
filters
• Increase the number of feature maps while decreasing spatial resolution
(pooling)
C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
ResNet
• The residual module
• Introduce skip or shortcut connections (existing before in various forms in
literature)
• Make it easy for network layers to represent the identity mapping
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image
Recognition, CVPR 2016 (Best Paper)
Case Study: ResNet
[He et al., 2015]
- Batch Normalization after every CONV layer
- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error
plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
ResNet
• Directly performing 3x3 convolutions with 256
Deeper residual module (bottleneck) feature maps at input and output:
256 x 256 x 3 x 3 ~ 600K operations
• Using 1x1 convolutions to reduce 256 to 64
feature maps, followed by 3x3 convolutions,
followed by 1x1 convolutions to expand back
to 256 maps:
256 x 64 x 1 x 1 ~ 16K
64 x 64 x 3 x 3 ~ 36K
64 x 256 x 1 x 1 ~ 16K
Total: ~70K
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning
for Image Recognition, CVPR 2016 (Best Paper)
Case Study: ResNet [He et al., 2015]
21 Jan 2015
(slide from Kaiming He’s recent presentation)
21 Jan 2015
21 Jan 2015
ResNet
• Architectures for ImageNet:
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition,
CVPR 2016 (Best Paper)
Inception v4
https://culurciello.github.io/tech/2016/06/04/nets.html
Design principles
• Reduce filter sizes (except possibly at the lowest layer),
factorize filters aggressively
• Use 1x1 convolutions to reduce and expand the number of
feature maps judiciously
• Use skip connections and/or create multiple paths through
the network
What’s missing from the picture?
• Training tricks and details: initialization, regularization,
normalization
• Training data augmentation
• Averaging classifier outputs over multiple crops/flips
• Ensembles of networks
77
Human Pose Estimation [10]
78
Super Resolution [11]
79
CNN on Text
Convolution vs. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0
Fully- x2
0 0 1 1 0 0
connected 1 0 0 0 1 0 …
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
fewer parameters!
1 1
1 -1 -1 Filter 1
2 0
-1 1 -1
3 0
-1 -1 1
4: 0 3
1 0 0 0 0 1
…
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
10: 0
0 1 0 0 1 0
0 0 1 0 1 0
…
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to 9
inputs.
16 1
…
“You need a lot of a data if you want to
train/use CNNs”
21 Jan 2015
Transfer Learning
21 Jan 2015
Transfer Learning with CNNs
1. Train on
Imagenet
21 Jan 2015
Transfer Learning with CNNs
21 Jan 2015
Transfer Learning with CNNs
21 Jan 2015
Transfer Learning with CNNs
21 Jan 2015
Case Study: VGGNet
[Simonyan and Zisserman, 2014]
best model
21 Jan 2015
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
21 Jan 2015
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
21 Jan 2015
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 Note:
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
early CONV
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Most params are
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
21 Jan 2015
Case Study: ResNet [He et al., 2015]
21 Jan 2015
(slide from Kaiming He’s recent presentation)
21 Jan 2015
21 Jan 2015
Case Study: ResNet [He et al., 2015]
at runtime: faster
than a VGGNet!
(even though it has
8x more layers)
21 Jan 2015
Case Study:
ResNet 224x224x3
spatial dimension
[He et al., 2015]
only 56x56!
21 Jan 2015
Case Study: ResNet [He et al., 2015]
21 Jan 2015
Case Study: ResNet [He et al., 2015]
21 Jan 2015
Case Study: ResNet [He et al., 2015]
21 Jan 2015
Case Study: ResNet [He et al., 2015]
21 Jan 2015
Case Study: ResNet [He et al., 2015]
21 Jan 2015
Case Study Bonus: DeepMind’s AlphaGo
21 Jan 2015
policy network:
[19x19x48] Input
CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]
CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]
CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)
21 Jan 2015
Summary
• ConvNets stack CONV,ReLU,POOL,FC layers
• Trend towards smaller filters and deeper architectures
• Trend towards getting rid of POOL/FC layers (just CONV)
• Early architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
• but recent advances such as ResNet/GoogLeNet use only Conv-
ReLU, 1x1 convolutions and Softmax
21 Jan 2015
Activation Functions: Biological Inspiration
21 Jan 2015
Activation Functions
21 Jan 2015
Activation Functions
Maxout
tanh tanh(x)
ELU
ReLU max(0,x)
21 Jan 2015
Activation Functions
21 Jan 2015
Consider what happens when the input to a neuron (x)
is always positive:
21 Jan 2015
Consider what happens when the input to a
neuron is always positive...
allowed
gradient
update
directions
hypothetical
optimal w
What can we say about the gradients on w? vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)
21 Jan 2015
Activation Functions
21 Jan 2015
Activation Functions
• Computes f(x) = max(0,x)
ReLU
(Rectified Linear Unit)
21 Jan 2015
Activation Functions - Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
ReLU
(Rectified Linear Unit)
21 Jan 2015
active ReLU
DATA CLOUD
21 Jan 2015
[Mass et al., 2013]
Activation Functions [He et al., 2015]
Leaky ReLU
21 Jan 2015
[Mass et al., 2013]
Activation Functions [He et al., 2015]
21 Jan 2015
Activation Functions
[Clevert et al., 2015]
21 Jan 2015
Maxout “Neuron”
[Goodfellow et al., 2013]
• Does not have the basic form of dot product -> nonlinearity
• Generalizes ReLU and Leaky ReLU
• Linear Regime! Does not saturate! Does not die!
21 Jan 2015
TLDR: In practice:
Try everything. Usually:
• Use ReLU on early image layers.
• Try out Leaky ReLU / Maxout / ELU.
• Use sigmoids for smooth functions, e.g. robot control.
• Sigmoids are also good for logical functions AND/OR.
21 Jan 2015
Weight Initialization
21 Jan 2015
Weight Initialization
21 Jan 2015
Weight Initialization
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
21 Jan 2015
Weight Initialization
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
21 Jan 2015
Activation Statistics
21 Jan 2015
21 Jan 2015
All activations
become zero!
Q: think about the
backward pass. What
do the gradients look
like?
21 Jan 2015
*1.0 instead of *0.01 Almost all neurons
completely
saturated, either -1
and 1. Gradients will
be all zero.
21 Jan 2015
“Xavier initialization” [Glorot et al., 2010]
21 Jan 2015
but when using the ReLU
nonlinearity it breaks.
21 Jan 2015
He et al., 2015
(note additional /2)
21 Jan 2015
He et al., 2015
(note additional /2)
21 Jan 2015
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
21 Jan 2015
This Time: Localization and Detection
21 Jan 2015
Computer Vision Tasks
Classification + Instance
Classification Object Detection
Localization Segmentation
21 Jan 2015
Computer Vision Tasks
Classification + Instance
Classification Object Detection
Localization Segmentation
21 Jan 2015
Classification + Localization: Task
Classification: C classes
Input: Image
Output: Class label CAT
Evaluation metric: Accuracy
Localization:
Input: Image
Output: Box in the image (x, y, w, h)
Evaluation metric: Intersection over Union
21 Jan 2015
Classification + Localization: ImageNet
21 Jan 2015
Idea #1: Localization as Regression
Input: image
Loss:
L2 distance
Correct output:
box coordinates
(4 numbers)
Only one object,
simpler than detection
21 Jan 2015
Simple Recipe for Classification + Localization
Convolution
and Pooling Fully-connected layers
Softmax loss
Final conv
feature map Class scores
Image
21 Jan 2015
Simple Recipe for Classification + Localization
Fully-connected
layers
“Classification head”
Fully-connected
layers
“Regression head”
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Simple Recipe for Classification + Localization
Step 3: Train the regression head only with SGD and L2 loss
Fully-connected
layers
Fully-connected
layers
L2 loss
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Simple Recipe for Classification + Localization
Fully-connected
layers
Fully-connected
layers
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Per-class vs class agnostic regression
Class agnostic:
4 numbers
Fully-connected (one box)
layers
Class specific:
C x 4 numbers
Final conv (one box per class)
feature map
Box coordinates
Image
21 Jan 2015
Where to attach the regression head?
Convolution Fully-connected
and Pooling layers
Softmax loss
Final conv
feature map Class scores
Image
21 Jan 2015
Aside: Localizing multiple objects
Fully-connected
layers
K x 4 numbers
(one box per object)
Final conv
feature map
Box coordinates
Image
21 Jan 2015
Aside: Human Pose Estimation
21 Jan 2015
Idea #2: Sliding Window
21 Jan 2015
Sliding Window: Overfeat
4096 4096 Class scores:
Winner of ILSVRC 2013 localization 1000
challenge
FC FC
Softmax
Convolution loss
+ pooling
FC
FC
FC FC
Feature map:
1024 x 5 x 5 Euclidean
Image: loss
3 x 221 x 221
Boxes:
1024 1000 x 4
4096
Sermanet et al, “Integrated Recognition, Localization and
Detection using Convolutional Networks”, ICLR 2014
21 Jan 2015
Sliding Window: Overfeat
Network input:
3 x 221 x 221
Larger image:
3 x 257 x 257
21 Jan 2015
Sliding Window: Overfeat
0.
5
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
0. 0.7
5 5
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
0.7
0.5
5
0.6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
0. 0.7
5 5
0.
0.8
6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
0.
0.75
5
0.
0.8
6
Network input:
3 x 221 x 221
Larger image: Classification scores:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
0.8
Network input:
3 x 221 x 221
Larger image: Classification score:
3 x 257 x 257 P(cat)
21 Jan 2015
Sliding Window: Overfeat
Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014
21 Jan 2015
Efficient Sliding Window: Overfeat
FC FC
Convolution
+ pooling
FC
FC
FC FC
Feature map:
1024 x 5 x 5
Image:
3 x 221 x 221
Boxes:
1024 1000 x 4
4096
21 Jan 2015
Efficient Sliding Window: Overfeat
Convolution
+ pooling
1 x 1 conv 1 x 1 conv
5x5
conv
5x5
conv
21 Jan 2015
Efficient Sliding Window: Overfeat
Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014
21 Jan 2015
ImageNet Classification + Localization
21 Jan 2015
Computer Vision Tasks
Classification + Instance
Classification Object Detection
Localization Segmentation
21 Jan 2015
Computer Vision Tasks
Classification + Instance
Classification Object Detection
Localization Segmentation
21 Jan 2015
Detection as Regression?
DOG, (x, y, w, h)
CAT, (x, y, w, h)
CAT, (x, y, w, h)
DUCK (x, y, w, h)
= 16 numbers
21 Jan 2015
Detection as Regression?
DOG, (x, y, w, h)
CAT, (x, y, w, h)
= 8 numbers
21 Jan 2015
Detection as Regression?
CAT, (x, y, w, h)
CAT, (x, y, w, h)
….
CAT (x, y, w, h)
= many numbers
21 Jan 2015
Detection as Classification
CAT? NO
DOG? NO
21 Jan 2015
Detection as Classification
CAT? YES!
DOG? NO
21 Jan 2015
Detection as Classification
CAT? NO
DOG? NO
21 Jan 2015
Detection as Classification
21 Jan 2015
Histogram of Oriented Gradients
Dalal and Triggs, “Histograms of Oriented Gradients for Human Detection”, CVPR 2005
Slide credit: Ross Girshick
21 Jan 2015
Deformable Parts Model (DPM)
21 Jan 2015
Aside: Deformable Parts Models are CNNs?
Girschick et al, “Deformable Part Models are Convolutional Neural Networks”, CVPR 2015
21 Jan 2015
Detection as Classification
21 Jan 2015
Region Proposals
21 Jan 2015
Region Proposals: Selective Search
Convert
regions to
boxes
21 Jan 2015
Region Proposals: Many other choices
Hosang et al, “What makes for effective detection proposals?”, PAMI 2015
21 Jan 2015
Region Proposals: Many other choices
Hosang et al, “What makes for effective detection proposals?”, PAMI 2015
21 Jan 2015
Putting it together: R-CNN
21 Jan 2015
R-CNN Training
Convolution
Fully-connected
and Pooling
layers
Softmax loss
Final conv
feature map Class scores
1000 classes
Image
21 Jan 2015
R-CNN Training
Softmax loss
Final conv
feature map Class scores:
21 classes
Image
21 Jan 2015
R-CNN Training
Convolution
and Pooling
pool5 features
21 Jan 2015
R-CNN Training
Step 4: Train one binary SVM per class to classify region features
Positive samples for cat SVM Negative samples for cat SVM
21 Jan 2015
R-CNN Training
Step 4: Train one binary SVM per class to classify region features
Negative samples for dog SVM Positive samples for dog SVM
21 Jan 2015
R-CNN Training
Step 5 (bbox regression): For each class, train a linear regression model to map from
cached features to offsets to GT boxes to make up for “slightly wrong” proposals
21 Jan 2015
Object Detection: Datasets
ImageNet
PASCAL
Detection MS-COCO
VOC
(ILSVRC (2014)
(2010)
2014)
Number of
20 200 80
classes
Number of
images ~20k ~470k ~120k
(train + val)
Mean objects
2.4 1.1 7.2
per image
21 Jan 2015
Object Detection: Evaluation
Compute average precision (AP) separately for each class, then average over classes
A detection is a true positive if it has IoU (Intersection over Union) with a ground-truth box
greater than some threshold (usually 0.5) (mAP@0.5)
Combine all detections from all test images to draw a precision / recall curve for each class; AP is
area under the curve
21 Jan 2015
R-CNN Results
21 Jan 2015
R-CNN Results
21 Jan 2015
R-CNN Results
21 Jan 2015
R-CNN Results
21 Jan 2015
R-CNN Problems
21 Jan 2015
Fast R-CNN
21 Jan 2015
R-CNN Problem #1:
Slow at test-time due to
independent forward passes of the
CNN
Solution:
Share computation of
convolutional layers between
proposals for an image
21 Jan 2015
R-CNN Problem #2:
Post-hoc training: CNN not
updated in response to final
classifiers and regressors
Solution:
Just train the whole system end-
to-end all at once!
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
Convolution
and Pooling Fully-connected
layers
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
Max-pool within
each grid cell
Convolution
and Pooling Fully-connected
layers
21 Jan 2015
Fast R-CNN: Region of Interest Pooling
21 Jan 2015
Fast R-CNN Results
Faster!
Training 84 hours 9.5 hours
Time:
(Speedup) 1x 8.8x
21 Jan 2015
Fast R-CNN Results
Faster!
Training 84 hours 9.5 hours
Time:
(Speedup) 1x 8.8x
FASTER!
Test time per 47 seconds 0.32 seconds
image
(Speedup) 1x 146x
Using VGG-16 CNN on Pascal VOC 2007 dataset
21 Jan 2015
Fast R-CNN Results
(Speedup) 1x 8.8x
Faster!
(Speedup) 1x 146x
Better!
mAP (VOC 66.0 66.9
2007)
Using VGG-16 CNN on Pascal VOC 2007 dataset
21 Jan 2015
Fast R-CNN Problem:
(Speedup) 1x 146x
21 Jan 2015
Fast R-CNN Problem Solution:
(Speedup) 1x 146x
21 Jan 2015
Faster R-CNN: Region Proposal Network
21 Jan 2015
Faster R-CNN: Region Proposal Network
21 Jan 2015
Faster R-CNN: Training
21 Jan 2015
Faster R-CNN: Results
21 Jan 2015
Object Detection State-of-the-art:
ResNet 101 + Faster R-CNN + some extras
He et. al, “Deep Residual Learning for Image Recognition”, arXiv 2015
21 Jan 2015
ImageNet Detection 2013 - 2015
21 Jan 2015
YOLO: You Only Look Once Detection as Regression
21 Jan 2015
YOLO: You Only Look Once Detection as Regression
21 Jan 2015
Object Detection code links:
R-CNN
(Cafffe + MATLAB): https://github.com/rbgirshick/rcnn
Probably don’t use this; too slow
Fast R-CNN
(Caffe + MATLAB): https://github.com/rbgirshick/fast-rcnn
Faster R-CNN
(Caffe + MATLAB): https://github.com/ShaoqingRen/faster_rcnn
(Caffe + Python): https://github.com/rbgirshick/py-faster-rcnn
YOLO
http://pjreddie.com/darknet/yolo/
21 Jan 2015
Recap
Localization:
- Find a fixed number of objects (one or many)
- L2 regression from CNN features to box coordinates
- Much simpler than detection; consider it for your projects!
- Overfeat: Regression + efficient sliding window with FC -> conv conversion
- Deeper networks do better
Object Detection:
- Find a variable number of objects by classifying image regions
- Before CNNs: dense multiscale sliding window (HoG, DPM)
- Avoid dense sliding window with region proposals
- R-CNN: Selective Search + CNN classification / regression
- Fast R-CNN: Swap order of convolutions and region extraction
- Faster R-CNN: Compute region proposals within the network
- Deeper networks do better
21 Jan 2015