Professional Documents
Culture Documents
cogneethi.com
nt
oddy
nt
oddy
Classification Pipeline
Classifier 0.8
0.1
Feature Extractor SVM/ 0.02
FC+ 0.03
HOG/SIFT/CNN/Etc Softmax 0.02
Etc 0.03
nt
oddy
Classification
AlexNet/VGG
Cat 0.8
Dog 0.1
Rhino 0.02
Hippo 0.02
Elephant 0.02
Mouse 0.04
Cat
Bicycle
etc
AlexNet/VGG
Get Class Scores
Using Softmax
Human
Car
X1, y1 w Dog
Get Bounding boxes
h
Using L2 loss
X2, y2 (x1, y1, x2, y2) Cat
X0, y0
Bicycle
etc
Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt
oddy
250 400
200
600
600
x1 y1 x2 y2
L2 Loss
Expected 200 250 600 400
0 0 800 600 (200-0)2 (250-0)2 (600-800)2 (400-600)2 182500
100 150 700 450 (200-100)2 (250-150)2 (600-700)2 (400-450)2 32500
Prediction
210 245 590 405 (200-210)2 (250-245)2 (600-590)2 (400-405)2 250
200 250 600 400 (200-200)2 (250-250)2 (600-600)2 (400-400)2 0
300 300
500 600
nt
oddy
Classifier
AlexNet/VGG
BBox Regressor
Combining Results
FC Softmax
Person
Boat
0.03 - TV TV
0.02 - Person
Class Conf Bbox coordinates
0.95 - Boat
Person 0.02 380 200 430 400
Person
Boat
TV
Overfeat
nt
oddy
Confidence scores
Localization CNN
BBox
Confidence scores
Localization CNN
BBox
nt
oddy
nt
oddy
AlexNet/VGG
0 0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0 0
nt
ConvNets input size constraints – FC as Conv
oddy
Pooled
Image Weights/Filter Feature Maps Pool FV FC Layers
Feature Maps
0 0 0 0 0 0 0 0
0 0
0 0
0 0 H
0
0
0
0 V
0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
H
0 0
0 0
0 0
V
0 0
0 0
0 0 0 0 0 0 0 0 0 0
1. Does this make sense?
2. If so, what does this mean?
nt
oddy
Receptive Field
Every value in the output encodes information from some 4x4 patch of the image.
nt
oddy
0 0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
H
0 0
0 0
0 0
V
0 0
0 0
0 0 0 0 0 0 0 0 0 0 Spatial output
1. Does this make sense? -> yes
2. If so, what does this mean? -> Represents the computations on different portions of the image.
nt
Spatial Output as Sliding Window
oddy
CNN
nt
ConvNets and Sliding Window Efficiency
oddy
Confidence scores
Localization CNN
BBox
Localization CNN
H
V
nt
oddy
8x8 6x6 nt
Spatial Output for Image Pyramids
oddy
H V
H V
nt
oddy
H V
H V
nt
oddy
AlexNet/VGG
0 0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
H
0 0
0 0
0 0
V
0 0
0 0
0 0 0 0 0 0 0 0 0 0
nt
oddy
H V
H V
nt
oddy
Receptive Field
Every value in the output encodes information from some 4x4 patch of the image.
nt
oddy
0 0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
H
0 0
0 0
0 0
V
0 0
0 0
0 0 0 0 0 0 0 0 0 0 Spatial output
1. Does this make sense? -> yes
2. If so, what does this mean? -> Represents the computations on different portions of the image.
nt
oddy
Overfeat
Sliding Window Crop FC as Conv (No input size constraint) + Spatial Output + Image Pyramid
2x3
3x5
5x7
6x7
7x10
245x245
Smaller objects Larger objects
If you want to detect even smaller objects, use even bigger image pyramids. Trade-off, increase in computation nt
oddy
Overfeat - Classification
1x1x4096 1x1x4096
x256
1x1xC
Get Class scores
5x5 1x1 1x1
Using Softmax
5x5
245x245
First 5 Layers of Feature Map
AlexNet (Modified)
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy
Convolution
Feature Map Filter Output
nt
oddy
Overfeat
Fully Connected layer implemented as a convolution layer
Conv Output
+ Feature Map
Pool Layers From Conv+Pool Feature Map Outputs Filters Final output
For 1 Class
1x1 1x1 1x1 1x1 1x1
5x5
245x245
First 5 Layers of Feature Map
AlexNet (Modified)
5x5
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy
Overfeat
Fully Connected layer implemented as a convolution layer
Conv Output
+ Feature Map
Pool Layers From Conv+Pool Feature Map Outputs Filters Final spatial
output
What about feature map depth? For 1 Class
Isn’t it 256 or 512?
2x3 1x1 2x3 1x1 2x3
5x5
245x245
281x317 First 5 Layers of Feature Map The dimensions of the filters should remain
AlexNet (Modified) same.
5x5
That’s the whole point. You want your network
6x7 to work irrespective of the image size.
Overfeat
Fully Connected layer implemented as a convolution layer
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy
Overfeat
Fully Connected layer implemented as a convolution layer
Conv Output
+ Feature Map
Pool Layers From Conv+Pool Feature Map Outputs Filters
x256 256*4096
4096* 4096* xC
x4096 4096 x4096 C
2x3 1x1 2x3 1x1 2x3
5x5
245x245
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy
Model Size
nt
oddy
11*11*3*96 13*13*256*4096
=34,848 =177,209,344
=~30KB =~177MB
Total FC input FC Output Total Weights Approx
Input Conv Filter Output Weights Approx 43264 4096 177209344 177MB
3 11 11 96 34848 34KB 4096 4096 16777216 16MB
96 5 5 256 614400 600KB 4096 1000 4096000 4MB
256 3 3 384 884736 900KB 198082560 ~=198MB
384 3 3 384 1327104 1.3MB
384 3 3 256 884736 900KB
3745824 3.7MB 13 Conv Layers
nt
oddy
Input Conv Filter Output Total Weights Approx FC input FC Output Total Weights Approx
3 3 3 64 1728 1.7KB 25088 4096 102760448 102MB
64 3 3 64 36864 36KB 4096 4096 16777216 16MB
64 3 3 128 73728 73KB 4096 1000 4096000 4MB
128 3 3 128 147456 150KB 123633664 123MB
128 3 3 256 294912 300KB
256 3 3 256 589824 600KB
256 3 3 256 589824 600KB • You can increase the depth of your CNN
256 3 3 512 1179648 1.2MB
512 3 3 512 2359296 2.4MB without significantly increasing model size.
512 3 3 512 2359296 2.4MB • But even for a 3 layer FC Network, you need
512 3 3 512 2359296 2.4MB significant memory for weights.
512 3 3 512 2359296 2.4MB • How can we do Classifications/Bbox
512 3 3 512 2359296 2.4MB regression without significantly increasing
14710464 14MB model size?
13 Conv Layers
nt
oddy
Model Sizes
4/14MB
x256 xC
6400
4096
6400
4096
4096
4096
40 40
96 96 C
245x245
First 5 Layers of Feature Map
AlexNet (Modified) 5x5 FC input FC Output Total Weights Approx
6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
43077632 43MB
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy
1x1 Conv
FC input FC Output Total Weights Approx
6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
4/14MB 43077632 43MB
4096
4096
6400
4096
4096
40 40
96 96 21
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy
1x1 Conv
FC input FC Output Total Weights Approx
• Using 1x1 Conv in FC Layers significantly reduce 6400 4096 26214400 26MB
the number of weights needed. 4096 4096 16777216 16MB
4096 21 86016 86KB
43077632 43MB
4/14MB
x256 256*4096 4096* 4096*
x4096 x4096 x21
4096 21
2x3 1x1 2x3 1x1 2x3
5x5
6x7 x6
281x317 x6 x6 x6
x6x21
FC
4096
FC input Output Total Weights Approx
4096
4096
10752
4096
10752 4096 264241152 264MB
4096 4096 100663296 100MB
4096 21 516096 516KB 40 40
96 96 21
365420544 365MB *6
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy
Overfeat
1x1x4096 1x1x4096
C = #no of classes + 1 (Background)
1x1xC
Get Class scores
5x5 1x1 1x1
Using Softmax
x256
1x1x4096
245x245 1x1x1024
1x1x4xC
First 5 Layers of Feature Maps Get Bounding boxes
AlexNet (Modified) 5x5 1x1 1x1 Using L2 loss
5x5
(x1, y1, x2, y2)
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy
Overfeat
6 Scaled images in the ratio ~1.4
1x1x4096 1x1x4096
245x245
Overfeat
Sliding Window Crop FC as Conv (No input size constraint) + Spatial Output + Image Pyramid
Resolution = 36
2x3
3x5
5x7
6x7
7x10
245x245
Smaller objects Larger objects
If you want to detect even smaller objects, use even bigger image pyramids. Trade-off, increase in computation nt
Output of Last oddy
245x245
Input Image Convolution layer Improving Resolution
15x15 (W-F+2P)/S + 1
FC layer
3x3 Pool
Stride = 3 (15-3)/3 + 1 = 5
Padding = 0
1x1
5x5 5x5
3x3 Pool
Stride = 2 (15-3)/2 + 1 = 7
Padding = 0
3x3
5x5
7x7
nt
oddy
Overfeat
6 Scaled images in the ratio ~1.4
1x1x4096 1x1x4096
245x245
281x317
nt
oddy
Spatial Output
Results
1x1xC
2x3xC
3x5xC
Won the ImageNet 5x7xC
Localization challenge 6x7xC
in 2013 7x10xC
3x3xC
6x9xC
9x15xC
15x21xC
18x21xC
21x30xC
3 3 9
6 9 54
9 15 135
15 21 315
18 21 378
21 30 630
1521
In Overfeat, they use a Greedy Merge strategy. But NMS can be used in place.
Greedy Merge is not commonly used, so I am skipping the discussion. x21 = 31941
nt
oddy
V
0 0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
H
0 0
0 0
0 0
0 0
V
0 0
0 0 0 0 0 0 0 0 0 0
nt
Ideas for Localization using ConvNets
oddy
10
Case #1 – Only one object per image Human
Car
Dog
Cat
Bicycle
etc
AlexNet/VGG
Get Class Scores
Using Softmax
Human
Car
X1, y1 w Dog
Get Bounding boxes
h
Using L2 loss
X2, y2 (x1, y1, x2, y2) Cat
X0, y0
Bicycle
etc
Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt
oddy
nt
oddy
Overfeat
Sliding Window Crop FC as Conv (No input size constraint) + Spatial Output + Image Pyramid
Effective Stride = 36
2x3
3x5
5x7
6x7
7x10
245x245
Smaller objects Larger objects
If you want to detect even smaller objects, use even bigger image pyramids. Trade-off, increase in computation nt
oddy
4/14MB
x256 xC
6400
4096
6400
4096
4096
4096
40 40
96 96 C
245x245
First 5 Layers of Feature Map
AlexNet (Modified) 5x5 FC input FC Output Total Weights Approx
6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
43077632 43MB
nt
oddy
nt
oddy
FC as 1x1 convolutions – Model size remains same irrespective of size of image
- Result: significant reduction in model size.
FC input FC Output Total Weights Approx
6400 4096 26214400 26MB 256*4096 4096* 4096*
4096 4096 16777216 16MB x4096 x4096 x21
4096 21
4096 21 86016 86KB
43077632 43MB
4/14MB
2x3 1x1 2x3 1x1 2x3
x256 5x5
x6
x6 x6 x6
6x7 x6x21
281x317
4096
4096
4096
10752
4096
FC
FC input Output Total Weights Approx 40
10752 4096 264241152 264MB 96 40
*6 96 21
4096 4096 100663296 100MB
4096 21 516096 516KB
365420544 365MB
FC as dot product operations –
model size increases for larger images
nt