Overfeat

oddy
cogneethi.com
Evolution of Object Detection Networks

Overfeat
nt
oddy
What is Object Detection
nt
oddy
Classification Pipeline
Conv and Pool Layers Fully Connected Layers

As Feature Extractors For Classification
Classifier 0.8
0.1
Feature Extractor SVM/ 0.02
FC+ 0.03
HOG/SIFT/CNN/Etc Softmax 0.02
Etc 0.03
nt
oddy
Classification
Get Class Scores

Using Softmax
AlexNet/VGG
Cat 0.8
Dog 0.1
Rhino 0.02
Hippo 0.02
Elephant 0.02
Mouse 0.04
Conv and Pool Layers Fully Connected Layers

Feature Maps
As Feature Extractors For Classification
Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt

oddy
Ideas for Localization using ConvNets

10
Case #1 – Only one object per image Human
Car
Dog
Cat
Bicycle
etc
AlexNet/VGG
Get Class Scores
Using Softmax
Human
Car
X1, y1 w Dog
Get Bounding boxes
h
Using L2 loss
X2, y2 (x1, y1, x2, y2) Cat
X0, y0
Bicycle
etc
oddy
Bounding Box Regression Training

(x1,y1) = (200, 250)
(x2,y2) = (600, 400) 800
250 400
200
600
600
x1 y1 x2 y2
L2 Loss
Expected 200 250 600 400
0 0 800 600 (200-0)2 (250-0)2 (600-800)2 (400-600)2 182500
100 150 700 450 (200-100)2 (250-150)2 (600-700)2 (400-450)2 32500
Prediction
210 245 590 405 (200-210)2 (250-245)2 (600-590)2 (400-405)2 250
200 250 600 400 (200-200)2 (250-250)2 (600-600)2 (400-400)2 0

oddy
BBox General Properties
Image Credits - see Notes nt

oddy
About Bounding Boxes
300 300
500 600
nt
oddy
Classifier
Get Class scores

Using Softmax
C class scores
AlexNet/VGG
BBox Regressor
Get Bounding boxes

Using L2 loss
(x1, y1, x2, y2)

oddy
Combining Results
FC Softmax
Person
Boat
0.03 - TV TV
0.02 - Person
Class Conf Bbox coordinates
0.95 - Boat
Person 0.02 380 200 430 400
Boat 0.95 210 245 590 405
TV 0.03 700 10 790 100

x1 y1 x2 y2
Person
Boat
TV

oddy
Overfeat
nt
oddy
Ideas for Detection
Confidence scores
Localization CNN
BBox
Neither do I know the number of objects

nor the location of those objects
Credits – See Description nt
oddy
Ideas for Detection – Sliding Window
Neither do I know the number of objects

nor the location of those objects
Confidence scores
Localization CNN
BBox
nt
oddy
Ideas for Detection – Sliding Window + Image Pyramid
Smaller objects Sliding Window – Location Larger objects

Image Pyramid - Scale
nt
oddy
Ideas for Detection using ConvNets

Crop + Resize with Sliding Window + Image Pyramid
Sliding Window – Location
Get Class scores
Using Softmax
AlexNet/VGG
Conv and Pool Layers Get Bounding boxes

Feature Maps Using L2 loss
As Feature Extractors
(x1, y1, x2, y2)

oddy
ConvNets input size constraints

Pooled
Image Weights/Filter Feature Maps FV FC Layers
Feature Maps
0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0
0 0

0 0

0 0

0 0

0 0

0 0
0 0

0 0

0 0 0 0 0 0 0 0 0 0

nt
ConvNets input size constraints – FC as Conv
oddy
Pooled
Image Weights/Filter Feature Maps Pool FV FC Layers
Feature Maps
0 0 0 0 0 0 0 0
0 0
0 0

0 0 H

0
0
0
0 V

0 0
0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0

0 0
0 0

H

0 0

0 0
0 0

V

0 0

0 0
0 0 0 0 0 0 0 0 0 0

1. Does this make sense?

2. If so, what does this mean?
nt
oddy
Receptive Field
2x2 Pool 2x2 Pool

Stride = 2 Stride = 2

2x2 1x1
4x4
2x2 Pool 2x2 Pool


2x2
4x4
8x8
Every value in the output encodes information from some 4x4 patch of the image.
nt
oddy

Pooled FC Layers
Image Weights/Filter Feature Maps
Feature Maps
0 0 0 0 0 0 0 0
0 0
0 0 H
0 0

0 0 V
0 0
0 0
0 0 0 0 0 0 0 0
Same Localization CNN
0 0 0 0 0 0 0 0 0 0
0 0
0 0
0 0

H

0 0

0 0

0 0

V
0 0
0 0
0 0 0 0 0 0 0 0 0 0 Spatial output
1. Does this make sense? -> yes
2. If so, what does this mean? -> Represents the computations on different portions of the image.
nt
Spatial Output as Sliding Window
oddy

CNN

nt
ConvNets and Sliding Window Efficiency
oddy
Confidence scores
Localization CNN
BBox
Localization CNN

H

V

nt
oddy
ConvNets and Sliding Window Efficiency

0 0 255 255 0 0 255 255
0 0 255 255 0 0 255 255 -765 -765 765 765 -765 -765
0 0 255 255 0 0 255 255 -765 -765 765 765 -765 -765
1 0 -1
0 0 255 255 0 0 255 255 -765 -765 765 765 -765 -765
1 0 -1
0 0 255 255 0 0 255 255 -765 -765 765 765 -765 -765
1 0 -1
0 0 255 255 0 0 255 255 -765 -765 765 765 -765 -765
0 0 255 255 0 0 255 255 3x3 -765 -765 765 765 -765 -765
0 0 255 255 0 0 255 255
6x6
8x8
0 0 255 255 0 0 255 255 0 0
0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
1 0 -1
0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
1 0 -1
0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
1 0 -1
0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
0 0 255 255 0 0 255 255 0 0 3x3 -765 -765 765 765 -765 -765 765 765
0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
0 0 255 255 0 0 255 255 0 0
10x10 8x8
255 255 0 0 255 255 0 0

255 255 0 0 255 255 0 0
255 255 0 0 255 255 0 0 1 0 -1 765 765 -765 -765 765 765
255 255 0 0 255 255 0 0 1 0 -1 765 765 -765 -765 765 765
255 255 0 0 255 255 0 0 1 0 -1 765 765 -765 -765 765 765
255 255 0 0 255 255 0 0 765 765 -765 -765 765 765
255 255 0 0 255 255 0 0 3x3 765 765 -765 -765 765 765
255 255 0 0 255 255 0 0 765 765 -765 -765 765 765
8x8 6x6 nt
Spatial Output for Image Pyramids
oddy

H V
H V
nt
oddy

H V
H V
nt
oddy
Ideas for Detection using ConvNets

Sliding Window – Location
Get Class scores

Using Softmax
AlexNet/VGG
Conv and Pool Layers Get Bounding boxes

Feature Maps Using L2 loss
As Feature Extractors
(x1, y1, x2, y2)

oddy
ConvNets input size constraints – FC as Conv

Pooled
Image Weights/Filter Feature Maps Pool FV FC Layers
Feature Maps
0 0 0 0 0 0 0 0
0 0
0 0

0 0 H

0
0
0
0 V

0 0
0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0

0 0
0 0

H

0 0

0 0
0 0

V

0 0

0 0
0 0 0 0 0 0 0 0 0 0

nt
oddy

H V
H V
nt
oddy
Receptive Field
2x2 Pool 2x2 Pool


2x2 1x1
4x4
2x2 Pool 2x2 Pool


2x2
4x4
8x8
Every value in the output encodes information from some 4x4 patch of the image.
nt
oddy

Pooled FC Layers
Image Weights/Filter Feature Maps
Feature Maps
0 0 0 0 0 0 0 0
0 0
0 0 H
0 0

0 0 V
0 0
0 0
0 0 0 0 0 0 0 0
Same Localization CNN
0 0 0 0 0 0 0 0 0 0
0 0
0 0
0 0

H

0 0

0 0

0 0

V
0 0
0 0
0 0 0 0 0 0 0 0 0 0 Spatial output
1. Does this make sense? -> yes
2. If so, what does this mean? -> Represents the computations on different portions of the image.
nt
oddy
Overfeat
Sliding Window Crop FC as Conv (No input size constraint) + Spatial Output + Image Pyramid
Resolution = 36 How to modify localization framework to convert FC as Conv?
461x569 425x497 389x461 317x389 281x317
2x3
3x5
5x7
6x7
7x10
245x245
Smaller objects Larger objects
If you want to detect even smaller objects, use even bigger image pyramids. Trade-off, increase in computation nt
oddy
Overfeat - Classification
1x1x4096 1x1x4096
x256
1x1xC

Get Class scores
5x5 1x1 1x1

Using Softmax

5x5
245x245
First 5 Layers of Feature Map
AlexNet (Modified)
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy
Convolution
Feature Map Filter Output
nt
oddy
Overfeat
Fully Connected layer implemented as a convolution layer
Conv Output
+ Feature Map
Pool Layers From Conv+Pool Feature Map Outputs Filters Final output
For 1 Class

1x1 1x1 1x1 1x1 1x1
5x5
245x245
AlexNet (Modified)
5x5
What about other classes?
oddy
Overfeat
Conv Output
+ Feature Map
Pool Layers From Conv+Pool Feature Map Outputs Filters Final spatial
output
What about feature map depth? For 1 Class
Isn’t it 256 or 512?

2x3 1x1 2x3 1x1 2x3
5x5
245x245
281x317 First 5 Layers of Feature Map The dimensions of the filters should remain
AlexNet (Modified) same.
5x5
That’s the whole point. You want your network
6x7 to work irrespective of the image size.
What about other classes?

oddy
N layer Conv – M Feature Maps
See demo here - http://cs231n.github.io/assets/conv-demo/index.html nt

oddy
Overfeat
Conv Output Feature Map Outputs Filters Final output

+ Feature Map For C Classes
Pool Layers From Conv+Pool
x256 256*4096 4096* xC

4096*
x4096 4096 x4096
C

1x1 1x1 1x1 1x1

5x5 1x1
245x245
AlexNet (Modified) 5x5
oddy
Overfeat
Conv Output
+ Feature Map
Pool Layers From Conv+Pool Feature Map Outputs Filters
x256 256*4096
4096* 4096* xC

x4096 4096 x4096 C

2x3 1x1 2x3 1x1 2x3
5x5
245x245
281x317 First 5 Layers of Feature Map

AlexNet (Modified)
5x5
Since the height and width of the filters are 1x1,
6x7
They are also referred to as 1x1 convolutions.
oddy
Model Size
nt
oddy
Model Sizes AlexNet
11*11*3*96 13*13*256*4096
=34,848 =177,209,344
=~30KB =~177MB
Total FC input FC Output Total Weights Approx
Input Conv Filter Output Weights Approx 43264 4096 177209344 177MB
3 11 11 96 34848 34KB 4096 4096 16777216 16MB
96 5 5 256 614400 600KB 4096 1000 4096000 4MB
256 3 3 384 884736 900KB 198082560 ~=198MB
384 3 3 384 1327104 1.3MB
384 3 3 256 884736 900KB
3745824 3.7MB 13 Conv Layers
nt
oddy
Model Sizes - VGGNet
Input Conv Filter Output Total Weights Approx FC input FC Output Total Weights Approx
3 3 3 64 1728 1.7KB 25088 4096 102760448 102MB
64 3 3 64 36864 36KB 4096 4096 16777216 16MB
64 3 3 128 73728 73KB 4096 1000 4096000 4MB
128 3 3 128 147456 150KB 123633664 123MB
128 3 3 256 294912 300KB
256 3 3 256 589824 600KB
256 3 3 256 589824 600KB • You can increase the depth of your CNN
256 3 3 512 1179648 1.2MB
512 3 3 512 2359296 2.4MB without significantly increasing model size.
512 3 3 512 2359296 2.4MB • But even for a 3 layer FC Network, you need
512 3 3 512 2359296 2.4MB significant memory for weights.
512 3 3 512 2359296 2.4MB • How can we do Classifications/Bbox
512 3 3 512 2359296 2.4MB regression without significantly increasing
14710464 14MB model size?
13 Conv Layers
nt
oddy
Model Sizes
4/14MB
x256 xC

6400
4096

6400
4096
4096

4096

40 40
96 96 C
245x245
AlexNet (Modified) 5x5 FC input FC Output Total Weights Approx
6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
43077632 43MB
oddy
1x1 Conv
FC input FC Output Total Weights Approx
6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
4/14MB 43077632 43MB
x256 256*4096 4096* x21

4096*
x4096 4096 x4096
21

1x1 1x1 1x1 1x1

5x5 1x1
245x245
First 5 Layers of Feature Map x21
AlexNet (Modified) 5x5
4096
4096
6400
4096
4096
40 40
96 96 21
oddy
1x1 Conv
• Using 1x1 Conv in FC Layers significantly reduce 6400 4096 26214400 26MB
the number of weights needed. 4096 4096 16777216 16MB
4096 21 86016 86KB
43077632 43MB
4/14MB
x256 256*4096 4096* 4096*
x4096 x4096 x21
4096 21

2x3 1x1 2x3 1x1 2x3
5x5
6x7 x6
281x317 x6 x6 x6
x6x21
FC
4096
FC input Output Total Weights Approx
4096
4096
10752
4096
10752 4096 264241152 264MB
4096 4096 100663296 100MB
4096 21 516096 516KB 40 40
96 96 21
365420544 365MB *6
oddy
Overfeat
1x1x4096 1x1x4096
C = #no of classes + 1 (Background)
1x1xC
Get Class scores
5x5 1x1 1x1
Using Softmax
x256
1x1x4096
245x245 1x1x1024
1x1x4xC
First 5 Layers of Feature Maps Get Bounding boxes
AlexNet (Modified) 5x5 1x1 1x1 Using L2 loss
5x5
(x1, y1, x2, y2)
oddy
Overfeat
6 Scaled images in the ratio ~1.4
1x1x4096 1x1x4096
245x245
281x317 Get Class scores

Using Softmax
317x389 x256 1x1xC
2x3xC
3x5xC
389x461 5x7xC
6x7xC
7x10xC
425x497
Get Bounding boxes
First 5 Layers of Feature Maps Using L2 loss (x1, y1, x2, y2)
AlexNet (Modified)
5x5 1x1x4xC
6x7 2x3x4xC
461x569 7x9 3x5x4xC
Network resolution is 36 9x11 5x7x4xC
10x11 6x7x4xC
11x14 7x10x4xC
oddy
Overfeat
Resolution = 36
461x569 425x497 389x461 317x389 281x317
2x3
3x5
5x7
6x7
7x10
245x245
Output of Last oddy
245x245
Input Image Convolution layer Improving Resolution
15x15 (W-F+2P)/S + 1

FC layer
3x3 Pool
Stride = 3 (15-3)/3 + 1 = 5

Padding = 0

1x1

5x5 5x5

3x3 Pool

Stride = 2 (15-3)/2 + 1 = 7
Padding = 0

3x3

5x5
7x7

nt
oddy
Overfeat
6 Scaled images in the ratio ~1.4
1x1x4096 1x1x4096
245x245
281x317 Get Class scores

Using Softmax
317x389 x256 1x1xC 3x3xC
2x3xC 6x9xC
3x5xC 9x15xC
389x461 5x7xC 15x21xC
6x7xC 18x21xC
7x10xC 21x30xC
425x497
Get Bounding boxes
First 5 Layers of Feature Maps Using L2 loss (x1, y1, x2, y2)
AlexNet (Modified)
5x5 1x1x4xC
6x7 2x3x4xC
461x569 7x9 3x5x4xC x3
Network resolution is 36 9x11 5x7x4xC
10x11 6x7x4xC
11x14 7x10x4xC
oddy
Problem of Multiple Detections

281x317
nt
oddy
Non Max Suppression
Spatial Output

Softmax instead of SVM

Human detection as an example
nt
oddy
Results
1x1xC
2x3xC
3x5xC
Won the ImageNet 5x7xC
Localization challenge 6x7xC
in 2013 7x10xC
3x3xC
6x9xC
9x15xC
15x21xC
18x21xC
21x30xC
3 3 9
6 9 54
9 15 135
15 21 315
18 21 378
21 30 630
1521
In Overfeat, they use a Greedy Merge strategy. But NMS can be used in place.
Greedy Merge is not commonly used, so I am skipping the discussion. x21 = 31941
nt
oddy

Pooled
Image FV FC Layers
Feature Maps

H

V

0 0 0 0 0 0 0 0 0 0
0 0

0 0

0 0

H

0 0

0 0

0 0
0 0

V

0 0

0 0 0 0 0 0 0 0 0 0

nt
oddy
10
Case #1 – Only one object per image Human
Car
Dog
Cat
Bicycle
etc
AlexNet/VGG
Get Class Scores
Using Softmax
Human
Car
X1, y1 w Dog
Get Bounding boxes
h
Using L2 loss
X2, y2 (x1, y1, x2, y2) Cat
X0, y0
Bicycle
etc
oddy
Ideas for Detection – Sliding Window + Image Pyramid
Smaller objects Sliding Window – Location Larger objects

nt
oddy
Overfeat
Effective Stride = 36
461x569 425x497 389x461 317x389 281x317
2x3
3x5
5x7
6x7
7x10
245x245
oddy
4/14MB
x256 xC

6400
4096

6400
4096
4096

4096

40 40
96 96 C
245x245
AlexNet (Modified) 5x5 FC input FC Output Total Weights Approx
6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
43077632 43MB
nt
oddy
4/14MB 1x1 Conv

x256 256*4096 4096* x21
4096*
x4096 4096 x4096
21

1x1 1x1 1x1 1x1

5x5 1x1
245x245
AlexNet (Modified) 5x5 6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
43077632 43MB
nt
oddy
FC as 1x1 convolutions – Model size remains same irrespective of size of image
- Result: significant reduction in model size.
6400 4096 26214400 26MB 256*4096 4096* 4096*
4096 4096 16777216 16MB x4096 x4096 x21
4096 21
4096 21 86016 86KB

43077632 43MB

4/14MB
2x3 1x1 2x3 1x1 2x3
x256 5x5

x6
x6 x6 x6
6x7 x6x21
281x317
4096
4096
4096
10752
4096
FC
FC input Output Total Weights Approx 40
10752 4096 264241152 264MB 96 40
*6 96 21
4096 4096 100663296 100MB
4096 21 516096 516KB
365420544 365MB
FC as dot product operations –
model size increases for larger images
nt

Overfeat

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Overfeat

Uploaded by

Copyright:

Available Formats

oddy

Evolution of Object Detection Networks

What is Object Detection

Conv and Pool Layers Fully Connected Layers

Get Class Scores

Conv and Pool Layers Fully Connected Layers

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt

Ideas for Localization using ConvNets

Bounding Box Regression Training

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt

BBox General Properties

Image Credits - see Notes nt

About Bounding Boxes

Ideas for Localization using ConvNets

Get Class scores

Get Bounding boxes

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt

Boat 0.95 210 245 590 405

TV 0.03 700 10 790 100

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt

Ideas for Detection

Neither do I know the number of objects

Ideas for Detection – Sliding Window

Neither do I know the number of objects

Ideas for Detection – Sliding Window + Image Pyramid

Smaller objects Sliding Window – Location Larger objects

Ideas for Detection using ConvNets

Conv and Pool Layers Get Bounding boxes

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt

ConvNets input size constraints

2x2 Pool 2x2 Pool

2x2 Pool 2x2 Pool

ConvNets input size constraints

Same Localization CNN

ConvNets and Sliding Window Efficiency

255 255 0 0 255 255 0 0

Spatial Output for Image Pyramids

Ideas for Detection using ConvNets

Get Class scores

Conv and Pool Layers Get Bounding boxes

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt

ConvNets input size constraints – FC as Conv

Spatial Output for Image Pyramids

2x2 Pool 2x2 Pool

2x2 Pool 2x2 Pool

ConvNets input size constraints

Same Localization CNN

Resolution = 36 How to modify localization framework to convert FC as Conv?

461x569 425x497 389x461 317x389 281x317

What about other classes?

What about other classes?

N layer Conv – M Feature Maps

See demo here - http://cs231n.github.io/assets/conv-demo/index.html nt

Conv Output Feature Map Outputs Filters Final output

x256 256*4096 4096* xC

1x1 1x1 1x1 1x1

281x317 First 5 Layers of Feature Map

Model Sizes AlexNet

Model Sizes - VGGNet

x256 256*4096 4096* x21

1x1 1x1 1x1 1x1

281x317 Get Class scores

461x569 425x497 389x461 317x389 281x317

x256 2564096 4096 xC

x256 2564096 4096 x21