You are on page 1of 58

oddy

cogneethi.com

Evolution of Object Detection Networks


Overfeat

nt
oddy

What is Object Detection

nt
oddy

Classification Pipeline

Conv and Pool Layers Fully Connected Layers


As Feature Extractors For Classification

Classifier 0.8
0.1
Feature Extractor SVM/ 0.02
FC+ 0.03
HOG/SIFT/CNN/Etc Softmax 0.02
Etc 0.03

nt
oddy

Classification

Get Class Scores


Using Softmax

AlexNet/VGG
Cat 0.8
Dog 0.1
Rhino 0.02
Hippo 0.02
Elephant 0.02
Mouse 0.04

Conv and Pool Layers Fully Connected Layers


Feature Maps
As Feature Extractors For Classification

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt


oddy

Ideas for Localization using ConvNets


10
Case #1 – Only one object per image Human
Car
Dog

Cat
Bicycle
etc

AlexNet/VGG
Get Class Scores
Using Softmax

Human
Car
X1, y1 w Dog
Get Bounding boxes
h
Using L2 loss
X2, y2 (x1, y1, x2, y2) Cat
X0, y0
Bicycle
etc
Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt
oddy

Bounding Box Regression Training


(x1,y1) = (200, 250)
(x2,y2) = (600, 400) 800

250 400
200

600

600

  x1 y1 x2 y2
L2 Loss
Expected 200 250 600 400
0 0 800 600 (200-0)2 (250-0)2 (600-800)2 (400-600)2 182500
100 150 700 450 (200-100)2 (250-150)2 (600-700)2 (400-450)2 32500
Prediction
210 245 590 405 (200-210)2 (250-245)2 (600-590)2 (400-405)2 250
200 250 600 400 (200-200)2 (250-250)2 (600-600)2 (400-400)2 0

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt


oddy

BBox General Properties

Image Credits - see Notes nt


oddy

About Bounding Boxes

300 300

500 600

nt
oddy

Ideas for Localization using ConvNets

Classifier

Get Class scores


Using Softmax
C class scores

AlexNet/VGG
BBox Regressor

Get Bounding boxes


Using L2 loss
(x1, y1, x2, y2)

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt


oddy

Combining Results
FC Softmax
Person
Boat
0.03 - TV TV

0.02 - Person
Class Conf Bbox coordinates
0.95 - Boat
Person 0.02 380 200 430 400

Boat 0.95 210 245 590 405

TV 0.03 700 10 790 100


x1 y1 x2 y2

Person
Boat
TV

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt


oddy

Overfeat

nt
oddy

Ideas for Detection

Confidence scores

Localization CNN

BBox

Neither do I know the number of objects


nor the location of those objects
Credits – See Description nt
oddy

Ideas for Detection – Sliding Window

Neither do I know the number of objects


nor the location of those objects

Confidence scores

Localization CNN

BBox

nt
oddy

Ideas for Detection – Sliding Window + Image Pyramid

Smaller objects Sliding Window – Location Larger objects


Image Pyramid - Scale

nt
oddy

Ideas for Detection using ConvNets


Crop + Resize with Sliding Window + Image Pyramid
Sliding Window – Location
Image Pyramid - Scale
Get Class scores
Using Softmax

AlexNet/VGG

Conv and Pool Layers Get Bounding boxes


Feature Maps Using L2 loss
As Feature Extractors
(x1, y1, x2, y2)

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt


oddy

ConvNets input size constraints


Pooled
Image Weights/Filter Feature Maps FV FC Layers
Feature Maps
0 0 0 0 0 0 0 0      
0            0                  
0            0                        
0            0                  
0            0                        
0            0                        
0            0                  
0 0 0 0 0 0 0 0      
     

 
 
0 0 0 0 0 0 0 0 0 0  
0                0                    
 
0                0                    
 
                       
0                0      
                       
0                0      
                       
0                0      
                       
0                0      
0                0                    
 
0                0                    
 
0 0 0 0 0 0 0 0 0 0      
 
 
 
 

nt
ConvNets input size constraints – FC as Conv
oddy

Pooled
Image Weights/Filter Feature Maps Pool FV FC Layers
Feature Maps
0 0 0 0 0 0 0 0  
0            0              
0            0              
     
0            0                               H
                                   
0           
0           
0
0                           V
     
0            0              
0 0 0 0 0 0 0 0  
 

 
 
0 0 0 0 0 0 0 0 0 0  
               
0                0  
               
0                0      
0                0
                       
     
H
                                   
0                0    
                             
           
0                0      
0                0
                       
 
           
   
V
               
0                0  
               
0                0  
0 0 0 0 0 0 0 0 0 0  
 
  1. Does this make sense?
 
 
2. If so, what does this mean?
nt
oddy

Receptive Field

2x2 Pool 2x2 Pool


Stride = 2 Stride = 2
       
       
2x2 1x1
4x4

2x2 Pool 2x2 Pool


            Stride = 2 Stride = 2
           
                   
                   
           
2x2
            4x4
8x8

Every value in the output encodes information from some 4x4 patch of the image.

nt
oddy

ConvNets input size constraints


Pooled FC Layers
Image Weights/Filter Feature Maps
Feature Maps
0 0 0 0 0 0 0 0
0            0            
0            0                               H
0            0                        
     
0            0                         V
0            0                  
0            0            
0 0 0 0 0 0 0 0

Same Localization CNN

0 0 0 0 0 0 0 0 0 0
0                0                
0                0                
0                0                                    
 
 
 
 
H
                                   
0                0
                                   
0                0
   
0                0                        
   
V
0                0                
0                0                
0 0 0 0 0 0 0 0 0 0 Spatial output
1. Does this make sense? -> yes
2. If so, what does this mean? -> Represents the computations on different portions of the image.
nt
Spatial Output as Sliding Window
oddy

           
           
           
           
           
           

           
           
           
                           
                           
                           
                CNN
               
                           
                           
                           
           
           
           

           
           
           
           
           
           

nt
ConvNets and Sliding Window Efficiency
oddy

Confidence scores

Localization CNN
BBox

Localization CNN
               
               
               
               
               
                                   
 
 
 
 
H
                       
                           
               
                                   
                           
               
               
   
V
               
               
               

               
                               
                               
                               
                               
                               
                               
                               
                nt
oddy

ConvNets and Sliding Window Efficiency


                0 0 255 255 0 0 255 255
                0 0 255 255 0 0 255 255 -765 -765 765 765 -765 -765
                0 0 255 255 0 0 255 255 -765 -765 765 765 -765 -765
1 0 -1
                0 0 255 255 0 0 255 255 -765 -765 765 765 -765 -765
1 0 -1
                0 0 255 255 0 0 255 255 -765 -765 765 765 -765 -765
1 0 -1
                0 0 255 255 0 0 255 255 -765 -765 765 765 -765 -765
                0 0 255 255 0 0 255 255 3x3 -765 -765 765 765 -765 -765
                0 0 255 255 0 0 255 255
6x6
8x8
                    0 0 255 255 0 0 255 255 0 0
                    0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
                    0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
                    0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
1 0 -1
                    0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
1 0 -1
                    0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
1 0 -1
                    0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
                    0 0 255 255 0 0 255 255 0 0 3x3 -765 -765 765 765 -765 -765 765 765
                    0 0 255 255 0 0 255 255 0 0 -765 -765 765 765 -765 -765 765 765
                    0 0 255 255 0 0 255 255 0 0
10x10 8x8

                255 255 0 0 255 255 0 0


                255 255 0 0 255 255 0 0
                255 255 0 0 255 255 0 0 1 0 -1 765 765 -765 -765 765 765
                255 255 0 0 255 255 0 0 1 0 -1 765 765 -765 -765 765 765
                255 255 0 0 255 255 0 0 1 0 -1 765 765 -765 -765 765 765
                255 255 0 0 255 255 0 0 765 765 -765 -765 765 765
                255 255 0 0 255 255 0 0 3x3 765 765 -765 -765 765 765
                255 255 0 0 255 255 0 0 765 765 -765 -765 765 765

8x8 6x6 nt
Spatial Output for Image Pyramids
oddy

                       
               
                       
               
                                   
               
                                   
               
                                   
               
                                   
               
                                   
               
                                   
               
                       
                       
                       
                       

               
                       
                       
               

H V
H V

nt
oddy

Spatial Output for Image Pyramids

               
                       
                       
               

H V
H V

nt
oddy

Ideas for Detection using ConvNets


Sliding Window – Location
Image Pyramid - Scale

Get Class scores


Using Softmax

AlexNet/VGG

Conv and Pool Layers Get Bounding boxes


Feature Maps Using L2 loss
As Feature Extractors
(x1, y1, x2, y2)

Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt


oddy

ConvNets input size constraints – FC as Conv


Pooled
Image Weights/Filter Feature Maps Pool FV FC Layers
Feature Maps
0 0 0 0 0 0 0 0  
0            0              
0            0              
     
0            0                               H
                                   
0           
0           
0
0                           V
     
0            0              
0 0 0 0 0 0 0 0  
 

 
 
0 0 0 0 0 0 0 0 0 0  
               
0                0  
               
0                0      
0                0
                       
     
H
                                   
0                0    
                             
           
0                0      
0                0
                       
 
           
   
V
               
0                0  
               
0                0  
0 0 0 0 0 0 0 0 0 0  
 
 
 
 

nt
oddy

Spatial Output for Image Pyramids

               
                       
                       
               

H V
H V

nt
oddy

Receptive Field

2x2 Pool 2x2 Pool


Stride = 2 Stride = 2
       
       
2x2 1x1
4x4

2x2 Pool 2x2 Pool


            Stride = 2 Stride = 2
           
                   
                   
           
2x2
            4x4
8x8

Every value in the output encodes information from some 4x4 patch of the image.

nt
oddy

ConvNets input size constraints


Pooled FC Layers
Image Weights/Filter Feature Maps
Feature Maps
0 0 0 0 0 0 0 0
0            0            
0            0                               H
0            0                        
     
0            0                         V
0            0                  
0            0            
0 0 0 0 0 0 0 0

Same Localization CNN

0 0 0 0 0 0 0 0 0 0
0                0                
0                0                
0                0                                    
 
 
 
 
H
                                   
0                0
                                   
0                0
   
0                0                        
   
V
0                0                
0                0                
0 0 0 0 0 0 0 0 0 0 Spatial output
1. Does this make sense? -> yes
2. If so, what does this mean? -> Represents the computations on different portions of the image.
nt
oddy

Overfeat
Sliding Window Crop FC as Conv (No input size constraint) + Spatial Output + Image Pyramid

Resolution = 36 How to modify localization framework to convert FC as Conv?

461x569 425x497 389x461 317x389 281x317

2x3
3x5
5x7
6x7

7x10

245x245
Smaller objects Larger objects
If you want to detect even smaller objects, use even bigger image pyramids. Trade-off, increase in computation nt
oddy

Overfeat - Classification

1x1x4096 1x1x4096
x256
1x1xC
         
          Get Class scores
          5x5 1x1 1x1
         
Using Softmax
         

5x5
245x245
First 5 Layers of Feature Map
AlexNet (Modified)

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy

Convolution
Feature Map Filter Output

nt
oddy

Overfeat
Fully Connected layer implemented as a convolution layer
Conv Output
+ Feature Map
Pool Layers From Conv+Pool Feature Map Outputs Filters Final output
For 1 Class

                   
                   
                   
                   
                   
1x1 1x1 1x1 1x1 1x1
5x5
245x245
First 5 Layers of Feature Map
AlexNet (Modified)
5x5

What about other classes?

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy

Overfeat
Fully Connected layer implemented as a convolution layer
Conv Output
+ Feature Map
Pool Layers From Conv+Pool Feature Map Outputs Filters Final spatial
output
What about feature map depth? For 1 Class
Isn’t it 256 or 512?

             
         
             
         
                               
         
                               
         
             
         
              2x3 1x1 2x3 1x1 2x3
5x5
245x245

281x317 First 5 Layers of Feature Map The dimensions of the filters should remain
AlexNet (Modified) same.
5x5
That’s the whole point. You want your network
6x7 to work irrespective of the image size.

What about other classes?


OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy

N layer Conv – M Feature Maps

See demo here - http://cs231n.github.io/assets/conv-demo/index.html nt


oddy

Overfeat
Fully Connected layer implemented as a convolution layer

Conv Output Feature Map Outputs Filters Final output


+ Feature Map For C Classes
Pool Layers From Conv+Pool

x256 256*4096 4096* xC


4096*
                    x4096 4096 x4096
                    C
                   
                   
                   

1x1 1x1 1x1 1x1


5x5 1x1
245x245
First 5 Layers of Feature Map
AlexNet (Modified) 5x5

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy

Overfeat
Fully Connected layer implemented as a convolution layer
Conv Output
+ Feature Map
Pool Layers From Conv+Pool Feature Map Outputs Filters

x256 256*4096
4096* 4096* xC
 
 
 
 
 
 
 
 
 
 
 
 
 
 
          x4096 4096 x4096 C
         
                                        
         
                                        
         
             
         
              2x3 1x1 2x3 1x1 2x3
5x5
245x245

281x317 First 5 Layers of Feature Map


AlexNet (Modified)
5x5
Since the height and width of the filters are 1x1,
6x7
They are also referred to as 1x1 convolutions.

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy

Model Size

nt
oddy

Model Sizes AlexNet

11*11*3*96 13*13*256*4096
=34,848 =177,209,344
=~30KB =~177MB
Total FC input FC Output Total Weights Approx
Input Conv Filter Output Weights Approx 43264 4096 177209344 177MB
3 11 11 96 34848 34KB 4096 4096 16777216 16MB
96 5 5 256 614400 600KB 4096 1000 4096000 4MB
256 3 3 384 884736 900KB     198082560 ~=198MB
384 3 3 384 1327104 1.3MB
384 3 3 256 884736 900KB
        3745824 3.7MB 13 Conv Layers
nt
oddy

Model Sizes - VGGNet

Input Conv Filter Output Total Weights Approx FC input FC Output Total Weights Approx
3 3 3 64 1728 1.7KB 25088 4096 102760448 102MB
64 3 3 64 36864 36KB 4096 4096 16777216 16MB
64 3 3 128 73728 73KB 4096 1000 4096000 4MB
128 3 3 128 147456 150KB     123633664 123MB
128 3 3 256 294912 300KB
256 3 3 256 589824 600KB
256 3 3 256 589824 600KB • You can increase the depth of your CNN
256 3 3 512 1179648 1.2MB
512 3 3 512 2359296 2.4MB without significantly increasing model size.
512 3 3 512 2359296 2.4MB • But even for a 3 layer FC Network, you need
512 3 3 512 2359296 2.4MB significant memory for weights.
512 3 3 512 2359296 2.4MB • How can we do Classifications/Bbox
512 3 3 512 2359296 2.4MB regression without significantly increasing
        14710464 14MB model size?

13 Conv Layers
nt
oddy

Model Sizes

4/14MB
x256 xC
           
           

6400

4096
 

6400

4096
4096
         

4096
           
           
 
40 40
96 96 C
245x245
First 5 Layers of Feature Map
AlexNet (Modified) 5x5 FC input FC Output Total Weights Approx
6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
    43077632 43MB

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy

1x1 Conv
FC input FC Output Total Weights Approx
6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
4/14MB     43077632 43MB

x256 256*4096 4096* x21


4096*
                    x4096 4096 x4096
                    21
                   
                   
                   

1x1 1x1 1x1 1x1


5x5 1x1
245x245
First 5 Layers of Feature Map x21
AlexNet (Modified) 5x5

4096

4096
6400

4096
4096
40 40
96 96 21
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy

1x1 Conv
FC input FC Output Total Weights Approx
• Using 1x1 Conv in FC Layers significantly reduce 6400 4096 26214400 26MB
the number of weights needed. 4096 4096 16777216 16MB
4096 21 86016 86KB
    43077632 43MB
4/14MB
x256 256*4096 4096* 4096*
x4096 x4096 x21
                        4096 21
                                                  
                                                  
                       
 
 
 
 
 
 
 
 
 
 
 
 
 
 
         
2x3 1x1 2x3 1x1 2x3
5x5
6x7 x6
281x317 x6 x6 x6
x6x21
FC

4096
FC input Output Total Weights Approx

4096

4096
10752

4096
10752 4096 264241152 264MB
4096 4096 100663296 100MB
4096 21 516096 516KB 40 40
96 96 21
    365420544 365MB *6
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy

Overfeat
1x1x4096 1x1x4096
C = #no of classes + 1 (Background)
1x1xC
Get Class scores
5x5 1x1 1x1
Using Softmax
x256

1x1x4096
245x245 1x1x1024
1x1x4xC
First 5 Layers of Feature Maps Get Bounding boxes
AlexNet (Modified) 5x5 1x1 1x1 Using L2 loss
5x5
(x1, y1, x2, y2)

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks – Sermanet et al
Images Credit – Overfeat paper & https://towardsdatascience.com/object-localization-in-overfeat-5bb2f7328b62 nt
oddy

Overfeat
6 Scaled images in the ratio ~1.4
1x1x4096 1x1x4096
245x245

281x317 Get Class scores


Using Softmax
317x389 x256 1x1xC
2x3xC
3x5xC
389x461 5x7xC
6x7xC
7x10xC
425x497
Get Bounding boxes
First 5 Layers of Feature Maps Using L2 loss (x1, y1, x2, y2)
AlexNet (Modified)
5x5 1x1x4xC
6x7 2x3x4xC
461x569 7x9 3x5x4xC
Network resolution is 36 9x11 5x7x4xC
10x11 6x7x4xC
11x14 7x10x4xC
Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt
oddy

Overfeat
Sliding Window Crop FC as Conv (No input size constraint) + Spatial Output + Image Pyramid

Resolution = 36

461x569 425x497 389x461 317x389 281x317

2x3
3x5
5x7
6x7

7x10

245x245
Smaller objects Larger objects
If you want to detect even smaller objects, use even bigger image pyramids. Trade-off, increase in computation nt
Output of Last oddy
245x245
Input Image Convolution layer Improving Resolution
15x15 (W-F+2P)/S + 1
                             
                              FC layer
                              3x3 Pool
                              Stride = 3 (15-3)/3 + 1 = 5
                             
                              Padding = 0                    
                                                       
                                                       
                                                       
                                                  1x1
                             
                             
5x5 5x5
                             
                             
                             

                             
                              3x3 Pool
                             
                              Stride = 2 (15-3)/2 + 1 = 7
                              Padding = 0              
                                                     
                                                                 
                                                                 
                                                                 
                                                     
                                           
3x3
                             
5x5
                              7x7
                             
                             
nt
oddy

Overfeat
6 Scaled images in the ratio ~1.4
1x1x4096 1x1x4096
245x245

281x317 Get Class scores


Using Softmax
317x389 x256 1x1xC 3x3xC
2x3xC 6x9xC
3x5xC 9x15xC
389x461 5x7xC 15x21xC
6x7xC 18x21xC
7x10xC 21x30xC
425x497
Get Bounding boxes
First 5 Layers of Feature Maps Using L2 loss (x1, y1, x2, y2)
AlexNet (Modified)
5x5 1x1x4xC
6x7 2x3x4xC
461x569 7x9 3x5x4xC x3
Network resolution is 36 9x11 5x7x4xC
10x11 6x7x4xC
11x14 7x10x4xC
Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt
oddy

Problem of Multiple Detections

     

     

281x317

nt
oddy

Non Max Suppression

Spatial Output
                     
                     
                     
                     
                     
                     
                     
                     
                     
                     
                     
                     
                     
                     
                     
                     

Softmax instead of SVM


Human detection as an example
nt
oddy

Results
1x1xC
2x3xC
3x5xC
Won the ImageNet 5x7xC
Localization challenge 6x7xC
in 2013 7x10xC

3x3xC
6x9xC
9x15xC
15x21xC
18x21xC
21x30xC
3 3 9
6 9 54
9 15 135
15 21 315
18 21 378
21 30 630
    1521
In Overfeat, they use a Greedy Merge strategy. But NMS can be used in place.
Greedy Merge is not commonly used, so I am skipping the discussion. x21 = 31941
nt
oddy

ConvNets input size constraints


Pooled
Image FV FC Layers
Feature Maps
 
 
       
              H
                   

                    V
 
 
 

 
 
0 0 0 0 0 0 0 0 0 0  
0                0                
 
0                0                
 
0                0                        
             
 
 
 
 
H
                       
0                0              
                       
0                0              
                           
0                0
0                0                
 
   
V
 
0                0                
 
0 0 0 0 0 0 0 0 0 0  
 
 
 
 

nt
Ideas for Localization using ConvNets
oddy

10
Case #1 – Only one object per image Human
Car
Dog

Cat
Bicycle
etc

AlexNet/VGG
Get Class Scores
Using Softmax

Human
Car
X1, y1 w Dog
Get Bounding boxes
h
Using L2 loss
X2, y2 (x1, y1, x2, y2) Cat
X0, y0
Bicycle
etc
Image Credit - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/examples/index.html nt
oddy

Ideas for Detection – Sliding Window + Image Pyramid

Smaller objects Sliding Window – Location Larger objects


Image Pyramid - Scale

nt
oddy

Overfeat
Sliding Window Crop FC as Conv (No input size constraint) + Spatial Output + Image Pyramid
Effective Stride = 36

461x569 425x497 389x461 317x389 281x317

2x3
3x5
5x7
6x7

7x10

245x245
Smaller objects Larger objects
If you want to detect even smaller objects, use even bigger image pyramids. Trade-off, increase in computation nt
oddy

4/14MB
x256 xC
           
           

6400

4096
 

6400

4096
4096
         

4096
           
           
 
40 40
96 96 C
245x245
First 5 Layers of Feature Map
AlexNet (Modified) 5x5 FC input FC Output Total Weights Approx
6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
    43077632 43MB

nt
oddy

4/14MB 1x1 Conv


x256 256*4096 4096* x21
4096*
                    x4096 4096 x4096
                    21
                   
                   
                   

1x1 1x1 1x1 1x1


5x5 1x1
245x245
First 5 Layers of Feature Map
FC input FC Output Total Weights Approx
AlexNet (Modified) 5x5 6400 4096 26214400 26MB
4096 4096 16777216 16MB
4096 21 86016 86KB
    43077632 43MB

nt
oddy
FC as 1x1 convolutions – Model size remains same irrespective of size of image
- Result: significant reduction in model size.
FC input FC Output Total Weights Approx
6400 4096 26214400 26MB 256*4096 4096* 4096*
4096 4096 16777216 16MB x4096 x4096 x21
          4096 21
4096 21 86016 86KB                            
        
    43077632 43MB                                     
         
4/14MB          
2x3 1x1 2x3 1x1 2x3
x256 5x5
             
             
             
             
              x6
              x6 x6 x6

6x7 x6x21
281x317

4096

4096

4096
10752

4096
FC
FC input Output Total Weights Approx 40
10752 4096 264241152 264MB 96 40
*6 96 21
4096 4096 100663296 100MB
4096 21 516096 516KB
    365420544 365MB
FC as dot product operations –
model size increases for larger images
nt

You might also like