SSD Eccv2016 Slide

SSD: Single Shot
MultiBox Detector
Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3),
Scott Reed(4), Cheng-Yang Fu(1), Alexander C. Berg(1)
UNC Chapel Hill(1), Zoox Inc.(2), Google Inc.(3),

University of Michigan(4)
VGGNet
Titan X Pascal
VGGNet
Titan X Pascal
VOC2007 test mAP 80
Faster R-CNN, Ren 2015

73% mAP / 7 fps
Fast R-CNN, Girshick 2015

70 70% mAP / 0.4 fps
R-CNN, Girshick 2014 YOLO, Redmon 2016

66% mAP / 0.02 fps 66% mAP / 21 fps
All with VGGNet pretrained on ImageNet,

batch_size = 1 on Titan X
10 20 30 40 50
Speed (fps)
VOC2007 test mAP 80

6.6x faster SSD300
73% mAP / 7 fps
74% mAP / 46 fps

70 70% mAP / 0.4 fps

66% mAP / 0.02 fps 66% mAP / 21 fps

10 20 30 40 50
Speed (fps)
80
SSD512
77% mAP / 19 fps
VOC2007 test mAP

6.6x faster SSD300
73% mAP / 7 fps
74% mAP / 46 fps
11% better
70 70% mAP / 0.4 fps

66% mAP / 0.02 fps 66% mAP / 21 fps

10 20 30 40 50
Speed (fps)
SSD512
80 80% mAP / 19 fps
SSD512 SSD300
77% mAP / 19 fps 77% mAP / 46 fps
VOC2007 test mAP

6.6x faster SSD300
73% mAP / 7 fps
74% mAP / 46 fps
11% better
70 70% mAP / 0.4 fps

66% mAP / 0.02 fps 66% mAP / 21 fps

10 20 30 40 50
Speed (fps)
SSD512
80 80% mAP / 19 fps
SSD300
77% mAP / 46 fps
VOC2007 test mAP

73% mAP / 7 fps

70 70% mAP / 0.4 fps

66% mAP / 0.02 fps 66% mAP / 21 fps
10 20 30 40 50
Speed (fps)
SSD512
80 80% mAP / 19 fps
SSD300
77% mAP / 46 fps
VOC2007 test mAP

73% mAP / 7 fps
ge fy Single Shot
-S ta las
t c
s i
o po
s
Twposal +
pro
70 70% mAP x
bo/ 0.4 fps

66% mAP / 0.02 fps 66% mAP / 21 fps
10 20 30 40 50
Speed (fps)
Bounding Box Prediction
Classical sliding
windows
Classical sliding
windows
Is it a cat? No
Classical sliding
windows
Is it a cat? No
Discretize the box space densely
Classical sliding SSD and other deep
windows approaches
Is it a cat? No
windows approaches
Is it a cat? No
windows approaches
Is it a cat? No cat: 0.8 dog: 0.1

windows approaches
Is it a cat? No
windows approaches
Is it a cat? No
windows approaches
Is it a cat? No
windows approaches
dog: 0.4 cat: 0.2

Is it a cat? No
windows approaches
dog: 0.4 cat: 0.2

Is it a cat? No
Discretize the box space more coarsely
Refine the coordinates of each box
SSD Output Layer
ConvNet
feature map
SSD Output Layer
ConvNet
small (e.g. 3x3)

conv kernel
feature map
SSD Output Layer
ConvNet
default box
small (e.g. 3x3)

conv kernel
feature map
SSD Output Layer
box multiclass
regression probabilities
ConvNet
feature map
SSD Training
• Match default boxes to ground truth boxes to
determine true/false positives.
• Loss = SmoothL1(box param) + Softmax(class prob)
Smooth L1 loss Softmax loss

box multiclass
regression probabilities
ConvNet
feature map
Related Work
Related Work
MultiBox [Erhan et al. CVPR14]
P(objectness)
for K boxes
Fully
connected
Offsets for
K boxes
Related Work
MultiBox [Erhan et al. CVPR14]
P(objectness)
for K boxes
Fully
connected post classify
+ boxes
Offsets for
K boxes
Related Work
MultiBox [Erhan et al. CVPR14] YOLO [Redmon et al. CVPR16]
P(objectness) multiclass prob

for K boxes for K boxes
Fully Fully
connected post classify connected
+ boxes
Offsets for Offsets for

K boxes K boxes
Related Work

Fully Fully
+ boxes

K boxes K boxes
Faster R-CNN [Ren et al. NIPS15]

P(objectness) Box offsets
Convolutional
Related Work

Fully Fully
+ boxes

K boxes K boxes
Faster R-CNN [Ren et al. NIPS15]

post classify
+ boxes
Convolutional
Related Work

Fully Fully
+ boxes

K boxes K boxes
Faster R-CNN [Ren et al. NIPS15] SSD

multiclass prob Box offsets
post classify
+ boxes Convolutional
Convolutional
Contribution #1:
Multi-Scale Feature Maps
box multiclass
regression scores
ConvNet
Contribution #1:
box multiclass
regression scores
ConvNet stride 2
convolution
Contribution #1:
box multiclass
regression scores
box multiclass
regression scores
ConvNet stride 2
convolution
SSD
vs.
8 ⇥ 8 feature map 4 ⇥ 4 feature map 8 ⇥ 8 feature map

Faster R-CNN Objectness
SSD Proposal, Ren 2015
vs.
8 ⇥ 8 feature map 4 ⇥ 4 feature map 8 ⇥ 8 feature map

Multi-Scale Feature Maps Experiment
mAP
Prediction source layers from:
use boundary boxes? # Boxes
38 ⇥ 38 19 ⇥ 19 10 ⇥ 10 5 ⇥ 5 3 ⇥ 3 1 ⇥ 1 Yes No
4 4 4 4 4 4 74.3 63.4 8732
4 4 4 70.7 69.2 9864
4 62.4 64.0 8664
mAP
38 ⇥ 38 19 ⇥ 19 10 ⇥ 10 5 ⇥ 5 3 ⇥ 3 1 ⇥ 1 Yes No
4 4 4 4 4 4 74.3 63.4 8732
4 4 4 70.7 69.2 9864
4 62.4 64.0 8664
mAP
38 ⇥ 38 19 ⇥ 19 10 ⇥ 10 5 ⇥ 5 3 ⇥ 3 1 ⇥ 1 Yes No
4 4 4 4 4 4 74.3 63.4 8732
4 4 4 70.7 69.2 9864
4 62.4 64.0 8664
mAP
38 ⇥ 38 19 ⇥ 19 10 ⇥ 10 5 ⇥ 5 3 ⇥ 3 1 ⇥ 1 Yes No
4 4 4 4 4 4 74.3 63.4 8732
4 4 4 70.7 69.2 9864
4 62.4 64.0 8664
mAP
38 ⇥ 38 19 ⇥ 19 10 ⇥ 10 5 ⇥ 5 3 ⇥ 3 1 ⇥ 1 Yes No
4 4 4 4 4 4 74.3 63.4 8732
4 4 4 70.7 69.2 9864
4 62.4 64.0 8664
mAP
38 ⇥ 38 19 ⇥ 19 10 ⇥ 10 5 ⇥ 5 3 ⇥ 3 1 ⇥ 1 Yes No
4 4 4 4 4 4 74.3 63.4 8732
4 4 4 70.7 69.2 9864
4 62.4 64.0 8664
mAP
38 ⇥ 38 19 ⇥ 19 10 ⇥ 10 5 ⇥ 5 3 ⇥ 3 1 ⇥ 1 Yes No
4 4 4 4 4 4 74.3 63.4 8732
4 4 4 70.7 69.2 9864
4 62.4 64.0 8664
boundary boxes
mAP
38 ⇥ 38 19 ⇥ 19 10 ⇥ 10 5 ⇥ 5 3 ⇥ 3 1 ⇥ 1 Yes No
4 4 4 4 4 4 74.3 63.4 8732
4 4 4 70.7 69.2 9864
4 62.4 64.0 8664
mAP
38 ⇥ 38 19 ⇥ 19 10 ⇥ 10 5 ⇥ 5 3 ⇥ 3 1 ⇥ 1 Yes No
4 4 4 4 4 4 74.3 63.4 8732
4 4 4 70.7 69.2 9864
4 62.4 64.0 8664
Contribution #2:
Splitting the Region Space
ConvNet convolution
Contribution #2:
ConvNet convolution
SSD300
include { 12 , 2} box? 4 4
include { 13 , 3} box? 4
number of Boxes 3880 7760 8732
VOC2007 test mAP 71.6 73.7 74.3
Contribution #2:
ConvNet convolution
Use 38x38 feature map : +2.5 mAP

(conv4_3)
Why So Many Default Boxes?
Faster R-CNN YOLO SSD300 SSD512
# Default Boxes 6000 98 8732 24564
Resolution 1000x600 448x448 300x300 512x512
# Default Boxes 6000 98 8732 24564
Resolution 1000x600 448x448 300x300 512x512
# Default Boxes 6000 98 8732 24564
Resolution 1000x600 448x448 300x300 512x512
GT
# Default Boxes 6000 98 8732 24564
Resolution 1000x600 448x448 300x300 512x512
GT DETECTION
# Default Boxes 6000 98 8732 24564
Resolution 1000x600 448x448 300x300 512x512
GT DETECTION
• SmoothL1 or L2 loss for box shape
averages among likely hypotheses
# Default Boxes 6000 98 8732 24564
Resolution 1000x600 448x448 300x300 512x512
GT DETECTION
• Need to have enough default boxes
(discrete bins) to do accurate regression
in each
# Default Boxes 6000 98 8732 24564
Resolution 1000x600 448x448 300x300 512x512
GT DETECTION
• Need to have enough default boxes
(discrete bins) to do accurate regression
in each
• General principle for regressing
complex continuous outputs with deep
nets
Handling Many Default Boxes
• Matching ground truth and
default boxes
• Matching ground truth and GT
default boxes
`
• Matching ground truth and GT Default box
default boxes
`
default boxes
FP
TP
TP
`
default boxes
FP
TP
`
?
TP
`
default boxes
FP
TP
‣ Match each GT box to closest default box
`
?
TP
`
default boxes
FP
TP
‣ Also match each GT box to all unassigned

`
default boxes with IoU > 0.5 ?
TP
`
default boxes
FP
TP

`
TP
• Hard negative mining `
default boxes
FP
TP

`
TP
• Unbalanced training: 1-30 TP, 8k-25k FP

default boxes
FP
TP

`
TP
• Unbalanced training: 1-30 TP, 8k-25k FP
• Keep TP:FP ratio fixed (1:3), use worst-

misclassified FPs.
SSD Architecture
Extra Convolutional Feature Maps

VGG16
Classifier : Conv: 3x3x(3x(Classes+4))
Detections:8732 per Class
Non-Maximum Suppression
Classifier : Conv: 3x3x(6x(Classes+4))
300 38
SSD
19 74.3mAP
image 46FPS
10 Conv: 3x3x(4x(Classes+4))
5
300 38 19 5
10
3
1
Contribution #3:
The Devil is in the Details
Data Augmentation
Data Augmentation
`
Data Augmentation
`
`
`
`
`
`
Data Augmentation
`
`
`
`
`
`
data augmentation SSD300

horizontal flip 4 4
random crop & color distortion 4
VOC2007 test mAP 65.5 74.3
Data Augmentation
Data Augmentation
`
Data Augmentation
`
`
`
`
` `
Random expansion creates more

small training examples
Data Augmentation
`
`
`
`
` `
Random expansion creates more

small training examples
data augmentation SSD300

horizontal flip 4 4 4
random crop & color distortion 4 4
random expansion 4
VOC2007 test mAP 65.5 74.3 77.2
Results on VOC2007 test
Method mAP FPS batch size # Boxes Input resolution

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000 ⇥ 600
Fast YOLO 52.7 155 1 98 448 ⇥ 448
YOLO (VGG16) 66.4 21 1 98 448 ⇥ 448
SSD300 74.3 46 1 8732 300 ⇥ 300
SSD512 76.8 19 1 24564 512 ⇥ 512
SSD300 74.3 59 8 8732 300 ⇥ 300
SSD512 76.8 22 8 24564 512 ⇥ 512

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000 ⇥ 600
Fast YOLO 52.7 155 1 98 448 ⇥ 448
YOLO (VGG16) 66.4 21 1 98 448 ⇥ 448
SSD300 6.6x 74.3 46 1 8732 300 ⇥ 300
SSD512 76.8 19 1 24564 512 ⇥ 512
SSD300 74.3 59 8 8732 300 ⇥ 300
SSD512 76.8 22 8 24564 512 ⇥ 512

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000 ⇥ 600
Fast YOLO 52.7 155 1 98 448 ⇥ 448
YOLO (VGG16) 66.4 21 1 98 448 ⇥ 448
SSD300 74.3 46 1 8732 300 ⇥ 300
SSD512 76.8 19 1 24564 512 ⇥ 512
SSD300 74.3 59 8 8732 300 ⇥ 300
SSD512 76.8 22 8 24564 512 ⇥ 512

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000 ⇥ 600
Fast YOLO 52.7 155 1 98 448 ⇥ 448
YOLO (VGG16) 66.4 21 1 98 448 ⇥ 448
SSD300 74.3 46 1 8732 300 ⇥ 300
SSD512 10% 76.8 19 1 24564 512 ⇥ 512
SSD300 74.3 59 8 8732 300 ⇥ 300
SSD512 76.8 22 8 24564 512 ⇥ 512

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000 ⇥ 600
Fast YOLO 52.7 155 1 98 448 ⇥ 448
YOLO (VGG16) 66.4 21 1 98 448 ⇥ 448
SSD300 74.3 46 1 8732 300 ⇥ 300
SSD512 76.8 19 1 24564 512 ⇥ 512
SSD300 74.3 59 8 8732 300 ⇥ 300
SSD512 76.8 22 8 24564 512 ⇥ 512

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000 ⇥ 600
Fast YOLO 52.7 155 1 98 448 ⇥ 448
YOLO (VGG16) 66.4 21 1 98 448 ⇥ 448
SSD300 74.3 46 1 8732 300 ⇥ 300
SSD512 76.8 19 1 24564 512 ⇥ 512
SSD300 74.3 59 8 8732 300 ⇥ 300
SSD512 76.8 22 8 24564 512 ⇥ 512

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000 ⇥ 600
Fast YOLO 52.7 155 1 98 448 ⇥ 448
YOLO (VGG16) 66.4 21 1 98 448 ⇥ 448
SSD300 74.3 46 1 8732 300 ⇥ 300
SSD512 76.8 19 1 24564 512 ⇥ 512
SSD300 74.3 59 8 8732 300 ⇥ 300
SSD512 76.8 22 8 24564 512 ⇥ 512

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000 ⇥ 600
Fast YOLO 52.7 155 1 98 448 ⇥ 448
YOLO (VGG16) 66.4 21 1 98 448 ⇥ 448
SSD300 77.2 74.3 46 1 8732 300 ⇥ 300
SSD512 79.8 76.8 19 1 24564 512 ⇥ 512
SSD300 77.2 74.3 59 8 8732 300 ⇥ 300
SSD512 79.8 76.8 22 8 24564 512 ⇥ 512
Results on More Datasets
VOC2007 VOC2012 MS COCO ILSVRC2014
Method
test test test-dev val2
Fast R-CNN 70.0 68.4 19.7 N/A
Faster R-CNN 73.2 70.4 21.9 N/A
YOLO 63.4 57.9 N/A N/A
Method
Fast R-CNN 70.0 68.4 19.7 N/A
Faster R-CNN 73.2 70.4 21.9 N/A
YOLO 63.4 57.9 N/A N/A
SSD300 74.3 72.4 23.2 43.4
Method
Fast R-CNN 70.0 68.4 19.7 N/A
Faster R-CNN 73.2 70.4 21.9 N/A
YOLO 63.4 57.9 N/A N/A
SSD300 74.3 72.4 23.2 43.4
SSD512 76.8 74.9 26.8 46.4
Method
Fast R-CNN 70.0 68.4 19.7 N/A
Faster R-CNN 73.2 70.4 21.9 N/A
YOLO 63.4 57.9 N/A N/A
SSD300*
SSD300 74.3
77.2 72.4
75.8 23.2
25.1 43.4
N/A
SSD512*
SSD512 76.8
79.8 74.9
78.5 26.8
28.8 46.4
N/A
COCO Bounding Box precision
COCO Bounding Box precision
mAP @ IoU 0.5 0.75 0.5:0.95

Faster R-CNN 45.3 23.5 24.2
SSD512* 48.5 30.3 28.8
gain +3.2 +6.8 +4.6
Future Work
Future Work
• Object detection + pose estimation
Future Work
[Poirson et al, coming out at 3DV, 2016]
2. Two-stage vs. Proposed. (a) The two-stage approach separates the detection and pose estimation steps. After object dete
cted objects are cropped and then processed by a separate network for pose estimation. This requires resampling the image a
mes: once for region proposals, once for detection, and once for pose estimation. (b) The proposed method, in contrast, requi
ing of the image and instead relies on convolutions for detecting the object and its pose in a single forward pass. This offers
p because the image is not resampled, and computation for detection and pose estimation is shared.
del N✓ possible poses. With Nc categories of objects, the

Nc ⇥ N✓ pose outputs for the separate model and N✓
an input RGB image, a single evaluation of the outputs for the share model. While we do add a 3x3 fil
Future Work
2. Two-stage• vs. Proposed. (a) The two-stage approach separates the detection and pose estimation steps. After object dete
Single shot 3D bounding box detection
p because the image is not resampled, and computation for detection and pose estimation is shared.

Future Work
2. Two-stage• vs. Proposed. (a) The two-stage approach separates the detection and pose estimation steps. After object dete
Single shot 3D bounding box detection
p because • Joint object detection + tracking model
the image is not resampled, and computation for detection and pose estimation is shared.

Check out the code/models
https://github.com/weiliu89/caffe/tree/ssd
Thank you!
Come by our poster O-1A-02

SSD Eccv2016 Slide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSD Eccv2016 Slide

Uploaded by

Copyright:

Available Formats

SSD: Single Shot

UNC Chapel Hill(1), Zoox Inc.(2), Google Inc.(3),

Faster R-CNN, Ren 2015

Fast R-CNN, Girshick 2015

R-CNN, Girshick 2014 YOLO, Redmon 2016

All with VGGNet pretrained on ImageNet,

Faster R-CNN, Ren 2015

Fast R-CNN, Girshick 2015

R-CNN, Girshick 2014 YOLO, Redmon 2016

All with VGGNet pretrained on ImageNet,

Faster R-CNN, Ren 2015

R-CNN, Girshick 2014 YOLO, Redmon 2016

All with VGGNet pretrained on ImageNet,

Faster R-CNN, Ren 2015

R-CNN, Girshick 2014 YOLO, Redmon 2016

All with VGGNet pretrained on ImageNet,

Faster R-CNN, Ren 2015

Fast R-CNN, Girshick 2015

R-CNN, Girshick 2014 YOLO, Redmon 2016

Faster R-CNN, Ren 2015

R-CNN, Girshick 2014 YOLO, Redmon 2016

Is it a cat? No cat: 0.8 dog: 0.1

dog: 0.4 cat: 0.2

dog: 0.4 cat: 0.2

small (e.g. 3x3)

small (e.g. 3x3)

• Loss = SmoothL1(box param) + Softmax(class prob)

Smooth L1 loss Softmax loss

P(objectness) multiclass prob

Offsets for Offsets for

P(objectness) multiclass prob

Offsets for Offsets for

Faster R-CNN [Ren et al. NIPS15]

P(objectness) multiclass prob

Offsets for Offsets for

Faster R-CNN [Ren et al. NIPS15]

P(objectness) multiclass prob

Offsets for Offsets for

Faster R-CNN [Ren et al. NIPS15] SSD

8 ⇥ 8 feature map 4 ⇥ 4 feature map 8 ⇥ 8 feature map

8 ⇥ 8 feature map 4 ⇥ 4 feature map 8 ⇥ 8 feature map

Use 38x38 feature map : +2.5 mAP

‣ Also match each GT box to all unassigned

‣ Also match each GT box to all unassigned

‣ Also match each GT box to all unassigned

• Unbalanced training: 1-30 TP, 8k-25k FP

‣ Also match each GT box to all unassigned

• Unbalanced training: 1-30 TP, 8k-25k FP

• Keep TP:FP ratio fixed (1:3), use worst-

Extra Convolutional Feature Maps

Detections:8732 per Class

data augmentation SSD300

Random expansion creates more

Random expansion creates more

data augmentation SSD300

Method mAP FPS batch size # Boxes Input resolution

Method mAP FPS batch size # Boxes Input resolution

Method mAP FPS batch size # Boxes Input resolution

Method mAP FPS batch size # Boxes Input resolution

Method mAP FPS batch size # Boxes Input resolution

Method mAP FPS batch size # Boxes Input resolution

Method mAP FPS batch size # Boxes Input resolution

Method mAP FPS batch size # Boxes Input resolution