Advanced Deep Learning Based Object Detection Methods

Advanced Deep Learning based Object
Detection Methods
Improving Object Detection With One Line of Code
● Non-Maximum Suppression is a greedy
process.
○ It worked well enough in 2007 but it doesn’t
anymore.
● High scoring detections can be suppressed
just as low scoring detections.
○ Overlap with stronger detection is the only
criteria.
● Should one detection completely suppress
another detection, or simply reduce its
confidence?
● NMS:
● Linear Soft-NMS:
● Gaussian Soft-NMS:
○ Linear Soft-NMS is not continuous in terms of
overlap and a sudden penalty is applied when a
NMS threshold is reached.
○ Instead we can use a continuous function:
Learning Non-Maximum Suppression
● Object detectors are mostly trained
end-to-end, except for the NMS.
○ NMS is still fully hand-crafted, and forces a
trade-off between recall and precision.
● Training loss is not evaluation loss.
○ Training is performed without NMS
○ During evaluation, multiple detections for same
object count as false positives.
● Instead, train the network to include the
suppression process.
○ Only output one bounding box per object.
○ Learn how to handle close objects.
● Additional blocks that: ● New loss:
○ Encode pairwise information. ○ Only one positive candidate per object.
○ For each detection, pool information from all ○ Instead of the current practice to take all
pairings. objects with IoU>50%
○ Update feature vector.
○ Repeat.
Multi-Scale Object Detection
● Multi-scale object detection using image pyramid

○ Predict different scales by applying same model at different image resolutions.
● Classic method.
● But also, in OverFeat.
● Slow. Requires multiple evaluation of the same model.
● Predict multiple scale of objects using a single feature map.

● Same as Faster R-CNN.
● Fast
● Single model (same in training as in testing).
● Bad features resolution for small objects.
● Predict different object sizes at different feature scales.

● Same as SSD.
● Good features resolution for small objects
● But features are much weaker than in deeper layers.
Feature Pyramid Network (FPN)
● Single model (same in training as in testing).

● Good features resolution for small objects.
● Strong features in all layers.
● Almost no overhead over SSD (= Fast).
● How important is top-down enrichment?

● How important are lateral connections?
● How important are pyramid representations?
● How important is top-down enrichment?

● How important are lateral connections?
● How important are pyramid representations?
Focal Loss for Dense Object Detection
● Can we train a single stage detector to be as accurate as two stage detectors?

● Contributions:
○ RetinaNet: Single stage object detector based on FPN backbone.
○ New loss.
● Class unbalance is an important issue for object detection.

● Previous solutions:
○ Random resampling at 1:3 ratio.
○ Hard negative resampling at 1:3 ratio.
● Both solutions means that at each step, we only a few samples actually matters
to the loss function.
● Instead, include all samples but use different weight for each class.
○ Regular cross entropy:
○ Weighted cross entropy:
● Using weight CE as baseline:
○ Can we do better?
○ Can we use different weight for each sample?
● Focal loss:
● Every sample is weighted according to its error.
○ We want to focus on samples which are
mislabeled.
● Different parameters for RetinaNet

● Comparison with online hard negative mining

● Accuracy/speed trade-offs
● Benchmark results
Also Read:
Deformable Convolutional Networks
https://arxiv.org/abs/1703.06211
YouTube Videos
● CS231n
○ Lecture 11 - Detection and segmentation https://youtu.be/nDPWywWRIRo
● Deep Learning for Objects and Scenes (CVPR 2017 Workshop)
○ Lecture 1: Learning Deep Representations for Visual Recognition, by Kaiming He
https://youtu.be/jHv37mKAhV4
○ Lecture 2: Deep Learning for Instance-level Object Understanding, by Ross Girshick
https://youtu.be/jHv37mKAhV4?t=39m4s
Looking for brilliant researchers
cv@brodmann17.com /
amir@brodmann17.com
Computer Vision Tasks
Source: CS231n Object detection http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf

Mask R-CNN
● Instance segmentation with pose
estimation for people.
● Extends faster R-CNN by adding new
branch for the instance mask task.
● Pose estimation can be added by simply
adding an additional branch.
● SOTA accuracy on detection, segmentation
and pose estimation at 5 FPS on GPU.
● https://arxiv.org/abs/1703.06870
● Girshick won young researcher award.
Mask R-CNN
Mask R-CNN
Mask R-CNN
Mask R-CNN
● RoiPool
○ Quantization breaks pixel-to-pixel alignment
○ Too coarse and not good for fine spatial
information required for mask.
● RoiAlign
○ Bilinearly sample the proposal region and avoid
the quantization.
○ Smoothly normalize features and predictions
into coordinate frame free of scale and aspect
ratio
Mask R-CNN
Mask R-CNN
● Backbone architecture
○ ResNet
○ ResNeXt
○ FPN
● Mask representation
○ FC vs. Convolutional
○ Multinomial vs. Independent Masks: softmax
vs. sigmoid
○ Class-Specific vs. Class-Agnostic Masks:
almost same accuracy
● Multi-task learning
○ Mask task improves object detection accuracy.
○ Keypoint task reduces object detection
accuracy.
Mask R-CNN
● Pose estimation
○ Simply add an additional branch.
○ Model a keypoint’s location as a one-hot mask,
and adopt Mask R-CNN to predict K masks.
○ Experiments are mainly to demonstrate the
generality of the Mask R-CNN framework.
○ RoiAlign improves this task’s accuracy as well.
Looking for brilliant researchers
cv@brodmann17.com

Advanced Deep Learning Based Object Detection Methods

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Deep Learning Based Object Detection Methods

Uploaded by

Copyright:

Available Formats

Advanced Deep Learning based Object

● Multi-scale object detection using image pyramid

● Predict multiple scale of objects using a single feature map.

● Predict different object sizes at different feature scales.

● Single model (same in training as in testing).

● How important is top-down enrichment?

● How important is top-down enrichment?

● Can we train a single stage detector to be as accurate as two stage detectors?

● Class unbalance is an important issue for object detection.

● Different parameters for RetinaNet

● Comparison with online hard negative mining

Source: CS231n Object detection http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf

You might also like