You are on page 1of 40

Digging into

Sample Assignment Methods


for Object Detection
Hiroto Honda
Oct. 1, 2020
About Me

Hiroto Honda

- Mobility Technologies Co., Ltd. (Japan)

- homepage: https://hirotomusiker.github.io/

- blogs: Digging Into Detectron 2

- kaggle master: 6th place at Open Images Challenge ‘19

- Interests: Object Detection, Human Pose Estimation, Image Restoration


Today I talk about..

How to define training samples of


object detection
for the given feature map
and ground-truth boxes
Today I don’t talk about...

Accuracy and Inference Time Comparison among


Object Detectors
because
It’s Hard to See the Difference between
Sampling Methods in a Fair Way
Object Detection Input:
Image
Output:
Bounding Boxes
(xywh + class id + confidence)

from: [H1]
How Object Detection Works
Example of 2-stage Detector [H1][3]: Faster R-CNN [1] + Feature Pyramid Network [2]
Object Detectors Decomposed

every grid cell


dense head is responsible for
detection

backbone

neck
roi head

recognition of one object


from one ROI feature map
Object Detectors Decomposed

every grid cell


dense head is responsible for
detection

backbone 1-stage
(single-shot)
detector

neck 2-stage
roi head detector

recognition of one object


from one ROI feature map
Object Detectors Decomposed

2-stage detector

1-stage (single-shot) detector

detector name backbone neck dense head roi head

Faster R-CNN [1] w/ FPN[2] ResNet FPN RPN Fast RCNN

Mask R-CNN [4] ResNet FPN RPN Mask RCNN

RetinaNet [5] ResNet FPN RetinaNetHead -

EfficientDet [6] EfficientNet BiFPN RetinaNetHead -

YOLO [7-11] darknet etc YOLO-FPN YOLO layer -

SSD [12] VGG - SSDHead -


How are Feature Maps and Ground Truth Associated?

from: [H1]
Region Proposal Network

detector name backbone neck dense head roi head

Faster R-CNN [1] w/ FPN[2] ResNet FPN RPN Fast RCNN

Mask R-CNN [4] ResNet FPN RPN Mask RCNN


Region Proposal Network (RPN)

INPUT

OUTPUT from: [H1]


Multi-Scale Detection Results (objectness)
stride = 4 stride = 8

stride = 16 stride = 32 stride = 64

from: [H1]
visualization of an objectness channel
(corresponding to one of three anchors)
Anchors

three anchors per scale

aspect ratio : (1,1), (1, 2), (2, 1)

from: [H1]
Anchors on Each Grid Cell

from: [H1]

Grid cells at the coarse scale have large anchors


= responsible for detecting large objects
How are Feature Maps and Ground Truth Associated?

from: [H1]

Answer: Define the ‘foreground grid cells’ by matching


‘anchors’ with GT boxes
Intersection Over Union (IoU)

A IoU = A ∩ B / A ∪ B
B

IoU=0.95

IoU=0.15
IoU Matrix for Anchor-GT Matching
position 0 position 1 position 2

anchors

GT box 0 0 0 0.61 0.28 0 0 0 0 0

GT box 1 0 0 0 0 0 0 0.98 0 0 IoU value

from: [H1]
matched with
ignored background
GT box 1,
foreground

foreground (IoU ≧ T1) -> objectness target=1, regression target

background (IoU < T2) : objectness target=0, no regression loss

ignored (T2 ≦ IoU < T1) T1 and T2: predefined threshold values
Sample Assignment of RPN
Box Regression After Sample Assignment

Δx =(x-xa)/wa)
Δy =(y-ya)/ha
Δw = log(w/wa)
Δh = log(h/ha)

from: [H1]

RPN learns relative size and location between GT boxes and anchors
RetinaNet / EfficientDet

detector name backbone neck dense head roi head

RetinaNet [5] ResNet FPN RetinaNetHead -

EfficientDet [6] EfficientNet BiFPN RetinaNetHead -


RetinaNet
Input Image
BGR, H, W

Backbone

C2
C3
RetinaNetHead
C4 C5
P6,
stem

P7
cls_subnet -> cls_score

bbox_subnet -> bbox_pred

+ + +
+

P4 P5
P2 P3
EfficientDet
Input Image
BGR, H, W

Backbone

Backbone: EfficientNet
C2
Neck: BiFPNC3 RetinaNetHead
C4 C5
P6,
stem

P7
cls_subnet -> cls_score

bbox_subnet -> bbox_pred

+ + +
+

P4 P5
P2 P3
Sample Assignment of RetinaNet and EfficientDet
same as RPN - number of anchors and IoU thresholds are different

position 0 position 1 position 2 architecture num. anchors T1 T2


at grid cell

anchors Faster R-CNN 3 0.7 0.3

GT box 0 0 0 0.41 0.28 0 0 0 0 0 RetinaNet 9 0.5 0.4

GT box 1 0 0 0 0 0 0 0.68 0 0 EfficientDet 3 0.5 0.5

IoU value
[3]
matched with
ignored background
GT box 1,
foreground

foreground (IoU ≧ T1) : class target = one-hot, regression target

background (IoU < T2): class target = zeros, no regression loss

ignored (T2 ≦ IoU < T1) [only RetinaNet] T1 and T2: predefined threshold values
YOLO v1 / v2 / v3 / v4 / v5

detector name backbone neck dense head roi head

YOLO [7-11] darknet etc YOLO-FPN YOLO layer -


YOLO detector
YOLOv3 architecture

darknet53

YOLO Layer

bbox, class score, confidence

P4 P5
P3

What makes YOLO is the YOLO layer


Sample Assignment of YOLO v2 / v3 for the details, see [H2]
position 0 position 1 position 2

anchors

GT box 0 0 0 0.38 0.18 0 0 0 0 0 max

・・・
GT box 1 0 0 0 0 0 0 0.98 0 0 max

matched background matched with


with GT box 0, GT box 1,
foreground foreground

foreground (max-IoU) : objectness = 1. regression target

background (other than max-IoU anchors): objectness = 0, no regression loss

ignored (IoU between prediction and GT > T1) T1: predefined threshold values

only one anchor is assigned to one GT


Sample Assignment of YOLO v4 / v5
position 0 position 1 position 2

anchors

GT box 0 0 0 0.88 0.78 0 0 0 0 0

・・・
GT box 1 0 0 0 0 0 0 0.98 0 0

matched with
matched
GT box 1,
with GT box 0,
foreground
foreground

foreground (v4: IoU > T1, v5: box w, h ratio < Ta ) : objectness = 1. regression target

background (v4: IoU > T1, v5: box w, h ratio > Ta) : objectness = 0, no regression loss

ignored (IoU > T2, only YOLOv4)


multiple anchors can be assigned to one GT
Sample Assignment Comparison - YOLOv3 vs YOLOv5

YOLOv5 assigns three feature points for one target center -> higher recall

see my kaggle discussion topic for the YOLOv5 details


https://www.kaggle.com/c/global-wheat-detection/discussion/172436
Sample Assignment of YOLO series

version scale num. anchors assignment method assigned


per scale anchors per GT

YOLO v1 1 0 center position comparison single

YOLO v2 1 9 IoU comparison single

YOLO v3 3 3 IoU comparison single

YOLO v4 3 3 IoU comparison multiple

YOLO v5 3 3 box size comparison multiple


additional neighboring 2 cells

target assignment is so different between YOLO versions - which one is the best?
“Anchor-Free” Detectors

detector name backbone neck dense head roi head

FCOS [13] ResNet FPN FCOSHead -

CenterNet (objects as points) [14] Hourglass CenterNetHead -


FCOS

- Assign all the grid cells that fall into the GT box
- only at the appropriate scale
- ‘Center-ness’ score is used additionally to suppress low-quality predictions far
from the GT center
Objects as Points (CenterNet)

- objectness (center) target: heatmap with


Gaussian kernels around GT centers
- regression target assignment: one grid cell
+ surrounding points (optional)
Adaptive Sample Selection

detector name backbone neck dense head roi head

ATSS [15] ResNet FPN ATSSHead -


Adaptive Sample Selection
Adaptively define IoU threshold for each GT box

IoUthreshold
= mean(IoUs) + std(IoUs)

sample candidates : K=9


nearby anchors from the
GT center

Improves performance of both


anchor-based and anchor-free detectors
Adaptive Sample Selection
anchors

GT box 0 0 0 0.88 0.28 0 0 0 0 0 IoU threshold=0.71


・・・
GT box 1 0 0 0 0 0.18 0.24 0.22 0 0 IoU threshold=0.21

matched matched with


with GT box 0, GT box 1,
foreground foreground

candidate anchors whose centers are


foreground (positive)
close to the GT centers

background (negative)

ignored
multiple anchors can be assigned for one GT
High recall but includes low-quality positives
Conclusion

- An object detector can be decomposed into backbone, neck,


dense detection head and ROI head
- The core of dense detection is ground-truth sample
assignment to the feature map
- Assignment method varies among detectors
- anchor based or point based
- allow multiple anchors per GT or not
- fixed or adaptive IoU threshold
- Adaptive IoU thresholding improves performance of both
anchor-based and anchor-free detectors
References
[1] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal
networks. In NIPS, 2015.
[2] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
[3] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo and Ross Girshick, Detectron2.
https://github.com/facebookresearch/detectron2, 2019.
[4] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In ICCV, 2017.
[5] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In ICCV, 2017.
[6] Mingxing Tan, Ruoming Pang, and Quoc V Le. EfficientDet: Scalable and efficient object detection. In CVPR , 2020.
[7] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779– 788, 2016.
[8] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 7263– 7271, 2017.
[9] Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
[10] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv
preprint arXiv:2004.10934, 2020.
[11] YOLOv5, https://github.com/ultralytics/yolov5 , as of version 3.0
[12] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C 12 Berg. SSD: Single
shot multibox detector. In ECCV, 2016.
[13] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In ICCV, 2019.
[14] Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
[15] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection
via adaptive training sample selection. In CVPR, 2020.
References

Hiroto Honda’s medium blogs


[H1] Digging Into Detectron 2
[part 1] : Introduction - Basic Network Architecture and Repo Structure
[part 2] : Feature Pyramid Network
[part 3] : Data Loader and Ground Truth Instances
[part 4] : Region Proposal Network
[part 5]: ROI (Box) Head
[H2] Reproducing Training Performance of YOLOv3 in PyTorch
[Part 0]: Introduction
[Part 1]: Network Architecture and channel elements of YOLO layers
[Part 2]: How to assign targets to multi-scale anchors

You might also like