Professional Documents
Culture Documents
Jiahui Yu1,2 Yuning Jiang2 Zhangyang Wang1 Zhimin Cao2 Thomas Huang1
University of Illinois at Urbana−Champaign
1
2
Megvii Inc
{jyu79, zwang119, t-huang1}@illinois.edu, {jyn, czm}@megvii.com
ABSTRACT 𝑥! 𝑥!
516
DenseBox requires the training image patches to be resized mids in testing phase. In this way, the ℓ2 loss is normalized
to a fixed scale. As a consequence, DenseBox has to perform but the detection efficiency is also affected negatively.
detection on image pyramids, which unavoidably affects the
efficiency of the framework. 2.2 IoU Loss Layer: Forward
The paper proposes a highly effective and efficient CNN- In the following, we present a new loss function, named
based object detection network, called UnitBox. It adopts a the IoU loss, which perfectly addresses above drawbacks.
fully convolutional network architecture, to predict the ob- Given a predicted bounding box x (after ReLU layer, we
ject bounds as well as the pixel-wise classification scores on have xt , xb , xl , xr ≥ 0) and the corresponding ground truth
the feature maps directly. Particularly, UnitBox takes ad- e, we calculate the IoU loss as follows:
x
vantage of a novel Intersection over Union (IoU ) loss func-
tion for bounding box prediction. The IoU loss directly Algorithm 1: IoU loss Forward
enforces the maximal overlap between the predicted bound-
ing box and the ground truth, and jointly regress all the Input: x e as bounding box ground truth
bound variables as a whole unit (see Figure 1). The Unit- Input: x as bounding box prediction
Box demonstrates not only more accurate box prediction, Output: L as localization error
but also faster training convergence. It is also notable that for each pixel (i, j) do
thanks to the IoU loss, UnitBox is enabled with variable- e ̸= 0 then
if x
X = (xt + xb ) ∗ (xl + xr )
scale training. It implies the capability to localize objects
Xe = (e
xt + x eb ) ∗ (e er )
xl + x
in arbitrary shapes and scales, and to perform more efficient
testing by just one pass on singe scale. We apply UnitBox Ih = min(xt , x et ) + min(xb , xeb )
on face detection task, and achieve the best performance on Iw = min(xl , x el ) + min(xr , x
er )
FDDB [6] among all published methods. I = Ih ∗ Iw
U =X +X e −I
I
IoU =
2. IOU LOSS LAYER U
Before introducing UnitBox, we firstly present the pro- L = −ln(IoU )
posed IoU loss layer and compare it with the widely-used ℓ2 else
L=0
loss in this section. Some important denotations are claimed
end
here: for each pixel (i, j) in an image, the bounding box of
ground truth could be defined as a 4-dimensional vector: end
ei,j = (e
x ebi,j , x
xti,j , x eli,j , x
eri,j ), (1)
In Algorithm 1, x e ̸= 0 represents that the pixel (i, j) falls
where xet , x
eb , x
el , x
er represent the distances between cur- inside a valid object bounding box; X is area of the predicted
rent pixel location (i, j) and the top, bottom, left and right box; Xe is area of the ground truth box; Ih , Iw are the height
bounds of ground truth, respectively. For simplicity, we omit and width of the intersection area I, respectively, and U is
footnote i, j in the rest of this paper. Accordingly, a pre- the union area.
dicted bounding box is defined as x = (xt , xb , xl , xr ), as Note that with 0 ≤ IoU ≤ 1, L = −ln(IoU ) is essentially
shown in Figure 1. a cross-entropy loss with input of IoU : we can view IoU
as a kind of random variable sampled from Bernoulli distri-
2.1 L2 Loss Layer bution, with p(IoU = 1) = 1, and the cross-entropy loss of
ℓ2 loss is widely used in optimization. In [5] [7], ℓ2 loss is the variable IoU is L = −pln(IoU ) − (1 − p)ln(1 − IoU ) =
also employed to regress the object bounding box via CNNs, −ln(IoU ). Compared to the ℓ2 loss, we can see that instead
which could be defined as: of optimizing four coordinates independently, the IoU loss
∑ considers the bounding box as a unit. Thus the IoU loss
L(x, x
e) = (xi − x
ei )2 , (2) could provide more accurate bounding box prediction than
i∈{t,b,l,r} the ℓ2 loss. Moreover, the definition naturally norms the
where L is the localization error. IoU to [0, 1] regardless of the scales of bounding boxes. The
However, there are two major drawbacks of ℓ2 loss for advantage enables UnitBox to be trained with multi-scale
bounding box prediction. The first is that in the ℓ2 loss, objects and tested only on single-scale image.
the coordinates of a bounding box (in the form of xt , xb , xl ,
xr ) are optimized as four independent variables. This as-
2.3 IoU Loss Layer: Backward
sumption violates the fact that the bounds of an object are To deduce the backward algorithm of IoU loss, firstly we
highly correlated. It results in a number of failure cases in need to compute the partial derivative of X w.r.t. x, marked
which one or two bounds of a predicted box are very close as ∇x X (for simplicity, we notate x for any of xt , xb , xl , xr
to the ground truth but the entire bounding box is unac- if missing):
ceptable; furthermore, from Eqn. 2 we can see that, given ∂X
two pixels, one falls in a larger bounding box while the other = xl + xr , (3)
∂xt (or ∂xb )
falls in a smaller one, the former will have a larger effect on
the penalty than the latter, since the ℓ2 loss is unnormal- ∂X
ized. This unbalance results in that the CNNs focus more on = xt + xb . (4)
∂xl (or ∂xr )
larger objects while ignore smaller ones. To handle this, in
previous work [5] the CNNs are fed with the fixed-scale im-
age patches in training phase, while applied on image pyra-
517
layers at the end of VGG stage-5 with convolutional kernel
h h h
size 512 x 3 x 3 x 4. Additionally, we insert a ReLU layer to
make bounding box prediction non-negative. The predicted
bounds are jointly optimized with IoU loss proposed in Sec-
Confidence tion 2. The final loss is calculated as the weighted average
…
518
Figure 3: Examples of detection results of UnitBox on FDDB.
519
6. REFERENCES [8] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A
[1] V. Belagiannis, X. Wang, H. Beny Ben Shitrit, convolutional neural network cascade for face
K. Hashimoto, R. Stauder, Y. Aoki, M. Kranzfelder, detection. In Proceedings of the IEEE Conference on
A. Schneider, P. Fua, S. Ilic, H. Feussner, and Computer Vision and Pattern Recognition, pages
N. Navab. Parsing human skeletons in an operating 5325–5334, 2015.
room. Machine Vision and Applications, 2016. [9] J. Li, X. Liang, S. Shen, T. Xu, and S. Yan.
[2] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Scale-aware Fast R-CNN for Pedestrian Detection.
Rich feature hierarchies for accurate object detection ArXiv e-prints, Oct. 2015.
and semantic segmentation. In Computer Vision and [10] S. Ren, K. He, R. Girshick, and J. Sun. Faster
Pattern Recognition, 2014. R-CNN: Towards real-time object detection with
[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual region proposal networks. In Advances in Neural
Learning for Image Recognition. ArXiv e-prints, Dec. Information Processing Systems (NIPS), 2015.
2015. [11] K. Simonyan and A. Zisserman. Very Deep
[4] J. Hosang, M. Omran, R. Benenson, and B. Schiele. Convolutional Networks for Large-Scale Image
Taking a deeper look at pedestrians. In Proceedings of Recognition. ArXiv e-prints, Sept. 2014.
the IEEE Conference on Computer Vision and [12] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers,
Pattern Recognition, pages 4073–4082, 2015. and A. W. M. Smeulders. Selective search for object
[5] L. Huang, Y. Yang, Y. Deng, and Y. Yu. DenseBox: recognition. International Journal of Computer
Unifying Landmark Localization with End to End Vision, 104(2):154–171, 2013.
Object Detection. ArXiv e-prints, Sept. 2015. [13] Z. Wang, S. Chang, Y. Yang, D. Liu, and T. S.
[6] V. Jain and E. Learned-Miller. Fddb: A benchmark Huang. Studying very low resolution recognition using
for face detection in unconstrained settings. Technical deep networks. CoRR, abs/1601.04153, 2016.
Report UM-CS-2010-009, University of Massachusetts, [14] S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face:
Amherst, 2010. A face detection benchmark. In IEEE Conference on
[7] Z. Jie, X. Liang, J. Feng, W. F. Lu, E. H. F. Tay, and Computer Vision and Pattern Recognition (CVPR),
S. Yan. Scale-aware Pixel-wise Object Proposal 2016.
Networks. ArXiv e-prints, Jan. 2016. [15] C. L. Zitnick and P. Dollár. Edge boxes: Locating
object proposals from edges. In ECCV. European
Conference on Computer Vision, September 2014.
520