Professional Documents
Culture Documents
Abstract—Methods based on Convolutional neural network non-maximum suppression (NMS) to obtain more accurate
(CNN) have been effectively applied to object detection. Howev- Bounding box (BBox) after predicting the location of the
er, the Traditional object detection models face huge obstacles object, they are not suitable for remote sensing image. Due
in the field of the aerial scene due to the complexity of the
aerial image and the wide degree of freedom of the remote to NMS cause excessive suppression from the density of
sensing object in density, scale, and orientation. In this article, objects. A significant feature for detecting an object in aerial
we applied rotatable bounding boxes (RBox) instead of the image is the Rotation invariant. When the viewpoint of the
traditional bounding box (BBox) and improved the CNN-based camera is from the top of the scene, thus the view of the
network to effectively handle the oriented angle of the detected object becomes arbitrary. A vast number of generated aerial
object so that the rotation invariant property can be achieved.
Our model is trained and tested on 3 classes of remote sensing images are produced from satellites each year. Making great
category Airplanes, ships, and Cars. Compared with two of the importance for object detection models, whereas the rotation
common benchmarks of the BBox methods Faster R-CNN and invariant feature plays an important role.
SDD, our model outperforms both of them. Where our model
efficiently output the orientation angles of the detected objects. II. RELATED WORK
Keywords-Rotation Invariant; Object Detection; Deep Con- Common object detection algorithms either associated or
volution Network (CNN); Computer Vision; based on deep convolutional neural networks (CNNs) [1],
[5], [7], [12] applying CNNs has successfully improved
I. I NTRODUCTION
the object detection algorithm and achieved state-of-the-
object detection considers a challenging task in image art detection performance. The region-based convolutional
processing and computer vision. It plays an important role in neural network (R-CNN) was introduced by Girshick et
the aerial image research direction with vast tasks and imple- al. [2] it considered a multi-stage structure method which
mentation in both military and civil applications. Although at first it used selective search (SS) to produce region
classical object detection deep learning-based algorithm such proposals and then CNN is deployed on each region proposal
as [2], [3], [6], [8], [9], [15] has come a long way in detecting to detect objects. Although R-CNN achieved an amazing
an object in a natural scene, it still needs more effort in the result, due to it has fed the region proposal into the net-
aerial section. they are holding back in the aerial image filed work individually, it consumes time. Through using feature
because of the following main reasons: sharing, we can overcome the constraint of the fixed-size
• The complexity of the aerial image. Because of the input and we only need to feed the image once to the net
huge amount of interference objects and the different introducing spatial pyramid pooling networks (SPP-nets) [4].
resolution in an aerial image makes it very complex. Unlike SPP-nets Fast R-CNN [3] improved over R-CNN
• Aerial images have always been accompanied by small using the RoI layer and a multi-task loss to fuse proposal
objects which have complex surroundings. classification and location regression. However, it still a
• A challenge for the detection algorithm comes from time-consuming method. By replacing SS with the region
certain categories such as vehicles, ships, and so on. proposal networks (RPNs) to generate the proposals Faster
Due to the dense arrangement of those objects. R-CNN was introduced [15]. Combing all detection steps
• Because images are taking from a high altitude in into a unified network. Making it the best in performance
remote sensing objects come in arbitrary orientation. and efficiency at the time. The single-stage structure methods
Making a technical difficulty for object detection to get have the advantage of speed over a slightly low precision
a large aspect ratio. compared with the multi-stage ones. Some famous single-
object detection algorithms, such as Faster-RCNN [2], are stage such as RetinaNet [13]], single-shot detector (SSD),
usually used in the aeronautical field. Even though they use and you only look once (YOLO)[9] and its versions [14],
area B̂1 ∩ B2
ArIoU180 (B1 , B2 ) = |cos(θB1 − θB2 )|
area B̂1 ∪ B2
(2)
area B̂1 ∩ B2
ArIoU (B1 , B2 ) = cos(θB1 − θB2 ) (3)
area B̂1 ∪ B2
Figure 1. Representation of RBox.
Where the angles of the two RBox are θB1 and θB2
respectively, B̂1 denotes an RBox which keeps the location
and size values of B1 except that the angle parameters of
III. METHODOLOGY
B2 . The union and intersection of the two RBox are denoted
The structure of our model is illustrated in Fig.2. Which by intersection and union. Both the union and intersection
introduces four main parts: the RBox generation and the are computable due to B̂1 and B2 have the same angle.
252
B. Feature Extraction IV. EXPERIMENTS AND RESULT
In our model we are extracting the features from the image A. Datasets
using the pre-trained network VGG19[19], the VGG19 is Our model is designed for object detection with aerial
truncated where all the full-connective layers, pooling layers images. For that, we used the publicly provided dataset by
and convolution layers after the conv4 3 layer is detached. Liu et al. [20] and it is constructed from satellite images
Then the conv4 3 is followed by two 3 x 3 convolutional collected from Google Earth. It contains three main class
layers which help to predict the class and location of each categories: Airplane, Vehicles, and Ships. The dataset is
RBox. Our network receptive field is 108 x 108 pixels in the described in table II.
conv4 3 and compared to conv5 3 information the conv4 3 Each and every object inside an image is annotated in the
has a greater resolution. Those two are the main reason we corresponding Box file with 5 parameters: location in x and
use conv4 3. Consequently, any object large then the scope y, size in Width and Height, and the angle of the object. And
of 108 x 108 cannot be detected, also any object smaller we implemented data augmentation on every single image to
than 8 pixels may be missed due to the feature map of the increases the variety of the image‘s angles and the amount
conv4 3 layer is 38 x 38 pixels. of image. With 22% taken as a test data and the rest used
for training. Some samples are shown in Fig.3.
B. Evaluation Metrics
We measure the model performance by employed break-
even point (BEP), precision, recall, and average precision
(AP). Defined as follows:
1) Precision Rate and Recall Rate: they are two wildly
used metrics for detectors performance evaluation,
defined in (6,7).
Ṫd
P = (6)
Td
Ṫd
R= (7)
Tr
Where P and R denote the precision rate and recall
rate, and the overall detected targets, the number of
correctly detected targets and real targets are denoted
by Tr , Ṫd and Td respectively.
2) Average Precision: in order to calculate the AP the
Figure 2. Our Network structures. average value of precision rate needs to be computed
over the interval from 0 to 1 from the recall rate, which
indicates the mean accuracy of the detector over the
various recall rate all-inclusive which implies that the
C. Loss Function higher the AP the better the detector performance.
Our network loss function objective is defined as: 3) Break-Even Point: the detector precision rate decreas-
es when the recall rate rises from 0 to 1. During that
when the recall rate equals the precision is considered
Loss (d, l, p∗ , p) = Losscls (d, l)+λ [l ≥ 1] Losscls (p∗ , p)
the BEP, and at that moment, the precision (value of
(4)
the recall) rate refers as the BEP value which used as
metric to evaluate the detector. The higher the BEP
Losscls (d, l) = − log dl (5)
value is, the better the detector performs.
Where d is the class probability distribution calculated by
the SoftMax function, and the object label denoted by l, C. Results
both the ground-truth and predicted RBox is represented by The experimental results are shown in Table I. Our model
p∗ and p respectively. λ is a hyper-parameter which controls outperforms the other BBox methods achieving state-of-
the balance between the two tasks. The Iverson bracket the-art detection accuracy. With the predicted bounding
indicator function [l ≥ 1] estimates to 1 when l ≥ 1 and boxes that match the ground truth within IoU > 0.5. while
0 otherwise. implanting our model we encounter some difficulty with the
253
Table I
EXPERIMENT RESULT FOR FASTER R-CNN, SDD AND OUR MODEL
Table II
NUMBER OF IMAGE IN DATASETS CATEGORIES
V. C ONCLUSION R EFERENCES
[1] Krizhevsky, A., I. Sutskever, and G.E. Hinton. Imagenet classi-
In this article, we have proposed an improver CNN-based fication with deep convolutional neural networks. in Advances
object detection model with Rotatable bounding box. Our in neural information processing systems. 2012.
model is able to accurately detect the arbitrary-oriented
object due to its ability of estimation the orientation angles [2] Girshick, R., et al. Rich feature hierarchies for accurate object
detection and semantic segmentation. in Proceedings of the
of the object making it rotation invariant. We outperform IEEE conference on computer vision and pattern recognition.
Faster R-CNN and SDD on object detection of satellite 2014.
images.
[3] Girshick, R. Fast r-cnn. in Proceedings of the IEEE interna-
tional conference on computer vision. 2015.
ACKNOWLEDGMENT
[4] He, K., et al., Spatial pyramid pooling in deep convolutional
This work was supported by the National Natural Science networks for visual recognition. IEEE transactions on pattern
analysis and machine intelligence, 2015. 37(9): p. 1904-1916.
Fund of China [Grant Nos. 61373062, 61373063, 61473155
and 61703139]; the Project of Ministry off Industry, Infor- [5] Szegedy, C., et al. Going deeper with convolutions. in Proceed-
mation Technology of PRC [Grant No. E0310/1112/02-1]; ings of the IEEE conference on computer vision and pattern
Fundamental Research Funds for the Central Universities recognition. 2015.
[Grant No. 2015B03114]; and Collaborative Innovation Cen- [6] Dai, J., et al. R-fcn: Object detection via region-based fully
ter of IoT Technology and Intelligent Systems, Minjiang convolutional networks. in Advances in neural information
University [Grant No. IIC1701]. processing systems. 2016.
254
[7] He, K., et al. Deep residual learning for image recognition. in
Proceedings of the IEEE conference on computer vision and
pattern recognition. 2016.
[8] Liu, W., et al. Ssd: Single shot multibox detector. in European
conference on computer vision. 2016. Springer.
[9] Redmon, J., et al. You only look once: Unified, real-time object
detection. in Proceedings of the IEEE conference on computer
vision and pattern recognition. 2016.
[13] Lin, T.-Y., et al. Focal loss for dense object detection. in
Proceedings of the IEEE international conference on computer
vision. 2017.
[16] Xu, Z., et al., Deformable convnet with aspect ratio con-
strained nms for object detection in remote sensing imagery.
Remote Sensing, 2017. 9(12): p. 1312.
[18] Ren, Y., C. Zhu, and S. Xiao, Deformable Faster R-CNN with
aggregating multi-layer features for partially occluded object
detection in optical remote sensing images. Remote Sensing,
2018. 10(9): p. 1470.
[20] Liu, L., Pan, Z. and Lei, B., 2017. Learning a rotation
invariant detector with rotatable bounding box. arXiv preprint
arXiv:1711.09405.
255