You are on page 1of 5

2019 International Conference on Electronic Engineering and Informatics (EEI)

An improved Rotation Invariant CNN-based detector with Rotatable Bounding


Boxes for Aerial Image detection

Mohammed AL-Soswa∗† , Liu Chuancai∗†‡ and Waheeb AL-Samhi†


† School of Computer Science and Engineering
Nanjing University of Science and Technology, Nanjing 210094, China
‡ Collaborative Innovation Center of IoT Technology and Intelligent Systems, Minjiang University, Fuzhou 350108, China
alsoswa@njust.edu.cn, chuancailiu@njust.edu.cn, waheebsamhi@njust.edu.cn

Abstract—Methods based on Convolutional neural network non-maximum suppression (NMS) to obtain more accurate
(CNN) have been effectively applied to object detection. Howev- Bounding box (BBox) after predicting the location of the
er, the Traditional object detection models face huge obstacles object, they are not suitable for remote sensing image. Due
in the field of the aerial scene due to the complexity of the
aerial image and the wide degree of freedom of the remote to NMS cause excessive suppression from the density of
sensing object in density, scale, and orientation. In this article, objects. A significant feature for detecting an object in aerial
we applied rotatable bounding boxes (RBox) instead of the image is the Rotation invariant. When the viewpoint of the
traditional bounding box (BBox) and improved the CNN-based camera is from the top of the scene, thus the view of the
network to effectively handle the oriented angle of the detected object becomes arbitrary. A vast number of generated aerial
object so that the rotation invariant property can be achieved.
Our model is trained and tested on 3 classes of remote sensing images are produced from satellites each year. Making great
category Airplanes, ships, and Cars. Compared with two of the importance for object detection models, whereas the rotation
common benchmarks of the BBox methods Faster R-CNN and invariant feature plays an important role.
SDD, our model outperforms both of them. Where our model
efficiently output the orientation angles of the detected objects. II. RELATED WORK
Keywords-Rotation Invariant; Object Detection; Deep Con- Common object detection algorithms either associated or
volution Network (CNN); Computer Vision; based on deep convolutional neural networks (CNNs) [1],
[5], [7], [12] applying CNNs has successfully improved
I. I NTRODUCTION
the object detection algorithm and achieved state-of-the-
object detection considers a challenging task in image art detection performance. The region-based convolutional
processing and computer vision. It plays an important role in neural network (R-CNN) was introduced by Girshick et
the aerial image research direction with vast tasks and imple- al. [2] it considered a multi-stage structure method which
mentation in both military and civil applications. Although at first it used selective search (SS) to produce region
classical object detection deep learning-based algorithm such proposals and then CNN is deployed on each region proposal
as [2], [3], [6], [8], [9], [15] has come a long way in detecting to detect objects. Although R-CNN achieved an amazing
an object in a natural scene, it still needs more effort in the result, due to it has fed the region proposal into the net-
aerial section. they are holding back in the aerial image filed work individually, it consumes time. Through using feature
because of the following main reasons: sharing, we can overcome the constraint of the fixed-size
• The complexity of the aerial image. Because of the input and we only need to feed the image once to the net
huge amount of interference objects and the different introducing spatial pyramid pooling networks (SPP-nets) [4].
resolution in an aerial image makes it very complex. Unlike SPP-nets Fast R-CNN [3] improved over R-CNN
• Aerial images have always been accompanied by small using the RoI layer and a multi-task loss to fuse proposal
objects which have complex surroundings. classification and location regression. However, it still a
• A challenge for the detection algorithm comes from time-consuming method. By replacing SS with the region
certain categories such as vehicles, ships, and so on. proposal networks (RPNs) to generate the proposals Faster
Due to the dense arrangement of those objects. R-CNN was introduced [15]. Combing all detection steps
• Because images are taking from a high altitude in into a unified network. Making it the best in performance
remote sensing objects come in arbitrary orientation. and efficiency at the time. The single-stage structure methods
Making a technical difficulty for object detection to get have the advantage of speed over a slightly low precision
a large aspect ratio. compared with the multi-stage ones. Some famous single-
object detection algorithms, such as Faster-RCNN [2], are stage such as RetinaNet [13]], single-shot detector (SSD),
usually used in the aeronautical field. Even though they use and you only look once (YOLO)[9] and its versions [14],

978-1-7281-4076-6/19/$31.00 ©2019 IEEE 251


DOI 10.1109/EEI48997.2019.00062
[17]. Unlike multi-stage methods, single-stage does not rely sample selection part, the feature Extraction part, the cal-
on region proposals instead of they generate several boxes, culation of the loss function part, and the output predicted
called prior boxes, deepening on a certain rule for detecting result part.
objects in an image. And from those boxes, they can locate
A. Rotatable bounding box and Sample selection
an object near or in them. YOLO [9] can achieve three times
the speed than Faster R-CNN by dividing the image into The bounding box (BBox) is an essential component in
blocks enhancing training speed and detection efficiency. locating targets in object detection models. They are con-
YOLOv2[14] abandoned the fully connected layer in its pre- sisting of four main components a central point, width and
decessor [9] and used fully convolutional layers producing Hight that makes a horizontal-aligned rectangular. However,
an even better result. While YOLOv3[17] keep on improving BBox has some drawback in the field of arbitrary-oriented
over YOLOv2 through enhancing milt-scale feature fusion detection. Due to it cannot outline the shape of the object
and for small-size objects it deploys residual networks. SSD precisely which does not separate the background from the
[8] is a pyramidal feature hierarchy based-method. It uses object and make it difficult to distinguish dense objects. To
generated prior boxes with different scales and ratios for overcome those obstacles A rotatable bounding box (RBox)
each pixel on the predicted feature layer. Introduce by Lin is defined with 5 tuples the coordinate of the central point (x,
et al.[13] RetinaNet utilized FL function to overcome the y), the long and short sides of the RBox width w and Hight
imbalance between the negative and positive sample issue. h, and the bounding box orientation angle θ. As illustrated
Its performance matches the multi-stage detectors while in Fig.1, the angle θ is the angle between the x-axis and
keeping the speed of the single-stage ones. the RBox principal axis. Which rotates clockwise and taken
Scholars used these methods in the field of remote every 30o in the range from 0o to 180o and if there is a
sensing. In order to make their methods insensitive to the requirement to distinguish the top and bottom of the object,
arbitrary rotation, attempts were made to either extract the we can use the range from 0o to 330o.
insensitive rotation features or adjusting the orientation. R-P- BBox uses Intersection-over-Union (IoU) presented in (1)
Faster R-CNN proposed by Han et al. [11]. Was applied on as a criterion for both selecting positive samples in the train-
small datasets and it preformed reasonable performance. Xu ing process and reducing repeated predictions in detection.
et al. [16] advanced the detection accuracy through merging Which we can use it for RBox with a minor improvement.
R-FCN and deformable convolution layers [10]. The de- Because the intersection of two RBox produces a polygon
formable Faster R-CNN introduce by Ren et al [18] produce with eight sides or less making the Boolean calculation more
a single high-level feature map of a fine resolution through sophisticated of the BBox. Angle-related IoU (ArIoU) is
using top-down and skipping connections, improving the implemented to compute the IoU between two RBox. Given
model performance. Orientation response networks (ORN) B1 and B2 as an RBox, Their ArIoU180 defined in (1)
proposed by Zhou et al. [19]. Attempt to extract the rotation where the top and bottom of the target is not required to be
angle of the whole image while it classifies it. Unlike distinguished, otherwise the definition of ArIoU is presented
ORN Liu et al. [20]. The orientation angle is extracted in (3).
locally using prior boxes. Achieving more performance and  
accuracy. Yang et al. [21]. Use a similar scheme to extract area B̂1 ∩ B2
the orientation with Faster R-CNN network structure. IoU (B1 , B2 ) =   (1)
area B̂1 ∪ B2

 
area B̂1 ∩ B2
ArIoU180 (B1 , B2 ) =   |cos(θB1 − θB2 )|
area B̂1 ∪ B2
(2)
 
area B̂1 ∩ B2
ArIoU (B1 , B2 ) =   cos(θB1 − θB2 ) (3)
area B̂1 ∪ B2
Figure 1. Representation of RBox.
Where the angles of the two RBox are θB1 and θB2
respectively, B̂1 denotes an RBox which keeps the location
and size values of B1 except that the angle parameters of
III. METHODOLOGY
B2 . The union and intersection of the two RBox are denoted
The structure of our model is illustrated in Fig.2. Which by intersection and union. Both the union and intersection
introduces four main parts: the RBox generation and the are computable due to B̂1 and B2 have the same angle.

252
B. Feature Extraction IV. EXPERIMENTS AND RESULT
In our model we are extracting the features from the image A. Datasets
using the pre-trained network VGG19[19], the VGG19 is Our model is designed for object detection with aerial
truncated where all the full-connective layers, pooling layers images. For that, we used the publicly provided dataset by
and convolution layers after the conv4 3 layer is detached. Liu et al. [20] and it is constructed from satellite images
Then the conv4 3 is followed by two 3 x 3 convolutional collected from Google Earth. It contains three main class
layers which help to predict the class and location of each categories: Airplane, Vehicles, and Ships. The dataset is
RBox. Our network receptive field is 108 x 108 pixels in the described in table II.
conv4 3 and compared to conv5 3 information the conv4 3 Each and every object inside an image is annotated in the
has a greater resolution. Those two are the main reason we corresponding Box file with 5 parameters: location in x and
use conv4 3. Consequently, any object large then the scope y, size in Width and Height, and the angle of the object. And
of 108 x 108 cannot be detected, also any object smaller we implemented data augmentation on every single image to
than 8 pixels may be missed due to the feature map of the increases the variety of the image‘s angles and the amount
conv4 3 layer is 38 x 38 pixels. of image. With 22% taken as a test data and the rest used
for training. Some samples are shown in Fig.3.
B. Evaluation Metrics
We measure the model performance by employed break-
even point (BEP), precision, recall, and average precision
(AP). Defined as follows:
1) Precision Rate and Recall Rate: they are two wildly
used metrics for detectors performance evaluation,
defined in (6,7).

Ṫd
P = (6)
Td

Ṫd
R= (7)
Tr
Where P and R denote the precision rate and recall
rate, and the overall detected targets, the number of
correctly detected targets and real targets are denoted
by Tr , Ṫd and Td respectively.
2) Average Precision: in order to calculate the AP the
Figure 2. Our Network structures. average value of precision rate needs to be computed
over the interval from 0 to 1 from the recall rate, which
indicates the mean accuracy of the detector over the
various recall rate all-inclusive which implies that the
C. Loss Function higher the AP the better the detector performance.
Our network loss function objective is defined as: 3) Break-Even Point: the detector precision rate decreas-
es when the recall rate rises from 0 to 1. During that
when the recall rate equals the precision is considered
Loss (d, l, p∗ , p) = Losscls (d, l)+λ [l ≥ 1] Losscls (p∗ , p)
the BEP, and at that moment, the precision (value of
(4)
the recall) rate refers as the BEP value which used as
metric to evaluate the detector. The higher the BEP
Losscls (d, l) = − log dl (5)
value is, the better the detector performs.
Where d is the class probability distribution calculated by
the SoftMax function, and the object label denoted by l, C. Results
both the ground-truth and predicted RBox is represented by The experimental results are shown in Table I. Our model
p∗ and p respectively. λ is a hyper-parameter which controls outperforms the other BBox methods achieving state-of-
the balance between the two tasks. The Iverson bracket the-art detection accuracy. With the predicted bounding
indicator function [l ≥ 1] estimates to 1 when l ≥ 1 and boxes that match the ground truth within IoU > 0.5. while
0 otherwise. implanting our model we encounter some difficulty with the

253
Table I
EXPERIMENT RESULT FOR FASTER R-CNN, SDD AND OUR MODEL

Airplane Ship Vehicles


methods mAP
BEP AP BEP AP BEP AP
Faster R-CNN 0.9807 0.9906 0.7920 0.8229 0.7160 0.7555 0.8563
SDD 0.9774 0.9856 0.8272 0.8289 0.8313 0.8759 0.8968
Ours 0.9850 0.9932 0.9462 0.9415 0.8515 0.8956 0.9432

Figure 3. Our Network structures.

Table II
NUMBER OF IMAGE IN DATASETS CATEGORIES

Airplane Ship Vehicles


Images 5515 5426 452
After data augmentaion 66242 40287 5494

airplane category because the shape of the airplane itself


which makes it hard to truncate the background form them,
unlike the other category fig.4. shows the difference between
the RBox of an airplane and a ship. Nevertheless, our model Figure 4. Our Network structures.
achieves a better result.

V. C ONCLUSION R EFERENCES
[1] Krizhevsky, A., I. Sutskever, and G.E. Hinton. Imagenet classi-
In this article, we have proposed an improver CNN-based fication with deep convolutional neural networks. in Advances
object detection model with Rotatable bounding box. Our in neural information processing systems. 2012.
model is able to accurately detect the arbitrary-oriented
object due to its ability of estimation the orientation angles [2] Girshick, R., et al. Rich feature hierarchies for accurate object
detection and semantic segmentation. in Proceedings of the
of the object making it rotation invariant. We outperform IEEE conference on computer vision and pattern recognition.
Faster R-CNN and SDD on object detection of satellite 2014.
images.
[3] Girshick, R. Fast r-cnn. in Proceedings of the IEEE interna-
tional conference on computer vision. 2015.
ACKNOWLEDGMENT
[4] He, K., et al., Spatial pyramid pooling in deep convolutional
This work was supported by the National Natural Science networks for visual recognition. IEEE transactions on pattern
analysis and machine intelligence, 2015. 37(9): p. 1904-1916.
Fund of China [Grant Nos. 61373062, 61373063, 61473155
and 61703139]; the Project of Ministry off Industry, Infor- [5] Szegedy, C., et al. Going deeper with convolutions. in Proceed-
mation Technology of PRC [Grant No. E0310/1112/02-1]; ings of the IEEE conference on computer vision and pattern
Fundamental Research Funds for the Central Universities recognition. 2015.
[Grant No. 2015B03114]; and Collaborative Innovation Cen- [6] Dai, J., et al. R-fcn: Object detection via region-based fully
ter of IoT Technology and Intelligent Systems, Minjiang convolutional networks. in Advances in neural information
University [Grant No. IIC1701]. processing systems. 2016.

254
[7] He, K., et al. Deep residual learning for image recognition. in
Proceedings of the IEEE conference on computer vision and
pattern recognition. 2016.

[8] Liu, W., et al. Ssd: Single shot multibox detector. in European
conference on computer vision. 2016. Springer.

[9] Redmon, J., et al. You only look once: Unified, real-time object
detection. in Proceedings of the IEEE conference on computer
vision and pattern recognition. 2016.

[10] Dai, J., et al. Deformable convolutional networks. in Proceed-


ings of the IEEE international conference on computer vision.
2017.

[11] Han, X., Y. Zhong, and L. Zhang, An efficient and robust


integrated geospatial object detection framework for high spa-
tial resolution remote sensing imagery. Remote Sensing, 2017.
9(7): p. 666.

[12] Huang, G., et al. Densely connected convolutional networks.


in Proceedings of the IEEE conference on computer vision and
pattern recognition. 2017.

[13] Lin, T.-Y., et al. Focal loss for dense object detection. in
Proceedings of the IEEE international conference on computer
vision. 2017.

[14] Redmon, J. and A. Farhadi, YOLO9000: better, faster,


stronger. arXiv preprint, 2017.

[15] Ren, S., et al., Faster R-CNN: towards real-time object


detection with region proposal networks. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2017(6): p. 1137-
1149.

[16] Xu, Z., et al., Deformable convnet with aspect ratio con-
strained nms for object detection in remote sensing imagery.
Remote Sensing, 2017. 9(12): p. 1312.

[17] Redmon, J. and A. Farhadi, Yolov3: An incremental improve-


ment. arXiv preprint arXiv:1804.02767, 2018.

[18] Ren, Y., C. Zhu, and S. Xiao, Deformable Faster R-CNN with
aggregating multi-layer features for partially occluded object
detection in optical remote sensing images. Remote Sensing,
2018. 10(9): p. 1470.

[19] Simonyan, K. and A. Zisserman, Very deep convolutional


networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.

[20] Liu, L., Pan, Z. and Lei, B., 2017. Learning a rotation
invariant detector with rotatable bounding box. arXiv preprint
arXiv:1711.09405.

[21] Yang, X., et al., Automatic ship detection in remote sens-


ing images from google earth of complex scenes based on
multiscale rotation dense feature pyramid networks. Remote
Sensing, 2018. 10(1): p. 132

255

You might also like