National University of Defense Technology, Changsha, Hunan, 410073, China

)$6708/7,',5(&7,21$/9(+,&/('(7(&7,2121$(5,$/,0$*(6
86,1*5(*,21%$6('&2192/87,21$/1(85$/1(7:25.6
TianyuTang,ShilinZhou*,ZhipengDeng,LinLei,HuanxinZou
National University of Defense Technology, Changsha, Hunan, 410073, China

$%675$&7 Then, a multiclass classifier with histogram of oriented
gradients (HOG) was used to assign the type and orientation
This paper proposes a coupled region based convolutional attributes of vehicles. However, there are still some
neural networks (R-CNN) to automatically detect vehicles problems in [2]. The sliding window technique leads to
in aerial images. Traditional methods are mostly based on heavy computational costs, and is not fast enough to achieve
sliding-window search, and use handcrafted or shallow- real-time object detection in large-scale aerial images.
learning based features. They have limited description Moreover, hand-crafted or shallow learning based features
ability and heavy computational costs. Recently, a series of influence the power of feature representation, thus still
R-CNN based methods have achieved great success in existing some false detection results.
general object detection. Inspired by the previous work, we In the field of computer vision, state-of-the-art object
propose a coupled R-CNN to detect small size vehicles in detection methods are a series of region CNN based (R-
large-scale aerial images. First, a vehicle proposal network CNN) methods [12-14], especially Faster R-CNN [14].
(VPN) is proposed to generate candidate vehicle-like Specifically, in Faster R-CNN, a region proposal network
regions, using a hyper feature map combined by feature (RPN) is used for fast generation of candidate regions, and a
maps of different layers. Then, a vehicle classification CNN based classifier is followed to verify the proposals,
network (VCN) is developed to further verify the candidate using powerful deep features. It’s obviously that Faster R-
regions and classify vehicles in eight directions. In this CNN can overcome the problems of [2] mentioned above.
study, our method is tested on a challenge Munich vehicle However, Faster R-CNN has not been used for object
dataset and the collected vehicle dataset, with improvements detection in aerial images as we known.
in accuracy and speed compared to existing methods. In this paper, we are the first to attempt to use Faster
R-CNN for vehicle detection in large-scale aerial images.
Index Terms—Vehicle detection, vehicle proposal Furthermore, to make Faster R-CNN more suitable for
networks, aerial image, CNN detection of small objects, we proposed an improved
method based on it. First, a vehicle proposal network (VPN)
,1752'8&7,21 is proposed to generate candidate vehicle-like regions, using
a hyper feature map combined by feature maps of different
Vehicle detection in aerial images has received great layers. Followed VPN, a vehicle classification network
attention recent years [1]-[5]. However, there are still some (VCN) is developed to further verify the candidate regions
challenges: 1. Vehicles are relatively small in such large and classify vehicles in eight directions. In additional, deep
aerial images, thus affecting accurate and fast locating for CNN based methods always require large amounts of
vehicle-like regions. 2. Variation of vehicles’ directions and training data with manual annotation, but there are limited
a large number of vehicle-like structures in background annotated data for aerial images. Therefore, we crop the
make the classification of candidate regions difficult. original large aerial images into image blocks and augment
In previous studies, various algorithms have been used the size of training dataset by rotation with four angles (i.e.,
for vehicle detection in aerial images [6]-[11]. The most 0°, 90°, 180° and 270°). Finally, we test our methods on
common methods are based on sliding-window search. two challenge datasets. Comprehensive evaluations
They usually use multiple handcrafted or shallow-learning demonstrate that: 1) Faster R-CNN can be used for vehicle
based features associated with AdaBoost or support vector detection on large-scale aerial images. Compared with [2],
machine (SVM), and then examine each window for the Faster R-CNN processes images in 3.84s (0.56s faster than
presence of a vehicle. The work of [2] is worth mentioning [2]), and achieves higher precision rate and F1-score. 2) Our
here, as they achieved rapid and effective detection improved method can achieve better results than Faster R-
performance. The authors employed a fast binary detector CNN, which means that our method are more suitable for
using integral channel features (ICFs) to detect the location detection of small size objects. 3) With limited training data,
of vehicles on large scale aerial images in a few seconds. we can use data augmentation and pre-trained model to train
,((( ,*$566

Fig.1. The framework of our method.
feature maps. In order to generate candidate regions, we add
a robust detector. Our method has successful detection a 3×3 convolutional layer on hyper feature maps, namely
performance for large-scale aerial images. Therefore, it has Conv_inter. At each location of the convolutional kernel,
great potential for wide field application. we simultaneously predict multiple region proposals
associated with different scales and aspect ratios (namely
352326('$3352$&+ anchors). As the size of vehicle is approximately 35×35
pixels, we adopt anchors of 3 scales of 302,402, and 502
The rationale of the proposed approach is illustrated in pixels, and 3 aspect ratios of 1:1, 1:2, and 2:1. For 256
Fig.1. Our method consists of two parts: APN and VCN. To feature maps in total, we can extract a 256-d feature vector
void over-fitting caused by limited training data, we crop for each region proposal. Afterwards, these region proposals
the original large aerial images into image blocks and and their corresponding features are fed into two sibling
augment the number of them by rotation with 4 angles. Each 1×1 convolutional layers for bounding box classification
image block comes along with annotation file that contains and bounding box regression respectively (namely conv_cls
orientation and bounding box’s coordinates of each vehicle. and conv_bbr).
Then, all the training data are used to train APN and VCN
jointly. For testing, we just crop the large test images into 9HKLFOHFODVVLILFDWLRQQHWZRUNV9&1
image blocks. Each image block is forwarded through the
shared convolutional layers and the hyper feature maps are To infer the orientation of vehicles, we develop a
produced. Next, APN is used to generate about 300 vehicle classification network (VCN), which is also based
proposals. Then, these proposals are classified by VCN. on ZF model. VCN takes an image with vehicle-like regions
Finally, all the detection results of blocks are stitched proposed by VPN as input and outputs vehicles’
together to recombine the original image. orientations. The architecture of the VCN is shown in Fig.1.
VCN shares convolutional layers with VPN. A region of
9HKLFOHSURSRVDOQHWZRUNV931 interest (ROI) pooling layer is added on hyper feature maps,
to convert the feature map of each vehicle-like region into a
Based on ZF model [16], we propose a vehicle fixed spatial size. Then, each ROI feature map is entered
proposal networks. As shown in Fig.1, VPN is a full into the subsequent fully connected (FC) layers. Finally, a
convolutional network to generate candidate regions. The softmax loss layer is used to classify the vehicles of
structure is as follows: The front 5 convolutional layers are different directions.
used to compute feature maps. As Ghodrati [15] indicates, For the training of two networks, we take two steps. In
we consider that the deeper convolutional layers can get the first step, we train the VPN and VCN separately. The
high recall but poor localization in detection, the lower two networks are initialized by the pre-trained ZF model for
convolutional layers otherwise. Therefore, we concatenate ImageNet classification. VPN is first trained by processing
the feature maps of three convolutional layers together an image with ground-truth bounding boxes as input and
(Conv3, Conv4 and Conv5). Specifically, two convolutional predicts a set of vehicle-like regions. Next, VCN is trained
layers (Conv31 and Conv41) are added on the top of Conv3 using those vehicle-like regions. In the second step, both
and Conv4 feature maps respectively, to compress them into networks are jointly trained by fixing the shared
the same number as Conv5. Then, we synthesize these three convolutional layer. We only fine-tune the unique layers of
feature maps to one single output cube, which we call hyper VPN and VCN.

Fig. 2. Detection and annotation results from the test images. A red box denotes a correct localization, a blue box denotes a
false alarm and a black box denotes missing detection. (a)-(c) are image blocks of Munich test aerial images. (d)-(h) are
images of the collected vehicle data set, (d) is UAV images, (e) and (f) are satellite images. (g) is original test images of
Munich dataset (5616×3744 pixels), (h) is a large satellite image of Tokyo (18239×12837 pixels).
7DEOH: Performance comparisons between different methods.

0HWKRG *URXQG 7UXH )DOVH 5HFDOO 3UHFLVLRQ ) 7LPH
7UXWK 3RVLWLYH 3RVLWLYH 5DWH 5DWH 6FRUH SHULPDJH
[2] 5892 4085 619 69.30% 86.80% 0.77 4.40
ACF detector 5892 3078 4062 52.24% 43.31% 0.47 4.37
ACF + Fast R-CNN 5892 2583 1540 43.84% 62.65% 0.52 6.29
Faster R-CNN 5892 4050 68.74% 0.78 3.84
Ours 5892 696 86.20%
Fig.2 displays the vehicle detection results from two

(;3(5,0(176$1'',6&866,21 data sets using the proposed approach. The red box denotes
correct localization, blue box and black box denote a false
'DWDVHW alarm and missed detection respectively. As shown in Fig. 2,
despite vehicles located in the shade or near the image block
Two data sets were used in experiments. The Munich boundaries (only part of the vehicle is shown on the image),
Vehicle data set contains 20 aerial images [1], and the the proposed approach has successfully detected most of the
ground sampling distance of which is approximately 13 cm. vehicles. Compared with [2], our approach can detect
To provide further verification, we also test our method bounding boxes with an adaptive size instead of a fixed
utilizing another collected vehicle data set, which contains window size. In additional, our method has good robustness
10 UAV images and 20 satellite images. All of the UAV when applied in UAV images, and some small-scale high-
images have a spatial resolution of approximately 2 cm, and resolution satellite images. However, when detecting
are taken from different cities. The 20 satellite images are vehicles in large-scale satellite imagery (see Fig.2 (h)), our
downloaded from Google Earth with a spatial resolution of method has poor performance.
approximately 0.08 m. Performance of different methods are shown in Table 1.
We compare our method with [2], and some state-of-the-art
([SHULPHQWDOUHVXOWVDQGGLVFXVVLRQ detection methods in object detection as well as Faster R-
CNN. The best performances are highlighted in bold. It can
Experiments were implemented based on the deep be observed that our proposed method achieves the best
learning framework Caffe [17], and executed on a PC with performance in terms of recall rate and F1-score. The Faster
Intel Core i7-4790 CPU, NVIDIA GTX-960 GPU (2 GB R-CNN achieved performance similar to [2], having higher
video memory), and 8 GB RAM. The PC operating system precision but a little lower recall rate than [2]. Compared
was Ubuntu 14.04. with them, our proposed method, based on a hyper feature

map, can improve the recall rate effectively. Compared with study, we will focus on mining hard negative samples to
the ACF detector, ACF + Fast R-CNN has a higher reduce false detection. Additionally, we will investigate
precision rate owing to the deep feature of Fast R-CNN. how to improve robustness of our method for object
This proves that CNN based methods have superior detection in large-scale satellite imagery.
classification performance than traditional methods.
Moreover, our method can detect a large-scale aerial images $&.12:/('*0(17
(5616×3744 pixels) in 3.65 seconds.
In Fig. 3, precision and recall (PR) curves of four test The work is supported by the National Natural Science
images are shown. These curves show that our method Foundation of China under Grants 61331015. The authors
achieves the best recall and precision. To further validate would like to thank Kang Liu and Gellert Mattyus, who
the ability of our method on aerial images with different generously provided their image data set with the ground
scales, we resized the image for the test but not the training. truth.
These detection results of scaled test image 2 are shown in
Fig. 4. Our method performs best on the same scale as it 5()(5(1&(6
was trained. The performance remains comparable under
small scale factor. But, our method does not perform well [1] J. Leitloff, D. Rosenbaum, F. Kurz, O. Meynberg, and P. Reinartz, “An
operational system for estimating road traffic information from aerial
under larger scale factor.
images,” RemoteSensing, vol. 6, no. 11, pp. 11 315–11341, 2014.
[2] K. Liu and G. Mattyus, “Fast multiclass vehicle detection on aerial
images,” IEEEGeoscience&RemoteSensingLetters, vol. 12, no. 9, pp. 1–
5, 2015.
[3] T. Moranduzzo and F. Melgani, “Automatic car counting method for
unmanned aerial vehicle images,” IEEE Transactions on Geoscience &
RemoteSensing, vol. 52, no. 3, pp. 1635–1647, 2014.
[4] T. Moranduzzo and F. Melgani, “Detecting cars in uav images with a
catalog-based approach,” IEEE Transactions on Geoscience & Remote
Sensing, vol. 52, no. 10, pp. 6356–6367, 2014.
[5] Z. Chen, C. Wang, H. Luo, and H. Wang, “Vehicle detection in high-
resolution aerial images based on fast sparse representation classification
and multiorder feature,” IEEE Transactions on Intelligent Transportation
Systems, pp. 1–14, 2016.
Fig. 3. PR curves of four test images in Munich dataset with [6] H. Y. Cheng, C. C. Weng, and Y. Y. Chen, “Vehicle detection in aerial
surveillance using dynamic bayesian networks.” IEEE Transactions on
different methods. ImageProcessing, vol. 21, no. 4, pp. 2152–9, 2012.
[7] W. Shao, W. Yang, G. Liu, and J. Liu, “Car detection from
highresolution aerial imagery using multiple features,” Geoscience and
RemoteSensingSymposium(IGARSS)4379–4382, 2012.
[8] S. Kluckner, G. Pacher, H. Grabner, and H. Bischof, “A 3d teacher for
car detection in aerial images,” ComputerVision(ICCV)pp. 1–8, 2007.
[9] A. Kembhavi, D. Harwood, and L. S. Davis, “Vehicle detection using
partial least squares,” IEEE Transactions on Pattern Analysis & Machine
Intelligence, vol. 33, no. 6, pp. 1250–1265, 2010.
[10] Z. Chen, C. Wang, C. Wen, and X. Teng, “Vehicle detection in
highresolution aerial images via sparse representation and superpixels,”
IEEETransactionsonGeoscience&RemoteSensing, vol. 54, no. 1, pp. 1–
14, 2015.
[11] X. Chen, S. Xiang, C. L. Liu, and C. H. Pan, “Vehicle detection in
Fig. 4. Performance after rescaling the image with different satellite images by hybrid deep convolutional neural networks,” IEEE
factors. Geoscience & Remote Sensing Letters, vol. 11, no. 10, pp. 1797–1801,
In conclusion, our method achieves the best results in 2014.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based
both speed and accuracy, and have some migration convolutional networks for accurate object detection and segmentation,”
capabilities. IEEETransactionsonPatternAnalysis&MachineIntelligence, vol. 38, no.
1, pp. 1–1, 2015.
&21&/86,21 [13] R. Girshick, “Fast R-CNN,” Computer Science, 2015.
[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
time object detection with region proposal networks,” IEEE Transactions
In this paper, we present a coupled region based CNN onPatternAnalysis&MachineIntelligence, pp. 1–1, 2016.
method for fast and accurate vehicle detection in large-scale [15] A. Ghodrati, M. Pedersoli, T. Tuytelaars, A. Diba, and L. V. Gool.
aerial images. Experimental results show that our method is “Deepproposal: Hunting objects by cascading deep convolutional layers,”
faster and more accurate than existing algorithms, and is ComputerVision(ICCV), 2015.
[16] M. D. Zeiler and R. Fergus, “Visualizing and Understanding
effective for images captured from UAV or downloaded Convolutional Networks,” SpringerInternationalPublishing, 2013.
from Google Earth. However, our method still produces [17] V. Turchenko and A. Luczak, “Caffe: Convolutional architecture for
some false detection, as well as missed detection. In future fast feature embedding,” EprintArxiv, pp. 675–678, 2014.

National University of Defense Technology, Changsha, Hunan, 410073, China

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

National University of Defense Technology, Changsha, Hunan, 410073, China

Uploaded by

Copyright:

Available Formats

)$6708/7,',5(&7,21$/9(+,&/('(7(&7,2121$(5,$/,0$*(6