You are on page 1of 5

DESIGN OF IMINT MILITARY TARGET DETECTION

SOFTWARE USING DEEP LEARNING


Ghazi MARZOUK, Xu QIANG*, Wei LI
College of Information and Navigation
Air Force Engineering University
Xi’an, China
marzouk.ghazi@yahoo.com

Abstract: Detecting different targets from a High Resolution Remote Sensing image is one of the classical problems of
computer vision and is often described as a difficult task. This paper will present the appropriate tasks in com-
puter vision using Deep Learning technology with the constraint of small training data and to use a pretrained
Convolutional Neural Network (CNN) (by preprocessing the images dataset and developing the right training
process). Faster R-CNN method will be used for the object detection task. Due to practical use, this work will
detect 2 classes (Airplanes and storage tanks). The dataset used for the training is a combination of an existing
data set and collected images for the military aircrafts. The analyze of the large data of IMINT will be faster and
the human labor will be reduced to the minimum. The software will recognize different targets from large images
collected by Satellites, Reconnaissance UAVs or Aircrafts from the ISR missions by an average accuracy of
90% for the airplane class.
Index Terms: Imagery Intelligence; Remote Sensing; ISR; Deep Learning; Object Detection; Convolutional Neural Network; CNN;
Faster RCNN; Computer Vision.

1. INTRODUCTION (targets) from a Remote Sensing HR image. What


makes object detection a distinct problem is that it

T he growing quantity of Airborne and Satellite


images (Electro Optical, Infrared, SAR, etc.) ac-
quired for Military Intelligence or called IMINT (Im-
involves both locating and classifying regions [3].
The location and size are typically defined using a
bounding box, in the form of corner coordinates. The
agery Intelligence), lead us to find a solution for de- sub-image contained in the bounding box is then
tecting and recognizing military targets such as classified by an algorithm that has been trained us-
tanks, armored vehicle, land troops, aircrafts, ships, ing machine learning or DL (in our case) [4].
bridges, storage tanks, buildings…). Due to the large
scale of the battlefield, the solution need to use an In this paper, the use of the CNN in object detec-
automatic system to detect and recognize the tar- tion task and present different existing methods will
gets. This system could enhance the intelligence ca- be discussed first. Then the experimental work will
pability of the Air force in the ISR (Intelligence Sur- be presented which is the use of one of the current
veillance and Reconnaissance) missions, but it can method (Faster R-CNN) for the target detection
also be employed by the Land Forces or the Navy. task.

In order to make this system autonomous, Artifi-


cial Intelligence (AI) could be the best choice to
2. CNN IN OBJECT DETECTION
achieve that goal. Nowadays, DL (Deep Learning), is Regardless of the visual task that we want to
wide used in many fields (Medical, Social media, au- achieve, a CNN or DCNN is necessary for the DL in
tonomous vehicle navigation and also in military computer vision tasks. 2012 marked the first year
use). Target recognition in large High Resolution where a CNN was used to achieve a top 5 test error
(HR) Remote Sensing (RS) images, such as aircraft, rate of 15.4%. Alex Krizhevsky et al. [5] achieved
and vehicle detection, is a challenging task due to promising results with CNNs for the general image
the small size and big number of targets and the classification task by developing the AlexNet. It
complex neighboring environments. achieved excellent results on the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC) [6] da-
This work is a combination between Computer vi-
taset. Trained on ImageNet data, the Alexnet net-
sion tasks and DL technology. The most common
work is composed by 25 layers. It uses a relatively
technic used for this combination is the use of the
simple layout. The network is made up of 5 convolu-
CNN (Convolutional Neural Network) and DCNN
tional layers, max-pooling layers, dropout layers (to
(Deep CNN). The basic idea of the CNN was inspired
combat the problem of overfitting to the training
by a concept in biology called the receptive field [1].
data which is a good solution for making the transfer
They act as detectors that are sensitive to certain
learning with a small dataset hence the choice of this
types of stimulus, for example, edges. This biological
model), and 3 fully connected layers [5].
function can be approximated in computers using
the convolution operation detailed in this reference Nowadays many methods exist for object detec-
[2]. My work is related with the object detection tion and they are developing from year to year

* English to Chinese translation


according to the ILSVRC. Those methods are de- Certain advanced object detection methods, such
pending on the purpose of the detection, we can di- as Faster R-CNN [9] described in subsection 2.4 be-
vide them into still image (R-CNN, Fast R-CNN, low, use parts of the same convolutional network
Faster R-CNN…) and real-time (YOLO [7], SSD both for generating the region proposals and for de-
[8]…) object detection. In this section, different ob- tection. We call these kinds of methods integrated
ject detection methods for the still image that utilize methods.
convolutional neural networks will be introduce and
compared. Then, the method used in my work will be 2.4. Faster R-CNN
discussed, which is the Faster R-CNN [9]. Faster R-CNN [9] by Ren et al. is an integrated
method. The main idea is to use shared convolutional
2.1. R-CNN
layers for region proposal generation and for detec-
In 2013, Girshick et al. published a method [4] tion. The authors discovered that feature maps gen-
generalizing the results of the Alexnet introduced in erated by object detection networks can also be used
the beginning of section 2 to object detection. This to generate the region proposals. The fully convolu-
method is called R-CNN (CNN with region pro- tional part of the Faster R-CNN network that gener-
posals). ates the feature proposals is called a Region Proposal
Network (RPN). The authors used Fast R-CNN ar-
R-CNN, consisted of 3 simple steps: First, scan
chitecture for the detection network.
the input image for possible objects using an algo-
rithm called Selective Search [10], generating A Faster R-CNN network is trained by alternat-
around 2000 region proposals. Second, run a CNN on ing between training for RoI generation and detec-
top of each of these region proposals. Then third, tion. First, two separate networks are trained. Then,
take the output of each CNN and feed it into an SVM these networks are combined and fine-tuned. The
(Support Vector Machine [11]) method to classify the trained network receives a single image as input.
region and a linear regressor to tighten the bounding The shared fully convolutional layers generate fea-
box of the object, if such an object exists [4]. ture maps from the image. These feature maps are
fed to the RPN. The RPN outputs region proposals,
In other words, it first proposes regions, then ex-
which are input, together with the said feature
tract features, and then classify those regions based
maps, to the final detection layers. These layers in-
on their features. In essence, it has turned object de-
clude a RoI pooling layer and output the final classi-
tection into an image classification problem. R-CNN
fications as shown on Figure 1.
was very intuitive, but very slow.
2.2. Fast R-CNN
Fast R-CNN published in 2015 by Girshick Ross
to provide a more practical method for object recog-
nition [3]. The main idea is to perform the forward
pass of the CNN for the entire image, instead of per-
forming it separately for each Region of Interest
(RoI). It’s the immediate descendant of R-CNN.
This method generates region proposals based on
the last feature map of the network. As a result, we
can train just one CNN for the entire image. And in- Figure 1: Faster R-CNN method in test mode
stead of training many different SVM’s to classify
each object class, there is a single Softmax layer [12] In this work, this method for the object detection
that outputs the class probabilities directly. Now we task will be used due to the results proven in the sub-
only have one neural net to train, as opposed to one section below.
neural net and many SVM’s. Fast R-CNN performed
much better in terms of speed. There was just one 2.5. Comparing the methods
big bottleneck remaining: the selective search algo-
Liu et al. [8] compared the performance of Fast R-
rithm for generating region proposals.
CNN, Faster R-CNN and SSD on the PASCAL VOC
2.3. Region proposal generation 2007 [14] test set. When using networks trained on
the PASCAL VOC 2007 training data, Fast R-CNN
To use R-CNN and Fast R-CNN, we need a achieved a mean average precision (mAP) of 66.9.
method for generating the regions of interest. Many Faster R-CNN performed better, with a mAP of 69.9.
methods exist for that like Sliding window technic, SSD achieved a mAP of 71.6 with input size 512x512.
Edge boxes [13], and the most popular unsupervised As the standard implementations of Fast R-CNN
methods is Selective Search [10] which utilizes an it- and Faster R-CNN use 600 as the length of the
erative merging of super-pixels. shorter dimension of the input image, SSD seems to
perform better with similarly sized images.
However, SSD requires extensive use of data aug- technics to augment the dataset and trying to make
mentation to achieve this result. Fast RCNN and the network rotational invariant.
Faster RCNN only use horizontal flipping [8].
1. Rotating the images: The first step is to ro-
tate the images for two reasons. The first one is to
3. EXPERIMENTAL METHOD enlarge the dataset (188 images will be not enough
to get good results even if a pretrained network is
3.1. Training data
used). Second is to try to make the detection rota-
In this section, the image dataset used for this tional invariant. The CNN cannot distinguish be-
work will be presented and the image preprocessing tween different rotation of the target. The idea is to
done on it will be discussed while explaining the rea- make a copy of every image from the dataset by add-
sons. ing a rotation of 45° from 0° to 180°, so I will get 4
new images. I chose that interval because I will add
3.1.1. Image source
a data augmentation function in the input image
The main source of the RS image dataset is the
layer (see subsection 3.2.1 below) by flipping the im-
well-known University of California(UC) Merced Da-
age vertically. At the end, while training the net-
taset [15]. This dataset is a collection of aerial im-
work, I have a theoretical number of images aug-
ages (256×256 pixels in RGB space) depicting 21
mented by 9 images for each source image. (Total
land use classes of 100 images. Since each image
output is 940 images)
comes with a single label, the dataset can be only
2. Drawing the Ground Truth (GT) boxes: The
used for image classification purposes. So, I needed
second step is to draw the GT boxes around the tar-
to enhance the images in the dataset as described in
gets in the images manually. Training an object de-
subsection 3.1.2 below. Also, the number of images
tection CNN need to have the images with GT la-
per category is relatively small (100 images) hence
beled with different target categories to train the
the need to collect more images especially for the
RPN and to learn how to localize the object from the
military airplane and I had to make preprocessing
neighboring environment. I made a manual labelling
and data augmentation described in the next subsec-
session using the built-in software implemented in
tion. For the practical purpose of this work, only 2
MATLAB 2017b [17] to get the GT boxes in the form
categories: Airplane and Storage Tanks has been
of [𝑥 𝑦 ℎ𝑒𝑖𝑔ℎ𝑡 𝑤𝑖𝑑ℎ𝑡] as shown in Figure 4.
chosen.

Figure 2: Example images associated with the 2 land-use categories in


the UC Merced data set: Airplane (left) Storage tank (Right) Figure 4: Example of drawing the GT boxes

In addition, with the UC Merced Dataset, I have 3. Extracting the target images from the image
collected other satellite images generally of military dataset: As a third step, all the labeled target from
transport airplanes using the GIS (Geographic Infor- the images in the dataset after step 2 are extracted
mation System) Software Global Mapper [16]. I have by the crop method. I did this step to initially train
searched for Air Bases around the world and down- the CNN as a classification method. I believe that it
loaded the high-resolution images with different will initialize the CNN weights to get better result in
sizes with a spatial resolution of 0.3m as shown in the Faster RCNN training (Total output is 1353 im-
Figure 3. ages). But in order the feed the CNN, images must
be resized into 𝑚 × 𝑚 and I have to keep the same
The total number of images in the dataset is 188 image ratio hence the need of the next step.
images with different sizes split into 2 categories. 4. Image resizing: The CNN input squared im-
age (for our CNN it’s 227 × 227 and RGB channel) so
I resized all the cropped images from step 3. The
resize method idea is to fill the shortest side by black
(or said zeros intensity) and output a square image
so we don’t lose the image ratio and we don’t get a
distorted target for training. After that, the output
Figure 3: Example of images from the dataset collected using Global is resized into 227 × 227 pixels.
Mapper 5. Create image datastore: this fifth step is to
prepare the data for the training purpose. I create an
3.1.2. Images preprocessing image datastore with all the images in the previous
The most time consuming in this work is the im- step with their respective classification label. After
ages preprocessing. I made many processing that I split every category in 3 parts: A Training
Dataset representing 80% of the data, A Validation the input image. This issue was overcome by the
Dataset (to follow the improvement of the network data augmentation in the input layer and the image
while training) 10%, and a Test Dataset (to get the preprocessing. The dataset created on step 2 de-
accuracy) of 10%. scribed in subsection 03.1.2 is used.
3.2. Architecture of the Network The Faster R-CNN have 2 CNN working to-
gether. One network is the pretrained CNN de-
3.2.1. CNN scribed in the section 2 above, transformed into a
For the CNN, I used the transfer learning tech- Fast R-CNN (see subsection 2.2) by adding a regres-
nique in a pre-trained network which is the Alexnet sion network in order to output the localization box
[5] (described in section 2). The advantage of this ap- represented by 4 values [𝑥 𝑦 ℎ𝑒𝑖𝑔ℎ𝑡 𝑤𝑖𝑑ℎ𝑡]. The other
proach is that the network has already learned a rich one is called the RPN to output the RoI. This net-
set of images features that is applicable to a wide work shares the same weights as the previous CNN,
range of images. The time of the training can be too but the last layers concerned by the classification are
long because of the heavy calculation needed to train changed by an RoI output to feed the Fast R-CNN for
the network. To make the transfer learning, I de- the classification [9]. The Faster R-CNN passes by 4-
leted the last 3 layers that are trained for the classi- steps alternating training [9], where: First, train the
fication (Fully-connected layer, Softmax layer and RPN initialized by the pretrained CNN. Second,
the classification layer) because the output was 1000 train a separate detection network by Fast R-CNN
classes, and our output is only 2 classes. I also using proposals generated by the last step, initial-
changed the first layer which is the input-layer and ized by the CNN. Third, fix the convolutional layers,
add a data augmentation method for making a ran- fine tune unique layers to RPN, initialized by the de-
dom vertical flip of the input image on every itera- tector on the third step. Fourth, fix the convolutional
tion of the mini-batch for the training. This data aug- layer, fine-tune the Fully-Connected layers of Fast
mentation, together with the image preprocessing R-CNN.
explained in the subsection 3.1.2, will help the
Faster R-CNN to detect the object with rotational in- The chosen learning options of the Faster R-CNN
variance due to the limitation of the object detection are almost the same for the 4 steps: SGDM with a
method explained in the subsection 3.2.2 below. batch size of 256, a learn rate drop factor of 0.5 every
3 epochs and by shuffling the image dataset on every
The network was trained on a laptop with a single epoch. The initial learn rate of the 2 first steps is
Intel core i5 CPU and 8GB RAM memory. The oper- 10−5 , and 10−6 for the rest of the steps (because
ating system was Windows 10, and the implementa- those steps are used to fine tune the previous ones).
tion environment was under MATLAB 2017b on the The learning epochs is fixed at 10 for the 1st and the
Training Dataset with the Validation Dataset de- 3rd steps and 12 epochs for the 2nd and the 4th steps.
scribed on the step 5 of the image preprocessing.
The positive IoU (Intersection over Union of the
The CNN is trained using SGDM (Stochastic Gra- anchor over the Ground Truth box, more details in
dient Descent with Momentum) with a batch size of reference [13]) is also fixed by [0.6 1] (object) and
128, a momentum of 0.9 and a learning rate of 10−4 . negative IoU by [0 0.3] (not object)
The training required 8 epochs (shuffling the dataset
on every epoch) and took us around 385 minutes. All the parameters were chosen after many prac-
tical tries to get a better result. The training took
After running the trained network on the test da- around 72 hours on the same laptop environment
taset, I got a mean diagonal accuracy of 99.98% on previously mentioned.
the 2 trained classes as shown in the confusion ma-
trix in Table 1. 3.2.3. Evaluation Method
For the target detection performance, four com-
Airplane Storage Tank
monly used criteria were computed: FPR, MR, Accu-
Airplane 0.9999 2.7189E-06
Storage
racy (AC), and Error Ratio (ER). These criteria are
4,63E-04 0.9995 defined as follows:
Tank
Table 1: Confusion Matrix of the trained CNN
Number of falsely detected targets
𝐹𝑃𝑅 = × 100%
This classification trained network will be used Number of detected targets
for the object detection method. Number of missing targets
𝑀𝑅 = × 100%
3.2.2. Object detection method Number of targets
For the object detection method, the Faster R- Number of detected targets
CNN described in section 2.4 is used because of its 𝐴𝐶 = × 100%
Number of targets
high accuracy result as shown in section 2.5. Faster
R-CNN don’t have rotation invariance implemented 𝐸𝑅 = 𝐹𝑃𝑅 + 𝑀𝑅
[9]. Objects in RS imagery need to be rotational in-
variant due to the random direction of the targets in
4. CONCLUSION [10] J. R. R. Uijlings, K. E. A. Sande, T. Gevers
and A. W. M. Smeulders, "Selective Search for
The goal of this work is to show the ability of ob- Object Recognition," International Journal of
ject detection using DL technology using Faster R- Computer Vision, vol. 2, pp. 154-171, 2013.
CNN method for military targets from a High-Reso-
[11] C. M. Bishop, Pattern Recognition and
lution RS image. To implement this method, a series
Machine Learning, Secaucus,NJ, USA:
of process must be done on the Dataset and on the
Springer-Verlag New York, Inc., 2006.
chosen pre-trained CNN due to the small number of
images in the dataset. I demonstrated the feasibility [12] I. Goodfellow, Y. Bengio and A. Courville,
of this type of work on a single CPU laptop DEEP LEARNING, MIT Press, 2016.
[13] C. L. Zitnick and P. Dollar, "Edge boxes:
Two object classes (Airplane, Storage tanks) are
Locating object proposals from edges,"
used as an example. In the future, the software can
European Conference on Computer Vision,
be trained to detect more classes (ships, armored ve-
pp. 391-405, 2014.
hicles, bridges, deployed troops…). There will be no
limitation as long as we can collect the appropriate [14] M. Everingham, L. Van Gool, C. K. I.
Dataset from Remote Sensing images (including EO, Williams, J. Winn and A. Zisserman, "The
IR, SAR…). PASCAL Visual Object Classes Challenge
2007," [Online]. Available:
http://host.robots.ox.ac.uk/pascal/VOC/voc20
5. REFERENCES 07/.
[15] Y. Yang and S. Newsam, "Bag-of-visual-
[1] K. Fukushima, "Neocognitron: A hierarchical words and spatial extensions for land-use
neural network capable," Neural Networks, classification," in 18th ACM SIGSPATIAL
vol. 1, no. 2, pp. 119 - 130, 1988. Int. Symposium on Advances in Geograph.
[2] D. Marr and E. Hildreth, "Theory of edge Inf. Sys., San Jose, CA, USA, June, 2010.
detection," Proceedings, vol. 207, no. 1167, pp. [16] Blue Marble Geographics, "Global Mapper,"
187 - 217, 1980. Blue Marble Geographics , [Online].
[3] R. Girshick, "Fast R-CNN," Proceedings of Available:
the IEEE International, pp. 1440-1448, 2015. http://www.bluemarblegeo.com/products/glob
al-mapper.php.
[4] R. Girshick, J. Donahue, T. Darrell and J.
Malik, "Rich feature hierarchies for accurate [17] Mathworks, "Matlab," [Online]. Available:
object detection and semantic segmentation," https://www.mathworks.com/products/matla
in Proceedings of the IEEE conference on b.html.
computer vision and, 2014. [18] D. Rumelhart, G. Hinton and R. Williams,
[5] A. Krizhevsky, I. Sutskever and G. E. Hinton, "Learning representations by back-
"ImageNet Classification with Deep propagating errors," Cognitive modeling, vol.
Convolutional Neural Networks," Advances 5, no. 3, 1986.
in neural information processing systems, pp. [19] R. Szeliski, Computer Vision: Algorithms and
1097-1105, 2012. Applications, Springer, 2010.
[6] ImageNet, "Large Scale Visual Recognition [20] F.-F. Li, A. Karpathy and J. Johnson, Writers,
Challenge (ILSVRC)," Stanford Vision Lab, Convolutional Neural Networks for Visual
[Online]. Available: http://www.image- Recognition. [Performance]. Stanford
net.org/challenges/LSVRC/. University CS231n, 2016.
[7] J. Redmon, S. Divvala, R. Girshick and A.
Farhadi, "You Only Look Once: Unified, Real-
Time Object Detection," Computer Vision and
Pattern Recognition (cs.CV), 2016.
[8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S.
Reed, C. Y. Fu and A. C. Berg, "SSD: Single
shot multibox detector," in European
Conference on Computer Vision , Amsterdam,
Netherlands, 2016.
[9] S. Ren, K. He, R. Girshick and J. Sun, "Faster
R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks," IEEE
Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 6, pp. 1137-
1149, 01 06 2017.

You might also like