You are on page 1of 4

FAST ANIMAL DETECTION IN UAV IMAGES USING CONVOLUTIONAL NEURAL

NETWORKS
Benjamin Kellenberger, Michele Volpi, Devis Tuia

MultiModal Remote Sensing, University of Zurich (Switzerland)


{benjamin.kellenberger, michele.volpi, devis.tuia}@geo.uzh.ch

ABSTRACT
’ Ground truth
Illegal wildlife poaching poses one severe threat to the en-
vironment. Measures to stem poaching have only been with ’ Predictions
limited success, mainly due to efforts required to keep track
of wildlife stock and animal tracking. Recent developments
in remote sensing have led to low-cost Unmanned Aerial Ve-
hicles (UAVs), facilitating quick and repeated image acqui-
sitions over vast areas. In parallel, progress in object detec-
tion in computer vision yielded unprecedented performance
improvements, partially attributable to algorithms like Con- Fig. 1. Example result on animal detection using the proposed
volutional Neural Networks (CNNs). We present an object model.
detection method tailored to detect large animals in UAV im-
ages. We achieve a substantial increase in precision over a We demonstrate the performance of our model on an set of
robust state-of-the-art model on a dataset acquired over the UAV images of the Kuzikus wildlife reserve in Namibia1 .
Kuzikus wildlife reserve park in Namibia. Furthermore, our
model processes data at over 72 images per second, as op-
posed 3 for the baseline, allowing for real-time applications. 2. RELATED WORK

Object detection is the task of drawing bounding boxes (lo-


1. INTRODUCTION calization) and identifying classes of objects (recognition) in
an image. It is one of the most investigated fields in com-
In this paper we address the task of animal detection from puter vision [5]. The common principle of object detection
sub-decimeter resolution images acquired by low-cost Un- algorithms has hardly changed and still consists of two to
manned Aerial Vehicles. This task is of particular interest to three steps: (i.) identifying candidate locations for objects
live stock conservation, where accurate and cost-effective so- (object proposals), (ii.) extracting expressive features for
lutions to animal monitoring would lead to targeted counter- each candidate region, and (iii.) classifying each location
actions to poaching [1, 2]. Such actions are of paramount im- according to these features. Traditional models typically
portance, as can be seen by the increasing numbers of killed rely on hand-crafted features like HOG [6] or SIFT [7], and
individuals, that raised from 13 to 668 in the case of Rhinos use classifiers like Boosting [8] for recognition. Recently,
in South Africa in the period 2007-2012 [3] and amounted to CNNs have become state-of-the-art in many computer vision
tens of thousands for African Elephants in 2011 alone [2]. tasks by performing end-to-end, task specific joint learning of
The answer to the needs of automatic counting might be features and classification. Typically, standard object detec-
found in UAV-based monitoring systems. They allow for fre- tion pipelines relying on CNNs perform tasks (ii.) and (iii.)
quent acquisitions over large areas at sub-decimeter resolu- jointly. For step (i.) there are mainly two paradigms:
tion. The task of animal detection has been traditionally car- Sliding window-based: These models perform recogni-
ried out by manual annotation, which requires trained experts tion at every possible location and scale in the image [9], in a
and large amounts of time. To offer a more efficient system, sliding window fashion.
we propose a pipeline performing animal (object) detection Region proposal-based: These models suggest a large
based on Convolutional Neural Networks (CNNs) [4]. An ex- number of bounding boxes (object proposals) that are likely
ample of the results achieved by our system is shown in Fig. 1. to contain an instance of any object. A discriminator then
has the task of classifying each proposal into different object
This work has been supported by the SNSF grant PP00P2 150593. The classes or background. Examples of such models are R-CNN
authors would like to acknowledge the SAVMAP project and Micromappers
for providing the data and ground truth used in this work. 1 http://www.kuzikus-namibia.de/

‹,(((  ,*$566


Authorized licensed use limited to: Universiti Teknikal Malaysia Melaka-UTEM. Downloaded on June 02,2023 at 04:52:10 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Architecture of the proposed model. It is based on a pre-trained AlexNet and learns specific features for recognition and
localization in two separate branches.

[10] and its extensions, e.g. Fast R-CNN [11]; a proposals calizing plausible animals) in a hybrid-parallel fashion. By
generator typically employed is Selective Search [12]. doing so, we let the two branches of the network learn com-
Due to the large computational complexity of sliding plementary aspects: the former learns the local appearance
windows detectors, object proposal-based models have tra- of animals while the second learns the size of animals based
ditionally been preferred. However, recent studies such as on the local likelihood provided by the first branch. Both
YOLO [13] exploit the intrinsic multi-scale properties of branches learn directly from AlexNet features.
CNNs and show large speed-ups compared to methods rely- In this implementation, we predict locally over 24 × 24
ing on object proposals. These models also reduce problems cells (from 224 × 224 pixels inputs, as explained in the next
related to inaccurate object-proposal generation. section): during the forward pass, each cell receives a con-
Remote sensing tasks have a number of peculiarities that fidence score on the presence of an animal (from the first
set them apart from traditional computer vision problems: branch) and an estimation of the height and width of the most
objects are typically seen only from above with an absolute likely bounding box (from the second branch). The former is
scale, orientation is not discriminative and there is no abso- learned using two convolutional blocks (blocks 1a and 2a in
lute location prior. Some successful applications of object Fig. 2). The latter is learned by using both features learned
detectors on over-head imagery are the detection of seals on directly form the image (the 128 filters coming from block 1b
Greenland ice shelves [14], and airplanes [15]. in Fig. 2), stacked with the confidence score used to perform
Models not relying on hand-crafted features and region recognition in the first branch, therefore letting the localiza-
proposals have only been hesitantly applied to remote sens- tion branch know the spatial extent of detections.
ing datasets. We argue that the aforementioned properties of The confidence in the recognition branch is normalized
remote sensing scenarios could actually be beneficial to ob- using a sigmoid. We observed that using a sigmoid activa-
ject detectors using pure CNNs pipelines: objects of simi- tion function reduced chances of exploding gradients. In turn,
lar classes are of comparable absolute size and the candidate BatchNorm layers [16] reduce vanishing gradient effects. The
search space encompasses the entire image. We propose a constrained output range of [0, 1] facilitates setting thresholds
CNN architecture, which is optimized for fast and efficient on the final confidence map. Further studies are needed to
detection of small, similarly sized objects at arbitrary loca- evaluate alternative scoring functions, such as Softmax. Note
tions, and does not require object proposals. that during backpropagation the two branches communicate
on the update of the AlexNet features, therefore performing
3. PROPOSED ARCHITECTURE multi-task learning.

Figure 2 presents the architecture of our model. We base it on 4. EXPERIMENTS


a pre-trained instance of AlexNet [4], which has given good
results in natural image classification [4] and has already been 4.1. Dataset
used successfully in object detection [11]. Under the assump-
tion that low-level features are similar between different im- We evaluate our pipeline on a dataset of UAV images of the
age analysis tasks (e.g. color gradients or edge detectors), we Kuzikus Wildlife Reserve park in central Namibia, acquired
fine-tune the first layers from AlexNet and add new learnable by the SAVMAP project2 in May 2014 [17]. The campaign
layers on top. resulted in 654 RGB orthorectified images3 . Ground truth
We adopt a two-branch strategy in our CNN. Our network was established via a crowdsourcing campaign organized by
performs animal recognition (i.e., assigning a local likelihood 2 http://lasig.epfl.ch/savmap

score of the presence of an animal) and localization (i.e., lo- 3 An example can be found at http://dx.doi.org/10.5281/zenodo.16445


Authorized licensed use limited to: Universiti Teknikal Malaysia Melaka-UTEM. Downloaded on June 02,2023 at 04:52:10 UTC from IEEE Xplore. Restrictions apply.
MicroMappers4 to retrieve positions of large animals. Some
missing animals were manually added and bounding boxes Table 1. Detection results. A bounding box is counted as
refined to exclude animal shadows. In the end, a total of 1196 correct if its IoU with the closest ground truth exceeds 25%.
animals could be identified. Fast R-CNN (baseline) Proposed Model
Ground Truth Objects 509 509
True Positives 429 379
4.2. Model Training False Positives 843 254
False Negatives 80 130
We divide all images (of size 4000 × 3000 pixels) into 224 ×
Precision (UA) 0.34 0.60
224 sub-frames to match the predefined AlexNet input size,
Recall (PA) 0.84 0.74
yielding a total of 1379 frames. Out of those, 690 images F1 Score 0.48 0.66
(50%) with 1004 animal bounding boxes are used for train- Avg. Speed [Hz] 2.96 73.62
ing, 276 (20%; 372 bounding boxes) for validation and 413
(30%; 509 animals) for testing. The number of training ex- bounding boxes hardly exceeds a mere one percent of the im-
amples is relatively small and could potentially lead to over- age area. We therefore counted a prediction as a true positive
fitting. However, note that only a small portion of the weights if its IoU exceeded 25%. However, we did retain the rule that
is learned from scratch, as we employ the pre-trained AlexNet multiple predictions of the same object count as one positive
in the common branch of our model and only fine-tune it. We hit only, crucial when automatically counting instances.
use extensive data augmentation, including mirroring (hori- In comparison to Fast R-CNN, we observe a significant
zontal and vertical), rotations, shifting (horizontal and verti- increase in precision (User’s Accuracy; UA) of about 0.25
cal) as well as adding small Gaussian noise to the images. points. This improvement is attributable to a lower number of
We train our model with stochastic gradient descent with false positives (254 vs. 843). On the one hand, our model has
momentum of 0.9 for 300 epochs, gradually reducing the to evaluate far less potential candidates (242 = 576 against up
learning rate from 10−4 to 10−7 . We employ weight decay to 5000 proposals) and thus the overall false alarm rate upper
of 0.001 and use for testing the average model of the last ten bound is lower. On the other hand, the object probability map
epochs. For training, we backpropagate over all the grid cells from our model allows tuning the localization in one back-
scoring with high confidence: outside ground truth bounding ward pass; a property that does not hold for proposal-based
boxes as negative examples and all the grid cells correspond- methods. Looking at the examples in Fig. 3, the lower false
ing to ground truth bounding boxes as positive (even when alarm rate can indeed be confirmed.
there are multiple detections on a single bounding box). We Our model struggles in detecting certain animals, leading
expect this procedure to make the model aware of all the to a recall rate (Producer’s Accuracy; PA) 0.1 points lower. To
appearance variations of animals. We filter out overlapping some extent, our model is more prone to predicting multiple,
bounding box predictions using Non-Maximum Suppression smaller hits per animal, as can be seen in the bottom left two
(NMS) as a post-processing step. images in Fig. 3. This issue could be partially mitigated by
As a baseline, we train a Fast R-CNN model using the employing stronger NMS, although this does not correct for
same pre-trained AlexNet base network with object propos- too small bounding box sizes. Instead, predicting on a slightly
als from Selective Search [12]. This model corresponds to coarser confidence grid (i.e., smaller than 24 × 24) could lead
the CaffeNet model described in [11]. For both models, NMS to improvements, as it decreases chances for single animals
threshold and detection cutoff (the threshold above which a to be located on multiple prediction cells.
region is detected as “animal”) were selected to yield maxi- Finally, while Fast R-CNN is able to evaluate an aver-
mum F1 scores on the validation set. age of 2.96 images per second5 , our model processes 72.65
images in the same amount of time, both evaluated on a
5. RESULTS AND DISCUSSION GTX980TI graphics card. Our model could be used for real-
time applications reducing latency in wildlife monitoring.
Table 1 presents performance measures obtained on the held
out test set; visual examples for detections are displayed in 6. CONCLUSION
Fig. 3. The PASCAL VOC challenge requires a predicted
bounding box to have an Intersection-over-Union (IoU) score Illegal poaching of wildlife animals remains a global threat
above 50%, with the closest ground truth bounding box to and calls for measures for near real time monitoring of live-
be counted as a true positive [18]. However, note that this stock. In this paper, we proposed an animal detection system
threshold is optimized for problems where objects occupy a able to efficiently operate on sub-decimeter images acquired
large fraction of the image. Assuming fixed image size, de- by UAVs. Experiments have shown our proposed method to
viations of a prediction from its closest ground truth have a be far more precise in predicting the location of animals in im-
much more severe impact (in IoU terms) in small bounding ages compared to the state-of-the-art Fast R-CNN model. Our
boxes than in large ones. In our case, the coverage of animal
5 This estimate does not include the time required to calculate the object
4 https://micromappers.wordpress.com proposals in Fast R-CNN.


Authorized licensed use limited to: Universiti Teknikal Malaysia Melaka-UTEM. Downloaded on June 02,2023 at 04:52:10 UTC from IEEE Xplore. Restrictions apply.
Correct predictions by our model

Multiple detections of a single animal False positives Missed animals

Fig. 3. Detection examples on the test set (blue; with IoU scores for predictions) and Fast R-CNN baseline (cyan). Ground
truth is in green. Top row shows correct detections for our model, while bottom row show failure cases.

model is able to predict sufficiently accurate bounding boxes, [8] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing fea-
while producing significantly less false positives. Moreover, tures: efficient boosting procedures for multiclass object detec-
our system is able to operate in near-real time (73Hz), which tion,” CVPR, vol. 3, pp. 762–769, 2004.
is a promising characteristic for more ubiquitous deployment [9] C. Park, Dennis and Ramanan, Deva and Fowlkes, “Multires-
in real-life, automated, animal monitoring systems. olution models for object detection,” ECCV, vol. 6314, pp.
1–14, 2010.
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
7. REFERENCES Feature Hierarchies for Accurate Object Detection and
Semantic Segmentation,” CVPR, pp. 580–587, 2014.
[1] M. Mulero-Pázmány, R. Stolper, L. D. Van Essen, J. J. Ne-
gro, and T. Sassen, “Remotely piloted aircraft systems as a [11] R. Girshick, “Fast r-cnn,” in ICCV, December 2015.
rhinoceros anti-poaching tool in Africa,” PLoS One, vol. 9, [12] J. R. R. Uijlings, K. E. A. Van De Sande, T. Gevers, and
no. 1, pp. 1–10, 2014. A. W. M. Smeulders, “Selective search for object recognition,”
Int. J. Comput. Vis., vol. 104, no. 2, pp. 154–171, 2013.
[2] G. Wittemyer, J. M. Northrup, J. Blanc, I. Douglas-Hamilton,
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only
P. Omondi, and K. P. Burnham, “Illegal killing for ivory drives
Look Once: Unified, Real-Time Object Detection,” arXiv,
global decline in African elephants,” Proc. Natl. Acad. Sci.,
2016.
vol. 111, no. 36, pp. 13 117–13 121, 2014.
[14] A.-B. Salberg, “Detection of seals in remote sensing images
[3] D. Biggs, F. Courchamp, R. Martin, and H. P. Possingham, using features extracted from deep convolutional neural net-
“Legal Trade of Africa‘s Rhino Horns,” Science, vol. 339, no. works,” IGARSS, no. 0373, pp. 1893–1896, 2015.
6123, pp. 1038–1039, 2013. [15] F. Zhang, B. Du, S. Member, L. Zhang, S. Member, and M. Xu,
[4] A. Krizhevsky, I. Sulskever, and G. E. Hinton, “ImageNet Clas- “Weakly Supervised Learning Based on Coupled Convolu-
sification with Deep Convolutional Neural Networks,” Adv. tional Neural Networks for Aircraft Detection,” IEEE Trans.
Neural Inf. Process. Syst., pp. 1–9, 2012. Geosci. Remote Sens., vol. 54, no. 9, pp. 1–11, 2016.
[5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, [16] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, Deep Network Training by Reducing Internal Covariate Shift,”
and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Arxiv, pp. 1–11, 2015.
Challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, [17] F. Ofli, P. Meier, M. Imran, C. Castillo, D. Tuia, N. Rey,
2015. J. Briant, P. Millet, F. Reinhard, M. Parkan, and S. Joost,
“Combining Human Computing and Machine Learning to
[6] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Make Sense of Big (Aerial) Data for Disaster Response,” Big
Human Detection,” CVPR, pp. 886–893, 2010. Data, vol. 4, no. 1, pp. 47–59, 2016.
[7] D. G. Lowe, “Distinctive Image Features from Scale-Invariant [18] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
Keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, A. Zisserman, “The pascal visual object classes (VOC) chal-
2004. lenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.


Authorized licensed use limited to: Universiti Teknikal Malaysia Melaka-UTEM. Downloaded on June 02,2023 at 04:52:10 UTC from IEEE Xplore. Restrictions apply.

You might also like