You are on page 1of 26

Article

A Performance Comparison and Enhancement of Animal


Species Detection in Images with Various R-CNN Models
Mai Ibraheam 1, *, Kin Fun Li 1 , Fayez Gebali 1 and Leonard E. Sielecki 2

1 Department of ECE, University of Victoria, Victoria, BC V8W 3A4, Canada; kinli@uvic.ca (K.F.L.);
fayez@uvic.ca (F.G.)
2 British Columbia Ministry of Transportation and Infrastructure, Victoria, BC V8W 9T5, Canada;
Leonard.Sielecki@gov.bc.ca
* Correspondence: maieelgendy@uvic.ca

Abstract: Object detection is one of the vital and challenging tasks of computer vision. It supports a
wide range of applications in real life, such as surveillance, shipping, and medical diagnostics. Object
detection techniques aim to detect objects of certain target classes in a given image and assign each
object to a corresponding class label. These techniques proceed differently in network architecture,
training strategy and optimization function. In this paper, we focus on animal species detection as an
initial step to mitigate the negative impacts of wildlife–human and wildlife–vehicle encounters in
remote wilderness regions and on highways. Our goal is to provide a summary of object detection
techniques based on R-CNN models, and to enhance the performance of detecting animal species
in accuracy and speed, by using four different R-CNN models and a deformable convolutional
neural network. Each model is applied on three wildlife datasets, results are compared and analyzed
 by using four evaluation metrics. Based on the evaluation, an animal species detection system
 is proposed.
Citation: Ibraheam, M.; Li, K.F.;
Gebali, F.; Sielecki, L.E. A Keywords: deep learning; convolutional neural network (CNN); region-based CNN (R-CNN) mod-
Performance Comparison and els; Deformable CNN (D-CNN); animal species detection
Enhancement of Animal Species
Detection in Images with Various
R-CNN Models. AI 2021, 2, 552–577.
https://doi.org/10.3390/ai2040034 1. Introduction
Object detection has been widely studied to identify objects within an image to a
Academic Editor: Mikel Galar
predefined set of object classes (object identification) and where these objects are in the
image (object localization) using bounding boxes [1]. It is a basic step for computer vision
Received: 28 September 2021
Accepted: 29 October 2021
and image understanding. In recent years, most of the object detectors use Deep Learning
Published: 31 October 2021
Neural Networks (DNNs) including Convolutional Neural Networks (CNNs) architectures.
CNNs have several blocks of multi convolution and pooling layers to extract features such
Publisher’s Note: MDPI stays neutral
as edges, textures, and shapes, etc., and to identify and locate objects in an image [2–6].
with regard to jurisdictional claims in
An object detection framework using a Region-based CNN (R-CNN) model can be
published maps and institutional affil- divided into four stages: (i) region of interest (RoI) selection, also known as region propos-
iations. als; (ii) features extraction for each region proposal using CNN; (iii) region classification
(which objects are in each proposal); and (iv) object localization by combining overlapped
region proposals into a single bounding box around each detected object using bounding
box regression [7–11]. All these processes are time consuming, thus making R-CNN slow.
Copyright: © 2021 by the authors.
Several models have been proposed to improve R-CNN, including Fast R-CNN [10], Faster
Licensee MDPI, Basel, Switzerland.
R-CNN [7], and Mask R-CNN [11], to speed up object detection.
This article is an open access article
The most important step in the object detection task is the extraction of significant
distributed under the terms and features, in order to identify and localize objects in the image with high accuracy. However,
conditions of the Creative Commons CNN is unable to deal with the geometric deformation of objects in images. In our study of
Attribution (CC BY) license (https:// animal species detection, Deformable CNN (D-CNN) is used to improve object features
creativecommons.org/licenses/by/ extraction under different geometric deformation conditions and thus object detection
4.0/). accuracy is improved, as concurred by [12,13].

AI 2021, 2, 552–577. https://doi.org/10.3390/ai2040034 https://www.mdpi.com/journal/ai


AI 2021, 2 553

In this paper, we focus on improving performance in accuracy and detection speed of


animal species identification and localization. This is achieved by enhancing the extracted
features through adding deformable convolutional layers to the four R-CNN models. To the
best of our knowledge, there is no existing work that uses this technique in animal species
detection. The effect of adding these convolutional layers to the four R-CNN detectors as we
investigated by evaluating the performance of these detectors using three animal datasets.
The False Negative Rate (FNR) evaluation metric is added to performance evaluation; to
our knowledge, this has not been done before with animal species detection. This metric is
important to determine how well a model can be used in many applications, that require
minimal false negative rate, such as animal species detection in remote wilderness regions
and on highways to warn hikers and drivers about the presence of dangerous animals.
The rest of the paper is organized as follows: Section 2 summarizes related work in
object detection, and in particular, animal species detection. Section 3 presents an overview
of the basic CNN architecture. Section 4 introduces the four R-CNN models of interest.
Section 5 describes the three datasets used in our experiments. Section 6 presents the
methodology of animal species detection. Section 7 compares and analyzes the results of
animal species detection using various R-CNN models, in detection speed and accuracy,
with and without deformable convolutional layers. Finally, Section 8 concludes the paper
and discusses desirable enhancements and future works.

2. Related Work in Detection


2.1. Object Detection
The traditional neural network classifiers extract features from images by using image
processing feature extraction descriptors such as Haar [14], Histogram of Oriented Gradi-
ents (HOG) [15], and Scale Invariant Feature Transform (SIFT) [16]. These neural networks
could not provide high object detection accuracy, as has been shown with commonly used
datasets such as ImageNet [6] and MS COCO [17]. Therefore, improvement attempts have
been made to the traditional neural networks with image processing descriptors for better
extraction of significant features to improve the accuracy as well as to avoid intensive
computation and memory usage [8,18,19].
DNNs have a deeper structure and densely connected neural networks, such as CNNs,
have become popular since the mid-2000s [20]. The characteristics which differentiate
DNNs from the traditional neural networks can be summarized as [21]: (1) requiring
large-scale annotated training data for learning, (2) enabling high-performance parallel
computing system like GPU clusters, (3) having a sophisticated and advanced design of
network structures and training strategies, and (4) using high-level characteristics of the
object in feature learning strategy. Without the use of image processing feature extraction
descriptors, deep CNNs can extract low (edges), mid (corners and textures), and high
(parts of objects) level features as shown in Figure 1 [22]. These features can be enhanced
by increasing the number of layers (depth) [23,24]. Because these neural networks are very
deep, the training achieves high accuracy in complex tasks such as object detection in real-
time applications. Examples of these DNNs are AlexNet [3], GoogLeNet [25], VGGNet [23],
and ResNet [26].
Object detection based on DNN was introduced in Pascal Visual Object Classes (VOC)
challenge in 2006 [27]. Since 2014, the ImageNet Large Scale Visual Recognition Chal-
lenge (ILSVRC) has become the main benchmark for object detection using CNNs [4,6,7].
Krizhevsky et al. [3] developed a CNN to create a bounding box around an object; how-
ever, it does not work well in images with multiple objects. Girshick et al. [9] combined
region proposals with CNNs, and called their method R-CNN detector, i.e., Regions with
CNN features. Due to the success of the region proposal methods, Fast R-CNN [10] was
proposed to reduce the computational complexity of CNN, thus improving object detec-
tion speed and accuracy. Ren et al. [7] merged region proposal network (RPN) and Fast
R-CNN into a single network called Faster R-CNN, to achieve further speed up and higher
object detection accuracy. Later, Faster R-CNN was extended by predicting segmentation
AI 2021, 2 554

masks at the pixel level for each object instance with bounding box, this method is called
Mask R-CNN [11]. All these improvements are significant and can be applied to animal
species detection.

1st Hidden 2nd Hidden 3rd Hidden


Layer (Low- Layer (Mid- Layer (High-
Input Image Feature Maps
Level Level Level
Features) Features) Features)

Figure1.
Figure 1.DNNs
DNNshave
haveseveral
severalhidden
hiddentraining
traininglayers
layersofofextracted
extractedfeatures.
features.

2.2. Animal Species Detection


There are many attempts to identify animals by assigning a label to an image; however,
there are limited works in the literature that focus on animal species detection, where the
location of the animal is determined as well as its identification [28–41]. Some researchers
used their own datasets which contain one or only a few animal species, and others used
relatively small datasets (a few thousand images only) [28,31,32]. Some researchers relied
on feature extraction descriptors to classify animals [29,30]; however, several recent works
have used CNNs.
Yu et al. [31] manually cropped and selected images, that only contained the entire ani-
mal body. They used a dataset which consists of over 7000 Infrared (IR) images captured by
motion detection camera, called camera-trap, from two different field sites. This technique
of cropping images allowed them to obtain 82% accuracy by using linear support vector
machine (SVM) to classify 18 animal species. Kwan et al. [32,33] used IR videos to classify
and localize objects taken from different distances, and achieved a mean average precision
89.4% by using YOLO model. Chen et al. [34] used 6-layers CNN to classify 20 animal
species in their own dataset of 23,876 images with an accuracy of 38.32%. The authors used
a segmentation algorithm for cropping the animals from the images and used these cropped
images to train and test their system. Gomez et al. [35] used deep CNNs to identify animal
species in the Snapshot Serengeti dataset. They reached 88.9% of accuracy in Top-1 (the
highest probability prediction matches the actual class) and 98.1% in Top-5 (one of the five
highest probability prediction matches the actual class). Willi et al. [36] identified animal
species by using CNNs. They achieved an accuracy of 92.5% in Snapshot Serengeti dataset,
and an accuracy of 91.4% in Snapshot Wisconsin dataset. Norouzzadeh et al. [37] used a
human labeling process to train a deep active learning system to classify and count animals
in to reduce images. Their system achieved an accuracy of 92.9% on cropped animal images
from the Snapshot Serengeti dataset, by using ResNet-50 as a backbone network for their
model. Furthermore, Norouzzadeh et al. [38] used CNN and reported an accuracy of 93.8%
in classifying images that contain only a single animal in the Snapshot Serengeti dataset.
The performance matched human accuracy in their experiments. However, though this
work showed promising results for classifying images with only a single animal, it could
not handle the challenge of localizing several animals.
Parham et al. [39] used the YOLO detector to detect zebras from a dataset of 2500 images
and created bounding boxes of Plains Zebras with an accuracy of 55.6%, and Grevy’s Zebras
with an accuracy of 56.6%. Zhang et al. [40] created a dataset of 23 different species in both
daytime color and nighttime grayscale formats from 800 camera-traps. They compared
Fast R-CNN, Faster R-CNN, and their proposed method (spatiotemporal object proposal
and patch verification framework) which achieved an average F-measure score of 82.1% for
animal species detection. Xu et al. [41] evaluated the Mask R-CNN model for the detection
and counting of cattle (single class) from quadcopter imagery. They achieved accuracy of
94%. Gupta et al. [42] used the Mask R-CNN model with a pre-trained network ResNet-101
Grevy’s Zebras with an accuracy of 56.6%. Zhang et al. [32] created a dataset of 23 different
species in both daytime color and nighttime grayscale formats from 800 camera-traps.
They compared Fast R-CNN, Faster R-CNN, and their proposed method (spatiotemporal
object proposal and patch verification framework) which achieved an average F-measure
AI 2021, 2 score of 82.1% for animal species detection. Xu et al. [37] evaluated the Mask R-CNN 555
model for the detection and counting of cattles (single class) from quadcopter imagery.
They achieved accuracy of 94%. Gupta et al. [59] used the Mask R-CNN model with a pre-
trained network ResNet-101 to detect two animal species (cows, and dogs). They achieved
to detect two animal species (cows and dogs). They achieved an average precision of
an average precision of 79.47% in detecting cows and 81.09% in detecting dogs.
79.47% in detecting cows and 81.09% in detecting dogs.
The objectives of our work are to detect multiple animals and their species in images
The objectives of our work are to detect multiple animals and their species in images
and
and annotate them
annotate them with
with bounding
bounding boxes.
boxes. TheThe three
three datasets
datasets used,
used, Snapshot
Snapshot Serengeti,
Serengeti,
images collectedbyby
images collected the Wildlife
the Wildlife ProgramProgram of the
of the British BritishMinistry
Columbia Columbia Ministry of
of Transportation
Transportation
and Infrastructureand Infrastructure
(BCMOTI), and (BCMOTI),
Snapshotand Snapshotare
Wisconsin, Wisconsin, are challenging
challenging as
as they are all
they are all imbalanced and contain a relatively large number of animal species,
imbalanced and contain a relatively large number of animal species, thirteen in total. thirteen
in total. Furthermore,
Furthermore, there species
there are animal are animal species
that have that
similar have similar
appearance. appearance.
We investigate the We
use
of D-CNNsthe
investigate to enhance the detection
use of D-CNNs performance.
to enhance the detection performance.

3. Overview
3. Overviewof ofCNN
CNN
CNNs have varying accuracy
CNNs have varying accuracy performance
performance onon image
image classification
classification (classify
(classify what
what is
is
contained in
contained in an
an image).
image). The
The number
number of of computation
computation layers
layers which
which have
have been
been used
used for
for
features learning of input images are different depending on the visual task [23–26].
features learning of input images are different depending on the visual task [21-24]. This This
section provides
section provides an
an overview
overview of
of regular
regular CNN
CNN andand D-CNN.
D-CNN.

3.1. Regular CNN


3.1. Regular CNN
Regular CNN is a deep learning algorithm which can be used to analyze input images
Regular CNN is a deep learning algorithm which can be used to analyze input images
for computer vision tasks such as image classification and object detection [4]. As shown
for computer vision tasks such as image classification and object detection [4]. As shown
in Figure 2, CNN has two main parts: feature learning and classification.
in Figure 2, CNN has two main parts: feature learning, and classification.
Moose: 0.97
Elk: 0.02
Deer: 0.01

Output Pool. n
Conv. n
Output of of Pool.
Conv. Pool. 1
Conv. 1
Output
Input image Multi-hidden layers FCLs Softmax
layer

Feature learning Classification


Figure 2. Illustration of an example CNN architecture in animal classification. Convolution layers are denoted
Figure 2. Illustration of an example CNN architecture in animal classification. Convolution layers are denoted as Conv.,
as Conv. and pooling layers are denoted as Pool. Multi-hidden layers consist of n hidden layers (Conv. n + Pool.
and pooling layers are denoted as Pool. Multi-hidden layers consist of n hidden layers (Conv. n + Pool. n), depending on
n), depending on the input image and the visual task. The Fully Connected Layers (FCLs) flatten the output of
the input image and the visual task. The Fully Connected Layers (FCLs) flatten the output of the previous layers, which is
the previous layers, which is called feature maps, and output them to the Softmax layer to classify the object in
called feature maps, and output them to the Softmax layer to classify the object in input image to different probabilities.
input image to different probabilities.
For feature learning, each layer in the multi-hidden layers (convolution layers plus
pooling layers) perform convolution and pooling operations on its input data to produce
a feature map which is a matrix representing different pixel intensities for the whole
image [43,44].
As shown in Figure 3, convolution is between sliding-flipped filter window (kernel)
with learned weights over input image and a local small region of the input that has the
same size of that filter (receptive field); with a non-linear activation function Rectifier Linear
Unit (ReLU) through a back propagation process to extract the objects’ features within the
image, regardless of their location [43]. This procedure is repeated by applying multiple
filters to produce a number of feature maps. Pooling is a down-sampling operation, applied
to the output of convolution layer, to decrease the amount of redundant information, thus
reducing computations and enabling the extraction of most significant features which
are related to objects in the input image by using one of the pooling methods [45]. The
most common pooling methods are average pooling and max pooling which calculate the
average value for each region on the feature map or extract the maximum value from each
ear Unit (ReLU) through a back propagation process to extract the objects’ features within
the image, regardless of their location [43]. This procedure is repeated by applying multi-
ple filters to produce a number of feature maps. Pooling is a down-sampling operation,
applied to the output of convolution layer, to decrease the amount of redundant infor-
mation, thus reducing computations and enabling the extraction of most significant fea-
AI 2021, 2 556
tures which are related to objects in the input image by using one of the pooling methods
[45]. The most common pooling methods are average pooling and max pooling which
calculate the average value for each region on the feature map or extract the maximum
value from
region of theeach region
feature map,ofrespectively,
the feature map, respectively,
as shown in Figure as
4. shown
The max inpooling
Figure 4. The max
function is
pooling
better function
than averageispooling
better than
in theaverage pooling in
object detection thewhere
task objectitdetection task where
helps in avoiding it helps
overfitting
in avoiding
and in makingoverfitting
poolingand
layer in output
makinginvariant
pooling layer output
to small invariant of
translations to small translations
the input [45,46].
Invariance
of the inputto[45,46].
translation meansto
Invariance that if the input
translation has been
means that iftranslated
the inputbyhasa been
smalltranslated
quantity,
most of thequantity,
by a small pooled outputs’
most of values do not
the pooled change
outputs’ [46]. The
values do notprocess of [46].
change convolutional
The process and
of
pooling layers is repeated n times through multiple stacked layers of computation,
convolutional and pooling layers is repeated n times through multiple stacked layers of and n is
determined
computation, byand
the n
data and the visual
is determined task.
by the data and the visual task.

EW 6
Figure3.3.An
Figure Anexample
exampleof ofconvolution
convolutionoperation
operationwith
withaakernel
kernelsize of33×× 3.
sizeof 3. (a) Input (numbers represent
represent pixels
pixels intensities);
intensities);
(b)Kernel
(b) Kernel(numbers
(numbersrepresent
representlearned
learned weights
weights of of filter);
filter); (c)(c) Output
Output of element-wise
of element-wise multiplication
multiplication between
between (a,b);
(a,b); (d) Fea-
(d) Feature
ture (sum
map map (sum
of all of all elements
elements in (c)).in (c)).

Figure 4. An example of pooling operation with a pool filter size of 2 × 2. (a) Input matrix of pooling layer (feature map) of
ple of pooling operation with a pool filter size of 2 × 2. (a) Input matrix of pooling layer (feature map)
size 4 × 4; (b) Output of max pooling operation; (c) Output of average pooling operation.
tput of max pooling operation; (c) Output of average pooling operation.
For classification, the Fully Connected Layers (FCLs) in Figure 2 are the output layers
which flatten the outputs of the previous layers, the feature maps, to a single vector
For classification, the Fully Connected Layers (FCLs) in Figure 2 are the output layers
that can be used as an input for the Softmax layer. Each input is connected to all neurons,
which flatten the outputs of theasprevious
represented layers,
circles in Figure the feature
2, to predict the classmaps, to ainsingle
of the object vector
the input thatan
image with
can be used as an input for the
activation Softmax
function Softmax, layer.
which Each
convertsinput is connected
the output to all neurons,
values to conditional probabilities
(normalized classification scores) for prediction, where each value
represented as circles in Figure 2, to predict the class of the object in the input image with ranges between 0 and 1
and all values sum to one [3,47]. The architecture of CNN has the capability to learn and
an activation function extract
Softmax,
objectwhich
features,converts
and to mergetheseveral
output values
tasks to conditional
together, probabil-
for example, object detection
ities (normalized classification scores) for prediction, where each value ranges between 0
and segmentation.
and 1 and all values sum to one [3,47]. The architecture of CNN has the capability to learn
and extract object features, and to merge several tasks together, for example, object detec-
tion and segmentation.
Regular CNNs have been built on fixed and known geometric structure, so they can-
an activation function Softmax, which converts the output values to conditional
probabilities (normalized classification scores) for prediction, where each value ranges
between 0 and 1 and all values sum to one [3,42]. The architecture of CNN has the
AI 2021,capability
2 to learn and extract object features, and to merge several tasks together, for 557
example, object detection and segmentation.
Regular CNNs have been built on fixed and known geometric structure, so they
cannot deal with any geometric
Regular CNNsvariations
have been in built
the object
on fixedsuch
and as:
knownpose, scale, viewpoint,
geometric structure, so they
and deformation parts [42],
cannot aswith
deal illustrated in Figure
any geometric 5. Toinsolve
variations thissuch
the object issue,
as: CNN has been
pose, scale, viewpoint,
and deformation parts [47], as illustrated in Figure 5. To solve
trained on datasets with sufficient variation, or on augmented data by changing the size, this issue, CNN has been
trained on datasets with sufficient variation, or on augmented data by changing the size,
shape, and rotation shape,
angleandof the object, to attain high detection accuracy. Although, the
rotation angle of the object, to attain high detection accuracy. Although the
problem has been problem
solved,has thebeen
training is training
solved, the very complex and and
is very complex therefore
thereforeexpensive.
expensive. ToToenhance
enhance the capability
the of CNN to
capability of deal
CNN with
to dealgeometric variations
with geometric variationsininobject
objector
or deformations
deformations without
using data-augmentation,
without using data-augmentation, D-CNNs D-CNNs was introduced
was introduced [12,13].
[43,44].

Figure
Figure 5. Example
5. Example of images
of images that geometric
that contain contain geometric
variations invariations in thewhich
the object (moose) objectmake
(moose) which
it difficult to bemake it
identified
difficult
by to beCNN.
using regular identified by using regular CNN.

3.2. D-CNN 3.2. D-CNN


The idea of D-CNN is to replace the regular sampling matrix that has fixed locations,
The idea of D-CNN
as the is
3 xto replace
3 blue pointsthe regular
in Figure 6a,sampling matrix that
with the deformable has fixed
sampling matrixlocations,
that has movable
as the 3 x 3 blue points in Figure
locations as the6a, withpoints
orange the deformable
in Figure 6b,c.sampling matrix
These orange pointsthat has movable
are redistributed to other
AI 2021, 2, FOR PEER REVIEW
locations as the orange points in Figure 6b,c. These orange points are redistributed to other green7
locations depending on the shape of the object with learned augmented offsets (the
locations dependingarrows).
on theTheshapestructure
of theof object
the deformable samplingaugmented
with learned matrix can beoffsets
obtained(the
by agreen
convolution
algorithm that calculates the offset of the sampling position to learn the objects’ geometrical
arrows). The structure of the
properties
deformable
[12,13]. Each[12,13].
sampling
point inEach
the regular
matrix can beis moved
obtained by a the
geometrical properties point insampling matrix
the regular sampling matrix after addingafter
is moved
convolution algorithm that
learnable
adding thecalculates
offset to each
learnable the
of
offset tooffset
them, ofofthem,
resulting
each the asampling
in resulting in a position
deformable sampling
deformable tosampling
learn the
matrix. matrix.
objects’ geometrical properties [43,44]. Each point in the regular sampling matrix is moved
after adding the learnable offset to each of them, resulting in a deformable sampling
matrix.

(a) (b) (c)

Figure6.6.Illustration
Figure Illustrationof
ofthe
thesampling locationsinin33×× 33 regular
samplinglocations regularand
anddeformable
deformablesampling
samplingmatrices.
matrices.(a)
(a)Regular
Regularsampling
sampling
matrix
matrix (blue points); (b) Deformable sampling matrix (orange points) with offsets (green arrows); (c) Exampleof
(blue points); (b) Deformable sampling matrix (orange points) with offsets (green arrows); (c) Example ofhow
howthe
the
positions of the (a)
deformable sampling matrix are changed (b)
from the original 3 × 3 squared positions (c)
positions of the deformable sampling matrix are changed from the original 3 × 3 squared positions according to theobjects
according to the objects
shapeto
shape toidentify
identifydeformed
deformedor oroccluded
occludedobjects
objectsininthe
theimage.
image.
Figure 6. Illustration of the sampling locations in 3 × 3 regular and deformable sampling matrices. (a)
Regular sampling matrix (blue D-CNNpoints);
D-CNN (b) Deformable
consists of two
of twoparts: sampling
parts:regular matrix (orange
regularconvolution
convolution topoints)
layers
layers to with
generate
generate offsets
feature
feature maps
maps for
for
(green arrows); (c) Example the whole
of how
the whole input
inputthe image,
positions
image, and additional convolution
of the deformable
and additional layers
convolution sampling (deformable
matrix are
layers (deformable convolution
changedlayers)
convolution
from the original 3×3 layers)
for
squared forpositions
offsets offsets
to to be
be learned learned from
from each
according each
to feature
the feature
map
objects mapthey
where
shape where
to can they can beeasily
be trained
identify trained
deformed oreasily
by using
by using
back back propagation
propagation from end-to-end
from end-to-end without
without any any supervision.
supervision. Theseconvolution
These additional additional
occluded objects in the convolution
image. layers increase the detection performance of the network at the cost of adding
layers increase the detection performance of the network at the cost of adding a small
amount of computations for offset learning. In Section 7.2., our experimental results show
that after adding deformable convolutional layers to the four R-CNN models, the animal
species detection accuracy is improved.
D-CNN consists of two parts: regular convolution layers to generate feature maps for
the whole input image, and additional convolution layers (deformable convolution layers)
for offsets to be learned from each feature map where they can be trained easily by using
AI 2021, 2 back propagation from end-to-end without any supervision. These additional convolution 558
layers increase the detection performance of the network at the cost of adding a small
amount of computations for offset learning. In Section 7.2., our experimental results show
that after adding deformable convolutional layers to the four R-CNN models, the animal
a small amount
species detectionofaccuracy
computations for offset learning. In Section 7.2., our experimental results
is improved.
show that after adding deformable convolutional layers to the four R-CNN models, the
animal
4. R-CNNspecies detection accuracy is improved.
Models
In general,
4. R-CNN Modelsthe four R-CNN models consist of two stages as shown in Figure 7. The
first is RoI or region proposals algorithm, that finds regions from the feature maps (output
In general, the four R-CNN models consist of two stages as shown in Figure 7. The
of CNN 1) that might contain objects and generates bounding box for each region. The
first is RoI or region proposals algorithm, that finds regions from the feature maps (output
second is the region pooling layer, where it detects and removes all the overlapped
of CNN 1) that might contain objects and generates bounding box for each region. The
regions, as well as converts the extracted proposals to fixed size by doing max-pooling on
second is the region pooling layer, where it detects and removes all the overlapped regions,
them. The fixed size of proposals is required by FCLs in CNN 2 and the bounding box
as well as converts the extracted proposals to fixed size by doing max-pooling on them.
regressor to identify and localize objects [11].
The fixed size of proposals is required by FCLs in CNN 2 and the bounding box regressor
This section provides an overview of R-CNN, Fast R-CNN, Faster R-CNN, and Mask
to identify and localize objects [11].
R-CNN models. Each model attempts to improve accuracy and speed up processing.
Feature Maps for the
Whole Image
CNN
RoI
1
Algorithm

Input Image

Feature Maps for the Regions of


Whole Image Interest
(Proposals)

Identify & Localize


Extracted Regions

Region CNN
Pooling Layer 2

Extracted Resized
Regions

Figure 7. Figure 7. Basic architecture


Basic architecture of variousofR-CNN
variousmodels.
R-CNN models.

This section provides an overview of R-CNN, Fast R-CNN, Faster R-CNN, and Mask
4.1.
R-CNNR-CNN
models. Each model attempts to improve accuracy and speed up processing.
The R-CNN architecture is divided into five stages as shown in Figure 8. It starts by
4.1. R-CNN
using a selective search algorithm to generate hundreds to thousands of region proposals
The R-CNN architecture is divided into five stages as shown in Figure 8. It starts by
for an input image. These region proposals are cropped and resized [1,45]. Then, each
using a selective search algorithm to generate hundreds to thousands of region proposals
resized region proposal is fed into CNN to extract object features. The output of each CNN
for an input image. These region proposals are cropped and resized [1,48]. Then, each
is the input of a linear SVM to identify the regions of objects in image [46]. Finally, these
resized region proposal is fed into CNN to extract object features. The output of each CNN
identified regions are adjusted by using the linear bounding box regressor, to tighten and
is the input of a linear SVM to identify the regions of objects in image [49]. Finally, these
to refine the final bounding boxes of the detected objects [47].
identified regions are adjusted by using the linear bounding box regressor, to tighten and
Selective search algorithm generates regions based on a segmentation approach. It
to refine the final bounding boxes of the detected objects [50].
combines both object search and segmentation to detect all the possible locations of objects.
Selective search algorithm generates regions based on a segmentation approach. It
In terms of segmentation of object and non-object, the image structures including object
combines both object search and segmentation to detect all the possible locations of objects.
In terms of segmentation of object and non-object, the image structures including object
size, color similarities, and texture similarities, are used to obtain many small segmented
areas. Then, a bottom-up approach is typically used as part of the selective search algorithm
to merge all the similar areas to get more accurate and larger segmented areas to produce
the final candidate region proposals [51,52].
The R-CNN model cannot be applied to real-time applications because:
size, color similarities, and texture similarities, are used to obtain many small segmented
areas. Then, a bottom-up approach is typically used as part of the selective search algorithm
AI 2021, 2 to merge all the similar areas to get more accurate and larger segmented areas to produce 559
the final candidate region proposals [48,49].
The R-CNN model cannot be applied to real-time applications because:
• Network processing is expensive and slow due to the use of selective search
• Network
algorithm, processing is expensive
where hundreds and slow of
to thousands due to theproposals
region use of selective
need tosearch algorithm,
be classified for
where hundreds
each image. to thousands of region proposals need to be classified for each image.
•• R-CNN
R-CNNsometimes
sometimesgenerates
generatesbad
badcandidate
candidateregion
region proposals
proposals asas
thethe
selective search
selective is
search
a fixed algorithm which has no learning capabilities.
is a fixed algorithm which has no learning capabilities.
At
At the
the same
same time,
time, the
the training
training of
of the
the R-CNN
R-CNN model
model isis complex
complex and
and requires
requires aa big
big
memory
memory space; since R-CNN has to train three different models separately: CNN,
space, since R-CNN has to train three different models separately: CNN, SVM,
SVM,
and
and bounding
bounding box box regressor.
regressor.

Regions of
Selective Search Interest
Algorithm

Input Image

Region Pooling Layer

Extracted Resized
Regions

CNN CNN CNN


1 2 3

Feature Map for Each


Region

SVM and SVM and SVM and


Bounding Box Bounding Box Bounding Box
Regressor Regressor Regressor

Detection Results (Class and Bounding Box)

Basic architecture
Figure 8.Figure of R-CNNofmodel.
8. Basic architecture R-CNN The number
model. Theofnumber
CNNs varies depending
of CNNs on the number
varies depending of classes.
on the
number of classes.
4.2. Fast R-CNN
The R-CNN
4.2. Fast same developer of R-CNN proposed a modified model, the Faster R-CNN [10], to
solve some of the R-CNN limitations. As shown in Figure 9, in Fast R-CNN, CNN is used
The same
to extract developer
features of R-CNN
and produce proposed
feature maps afor modified
the wholemodel,
inputtheimage
Fasterinstead
R-CNNof[10], to
each
solve some
region of R-CNN
proposal limitations.
as in R-CNN. As shown
Thereby, in Figure
Fast R-CNN can 9, in Fast
save timeR-CNN,
and memoryCNNcompared
is used to
extract features and produce feature maps for the whole input image
to R-CNN. From the feature maps of the whole image, and RoI which are identified instead of each region
by
the selective search algorithm, regions are cropped out to a fixed size feature maptofor
proposal as in R-CNN. Thereby, Fast R-CNN can save time and memory compared R-
CNN. From the feature maps of the whole image, and RoI which are
each region proposal by using the region pooling layer. Then, these feature maps of each identified by the
selective
region aresearch algorithm,
flattened regions
to a vector are cropped
by FCLs and fed out to a fixed
to Softmax size featureand
classification map for each
bounding
region
box proposal
regressor by usingthe
to predict theclass
region
andpooling layer.
bounding boxThen, thesefor
locations feature
each maps
objectof ineach region
the image.
will Despite
be flattened to a vector of
the advantages byFast
FCLs and fed
R-CNN to Softmax
in reducing classification
used memory and and boundingtime,
processing box
regressor to predict the class and bounding box locations for each object
and increasing detection accuracy, the selective search algorithm that generates region in the image.
proposals is still a bottleneck of the model processing time.
AI 2021, 2, FOR PEER REVIEW 9

AI 2021, 2 Despite the advantages of Fast R-CNN in reducing used memory and processing time,
560
and increasing detection accuracy, the selective search algorithm that generates region
proposals is still a bottleneck of the model processing time.

Selective Search
CNN Algorithm

Input Image

Feature Maps for the Regions of


Whole Image Interest

Region Pooling Layer

Resized Regions
FCLs
Flatten Feature Map for Each
Region to a Vector

Softmax Layer and Bounding Box


Regressor

Detection Results (Class and Bounding Box)

Basicarchitecture
Figure9.9.Basic
Figure architectureof
ofFast
FastR-CNN
R-CNNModel.
Model.

4.3. Faster R-CNN


4.3.In
Faster R-CNN model, the selective search algorithm in the Fast R-CNN has been
this improved
replaced Inby RPN.
this As shown
improved in Figure
model, 10, the consumed
the selective time in generating
search algorithm in the Fastregion
R-CNN proposals
has been
isreplaced
less in RPN bycompared
RPN. Astoshown selectivein search
Figurealgorithm, since RPN shares
10, the consumed time in most computations
generating region
with Fast R-CNN,
proposals is less asin both
RPN networks
comparedhave the same
to selective convolution
search algorithm;layers andRPN
since, feature maps.
shares most
As shown in Figure 11, RPN is used to generate a set of
computations with Fast R-CNN, as both networks have the same convolution layers andvarious size anchor boxes
across
featurethemaps.
image [53]. Anchor boxes are proposals with different sizes and aspect ratios
which As have been in
shown selected
Figurebased
11, RPNon object
is usedsize and are used
to generate a setasofavarious
reference inanchor
size the testing
boxes
process
acrossfor thethe prediction
image of objectboxes
[50]. Anchor class are
andproposals
localization.
withThese anchor
different boxes
sizes andwill be fed
aspect to
ratios
a which
binary have
classifier
beentoselected
determine theon
based probability
object sizeofand
having
are object
used as oranot, and a regressor
reference to
in the testing
create
processtheforbounding boxes of
the prediction these proposals.
of object After that, aThese
class and localization. Non-Maximum
anchor boxesSuppression
will be fed to
(NMS) filter is used to remove overlapping anchor boxes,
a binary classifier to determine the probability of having object or not, by (i) selecting theand
anchor box thatto
a regressor
has the highest confidence score, (ii) computing the overlap between
create the bounding boxes of these proposals. After that, a Non-Maximum Suppression this anchor box and
other anchor boxes by calculating the intersection over union (IoU),
(NMS) filter is used to remove overlapping anchor boxes, by (i) selecting the anchor box (iii) removing anchor
boxes thatthe
that has have higher
highest overlap than
confidence score, a predefined
(ii) computingoverlap threshold,
the overlap and (iv)
between this repeating
anchor box
steps (ii) and (iii) until all overlapping anchor boxes are removed [54].
and other anchor boxes by calculating the intersection over union (IoU), (iii) removing
anchor boxes that have higher overlap than a predefined overlap threshold, and (iv)
repeating steps (ii), and (iii) until all overlapping anchor boxes are removed [61].
AI 2021, 2 561
AIAI2021,
2021,2,2,FOR
FORPEER
PEERREVIEW
REVIEW 1010

FeatureMaps
Feature Mapsfor
forthe
the
Whole Image
Whole Image
RegionProposal
Region Proposal
CNN
CNN Network
Network

InputImage
Input Image
FeatureMaps
Feature Mapsfor
forthe
the
Regionsofof
Regions
Whole Image
Whole Image Interest
Interest

RegionPooling
Region PoolingLayer
Layer

ResizedRegions
Resized Regions

FCLs
FCLs
FlattenFeature
Flatten FeatureMap
Mapfor
forEach
Each
Region to a Vector
Region to a Vector
SoftmaxLayer
Softmax Layerand
andBounding
BoundingBox
Box
Regressor
Regressor

DetectionResults
Detection Results(Class
(Classand
andBounding
BoundingBox)
Box)

Figure10.
Figure
Figure 10.Basic
10. Basicarchitecture
Basic architectureof
architecture ofFaster
of FasterR-CNN
Faster R-CNNModel.
R-CNN Model.
Model.

Binary
Binary
Convolution
Convolution AnchorBox
Box Category
Anchor Category
Layer
Layer Prediction
Prediction
NMS
NMS RegionProposals
Proposals
Region
Differentscales
Different scalesand
andaspect
aspect Bounding
Bounding
ratios of anchor boxes
ratios of anchor boxes Box
Box
Regressor
Regressor

Figure11.
Figure 11.Region
RegionProposal
ProposalNetwork
NetworkArchitecture.
Architecture.

4.4.
4.4.Mask
4.4. MaskR-CNN
Mask R-CNN
R-CNN
Mask
MaskR-CNNR-CNN is anisextension of Faster
an extension
extension R-CNN especially used for instance segmenta-
tion Mask
to specifyR-CNN
which is an
pixel is a part of ofof Faster
which
Faster R-CNN
object
R-CNN especially
in an image
especially
[53,55,56].
used for
used for instance
instance
Segmentation
segmentation
segmentation to specify which pixel is a part of which object in an image [50,51,52].
labels each pixeltoinspecify
an image which
with pixel is aclass,
an object part andof which objecteach
then assigns in an image
pixel to an[50,51,52].
instance,
Segmentationlabels
Segmentation labelseach
eachpixel
pixelininan
animage
imagewith withan anobject
objectclass,
class,and
andthen
thenassign
assigneach
eachpixel
pixel
where each instance corresponds to an object in an image. Two types of segmentations
totoan
have
aninstance,
instance,where
been applied
whereeach eachinstance
instancecorresponds
on the image
corresponds toan
in Figure 12a. to
anobject
Semantic
objectininan animage.
image.Two
segmentation,
Twotypes
as shown
typesofof
in
segmentations
segmentations have
have been
been applied
applied on
on the
the image
image inin Figure
Figure 12a.
12a. Semantic
Semantic segmentation,
segmentation, asas
Figure 12b, does not differentiate instances of the same class (there is one bounding box for
shown
shown in Figure
in Figure 12b, does not differentiate instances of the same class (there is one
the two bears). On 12b, doeshand,
the other not differentiate instances using
instance segmentation of theMask
sameR-CNN,
class (there
as shownis one
in
boundingbox
bounding boxfor
forthe
thetwo
twobears).
bears).OnOnthe
theother
otherhand,
hand,instance
instancesegmentation
segmentationusing usingMask
MaskR- R-
Figure 12c, segments and distinguishes between objects of the same class individually in
CNN,
CNN, as shown
as shown in Figure
in Figure 12c,
12c, segments
segments and
andwith distinguishes
distinguishes between
between objects
objects of the same class
an image and localize each object instance a bounding box (there is aofbounding
the same class
box
individuallyininan
individually animage
imageandandlocalize
localizeeach
eachobject
objectinstance
instancewith
withaabounding
boundingbox box(there
(thereisisaa
for each bear).
boundingbox
bounding boxfor
foreach
eachbear).
bear).
stance, where each instance corresponds to an object in an image. Two types of segmen-
tations have been applied on the image in Figure 12a. Semantic segmentation, as shown
in Figure 12b, does not differentiate instances of the same class (there is one bounding box
for the two bears). On the other hand, instance segmentation using Mask R-CNN, as
AI 2021, 2 shown in Figure 12c, segments and distinguishes between objects of the same class indi- 562
vidually in an image and localize each object instance with a bounding box (there is a
bounding box for each bear).

AI 2021, 2, FOR PEER REVIEW 11

(a) (b) (c)


Figure 12.
Figure 12. Image
Imagesegmentation
segmentationtechniques.
techniques.(a)(a)
Original image
Original of of
image twotwo
bears; (b) (b)
bears; Semantic segmentation;
Semantic (c) Instance
segmentation; seg-
(c) Instance
mentation using Mask R-CNN.
segmentation using Mask R-CNN. (a) (b) (c)
Figure 12. Image segmentation techniques. (a) Original image of two bears; (b) Semantic
Assegmentation;
shown in
shown in Figure
Figure 13,segmentation
(c) Instance
13, MaskR-CNN
Mask R-CNN consists
usingconsists
Mask ofof
R-CNN. twotwo parts:
parts: (i)(i) Faster
Faster R-CNN
R-CNN forfor ob-
object
ject detection, and (ii)
detection, and (ii) FullyAs Fully Convolutional
Convolutional
shown in FigureNetwork
Network (FCN)
(FCN)consists
13, Mask R-CNN
for
for providing providing segmentation
segmentation
of two parts: (i) Faster R-CNN masks
for object
masks
on eachon each(object
object object (object(ii)
mask)
detection, mask)
[53]. [53]. In
In Convolutional
Fully Faster Faster
R-CNN, R-CNN,
the
Network (FCN)the
regions regions
which
for have
providing which
been have been
resized
segmentation by on
masks
resized by RoI
RoI pooling layerpooling
are
each layer(object
slightly
object are slightly
misaligned misaligned
from
mask) [53]. fromthe
the original
In Faster R-CNN the
input original
image.
regions Itinput
which is not
have image.
been It by
important
resized is RoI
not importantboxes;
in bounding pooling
in bounding
however, layeritare
boxes;hasslightly
however,misaligned
a negative it has fromontheinstance
a negative
effect original
effectinput
onimage.
instance
segmentation. It is segmenta-
not
So,important
Mask in
bounding boxes; however, it has a negative effect on instance segmentation. So, Mask R-
tion.
R-CNN So,uses
MasktheR-CNN
RoI Align useslayer
the toRoI Align layer
overcome this to overcome
problem and this problem
to align more and to align
precisely by
CNN uses the RoI Align layer to overcome this problem and to align more precisely by
more precisely
removing by removing
any quantization any quantization
operations [57]. operations
removing any quantization operations [53]. [57].

Feature Maps for the


Whole Image
Region Proposal
CNN Network

Input Image
Regions of
Feature Maps for the Interest
Whole Image

RoI Align Layer


Resized Regions

Fully Convolutional FCLs


Network

Detection Results (Mask) Softmax Layer and


Prediction Bounding Box Regressor

Detection
Results (Class
and Bounding
Box)

Output Image
Figure 13. Basic architecture of Mask R-CNN Model.
Figure 13. Basic architecture of Mask R-CNN Model.
AI 2021, 2 563

5. Animal Datasets
5.1. Datasets Used in Our Study
In our research, we used three datasets: (1) the Snapshot Serengeti dataset [58],
(2) the dataset furnished by BCMOTI, and (3) the Snapshot Wisconsin dataset [59]. The
Snapshot Serengeti is the dataset for the animal species in Africa (Serengeti National Park
in Tanzania). A total of 712,158 images for seven species (lion, zebra, buffalo, giraffe,
fox, deer, and elephant) were selected. The BCMOTI dataset has 53,000 images for eight
species (bear, moose, elk, deer, cougar, mountain goat, fox, and wolf) as they are commonly
seen in highways and remote areas in Canada. The Snapshot Wisconsin dataset was
collected in North America by using 1037 camera-traps placed in a forest in Wisconsin.
It contains 0.5 million images for different animal species, six types of animals have been
chosen (bears, deer, elk, moose, wolf, and fox) since encounters between these animals
and vehicles typically lead to severe crashes on highways. These animals are sometimes
AI 2021, 2, FOR PEER REVIEW involved in tragic direct encounters with humans as well. 14
In the three datasets, the classes are imbalanced, and this is an issue to be dealt with
in the future. The images were labeled by human volunteers as empty or as the name of
animal species. The images in the datasets have resolutions ranging between 512 × 384
and 2048 × 1536 pixels. Snapshot Serengeti, BCMOTI, and Snapshot Wisconsin differ in
and 2048 × 1536 pixels. Snapshot Serengeti, BCMOTI, and Snapshot Wisconsin differ in
many aspects such as dataset size, camera placement, camera configuration, and species
many aspects such as dataset size, camera placement, camera configuration, and species
coverage, thus allowing one to draw more general conclusions.
coverage, thus allowing one to draw more general conclusions.
5.2.
5.2. Limitations
Limitations of
of Datasets
Datasets
Detection
Detection of animal
of animal species
species in images is
in images is challenging
challenging due
due toto images’
images’ conditions.
conditions. InIn
some instances, the whole animal covers only a small area of the field of
some instances, the whole animal covers only a small area of the field of view as shown view as shown
in
in Figure
Figure 14a.
14a. In
In other
other instances,
instances, two
two oror more
more animals
animals are
are too
too close
close from
from the
the field
field of
of view
view
and
and combined
combined withwith each
each other, as shown
other, as shown in in Figure 14b. Sometimes,
Figure 14b. Sometimes, onlyonly part
part of
of the
the animal
animal
is
is visible
visible in
in the field of
the field of view,
view, as
as shown
shown in in Figure
Figure 14c,d.
14c,d. Furthermore,
Furthermore, different
different lighting
lighting
conditions, shadows, and weather, as shown in Figure 14e,f, can make the feature
conditions, shadows, and weather, as shown in Figure 14e,f, can make the feature extraction extrac-
tion task even
task even harder.harder.

(a) (b)
Figure 14. Cont.

(c) (d)
AI 2021, 2 564

(a) (b)

(c) (d)

(e) (f)
Figure
Figure 14.
14. Image
Image samples
samples from
from the
the dataset
dataset used.
used. (a)
(a) Low
Low resolution
resolution image;
image; (b)
(b) An
An image
image of
of three
three moose
moose close
close to camera
to camera
and
and merge
merge toto each
each other;
other; (c,d)
(c,d) A
A part
part of the animal;
of the animal; (e) A night
(e) A night image
image of cougar with
of cougar with falling
falling snow;
snow; (f)
(f) A
A night
night image
image of
of
cougar with mist.
cougar with mist.

6. Methodology of Animal Species Detection


The objective of this section is to find fast and accurate animal detector. Therefore,
various R-CNN models are applied on the three animal datasets to evaluate and to compare
their performance in terms of accuracy and speed. Moreover, D-CNN has been integrated
into R-CNN models to enhance the extracted features, which in return improves the models’
capability in detecting animals.

6.1. Features Enhancement


Regular R-CNNs extract the features from the image by using a fixed size square
kernel. This kernel does not cover properly all the pixels of the target object to precisely
represent it. The predicted bounding box using regular R-CNN does not cover the whole
animal as shown in Figure 15a. As a result, a novel technique is required to enhance
the extracted features. By adding deformable convolutional layers to the regular R-CNN
animal detectors, the learning of the geometric transformation of animals is possible. These
layers can produce adaptive deformable kernel and offset according to the object’s scale
and shape by augmenting the spatial sampling locations in convolution layers as explained
earlier in Section 3.2. Therefore, the predicted bounding box using deformable R-CNN
covers the whole animal as shown in Figure 15b. After experimental tries, three deformable
convolutional layers are used to learn offsets, these offsets are added to the regular grid
sampling locations in the regular convolution. The detection capability and accuracy are
enhanced as reported later in Figures for the three datasets used.
layers can produce adaptive deformable kernel and offset according to the object’s scale
and shape by augmenting the spatial sampling locations in convolution layers as ex-
plained earlier in Section 3.2. Therefore, the predicted bounding box using deformable R-
CNN covers the whole animal as shown in Figure 15b. After experimental tries, three de-
AI 2021, 2 formable convolutional layers are used to learn offsets, these offsets are added to the reg-565
ular grid sampling locations in the regular convolution. The detection capability and ac-
curacy are enhanced as reported later in Figures for the three datasets used.

(a) (b)
Figure 15.15.
Figure Animal species
Animal detection
species byby
detection using: (a)(a)
using: Regular convolution;
Regular (b)(b)
convolution; Deformable convolution.
Deformable convolution.

6.2. Training
6.2. Training
Each
Eachof of
thethe
three datasets
three hashas
datasets beenbeen
splitsplit
into into
70% 70%
for training, 15% for
for training, 15%validation, and
for validation,
15%
and for testing,
15% which are
for testing, whichtheare
commonly used percentages
the commonly in similar
used percentages research.
in similar In the train-
research. In the
ing of deepoflearning
training models,models,
deep learning it is important to find the
it is important to significant values ofvalues
find the significant hyper-param-
of hyper-
eters such as: such
parameters learning
as: rate, batchrate,
learning size,batch
numbersize,ofnumber
iterations, etc. Reaching
of iterations, etc.the optimum
Reaching the
performance of a model of
optimum performance is achieved
a model isby experiment
achieved using various
by experiment usingvalues
various forvalues
these for
hyper-
these
parameters [60]. A validation
hyper-parameters dataset is used
[60]. A validation as well
dataset to fine
is used tune to
as well thefine
modeltunefor overfitting
the model for
and for adjusting
overfitting these
and for hyper-parameters.
adjusting these hyper-parameters.
TheTheeight
eight R-CNN
R-CNN models
models (with
(withandandwithout
withoutdeformable
deformableconvolutional
convolutional layer)
layer)were
were
trained
trained bybythetheback
backpropagation
propagationand andfine-tuned
fine-tuned on on the validation
validation setset to
toreduce
reduceoverfitting
overfittingby
using a learning rate of 0.0025 for 32 batch size. The network of these models is initialized
with the ResNet-101 [26] pre-trained model and fine-tuned end-to-end for the object
detection task to enhance efficiency of training time and improve evaluation performances.
All training input images were annotated by using the Image Labeler app [61] to provide
labeled bounding box over the animals in these images. This box is called the ground
truth box.
To identify animal species, several pre-trained models are experimented including:
AlexNet, GoogleNet, VGG-16, VGG-19, ResNet-18, ResNet-50, and ResNet-101, as shown in
Table 1. Finally, ResNet-101 has been selected as a backbone network for the R-CNN models
to detect animals in the training process. This selection of ResNet was also supported
by the work of Kwan et al. [33], as they achieved good performance with YOLO using
ResNet. The main reason for that selection is the ability to balance between computational
complexity and the animal species detection accuracy. ResNet-101 introduces shortcut
connections to speed up the convergence of the network and to avoid vanishing gradient
problems during the training process, as these problems could stop the network from
further training [11,26,62]. Furthermore, ResNet-101 achieves competitive accuracy and
speed performance in scale-invariant feature extraction.
Table 1. Evaluation of animal species identification by using seven pre-trained models on the three
datasets.
AI 2021, 2 566
Pre-trained Models Accuracy of animal identification

AlexNet 93.1%
Table 1. Evaluation of animal species identification by using seven pre-trained models on the
three datasets. GoogleNet 95.9%

VGG-16Models
Pre-Trained 96.8% Identification
Accuracy of Animal
AlexNet
VGG-19 93.1%
96.3%
GoogleNet 95.9%
VGG-16
ResNet-18 96.8%
96.8%
VGG-19 96.3%
ResNet-18
ResNet-50 96.8%
97.1%
ResNet-50 97.1%
ResNet-101
ResNet-101 97.6%
97.6%

As shown in Figure 16, ResNet-101 consists of five regularized residual convolution


As shown in Figure 16, ResNet-101 consists of five regularized residual convolution
blocks (Rconv.1, Rconv.2, Rconv.3, Rconv.4, and Rconv.5) with shortcut connections. These
blocks (Rconv.1, Rconv.2, Rconv.3, Rconv.4, and Rconv.5) with shortcut connections.
connections prevent overfitting and allow data flow from the input layer to the output
These connections prevent overfitting and allow data flow from the input layer to the
layer of each block. The five blocks use 101 hidden layers to extract the image features
output layer of each block. The five blocks use 101 hidden layers to extract the image
and to produce feature maps by using 3 × 3 and 1 × 1 filter windows [26]. The output of
features and to produce feature maps by using 3x3 and 1x1 filter windows [24]. The output
the last block (Rconv.5) will be the input of max-pooling layer with a stride of 2 pixels to
of the last
reduce theblock (Rconv.5)
number will maps.
of feature be the FCL
inputflattens
of max-pooling layer
these maps to with a stride
a single of that
vector 2 pixels
can
to
be used as an input of the Softmax classification layer to deal with the thirteen classescan
reduce the number of feature maps. FCL flattens these maps to a single vector that of
be usedspecies.
animal as an input of the Softmax classification layer to deal with the thirteen classes of
animal species.
Shortcut connection

Classified
Input Output
Image Image
a b a b c a b c a b c a b c a b

Rconv. 1 Rconv. 2 Rconv. 3 Rconv. 4 Rconv. 5

a. {7x7 conv}, 64 a. {1x1 conv, 64} x 3 a. {1x1 conv, 128} x 4 a. {1x1 conv, 256} x 23 a. {1x1 conv, 512} x 3 a. 7x7 max pool

b. 3x3 max pool b. {3x3 conv, 64} x 3 b. {3x3 conv, 128} x 4 b. {3x3 conv, 256} x 23 b. {3x3 conv, 512} x 3 b. FCL &
Softmax layer
c. {1x1 conv, 256} x 3 c. {1x1 conv, 512} x 4 c. {1x1 conv, 1024} x 23 c. {1x1 conv, 2048} x 3

Figure
Figure 16.16. Architecture
Architecture ofof ResNet-101.
ResNet-101. Rconv.1 has
Rconv. 1 has two
two layers:
layers: a. a. convolution
convolution layer
layer with
with kernel
kernel size
size (7 Xand
(7x7) 7) and 64 filters,
64 filters, and
b.and
maxb.pooling
max pooling
layer oflayer
sizeof(3x3).
size (3Rconv2
X 3). Rconv2 has 9 convolution
has 9 convolution layers layers with kernel
with kernel sizes (1
sizes (1x1), X 1),
and andand
(3x3) (3 Xwith
3) and with
different
number of number
different filters (64
of and 256).
filters (64 Similarly, Rconv3 has
and 256). Similarly, 12 convolution
Rconv3 layers, Rconv4
has 12 convolution has 69 convolution
layers, Rconv4 layers, and
has 69 convolution Rconv5
layers, and
has 9 convolution
Rconv5 layers. layers.
has 9 convolution

Figure
Figure 17 17 shows
shows the
the animal
animal species
species detection
detection procedure for the regular R-CNN and
deformable
deformableR-CNN R-CNNmodels.
models.TheThetraining of of
training thethe
system
systemhashas
been applied
been by using
applied the pre-
by using the
trained residual
pre-trained network
residual network(ResNet-101).
(ResNet-101).First, four
First, fourregular
regularregion-based
region-basedobject
object detection
detection
models (R-CNN,
models (R-CNN,Fast FastR-CNN,
R-CNN,Faster
Faster R-CNN,
R-CNN, andand Mask
Mask R-CNN)
R-CNN) are are trained.
trained. Then,Then,
four
four new
new deformable
deformable region-based
region-based object
object detection
detection models
models arearetrained
trainedafter
afteradding
adding three
deformable convolutional
deformable convolutional layers
layers to
to the
the last
last three convolutional
convolutional layers
layers with kernel size
(3 × 3)
(3x3) in in
thethe last
last block
block ofof ResN + et-101
ResN+et-101 (Rconv.
(Rconv. 5). 5).
AI 2021, 2 567
AI 2021, 2, FOR PEER REVIEW 16

1. R-CNN

2. Fast R-CNN

3. Faster R-CNN

4. Mask R-CNN

Ground Truth
Input: Training Images
Dataset for Thirteen
from The Three Datasets 5. R-CNN
Animal Species
+Deformable
CNN

6. Fast R-CNN
+Deformable
CNN

7. Faster R-CNN
+Deformable
CNN

8. Mask R-CNN
+Deformable
CNN

Figure
Figure17.
17.Animal
Animalspecies
speciesdetection
detectiontraining
trainingmodel
modelwith
witheight
eightdetectors.
detectors.

Our work
Our workwaswascarried out out
carried usingusing
MATLAB 2020b deep
MATLAB learning
2020b deep and parallel
learning andcomputing
parallel
toolboxes and implemented on a Laptop Core i7-10750H Processor, NVIDIA GeForce
computing toolboxes and implemented on a Laptop Core i7-10750H Processor, NVIDIA RTX
2070 graphics accelerator, 32 GB RAM memory, and running a Windows 10 Professional
GeForce RTX 2070 graphics accelerator, 32 GB RAM memory, and running a Windows 10
x64 operating
Professional system.
x64 operating system.
7. Experimental
7. Experimental Results
Results of
of Animal
Animal Species
Species Detection
Detection
7.1. Performance Evaluation Metrics
To compare
7.1. Performance and evaluate
Evaluation the performance of animal species detectors, four met-
Metrics
rics are used: False Negative Rate (FNR), accuracy,
To compare and evaluate the performance mean Average
of animal Precisionfour
species detectors, (mAP), and
metrics
response-time.
are used: False Negative Rate (FNR), accuracy, mean Average Precision (mAP), and
IoU measures the overlap “intersection” between the ground truth box (actual) and
response-time.
the predicted bounding
IoU measures box divided
the overlap by their union.
“intersection” between The resulting
the groundvalue
truth shows how close
box (actual) and
is the predicted bounding box to the ground truth box. To determine if the detection is
the predicted bounding box divided by their union. The resulting value shows how close
positive or negative, a predefined IoU threshold value is used. It is important that the value
is the predicted bounding box to the ground truth box. To determine if the detection is
of this threshold not to be too small or too large; in object detection researches, threshold
positive or negative, a predefined IoU threshold value is used. It is important that the
from 0.4 to 0.7 are commonly used [6,27]. Figure 18 shows the effect of IoU threshold on
value of this threshold not to be too small or too large; in object detection researches,
the performance of Mask R-CNN. As shown in Figure 18a, the higher threshold (equal to
threshold from 0.4 to 0.7 are commonly used [6,25]. Figure 18 shows the effect of IoU
or more than 0.5) detected two animals and produced two bounding boxes for each animal.
threshold on the performance of Mask R-CNN. As shown in Figure 18a, the higher
AI 2021, 2, FOR PEER REVIEW 19

AI 2021, 2 threshold from 0.4 to 0.7 are commonly used [6,27]. Figure 18 shows the effect568of IoU
threshold on the performance of Mask R-CNN. As shown in Figure 18a, the higher thresh-
old (equal to or more than 0.5) detected two animals and produced two bounding boxes
forIneach animal.
Figure 18b, theInlower
Figure 18b, the
threshold lower
(lower thanthreshold
0.5) failed(lower than
to detect two 0.5) failed
animals; to detect
however, it two
animals; however,
produced a boundingit produced
box for onea detected
bounding box for
animal. one detected
Thereby, animal.
FNR, accuracy, andThereby,
mAP are FNR,
accuracy,
measuredand mAP
using IoUare measured
threshold using
[17,28] IoU threshold [17,28] at 0.5.
at 0.5.

(a) (b)
Figure 18. Effect
Figure of IoU
18. Effect on the
of IoU animal
on the images
animal imagesusing
usingdeformable MaskR-CNN
deformable Mask R-CNN (a)(a) High
High threshold
threshold (two(two bounding
bounding boxesboxes
for for
each each
detected bear);
detected (b) (b)
bear); Low threshold
Low threshold(detect
(detectonly
only one bear
bearwith
withone
onebounding
bounding box).
box).

FNRFNRisisan
anessential
essential metric
metric in
inour
ourwork,
work,where
whereit measures the number
it measures of images
the number that that
of images
contain animals (positive) but incorrectly classified as empty images (negative). Thereby,
contain animals (positive) but incorrectly classified as empty images (negative). Thereby,
FNR does not consider the animal class, and only measures the performance of binary
FNR does not consider the animal class, and only measures the performance of binary
classification. By defining the true positive (TP) as truly classified images with animals,
classification. By defining
and false negative the true
(FN) as falsely positive
classified (TP)with
images as truly classified
animals as emptyimages
images,with animals,
the FNR
and false negative
is calculated as: (FN) as falsely classified images with animals as empty images, the
FNR is calculated as: FN
FNR = (1)
TP + FN
Accuracy is an evaluation metric which FNR is =calculated by dividing the total number of (1)
correctly predicted objects over the total number of input images as shown in Equation (2).
TP is defined as the true detection of a ground truth box (if IoU is greater than or equal to
Accuracy
0.5), FN as theisfalse
an evaluation
detection ofmetric which
a ground truthisbox
calculated
(if IoU isby
lessdividing
than 0.5),the total
false number of
positive
correctly predicted
(FP) as the objects of
false detection over the total
an object that number
does not of input
exist, andimages as shown
true negative (TN)inas Equation
the
(2).number
TP is defined as theboxes
of bounding true that
detection of a ground
are supposed not to truth box (ifinside
be detected IoU isany
greater
image.than or equal
to 0.5), FN as the false detection of a ground truth box (if IoU is less than 0.5), false positive
TP + TN
(FP) as the false detection ofAccuracy
an object
= that does not exist, and true negative (TN) (2)as the
TP + FP + TN + FN
number of bounding boxes that are supposed not to be detected inside any image.
The mAP is a single number metric that combines both precision and recall by averag-
ing precision across recall values, where it is the
Accuracy = area under a precision–recall curve for the (2)
detections of each animal class [27,63]. Then, the result is divided by the number of classes
The
N in themAP is aassingle
dataset shownnumber metric
in Equation (3). that combines both precision and recall by aver-
aging precision across recall values, where it is the area under a precision–recall curve for
1 N
the detections of each animal classmAP
[27,63]. Σ APthe result is divided by the number
= Then, (3) of
N i=1 i
classes N in the dataset as shown in Equation (3).
where APi is the average precision (AP) for each animal species class (i). It is measured
under=the∑precision–recall
with the Riemann sum as the true area mAP 𝐴𝑃 curve [27]. (3)
Precision measures how accurate the object detection model is, as shown in Equation (4),
so high
where APiprecision means low
is the average false positive
precision (AP) rate.
for each animal species class (i). It is measured
with the Riemann sum as the true area under the TP precision–recall curve [27].
Precision measures how accurate the =
Precision object
TP +detection
FP
(4)
model is, as shown in Equation
(4), so high precision means low false positive rate.

Precision = (4)
AI 2021, 2, FOR PEER REVIEW 20

Recall measures how many correct detections are found by the object detection
AI 2021, 2 model, as shown in Equation (5), so high recall means a low false negative rate. 569

Recall = (5)

Response-time (elapsed CPU time) is an important evaluation metric which is used


Recall measures how many correct detections are found by the object detection model,
to measure the amount of time MATLAB takes to detect animals in a single image for an
as shown in Equation (5), so high recall means a low false negative rate.
object detector model.
TP
7.2. Comparison Results and Discussion Recall = (5)
TP + FN
The results in Figures
Response-time (elapsed19–21
CPU present
time) is the performance
an important of the eight
evaluation metricR-CNN
which models
is used
(four regular
to measure theand four deformable)
amount of time MATLAB with takes
FNR, toAccuracy (Acc.), in
detect animals and mAP image
a single on Snapshot
for an
Serengeti, BCMOTI,
object detector model. and Snapshot Wisconsin datasets, respectively. Moreover, Figure 22
presents the response-time per image (sec) of the three datasets. These figures show that
7.2. Comparison
deformable Mask Results
R-CNN andachieves
Discussionhigher performance compared to other R-CNNs mod-
els. InThe
addition,
results itinisFigures
able to19–21
detectpresent
and to the
perform instanceofsegmentation
performance the eight R-CNN of animal
modelsspecies
(four
within
regularimages.
and fourIndeformable)
general, thewithresults show
FNR, that the(Acc.),
Accuracy addedand deformable convolution
mAP on Snapshot layers
Serengeti,
can improve
BCMOTI, andthe detection
Snapshot performance.
Wisconsin datasets, respectively. Moreover, Figure 22 presents the
response-time per image (sec) of the three datasets.
In Figure 19, according to the evaluation These
metrics figures
(FNR, Acc.,show
and that
mAP),deformable
Mask R-
Mask R-CNN
CNN reaches achieves
the highest higher performance
performance compared
in both regulartoCNNs
other R-CNNs
and D-CNNs.models. In addition,
Furthermore,
it is able to detect
deformable Mask andR-CNN to perform
provides instance
the bestsegmentation
result with an ofaccuracy
animal species
of 98.4%within images.
and mAP of
In general,
89.2%, theincorrectly
while results show that the added
identifying deformable
427 images convolution
with animals in thelayers canasimprove
test set the
empty im-
detection performance.
ages.

19. Evaluation of object detection models by using Regular (R.) and Deformable (D.) in terms of FNR, Acc., and
Figure 19.
mAP on the Snapshot
mAP on the Snapshot Serengeti
Serengeti dataset.
dataset.
AI 2021, 2, FOR PEER REVIEW 21

In Figure 20, the BCMOTI dataset, which is the smallest dataset used in this work,
AI 2021, 2 the performance of deformable Mask R-CNN decreases to 93.3% accuracy, 82.9% mAP, 570
and FNR is increased by 1.7%, as most of the images in this dataset were taken at night
with poor resolution and from the backside of the animals, as shown earlier in Figure 14.

Figure 20. Evaluation of object detection models by using Regular (R.) and Deformable (D.) in terms of FNR, Acc., and
mAP on
mAP on the
the BCMOTI
BCMOTI dataset.
dataset.

In Figure 19, according to the evaluation metrics (FNR, Acc., and mAP), Mask R-
Figure 21 shows that by using deformable Mask R-CNN, accuracy and mAP of de-
CNN reaches the highest performance in both regular CNNs and D-CNNs. Further-
tection are 97.6% and 87.6%, respectively, on the Snapshot Wisconsin dataset with 0.6%
more, deformable Mask R-CNN provides the best result with an accuracy of 98.4% and
FNR. In the Snapshot Serengeti dataset, the system has been trained on a larger training
mAP of 89.2%, while incorrectly identifying 427 images with animals in the test set as
set than BCMOTI and Snapshot Wisconsin. Thereby, it has gained up to 5.1% accuracy
empty images.
compared to BCMOTI, and up to 0.8% accuracy compared to Snapshot Wisconsin. This
In Figure 20, the BCMOTI dataset, which is the smallest dataset used in this work, the
shows the importance of having a large training set with a large number of instances in
performance of deformable Mask R-CNN decreases to 93.3% accuracy, 82.9% mAP, and
each class.
FNR is increased by 1.7%, as most of the images in this dataset were taken at night with
poor resolution and from the backside of the animals, as shown earlier in Figure 14.
Figure 21 shows that by using deformable Mask R-CNN, accuracy and mAP of
detection are 97.6% and 87.6%, respectively, on the Snapshot Wisconsin dataset with 0.6%
FNR. In the Snapshot Serengeti dataset, the system has been trained on a larger training
set than BCMOTI and Snapshot Wisconsin. Thereby, it has gained up to 5.1% accuracy
compared to BCMOTI, and up to 0.8% accuracy compared to Snapshot Wisconsin. This
shows the importance of having a large training set with a large number of instances in
each class.
As shown in Figure 22, deformable Mask R-CNN is able to detect objects in about
0.78 s per image on all three datasets. That makes deformable Mask R-CNN, though
slightly slower than the regular version, suitable for use in most real-time applications.
AI 2021, 22, FOR PEER REVIEW
AI 2021, 571
22

Evaluation of object detection models by using Regular (R.) and Deformable (D.) in terms of FNR, Acc., and
e Snapshot Wisconsin dataset.

As shown in Figure 22, deformable Mask R-CNN is able to detect objects in about
0.78 Evaluation
s per image
Figure 21. Evaluation
ondetection
all three
of object detection
datasets. Regular
models
That makes
models by using Regular
deformable
(R.) and Deformable
Deformable (D.)
Mask R-CNN, though
(D.) in terms of FNR, Acc., and
mAP slightly
mAP on slower
the Snapshot
on the Snapshot than the
Wisconsin
Wisconsin regular version, suitable for use in most real-time applications.
dataset.
dataset.

As shown in Figure 22, deformable Mask R-CNN is able to detect objects in about
0.78 s per image on all three datasets. That makes deformable Mask R-CNN, though
slightly slower than the regular version, suitable for use in most real-time applications.

Figure 22. Evaluation


Figure of object
22. Evaluation detection
of object
Figure models by
detection
22. Evaluation using Regular
models
of object by (R.)
detection andRegular
using
modelsDeformable (D.)
by using(R.) and
Regularin(R.)
terms
andofDeformable
DeformableResponse-time
(D.) (D.) onterms
in terms
in
the three datasets. of Response-time
of Response-time on the three datasets. on the three datasets.
AI 2021, 2, FOR PEER REVIEW 23
AI 2021, 2 572

The image results in Figure 23 show that deformable Mask R-CNN can detect and
The image results in Figure 23 show that deformable Mask R-CNN can detect and seg-
segment single and multiple animal species with a confidence score for each class. De-
ment single and multiple animal species with a confidence score for each class. Deformable
formable Mask R-CNN detects animal species with higher accuracy and speed in compar-
Mask R-CNN detects animal species with higher accuracy and speed in comparison to
isonother
to other regular
regular and deformable
and deformable R-CNNR-CNN
models.models. Therefore,
Therefore, not deformable
not only can only can deformable
Mask
Mask R-CNN
R-CNN be applied
be applied in real-time
in real-time systems
systems to single
to detect detectand
single and multiple
multiple animal
animal species, species,
but it
but can
it can
alsoalso produce
produce a maska mask overdetected
over each each detected
animal inanimal in the
the image image for
for counting thecounting
number the
number of occluded
of occluded and overlapping
and overlapping animal species.
animal species.

Figure 23. Some examples of animal species detection after deformable Mask R-CNN (output mask size is the object size).

Figure 23. Some examples of animal species detection after deformable Mask R-CNN (output mask size is the object size).

In general, our results show that deformable Mask R-CNN using ResNet-101 can de-
tect and segment animals with high accuracy exceeding the performance of the related
work, as shown in Table 2. This table summarizes the datasets, performance, and tech-
AI 2021, 2 573

In general, our results show that deformable Mask R-CNN using ResNet-101 can
detect and segment animals with high accuracy exceeding the performance of the related
work, as shown in Table 2. This table summarizes the datasets, performance, and techniques
of our research and similar related work on animal species detection. The integration of
D-CNN to Mask R-CNN improves the performance of animal species detection. Our
research has an improvement over these related work due to the following reasons:
1. Three datasets of different characteristics have been used for training and testing.
2. Deformable convolutional layers have been added to the R-CNN detectors, which
have a great effect on enhancing the extracted features, which in turn improve the
performance of animal species detection.

Table 2. Related work in animal species detection: a comparison.

References Year Dataset Performance Technique


mAP of zebra detection:
2500 images of Plain
Parham et al. [39] 2016 55.6% for Plain and YOLOv1 detector
and Grevy zebras
56.6% for Grevy
Accuracy of animal
Norouzzadeh et al. [37] 2019 Snapshot Serengeti Deep active learning
species detection 92.9%
Accuracy of cattle
Xu et al. [41] 2020 750 images of cattles Mask R-CNN
detection: 94%
AP of detection: 79.47%
Gupta et al. [42] 2020 MS COCO dataset [15] for cows, and 81.09% Mask R-CNN
for dogs
31,774 images of mAP of animal species
Saxena et al. [64] 2021 Faster R-CNN
various animals detection: 82.11%
Accuracy of cattle
Yılmaz et al. [65] 2021 1500 images of cattle YOLOv4
detection: 92.85%
Accuracy of donkeys
2000 images of donkeys
Sato et al. [66] 2021 and horses: detection: YOLOv4
and horses
84.87%
Accuracy and mAP of
animal species
detection, respectively:
Snapshot Serengeti, 98.4% and 89.2% in
Deformable Mask
Our work 2021 BCMOTI, and Snapshot Snapshot Serengeti,
R-CNN
Wisconsin 93.3% and 82.9% in
BCMOTI, 97.6% and
87.6% in Snapshot
Wisconsin

8. Conclusions and Future Work


In this paper a review on deep learning-based object detection models: R-CNN, Fast
R-CNN, Faster R-CNN, and Mask R-CNN, has been provided. Then these models are
evaluated on animal images from three datasets for high precision, real-time animal species
detection. Next, the accuracy and speed performance of animal species detection are
provided after enhancing the extracted features by using D-CNNs. The results show that
deformable Mask R-CNN is the optimal choice in real-time animal species detection, and
it can achieve the best performance in FNR, accuracy, and mAP, as shown in Table 2.
Deformable Mask R-CNN is capable of handling geometric variations or deformations
of an object, without the need for further training on datasets, thus reducing validation
time and cost. Moreover, as shown in Figure 23, Deformable Mask R-CNN provides
promising results for detecting animal species at a wide range of lighting, shadows, and
weather condition.
AI 2021, 2 574

In future work, we aim to detect smaller animal species which is one of the major
challenges of animal species detection and to investigate improvement by reducing FNR.
Furthermore, we plan to design an efficient animal detector by improving the accuracy
of animal species identification and localization in high enough speed to be applied in
real-time applications. To obtain higher accuracy, we need to extract more significant
features, improve pre- and post- processing methods, solve the imbalance class issue,
accommodate imbalance day and night images, and enhance classification confidence. For
reducing the response-time and increasing the detection speed, we need to reduce the
network complexity and computation time by removing some layers from the deformable
Mask R-CNN architecture. Furthermore, a comparative study of one-stage and two-stage
detectors would provide insights into these approaches’ speed performance.

Author Contributions: Conceptualization, M.I., K.F.L. and F.G.; methodology, M.I.; software, M.I.;
validation, M.I.; formal analysis, M.I.; investigation, M.I., K.F.L. and F.G.; data curation, M.I.; re-
sources, L.E.S.; writing—original draft preparation, M.I.; writing—review and editing, K.F.L. and
F.G. C.; supervision, K.F.L. and F.G. All authors have read and agreed to the published version of
the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. This data can
be found here: [58,59].
Acknowledgments: We gratefully acknowledge the support by the British Columbia Ministry of
Transportation and Infrastructure.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models.
IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [CrossRef]
2. Lipton, Z.C.; Berkowitz, J.; Elkan, C. A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv 2015,
arXiv:1506.00019.
3. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 2012, 25, 1097–1105. [CrossRef]
4. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef]
5. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J.
Comput. Vis. 2010, 88, 303–338. [CrossRef]
6. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al.
ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [CrossRef]
7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef]
8. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
[CrossRef]
9. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587. [CrossRef]
10. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile,
7–13 December 2015; pp. 1440–1448. [CrossRef]
11. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer
Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [CrossRef]
12. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the 2017 IEEE
International Conference on Computer Vision (ICCV), Venice, Italy, 22 October 2017; pp. 764–773.
13. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 9308–9316.
AI 2021, 2 575

14. Papageorgiou, C.P.; Oren, M.; Poggio, T. A general framework for object detection. In Proceedings of the Sixth Internation-al
Conference on Computer Vision2002, Bombay, India, 7 January 1998; pp. 555–562. [CrossRef]
15. Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893.
[CrossRef]
16. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [CrossRef]
17. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in
Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014.
18. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
[CrossRef]
19. Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Detnet: A Backbone network for Object Detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018.
20. Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507.
[CrossRef] [PubMed]
21. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, L.; Salakhutdinov, R.R. Improving neural networks by preventing
co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580.
22. Zeiler, M.D.; Fergus, R. Visualizing and Understanding Convolutional Networks. In European Conference on Computer Vision; Lec-
ture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics);
Springer: Cham, Switzerland, 2014; Volume 8689 LNCS, pp. 818–833. [CrossRef]
23. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
24. Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection
using convolutional networks. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014,
Banff, AB, Canada, 14–16 April 2014.
25. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with
Convolutions. arXiv 2014, arXiv:1409.4842.
26. He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [CrossRef]
27. Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge:
A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [CrossRef]
28. Schneider, S.; Taylor, G.W.; Kremer, S. Deep Learning Object Detection Methods for Ecological Camera Trap Data. In Proceedings
of the 2018 15th Conference on Computer and Robot Vision (CRV), Toronto, ON, Canada, 8–10 May 2018; pp. 321–328.
29. Swinnen, K.; Reijniers, J.; Breno, M.; Leirs, H. A Novel Method to Reduce Time Investment When Processing Videos from Camera
Trap Studies. PLoS ONE 2014, 9, e98881. [CrossRef]
30. Figueroa, K.; Camarena-Ibarrola, A.; Garcia, J.; Villela, H.T. Fast Automatic Detection of Wildlife in Images from Trap Cameras.
Hybrid Learn. 2014, 8827, 940–947.
31. Yu, X.; Wang, J.; Kays, R.; Jansen, P.; Wang, T.; Huang, T. Automated identification of animal species in camera trap images.
EURASIP J. Image Video Process. 2013, 2013, 52. [CrossRef]
32. Kwan, C.; Gribben, D.; Tran, T. Multiple Human Objects Tracking and Classification Directly in Compressive Measurement
Domain for Long Range Infrared Videos. In Proceedings of the 2019 IEEE 10th Annual Ubiquitous Computing, Electronics &
Mobile Communication Conference (UEMCON), New York, NY, USA, 10–12 October 2019; IEEE: Piscataway, NJ, USA, 2019;
pp. 0469–0475.
33. Uddin, M.S.; Hoque, R.; Islam, K.A.; Kwan, C.; Gribben, D.; Li, J. Converting Optical Videos to Infrared Videos Using Attention
GAN and Its Impact on Target Detection and Classification Performance. Remote Sens. 2021, 13, 3257. [CrossRef]
34. Chen, G.; Han, T.X.; He, Z.; Kays, R.; Forrester, T. Deep convolutional neural network based species recognition for wild animal
monitoring. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October
2014; pp. 858–862.
35. Villa, A.G.; Salazar, A.; Vargas, F. Towards automatic wild animal monitoring: Identification of animal species in camera-trap
images using very deep convolutional neural networks. Ecol. Inform. 2017, 41, 24–32. [CrossRef]
36. Willi, M.; Pitman, R.T.; Cardoso, A.W.; Locke, C.; Swanson, A.; Boyer, A.; Veldthuis, M.; Fortson, L. Identifying animal species in
camera trap images using deep learning and citizen science. Methods Ecol. Evol. 2019, 10, 80–91. [CrossRef]
37. Norouzzadeh, M.S.; Morris, D.; Beery, S.; Joshi, N.; Jojic, N.; Clune, J. A deep active learning system for species identification and
counting in camera trap images. Methods Ecol. Evol. 2021, 12, 150–161. [CrossRef]
38. Norouzzadeh, M.S.; Nguyen, A.; Kosmala, M.; Swanson, A.; Palmer, M.S.; Packer, C.; Clune, J. Automatically identifying,
counting, and describing wild animals in camera-trap images with deep learning. Proc. Natl. Acad. Sci. USA 2018, 115,
E5716–E5725. [CrossRef]
39. Parham, J.; Stewart, C. Detecting plains and Grevy’s Zebras in the realworld. In Proceedings of the 2016 IEEE Winter Applications
of Computer Vision Workshops (WACVW), Lake Placid, NY, USA, 10 March 2016.
AI 2021, 2 576

40. Zhang, Z.; He, Z.; Cao, G.; Cao, W. Animal Detection from Highly Cluttered Natural Scenes Using Spatiotemporal Object Region
Proposals and Patch Verification. IEEE Trans. Multimed. 2016, 18, 2079–2092. [CrossRef]
41. Xu, B.; Wang, W.; Falzon, G.; Kwan, P.; Guo, L.; Chen, G.; Tait, A.; Schneider, D. Automated cattle counting using Mask R-CNN in
quadcopter vision system. Comput. Electron. Agric. 2020, 171, 105300. [CrossRef]
42. Gupta, S.; Chand, D.; Kavati, I. Computer Vision based Animal Collision Avoidance Framework for Autonomous Vehicles. Inf.
Process. Manag. Uncertain. Knowl.-Based Syst. 2021, 1378, 237–248. [CrossRef]
43. Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. Learning and Transferring Mid-level Image Representations Using Convolutional
Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH,
USA, 23–28 June 2014; pp. 1717–1724. [CrossRef]
44. Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. Weakly supervised object recognition with convolutional neural networks. HAL 2014,
hal-01015140v1 2014.
45. Kavukcuoglu, K.; Ranzato, M.; Fergus, R.; LeCun, Y. Learning invariant features through topographic filter maps. In Proceedings
of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1605–1612.
46. Goodfellow, I.; Bengio, Y.B.; Courville, A. Adaptive Computation and Machine Learning Series (Deep Learning); The MIT Press:
Cambridge, MA, USA, 2016. Available online: Academia.edu (accessed on 15 August 2020).
47. Bishop, C.M. Pattern Recognition, and Machine Learning; Springer: New York, NY, USA, 2006; Volume 128, pp. 1–58. Available
online: Academia.edu (accessed on 15 August 2020).
48. Uijlings, J.R.R.; van de Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis.
2013, 104, 154–171. [CrossRef]
49. Ding, S.; Zhang, X.; An, Y.; Xue, Y. Weighted linear loss multiple birth support vector machine based on information granulation
for multi-class classification. Pattern Recognit. 2017, 67, 32–46. [CrossRef]
50. He, Y.; Zhu, C.; Wang, J.; Savvides, M.; Zhang, X. Bounding Box Regression With Uncertainty for Accurate Object Detection. In
Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA,
15–20 June 2019; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2019; pp. 2883–2892.
51. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of
the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009;
pp. 248–255.
52. Dai, J.; He, K.; Sun, J. Convolutional feature masking for joint object and stuff segmentation. In Proceedings of the 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; Institute of Electrical and
Electronics Engineers (IEEE): Piscataway, NJ, USA, 2015; pp. 3992–4000.
53. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [CrossRef]
54. Prokudin, S.; Kappler, D.; Nowozin, S.; Gehler, P. Learning to Filter Object Detections. In Transactions on Computational Science XI.;
Springer Science and Business Media LLC: Berlin/Heidelberg, Germany, 2017; Volume 10496, pp. 52–62.
55. Dai, J.; He, K.; Sun, J. Instance-Aware Semantic Segmentation via Multi-task Network Cascades. In Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; Institute of Electrical
and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2016; pp. 3150–3158.
56. Arnab, A.; Torr, P.H.S. Pixelwise Instance Segmentation with a Dynamically Instantiated Network. In 2017 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Institute of Electrical and Electronics
Engineers (IEEE): Piscataway, NJ, USA, 2017; pp. 879–888.
57. Wu, H.; Siebert, J.P.; Xu, X. Fully Convolutional Networks for automatically generating image masks to train Mask R-CNN. arXiv
2020, arXiv:2003.01383v1.
58. Labeled Information Library of Alexandria: Biology and Conservation (LILA BC). Available online: http://lila.science/datasets/
snapshot-serengeti.[SnapshotSerengeti] (accessed on 27 August 2020).
59. Snapshot Wisconsin, A Volunteer-Based Project for Wildlife Monitoring. Available online: https://dnr.wisconsin.gov/topic/
research/projects/snapshot.[SnapshotWisconsin] (accessed on 1 May 2020).
60. Fan, Q.; Brown, L.; Smith, J. A closer look at Faster R-CNN for vehicle detection. In Proceedings of the 2016 IEEE Intelligent
Vehicles Symposium (IV), Gotenburg, Sweden, 19–22 June 2016; Volume 1, pp. 124–129.
61. MATLAB. Available online: https://www.mathworks.com/help/vision/ug/get-started-with-the-image-labeler.html (accessed
on 15 January 2020).
62. Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif.
Intell. Rev. 2020, 53, 5455–5516. [CrossRef]
63. Henderson, P.; Ferrari, V. End-to-End Training of Object Class Detectors for Mean Average Precision. In Asian Conference on
Computer Vision; Springer: Cham, Switzerland, 2016; pp. 198–213. [CrossRef]
64. Saxena, A.; Gupta, D.K.; Singh, S. An Animal Detection and Collision Avoidance System Using Deep Learning. Adv. Graph.
Commun. Packag. Technol. Mater. 2021, 668, 1069–1084. [CrossRef]
AI 2021, 2 577

65. Yilmaz, A.; Uzun, G.N.; Gurbuz, M.Z.; Kivrak, O. Detection and Breed Classification of Cattle Using YOLO v4 Algorithm. In
Proceedings of the 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Kocaeli,
Turkey, 25–27 August 2021; pp. 1–4.
66. Sato, D.; Zanella, A.J.; Costa, E.X. Computational classification of animals for a highway detection system. Braz. J. Veter-Res. Anim.
Sci. 2021, 58, e174951. [CrossRef]

You might also like