Professional Documents
Culture Documents
1 Department of ECE, University of Victoria, Victoria, BC V8W 3A4, Canada; kinli@uvic.ca (K.F.L.);
fayez@uvic.ca (F.G.)
2 British Columbia Ministry of Transportation and Infrastructure, Victoria, BC V8W 9T5, Canada;
Leonard.Sielecki@gov.bc.ca
* Correspondence: maieelgendy@uvic.ca
Abstract: Object detection is one of the vital and challenging tasks of computer vision. It supports a
wide range of applications in real life, such as surveillance, shipping, and medical diagnostics. Object
detection techniques aim to detect objects of certain target classes in a given image and assign each
object to a corresponding class label. These techniques proceed differently in network architecture,
training strategy and optimization function. In this paper, we focus on animal species detection as an
initial step to mitigate the negative impacts of wildlife–human and wildlife–vehicle encounters in
remote wilderness regions and on highways. Our goal is to provide a summary of object detection
techniques based on R-CNN models, and to enhance the performance of detecting animal species
in accuracy and speed, by using four different R-CNN models and a deformable convolutional
neural network. Each model is applied on three wildlife datasets, results are compared and analyzed
by using four evaluation metrics. Based on the evaluation, an animal species detection system
is proposed.
Citation: Ibraheam, M.; Li, K.F.;
Gebali, F.; Sielecki, L.E. A Keywords: deep learning; convolutional neural network (CNN); region-based CNN (R-CNN) mod-
Performance Comparison and els; Deformable CNN (D-CNN); animal species detection
Enhancement of Animal Species
Detection in Images with Various
R-CNN Models. AI 2021, 2, 552–577.
https://doi.org/10.3390/ai2040034 1. Introduction
Object detection has been widely studied to identify objects within an image to a
Academic Editor: Mikel Galar
predefined set of object classes (object identification) and where these objects are in the
image (object localization) using bounding boxes [1]. It is a basic step for computer vision
Received: 28 September 2021
Accepted: 29 October 2021
and image understanding. In recent years, most of the object detectors use Deep Learning
Published: 31 October 2021
Neural Networks (DNNs) including Convolutional Neural Networks (CNNs) architectures.
CNNs have several blocks of multi convolution and pooling layers to extract features such
Publisher’s Note: MDPI stays neutral
as edges, textures, and shapes, etc., and to identify and locate objects in an image [2–6].
with regard to jurisdictional claims in
An object detection framework using a Region-based CNN (R-CNN) model can be
published maps and institutional affil- divided into four stages: (i) region of interest (RoI) selection, also known as region propos-
iations. als; (ii) features extraction for each region proposal using CNN; (iii) region classification
(which objects are in each proposal); and (iv) object localization by combining overlapped
region proposals into a single bounding box around each detected object using bounding
box regression [7–11]. All these processes are time consuming, thus making R-CNN slow.
Copyright: © 2021 by the authors.
Several models have been proposed to improve R-CNN, including Fast R-CNN [10], Faster
Licensee MDPI, Basel, Switzerland.
R-CNN [7], and Mask R-CNN [11], to speed up object detection.
This article is an open access article
The most important step in the object detection task is the extraction of significant
distributed under the terms and features, in order to identify and localize objects in the image with high accuracy. However,
conditions of the Creative Commons CNN is unable to deal with the geometric deformation of objects in images. In our study of
Attribution (CC BY) license (https:// animal species detection, Deformable CNN (D-CNN) is used to improve object features
creativecommons.org/licenses/by/ extraction under different geometric deformation conditions and thus object detection
4.0/). accuracy is improved, as concurred by [12,13].
masks at the pixel level for each object instance with bounding box, this method is called
Mask R-CNN [11]. All these improvements are significant and can be applied to animal
species detection.
Figure1.
Figure 1.DNNs
DNNshave
haveseveral
severalhidden
hiddentraining
traininglayers
layersofofextracted
extractedfeatures.
features.
3. Overview
3. Overviewof ofCNN
CNN
CNNs have varying accuracy
CNNs have varying accuracy performance
performance onon image
image classification
classification (classify
(classify what
what is
is
contained in
contained in an
an image).
image). The
The number
number of of computation
computation layers
layers which
which have
have been
been used
used for
for
features learning of input images are different depending on the visual task [23–26].
features learning of input images are different depending on the visual task [21-24]. This This
section provides
section provides an
an overview
overview of
of regular
regular CNN
CNN andand D-CNN.
D-CNN.
Output Pool. n
Conv. n
Output of of Pool.
Conv. Pool. 1
Conv. 1
Output
Input image Multi-hidden layers FCLs Softmax
layer
EW 6
Figure3.3.An
Figure Anexample
exampleof ofconvolution
convolutionoperation
operationwith
withaakernel
kernelsize of33×× 3.
sizeof 3. (a) Input (numbers represent
represent pixels
pixels intensities);
intensities);
(b)Kernel
(b) Kernel(numbers
(numbersrepresent
representlearned
learned weights
weights of of filter);
filter); (c)(c) Output
Output of element-wise
of element-wise multiplication
multiplication between
between (a,b);
(a,b); (d) Fea-
(d) Feature
ture (sum
map map (sum
of all of all elements
elements in (c)).in (c)).
Figure 4. An example of pooling operation with a pool filter size of 2 × 2. (a) Input matrix of pooling layer (feature map) of
ple of pooling operation with a pool filter size of 2 × 2. (a) Input matrix of pooling layer (feature map)
size 4 × 4; (b) Output of max pooling operation; (c) Output of average pooling operation.
tput of max pooling operation; (c) Output of average pooling operation.
For classification, the Fully Connected Layers (FCLs) in Figure 2 are the output layers
which flatten the outputs of the previous layers, the feature maps, to a single vector
For classification, the Fully Connected Layers (FCLs) in Figure 2 are the output layers
that can be used as an input for the Softmax layer. Each input is connected to all neurons,
which flatten the outputs of theasprevious
represented layers,
circles in Figure the feature
2, to predict the classmaps, to ainsingle
of the object vector
the input thatan
image with
can be used as an input for the
activation Softmax
function Softmax, layer.
which Each
convertsinput is connected
the output to all neurons,
values to conditional probabilities
(normalized classification scores) for prediction, where each value
represented as circles in Figure 2, to predict the class of the object in the input image with ranges between 0 and 1
and all values sum to one [3,47]. The architecture of CNN has the capability to learn and
an activation function extract
Softmax,
objectwhich
features,converts
and to mergetheseveral
output values
tasks to conditional
together, probabil-
for example, object detection
ities (normalized classification scores) for prediction, where each value ranges between 0
and segmentation.
and 1 and all values sum to one [3,47]. The architecture of CNN has the capability to learn
and extract object features, and to merge several tasks together, for example, object detec-
tion and segmentation.
Regular CNNs have been built on fixed and known geometric structure, so they can-
an activation function Softmax, which converts the output values to conditional
probabilities (normalized classification scores) for prediction, where each value ranges
between 0 and 1 and all values sum to one [3,42]. The architecture of CNN has the
AI 2021,capability
2 to learn and extract object features, and to merge several tasks together, for 557
example, object detection and segmentation.
Regular CNNs have been built on fixed and known geometric structure, so they
cannot deal with any geometric
Regular CNNsvariations
have been in built
the object
on fixedsuch
and as:
knownpose, scale, viewpoint,
geometric structure, so they
and deformation parts [42],
cannot aswith
deal illustrated in Figure
any geometric 5. Toinsolve
variations thissuch
the object issue,
as: CNN has been
pose, scale, viewpoint,
and deformation parts [47], as illustrated in Figure 5. To solve
trained on datasets with sufficient variation, or on augmented data by changing the size, this issue, CNN has been
trained on datasets with sufficient variation, or on augmented data by changing the size,
shape, and rotation shape,
angleandof the object, to attain high detection accuracy. Although, the
rotation angle of the object, to attain high detection accuracy. Although the
problem has been problem
solved,has thebeen
training is training
solved, the very complex and and
is very complex therefore
thereforeexpensive.
expensive. ToToenhance
enhance the capability
the of CNN to
capability of deal
CNN with
to dealgeometric variations
with geometric variationsininobject
objector
or deformations
deformations without
using data-augmentation,
without using data-augmentation, D-CNNs D-CNNs was introduced
was introduced [12,13].
[43,44].
Figure
Figure 5. Example
5. Example of images
of images that geometric
that contain contain geometric
variations invariations in thewhich
the object (moose) objectmake
(moose) which
it difficult to bemake it
identified
difficult
by to beCNN.
using regular identified by using regular CNN.
Figure6.6.Illustration
Figure Illustrationof
ofthe
thesampling locationsinin33×× 33 regular
samplinglocations regularand
anddeformable
deformablesampling
samplingmatrices.
matrices.(a)
(a)Regular
Regularsampling
sampling
matrix
matrix (blue points); (b) Deformable sampling matrix (orange points) with offsets (green arrows); (c) Exampleof
(blue points); (b) Deformable sampling matrix (orange points) with offsets (green arrows); (c) Example ofhow
howthe
the
positions of the (a)
deformable sampling matrix are changed (b)
from the original 3 × 3 squared positions (c)
positions of the deformable sampling matrix are changed from the original 3 × 3 squared positions according to theobjects
according to the objects
shapeto
shape toidentify
identifydeformed
deformedor oroccluded
occludedobjects
objectsininthe
theimage.
image.
Figure 6. Illustration of the sampling locations in 3 × 3 regular and deformable sampling matrices. (a)
Regular sampling matrix (blue D-CNNpoints);
D-CNN (b) Deformable
consists of two
of twoparts: sampling
parts:regular matrix (orange
regularconvolution
convolution topoints)
layers
layers to with
generate
generate offsets
feature
feature maps
maps for
for
(green arrows); (c) Example the whole
of how
the whole input
inputthe image,
positions
image, and additional convolution
of the deformable
and additional layers
convolution sampling (deformable
matrix are
layers (deformable convolution
changedlayers)
convolution
from the original 3×3 layers)
for
squared forpositions
offsets offsets
to to be
be learned learned from
from each
according each
to feature
the feature
map
objects mapthey
where
shape where
to can they can beeasily
be trained
identify trained
deformed oreasily
by using
by using
back back propagation
propagation from end-to-end
from end-to-end without
without any any supervision.
supervision. Theseconvolution
These additional additional
occluded objects in the convolution
image. layers increase the detection performance of the network at the cost of adding
layers increase the detection performance of the network at the cost of adding a small
amount of computations for offset learning. In Section 7.2., our experimental results show
that after adding deformable convolutional layers to the four R-CNN models, the animal
species detection accuracy is improved.
D-CNN consists of two parts: regular convolution layers to generate feature maps for
the whole input image, and additional convolution layers (deformable convolution layers)
for offsets to be learned from each feature map where they can be trained easily by using
AI 2021, 2 back propagation from end-to-end without any supervision. These additional convolution 558
layers increase the detection performance of the network at the cost of adding a small
amount of computations for offset learning. In Section 7.2., our experimental results show
that after adding deformable convolutional layers to the four R-CNN models, the animal
a small amount
species detectionofaccuracy
computations for offset learning. In Section 7.2., our experimental results
is improved.
show that after adding deformable convolutional layers to the four R-CNN models, the
animal
4. R-CNNspecies detection accuracy is improved.
Models
In general,
4. R-CNN Modelsthe four R-CNN models consist of two stages as shown in Figure 7. The
first is RoI or region proposals algorithm, that finds regions from the feature maps (output
In general, the four R-CNN models consist of two stages as shown in Figure 7. The
of CNN 1) that might contain objects and generates bounding box for each region. The
first is RoI or region proposals algorithm, that finds regions from the feature maps (output
second is the region pooling layer, where it detects and removes all the overlapped
of CNN 1) that might contain objects and generates bounding box for each region. The
regions, as well as converts the extracted proposals to fixed size by doing max-pooling on
second is the region pooling layer, where it detects and removes all the overlapped regions,
them. The fixed size of proposals is required by FCLs in CNN 2 and the bounding box
as well as converts the extracted proposals to fixed size by doing max-pooling on them.
regressor to identify and localize objects [11].
The fixed size of proposals is required by FCLs in CNN 2 and the bounding box regressor
This section provides an overview of R-CNN, Fast R-CNN, Faster R-CNN, and Mask
to identify and localize objects [11].
R-CNN models. Each model attempts to improve accuracy and speed up processing.
Feature Maps for the
Whole Image
CNN
RoI
1
Algorithm
Input Image
Region CNN
Pooling Layer 2
Extracted Resized
Regions
This section provides an overview of R-CNN, Fast R-CNN, Faster R-CNN, and Mask
4.1.
R-CNNR-CNN
models. Each model attempts to improve accuracy and speed up processing.
The R-CNN architecture is divided into five stages as shown in Figure 8. It starts by
4.1. R-CNN
using a selective search algorithm to generate hundreds to thousands of region proposals
The R-CNN architecture is divided into five stages as shown in Figure 8. It starts by
for an input image. These region proposals are cropped and resized [1,45]. Then, each
using a selective search algorithm to generate hundreds to thousands of region proposals
resized region proposal is fed into CNN to extract object features. The output of each CNN
for an input image. These region proposals are cropped and resized [1,48]. Then, each
is the input of a linear SVM to identify the regions of objects in image [46]. Finally, these
resized region proposal is fed into CNN to extract object features. The output of each CNN
identified regions are adjusted by using the linear bounding box regressor, to tighten and
is the input of a linear SVM to identify the regions of objects in image [49]. Finally, these
to refine the final bounding boxes of the detected objects [47].
identified regions are adjusted by using the linear bounding box regressor, to tighten and
Selective search algorithm generates regions based on a segmentation approach. It
to refine the final bounding boxes of the detected objects [50].
combines both object search and segmentation to detect all the possible locations of objects.
Selective search algorithm generates regions based on a segmentation approach. It
In terms of segmentation of object and non-object, the image structures including object
combines both object search and segmentation to detect all the possible locations of objects.
In terms of segmentation of object and non-object, the image structures including object
size, color similarities, and texture similarities, are used to obtain many small segmented
areas. Then, a bottom-up approach is typically used as part of the selective search algorithm
to merge all the similar areas to get more accurate and larger segmented areas to produce
the final candidate region proposals [51,52].
The R-CNN model cannot be applied to real-time applications because:
size, color similarities, and texture similarities, are used to obtain many small segmented
areas. Then, a bottom-up approach is typically used as part of the selective search algorithm
AI 2021, 2 to merge all the similar areas to get more accurate and larger segmented areas to produce 559
the final candidate region proposals [48,49].
The R-CNN model cannot be applied to real-time applications because:
• Network processing is expensive and slow due to the use of selective search
• Network
algorithm, processing is expensive
where hundreds and slow of
to thousands due to theproposals
region use of selective
need tosearch algorithm,
be classified for
where hundreds
each image. to thousands of region proposals need to be classified for each image.
•• R-CNN
R-CNNsometimes
sometimesgenerates
generatesbad
badcandidate
candidateregion
region proposals
proposals asas
thethe
selective search
selective is
search
a fixed algorithm which has no learning capabilities.
is a fixed algorithm which has no learning capabilities.
At
At the
the same
same time,
time, the
the training
training of
of the
the R-CNN
R-CNN model
model isis complex
complex and
and requires
requires aa big
big
memory
memory space; since R-CNN has to train three different models separately: CNN,
space, since R-CNN has to train three different models separately: CNN, SVM,
SVM,
and
and bounding
bounding box box regressor.
regressor.
Regions of
Selective Search Interest
Algorithm
Input Image
Extracted Resized
Regions
Basic architecture
Figure 8.Figure of R-CNNofmodel.
8. Basic architecture R-CNN The number
model. Theofnumber
CNNs varies depending
of CNNs on the number
varies depending of classes.
on the
number of classes.
4.2. Fast R-CNN
The R-CNN
4.2. Fast same developer of R-CNN proposed a modified model, the Faster R-CNN [10], to
solve some of the R-CNN limitations. As shown in Figure 9, in Fast R-CNN, CNN is used
The same
to extract developer
features of R-CNN
and produce proposed
feature maps afor modified
the wholemodel,
inputtheimage
Fasterinstead
R-CNNof[10], to
each
solve some
region of R-CNN
proposal limitations.
as in R-CNN. As shown
Thereby, in Figure
Fast R-CNN can 9, in Fast
save timeR-CNN,
and memoryCNNcompared
is used to
extract features and produce feature maps for the whole input image
to R-CNN. From the feature maps of the whole image, and RoI which are identified instead of each region
by
the selective search algorithm, regions are cropped out to a fixed size feature maptofor
proposal as in R-CNN. Thereby, Fast R-CNN can save time and memory compared R-
CNN. From the feature maps of the whole image, and RoI which are
each region proposal by using the region pooling layer. Then, these feature maps of each identified by the
selective
region aresearch algorithm,
flattened regions
to a vector are cropped
by FCLs and fed out to a fixed
to Softmax size featureand
classification map for each
bounding
region
box proposal
regressor by usingthe
to predict theclass
region
andpooling layer.
bounding boxThen, thesefor
locations feature
each maps
objectof ineach region
the image.
will Despite
be flattened to a vector of
the advantages byFast
FCLs and fed
R-CNN to Softmax
in reducing classification
used memory and and boundingtime,
processing box
regressor to predict the class and bounding box locations for each object
and increasing detection accuracy, the selective search algorithm that generates region in the image.
proposals is still a bottleneck of the model processing time.
AI 2021, 2, FOR PEER REVIEW 9
AI 2021, 2 Despite the advantages of Fast R-CNN in reducing used memory and processing time,
560
and increasing detection accuracy, the selective search algorithm that generates region
proposals is still a bottleneck of the model processing time.
Selective Search
CNN Algorithm
Input Image
Resized Regions
FCLs
Flatten Feature Map for Each
Region to a Vector
Basicarchitecture
Figure9.9.Basic
Figure architectureof
ofFast
FastR-CNN
R-CNNModel.
Model.
FeatureMaps
Feature Mapsfor
forthe
the
Whole Image
Whole Image
RegionProposal
Region Proposal
CNN
CNN Network
Network
InputImage
Input Image
FeatureMaps
Feature Mapsfor
forthe
the
Regionsofof
Regions
Whole Image
Whole Image Interest
Interest
RegionPooling
Region PoolingLayer
Layer
ResizedRegions
Resized Regions
FCLs
FCLs
FlattenFeature
Flatten FeatureMap
Mapfor
forEach
Each
Region to a Vector
Region to a Vector
SoftmaxLayer
Softmax Layerand
andBounding
BoundingBox
Box
Regressor
Regressor
DetectionResults
Detection Results(Class
(Classand
andBounding
BoundingBox)
Box)
Figure10.
Figure
Figure 10.Basic
10. Basicarchitecture
Basic architectureof
architecture ofFaster
of FasterR-CNN
Faster R-CNNModel.
R-CNN Model.
Model.
Binary
Binary
Convolution
Convolution AnchorBox
Box Category
Anchor Category
Layer
Layer Prediction
Prediction
NMS
NMS RegionProposals
Proposals
Region
Differentscales
Different scalesand
andaspect
aspect Bounding
Bounding
ratios of anchor boxes
ratios of anchor boxes Box
Box
Regressor
Regressor
Figure11.
Figure 11.Region
RegionProposal
ProposalNetwork
NetworkArchitecture.
Architecture.
4.4.
4.4.Mask
4.4. MaskR-CNN
Mask R-CNN
R-CNN
Mask
MaskR-CNNR-CNN is anisextension of Faster
an extension
extension R-CNN especially used for instance segmenta-
tion Mask
to specifyR-CNN
which is an
pixel is a part of ofof Faster
which
Faster R-CNN
object
R-CNN especially
in an image
especially
[53,55,56].
used for
used for instance
instance
Segmentation
segmentation
segmentation to specify which pixel is a part of which object in an image [50,51,52].
labels each pixeltoinspecify
an image which
with pixel is aclass,
an object part andof which objecteach
then assigns in an image
pixel to an[50,51,52].
instance,
Segmentationlabels
Segmentation labelseach
eachpixel
pixelininan
animage
imagewith withan anobject
objectclass,
class,and
andthen
thenassign
assigneach
eachpixel
pixel
where each instance corresponds to an object in an image. Two types of segmentations
totoan
have
aninstance,
instance,where
been applied
whereeach eachinstance
instancecorresponds
on the image
corresponds toan
in Figure 12a. to
anobject
Semantic
objectininan animage.
image.Two
segmentation,
Twotypes
as shown
typesofof
in
segmentations
segmentations have
have been
been applied
applied on
on the
the image
image inin Figure
Figure 12a.
12a. Semantic
Semantic segmentation,
segmentation, asas
Figure 12b, does not differentiate instances of the same class (there is one bounding box for
shown
shown in Figure
in Figure 12b, does not differentiate instances of the same class (there is one
the two bears). On 12b, doeshand,
the other not differentiate instances using
instance segmentation of theMask
sameR-CNN,
class (there
as shownis one
in
boundingbox
bounding boxfor
forthe
thetwo
twobears).
bears).OnOnthe
theother
otherhand,
hand,instance
instancesegmentation
segmentationusing usingMask
MaskR- R-
Figure 12c, segments and distinguishes between objects of the same class individually in
CNN,
CNN, as shown
as shown in Figure
in Figure 12c,
12c, segments
segments and
andwith distinguishes
distinguishes between
between objects
objects of the same class
an image and localize each object instance a bounding box (there is aofbounding
the same class
box
individuallyininan
individually animage
imageandandlocalize
localizeeach
eachobject
objectinstance
instancewith
withaabounding
boundingbox box(there
(thereisisaa
for each bear).
boundingbox
bounding boxfor
foreach
eachbear).
bear).
stance, where each instance corresponds to an object in an image. Two types of segmen-
tations have been applied on the image in Figure 12a. Semantic segmentation, as shown
in Figure 12b, does not differentiate instances of the same class (there is one bounding box
for the two bears). On the other hand, instance segmentation using Mask R-CNN, as
AI 2021, 2 shown in Figure 12c, segments and distinguishes between objects of the same class indi- 562
vidually in an image and localize each object instance with a bounding box (there is a
bounding box for each bear).
Input Image
Regions of
Feature Maps for the Interest
Whole Image
Detection
Results (Class
and Bounding
Box)
Output Image
Figure 13. Basic architecture of Mask R-CNN Model.
Figure 13. Basic architecture of Mask R-CNN Model.
AI 2021, 2 563
5. Animal Datasets
5.1. Datasets Used in Our Study
In our research, we used three datasets: (1) the Snapshot Serengeti dataset [58],
(2) the dataset furnished by BCMOTI, and (3) the Snapshot Wisconsin dataset [59]. The
Snapshot Serengeti is the dataset for the animal species in Africa (Serengeti National Park
in Tanzania). A total of 712,158 images for seven species (lion, zebra, buffalo, giraffe,
fox, deer, and elephant) were selected. The BCMOTI dataset has 53,000 images for eight
species (bear, moose, elk, deer, cougar, mountain goat, fox, and wolf) as they are commonly
seen in highways and remote areas in Canada. The Snapshot Wisconsin dataset was
collected in North America by using 1037 camera-traps placed in a forest in Wisconsin.
It contains 0.5 million images for different animal species, six types of animals have been
chosen (bears, deer, elk, moose, wolf, and fox) since encounters between these animals
and vehicles typically lead to severe crashes on highways. These animals are sometimes
AI 2021, 2, FOR PEER REVIEW involved in tragic direct encounters with humans as well. 14
In the three datasets, the classes are imbalanced, and this is an issue to be dealt with
in the future. The images were labeled by human volunteers as empty or as the name of
animal species. The images in the datasets have resolutions ranging between 512 × 384
and 2048 × 1536 pixels. Snapshot Serengeti, BCMOTI, and Snapshot Wisconsin differ in
and 2048 × 1536 pixels. Snapshot Serengeti, BCMOTI, and Snapshot Wisconsin differ in
many aspects such as dataset size, camera placement, camera configuration, and species
many aspects such as dataset size, camera placement, camera configuration, and species
coverage, thus allowing one to draw more general conclusions.
coverage, thus allowing one to draw more general conclusions.
5.2.
5.2. Limitations
Limitations of
of Datasets
Datasets
Detection
Detection of animal
of animal species
species in images is
in images is challenging
challenging due
due toto images’
images’ conditions.
conditions. InIn
some instances, the whole animal covers only a small area of the field of
some instances, the whole animal covers only a small area of the field of view as shown view as shown
in
in Figure
Figure 14a.
14a. In
In other
other instances,
instances, two
two oror more
more animals
animals are
are too
too close
close from
from the
the field
field of
of view
view
and
and combined
combined withwith each
each other, as shown
other, as shown in in Figure 14b. Sometimes,
Figure 14b. Sometimes, onlyonly part
part of
of the
the animal
animal
is
is visible
visible in
in the field of
the field of view,
view, as
as shown
shown in in Figure
Figure 14c,d.
14c,d. Furthermore,
Furthermore, different
different lighting
lighting
conditions, shadows, and weather, as shown in Figure 14e,f, can make the feature
conditions, shadows, and weather, as shown in Figure 14e,f, can make the feature extraction extrac-
tion task even
task even harder.harder.
(a) (b)
Figure 14. Cont.
(c) (d)
AI 2021, 2 564
(a) (b)
(c) (d)
(e) (f)
Figure
Figure 14.
14. Image
Image samples
samples from
from the
the dataset
dataset used.
used. (a)
(a) Low
Low resolution
resolution image;
image; (b)
(b) An
An image
image of
of three
three moose
moose close
close to camera
to camera
and
and merge
merge toto each
each other;
other; (c,d)
(c,d) A
A part
part of the animal;
of the animal; (e) A night
(e) A night image
image of cougar with
of cougar with falling
falling snow;
snow; (f)
(f) A
A night
night image
image of
of
cougar with mist.
cougar with mist.
(a) (b)
Figure 15.15.
Figure Animal species
Animal detection
species byby
detection using: (a)(a)
using: Regular convolution;
Regular (b)(b)
convolution; Deformable convolution.
Deformable convolution.
6.2. Training
6.2. Training
Each
Eachof of
thethe
three datasets
three hashas
datasets beenbeen
splitsplit
into into
70% 70%
for training, 15% for
for training, 15%validation, and
for validation,
15%
and for testing,
15% which are
for testing, whichtheare
commonly used percentages
the commonly in similar
used percentages research.
in similar In the train-
research. In the
ing of deepoflearning
training models,models,
deep learning it is important to find the
it is important to significant values ofvalues
find the significant hyper-param-
of hyper-
eters such as: such
parameters learning
as: rate, batchrate,
learning size,batch
numbersize,ofnumber
iterations, etc. Reaching
of iterations, etc.the optimum
Reaching the
performance of a model of
optimum performance is achieved
a model isby experiment
achieved using various
by experiment usingvalues
various forvalues
these for
hyper-
these
parameters [60]. A validation
hyper-parameters dataset is used
[60]. A validation as well
dataset to fine
is used tune to
as well thefine
modeltunefor overfitting
the model for
and for adjusting
overfitting these
and for hyper-parameters.
adjusting these hyper-parameters.
TheTheeight
eight R-CNN
R-CNN models
models (with
(withandandwithout
withoutdeformable
deformableconvolutional
convolutional layer)
layer)were
were
trained
trained bybythetheback
backpropagation
propagationand andfine-tuned
fine-tuned on on the validation
validation setset to
toreduce
reduceoverfitting
overfittingby
using a learning rate of 0.0025 for 32 batch size. The network of these models is initialized
with the ResNet-101 [26] pre-trained model and fine-tuned end-to-end for the object
detection task to enhance efficiency of training time and improve evaluation performances.
All training input images were annotated by using the Image Labeler app [61] to provide
labeled bounding box over the animals in these images. This box is called the ground
truth box.
To identify animal species, several pre-trained models are experimented including:
AlexNet, GoogleNet, VGG-16, VGG-19, ResNet-18, ResNet-50, and ResNet-101, as shown in
Table 1. Finally, ResNet-101 has been selected as a backbone network for the R-CNN models
to detect animals in the training process. This selection of ResNet was also supported
by the work of Kwan et al. [33], as they achieved good performance with YOLO using
ResNet. The main reason for that selection is the ability to balance between computational
complexity and the animal species detection accuracy. ResNet-101 introduces shortcut
connections to speed up the convergence of the network and to avoid vanishing gradient
problems during the training process, as these problems could stop the network from
further training [11,26,62]. Furthermore, ResNet-101 achieves competitive accuracy and
speed performance in scale-invariant feature extraction.
Table 1. Evaluation of animal species identification by using seven pre-trained models on the three
datasets.
AI 2021, 2 566
Pre-trained Models Accuracy of animal identification
AlexNet 93.1%
Table 1. Evaluation of animal species identification by using seven pre-trained models on the
three datasets. GoogleNet 95.9%
VGG-16Models
Pre-Trained 96.8% Identification
Accuracy of Animal
AlexNet
VGG-19 93.1%
96.3%
GoogleNet 95.9%
VGG-16
ResNet-18 96.8%
96.8%
VGG-19 96.3%
ResNet-18
ResNet-50 96.8%
97.1%
ResNet-50 97.1%
ResNet-101
ResNet-101 97.6%
97.6%
Classified
Input Output
Image Image
a b a b c a b c a b c a b c a b
a. {7x7 conv}, 64 a. {1x1 conv, 64} x 3 a. {1x1 conv, 128} x 4 a. {1x1 conv, 256} x 23 a. {1x1 conv, 512} x 3 a. 7x7 max pool
b. 3x3 max pool b. {3x3 conv, 64} x 3 b. {3x3 conv, 128} x 4 b. {3x3 conv, 256} x 23 b. {3x3 conv, 512} x 3 b. FCL &
Softmax layer
c. {1x1 conv, 256} x 3 c. {1x1 conv, 512} x 4 c. {1x1 conv, 1024} x 23 c. {1x1 conv, 2048} x 3
Figure
Figure 16.16. Architecture
Architecture ofof ResNet-101.
ResNet-101. Rconv.1 has
Rconv. 1 has two
two layers:
layers: a. a. convolution
convolution layer
layer with
with kernel
kernel size
size (7 Xand
(7x7) 7) and 64 filters,
64 filters, and
b.and
maxb.pooling
max pooling
layer oflayer
sizeof(3x3).
size (3Rconv2
X 3). Rconv2 has 9 convolution
has 9 convolution layers layers with kernel
with kernel sizes (1
sizes (1x1), X 1),
and andand
(3x3) (3 Xwith
3) and with
different
number of number
different filters (64
of and 256).
filters (64 Similarly, Rconv3 has
and 256). Similarly, 12 convolution
Rconv3 layers, Rconv4
has 12 convolution has 69 convolution
layers, Rconv4 layers, and
has 69 convolution Rconv5
layers, and
has 9 convolution
Rconv5 layers. layers.
has 9 convolution
Figure
Figure 17 17 shows
shows the
the animal
animal species
species detection
detection procedure for the regular R-CNN and
deformable
deformableR-CNN R-CNNmodels.
models.TheThetraining of of
training thethe
system
systemhashas
been applied
been by using
applied the pre-
by using the
trained residual
pre-trained network
residual network(ResNet-101).
(ResNet-101).First, four
First, fourregular
regularregion-based
region-basedobject
object detection
detection
models (R-CNN,
models (R-CNN,Fast FastR-CNN,
R-CNN,Faster
Faster R-CNN,
R-CNN, andand Mask
Mask R-CNN)
R-CNN) are are trained.
trained. Then,Then,
four
four new
new deformable
deformable region-based
region-based object
object detection
detection models
models arearetrained
trainedafter
afteradding
adding three
deformable convolutional
deformable convolutional layers
layers to
to the
the last
last three convolutional
convolutional layers
layers with kernel size
(3 × 3)
(3x3) in in
thethe last
last block
block ofof ResN + et-101
ResN+et-101 (Rconv.
(Rconv. 5). 5).
AI 2021, 2 567
AI 2021, 2, FOR PEER REVIEW 16
1. R-CNN
2. Fast R-CNN
3. Faster R-CNN
4. Mask R-CNN
Ground Truth
Input: Training Images
Dataset for Thirteen
from The Three Datasets 5. R-CNN
Animal Species
+Deformable
CNN
6. Fast R-CNN
+Deformable
CNN
7. Faster R-CNN
+Deformable
CNN
8. Mask R-CNN
+Deformable
CNN
Figure
Figure17.
17.Animal
Animalspecies
speciesdetection
detectiontraining
trainingmodel
modelwith
witheight
eightdetectors.
detectors.
Our work
Our workwaswascarried out out
carried usingusing
MATLAB 2020b deep
MATLAB learning
2020b deep and parallel
learning andcomputing
parallel
toolboxes and implemented on a Laptop Core i7-10750H Processor, NVIDIA GeForce
computing toolboxes and implemented on a Laptop Core i7-10750H Processor, NVIDIA RTX
2070 graphics accelerator, 32 GB RAM memory, and running a Windows 10 Professional
GeForce RTX 2070 graphics accelerator, 32 GB RAM memory, and running a Windows 10
x64 operating
Professional system.
x64 operating system.
7. Experimental
7. Experimental Results
Results of
of Animal
Animal Species
Species Detection
Detection
7.1. Performance Evaluation Metrics
To compare
7.1. Performance and evaluate
Evaluation the performance of animal species detectors, four met-
Metrics
rics are used: False Negative Rate (FNR), accuracy,
To compare and evaluate the performance mean Average
of animal Precisionfour
species detectors, (mAP), and
metrics
response-time.
are used: False Negative Rate (FNR), accuracy, mean Average Precision (mAP), and
IoU measures the overlap “intersection” between the ground truth box (actual) and
response-time.
the predicted bounding
IoU measures box divided
the overlap by their union.
“intersection” between The resulting
the groundvalue
truth shows how close
box (actual) and
is the predicted bounding box to the ground truth box. To determine if the detection is
the predicted bounding box divided by their union. The resulting value shows how close
positive or negative, a predefined IoU threshold value is used. It is important that the value
is the predicted bounding box to the ground truth box. To determine if the detection is
of this threshold not to be too small or too large; in object detection researches, threshold
positive or negative, a predefined IoU threshold value is used. It is important that the
from 0.4 to 0.7 are commonly used [6,27]. Figure 18 shows the effect of IoU threshold on
value of this threshold not to be too small or too large; in object detection researches,
the performance of Mask R-CNN. As shown in Figure 18a, the higher threshold (equal to
threshold from 0.4 to 0.7 are commonly used [6,25]. Figure 18 shows the effect of IoU
or more than 0.5) detected two animals and produced two bounding boxes for each animal.
threshold on the performance of Mask R-CNN. As shown in Figure 18a, the higher
AI 2021, 2, FOR PEER REVIEW 19
AI 2021, 2 threshold from 0.4 to 0.7 are commonly used [6,27]. Figure 18 shows the effect568of IoU
threshold on the performance of Mask R-CNN. As shown in Figure 18a, the higher thresh-
old (equal to or more than 0.5) detected two animals and produced two bounding boxes
forIneach animal.
Figure 18b, theInlower
Figure 18b, the
threshold lower
(lower thanthreshold
0.5) failed(lower than
to detect two 0.5) failed
animals; to detect
however, it two
animals; however,
produced a boundingit produced
box for onea detected
bounding box for
animal. one detected
Thereby, animal.
FNR, accuracy, andThereby,
mAP are FNR,
accuracy,
measuredand mAP
using IoUare measured
threshold using
[17,28] IoU threshold [17,28] at 0.5.
at 0.5.
(a) (b)
Figure 18. Effect
Figure of IoU
18. Effect on the
of IoU animal
on the images
animal imagesusing
usingdeformable MaskR-CNN
deformable Mask R-CNN (a)(a) High
High threshold
threshold (two(two bounding
bounding boxesboxes
for for
each each
detected bear);
detected (b) (b)
bear); Low threshold
Low threshold(detect
(detectonly
only one bear
bearwith
withone
onebounding
bounding box).
box).
FNRFNRisisan
anessential
essential metric
metric in
inour
ourwork,
work,where
whereit measures the number
it measures of images
the number that that
of images
contain animals (positive) but incorrectly classified as empty images (negative). Thereby,
contain animals (positive) but incorrectly classified as empty images (negative). Thereby,
FNR does not consider the animal class, and only measures the performance of binary
FNR does not consider the animal class, and only measures the performance of binary
classification. By defining the true positive (TP) as truly classified images with animals,
classification. By defining
and false negative the true
(FN) as falsely positive
classified (TP)with
images as truly classified
animals as emptyimages
images,with animals,
the FNR
and false negative
is calculated as: (FN) as falsely classified images with animals as empty images, the
FNR is calculated as: FN
FNR = (1)
TP + FN
Accuracy is an evaluation metric which FNR is =calculated by dividing the total number of (1)
correctly predicted objects over the total number of input images as shown in Equation (2).
TP is defined as the true detection of a ground truth box (if IoU is greater than or equal to
Accuracy
0.5), FN as theisfalse
an evaluation
detection ofmetric which
a ground truthisbox
calculated
(if IoU isby
lessdividing
than 0.5),the total
false number of
positive
correctly predicted
(FP) as the objects of
false detection over the total
an object that number
does not of input
exist, andimages as shown
true negative (TN)inas Equation
the
(2).number
TP is defined as theboxes
of bounding true that
detection of a ground
are supposed not to truth box (ifinside
be detected IoU isany
greater
image.than or equal
to 0.5), FN as the false detection of a ground truth box (if IoU is less than 0.5), false positive
TP + TN
(FP) as the false detection ofAccuracy
an object
= that does not exist, and true negative (TN) (2)as the
TP + FP + TN + FN
number of bounding boxes that are supposed not to be detected inside any image.
The mAP is a single number metric that combines both precision and recall by averag-
ing precision across recall values, where it is the
Accuracy = area under a precision–recall curve for the (2)
detections of each animal class [27,63]. Then, the result is divided by the number of classes
The
N in themAP is aassingle
dataset shownnumber metric
in Equation (3). that combines both precision and recall by aver-
aging precision across recall values, where it is the area under a precision–recall curve for
1 N
the detections of each animal classmAP
[27,63]. Σ APthe result is divided by the number
= Then, (3) of
N i=1 i
classes N in the dataset as shown in Equation (3).
where APi is the average precision (AP) for each animal species class (i). It is measured
under=the∑precision–recall
with the Riemann sum as the true area mAP 𝐴𝑃 curve [27]. (3)
Precision measures how accurate the object detection model is, as shown in Equation (4),
so high
where APiprecision means low
is the average false positive
precision (AP) rate.
for each animal species class (i). It is measured
with the Riemann sum as the true area under the TP precision–recall curve [27].
Precision measures how accurate the =
Precision object
TP +detection
FP
(4)
model is, as shown in Equation
(4), so high precision means low false positive rate.
Precision = (4)
AI 2021, 2, FOR PEER REVIEW 20
Recall measures how many correct detections are found by the object detection
AI 2021, 2 model, as shown in Equation (5), so high recall means a low false negative rate. 569
Recall = (5)
19. Evaluation of object detection models by using Regular (R.) and Deformable (D.) in terms of FNR, Acc., and
Figure 19.
mAP on the Snapshot
mAP on the Snapshot Serengeti
Serengeti dataset.
dataset.
AI 2021, 2, FOR PEER REVIEW 21
In Figure 20, the BCMOTI dataset, which is the smallest dataset used in this work,
AI 2021, 2 the performance of deformable Mask R-CNN decreases to 93.3% accuracy, 82.9% mAP, 570
and FNR is increased by 1.7%, as most of the images in this dataset were taken at night
with poor resolution and from the backside of the animals, as shown earlier in Figure 14.
Figure 20. Evaluation of object detection models by using Regular (R.) and Deformable (D.) in terms of FNR, Acc., and
mAP on
mAP on the
the BCMOTI
BCMOTI dataset.
dataset.
In Figure 19, according to the evaluation metrics (FNR, Acc., and mAP), Mask R-
Figure 21 shows that by using deformable Mask R-CNN, accuracy and mAP of de-
CNN reaches the highest performance in both regular CNNs and D-CNNs. Further-
tection are 97.6% and 87.6%, respectively, on the Snapshot Wisconsin dataset with 0.6%
more, deformable Mask R-CNN provides the best result with an accuracy of 98.4% and
FNR. In the Snapshot Serengeti dataset, the system has been trained on a larger training
mAP of 89.2%, while incorrectly identifying 427 images with animals in the test set as
set than BCMOTI and Snapshot Wisconsin. Thereby, it has gained up to 5.1% accuracy
empty images.
compared to BCMOTI, and up to 0.8% accuracy compared to Snapshot Wisconsin. This
In Figure 20, the BCMOTI dataset, which is the smallest dataset used in this work, the
shows the importance of having a large training set with a large number of instances in
performance of deformable Mask R-CNN decreases to 93.3% accuracy, 82.9% mAP, and
each class.
FNR is increased by 1.7%, as most of the images in this dataset were taken at night with
poor resolution and from the backside of the animals, as shown earlier in Figure 14.
Figure 21 shows that by using deformable Mask R-CNN, accuracy and mAP of
detection are 97.6% and 87.6%, respectively, on the Snapshot Wisconsin dataset with 0.6%
FNR. In the Snapshot Serengeti dataset, the system has been trained on a larger training
set than BCMOTI and Snapshot Wisconsin. Thereby, it has gained up to 5.1% accuracy
compared to BCMOTI, and up to 0.8% accuracy compared to Snapshot Wisconsin. This
shows the importance of having a large training set with a large number of instances in
each class.
As shown in Figure 22, deformable Mask R-CNN is able to detect objects in about
0.78 s per image on all three datasets. That makes deformable Mask R-CNN, though
slightly slower than the regular version, suitable for use in most real-time applications.
AI 2021, 22, FOR PEER REVIEW
AI 2021, 571
22
Evaluation of object detection models by using Regular (R.) and Deformable (D.) in terms of FNR, Acc., and
e Snapshot Wisconsin dataset.
As shown in Figure 22, deformable Mask R-CNN is able to detect objects in about
0.78 Evaluation
s per image
Figure 21. Evaluation
ondetection
all three
of object detection
datasets. Regular
models
That makes
models by using Regular
deformable
(R.) and Deformable
Deformable (D.)
Mask R-CNN, though
(D.) in terms of FNR, Acc., and
mAP slightly
mAP on slower
the Snapshot
on the Snapshot than the
Wisconsin
Wisconsin regular version, suitable for use in most real-time applications.
dataset.
dataset.
As shown in Figure 22, deformable Mask R-CNN is able to detect objects in about
0.78 s per image on all three datasets. That makes deformable Mask R-CNN, though
slightly slower than the regular version, suitable for use in most real-time applications.
The image results in Figure 23 show that deformable Mask R-CNN can detect and
The image results in Figure 23 show that deformable Mask R-CNN can detect and seg-
segment single and multiple animal species with a confidence score for each class. De-
ment single and multiple animal species with a confidence score for each class. Deformable
formable Mask R-CNN detects animal species with higher accuracy and speed in compar-
Mask R-CNN detects animal species with higher accuracy and speed in comparison to
isonother
to other regular
regular and deformable
and deformable R-CNNR-CNN
models.models. Therefore,
Therefore, not deformable
not only can only can deformable
Mask
Mask R-CNN
R-CNN be applied
be applied in real-time
in real-time systems
systems to single
to detect detectand
single and multiple
multiple animal
animal species, species,
but it
but can
it can
alsoalso produce
produce a maska mask overdetected
over each each detected
animal inanimal in the
the image image for
for counting thecounting
number the
number of occluded
of occluded and overlapping
and overlapping animal species.
animal species.
Figure 23. Some examples of animal species detection after deformable Mask R-CNN (output mask size is the object size).
Figure 23. Some examples of animal species detection after deformable Mask R-CNN (output mask size is the object size).
In general, our results show that deformable Mask R-CNN using ResNet-101 can de-
tect and segment animals with high accuracy exceeding the performance of the related
work, as shown in Table 2. This table summarizes the datasets, performance, and tech-
AI 2021, 2 573
In general, our results show that deformable Mask R-CNN using ResNet-101 can
detect and segment animals with high accuracy exceeding the performance of the related
work, as shown in Table 2. This table summarizes the datasets, performance, and techniques
of our research and similar related work on animal species detection. The integration of
D-CNN to Mask R-CNN improves the performance of animal species detection. Our
research has an improvement over these related work due to the following reasons:
1. Three datasets of different characteristics have been used for training and testing.
2. Deformable convolutional layers have been added to the R-CNN detectors, which
have a great effect on enhancing the extracted features, which in turn improve the
performance of animal species detection.
In future work, we aim to detect smaller animal species which is one of the major
challenges of animal species detection and to investigate improvement by reducing FNR.
Furthermore, we plan to design an efficient animal detector by improving the accuracy
of animal species identification and localization in high enough speed to be applied in
real-time applications. To obtain higher accuracy, we need to extract more significant
features, improve pre- and post- processing methods, solve the imbalance class issue,
accommodate imbalance day and night images, and enhance classification confidence. For
reducing the response-time and increasing the detection speed, we need to reduce the
network complexity and computation time by removing some layers from the deformable
Mask R-CNN architecture. Furthermore, a comparative study of one-stage and two-stage
detectors would provide insights into these approaches’ speed performance.
Author Contributions: Conceptualization, M.I., K.F.L. and F.G.; methodology, M.I.; software, M.I.;
validation, M.I.; formal analysis, M.I.; investigation, M.I., K.F.L. and F.G.; data curation, M.I.; re-
sources, L.E.S.; writing—original draft preparation, M.I.; writing—review and editing, K.F.L. and
F.G. C.; supervision, K.F.L. and F.G. All authors have read and agreed to the published version of
the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. This data can
be found here: [58,59].
Acknowledgments: We gratefully acknowledge the support by the British Columbia Ministry of
Transportation and Infrastructure.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models.
IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [CrossRef]
2. Lipton, Z.C.; Berkowitz, J.; Elkan, C. A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv 2015,
arXiv:1506.00019.
3. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 2012, 25, 1097–1105. [CrossRef]
4. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef]
5. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J.
Comput. Vis. 2010, 88, 303–338. [CrossRef]
6. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al.
ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [CrossRef]
7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef]
8. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
[CrossRef]
9. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587. [CrossRef]
10. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile,
7–13 December 2015; pp. 1440–1448. [CrossRef]
11. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer
Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [CrossRef]
12. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the 2017 IEEE
International Conference on Computer Vision (ICCV), Venice, Italy, 22 October 2017; pp. 764–773.
13. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 9308–9316.
AI 2021, 2 575
14. Papageorgiou, C.P.; Oren, M.; Poggio, T. A general framework for object detection. In Proceedings of the Sixth Internation-al
Conference on Computer Vision2002, Bombay, India, 7 January 1998; pp. 555–562. [CrossRef]
15. Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893.
[CrossRef]
16. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [CrossRef]
17. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in
Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014.
18. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
[CrossRef]
19. Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Detnet: A Backbone network for Object Detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018.
20. Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507.
[CrossRef] [PubMed]
21. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, L.; Salakhutdinov, R.R. Improving neural networks by preventing
co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580.
22. Zeiler, M.D.; Fergus, R. Visualizing and Understanding Convolutional Networks. In European Conference on Computer Vision; Lec-
ture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics);
Springer: Cham, Switzerland, 2014; Volume 8689 LNCS, pp. 818–833. [CrossRef]
23. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
24. Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection
using convolutional networks. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014,
Banff, AB, Canada, 14–16 April 2014.
25. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with
Convolutions. arXiv 2014, arXiv:1409.4842.
26. He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [CrossRef]
27. Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge:
A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [CrossRef]
28. Schneider, S.; Taylor, G.W.; Kremer, S. Deep Learning Object Detection Methods for Ecological Camera Trap Data. In Proceedings
of the 2018 15th Conference on Computer and Robot Vision (CRV), Toronto, ON, Canada, 8–10 May 2018; pp. 321–328.
29. Swinnen, K.; Reijniers, J.; Breno, M.; Leirs, H. A Novel Method to Reduce Time Investment When Processing Videos from Camera
Trap Studies. PLoS ONE 2014, 9, e98881. [CrossRef]
30. Figueroa, K.; Camarena-Ibarrola, A.; Garcia, J.; Villela, H.T. Fast Automatic Detection of Wildlife in Images from Trap Cameras.
Hybrid Learn. 2014, 8827, 940–947.
31. Yu, X.; Wang, J.; Kays, R.; Jansen, P.; Wang, T.; Huang, T. Automated identification of animal species in camera trap images.
EURASIP J. Image Video Process. 2013, 2013, 52. [CrossRef]
32. Kwan, C.; Gribben, D.; Tran, T. Multiple Human Objects Tracking and Classification Directly in Compressive Measurement
Domain for Long Range Infrared Videos. In Proceedings of the 2019 IEEE 10th Annual Ubiquitous Computing, Electronics &
Mobile Communication Conference (UEMCON), New York, NY, USA, 10–12 October 2019; IEEE: Piscataway, NJ, USA, 2019;
pp. 0469–0475.
33. Uddin, M.S.; Hoque, R.; Islam, K.A.; Kwan, C.; Gribben, D.; Li, J. Converting Optical Videos to Infrared Videos Using Attention
GAN and Its Impact on Target Detection and Classification Performance. Remote Sens. 2021, 13, 3257. [CrossRef]
34. Chen, G.; Han, T.X.; He, Z.; Kays, R.; Forrester, T. Deep convolutional neural network based species recognition for wild animal
monitoring. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October
2014; pp. 858–862.
35. Villa, A.G.; Salazar, A.; Vargas, F. Towards automatic wild animal monitoring: Identification of animal species in camera-trap
images using very deep convolutional neural networks. Ecol. Inform. 2017, 41, 24–32. [CrossRef]
36. Willi, M.; Pitman, R.T.; Cardoso, A.W.; Locke, C.; Swanson, A.; Boyer, A.; Veldthuis, M.; Fortson, L. Identifying animal species in
camera trap images using deep learning and citizen science. Methods Ecol. Evol. 2019, 10, 80–91. [CrossRef]
37. Norouzzadeh, M.S.; Morris, D.; Beery, S.; Joshi, N.; Jojic, N.; Clune, J. A deep active learning system for species identification and
counting in camera trap images. Methods Ecol. Evol. 2021, 12, 150–161. [CrossRef]
38. Norouzzadeh, M.S.; Nguyen, A.; Kosmala, M.; Swanson, A.; Palmer, M.S.; Packer, C.; Clune, J. Automatically identifying,
counting, and describing wild animals in camera-trap images with deep learning. Proc. Natl. Acad. Sci. USA 2018, 115,
E5716–E5725. [CrossRef]
39. Parham, J.; Stewart, C. Detecting plains and Grevy’s Zebras in the realworld. In Proceedings of the 2016 IEEE Winter Applications
of Computer Vision Workshops (WACVW), Lake Placid, NY, USA, 10 March 2016.
AI 2021, 2 576
40. Zhang, Z.; He, Z.; Cao, G.; Cao, W. Animal Detection from Highly Cluttered Natural Scenes Using Spatiotemporal Object Region
Proposals and Patch Verification. IEEE Trans. Multimed. 2016, 18, 2079–2092. [CrossRef]
41. Xu, B.; Wang, W.; Falzon, G.; Kwan, P.; Guo, L.; Chen, G.; Tait, A.; Schneider, D. Automated cattle counting using Mask R-CNN in
quadcopter vision system. Comput. Electron. Agric. 2020, 171, 105300. [CrossRef]
42. Gupta, S.; Chand, D.; Kavati, I. Computer Vision based Animal Collision Avoidance Framework for Autonomous Vehicles. Inf.
Process. Manag. Uncertain. Knowl.-Based Syst. 2021, 1378, 237–248. [CrossRef]
43. Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. Learning and Transferring Mid-level Image Representations Using Convolutional
Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH,
USA, 23–28 June 2014; pp. 1717–1724. [CrossRef]
44. Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. Weakly supervised object recognition with convolutional neural networks. HAL 2014,
hal-01015140v1 2014.
45. Kavukcuoglu, K.; Ranzato, M.; Fergus, R.; LeCun, Y. Learning invariant features through topographic filter maps. In Proceedings
of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1605–1612.
46. Goodfellow, I.; Bengio, Y.B.; Courville, A. Adaptive Computation and Machine Learning Series (Deep Learning); The MIT Press:
Cambridge, MA, USA, 2016. Available online: Academia.edu (accessed on 15 August 2020).
47. Bishop, C.M. Pattern Recognition, and Machine Learning; Springer: New York, NY, USA, 2006; Volume 128, pp. 1–58. Available
online: Academia.edu (accessed on 15 August 2020).
48. Uijlings, J.R.R.; van de Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis.
2013, 104, 154–171. [CrossRef]
49. Ding, S.; Zhang, X.; An, Y.; Xue, Y. Weighted linear loss multiple birth support vector machine based on information granulation
for multi-class classification. Pattern Recognit. 2017, 67, 32–46. [CrossRef]
50. He, Y.; Zhu, C.; Wang, J.; Savvides, M.; Zhang, X. Bounding Box Regression With Uncertainty for Accurate Object Detection. In
Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA,
15–20 June 2019; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2019; pp. 2883–2892.
51. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of
the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009;
pp. 248–255.
52. Dai, J.; He, K.; Sun, J. Convolutional feature masking for joint object and stuff segmentation. In Proceedings of the 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; Institute of Electrical and
Electronics Engineers (IEEE): Piscataway, NJ, USA, 2015; pp. 3992–4000.
53. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [CrossRef]
54. Prokudin, S.; Kappler, D.; Nowozin, S.; Gehler, P. Learning to Filter Object Detections. In Transactions on Computational Science XI.;
Springer Science and Business Media LLC: Berlin/Heidelberg, Germany, 2017; Volume 10496, pp. 52–62.
55. Dai, J.; He, K.; Sun, J. Instance-Aware Semantic Segmentation via Multi-task Network Cascades. In Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; Institute of Electrical
and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2016; pp. 3150–3158.
56. Arnab, A.; Torr, P.H.S. Pixelwise Instance Segmentation with a Dynamically Instantiated Network. In 2017 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Institute of Electrical and Electronics
Engineers (IEEE): Piscataway, NJ, USA, 2017; pp. 879–888.
57. Wu, H.; Siebert, J.P.; Xu, X. Fully Convolutional Networks for automatically generating image masks to train Mask R-CNN. arXiv
2020, arXiv:2003.01383v1.
58. Labeled Information Library of Alexandria: Biology and Conservation (LILA BC). Available online: http://lila.science/datasets/
snapshot-serengeti.[SnapshotSerengeti] (accessed on 27 August 2020).
59. Snapshot Wisconsin, A Volunteer-Based Project for Wildlife Monitoring. Available online: https://dnr.wisconsin.gov/topic/
research/projects/snapshot.[SnapshotWisconsin] (accessed on 1 May 2020).
60. Fan, Q.; Brown, L.; Smith, J. A closer look at Faster R-CNN for vehicle detection. In Proceedings of the 2016 IEEE Intelligent
Vehicles Symposium (IV), Gotenburg, Sweden, 19–22 June 2016; Volume 1, pp. 124–129.
61. MATLAB. Available online: https://www.mathworks.com/help/vision/ug/get-started-with-the-image-labeler.html (accessed
on 15 January 2020).
62. Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif.
Intell. Rev. 2020, 53, 5455–5516. [CrossRef]
63. Henderson, P.; Ferrari, V. End-to-End Training of Object Class Detectors for Mean Average Precision. In Asian Conference on
Computer Vision; Springer: Cham, Switzerland, 2016; pp. 198–213. [CrossRef]
64. Saxena, A.; Gupta, D.K.; Singh, S. An Animal Detection and Collision Avoidance System Using Deep Learning. Adv. Graph.
Commun. Packag. Technol. Mater. 2021, 668, 1069–1084. [CrossRef]
AI 2021, 2 577
65. Yilmaz, A.; Uzun, G.N.; Gurbuz, M.Z.; Kivrak, O. Detection and Breed Classification of Cattle Using YOLO v4 Algorithm. In
Proceedings of the 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Kocaeli,
Turkey, 25–27 August 2021; pp. 1–4.
66. Sato, D.; Zanella, A.J.; Costa, E.X. Computational classification of animals for a highway detection system. Braz. J. Veter-Res. Anim.
Sci. 2021, 58, e174951. [CrossRef]