You are on page 1of 16

Environmental Science and Pollution Research (2020) 27:39619–39634

https://doi.org/10.1007/s11356-020-09950-3

RESEARCH ARTICLE

Intelligent animal detection system using sparse multi


discriminative-neural network (SMD-NN) to mitigate
animal-vehicle collision
S Divya Meena 1 & Agilandeeswari Loganathan 1

Received: 11 February 2020 / Accepted: 29 June 2020 / Published online: 10 July 2020
# Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract
Animal-Vehicle Collision (AVC) is a predominant problem in both urban and rural roads and highways. Detecting animals on the
road is challenging due to factors like the fast movement of both animals and vehicles, highly cluttered environmental settings, noisy
images, and occluded animals. Deep learning has been widely used for animal applications. However, they require large training
data; henceforth, the dimensionality increases, leading to a complex model. In this paper, we present an animal detection system for
mitigating AVC. The proposed system integrates sparse representation and deep features optimized with FixResNeXt. The deep
features extracted from candidate parts of the animals are represented in a sparse form using a feature-efficient learning algorithm
called Sparse Network of Winnows (SNoW). The experimental results prove that the proposed system is invariant to the viewpoint,
partial occlusion, and illumination. On the benchmark datasets, the proposed system has achieved an average accuracy of 98.5%.

Keywords Animal-vehicle collision . Animal detection . Deep features . Sparse representation . Sparse Network of Winnows

Introduction AVC. However, only very few cities are equipped with facil-
ities to mitigate AVC. According to Society for Prevention of
Animal-vehicle collision (AVC) signifies the mounting issues Cruelty to Animals (SPCA) (Sharma and Shah, 2016), ap-
in terms of conservation, safety to both animals and humans, proximately 500 cattle were admitted in the past 3 years, and
as well as the financial apprehensions across India, specifical- 90% of the cases were due AVC.
ly the Southern zone. Both animals and humans are crashing Animal detection systems are mostly developed for real-time
together on the roads, leading to loss of life. All these crashes animal monitoring applications like human-animal conflict
are due to the rapid depletion of the forest, which is being used (Divya and Agilandeeswari 2019), animal-vehicle
for the various human causes like urban sprawl, expanding the collision (Meena and Agilandeeswari 2020), animal monitoring
infrastructure for transportation, etc. Most instances of AVCs systems (Divya Meena and Agilandeeswari 2020a, b), wildlife
are by far reported in the rural settings. However, with the on- population estimation (Divya Meena and Agilandeeswari
going rapid destruction of the forest, sooner, both the urban 2020a, b), and much more. However, a large portion of animal
and suburban landscapes will be occupied. According to a detection problem is yet to be solved. The biggest challenge with
report (Bíl et al. 2019), by 2050, urban areas alone will be animal detection systems is the physical characteristics of ani-
holding approximately 6.3 billion people. Such engorgement mals like their shape, size, color, and the unpredictable behavior
of cities will have a higher traffic rate with unprecedented road of the animals leading to pose variation, camouflage, and occlu-
use, and this will have a higher number of accidents involving sion. Detecting animals on roads is more complex than in an
image or video, as both the animal and the vehicle are in motion.
However, the base of any animal detection system is detecting
Responsible editor: Philippe Garrigues the animals from images or videos. In this work, we describe an
approach for detecting animals on roads and highways to miti-
* Agilandeeswari Loganathan
gate the AVC. Detecting animals on the roads or highway is
agila.l@vit.ac.in
challenging due to the noise and cluttered environmental set-
1
School of Information and Technology, Vellore Institute of tings. Furthermore, animals may be occluded by vehicles or
Technology, Vellore, India by signboards. Hence, the right representation of the problem
39620 Environ Sci Pollut Res (2020) 27:39619–39634

is essential to find an optimal solution. Based on this belief, we On a dataset of animals that show their side pose, the model
propose a sparse representation for the deep features to support a achieved an accuracy of 90%. Despite the good results, the
computationally inexpensive animal detection system. model is incapable of detecting animals from their front view.
According to Fabre-Thorpe et al. (2001), human eyes get Sharma and Shah (2016) proposed a real-time animal detec-
exhausted easily and require a lot of rest to focus. However, tion system for avoiding animal-vehicle collisions using com-
without a proper system for AVC avoidance, human has to act puter vision techniques. The authors used a series of pre-
swiftly to prevent a collision with animals which appears unex- processing techniques, followed by HOG-based features ex-
pectedly. Few works on AVC (Burghardt and Ćalić 2006) de- traction and finally boosted cascade classifier for the classifica-
tects animals on the road, but it requires the animals to look tion. The method to calculate the distance of the animal from
forward towards the camera, which is not feasible in most of the vehicle is also proposed. One of the practical animal-vehicle
the cases. A similar system was developed by Viola and Jones collision systems was proposed by Forslund and Bjärkefur
(2004), where the authors detect animals with their faces. (2014). The authors proposed a vehicular animal detection sys-
However, since the animal can appear from different directions, tem named Autoliv that is now practically used in cars like
detecting animals by their face is hard and impractical. The an- Audi, BMW, and Daimler. The classification system based on
imal movement has been considered as one of the prime factors cascade boosting is claimed to be robust to occlusion and pose
to detect animals (Walther et al. 2004). The idea is to subtract the variation. The model can detect animals as far as 200 m away
background and keep the remaining blobs as foreground and from the car and produces less than one false positive a year.
detect animals from that. However, background subtraction is Biological vision theory attempts to solve object detection by
difficult in both highways and roadsides. Threshold segmenta- decomposing the objects into individual parts and establishing a
tion (Nascimento and Marques 2006) has also been used to relation over the parts. Based on this theory, we describe a sparse
subtract background; however, deciding on the appropriate image representation for the deep features that are extracted from
threshold value is difficult in moving backgrounds, and this the candidate parts and for which the spatial relationships are
may lead to false positives. Ramanan et al. (2006) detected an- established. Sparse representation is gaining attention and has
imals using texture and SIFT features. However, this approach is already been applied to several applications like car detection
applicable only in the case where it contains only a single animal (Agarwal and Roth 2002), hyperspectral imaging (Chen et al.
and a very minimal background clutter. 2011). It can identify the essential patterns of images and can
One of the earliest animal-vehicle collision systems was work even on noisy or occluded images. Few image processing
developed with thermal images by Zhou et al. (2009). The applications like face recognition, image restoration has been
authors specifically focused on deer-vehicle collision (DVC) developed using the sparse representation. We have employed
through deer detection with pattern recognition and matching sparse representation for class-specific animal detection based on
techniques. The detailed patterns of deer were stored in an candidate feature extraction that learns to detect the animals ac-
array format, and this constituted their database. Using curately, even with minimal candidate features.
pattern matching techniques, the animals were detected. While traditionally objects were represented using low-level
Ragab et al. (2011) proposed a real-time camel vehicle colli- features such as edges, bars, and other primitive structures, we
sion avoidance system with GPS support. The system consists believe that the candidate part-based representation produces
of two sub-systems, namely the animal detection and warning information-rich and meaningful features. In our proposed sys-
system. With the programmable GPU device, the movement of tem, we first localize the target animals and identify the candidate
the camel on the roads can be detected, and the driver is alerted. parts using YOLO. We then extract small patches around the
This is one of the first intelligent transport system (ITS). candidate parts and construct a very large dictionary. The large
Despite the huge cost involved, the system has large false- feature space eliminates the viewpoint-variations. We first extract
negative cases due to the delay in receiving the SMS alert. deep features and characterize them as higher-level sparse repre-
Mammeri et al. (2014) proposed a two-stage animal detec- sentation. The spatial relationship among the parts is learned with
tion system for avoiding animal-vehicle collisions using LBP the sparse representation, which is otherwise not possible with
features and AdaBoost classifier. The RoI from the first stage other forms like pixel-based representation.
is sent to the second stage, which uses a histogram of oriented
gradient (HOG) features and SVM classifier for classifying
the RoI as animal or non-animal. Jaskó et al. (2017) also Contribution
proposed a novel animal detection system using monocular
color vision images for avoiding animal-vehicle collisions in The contribution of the proposed system is as follows:
traffic scenarios. Based on the intensity, color, and orientation
features, a saliency map is generated to detect the animal from (i) To develop an efficient and computationally inexpensive
the heavy traffic scenes. The salient region is considered as the animal detection system for mitigating the animal-vehicle
RoI, and it is fed to the SVM classifier for final classification. collision.
Environ Sci Pollut Res (2020) 27:39619–39634 39621

(ii) Identifying the candidate parts with CNN and extracting Region selection and activation statistics
their features instead of manually extracting them with
hand-crafted feature extraction techniques. Generally, CNN requires input images to be pre-processed
(iii) Representing the images as parts along with the before it is fed to the network. The primary step in pre-
spatial relationships. This higher-level feature rep- processing is region selection. The required region or the area
resentation is more robust and efficient than a of interest is selected and extracted from the whole image,
low-level representation. after which we resize them based on the chosen network. By
(iv) Representing the deep feature space in a sparse and large, CNN is invariant to translation changes but scale-
representation to reduce the complexity and com- invariant. FixResNeXt is proposed based on this idea.
putation cost. Consequently, resizing the images will have an apparent effect
on network activations. Figure 2 illustrates how FixResNeXt
The paper is structured as follows. Section 2 describes the resolves the resolution discrepancy between the train and test
proposed methodology. The experimental framework is pre- data.
sented in Section 3 where we discuss the dataset, performance Initially, the region is selected from the input image both
metrics, and the experimental setup. Experimental results are during training and testing. The red color bounding box is
discussed in Section 4. Sections 5 conclude with the summary resampled as a cropped image to be fed to the neural network.
and future direction of the work. When the objects in the image are of the same size, then the
cropped image will be larger at the training time than at the test
time. This will lead to a train-test resolution discrepancy. To
Proposed methodology overcome this, we either increase the test resolution (3rd and
4th column) or decrease the train data resolution. This will
In this section, we present the proposed system in detail. First, make the cropped image to have the same resolution during
we pre-process the images using FixResNeXt to resolve the both train and test time. Furthermore, this leads to scale in-
resolution discrepancy between the training and testing data. variance, which is one of the essential requirements for the
Then we extract the candidate features through target locali- neural network. The above computation is computationally
zation, where we highlight the salient region and identify the very cheap, besides being efficient.
candidate sub-regions. The distinct representative parts are The cumulative distribution of the activation function is
identified, and a dictionary is constructed with those parts, measured after the pooling layer. The values are all positive,
and the spatial relationship between the parts is also studied. as we apply the rectified linear unit (ReLU) activation func-
Finally, we learn a sparse classifier to detect and identify the tion. For different train-test resolutions, we have different ac-
target animals from the given image. The overall flow of the tivation maps. For instance, for the train-test resolution of
proposed system is depicted in Fig. 1. 224 × 224 pixels, the activation map is 7 × 7, and for the res-
olution of 448 × 448, the activation map is 14 × 14. In Fig. 3,
Pre-processing with FixResNeXt we plotted the cumulative distribution function (CDF) of the
activation function on the output of the pooling layer.
Image resolution discrepancy between the train and test data
affects the detection performance. Recently, Touvron et al. Target localization and candidate feature extraction
(2019) proposed FixResNeXt based on ResNet to resolve
the resolution discrepancies. As the name suggests, it is a To achieve meaningful representation of the animal, the candi-
typical method that fixes the train-test resolution discrepancy date features or parts of the animals has to be captured. Such
and thereby improves the detection performance. The FixRes features should also capture the inter-class variance with other
network takes input images of any resolution and normalizes animal classes. Automating the extraction of candidate parts is
them with the mean and standard deviation value that varies the best approach for a real-time animal-vehicle collision avoid-
based on the image size. For our application, we have a mean ance system. In one of our previous works (Meena and
value of [0.485, 0.456, and 0.406], and the standard deviation Agilandeeswari 2019), we proposed a multi-part CNN–based
is [0.229, 0.224, and 0.225]. To avoid over-fitting and also to system that automatically identifies and extracts the discrimina-
generalize the model, we perform data augmentation during tive features of the animals. In (Divya Meena and
the training. Typical augmentations techniques used are Agilandeeswari 2020a), we extended the multi-part system to
cropping at random sizes, flipping horizontally, and color jit- handle the real-time unlabeled data by employing a supervised
ter. By combining multiple data augmentation techniques, we clustering technique over the classified data. In this work, we
improve test accuracy. As a fast rule, we adopt the same pa- propose yet another technique for extracting the candidate fea-
rameters of FixResNeXt, except for the learning rate and the tures of the animals. Our method for automatic candidate feature
number of epochs. We set the learning rate to be 0.003. extraction is based on You Only Look Once (YOLO v1)
39622 Environ Sci Pollut Res (2020) 27:39619–39634

Fig. 1 Proposed animal detection framework

network by Redmon et al. (2016). The YOLO is a variant of into a candidate feature set. The YOLO takes input images of
single-shot detectors (SSD) that directly outputs a fixed-length size 448 × 448 and outputs bounding boxes for each of the target
regression output for a given fixed-sized input image without the objects along with the label and classification score. We essen-
need for a separate region proposal network (RPN). The goal of tially treat these localizations as salient target detections and use
using YOLO is to perform bounding box regression and animal their sub-regions as candidate features for further processing. The
classification around all target objects and obtain a result that motivation for using the target localization is that we want to
includes a collection of image sub-regions that can be cropped process relevant sub-regions of the targets and also to eliminate

Fig. 2 Illustration of FixResNeXt for resolving train-test resolution discrepancy


Environ Sci Pollut Res (2020) 27:39619–39634 39623

Fig. 3 CDF of the activation function on the output of the pooling layer, on testing for different test resolutions. Left: 128 pixels, center: 224 pixels, and
right: 448 pixels

the distracting pixel information and thereby reduce the compu- no similar cluster remains alone. The similarity measure is
tation complexity of animal identification. defined as;
Sample results for target localization and candidate feature ∑p ∈C ∑p ∈C S ðp1 ; p2 Þ
extraction are presented in Fig. 4. The ground truth is S ðC 1 ; C 2 Þ ¼ 1 1 2 2 ð1Þ
jC 1 j  jC 2 j
marked in the green color bounding box, and the red color
bounding box is the result of target object localization. The The above equation calculates the average similarity be-
region within the red bounding box is treated as a salient tween the distinct parts. A cluster that represents a single part
region. The target animal in the green bounding box has is considered as one conceptual part. Since more than one
positive patches (blue color) and true negative patches (red cluster represents a single part, we were able to obtain a
color). The boxes in yellow color are not considered as they higher-level abstract representation of the parts from the re-
represent non-target objects. The blue patches are considered dundant representation. Having constructed the candidate part
as the candidate features since they represent the salient re- dictionary, we now represent the images using the dictionary
gion of the target object. parts. Based on the similarity measure, we determine which all
From the candidate parts, we extract small patches and parts in the input image are present in the dictionary. Finally,
construct a large dictionary. By constructing a large dictionary each of the images is represented as a feature vector based on
from all the target objects, we will be able to handle new the dictionary parts. To search for dictionary parts in the given
unlabeled data. That is, the new object class can be represent- image it is computationally quite expensive. Hence, we detect
ed from the subset of this large dictionary. This approach is the candidate features and highlight the region around the
very efficient since most of the animals that come on the road parts. The patches extracted from the highlighted region are
are quadruped and follow the same general structure. Thus, then compared with the dictionary parts based on the similar-
the subset of the dictionary can represent any new animal ity measure used for the clustering. If sufficiently similar dic-
classes. In our experiments, we used 100 samples from each tionary art is found, the patch in the image is represented by
animal class to build the dictionary. As seen in the above the feature id corresponding to that dictionary part.
figure, most of the extracted parts are visually very similar to The spatial relationships among the candidate parts are de-
one another. So, we cluster similar parts to individual groups. fined in terms of distance and direction between each pair of
We follow the bottom-up clustering approach, where each part parts. Instead of using a geometric alignments based tech-
from the dictionary is initially put in a distinct cluster. A sim- nique, we try to learn a model that best represents the relation-
ilarity measure has been used to verify the similarity between ship among the candidate parts. Basically, we discretize the
two clusters. On this basis, similar clusters are grouped until distance and direction between the arts into bins. The distance

Fig. 4 Illustration of a target object localization, b candidate feature extraction


39624 Environ Sci Pollut Res (2020) 27:39619–39634

is defined relative to the size of the candidate parts and is state is then represented by ∑i∈A wti > θt , where wti is the
discretized into 5 bins. Similarly, the direction is discretized weighted edge connecting target node t and ith feature and θt
into 4 different ranges, each covering an angle of 45° and thus is the threshold value. Each target class is represented by a
covering an entire animal of 360°. Thus, we get 20 (5 × 4 bins) sub-network. So, a given input image is treated autonomously
possible distance-direction relations between any two candi- by the corresponding target sub-network. Separate feature
date parts. Since the number of active candidate parts is min- vectors are learned for each of the animal classes, and so a
imal, the computation cost of the distance-direction relation is distinct function is learned from the shared feature
also negligible. space, for each of the target classes. When a new input
feature vector is given, the so-far learned functions will
output an activation value using the sigmoid activation
Learning a sparse classifier
function. The class with the highest activation value
will be the appropriate class for that given input vector.
The training images are converted into a feature vector, and
Basically, the model follows the winner-take-all-decision
the number of features for any given image will be very large.
approach. The activation level–based conclusion pro-
Adding to this is the 20 possible relations between the candi-
vides a robust confidence score, and we identify the
date parts. However, we focus only on the candidate features,
animal based on this confidence score in the final stage.
and so the number of active features will be very small.
The learning policy is mistake-driven, and several up-
Utilizing this sparseness property, we train our classifier using
date rules can be used in SNoW. In this application, we
the Sparse Network of Winnows (SNoW) learning architec- used the most successful update rule, the Littlestone’s
ture (Carlson et al. 1999). SNoW is a sparse network with Winnow update rule. It is a multiplicative update rule
linear units over a pre-defined learned feature space. The input where the input features are not known prior. That is,
nodes are used as the input features, and the linear units are the (1) input features are allocated in a data-driven way—an
target nodes that correspond to the number of animal classes input node for the feature i is allocated only if the feature
we have in the dataset. Given a set of candidate input features, i is active in the input image and (2) a link (Le., a non-
each input image is mapped onto a set of features that are zero weight) exists between a target node t and a feature i
active in it. This feature representation is fed to the input layer if and only if i has been active in an image labeled t.
and propagates to the target nodes that are connected via the Thus, the architecture also supports augmenting the feature
weighted edges. The active set of features is represented types at later stages or from external sources in a flexible
by At = {i1, i2, …, in}, where t is the target node. The active way, an option we do not use in the current work.

Fig. 5 Sample images from dataset a AoW, b COCO, and c AwA dataset
Environ Sci Pollut Res (2020) 27:39619–39634 39625

Table 1 Effects of different proportion of dataset splits

Dataset Dataset proportion Accuracy (%) Specificity (%) Sensitivity (%) F-measure (%)

Train (%) Test (%)

COCO animals 10 90 88.7 95.6 70.9 83.0


20 80 90.0 98.3 72.0 87.0
30 70 93.1 100.1 77.9 88.0
40 60 97.8 96.4 78.7 82.0
60 40 97.0 99.2 81.5 83.0
Animals on the Web 10 90 87.9 99.3 81.9 82.0
20 80 90.3 98.3 82.9 93.0
30 70 92.8 91.1 87.0 93.0
40 60 96.2 85.9 86.5 95.0
60 40 97.4 89.3 87.9 97.2
Animals with Attributes 10 90 90.3 92.2 89.0 89.7
20 80 94.6 96.7 92.8 92.0
30 70 96.1 98.0 95.5 96.1
40 60 98.8 99.4 98.4 98.8
60 40 98.9 99.6 98.5 99.0

Implementing the Littlestone’s Winnow update rule via a dataset, the COCO animal, was released by researchers of
sparse network follow the two steps: Stanford University (Patterson and Hays 2016). It is a subset
of the COCO dataset and has 8 classes of animals, including
1) The input node is allocated with a feature i, only if it is bear, bird, cat, dog, giraffe, horse, sheep, and zebra. The dataset
active in the given input image. has a total of 800 training images and 200 test images. The test
2) Link is established between a target node t and a feature i images are of different resolutions and contain partially occlud-
if i is active in an image. ed animals in a highly cluttered background. The last dataset is
AwA, a large-scale dataset with 50 different animal classes and
The key property of the Winnow update rule is the number has over 30,000 images. The dataset was released by Lampert
of examples it requires to learn a linear function that grows et al. (2009) and has images that are mostly taken on roads.
linearly with the number of candidate features. This property Sample images for each dataset are depicted in Fig. 5.
is crucial in applications where the number of features is vast,
but a relatively small number of them is sufficient. Winnow
efficiently learns any linear threshold function and is robust to Performance metrics
various kinds of noises. Once target sub-networks complete
learning, the winner-take-all mechanism selects the dominant The accuracy of the model is measured with two quantitative
active target node in the SNoW unit to produce the final metrics, namely positive detection rate and false detection
prediction. rate. The positive detection rate is the number of correct

100
99
Experimental framework 98
97
Accuracy

96
Dataset description
95
94
The proposed approach has been tested on three benchmark 93
datasets for animals, namely, Animals on the web (AoW), 92
COCO animals, and Animals with Attributes (AwA) datasets. 91
Imbalanced Balanced Imbalanced Balanced Imbalanced Balanced
The first dataset, Animals on the web, was released by Berg COCO Animals Animals on the Web Animals with Attributes
and Forsyth (2006). The dataset has a total of 14,051 images of
Accuracy
10 different animal classes like an alligator, giraffe, penguin,
etc. Each image is of at least 120 × 120 pixels. The second Fig. 6 Performance of balanced vs. imbalanced dataset
39626 Environ Sci Pollut Res (2020) 27:39619–39634

Table 2 Performance of the


proposed system on “COCO Class Accuracy Precision Recall Specificity F- FPR FNR
Animal” dataset measure

Bear 0.970 0.883 0.978 0.469 0.981 0.531 0.022


Bird 0.956 0.851 0.971 0.3 0.978 0.7 0.029
Cat 0.971 0.88 0.977 0.455 0.981 0.545 0.023
Dog 0.973 0.927 0.987 0.602 0.986 0.398 0.013
Giraffe 0.974 0.934 0.989 0.618 0.987 0.382 0.011
Horse 0.966 0.898 0.981 0.524 0.983 0.476 0.019
Sheep 0.971 0.917 0.985 0.579 0.985 0.421 0.015
Zebra 0.97 0.915 0.985 0.572 0.984 0.428 0.015

Average accuracy: 0.97


Average precision: 0.90
Average sensitivity: 0.98
Average specificity: 0.51

detection relative to the total positive sample. Similarly, the image that is positive (correctly classified). The recall
false detection rate is the number of incorrect detections with is the total percentage of positive images obtained.
respect to total negative samples. Precision is also called a positive predictive value
ðTrue positive þ True negativeÞ (PPV). The recall is also called sensitivity. Besides,
Accuracy ¼
True positive þ True negative þ False positive þ False negative we also have specificity and sensitivity metrics.
ð2Þ Number of true positives
Precision ¼ ¼ PPV
Number of true positives Number of true positives þ Number of false positives
Positive detection rate ¼ ð3Þ
Total number of positive samples ð5Þ
Number of false positives Number of true positives
False detection rate ¼ ð4Þ Recall ¼ ¼ Positive Detection Rate ð6Þ
Total number of negative samples Total number of positive samples
Number of true positives
The trade-off between the correct and incorrect detec- Sensitivity ¼ ð7Þ
Number of true positives þ Number of false negatives
tions can be appropriately expressed with a ROC curve Number of true negatives
(receiver operating characteristics). It plots the correct Specificity ¼ ð8Þ
Number of true negatives þ Number of false positives
detection vs. incorrect detections. One other important
metric in the detection problem is the precision-recall F1 measure is a weighted average value of precision
curve. Precision corresponds to the percentage of the and recall. It is a balance between the two. This is more

Table 3 Performance of animal


detection system on “Animals on Class Accuracy Precision Recall Specificity F- FPR FNR
the Web” dataset measure

Alligator 0.975 0.939 0.99 0.626 0.987 0.374 0.01


Ant 0.912 0.923 0.986 0.593 0.985 0.407 0.014
Bear 0.978 0.95 0.992 0.648 0.988 0.352 0.008
Beaver 0.977 0.948 0.992 0.644 0.988 0.356 0.008
Dolphin 0.986 0.923 0.972 0.469 0.981 0.428 0.019
Frog 0.981 0.898 0.966 0.3 0.986 0.374 0.015
Giraffe 0.985 0.917 0.971 0.455 0.987 0.407 0.022
Leopard 0.981 0.95 0.981 0.644 0.981 0.428 0.023
Monkey 0.986 0.948 0.985 0.572 0.986 0.374 0.013
Penguin 0.987 0.915 0.978 0.626 0.987 0.407 0.011

Average accuracy: 0.98


Average precision: 0.93
Average sensitivity: 0.98
Average specificity: 0.56
Environ Sci Pollut Res (2020) 27:39619–39634 39627

Has both cluster and relation No cluster but has relations 1 Layer CNN 2 Layer CNN 3 Layer CNN

No relations but has clusering No clustering and relations 4 Layer CNN 5 Layer CNN
1
1.2
0.9
1 0.8
0.7
0.8
Recall

Accuracy
0.6
0.6 0.5
0.4
0.4 0.3
0.2 0.2
0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 50 60 70 80 90 100

Precision Number of iterations

Fig. 7 Effects of clustering and spatial relationship on the overall Fig. 9 Effects of CNN depths on COCO animals dataset
performance
376.82 GPU workstation with Intel(R) Xeon® W-2102 CPU
important than accuracy when the classes are uneven, @ 2.90GHz processor and 128GB RAM. For storage pur-
and this is the case most of the time. It is given by; poses, we used 1 TB Seagate Hard disk drive. The target
  localizer is trained using a Python wrapper around the original
precision*recall
F1 measure ¼ 2* ð9Þ open-source implementation. Jupyter Notebook IDE was used
precision þ recall
to build and visualize the systems. Pycharm was used to pre-
Apart from these, we have type I and type II error. FPR and process the data. Keras-GPU was used to build a neural net-
FNR are given by; work and tune the hyper-parameters.

FP
α¼ ð10Þ
ðFP þ TN Þ
Experiments and results
FN
β¼ ð11Þ
ðTP þ FN Þ In this section, we discuss the results of various experiments.
The experiments have covered the broad aspects of dataset
Type I error defines the false positive rate and is sometimes
slitting and data balancing, sparse representation, depth in
called Alpha. It is the inverse of specificity. The false-negative
CNN, and its features. We also present an ablation study of
rate belongs to the type II error, and it is the inverse of
the proposed system. Finally, a comparative study is
sensitivity.
presented.

Experimental setup Dataset splitting and data balancing

The experiments were carried out in Windows 10 (64 bit) The proportion of train and test data plays an important role in
environment in a single NVIDIA GeForce 940 M Version computational time and complexity, besides influencing the
performance. Similarly, data balancing is an essential step in
1

0.9 100
0.8

0.7 95
0.6
90
Accuracy

0.5
Accuracy

0.4 85
1 Layer CNN
0.3
2 Layer CNN
0.2
80
3 Layer CNN

0.1 4 Layer CNN


5 Layer CNN
75
0 10 20 30 40 50 60 70 80 90 100
0 10 20 30 40 50 60 70 80 90 100

Number of iterations Number of deep features for each layer

Fig. 8 Effects of CNN depths on Animals on the web dataset Fig. 10 Effects of number of features for each layer of the CNN
39628 Environ Sci Pollut Res (2020) 27:39619–39634

100
combining multiple data augmentation techniques, we improved
99 the quality of augmented data, and therefore, the model did not
98 overfit. There exists a trade-off in the performance metrics for a
97 balanced and imbalanced dataset. For a balanced dataset, accu-
96 racy is the best metric. On the other hand, for an imbalanced
Accuracy

95 AWA+Deep dataset, precision and recall are the appropriate metrics.


94 AWA+Sparse
93 COCO+Deep
COCO+Sparse
Performance analysis on the benchmark datasets
92
91
0 100 200 300 400 500 600 We analyzed the performance of the proposed system on two
Dictionary size
standard datasets with the performance metrics discussed in
Fig. 11 Analyses of deep vs. sparse representation on the two datasets Section 3.2. The results are presented in Table 2 and Table 3.
From the table, it is clear that giraffe has the highest accuracy
of 97.4%, and birds have the least accuracy of 95.6%. The model
training. The effects of different proportions of training data has a good recall rate of 98% and a precision of 90%.
on the accuracy of the proposed system are evaluated, and the Animals on the Web dataset have marginally higher per-
result is presented in Table 1. formance than the COCO dataset. Among the 10 classes, pen-
The proposed model is validated against a different propor- guin has the highest accuracy of 98.7%, and ants have the least
tion of training data, like 10%, 20%, 30%, 40%, and 60%. The accuracy of 91.2%. The test images in this dataset were diffi-
system has achieved good accuracy, even with 10% training cult than the COCO animal dataset. The images were mostly
data. Nevertheless, the accuracy increases with an increase in submerged in the background, which was already cluttered.
the proportion of training data. However, the accuracy starts Nevertheless, the model has achieved a recall of 98% and a
saturating at around 40% of training data, and further on, it precision of 93%, which is higher than the previous dataset.
does not have much impact on the performance. Nevertheless,
increasing the size of training data increases the computational Effects of clustering and relations
cost and complexity. Hence, the proposed system is trained
with 40% training data. Thus, the following experiments are Candidate parts are identified, and the patches around the
trialed with 40:40:20 proportions, where 40% is for training candidate parts are extracted, which was then used to construct
and testing, and the remaining 20% is for validation. The a dictionary. The patches are clustered based on the similarity.
effect of a balanced vs. imbalanced dataset is studied, and Besides, the spatial relationships among the parts are also
the result is presented in Fig. 6. Following the previous exper- considered. In this experiment, we analyze the effects of clus-
iment, the dataset proportion is maintained at 40:40:20 on all tering and spatial relationship.
the benchmark datasets. In this experiment, we considered 4 different combinations
From the figure, it is clear that an imbalanced dataset hinders of clusters and relationships. The first set considered both the
the performance. Generally, it is good to have a balanced dataset, clustering as well as spatial relationship. The second and third
as an imbalanced dataset will have a bias favoring the majority sets considered only the spatial relationship and clustering,
class. Usually, the problem of an imbalanced dataset is predom- respectively. To highlight the importance of both clustering
inant in multi-class problem, where more than one class may and spatial relationship, we considered the fourth set that has
have minimal data. Data augmentation was done to balance the neither clustering nor relations. It is evident from the figure
dataset with typical augmentations techniques like cropping at that both the clustering and spatial relationship has significant
random sizes, flipping horizontally, and color jitter. By importance on the overall performance of the system. When

Table 4 Ablation study on the


proposed system Dataset Representation Accuracy (%) Throughput Training Testing time (sec)
time (sec)

Animals on the web Deep 85.42 431 167 138


Sparse 98.67 486 115 97
COCO animals Deep 83.84 409 143 121
Sparse 97.14 476 90 78
Animals with attributes Deep 84.84 429 173 134
Sparse 98.50 462 102 87
Environ Sci Pollut Res (2020) 27:39619–39634 39629

Fig. 12 Sample results for correct


detection

compared with pure clustering models (Agarwal and Roth From Fig. 10, it could be inferred that the accuracy
2002), spatial relations among the parts have better effects of the COCO animal dataset increased with the number
on detecting the animals. Moreover, it is evident from Fig. 7 of features, although they were not steady. However,
that “no cluster but has relations” is better in performance than the accuracy of Animals on the Web dataset is not
“no relations but has clustering.” having any increasing trend. Beyond 80, there is no
improvement in accuracy. This shows that accuracy is
Analysis of CNN depth and features unstable beyond a certain number of features. Thus, the
number of features was set to 80.
The proposed system extracts deep features and identifies the
candidate arts Depth is one of the important factors in CNN.
More the depth of the network, the deeper the features extract- Analysis of sparse representation
ed. We conducted a few experiments on both the datasets to
analyze the effects of the depth of the network. We tested for Sparse representation has been adopted to reduce the overall
five different depths of CNN, and the number of features was complexity of the system. Deep features increase the dimen-
set to 100. For a different depth, the overall accuracy was sion of the network, and the pooling layer tends to lose infor-
estimated. Figure 8 and Fig. 9 illustrates the CNN depth effect mation. In contrast, sparse representation has better learning
on the two datasets. The features from deeper layers are more power and performs better with fewer features. To analyze the
robust and have higher representational power. performance of sparse representation, we compared it with
From the figures, it could be inferred that the accuracy in- deep feature representation on both the datasets.
creases with the increase in depth of the network. The initial Figure 11 illustrates the relationship between the dictionary
layers can capture only the low-level features, and the level of size and the performance of the system in terms of accuracy.
abstraction increases with the depth. The first dataset (AOW) The figure evaluates both the deep and sparse representation
overfitted the model, and this has resulted in lower accuracy forms on the two datasets. The accuracy is plotted as a func-
when compared with the COCO animal dataset. The dimension tion of training dictionary size. On both the datasets, sparse
of the spatial features is governed by the number of features on representation has a significant improvement in accuracy
CNN. We conducted experiments on both of the benchmark when compared with deep representation. In terms of deep
datasets to analyze the accuracy of the model for a varying num- feature representation, it could be inferred that the accuracy
ber of features. For the experiment, we varied the number of increases with an increase in training dictionary size. In both
features from 10 to 100, and CNN was set to five-layer architec- the representations, the COCO animal dataset has better accu-
ture throughout. racy than Animals on the Web dataset.

Fig. 13 Sample results for incorrect detection


39630 Environ Sci Pollut Res (2020) 27:39619–39634

Fig. 14 Sample results for detecting animals in the roads

Ablation study on the proposed system time complexity of deep representation. Of all the datasets, COCO
animals have the least training and testing time. In contrast, the
The ablation study is conducted for assessing the two different AoW dataset has the highest accuracy followed by AwA.
representations of the proposed system. In addition to accura- However, there is a trade-off between throughput and accuracy.
cy, the throughput measure is also considered for measuring The proposed system balanced the trade-off and has achieved
the number of images processed per second. While usually significantly higher accuracy without affecting the throughput.
FLOPS (FLoating point Operations Per Second) is used to
measure the number of images processed per second, here
throughput is considered rather since the inference speed of Qualitative results for animal detection and
FLOPS is not proportional to GPU. The result of the ablation recognition
study is presented in Table 4. The ablation study is conducted
on a random 500 images from each dataset. In this part, we represent the sample results for both correct
The sparse representation has significantly lower processing and incorrect detections. The animals are detected with a
time both at training and testing phases. The high dimensional bounding box, and the detected animals are recognized by
features affect the processing time, and this is evident from the its class, and a confidence score is given along with the class.

Fig. 15 Sample results for a partial occlusion, b viewpoint invariance


Environ Sci Pollut Res (2020) 27:39619–39634 39631

Fig. 18 Sample results for a


blurred image, b shaken image

Figure 12 is a representation of correct detection. The samples Fig. 15–Fig. 18. Since the dictionary is populated with
are tested on both the dataset. the patches around the candidate parts, the model can
Figure 13 represents the sample results for incorrect detec- detect animals even when it is partially occluded (see
tion. In the case of the alligator, a reflection of the animal is Fig. 15a). Besides, with the large dictionary of candi-
also detected as a true positive. In the case of the zebra, two date parts, the proposed system is invariant to viewpoint
instances of zebra standing very close to each other are detect- variance (see Fig. 15b).
ed as one single zebra. False positives due to overlapping or Despite the cluttered environmental setting and other
closely-standing animals are not severe, as the application is challenges, the model was able to detect each instance
about detecting animals and not population estimation. of the animals correctly with a good confidence score.
Nevertheless, at least one instance of the animal is detected; Figure 16 represents a sample for animal detection in a
indicating the presence of animals on the road, and thus the cluttered environment. Despite the clutter and being far
driver can be alerted. Figure 14 presents sample results for away from the camera, the model was able to detect the
animal detection on the roads to simulate a real-time AVC animals.
scenario. The training images are usually of good quality, and the
training accuracy will be quite good when the parameters are
Animal detection in challenging image conditions fine-tuned. On the other hand, the real-time test images cap-
tured on the road may not always be clear. The camera limi-
In this section, we present the challenges in detecting tations may cause unnecessary noise in the images like a blur,
animals in roads and highways. Some of the prominent poor illumination, and overexposure. To tackle these chal-
challenges in detecting animals on roads include partial lenges, the proposed system pre-processes the images with
occlusion, pose variation or viewpoint variation, and FixResNeXt, which adjusts the resolution discrepancy among
cluttered background. Besides these, there are other the train and test data.
challenges like poor illumination and overexposure that As far as animal recognition is concerned, we set a thresh-
might arise due to lighting effects. Finally, images old of 55%. If the detection score is less than 55%, the model
might get blurred or shaken due to the fast movement detects the animal without labeling them. However, a
of the vehicle, and this makes it difficult to detect an- bounding box is shown to indicate the presence of an animal.
imals from the images. The performance of the pro- In Fig. 17, the animals are detected as deer despite the illumi-
posed system on these challenging image conditions is nation problem. Figure 18 illustrates some of the challenges
analyzed, and the sample results are presented in due to hardware limitations.

Fig. 16 Sample results for animals in a cluttered environmental setting


39632 Environ Sci Pollut Res (2020) 27:39619–39634

Fig. 17 Sample results for a poor


illumination, b overexposure

Comparative study feature learning and the way of representations. In comparison


with deep representation, sparse representation has better con-
The proposed system is compared with other related systems vergence. Similarly, with a higher number of features, the
on benchmark Animals with Attributes (AwA) dataset system took more time to perform detection.
(Lampert et al. 2009). AwA, an attribute-based dataset, has
been benchmarked by several systems like learning-based sys-
tems, semantic attribute-based system, and category-based at-
tributes. Among all, Bo Zhao et al. (2019), Guo et al. (2015), Conclusion and future scope
Kovashka et al. (2011), and Tian et al. (2017) were based on
learning systems. On the other hand, Al-Halah and In this paper, we presented an animal detection system that
Stiefelhagen (2015, 2017), and Lampert et al. (2009) were integrates sparse representation with deep features that are
based on the semantic attributes. Finally, Yu et al. (2013), automatically extracted from the candidate discriminative
Batra and Parikh (2017), and Al-Halah et al. (2016) were parts of the animals. An efficient feature-efficient learning
based on category-level attributes. Steve Branson et al. algorithm called Sparse Network of Winnows is learned over
(2010) proposed an interactive human-computer system the feature space. The results demonstrate the effectiveness of
intending to identify the true class while minimizing the num- the model on detecting animals on roads. It works on chal-
ber of questions asked to humans, using the visual content of lenging image conditions like partial occlusion, poor illumi-
the image. Table 5 compares the existing system with the nation, overexposure, pose variation, and cluttered back-
proposed system on the benchmark AwA dataset. ground. The proposed model has been tested on several
In comparison with existing systems, the proposed system benchmark animal datasets with varied experiments on both
has faster convergence and is significantly better than others. deep characteristics of CNN and the effectiveness of sparse
The better performance is mainly attributed to the choice of representation.

Table 5 Comparative analysis of the proposed system with the related existing systems on benchmark AwA dataset

S.no Systems Technique Accuracy (%)

1 Bo Zhao et al. (2019) Zero-shot learning 87.29


2 Yuchen Guo et al. (2015) Latent attribute learning 65.25
3 Adriana Kovashka et al. (2011) Active learning with objects and attributes 75.01
4 Tian et al. (2017) Learning attributes from the crowdsourced relative labels 76.34
5 Al-Halah and Stiefelhagen (2017) Semantic attribute discovery 89.73
6 Al-Halah and Stiefelhagen (2015) Hierarchical transfer of semantic attributes 83.75
7 Lampert et al. (2009) Information transfer by attribute sharing 65.97
8 Yu et al. (2013) Category-level attributes 84.38
9 Batra and Parikh (2017) Attribute-category model 83.51
10 Al-Halah et al. (2016) Class-attribute associations 79.74
11 Steve Branson et al. (2010) Visual recognition with humans in the loop 87.22
12 Proposed system Candidate deep features with sparse classifier 98.50
Environ Sci Pollut Res (2020) 27:39619–39634 39633

This work can be extended to detect animals from real-time Fabre-Thorpe M, Delorme A, Marlot C, Thorpe S (2001) A limit to the
speed of processing in ultra-rapid visual categorization of novel
videos. Besides, the system can be extended to work with
natural scenes. J Cogn Neurosci 13(2):171–180
thermal images, since they are the best choice to detect ani- Forslund D, Bjärkefur J (2014) Night vision animal detection. In 2014
mals at night time. In contrast,, visible images are the best IEEE Intelligent Vehicles Symposium Proceedings (pp. 737-742).
choice to detect animals during day time. Thus, a fusion of IEEE
visible and thermal images can be used for detecting animals Guo Y, Ding G, Jin X, Wang J (2015) Learning predictable and discrim-
inative attributes for visual recognition. In Twenty-Ninth AAAI
on the roads. Conference on Artificial Intelligence
Jaskó G, Giosan I, Nedevschi S (2017) Animal detection from traffic
Acknowledgments The authors thank VIT for providing "VIT SEED scenarios based on monocular color vision. In 2017 13th IEEE
GRANT” for carrying out this research work. International Conference on Intelligent Computer Communication
and Processing (ICCP) (pp. 363-368). IEEE
Kovashka A, Vijayanarasimhan S, Grauman K (2011) Actively selecting
annotations among objects and attributes. In 2011 International
References Conference on Computer Vision (pp. 1403-1410). IEEE
Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen
object classes by between-class attribute transfer. In 2009 IEEE
Agarwal S, Roth D (2002) Learning a sparse representation for object
Conference on Computer Vision and Pattern Recognition (pp.
detection. In: European conference on computer vision. Springer,
951-958). IEEE
Berlin, pp 113–127
Mammeri A, Zhou D, Boukerche A, Almulla M (2014) An efficient
Al-Halah Z, Stiefelhagen R (2015) How to transfer? Zero-shot object
animal detection system for smart cars using cascaded classifiers.
recognition via hierarchical transfer of semantic attributes. In 2015
In 2014 IEEE International Conference on Communications (ICC)
IEEE Winter Conference on Applications of Computer Vision (pp.
(pp. 1854-1859). IEEE
837-843). IEEE
Meena SD, Agilandeeswari L (2019) An efficient framework for animal
Al-Halah Z, Stiefelhagen R (2017) Automatic discovery, association es- breeds classification using semi-supervised learning and multi-part
timation and learning of semantic attributes for a thousand catego- convolutional neural network (MP-CNN). IEEE Access 7:151783–
ries. In Proceedings of the IEEE Conference on Computer Vision 151802
and Pattern Recognition (pp. 614-623)
Meena D, Agilandeeswari L (2020) Invariant Features-Based Fuzzy
Al-Halah Z, Tapaswi M, Stiefelhagen R (2016) Recovering the missing Inference System for Animal Detection and Recognition Using
link: predicting class-attribute associations for unsupervised zero- Thermal Images. Int J Fuzzy Syst. https://doi.org/10.1007/s40815-
shot learning. In Proceedings of the IEEE Conference on 020-00907-9
Computer Vision and Pattern Recognition (pp. 5975-5984) Nascimento JC, Marques JS (2006) Performance evaluation of object
Batra T, Parikh D (2017) Cooperative learning with visual attributes. detection algorithms for video surveillance. IEEE Trans
arXiv preprint arXiv:1705.05512 Multimedia 8(4):761–774
Berg TL, Forsyth DA (2006) Animals on the web. In 2006 IEEE Patterson G, Hays J (2016) Coco attributes: attributes for people, animals,
Computer Society Conference on Computer Vision and Pattern and objects. In European Conference on Computer Vision (pp. 85-
Recognition (CVPR’06) (Vol. 2, pp. 1463-1470). IEEE 100). Springer, Cham
Bíl M, Andrášik R, Duľa M, Sedoník J (2019) On reliable identification Ragab K, Zahrani M, Haque AU (2011) Camel-vehicle accidents mitiga-
of factors influencing wildlife-vehicle collisions along roads. J tion system: design and survey. In: Future Information Technology.
Environ Manag 237:297–304 Springer, Berlin, pp 148–158
Branson S, Wah C, Schroff F, Babenko B, Welinder P, Perona P, Ramanan D, Forsyth DA, Barnard K (2006) Building models of animals
Belongie S (2010) Visual recognition with humans in the loop. In: from video. IEEE Trans Pattern Anal Mach Intell 28(8):1319–1334
European Conference on Computer Vision. Springer, Berlin, pp Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once:
438–451 Unified, real-time object detection. In Proceedings of the IEEE con-
Burghardt T, Ćalić J (2006) Analysing animal behaviour in wildlife ference on computer vision and pattern recognition (pp. 779-788).
videos using face detection and tracking. IEE Proceedings-Vision, Redmon J, Farhadi A (2018) Yolov3: An incremental improvement.
Image and Signal Processing, 153(3), 305–312 arXiv preprint arXiv:1804.02767
Carlson A, Cumby C, Rosen J, Roth D (1999) The SNoW learning Sharma SU, Shah DJ (2016) A practical animal detection and collision
architecture (p. 24). Technical report UIUCDCS avoidance system using computer vision technique. IEEE Access 5:
Chen Y, Nasrabadi NM, Tran TD (2011) Hyperspectral image classifica- 347–358
tion using dictionary-based sparse representation. IEEE Trans Tian T, Chen N, Zhu J (2017) Learning attributes from the crowdsourced
Geosci Remote Sens 49(10):3973–3985 relative labels. In Thirty-First AAAI Conference on Artificial
Divya Meena S, Agilandeeswari L (2019) Adaboost cascade classifier for Intelligence
classification and identification of wild animals using movidius neu- Touvron H, Vedaldi A, Douze M, Jégou H (2019) Fixing the train-test
ral compute stick. Int J Eng Adv Technol 9(1S3):495–499 resolution discrepancy. In Advances in Neural Information
Divya Meena S, Agilandeeswari L (2020a) A new supervised clustering Processing Systems (pp. 8252-8262)
framework using multi discriminative parts and expectation– Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput
maximization approach for a fine-grained animal breed classifica- Vis 57(2):137–154
tion (SC-MPEM). Neural Process Lett. https://doi.org/10.1007/ Walther D, Edgington DR, Koch C (2004) Detection and tracking of
s11063-020-10246-3 objects in underwater video. In Proceedings of the 2004 IEEE
Divya Meena S, Agilandeeswari L (2020b) FSSCaps-DetCountNet: Computer Society Conference on Computer Vision and Pattern
fuzzy soft sets and CapsNet-based detection and counting network Recognition, 2004. CVPR 2004. (Vol. 1, pp. I-I). IEEE
for monitoring animals from aerial images. J Appl Remote Sens Yu FX, Cao L, Feris RS, Smith JR, Chang SF (2013) Designing category-
14(2):026521. https://doi.org/10.1117/1.JRS.14.026521 level attributes for discriminative visual recognition. In Proceedings
39634 Environ Sci Pollut Res (2020) 27:39619–39634

of the IEEE Conference on Computer Vision and Pattern Conference on Robotics and Biomimetics (ROBIO) (pp. 688-693).
Recognition (pp. 771-778) IEEE
Zhao B, Fu Y, Liang R, Wu J, Wang Y, Wang Y (2019) A large-scale
attribute dataset for zero-shot learning. In Proceedings of the IEEE Publisher’s note Springer Nature remains neutral with regard to jurisdic-
Conference on Computer Vision and Pattern Recognition tional claims in published maps and institutional affiliations.
Workshops (pp. 0-0)
Zhou D, Dillon M, Kwon E (2009) Tracking-based deer vehicle collision
detection using thermal imaging. In 2009 IEEE International

You might also like