You are on page 1of 9

Measurement: Sensors 25 (2023) 100611

Contents lists available at ScienceDirect

Measurement: Sensors
journal homepage: www.sciencedirect.com/journal/measurement-sensors

Animal image identification and classification using deep neural


networks techniques
Thirupathi Battu a, *, D. Sreenivasa Reddy Lakshmi b
a
Department of Computer Science & Engineering, University College of Engineering(A), OsmaniaUniversity Hyderabad, Telangana, India
b
Department of Information Technology, Chaitanya Bharathi Institute of Technology(A), Affiliated Osmania University Hyderabad, Telangana, India

A R T I C L E I N F O A B T R A C T

Keywords: Animal identification research, there haven’t been many effective methods introduced, especially in the area of
Animal detection and classification predator species. In this article, we provide a reliable learning strategy for categorising animals from camera-trap
Feature learning photos captured in naturally inhabited areas with high densities of people and noise. To deal with noisy labels,
Image modalities
we offered two distinct network architectures—one with clean samples and the other without. We separate the
Deep neural networks
training data into groups with various properties using k-means clustering. Then, other networks are trained
using these groupings. Then, using maximum voting, these more diverse networks are used to jointly forecast or
correct sample labels. We test the effectiveness of the suggested method using two publicly accessible camera-
trap picture datasets, Snapshot Serengeti and Panama-Netherlands. Our findings show that our method is
more accurate and surpasses state-of-the-art techniques for classifying animal species from camera-trap photos
with high levels of label noise.

1. Introduction they are badly malnourished. Attacks seem to happen more often at
night when animals leave their territory in quest of food. There is no
According to statistics from 2017, up to five million individuals in predator animal detector mentioned in the literature. In this study, an
the United States are assaulted by animals each year. According to image processing technology is utilised to propose a way for identifying
Langley & Morrow, up to two million animal bites occur in the United the species of animals. This method is then tested using a dataset that
States each year. Depending on where you live, animal attacks happen includes pets and predators. The classification results are then evaluated
more or less often. A term used in slang to describe an animal that uses and debated in terms of accuracy [4].
humans as a form of prey is "man-eater." For instance, tigers are reported One of the fundamental notions in computer vision is that the initial
to have killed more people than any other animal of their kind [1,2]. goal is to "understand the picture," which leads to a constant increase in
They are more likely than any other wild animal to directly attack a the requirement to grasp the high-level meaning of things when it comes
person, according to Nowak et al. On the other hand, man-eating lions to object recognition and image identification. The field has exploded in
have been seen invading human settlements both during the day and at popularity as a key visual talent required by computer vision systems. As
night in order to find victims. According to American and Tanzanian so many individuals and computers extract large quantities of infor­
specialists, there was a significant increase in man-eating incidents in mation from photos, images have become ubiquitous in a number of
Tanzania’s rural areas between 1990 and 2005. At least 563 locals were sectors [5].
attacked at this period, and many victims were eaten. According to Automation, schools, self-driving vehicles, tracking, and the con­
Warrel, thousands of humans suffer fatal injuries each year as a result of struction of 3D model representations all need knowledge that might be
animal attacks [3]. Though it does not seem that every government does crucial. While the above-mentioned applications vary in a variety of
so, they keep records of animal-related deaths. The majority of animals, ways, they all follow the same procedure of annotating a picture with
with the exception of tigers, do not actively seek out humans, while one or more labels that correspond to a set of classes or categories.
certain species may prey on the unconscious, ill, or dead. Animals may Object identification is the name given to this procedure, and it has
attack humans, livestock, and pets when they get used to people or when become a popular study topic in the area due to the focus on

* Corresponding author.
E-mail address: ramdasv786sap@gmail.com (T. Battu).

https://doi.org/10.1016/j.measen.2022.100611
Received 24 October 2022; Received in revised form 16 November 2022; Accepted 25 November 2022
Available online 21 December 2022
2665-9174/© 2022 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
T. Battu and D.S. Reddy Lakshmi Measurement: Sensors 25 (2023) 100611

understanding what a picture represents. An extensive body of research Because of the high demands of the job, it is excessively expensive, and
has been done on object detection and identification using image pro­ the company’s survival may be jeopardized [9,10]. Animal researchers
cessing. The objective of this project is to develop a system that enables often go to isolated regions all around the globe. In the life of a
researchers studying animal and floral flora as well as wildlife photog­ photographer, hostile situations are frequently the norm. Some people
raphers to recognise and differentiate wild creatures and flowers auto­ may wait for hours on end before capturing a photo worth selling. The
matically [6,7]. The crucial problem of animal detection and photographers must be bold enough to remain in a hazardous situation
identification hasn’t gotten much attention. The aim of this project is to comfortably and patiently until the animals emerge. A ideal situation or
develop a system that will help wildlife photographers and researchers outcome always arrives with fewer disruptions to the animal’s natural
who study animal behavior [8,9]. behaviour. Human presence may be quickly detected by the dog due to
The technology used in this work might be improved for uses in their great sensitivity. Because we can’t know what will happen next,
safety, monitoring, and other areas. One of the most challenging genres photographers must be prepared to meet any important jungle moment.
of photography is considered to be wildlife photography. It needs strong The DSLR cameras utilised in the business, on the other hand, are
technical skills, such precise capturing.Photographers of wild animals relatively expensive, and the shutter on-off operation has limitations. As
often need a high level of technical skill as well as a great deal of a result, adequate recognition in such equipment is required. The better
patience. Some animals, for example, are difficult to approach, thus the picture quality, the more memory space is required [11,12]. It also
understanding their behaviour is necessary to anticipate their activities. fosters correct acknowledgment in nature photography. We’ve opted to
Photographers must sometimes stay calm and silent for many hours until remain cautious with this project, focusing just on animals and flowers.
the exact moment arrives. Photographing some animals may need This was chosen at random since collecting animal data to create a
stalking skills or the use of a hide-and-seek device. A stunning wild life trustworthy database is never an easy undertaking. To arrive at the
shot is also the result of being in the right spot at the right moment. probabilistic conclusion and reach the acceptable percentage, we used

Fig. 1. Samples of camera-trap images training data sets diluted with noisy labels. The red box is the noisy label while the green box is the clean label. (For
interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

2
T. Battu and D.S. Reddy Lakshmi Measurement: Sensors 25 (2023) 100611

the method of attribute abstraction, segmentation, and thresholding to correct noisy input labels, Veit et al. (2017) developed a label
[13–15]. cleaning network that was supervised by a few clean labels. Han et al.
Animal type identification has not received much attention, espe­ (2018), Jiang et al. (2017), and Ren et al. (2018) have all employed
cially when it comes to predator animals. One instance is identifying a clean validation sets to weight training data [26,27].
certain animal species in a tracking tunnel by capturing its ink footprints
in a particular spot, which is inappropriate in this case since we want to
2.2. Animal classification systems
identify the animal immediately. In order to recognise and follow
moving object blobs in order to identify activities like hunting, re­
2.2.1. Animal breed classification
searchers employed a neural network. However, this study’s objective is
With the rising number of public datasets, FGC is gaining attention. It
to identify the animal before, not during, the hunt. The work presented
is employed when the classes in the dataset have huge inter-class simi­
here does not categorize the traits of the targeted animal, but rather
larity and intra-class variance. Animal breed classification is a problem
employs a background subtraction method to identify them from the
of FGC, and there are three benchmark animal breed datasets to study
backdrop. For the real-time detection and tracking of animals, the au­
the problem of fine-grained animal breed classification. The three
thors of [16,17] presented a modified version of Viola and Jones’
benchmark datasets are Columbia Dogs with Parts (CU), Stanford dogs
Haar-like detector. They devised a method for monitoring and noting the
(SD), and Oxford IIIT-Pet dataset (OX). The literature for animal breed
presence of an animal in a video and retrieved the Haarlike traits of a
classification is categorized based on the benchmark datasets used in the
lion. They only used one animal, despite this. Using machine learning
research [28].
techniques, we offer a unique method for extracting animal features in
this study and separate predators from non-predators [18–20] shown in
2.2.2. Columbia dogs with parts dataset
the Fig. 1.
In this section, we discuss the benchmark CU and the systems that
have used it. The CU dataset includes large real-world dog breeds with a
2. RELATED work
total of 133 classes having 8351 images altogether, and the dataset is
relatively balanced. The dataset was released by researchers at
2.1. Animal monitoring systems
Columbia University in 2012 [29].
The baseline model for the CU dataset is proposed by Liu et al.
In general, animal monitoring systems monitors the animals, their
(2012). The authors proposed a novel fine-grained image classification
movement, and their behaviour. Wildlife researchers developed various
for dog breeds, where the animals share common parts but different in
technologies for monitoring animals. For continuous observation, a
shape and physical characteristics. The idea was to extract the corre­
video monitoring system was mounted on the animals, which was both
sponding parts and classify the breeds based on the extracted parts. The
intrusive and harmful for the animals (Parkhi et al., 2012). Other
main challenge was the variation in the semblance of the corresponding
prominent technologies include very high-frequency radio-tracking
parts of the animals sharing common parts. To find correct corre­
(Kim et al., 2010), GPS tracking (Tomkiewicz et al., 2010), satellite
sponding parts, the authors use the exemplar-based geometric model on
tracking with radio collars (Venkataraman et al., 2005), pyro-electric
the face of the animals. With part correspondence, it is easier to compare
sensors (Hao et al., 2006), wireless sensor networks (Mainwaring
the extracted face descriptors of animals having similar parts. The model
et al., 2002). These intrusive technologies were of little success, as they
also includes parts hierarchy and part localization that varies for
are mostly applicable only for a small geographic area. With the rapid
different dog breeds. The baseline model achieved an accuracy of 67%
technological advancement, the camera trap technology (Wearn and
and concluded that an accurate part localization substantially increases
Glover-Kapfer, 2017) have matured and reached a point where they are
the classification performance. Wang et al. (2014) proposed a dog breed
reliably used as one of the non-intrusive methods to monitor animals
categorization model on the same dataset using statistical techniques.
[21–23].
The model identifies the shape representation using facial landmarks.
Deep networks are somewhat resistant to noise labelling, it should be
Initially, the authors identified eight facial landmarks. The identified
mentioned. The neural networks’ resilience to adversarial samples was
landmarks are projected onto the Grassmann manifold, which will
developed by Szegedy et al. (2013). According to Rolnick et al. (2017), a
represent the geometry of the dog breeds. The shape of the dog is
deep neural network can learn from a large number of noisy labels. They
considered as points on the Grassmann manifold. Based on the equiva­
looked at how batch size and learning rate affected the effectiveness of
lence property, these points are then projected onto the ambient space
the model. Van Horn et al. (2015) claim that as long as the mistake rate
[30].
is not too high, learning techniques based on CNN features and part
localization are resilient to annotation errors and damaged training
data. In this paper, we use a variety of networks to estimate the accurate 2.3. Animal detection systems
label of the input sample, as described by Yuan et al. (2018). We
acknowledge that diversity across these networks is crucial during this 2.3.1. Visible image based systems
joint estimate process in order for various networks to contribute diverse Generally, visible images are most commonly used in animal detec­
insights or knowledge about the input sample. This might enhance the tion and classification systems, as they provide a rich set of features like
performance of joint estimation [24,25]. colour, texture, and many more. Most of the SOTA CNN architectures are
To get over the challenge of learning from noisy labels, there are two trained on large-scale datasets that have visible images. Besides, there
methods. Assuming that clean labels are not accessible, the first method are ample visible image based benchmark camera trap datasets. In this
involves learning directly from noisy labels. With this technique, the section, we categorize the systems based on the camera trap datasets.
input image’s label noise is conditionally isolated (Natarajan et al.,
2013; Sukhbaatar et al., 2014). A label cleaning module, on the other 2.3.1.1. Missouri camera trap dataset. Chen et al. (2014) proposed an
hand, is used to reject or fix samples with inaccurate labels (Brodley and animal detection model for automating the animal monitoring system
Friedl, 1999). The second method instructs the network as it learns from using a deep convolutional neural network. The authors also released
noisy labels using a small number of clean samples and correct labels. In the Missouri camera trap dataset that consists of 20 species of animals
order to increase classification accuracy, this system offers a tiny pro­ commonly found in North America. The multimodal dataset has animals
portion of clean labels that have been confirmed by real individuals. For in colour images, grayscale images, and infrared images. The dataset has
instance, Yuan et al. (2018) monitored and enhanced the deep neural 23,876 images totally, with 14,346 for training and the rest for testing.
network’s learning capacity using a set of 5000 clean samples. In order The authors proposed a baseline model with DCNN and achieved an

3
T. Battu and D.S. Reddy Lakshmi Measurement: Sensors 25 (2023) 100611

average accuracy of 38.31% [31]. technique, followed by CNN for feature extraction, Least Absolute
Zhang et al. (2016) proposed an ensemble coupled graph cut algo­ Shrinkage and Selection Operator (LASSO) for feature selection, and
rithm for segmenting the animals from the highly cluttered Missouri SVM classifier for classifying the animals. The model was trained with
camera trap images. Using the object verification technique, the several CNN architectures like GoogleNet, MixtureNet, and ResNet with
segmented images are verified to find if they belong to the foreground or 50, 101, and 152 layers. Among all these networks, ResNet-152 ach­
background. The proposed segmentation technique for detecting ani­ ieved the best accuracy of 90.32%.
mals in the image achieved an average precision of 82.93%. Later, Parham et al. (2018) proposed a novel technique for identifying the
Zhang et al. (2016) proposed spatio-temporal object region proposal for annotation of interest for identifying the animals. The authors proposed
detecting animals from the highly cluttered camera trap dataset. To a pipeline model that includes annotation localization, annotation
reduce the number of false positives in the previous approach, the au­ classification, background segmentation, and AoI classification. The
thors proposed a multi-level graph cut to produce the region proposal final AoI can be used as a focused-input for any appearance-based model
that possibly contains the object. The region proposals are then verified to increase the reliability of the model. The authors have also released a
using patch verification to validate if the proposals do contain the target dataset named WILD with 6 different animal species accounting for 5784
object or not. The features are extracted using Histogram of Oriented images. On this dataset, the authors achieved an average accuracy of
Gradients (HOG) and deep learning techniques. This kind of animal 94.28% with the proposed baseline model.
detection achieved an average precision of 82.09%, which is slightly Elias et al. (2017) proposed an automated animal monitoring system
lesser than the previous technique proposed by the authors [32]. using the Edge cloud and IoT system. The authors have proposed a
On the same dataset, Yousif et al. (2017) proposed a background practical Where’s The Bear (WTB) system in UCBS Sedgwick Reserve to
modeling technique jointly with Deep Convolutional Neural Network analyze how good the proposed model is in detecting the animals from
(DCNN) to detect animals and humans from the highly cluttered dataset. camera trap images. The model is developed with the open-source
The background subtraction modeling removes the background and TensorFlow and OpenCV applications to perform classifications. Un­
generates the region proposals that contain the foreground object. The like other animal monitoring systems, the proposed model discards the
regions are fed to the DCNN for classification. This technique achieved empty frame on-site and transfers only the true positive images. The
an average accuracy of 83.78%. Recently, Verma and Gupta (2018) model has tested on over 1.2 million images and has achieved an
proposed an animal detection model using the spatio-temporal object average accuracy of 90%.
region proposal proposed by Zhang et al. (2016) and deep neural net­
works. The authors used the region proposal technique to identify the 2.4.2. Animal detection systems with own dataset
object region and feed-forwarded them to neural networks for training. Certain animal detection systems were trained with camera trap
Specifically, the authors trained Visual Geometry Group (VGG) and datasets and achieved considerable performance but still not available
Residual Network (ResNet) model and achieved an average accuracy of for the benchmark. Maiti et al. (2015) studied the abundance of the
91.2% with ResNet [33]. grizzly bear in Lake Louise in Banff national park of Canada. The images
captured over months produced around 37,000 images, out of which
2.4. LH1 animal-face dataset only 27 images had a bear, and the rest were empty images. The authors
use the MSER (Maximally Stable Extremal Region) technique to segment
Si and Zhu (2011) released an animal face dataset with 19 classes the bear from the image, and on this, a CNN is applied to classify if the
containing the head of the animals. The dataset has a total of 2200 image indeed has a bear. The model, however, is prolonged and took
images, and the classes had huge intra-class and inter-class variance. The around a minute to classify, and the authors claimed to have achieved a
authors also proposed a generative learning model that required only 63% recall rate and a precision of 7%.
very few training data. Each animal class had a template that learns the
features, including the scale, and orientation. The templates are char­ 3. Proposed method
acterized by texture, local sketch, flat regions, colour, and a few more
features. The baseline model achieved an average accuracy of 75.6%. The strategy for training deep neural networks with noisy labels for
Peng et al. (2016) proposed an animal classification system using robust animal categorization in this part.
deep boosting and dictionary learning. The dictionaries are sequentially
combined with the previous layers and finally obtain an image repre­ 3.1. Noisy labels
sentation that is used for training the deep neural network. On the LH1
animal face dataset, the model achieved an average accuracy of 81.5%. Our objective is to train our network at various noise levels with and
Taheri and Toygar (2018) also proposed an animal face classification without a limited number of unambiguous labels. Fig. 2(a) depicts the
system using score fusion. The authors proposed a hybrid CNN and local primary outline of the suggested method. As shown in Fig. 2(a), we
hand-crafted features and weighed them using score-level fusion. For the employ the pretrained clean network to transform the input picture into
local features, the authors used Kernel Fisher Discriminator (KFD) and a feature vector. We propose that the other approach does not give ac­
biologically inspired features. The intuition is that features from hybrid cess to accurate data (Fig. 2(b)). From the input training picture, feature
modules produce better performance than CNN alone. The model was maps are produced using the pretrained base model with varied amounts
tested on the LH1 animal face dataset and achieved an average classi­ of noise. Using k-means clustering (Everitt et al., 2011), we divide the
fication accuracy of 95.31%. training dataset for each class into 12 groups in both architectures, and
from these groups, multiple convolutional neural networks are trained
2.4.1. Public camera trap datasets using a random selection of 6 clusters.
Apart from the three benchmark datasets, there are few other public The pretrained clean network, as shown in Fig. 2(a), can only be used
datasets like Alexander von Humboldt Institute dataset, the Wild data­ to predict the label at each step without training on the dataset gener­
set, and UCBS Sedgwick Reserve dataset. However, these datasets have ated by kmeans clustering and help the other two networks clean up
not been benchmarked yet. The Alexander von Humboldt Research their labels. On clean samples, the pretrained clean network is only
Institute released a camera trap dataset of 10 animal classes that were trained once. On the other hand, various parts of the training dataset
captured with 176 camera traps in 10 different regions of Colombia. may be used to train the CNN3 network in Fig. 2(b). To train the CNN3,
Giraldo-Zuluaga et al. (2017) released this dataset and proposed a 90% of the whole training dataset was randomly chosen. Using this
baseline model for animal segmentation and classification from the network and the other two autonomous networks, we will forecast new
camera trap images. The segmentation uses the Multi-layer robust PCA labels. The anticipated labels will be used to update the network in the

4
T. Battu and D.S. Reddy Lakshmi Measurement: Sensors 25 (2023) 100611

Fig. 2(a). Flow diagram of the proposed method. (a) With clean network. (b) Without clean network.

Fig. 2(b). The architecture of the convolutional neural network.

next step. This iterative method will be performed a great deal more terrestrial species, indicating that predators and prey animals had
times. distinct pupil forms [8]. In order to identify the species of animal, we
We utilise each network independently in both structures to predict focused on extracting characteristics from an animal’s eyes and hearing
the label for each picture in the training dataset. To update the original in this research. The steps are as follows: Initially, a dataset with images
picture label, we apply maximum voting among the anticipated labels. of 10 animals was created (5 Predators, 5 Pets) The dimensions, loca­
In particular, if two or more networks create the same label, we apply it tions, and lighting of the images vary widely. A few examples from the
to the sample. Otherwise, we just use the original label, which may be dataset are shown in Fig. 3.
distracting. Once the labels have been changed, we will return to
enhance those two networks. This label updating and network refining 4. Experimental results
procedure is done many times. Techniques 1 and 2 offer detailed algo­
rithms for implementing these two approaches. Fig. 3 depicts the This section evaluates the effectiveness of our technique on the
network topology for the Serengeti dataset’s base network, clean Serengeti and Panama-Netherlands datasets using three separate met­
network, and independent networks.For fair performance comparison, rics. labels for the three levels of noise: 30%, 50%, and 70% We contrast
we adopt the same model design as Yuan et al. (2018). the results of our experiments. recommended method using CNN as a
It is advised to utilise a statistical feature extraction method while reference point.
studying the face area of predator animals in outdoor settings. To extract
statistical information from a specific area of an image of an animal’s 4.1. Datasets
face, area descriptors are employed. Since animal ears and eyes have
characteristics that are helpful for categorising animal species, we pro­ The first is the cameratrap images from Tanzania’s Serengeti Na­
pose using them as a place to extract statistical information. Researchers tional Park in the Snapshot Serengeti collection (Swanson et al., 2015).
found a correlation between ecological niche and the pupil form of Assuming that each image only includes one species of animal, we

5
T. Battu and D.S. Reddy Lakshmi Measurement: Sensors 25 (2023) 100611

Fig. 3. Example of the dataset. The pet animals are shown in the top row, while the predator species chosen for this experiment are displayed in the bottom row.

discuss animal classification at the level of the individual picture. 12,904 4.1.1. Results from the clean sample Snapshot Serengeti and Panama-
colour images of 10 distinct animal species are included in the collec­ Netherlands datasets
tion. Our dataset was randomly split between 80% training samples and We assume that 5% of clean samples are available for training the
20% test samples. We use 85% of the samples for training, 10% for clean network in the following trials. To assess the success of the sug­
testing, and the remaining 5% as clean samples when utilising the clean gested technique, we employ accuracy, precision, recall, and F1 score
network. the number of images used for each animal species in the criteria. The definition of precision is
training and testing sets 20 animal species from Panama and the
TP + TN
Netherlands are represented in photo sequences in the second dataset Accuracy = (1)
TP + TN + FP + FN
(Zhang et al., 2016). As shown in, we chose the training, testing, and
cleaning images at random. where TP stands for total number of true positive samples, TN for total
10,321 images from the Serengeti dataset were utilised for training, number of true negative samples, FP for total number of false positive
while 2583 images were used for testing. Fig. 4 shows the steps used to samples, and FN for total number of false negative samples. The
achieve the classification accuracy for the three different label noise following equation is used to determine precision.
levels. At stage 3, our method achieves an accuracy of 73.09% for 30%,
and accuracy declines beyond this point. For noise levels of 50% and TP
Precision = (2)
70%, respectively, it obtains accuracy of 59.66% and 46.50% at stage 2, TP + FP
while accuracy declines beyond stage 2. As we can see, accuracy may be Conversely, recall or sensitivity can be computed as
improved by doing label updating and network refining at many phases.
TP
However, if there are too many steps, such as more than three, the Recall = (3)
TP + FN
continuous label update loses accuracy and the performance suffers (see
Fig. 5). We utilise the F1 score or F measure, which is a harmonic mean of
accuracy and recall, to compute the combination of precision and recall.
It’s described as

Fig. 4. Accuracy of five steps on ten distinct Serengeti species datasets at various noise levels without clean samples.

6
T. Battu and D.S. Reddy Lakshmi Measurement: Sensors 25 (2023) 100611

Fig. 5. The compare the performance of our method to that of the ICL method, a cutting-edge method for removing label noise.

7
T. Battu and D.S. Reddy Lakshmi Measurement: Sensors 25 (2023) 100611

2 × Precision × Recall 2TP Informed Consent


F1 Score = ×
Precesion + Recall 2TP + FP + FN
There is no Informed Consent.
We compare the classification accuracy, recall, and F1 score metrics
of our proposed technique with those of the conventional CNN for each
animal species, as shown in Fig. 4. On the Snapshot Serengeti, we can see Author’s contribution
that the suggested approach can fix noisy labels for noise levels of 30%,
50%, and 70%. The author ‘s declare no contribution.
We use 10,321 images for training and 2583 for testing as we test our
approach on 10 Snapshot Serengeti dataset species without the clean CRediT authorship contribution statement
network. Step 3 of the technique results in an accuracy of 73.09% for
30%; however, stage 4 onwards sees a reduction in accuracy. It is shown Thirupathi Battu: Conceptualization, Methodology, Software,
the classification accuracy attained for each of the three different label Validation, implementation, Writing – original draft, Writing – review &
noise levels. It achieves accuracy of 59.66% and 46.50% at stage 2 for editing, Visualization. D. Sreenivasa Reddy Lakshmi: Conceptualiza­
noise levels of 50% and 70%, respectively, but accuracy decreases tion, Overall, Supervision, Project administration, Formal analysis.
beyond stage 2. We can observe that a multi-stage label updating and
network refining process may boost accuracy. However, if there are too
Declaration of competing interest
many stages, let’s say more than three, the performance degrades and
the continuous label update loses accuracy over time.
The authors declare that they have no known competing financial
The Panama-Netherlands dataset, which contains data on 20 native
interests or personal relationships that could have appeared to influence
animal species to North America, is also utilised to assess our technique.
the work reported in this paper.
samples from the Panama-Netherlands dataset representing 10 species.
Each sample is matched with a visual sequence. When the camera-trap
Data availability
detects animal movement, it typically takes 3–20 photographs at a
frame rate of one per second, depending on how long the animal is in the
No data was used for the research described in the article.
camera’s field of view. To build our networks, we use the pre-trained
AlexNet convolutional neural network.
References
For comparison’s purposes, we also provide the conventional CNN
method without any noise reduction. Our method outperforms the [1] E. Beigman, B.B. Klebanov, Learning with annotation noise, in: In Proceedings of
standard CNN and ICL on the Snapshot Serengeti and Panama- the Joint Conference of the 47th Annual Meeting of the ACL and the 4th
Netherlands datasets at all three degrees of noise (Yuan et al., 2018). International Joint Conference on Natural Language Processing of the AFNLP vol.
1, Association for Computational Linguistics, 2009, pp. 280–287 August.
We demonstrate how different sample fractions of the whole data [2] A.J.A. Fegraus, E.T. Birch, N. Flores, R. Kays, T.G.O. Brien, J. Palmer, S. Schuttler,
impact performance. Using 90% of the whole data, the network is J.Y. Zhao, W. Jetz, M. Kinnaird, S. Kulkarni, A. Lyet, D. Thau, Wildlife insights : a
trained on the Snapshot Serengeti with high accuracy. A 70% sample of platform to maximize the potential of camera trap and other passive sensor wildlife
data for the planet, Environ Conserv page 2014 (2019), https://doi.org/10.1017/
the Panama-Netherlands datasets is used to train the network, producing
S0376892919000298.
a network with high accuracy. [3] R. Fergus, L. Fei-Fei, P. Perona, A. Zisserman, Learning object categories from
internet image searches, Proc. IEEE 98 (8) (2010) 1453–1466.
5. Conclusions and further discussions [4] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, M. Sugiyama, Coteaching:
robust training of deep neural networks with extremely noisy labels, Adv. Neural
Inf. Process. Syst. (2018) 8527–8537.
There have been studies on the impact of loud labels on the classi­ [5] L. Jiang, Z. Zhou, T. Leung, L.J. Li, L. Fei-Fei, Mentornet: Learning Data-Driven
fication of animals. From these noisy examples, we created a cutting- Curriculum for Very Deep Neural Networks on Corrupted Labels, 2017 arXiv
preprint arXiv:1712.05055.
edge technique for creating an exact animal species categorization [6] I. Jindal, M. Nokleby, X. Chen, Learning deep networks from noisy labels with
network. With and without clean samples, we looked into the network dropout regularization, in: Data Mining (ICDM). in: 2016 IEEE 1th, 2016,
training process. The studies’ findings show how our method for label­ December.
[7] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep
ling noise with and without clean samples is accurate. This study has convolutional neural networks, Adv. Neural Inf. Process. Syst. (2012) 1097–1105.
demonstrated the importance of network diversity in producing a more [8] N. Manwani, P.S. Sastry, Noise tolerance under risk minimization, IEEE Trans.
accurate joint estimate of sample label performance. To construct groups Cybernetics 43 (3) (2013) 1146–1151.
[9] N. Natarajan, I.S. Dhillon, P.K. Ravikumar, A. Tewari, Learning with noisy labels,
with a variety of traits, we combined deep neural network features with Adv. Neural Inf. Process. Syst. (2013) 1196–1204.
k-means clustering. The clusters are then used to generate groupings. [10] D.F. Nettleton, A. Orriols-Puig, A. Fornells, A study of the effect of different types of
After then, each group is utilised to train its own network. This has noise on the precision of supervised learning techniques, Artif. Intell. Rev. 33 (4)
(2010) 275–306.
allowed us to guarantee that each network receives training using a
[11] L. Niu, W. Li, D. Xu, Visual recognition by learning from web data: a weakly
unique set of photos. In order to determine the real label of the noisy supervised domain generalization approach, in: Proc. IEEE Conf. Comput. Vis.
data, we can apply maximum voting. Pattern Recognit., 2015, pp. 2774–2783.
For extensive wildlife monitoring by citizen scientists, the suggested [12] M. Ren, W. Zeng, B. Yang, R. Urtasun, Learning to Reweight Examples for Robust
Deep Learning, 2018 arXiv preprint arXiv:1803.09050.
method for categorising animal species from cameratrap photos with [13] D. Rolnick, A. Veit, S. Belongie, N. Shavit, Deep Learning Is Robust to Massive
noise labels may be useful (Fegraus et al., 2019). Most camera-trap Label Noise, 2017 arXiv preprint arXiv:1705.10694.
photos are gathered, analysed, and shared by amateur volunteers or [14] A. Swanson, M. Kosmala, C. Lintott, R. Simpson, A. Smith, C. Packer, Snapshot
Serengeti, high-frequency annotated camera trap images of 40 mammalian species
citizen scientists. There will definitely be a large number of inaccurate in an African savanna, Sci. Data 2 (2015).
labels in their annotations. We will be able to extract useful animal [15] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus,
species classifiers from these datasets using the suggested methodology. Intriguing Properties of Neural Networks, 2013 arXiv preprint arXiv:1312.6199.
[16] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona,
S. Belongie, Building a Bird Recognition App and Large Scale Dataset with Citizen,
Funding details 2015.
[17] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, S.J. Belongie, in: Learning from
Noisy Large-Scale Datasets with Minimal Supervision, CVPR, 2017, July,
There are no funding details available. pp. 6575–6583.
[18] T. Xiao, T. Xia, Y. Yang, C. Huang, X. Wang, Learning from massive noisy labeled
data for image classification, Proc. IEEE Conf. Comput. Vis. Pattern Recognition
(2015) 2691–2699.

8
T. Battu and D.S. Reddy Lakshmi Measurement: Sensors 25 (2023) 100611

[19] B. Yuan, J. Chen, W. Zhang, H.S. Tai, S. McMains, Iterative cross learning on noisy [27] Ramdas Vankdothu, Mohd Abdul Hameed, Raju Bhukya, Gaurav Garg, Entropy
labels, in: In 2018 IEEE Winter Conference on Applications of Computer Vision and sigmoid based K-means clustering and AGWO for effective big data handling,
(WACV), IEEE, 2018, March, pp. 757–765. Multimed. Tool (2022) 1–18.
[20] N. Zhang, M. Paluri, M.A. Ranzato, T. Darrell, L. Bourdev, Panda: pose aligned [28] Ramdas Vankdothu, Mohd Abdul Hameed, Brain tumor MRI images identification
networks for deep attribute modeling, in: Proc. IEEE Conf. Comput. Vis. Pattern and classification based on the recurrent convolutional neural network,
Recognit., 2014, pp. 1637–1644. Measurement: Sensors Journal 24 (2022), 100412.
[21] Z. Zhang, Z. He, G. Cao, W. Cao, Animal detection from highly cluttered natural [29] Ramdas Vankdothu, Mohd Abdul Hameed, Brain tumor segmentation of MR
scenes using spatiotemporal object region proposals and patch verification, IEEE images using SVM and fuzzy classifier in machine learning, Measurement: Sensors
Trans. Multimed. 18 (10) (2016) 2079–2092. Journal 24 (2022), 100440.
[22] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep features for [30] Ramdas Vankdothu, Mohd Abdul Hameed, COVID-19 detection and classification
scene recognition using places database, Adv. Neural Inf. Process. Syst. (2014) for machine learning methods using human genomic data, Measurement: Sensors
487–495. Journal 24 (2022), 100537.
[23] X. Zhu, A.B. Goldberg, Introduction to semi-supervised learning, Synth. Lect. Artif. [31] Ayodeji Olalekan Salau, Nikhil Marriwala, Muzhgan Athaee, Data security in
Intell. Mach. Learn. 3 (1) (2009) 1–130. wireless sensor networks: attacks and countermeasures, Mobile Radio
[24] Ramdas Vankdothu, DrMohd Abdul Hameed, Husnah Fatima, A brain tumor Communications and 5G Networks (2020) 173–186.
identification and classification using deep learning based on CNN-LSTM method, [32] Nikhil Marriwala, O.P. Sahu, Anil Vohra, 8-QAM software defined radio based
Comput. Electr. Eng. 101 (2022), 107960. approach for channel encoding and decoding using forward error correction,
[25] Ramdas Vankdothu, Mohd Abdul Hameed, Adaptive features selection and EDNN wireless personal communications Journal 72 (4) (2013).
based brain image recognition on the internet of medical things, Comput. Electr. [33] Nikhil Marriwala, Om Prakash Sahu, Anil Vohra, LabVIEW based design
Eng. 103 (2022), 108338. implementation of M-PSK transceiver using multiple forward error correction
[26] Vankdothu Ramdas, Mohd Abdul Hameed, Ayesha Ameen, Raheem, Unnisa, Brain coding technique for software defined radio applications, J. Electr. Electron. Eng. 2
image identification and classification on Internet of Medical Things in healthcare (4) (2014).
system using support value based deep neural network”, Comput. Electr. Eng. 102
(2022), 108196.

You might also like