You are on page 1of 7

IET Computer Vision

Research Article

Animal classification using facial images with ISSN 1751-9632


Received on 31st January 2017
Revised 11th January 2018
score-level fusion Accepted on 1st March 2018
E-First on 21st March 2018
doi: 10.1049/iet-cvi.2017.0079
www.ietdl.org

Shahram Taheri1, Önsen Toygar1


1Computer Engineering Department, Faculty of Engineering, Eastern Mediterranean University, Famagusta, North Cyprus, via Mersin 10,

Turkey
E-mail: onsen.toygar@emu.edu.tr

Abstract: A real-world animal biometric system that detects and describes animal life in image and video data is an emerging
subject in machine vision. These systems develop computer vision approaches for the classification of animals. A novel method
for animal face classification based on score-level fusion of recently popular convolutional neural network (CNN) features and
appearance-based descriptor features is presented. This method utilises a score-level fusion of two different approaches; one
uses CNN which can automatically extract features, learn and classify them; and the other one uses kernel Fisher analysis
(KFA) for its feature extraction phase. The proposed method may also be used in other areas of image classification and object
recognition. The experimental results show that automatic feature extraction in CNN is better than other simple feature
extraction techniques (both local- and appearance-based features), and additionally, appropriate score-level combination of
CNN and simple features can achieve even higher accuracy than applying CNN alone. The authors showed that the score-level
fusion of CNN extracted features and appearance-based KFA method have a positive effect on classification accuracy. The
proposed method achieves 95.31% classification rate on animal faces which is significantly better than the other state-of-the-art
methods.

1 Introduction many research studies. In [2] the authors suggested a score-level


fusion of fingerprint and face matchers for personal verification
Animal recognition and classification is an important area which under stress conditions. Elmir et al. [3] proposed a biometric
has not been discussed rapidly. Animal classification which relies identification system which used score-level fusion of fingerprint
on the problem of distinguishing images of different animal species and voice. Sim et al. [4] presented a method that combines face
is an easy task for humans, but evidence suggests [1] that even in and iris biometric traits with the weighted score-level fusion
simple cases like cats and dogs, it is difficult to distinguish them technique to flexibly fuse the matching scores from these two
automatically. Animals have flexible structure that could self-mask modalities based on their weight availability. In [5], the authors
and usually they appear in complex scene. Also, as all objects, they proposed a multi-biometric system that used three traits such as
may appear under different illumination conditions, viewpoints and fingerprint, palmprint and iris are combined by using weighted
scales. There are attempts to apply recognition methods on images fusion technique for person identification.
of animals but the specific problem of animal categorisation has In this paper, we show that score-level fusion can be applied to
recently attracted limited interest. other classification and recognition problems. We believe that
Many existing methods showing promising results for human performance rate can be increased when the information of two
fare recognition cannot properly represent the diversity of animal different types of classification system are consolidated. For this
classes with complex intra-class variability and inter-class purpose, we construct two different classifier systems, one based
similarity. There are several kinds of approaches for solving this on features extracted from CNN and the other based on appearance
problem with each one having its advantages and disadvantages. and shape-based features. Then we combine their information and
The first approach constructs complex features which represents make decision according to this final result.
and discriminates sample images better but creating such a feature We tested our method with LHI-Animal-Faces dataset which
is complicated and it is problem dependent. The second approach consists of 20 classes, 19 classes of different animals and one
combines the extracted features from different methods and human faces class. It is challenging to discern them for their
concatenates them to build a more powerful feature vector. evolutional relationship and shared parts. Besides, interesting
Increasing the size of feature space causes increased problem within-class variation is shown in the face categories, including
computation cost. Instead of using complex representation, the rotation, flip transforms, posture variation and sub-types. From this
information is consolidated from different classifiers and a decision collection, we select 30 images from each class for training and
is made according to it. This method is known as score-level rest of images for test. We used this experimental setting in order to
fusion. make comparison with the other state-of-the-art methods. Our
Score-level fusion shows very good performance in multimodal results prove that the score-level fusion approach improves the
biometric systems [2–5]. In a biometric recognition system, the classification accuracy significantly.
match score is a measure of similarity between the input and The outline of this paper is as follows. In Section 2, an
template biometric feature vectors. When match scores’ outputs by overview of related work is given. Section 3 gives an overview on
different biometric matchers are consolidated in order to arrive at a methods that are used in our experiments. Section 4 is related to
final recognition decision, fusion is said to be done at the match experimental setup and results, and it is followed by conclusion in
score level which is also known as score-level fusion. Apart from Section 5.
the raw data and feature vectors, the match scores contain the
richest information about the input pattern. Additionally, it is
relatively easy to access and combine the scores generated by
different biometric matchers. Score-level fusion has been used in

IET Comput. Vis., 2018, Vol. 12 Iss. 5, pp. 679-685 679


© The Institution of Engineering and Technology 2018
Fig. 1  AlexNet architecture [15]

2 Related work Feature extraction is an important step in a biometric system. The


method used for feature extraction significantly affects the final
When dealing with a complex image database, it is important to decision for a biometric recognition system. Several feature
use a method capable of capturing information from all the extractors that are used in the literature are implemented in this
different classes. Sparse texture descriptors [6] depend on the study. These feature extraction methods, namely convolutional
region detector used and how complex the structure and texture of neural network (CNN), histograms of oriented gradients (HOGs),
object and background are. As a result, using such methods may median robust extended local binary pattern (MRELBP), kernel
fail, especially for animal classes with smooth skin texture and Fisher analysis (KFA) and biologically inspired features (BIFs) are
complex background. explained below briefly.
Animal recognition and classification can also be used in expert
systems for determining wild animal's migration corridors.
Animals are one of the object categories used in the literature for 3.1 Convolutional neural network
object recognition. Success rate of object recognition depends also CNN is a powerful machine learning technique from the field of
on good object representation and characterisation. Object deep learning which has been used in many computer vision tasks
characterisation can be achieved by visual descriptors, shape [14]. In this research, we apply a pre-trained CNN as feature
descriptors or texture representation. In this paper, score-level extractor to extract discriminate representations from animal faces.
fusion of different visual descriptors was used to classify the In order to train a CNN, a huge set of training image samples is
animal images. necessaries. From these large training set images, CNNs
One of the earliest attempts to perform recognition on an animal automatically learn extract discrimination features which in most
database was done by Schmid [7]. They constructed models for of the cases, outperform hand-crafted features such as HOG, local
content-based image retrieval using Gabor-like filters. The method binary pattern (LBP), or speeded-up robust features (SURF).
was tested on only four different classes. All animals used in this Traditional CNNs consist of stack of convolutional layers
work had complex skin texture. Later, Ramanan et al. [8, 9] followed by some fully connected layers and use the softmax
introduced methods to detect textured animals using the shape and multi-category classifier with the cross-entropy loss function.
texture information in video sequences. In an application for Firstly, we briefly describe each layer of CNN:
searching images on the Internet, Berg and Forsyth [10] used four
cues: nearby texts on the web pages, colour, texture and shape to • Convolutional layer: the purpose of this layer is to extract
re-rank the images retrieved by Google image search. They spatial correlation and invariant features by using various
reported that animals are among the hardest classes of objects for convolutional filters.
recognition in computer vision. • Pooling layer: in order to remove high-frequency noises, the
Peng et al. [11] proposed a deep neural network for image extracted feature maps of convolutional layer down-sample by
classification and tested it on LHI-Animal-Faces dataset. They extracting local max (max-pooling) or average (avg-pooling)
proposed a deep boosting framework based on layer-by-layer joint value of each patch in the feature map.
feature boosting and dictionary learning. In each layer, they • Fully connected layer: each neuron in this layer is fully
constructed a dictionary of filters by combining the filters from the connected to all neurons of its two adjacent layers.
lower layer, and iteratively optimised the image representation with
a joint discriminative generative formulation. In the following sub-sections, we briefly introduce two of the
A method for visual object categorisation was presented by mostly used CNN architectures which we used for automatic
Afkham et al. [12]. It is based on encoding the joint textural feature extraction from animal faces, namely AlexNet and
information in objects and the surrounding background. The VGG-16.
authors tested this framework for classification among 13 different
classes of animals.
Si and Zhu [13] introduced a framework for learning a 3.1.1 AlexNet: AlexNet [15] is a deep CNN for image
generative image representation which only needs a small number classification that won the ImageNet large-scale visual recognition
of training image samples. Each learned template is composed of challenge (ILSVRC)-2012 competition [16] and achieved a
different image patches in which their features such as location, winning top 5 test error rate of 15.3%, compared to 26.2%
scale and orientation may adapt in a local neighbourhood for achieved by the second-best entry. AlexNet is composed of five
deformation. The appearances of these patches are characterised, convolutional layers (C1 to C5) followed by two fully connected
respectively, by their local sketch, texture gradients, flatness (FC6 and FC7) and a final softmax output layer (FC8). The
regions, and colours. Then these patches are sorted and selected architecture of AlexNet is demonstrated in Fig. 1.
according to their information gain. This automated feature
selection procedure allows their algorithm to scale up to a wide 3.1.2 VGG-16: VGG-16 [17] architecture had been proposed in
range of image categories, from those with regular shapes to those ILSVRC 2014. The Oxford Visual Geometry Groups’ model is
with stochastic texture. They evaluated the hybrid image templates deeper and wider than former CNN structure. VGG-16 has five
on several public benchmarks including LHI-Animal-Faces dataset. batches of convolution operations, each batch consist of 3–5
adjacent convolution layers. Adjacent convolution batches are
connected via max-pooling layers. The size of kernels in all
3 Feature extraction methods
680 IET Comput. Vis., 2018, Vol. 12 Iss. 5, pp. 679-685
© The Institution of Engineering and Technology 2018
is shown to be highly robust to different type of image noises, such
as salt-and-pepper noise, Gaussian noise, Gaussian blur and
random pixel corruption. Liu et al. [24] compared a large number
of LBP variants. They designed different experiments to measure
their feature descriptors’ robustness against changes in rotation,
viewpoint, illumination, scale, different types of image
degradation, number of classes, and computational complexity. The
best overall performance is obtained for the MRELBP feature.

3.4 Kernel Fisher discriminant analysis


The KFA method [25] is an extension of Fisher discriminant
Fig. 2  VGG16 architecture [17] analysis (FDA). In this approach as a first step, input space will be
expanded by using a non-linear mapping, and then in the obtained
feature space the multiclass FDA will be applied. By implementing
non-linear mapping the dimensionality of feature space will be
increased and as a result, it improves the discriminative ability of
the KFA method. The main advantage of the KFA method is that it
can be applied for multiclass pattern classification problems and it
solution is unique which is its superiority to generalised
discriminant analysis [25] method which produces multiple
solutions.

3.5 Biologically inspired features


BIFs are a method that tries to model visual processing in the
cortex as a stack of increasingly sophisticated layers. Riesenhuber
and Poggio [26] proposed a new set of features derived from a
feed-forward network of the primate visual object recognition
pathway, called the ‘HMAX’ model. The model consists of two
different types of layers: S units neurons (simple) C units neurons
(complex) (C). A remarkable feature of this model is that it uses
non-linear maximum operation ‘MAX’ in the simple units’ neurons
instead of using the linear summation operation ‘SUM’ in pooling
inputs at the layer, which is called C1, the maximum values within
local patches and across the scales within a band is computed.
Therefore C1 feature includes eight bands and four orientations.
The bio-inspired features (BIFs) have been investigated for object
category recognition [27, 28] and face recognition [29]. An
overview of BIF system is shown in Fig. 3.
Fig. 3  System architecture of BIF [21]
4 Proposed method
convolutional layers is 3 × 3 convolutional layers and the number
of kernels within each batches is the same (increases from 64 in the According to the success rate of using score-level fusion in
first group to 512 in the last one). Fig. 2 illustrates a 16-layer VGG multimodal biometric recognition systems, it is believed that
architecture. VGG-16 network architecture has been used in many accuracy can be improved when the information of two different
researches and it was the first one that outperformed human-level types of classifiers are consolidated. Due to large inter-class
performance on ImageNet [18]. similarity and intra-class variation, we need to perform fusion of
two different kinds of feature descriptors. Appropriate score-level
combination of CNN and other simple features can achieve even
3.2 Histograms of oriented gradients higher accuracy than applying CNN alone. In our proposed
One of the popular feature extractors that has been used in object method, we performed score-level fusion of features that are
classification and face recognition is HOG [19]. This method extracted from two different descriptors.
constructs the features by counting occurrences of gradient For descriptor selection, we investigated different feature
orientation in local patches or detection window of an image. In descriptors and tried to select the most powerful ones. Inspired by
this approach, an image is decomposed into different partitions and recent face and object recognition works [26–28], in the computer
then the histogram of orientations for each partition is computed. vision community, we used features extracted from a fine-tuned
Finally, all of these histograms are concatenated to construct HOG pre-trained CNN, which achieve state-of-the-art performance in
feature. HOG is used in several recognition systems [20, 21]. many computer vision tasks, and combined them with the features
that are extracted by KFA which is one of the most successful
3.3 Median robust extended LBP appearance-based recognition systems in which the results of
recent research [25] show that it outperforms linear discriminant
Local binary patterns (LBPs) method [22] is considered among the analysis (LDA) [30] and principal component analysis [31]
most computationally efficient high-performance texture methods.
descriptors. LBP is used in several recognition problems such as The overview of proposed method is shown in Fig. 4. For each
face, iris, palmprint and plant recognition [20, 21]. However, the input animal head image, in the pre-processing step, in order to
LBP method is very sensitive to image noise and is unable to eliminate the negative effect of different factors such as size,
capture macrostructure information. In order to best address these illumination and picture quality; we perform some simple
disadvantages, Liu et al. [23] introduced a novel descriptor for manipulation such as image resizing, converting from RGB to
texture classification, namely MRELBP. Instead of using raw greyscale and histogram equalisation. However, we only resize the
image intensities, MRELBP compares median of image intensities input image for CNN features. Then two different set of features
in a local region. By using a multiscale LBP-type descriptor and a are computed. After that we compare the similarity of these feature
novel sampling scheme, this method can capture both vectors with all of the feature vectors in the training set and select
microstructure and macrostructure texture information. MRELBP the minimum one for each method. The distance between the test

IET Comput. Vis., 2018, Vol. 12 Iss. 5, pp. 679-685 681


© The Institution of Engineering and Technology 2018
Fig. 4  Block diagram of the proposed method

Fig. 5  LHI-Animal-Faces dataset. Five images are shown for each category

and training animal images is assumed to be the score of that and histogram equalisation in pre-processing step for local and
sample. In the next step, we normalised these obtained scores and appearance-based features but for CNN features we only resized
fused them together. Finally, we make decision by using nearest the image to 224 × 224 for VGG16 and 227 × 227 for AlexNet
neighbour (NN) classifier which uses the normalised fused scores. architectures. We investigated different type of feature descriptors
for both local feature-based and appearance-based classifier. We
5 Experiments and results used HOG, completed local binary patterns (CLBP), local binary
pattern histogram Fourier (LBP-HF), Haralick features and
Several experiments have been carried out in order to show the MRELBP as local feature descriptors and tested LDA and KFA as
performance of the proposed method and the other state-of-the-art appearance-based feature extractors. In each method, we used
methods. In the following subsections, the dataset description and cross-validation approach to find the optimal parameter settings.
experimental setup and results are presented. After feature extraction, we used a linear support vector machine
(SVM) for classification. The accuracy of each individual method
5.1 Dataset is demonstrated in Table 1.
Additionally, we used CNN as an automatic feature extractor.
The LHI-Animal-Faces dataset [13] consists of 19 classes of We selected publicly available CNNs, AlexNet and VGG-16
animal head images plus one class of human head images with architectures, pre-trained on the ImageNet database [18] for image
overall 2200 images. Fig. 5 shows five sample images from each of classification. Our choice is motivated by the impressive results
these categories. In contrast with other general classification achieved using these two models on the ILSVRC [16] and these
datasets, LHI-Animal-Faces contain only animal or human faces, two models in pre-trained version are freely accessible. These
which have a large intra-classes similarities (due to evolutional models are trained over hundreds of thousands of different images
relationship, some animal face categories are similar to the other and can separate an image into 1000 pre-defined classes of
class) and large inter-class variations (rotation, posture variation, different objects such as car, bike, airplane and many other object
subtypes). categories. As a result, these models have learned powerful
discriminative feature sets which can be used in the other object
5.2 Experimental setup and results recognition tasks.
In all of the following experiments, the LHI-Animal-Faces dataset The CNN architectures of AlexNet and VGG-16 contain
is divided into a training and test set in the same way as in the millions of parameters. Directly learning so many parameters from
AND-OR template (AOT) [32] method. We use 30 animal images only a few hundred training images from LHI-Animal dataset is
of each classes as a training set, and the remaining images of each problematic. There are several solutions for this problem. Our first
class for testing examples. solution to this problem is using a pre-trained CNN on a large
In order to eliminate the negative effect of different factors such image collection like ImageNet database, and then re-use it on
as size, illumination and picture quality, all the images have been LHI-Animal dataset. In order to use the representational power of
resized to 60 × 60 pixels and perform pixel intensity normalisation pre-trained deep networks, 4096-dimensional features are extracted
from the activations of FC7 layer. The reason that we selected
682 IET Comput. Vis., 2018, Vol. 12 Iss. 5, pp. 679-685
© The Institution of Engineering and Technology 2018
Table 1 Classification accuracy of different methods on LHI-Animal-Faces dataset
Type of methods Method Accuracy, %
local feature descriptor methods HOG 66.54
LBP 61.74
CLBP [33] 63.59
Fourier-LBP [34] 50.29
Haralick feature 49.27
BIF 68.46
median robust CLBP (MRCLBP) 68.46
appearance-based feature descriptor methods LDA 60.33
KFA 69.87
CNN features FC7 AlexNet features 89.91
FC7 VGG-16 features 92.84
Fine-tuned AlexNet 91.06
Fine-tuned VGG-16 94.39
score-level fusion methods LDA + HOG 74.32
LDA + LBP 68.91
LDA + CLBP 70.23
LDA + Fourier-LBP 62.44
LDA + Haralick feature 61.59
LDA + BIF 77.26
LDA + MRCLBP 76.30
LDA + FC7 AlexNet features 90.61
LDA + FC7 VGG-16 features 93.77
KFA + HOG 76.48
KFA + LBP 74.19
KFA + CLBP 74.14
KFA + Fourier-LBP 72.65
KFA + Haralick feature 70.94
KFA + MRCLBP 78.98
KFA + FC7 AlexNet features 91.37
KFA + FC7 VGG-16 features 94.21
KFA + FC7 fine-tuned AlexNet features 92.86
proposed method (KFA + FC7 fine-tuned VGG-16 features) 95.31

activations of FC7 layer for feature extraction is that based on resize the training images to 256 × 256 and then we select four
Donahue's results [35] the activations of the earlier layers can different patches with size 224 × 224 for VGG-16 and 227 × 227
extract mid-level image features and activations of the late layers for AlexNet by using the four image corners, and use them as the
can be used as very powerful features for many classification input of corresponding CNN architecture.
applications. In the next step, we rain a linear multiclass SVM on As the first step of fine-tuning, we truncate the last layer
the extracted features (softmax layer) of the VGG-16 and AlexNet pre-trained networks
In order to extract features from an image, the image is resized and replace them with our new softmax layer with 20 outputs or
and fed to the CNN as a multidimensional array of pixel intensities. categories instead of 1000 categories. We freeze the weight for the
The input image size of VGG-16 net is 224 × 224 × 3 matrix, while first early layers so that they remain intact throughout the fine-
for the input image of AlexNet architecture is a 227 × 227 × 3 tuning process. Then fine-tune the network by using stochastic
matrix. As the experiment results in Table 1 show, the classifier gradient descent algorithm to minimise the loss function with a
trained using AlexNet features provides close to 90% accuracy and small initial learning rate of 0.001.
the classifier trained using VGG-16 features provide close to 93% We believed that appropriate score-level combination of CNN
accuracy which are higher than the accuracy achieved using the and other simple features can achieve even higher accuracy than
hand-crafted features [36] such as LBP and HOG. The reason that applying CNN alone. In order to show that score-level fusion can
VGG-16 outperforms AlexNet is that VGG-16 architecture is much improve the accuracy of classification system, we tested different
deeper than the AlexNet, with 16 layers in total, 13 convolutional type of feature descriptors, namely local feature-based, appearance-
and three fully connected layers. VGG-16 uses small convolutional based and CNN-based features. We used HOG, CLBP, LBP-HF,
filters of 3 × 3 pixels so each filter captures simpler geometrical Haralick features and MRELBP as local feature descriptors, tested
structures but in comparison allows more complex reasoning LDA and KFA as appearance-based feature extractors and used
through its increased depth. FC-7 activation features of AlexNet and VGG-16 as CNN-based
Another solution for the problem is fine-tuning the pre-trained feature. We tested score-level fusion of different combination of
CNN by using our limited dataset. However, since our dataset is these features and some of the best combinations are mentioned in
too small, fine-tuning the pre-trained network on a small dataset Table 1. This procedure is detailed below.
might lead to overfitting, especially since the last few layers of the The distance between each test sample and its nearest training
VGG and AlexNet network are fully connected layers. In order to samples is assumed to be the score of that test sample in the
avoid overfitting, we increase the number of training samples with corresponding classifier. These scores are normalised by min–max
the common data augmentation strategies such as translation, normalisation [37] method as follows:
rotation, flipping. Training on the augmented dataset will make the
resulting model more robust and less prone to overfitting. We x − min(x)
x′ = (1)
increase the size of the training set five times by performing max x − min x
random rotations and translations on each image sample. Also, we

IET Comput. Vis., 2018, Vol. 12 Iss. 5, pp. 679-685 683


© The Institution of Engineering and Technology 2018
Fig. 6  Confusion matrix
(a) Fine-tuned VGG-16, (b) Score-level fusion of fine-tuned VGG-16 and KFA

where x is the raw score, max x and min x are the maximum and that uses score-level fusion with FC7 activation feature of fine-
minimum values of the raw scores, respectively and x′ is the tuned pre-trained VGG-16 and KFA, achieves 95.31%
normalised score. classification rate which is better than all the other methods
The multimodal score vector x1, x2 can be constructed after presented in that table. Therefore, it can be stated that the proposed
score normalisation, with x1 and x2 corresponding to the method outperforms the other local and appearance-based methods
normalised scores of two different systems. The next step is fusion and all the possible combination pairs of these methods with score-
at the matching score level. The score vector is combined by sum level fusion.
rule-based fusion method [37] to generate a single scalar score In order to show the classification accuracy for each class, we
which is then used to make the final decision as follows: computed confusion matrices for both cases of fine-tuned VGG-16
alone and score-level fusion of VGG-16 and KFA as shown in
f s = w1 x1 + w2 x2 (2) Fig. 6. The classification accuracy of seven classes is 100% and for
many other classes, it is acceptable and near to 100%. In the fine-
The notation wi stands for the weight which is assigned to one of tuned VGG-16 confusion matrix, the maximum confusions are
the two systems and reflects the relative importance of the two caused by deer head and rabbit head versus dog head (12%). These
systems. In Table 1, w1 is the weight of the first mentioned method confusion values are reduced to 0 and 4% in the score-level fusion
in the (method1 + method2) syntax. We used grid-search algorithm of fine-tuned VGG-16 and KFA, respectively. In the proposed
to find the optimal value for w1 and w2 by giving different values method, the maximum confusions are caused by bear head versus
pigeon head and rabbit head versus mouse head (8%). In the fine-
between 0, 1 and w1 = 1 − w2, for all the experiments with score-
tuned VGG-16, 11 classes have confusion with dog head class
level fusion method. The optimal values for the proposed method
which reduced to 4 classes in the Score-level fusion of fine-tuned
are w1 = 0.3 and w2 = 0.7 which show the importance of CNN
VGG-16 and KFA.
system over KFA method. On the other hand, the proposed method is compared with the
The experimental results show that in all the cases, the score- other state-of-the-art methods that presented the results on LHI-
level fusion causes meaningful improvement in accuracy. Table 1 Animal-Faces dataset. Table 2 exhibits the classification accuracy
shows the classification accuracy of different feature descriptor of our proposed method and state-of-the-art methods on LHI-
methods individually and their score-level fusion results on LHI- Animal-Faces. It is clearly seen that our proposed method
Animal-Faces dataset. As shown in the table, the proposed method,

684 IET Comput. Vis., 2018, Vol. 12 Iss. 5, pp. 679-685


© The Institution of Engineering and Technology 2018
Table 2 Classification accuracy on LHI-Animal-Faces [12] Afkham, H., Tavakoli, A., Eklundh, J., et al.: ‘Joint visual vocabulary for
animal classification’. Int. Conf. Pattern Recognition, 2008. ICPR 2008,
dataset Tampa, FL, USA, 2008, pp. 1–4
Method Accuracy, % [13] Si, Z., Zhu, S.-C.: ‘Learning hybrid image templates (HIT) by information
HOG + SVM [13] 70.8 projection’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (7), pp. 1354–
1367
HIT [13] 75.6 [14] Druzhkov, P.N., Kustikova, V.D.: ‘A survey of deep learning methods and
LSVM [38] 77.6 software tools for image classification and object detection’, Pattern
Recognit. Image Anal., 2016, 26, (1), pp. 9–15
AOT [32] 79.1 [15] Krizhevsky, A., Sutskever, I., Hinton, G.: ‘Imagenet classification with deep
deep boosting [11] 81.5 convolutional neural networks’. Advances in Neural Information Processing
Systems, Lake Tahoe, USA, 2012, pp. 1097–1105
proposed method 95.31 [16] Russakovsky, O., Deng, J., Su, H., et al.: ‘Imagenet large scale visual
recognition challenge’, Int. J. Comput. Vision, 2015, 115, (3), pp. 211–252
[17] Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-
scale image recognition’, arXiv preprint arXiv:1409.1556
outperforms other state-of-the-art algorithms for animal face [18] Deng, J., Dong, W., Socher, R.: ‘Imagenet: a large-scale hierarchical image
classification. database’. Proc. IEEE Conf. Computer Vision Pattern Recognition, Miami
FL, USA, 2009, pp. 248–255
[19] Dalal, N., Triggs, B.: ‘Histograms of oriented gradients for human detection’.
6 Conclusions Computer Vision and Pattern Recognition, San Diego, CA, USA, June 2005,
pp. 886–893
A novel animal classification system is proposed in this paper. The [20] Farmanbar, M., Toygar, Ö.: ‘Feature selection for the fusion of face and
proposed system uses score-level fusion on two powerful feature palmprint biometrics’, Signal Image Video Process., 2016, 10, (5), pp. 951–
extractors: KFA and FC7 activation feature of VGG-16 pre-trained 958
on ImageNet dataset and fine-tuned on LHI-Animal-Faces dataset. [21] Eskandari, M., Toygar, Ö.: ‘Fusion of face and iris biometrics using local and
global feature extraction methods’, Signal Image Video Process., 2014, 8, (6),
Local feature extractors, appearance-based methods, two famous pp. 995–1006
CNN architectures and several score-level fusion pairs of these [22] Ojala, T., Pietikäinen, M., Maenpää, T.: ‘Multiresolution gray-scale and
methods are used in the experiments to show the superiority of the rotation invariant texture classification with local binary patterns’, IEEE
proposed method on LHI-Animal-Face database. Additionally, the Trans. Pattern Anal. Mach. Intell., 2002, 24, (7), pp. 971–987
[23] Liu, L., Lao, S., Fieguth, P., et al.: ‘Median robust extended local binary
proposed method is compared with several state-of-the-art systems pattern for texture classification’, IEEE Trans. Image Process, 2016, 25, (3),
that use the same dataset. It is demonstrated that the proposed pp. 1368–1381
method using score-level fusion on appearance-based KFA method [24] Liu, L., Fieguth, P., Guo, Y., et al.: ‘Local binary features for texture
and learned features of VGG-16 outperform the state-of-the-art classification: taxonomy and experimental study’, Pattern Recognit., 2017,
62, pp. 135–160
systems for animal face classification. [25] Liu, C.: ‘Capitalize on dimensionality increasing techniques for improving
face recognition grand challenge performance’, IEEE Trans. Pattern Anal.
7 References Mach. Intell., 2006, 28, (5), pp. 725–737
[26] Riesenhuber, M., Poggio, T.: ‘Hierarchical models of object recognition in
[1] Elson, J., Douceur, J., Howell, J., et al.: ‘Asirra: a CAPTCHA that exploits cortex’, Nat. Neurosci., 1999, 2, pp. 1019–1025
interest-aligned manual image categorization’. Proc. ACM Conf. Computer [27] Cord, M., Theriault, C., Thome, N.: ‘HMAX-S: deep scale representation for
and Communications Security (CCS), Alexandria VA, USA, 2007, pp. 366– biologically inspired image categorization’. Proc. IEEE Int. Conf. Image
374 Processing, Brussels, Belgium, 2011, pp. 1261–1264
[2] Marcialis, G., Roli, F.: ‘Score-level fusion of fingerprint and face matchers for [28] Thériault, C., Thome, N., Cord, M.: ‘Extended coding and pooling in the
personal verification under stress conditions’. 14th IEEE Int. Conf. Image HMAX model’, IEEE Trans. Image Process., 2013, 22, (2), pp. 764–777
Analysis and Processing ICIAP, DC, USA, 2007, pp. 259–264 [29] Bingpeng, M., Su, Y., Jurie, F.: ‘Covariance descriptor based on bio-inspired
[3] Elmir, Y., Elberrichi, Z., Adjoudj, R.: ‘Score-level fusion based multimodal features for person re-identification and face verification’, Image Vision
biometric identification (fingerprint & voice)’. 6th Int. Conf. Sciences of Comput., 2014, 32, (6-7), pp. 379–390
Electronics, Technologies of Information and Telecommunications, Sousse, [30] Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: ‘Eigenfaces vs. fisherfaces:
2010, pp. 146–150 recognition using cla ss specific linear projection’, IEEE Trans. Pattern Anal.
[4] Sim, H., Hishammuddin, A., Rohayanti, H., et al.: ‘Multimodal biometrics: Mach. Intell., 1997, 19, pp. 711–720
weighted score-level fusion based on non-ideal iris and face images’, Expert [31] Eskandari, M., Toygar, Ö.: ‘Selection of optimized features and weights on
Syst. Appl., 2014, 41, (11), pp. 5390–5404 face-iris fusion using distance images’, Comput. Vis. Image Underst., 2015,
[5] Patil, A., Bhalke, D.: ‘Fusion of fingerprint, palmprint and iris for person 137, pp. 63–75
identification’. Int. Conf. Automatic Control and Dynamic Optimization [32] Si, Z., Zhu, S.-C.: ‘Learning and-or templates for object recognition and
Techniques (ICACDOT), Pune, 2016, pp. 960–963 detection’, IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, (9), pp. 2189–
[6] Takimoto, H., Mitsukura, Y., Fukumi, M., et al.: ‘Robust gender and age 2205
estimation under varying facial pose’, Electron. Commun. Jpn., 2008, 91, (7), [33] Guo, Z., Zhang, D., Zhang, D.: ‘A completed modeling of local binary pattern
pp. 32–40 operator for texture classification’, IEEE Trans. Image Process., 2010, 19, (6),
[7] Schmid, C.: ‘Constructing models for content-based image retrieval’. Proc. pp. 1657–1663
2001 IEEE Computer Society Conf. Computer Vision and Pattern [34] Ahonen, T., Matas, J., He, C., et al.: ‘Rotation invariant image description
Recognition, 2001. CVPR 2001, Kauai, USA, December 2001, pp. 11–39 with local binary pattern histogram Fourier features’, Lect. Notes Comput.
[8] Ramanan, D., Forsyth, D.A., Barnard, K.: ‘Detecting, localizing and Sci., 2009, 5575, pp. 61–70
recovering kinematics of textured animals’. 2005 IEEE Computer Society [35] Donahue, J., Jia, Y., Vinyals, O., et al.: ‘DeCAF: a deep convolutional
Conf. Computer Vision and Pattern Recognition, San Diego, USA, June 2005, activation feature for generic visual recognition’. Int. Conf. Machine Learning
pp. 635–642 (ICML), Beijing, China, 2014, pp. 647–655
[9] Ramanan, D., Forsyth, D.A., Barnard, M.-K.: ‘Building models of animals [36] Nanni, L., Ghidoni, S., Brahnamb, S.: ‘Handcrafted vs. non-handcrafted
from video’, IEEE Trans. Pattern Anal. Mach. Intell., 2006, 28, (8), pp. 1319– features for computer vision classification’, Pattern Recognit., 2017, 71, pp.
1334 158–172
[10] Berg, T.L., Forsyth, D.A.: ‘Animals on the web’. 2006 IEEE Computer [37] He, M., Horng, S., Fan, F., et al.: ‘Performance evaluation of score level
Society Conf. Computer Vision and Pattern Recognition (CVPR'06), NY, fusion in multimodal’, Biometric Syst. Pattern Recognit., 2010, 43, pp. 1789–
USA, 2006, pp. 1463–1470 1800
[11] Penga, Z., Lia, Y., Caib, Z., et al.: ‘Deep boosting: joint feature selection and [38] Felzenszwalb, P., Girshick, R., McAllester, D., et al.: ‘Object detection with
analysis dictionary learning in hierarchy’, Neurocomputing, 2016, 178, (20), discriminatively trained part-based models’, IEEE Trans. Pattern Anal. Mach.
pp. 36–45 Intell., 2010, 32, (9), pp. 1627–1645

IET Comput. Vis., 2018, Vol. 12 Iss. 5, pp. 679-685 685


© The Institution of Engineering and Technology 2018

You might also like