You are on page 1of 12

Bird Region Detection in Images with

Multi-scale HOG Features and SVM


Scoring

Rahul Kumar, Ajay Kumar and Arnav Bhavsar

Abstract In this paper, we address a problem of detecting regions (bounding box)


containing birds in images, which is closely related to the task of fine-grained visual
classification (FGVC) of bird images. We note that there exist various sophisticated
approaches proposed for this task within the overall framework of FGVC. However,
we demonstrate that the problem of bird region detection, by itself, can be addressed
in a rather simplistic manner. Our approach employs HOG features and a multi-
scale detection framework using the SVM classifier, but where real-valued scores
(or weights) from the SVM are used rather than the conventional binary decision
labels. We validate our approach on a variety of bird images from the CUB-200 bird
image data set and show that the proposed approach yields reasonable quality bird
region detection.

Keywords Bird region detection · Histogram of oriented gradients (HOG)


Multi-scale processing · SVM confidence score

1 Introduction

In recent years, the task of fine-grained visual classification (FGVC) is being real-
ized as a challenging task for computer vision. The problem involves classifying
different types of objects of the same family. An application area of the FGVC
task is that of identification species of the same organism (e.g. plant [1], insects [2],

R. Kumar (B) · A. Kumar · A. Bhavsar


School of Computing and Electrical Engineering, Indian Institute of Technology Mandi,
Mandi, India
e-mail: rayiit2016@gmail.com
A. Kumar
e-mail: b12039@students.iitmandi.ac.in
A. Bhavsar
e-mail: arnav@iitmandi.ac.in

© Springer Nature Singapore Pte Ltd. 2018 353


B. B. Chaudhuri et al. (eds.), Proceedings of 2nd International Conference
on Computer Vision & Image Processing, Advances in Intelligent Systems
and Computing 704, https://doi.org/10.1007/978-981-10-7898-9_29
354 R. Kumar et al.

bird [3–6]), which is important for ecological or environmental studies. Among these,
the bird image classification task has been explored in many recent works.
The popularity of the bird image classification task is perhaps due to the associated
challenge of considering images acquired ‘in the wild’, as is the case with the well-
known CUB bird image data set [7]. A related problem (arguably, beneficial to the
classification task) involving such images captured in a completely natural settings
is that of detection of birds in the image. The problem involves finding a region of
interest (ROI) which encompasses most of pixels corresponding to the bird, while
eliminating a large part of the background.
Indeed, the bird ROI detection problem has often been considered within the
overall FGVC framework of bird image data (e.g. [3, 4, 6]), which as we discuss
in the next subsection, typically involve sophisticated frameworks. However, the
detection task, by itself, can be primary part in the overall FGVC pipeline. This
is because finding the image ROI locating the bird while eliminating most of the
background can be useful for better localized feature extraction, or as an initial
estimate for pixel-level segmentation [3, 8], or for constraining the image region
for detecting individual parts [5]. Notwithstanding variety in bird appearance and
the background, to benefit the overall processing pipeline, the bird ROI detection
task should, arguably, be relatively simpler as compared to subsequent tasks such as
segmentation, part localization and the overall classification. Thus, it is useful if the
bird ROI detection task is handled by a simple and efficient approach.
Hence, unlike in the works on overall FGVC, in this work, we demonstrate that the
bird ROI detection can indeed be achieved in a straightforward manner. Our proposed
method involves the traditional Histogram of Oriented Gradients (HOG) features,
used in a multi-scale manner, and an SVM classifier on local image regions. The
SVM classifier output is considered in terms of a real-valued weights rather than the
conventional way considering the binary decision. We demonstrate on image exam-
ples containing a variety of birds and background that the proposed strategy yields
reasonably good quality bird ROI detection, which satisfies the above-mentioned
role of the ROI detection task.

1.1 Related Work

As mentioned above, the task of ROI detection in bird images has been considered as
a part of approaches on FGVC. However, the region detection task has been reported,
primarily at the level of bird parts, while the complete bird detection or segmentation
is considered in relatively less number of works. However, as noted earlier, the latter
can be useful in further stages of FGVC such as segmentation, feature extraction or
part detection.
Having said that we discuss below some prominent works on part detection,
whole-bird segmentation, and bird detection, as these can be considered as different
flavours of the detection task. As mentioned above, in most of approaches, the bird
ROI detection is considered as a part of the overall FGVC task. Hence, in the dis-
Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring 355

cussion below, we only focus on the ROI detection component in considered in such
FGVC methods.
A deformable part descriptor based on the deformable parts model (DPM) is used
for part detection in [4]. The model parameters are learnt using a latent SVM trained
on filtered windows given by the part annotations. The underlying feature space
over which the DPM is defined consists of gradient, local binary patterns and colour
features. There are some FGVC approaches where the part detection is not considered
independently but is intrinsically related to the classification problem [5, 9, 10]. In
[5], given a bounding box around the bird regions, a segmentation is performed so
as to divide the region into multiple segments, with an assumption that each segment
contains a semantic attribute bird part. A latent conditional random field (CRF)
is then employed to learn (and detect) parts which are more discriminative, with
respect to the overall classification objective. The CRF is learnt via the expectation
maximization, and some of the parameters are learnt as weights of an SVM. The
work in [9] also intricately links the detection problem with that of classification.
Here, a set of regions which are common across classes are learnt in an unsupervised
manner. Following this, a multi-kernel learning framework using SVM is employed
to learn weights of the features from such regions. The weights of the feature are
based on how discriminative the underlying region is. Thus, in essence, the approach
detects important regions which yield better discrimination. The feature space used
in this work consists of encoded HOG features. Another work following a similar
philosophy is that of [10], wherein image patches which are more discriminative
for classification are learnt using random forests. The features extracted from image
patches involve SIFT and related codewords from a Bag of words framework.
In recent years, some deep learning-based works are also reported for FGVC,
wherein the whole-bird detection or part detection is also considered. For instance,
the region-based CNN (R-CNN) [6] uses the weights from the CNNs trained on the
complete bird as well as bird parts, in an SVM classifier. Moreover, the detection
is further refined using geometric constraints relating the location of parts and the
complete bird region. Another application of CNN for part detection is reported
in [11], wherein a fully convolution network (FCN) is learnt to directly computing
part annotations, unlike the R-CNN approach. The work reported in [12] devises a
strategy to select filters from the CNN which correspond to semantically meaningful
parts. These are then used in the computation of part-saliency maps which can serve
as approximate part localization detection.
In [13], an approach for coarse segmentation of the overall bird region follows
a supervised Laplacian label propagation framework which involves minimizing a
cost with a smoothness constraint on the labels. The constraint involves some prior
knowledge about labels for some pixels, which is represented using the output of an
a supervised SVM classifier. The SVM classifier is trained on encoded HOG, and
colour features are computed on super-pixels of the original image. Some relatively
simpler coarse segmentation strategies are also followed in [14, 15], wherein colour
information is learnt from regions near the image borders, and is used to represent the
background information. Based on this, a pixel is labelled in the bird or background
class based on the similarity with respect to this representation. While these are
356 R. Kumar et al.

approaches that are quite straightforward, an important limitation in these could be


that only the regions close to the image borders may not be able to capture the
background variation well.
A somewhat different scenario for bird detection is considered in [16, 17], which
involves far-field views with birds localized to very small regions. One of these
methods employ Adaboost classification with Haar and HOG features [16], while the
other [17] involves an approach fusing the likelihoods from CNN, FCN, and super-
parsing and further trained with a SVM classification to yield better estimates. While
this work is worth mentioning about, as it essentially addresses the bird detection
task, we note the scenario that we considered in the present work involves near-field
detection unlike these methods.
Some methods for FGVC address the recognition problem in an end-to-end man-
ner such as [18–20], and it may seem that ROI detection may not be required. How-
ever, even in such cases, it may be argued that a ROI which contains more relevant
information about the actual classes, and wherein a large part of background (which
is not useful for classification) is eliminated, may provide more relevant inputs to
such end-to-end systems. This could be useful for better discrimination.
We can thus summarize the above discussion and place our approach in their
context as follows. The part detection task is often inherently connected to the clas-
sification task in the overall FGVC framework and typically focuses on computing
the more discriminative parts. On the other hand, we consider the task of bird detec-
tion which can be carried out independently and thus can be used with any subsequent
approaches in the FGVC pipeline. On the other hand, most of the above including
the ones which do handle the detection part independently are quite sophisticated,
involving many features and/or their encoding, intensive parameter leaning, applying
latent SVM within another classification framework, deep learning, etc. As we argue
above, and demonstrate later, the bird ROI detection problem, viz. an initial building
block of FGVC (e.g. the ROI is assumed to be provided in [3, 5]), can be addressed
in a simpler manner. Indeed, in [14, 15], the detection (or coarse segmentation) is
carried out using a different strategy but which is also relatively simpler than the rest,
an observation that supports our argument.
The paper is organized as follows. In the next section, we elaborate on the proposed
approach, wherein we first provide a brief description about various modules used in
this work, followed by a description of the steps of the proposed detection process.
Section 3 consists of the experimental results, and we conclude in Sect. 4.

2 Proposed Approach

In this section, we briefly describe the HOG feature, SVM classification and the
multi-scale processing, in the context of their application in this work. We then
summarize the overall approach in a stepwise manner.
Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring 357

2.1 Histogram of Oriented Gradients (HOG) Features

The histogram of oriented gradients (HOG) is a feature descriptor commonly used in


visual pattern classification [21]. Some well-known examples of applications of the
HOG for detection tasks are the DLIB face detection [22] and human detection [21].
Indeed, we are inspired by the work in [22], which shows that a complex problem of
large-scale face detection can also be addressed using a relatively simple application
of HOG features. The approach is based on considering the occurrences of gradient
orientations in localized portions of an image. Thus, it represents the distribution of
gradient orientation in localized portions of an image. We briefly describe the overall
process below:
• Divide the image into small connected regions called ‘cells’, and for each cell
compute edge orientations for the pixels within the cell.
• Discretize each cell into angular bins according to the gradient orientation (typi-
cally spread over 9 bins from 0 to 180 degrees).
• Each cell pixel contributes the gradient value to its corresponding angular bin.
• Groups of adjacent cells are considered as spatial regions called blocks, which are
used for grouping and normalization of the cell histograms.
• The final descriptor for an image region (a detection window) is the vector of
all components of the normalized cell responses from all of the blocks in that
detection window.
In this work, while we do not extract any colour features, we concatenate the HOG
features for all the colour channels, to implicitly consider the colour information.
Examples of the representation captured by the HOG descriptors over local image
blocks are shown in Fig. 1, for two bird images. The HOG representation images
(Figs. 1b, d) depict the distribution of the dominant gradients in each image block.

2.2 Support Vector Machine Classification

A Support vector machine (SVM) is fundamentally a binary discriminative classifier


which learns a separating hyperplane. In other words, given labelled training data
(supervised learning), the algorithm yields an optimal hyperplane, in such a way that
the largest possible separation is achieved between the training samples closest to
the hyperplane. SVM can also be used for nonlinear classification by performing the
linear classification in a higher dimensional space, which is achieved via the implicit
use of kernels. Considering the argument about the simplicity of our approach, we
choose to use the linear SVM which does not use any kernels. The regularization
parameter is fine-tuned by grid search, over a specific range. In this work, we employ
the LIBSVM toolbox for SVM classification [23].
Typically in most applications, only the final decision about the class label is used.
However, SVM can also provide a real-valued confidence score [24] for each of the
358 R. Kumar et al.

Fig. 1 a, c Examples of bird images, and b, d corresponding HOG feature maps for one of the
colour channel

test sample which essentially represents the distance of that test samples from the
hyperplane. The higher this score, the better confidence one can have on the classified
sampled as belonging to its assigned class. This implies that the higher scored test
sample better represents the characteristics of class that it is assigned to.
In this work, we employ the SVM score for the purposes of bird ROI detection.
This is because many image windows which contain some part of the bird region can
be labelled as belonging to the bird class. Hence, one requires to look for a window
that most appropriately locates the bird region. Thus, the intuition is that such a image
window should contain a large part of the bird region, as it should contain a large
number of features from the bird region pixels. Thus, among the various choices of
windows, it is natural to pick one which best represents the bird features, or in other
words, which has the largest SVM score. We demonstrate in the results section, some
choices of selected windows to convey this point visually.
Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring 359

2.3 Multi-scale Classification

We note that the ROI containing birds in the images can span vary in their sizes.
Hence, considering a single size of window may not fit the ROI in all the images. To
address this, we follow a multi-scale strategy for addressing the detection problem
in this work.
Such a strategy essentially involves extracting features by considering different
window sizes relative to the image. We keep the window size constant but consider
four image scales. This ensures that the HOG feature aggregation for each window
is carried out at different scales. The four scales that we choose are free parameters.
In this work, these are based on observation of the maximum and minimum areas
of the regions spanned by the bird pixels in the data set. In general, the image can
be downscaled till the point when one of the image dimensions becomes lower than
the window size (which in our case happens in most of the cases for the lowest scale
specified above).
We treat the HOG feature vectors extracted for each window, at each of the four
levels, as independent samples for the SVM classifier. Thus, for each test image, the
number of test samples for the SVM equals the N ∗ L, where N is the number of
windows and L denotes the number of scales (4, in this case).

2.4 Overall Approach for Bird Region Detection

Here, we provide the steps of our approach. For the training process, we use the
ground-truth bounding box annotation to differentiate between bird and background
regions, which are provided with the data set that we consider in this work.
• Training:
– From each training image, randomly extract various image regions which con-
tain bird and background. Note that the bird regions may include some back-
ground parts as well, as we do not use exact segmentations, but only bounding
box annotations.
– Extract HOG features from all bird and background subimages and create train-
ing data set.
– Train a linear SVM model.
• Testing
– Given a test image, move a sliding window across the image and extract the
HOG features from each window at different positions.
– Perform the same at downscaled versions of the image. Downscale the image
till the point when one of the image dimensions becomes lower than the window
size.
– Calculate the SVM weight for the HOG feature for each sliding window.
360 R. Kumar et al.

– To avoid very closely spaced windows in a local region getting most of the high
weights, apply a non-maximum suppression on local groups of windows, and
select only a single window from such local groups.
– Pick the highest weighted window from the remaining ones, which is the final
detection region for bird in image.

3 Experimental Results

In this section, we first comment briefly on the data set used in this work and the
training–testing protocol. We then discuss some results of the proposed bird detection
approach.

3.1 Data set and Parameters

We validate the proposed detection algorithm on a variety of images from the well-
known Caltech-UCSD Birds image data set [7]. The data set contains images belong-
ing to 200 different classes of birds, in a variety of background (water, trees, relatively
plain, etc.). To validate our approach, we consider different types of birds and a vari-
ety of such backgrounds.
In this work, we have used 20 bird classes, which include all variety of background
scenes in the data set. We note that we manually select different bird classes which
cover enough variation in terms of their visual characteristics (e.g. very different
species of birds). As we focus on the detection problem (a two-class problem), we
believe that such a validation provides a proof of concept for the proposed approach
as enough variety of foreground and background data are being considered. The SVM
training was carried out with 50% examples from all the class (about 30 images in
each class), and the testing was carried out with the rest. Below, we depict and discuss
some of our results on the test images.
The sliding window size chosen in this work is of the size 128 × 192. The four
scaling factors for the multi-scale processing are used as 1, 0.85, 0.72, 0.62. The
HOG feature parameters are same as those in [21]. As mentioned above, we train a
linear SVM model.

3.2 Results

We start with a relatively simple case where the background is rather plain (Fig. 2).
The two bounding boxes in Fig. 2a represent the two detected regions which were
assigned the top two weights by the SVM. Figure 2b depicts the bird image when
retaining the bounding box one corresponding to the highest weight. Note that the
Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring 361

Fig. 2 a An example bird image with the top two weighted windows, and b the detection when
retaining the top weighted window

selected bounding box indeed covers the maximum part of the bird region, while
keeping the background as low as possible. While the second best detection also
covers most of the bird pixels, it also includes some part of the background. Also,
note that the bird is correctly detected, even if one of the colours on the upper back
and the bottom of the bird is similar to the branch beneath it. However, the gradient
structure is different, which the HOG feature, arguably, seems to capture.
Figure 3 shows three examples, with somewhat of a varying background, each
with the top-ranked bounding box in terms of the SVM weight. In all the cases, the
blue bounding box shows the ground-truth. Figure 3b shows two red bounding boxes
which were weighted quite close. However, the one with the highest weight is the
one covering the beak of the bird. Note that the detection in Fig. 3a, b does miss
some proximities of the wing, which could be because of the texture of the wing
is similar to that of the background, and the mis-detected part is the one connected
with the background. However, we can still observe that the bird region is detected
enough so as to extract appropriate features for a classification task. In Fig. 3c, the
scene is somewhat more challenging, and yet the detection is quite well. Indeed the
only part missing in the detection is a small part of the tail, which is disconnected
from the main body due to the occlusion caused by the branch. Other than this, we
can observe that the detection is very close to the ground-truth, as also is the case
with the other images.
Finally, in Fig. 4, we show some more detection results, some of which involve
more complex variations. Figure 4b shows a bird with very similar texture as the
bottom half of the background. Similarly, in Fig. 4c the bird itself is dark with little
texture variation. However, in all the cases, the approach yields very encouraging
detection performance, wherein almost the complete bird is included in the bounding
box, (except for a small portion of the tail in Fig. 4c). Thus, it is apparent that such a
simplistic approach is able to perform quite well, even on scene images with varying
backgrounds.
362 R. Kumar et al.

Fig. 3 a, b, c Examples of bird detection with top 2 or 3 weighted windows. The blue window
depicts the highest weighted window which is finally selected

Fig. 4 a–d Some more examples of bird detection with different background
Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring 363

As this is an initial study primarily targeted towards the development of a bird


recognition system, we mainly focus on achieving a performance of the detection
task in a simplistic and efficient manner, which is good enough to remove a large part
of the background without significantly affecting the bird regions. While we have
not provided quantitative comparisons, we can appreciate from the visual results that
the above-mentioned objective of the detection process is indeed well met, which is
important from a systems perspective. Also, as mentioned earlier, while we do not
experiment exhaustively on the CUB data, we consider some images of all types
of scenes/background involved in the data set and demonstrate that the approach
performs consistently on all of these scene types.

4 Conclusion

In this work, we argue that, as the ROI detection process is a primary step in FGVC,
it ought to be handled in a simplistic manner while providing a good performance
considering the variety of birds and background. Thus, we propose and demonstrate a
rather simple but an effective approach for bird ROI detection in images; a framework
which involves a traditional feature representation, applied in a multi-scale manner,
and an SVM classifier, is used to weigh the image windows for finding one which
most appropriately locates the bird ROI. Our results clearly demonstrate that the
proposed approach is able to provide good quality bird region detections. With such
an approach able to eliminate most of the background, it can be an efficient option
to be used as a primary step for bird FGVC.

References

1. N. Kumar, P. Belhumeur, A. Biswas, D. Jacobs, W. John Kress, I. Lopez, and J. V. B. Soares,


Leafsnap: A computer vision system for automatic plant species identification, European Con-
ference on Computer Vision (ECCV), 2012.
2. M. Martineau, D. Conte, R. Raveaux, I. Arnault, D. Munier, and G. Venturini, A survey on
image-based insect classification, Pattern Recognition, vol. 65, 2017, pp. 273–284.
3. L. Xie, Q. Tian, R. Hong, S. Yan, and B. Zhang, Hierarchical part matching for fine-grained
visual categorization, Int. Conference on Computer Vision (ICCV), 2013.
4. N. Zhang, R. Farrell, F. Iandola, and T. Darrel, Deformable part descriptors for fine-grained
recognition and attribute prediction, IEEE Int. Conference on Computer Vision (ICCV), 2013.
5. K. Duan, D. Parikh, D. Crandall, and K. Grauman, Discovering localized attributes for fine-
grained recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2012.
6. N. Zhang, J. Donahue, R. Girshick, and T. Darrell, Part-based R-CNNs for fine-grained category
detection, European Conference on Computer Vision (ECCV), 2014.
7. P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, Caltech-UCSD
Birds 200. California Institute of Technology. CNS-TR-2010-001. 2010.
8. J. Krause, H. Jin, J. Yang, and F. Li, Fine-grained recognition without part annotations, IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
364 R. Kumar et al.

9. A. Angelova and A. Niculescu-Mizil, Feature combination with multi-kernel learning for fine-
grained visual classification, IEEE Winter Conference on Applications of Computer Vision
(WACV), 2014.
10. B. Yao, A. Khosla and F. Li, Combining randomization and discrimination for fine-grained
image categorization, IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2011.
11. S. Huang, Z. Xu, D. Tao, and Y. Zhang, Part-Stacked CNN for Fine-Grained Visual Catego-
rization, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
12. X. Zhang, H. Xiong, W. Zhou, W. Lin, Q. Tian, Picking deep filter responses for fine-grained
image recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016.
13. A. Angelova and S. Zhu, Efficient object detection and segmentation for fine-grained recogni-
tion, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
14. M. Das and R. Manmatha, Automatic segmentation and indexing in a database of bird images,
IEEE Int. Conference on Computer Vision (ICCV), 2001.
15. I. Lillo, J. Niebles, and A. Soto, Bird species classification based on color features, IEEE Int.
Conference on on Systems, Man, and Cybernetics (SMC), 2013.
16. R. Yoshihashi, R. Kawakami, M. Iida, and T. Naemura, Evaluation of bird detection using
time-lapse images around a wind farm, EWEA 2015 Annual Event, 2015.
17. A. Takeki, T. Tuan Trinh, R. Yoshihashi, R. Kawakami, M. Iida, and T. Naemura, Detection
of small birds in large images by combining a deep detector with semantic segmentation, Int.
Conference on Image Processing (ICIP), 2016.
18. T. Lin, A. Roy Chowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recogni-
tion, International Conference on Computer Vision (ICCV) 2015.
19. J. Kraus, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and F. Li, The
unreasonable effectiveness of noisy data for fine-grained recognition, European Conference
on Computer Vision (ECCV), 2016.
20. Z. Ge, A. Bewley, C. McCool, B. Upcroft, P. Corke, and C. Sanderson, Fine-grained clas-
sification via mixture of deep convolutional neural networks, IEEE Winter Conference on
Applications of Computer Vision (WACV), 2016.
21. N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2005.
22. Dlib C++ library, http://blog.dlib.net/2014/02/dlib-186-released-make-your-own-object.html
23. C. Chang and C. Lin, LIBSVM: A library for support vector machines, ACM Transactions on
Intelligent Systems and Technology, vol. 2, no. 3, 2011, pp. 1–27.
24. H. Lin, http://www.work.caltech.edu/~htlin/program/libsvm/

You might also like