You are on page 1of 18

applied

sciences
Article
Weakly Supervised Fine-Grained Image Classification
via Salient Region Localization and Different Layer
Feature Fusion
Fangxiong Chen 1 , Guoheng Huang 2, * , Jiaying Lan 2 , Yanhui Wu 3 , Chi-Man Pun 4, * ,
Wing-Kuen Ling 3, * and Lianglun Cheng 2
1 School of Automation, Guangdong University of Technology, Guangzhou 510006, China;
2111604199@mail2.gdut.edu.cn
2 School of Computers, Guangdong University of Technology, Guangzhou 510006, China;
3116005834@mail2.gdut.edu.cn (J.L.); llcheng@gdut.edu.cn (L.C.)
3 School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China;
3116002239@mail2.gdut.edu.cn
4 Department of Computer and Information Science, University of Macau, Macau SAR 999078, China
* Correspondence: kevinwong@gdut.edu.cn (G.H.); cmpun@umac.mo (C.-M.P.);
yongquanling@gdut.edu.cn (W.-K.L.)

Received: 19 June 2020; Accepted: 28 June 2020; Published: 6 July 2020 

Abstract: The fine-grained image classification task is about differentiating between different object
classes. The difficulties of the task are large intra-class variance and small inter-class variance.
For this reason, improving models’ accuracies on the task heavily relies on discriminative parts’
annotations and regional parts’ annotations. Such delicate annotations’ dependency causes the
restriction on models’ practicability. To tackle this issue, a saliency module based on a weakly
supervised fine-grained image classification model is proposed by this article. Through our salient
region localization module, the proposed model can localize essential regional parts with the use
of saliency maps, while only image class annotations are provided. Besides, the bilinear attention
module can improve the performance on feature extraction by using higher- and lower-level layers
of the network to fuse regional features with global features. With the application of the bilinear
attention architecture, we propose the different layer feature fusion module to improve the expression
ability of model features. We tested and verified our model on public datasets released specifically for
fine-grained image classification. The results of our test show that our proposed model can achieve
close to state-of-the-art classification performance on various datasets, while only the least training
data are provided. Such a result indicates that the practicality of our model is incredibly improved
since fine-grained image datasets are expensive.

Keywords: fine-grained image classification; different layer feature fusion; attention model

1. Introduction
Image classification is gaining increasing attention mainly for its wide use in the Internet of Things,
self-driving cars, security, medical treatment, etc. People’s daily life has been changed by the use of
computer-based automatic classification and recognition. Nonetheless, such usage is facing growing
challenges for people who are no longer satisfied with getting coarse-grained classification results
but desire finer-grained ones. Different from general object classification, which aims to distinguish
basic-level categories, fine-grained image classification focuses on recognizing images that belong
to the same basic category but not the same class or subcategory [1,2]. For instance, in the security
domain, while monitoring vehicles passing through checkpoints, not only coarse-grained information

Appl. Sci. 2020, 10, 4652; doi:10.3390/app10134652 www.mdpi.com/journal/applsci


Appl. Sci. 2020, 10, 4652 2 of 18

like the types of vehicles (SUV, sedan, truck, and so on), or brand (Volkswagen, Mercedes Benz, BMW,
or so) are wanted but also more accurate and fine-grained information like the models of vehicles
(Volkswagen Sagitar 2006–2011, Volkswagen Sagitar 2012–2014, BMW 3 Series 2013–2014, and so on)
are wanted. This finer-grained information will provide great help for case investigation, tracking
accident escape, deck, and fake vehicles to traffic law enforcement departments. Hence, with this
help, governments can maintain or even improve social stability. For this reason, fine-grained image
classification has great research value and broad application prospects.
Fine-grained image classification is an important research topic in the field of computer vision,
which aims to perform lower-level fine-grained classification upon higher-level coarse-grained
categories. The main challenge of fine-grained classification is that the differences between different
subcategories are usually subtle and local. In the related studies, it is usual to pre-process images
for the extraction of image features, like color, texture, and contour. Then, the extracted features are
used for training different models. During the test process, the same pre-processing procedure is
applied to the testing images for the extraction of features, and these features are fed into the trained
model to achieve classification results. Therefore, in the early stage, the bag-of-words method was first
proposed [3]. Traditional artificial features are fed into the model to extract the corresponding feature
vectors and obtain the final classification results. Wah et al. introduced the CUB200-2011 dataset [4]
and proposed some benchmark methods. However, their classification method for uncropped images
only achieved an accuracy of 10.3%. Their model was used to first locate regional areas, then apply the
bag-of-words method for encoding two different kinds of feature, RGB histogram features, and scale
invariant feature transform (SIFT) features [5]. The encoded features would be fed into a support
vector machine (SVM) classifier for image classification training. Such low accuracy is not satisfying.
This method is limited due to the regional area localization method they used being not accurate
enough; also, the artificially designed features cannot distinguish enough. Hence, many researchers
have proposed new feature descriptors based on their research, like part-based one-vs-one features
(POOF) [6], Fisher-encoded SIFT [7], supervised kernel descriptors for visual recognition (KEDS) [8], etc.
These methods, on fine-grained image classification, can reach an accuracy from 50% up to 62% [9,10].
From the early model design, we can see that the use of discriminative extracted image features
and feature encoding methods have a significant impact on the final classification results, which means
the better the methods that are used, the more discriminative features are used, and the better the
classification results. So, it can be seen that locating the regional feature area of images plays an
important part in achieving good classification results. However, annotating the regional feature area
manually is expensive, resulting in obvious defects on practical applications.
In recent years, deep neural networks have been widely used in the field of computer vision.
According to the annotations used during training, deep neural network methods are divided into
strongly supervised learning [11–15] and weakly supervised learning [16–18]. The strongly supervised
fine-grained image classification method is characterized by not only image category tag data but also
additional manual annotations, such as key points of annotations or regional area annotations that are
provided during training. Thus, strongly supervised fine-grained image classification methods require
additional manually annotated data, which makes these methods expensive and heavily restricts their
application area. For these reasons, strongly supervised methods may not be the most appropriate
choice for the actual classification tasks. Adopting the weakly supervised method is another big trend
in the field of fine-grained image classification research. How to find the most discriminative regions
has been studied by various researchers.
A weakly supervised fine-grained classification network based on two-level attention is proposed
by this article for tackling this problem. Our method consists of two parts. One of them is based
on salient region localization module. This module is designed for locating discriminative regions,
and it is trained while only using images’ categories as labels. The other one is the bilinear attention
module for fusing regional and global features. This module is used for extracting regional and global
feature from higher- and lower-level layers of bilinear neural networks separately. Then, we fuse these
Appl. Sci. 2020, 10, 4652 3 of 18

features to improve the features’ representation capability and construct our different layer feature
fusion module. The main contributions of our study are as follows:

1. The differences between different subcategories are usually subtle and local. Hence, how to
locate and distinguish the area has become key to solving the problem. An essential new
regional area localization method (salient region localization module) is proposed, which can
accurately locate and extract the most distinct regional areas and reduce the dependence of
manual annotation information.
2. We adopt the bilinear neural network for the extraction of global features and regional features,
which allows us to make better use of global features and regional features for training. The use
of the bilinear neural network allows us end-to-end train our model. Besides, using bilinear
neural network makes our model more stable.
3. Due to huge intra-class variance and small inter-class variance, a different layer feature fusion
module is proposed. First, we add center loss to our loss function to improve the distinction
between classes. In this way, we can reduce the impact of large intra-class variance and small
inter-class variance. Finally, we better guide the fine-grained image classification by combining
low-level visual features and advanced semantic information.
4. Our resulted model is trained without providing manually annotated essential areas while
reaching an accuracy of 85.1% on the CUB-200-2001 dataset. Our method’s resulting accuracy is
better than most strongly supervised method ones. This result shows that our model can reduce the
dependence on delicate manually annotated essential areas while maintaining acceptable accuracy.

The rest of this article is organized as follows. We first review the techniques related to the two-level
attention module, applications of the saliency module in weakly supervised image classification,
and the different layer feature fusion method in Section 2. Section 3 introduces our proposed network
architectures for fine-grained image classification. To verify the effectiveness of our method, extensive
experiments are performed in Section 4. The conclusion and future works are summarized in Section 5.

2. Related Works
The key to image classification is to extract the robust features of the object and form a better
feature representation. From the relevant studies, we can find that adding a weakly supervised
method to fine-grained image classification is a big trend in recent years. The application of the
weakly supervised method is mainly for reducing dependency upon delicate manual labels, especially
manually annotated essential areas. In order to apply the fine-grained classification methods to actual
tasks, many researchers have studied how to accurately locate and distinguish salient regions under
weakly supervised conditions, and then use Convolutional Neural Network (CNN) to extract features
from these detected regions. Previous work on fine-grained classification usually focused on part
detection to establish correspondence between object instances and reduce the impact of object posture
changes under strictly supervised settings.

2.1. Two-Level Attention Model


The attention mechanism has the ability to pay attention to certain content while ignoring other
content. The ITTI model introduced the attention mechanism for the first time, where it was used
for saliency detection [19]. Dzmitry employed a single-layer attention model to solve the problem of
machine translation [20]. The inception series expanded the width of the CNNs to achieve adaptability
to different convolutional scales [21–23]. Xiao et al. made an initial attempt at introducing a weakly
supervised method to fine-grained image classification [24]. The two-level attention module they
proposed is capable of casting attention on two different level features, which is similar to the object-level
and part-level feature of the strongly supervised learning method. The bilinear CNN (B-CNN) model
was proposed by Lin et al. for the reduction of redundancy caused by the candidate region extraction
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 19

is similar to the object-level and part-level feature of the strongly supervised learning method. The
bilinear
Appl. CNN
Sci. 2020, 10,(B-CNN)
4652 model was proposed by Lin et al. for the reduction of redundancy caused
4 ofby
18
the candidate region extraction algorithm [25]. Similar to the two-level attention model, our model is
built based on the bilinear convolutional neural network.
algorithm [25]. Similar to the two-level attention model, our model is built based on the bilinear
convolutional neural network.
2.2. Saliency Module in Weakly Supervised Image Classification
2.2. Saliency
Peng et Module in Weakly
al. mentioned two Supervised Image [1]:
basic concepts Classification
One is a collection of all feature maps for the same
convolutional
Peng et al. mentioned two basic concepts [1]: One is a set",
layer, collectively referred as the "activation and the
collection other
of all is that
feature an activation
maps set
for the same
can be represented by a T-dimensional vector, which is called the “descriptor”.
convolutional layer, collectively referred as the "activation set", and the other is that an activation set The method
proposed
can by Peng by
be represented et al. is heavily influenced
a T-dimensional by hyperparameters,
vector, which which makes
is called the “descriptor”. their model
The method very
proposed
unstable and hard to reproduce. Besides, their model is not end-to-end,
by Peng et al. is heavily influenced by hyperparameters, which makes their model very unstable and which reduces its
practicality. When using a convolutional neural network for training, all feature
hard to reproduce. Besides, their model is not end-to-end, which reduces its practicality. When using a maps of different
depths of the convolutional
convolutional neural network layer
for or featureall
training, maps in the
feature same
maps ofdepth of the
different convolutional
depths layer have
of the convolutional
layer or feature maps in the same depth of the convolutional layer have different responsesfigure
different responses toward the same image. Such a phenomenon is shown in Figure 1. The toward is
from
the sameresearch
image. produced by Selvaraju isetshown
Such a phenomenon al. [26].
inTherefore,
Figure 1. The making
figure better
is fromuseresearch
of each produced
feature map by
will improve the performance of the model for image classification. We
Selvaraju et al. [26]. Therefore, making better use of each feature map will improve the performance use the weighted of
gradient-based
the model for image algorithm for class We
classification. activation
use themapping
weightedin our method for
gradient-based this reason.
algorithm This activation
for class process is
inspired by
mapping in the
ourprocess
methodused in gradient
for this classprocess
reason. This activation mapping
is inspired by(Grad-CAM),
the process usedwhichin was proposed
gradient class
by Selvaraju et al. [26]. This process enables us to eliminate the influence
activation mapping (Grad-CAM), which was proposed by Selvaraju et al. [26]. This process enables brought by different
us
structures of convolutional neural networks.
to eliminate the influence brought by different structures of convolutional neural networks.

Figure 1. The
The figure
figure above
above shows
shows feature
feature maps
maps from
from different
different channels,
channels, generated
generated with the Class
Activation Mapping via Gradient-based
Gradient-based Localization
Localization (Grad-CAM)
(Grad-CAM) method
method by by Selvaraju
Selvaraju et
et al.
al. [26].
[26].
thefigure
From the figureabove,
above,wewecancan
seesee
thatthat Grad-CAM
Grad-CAM can easily
can easily locatelocate
salientsalient
places places of different
of different images.
images.
Heat Heat
maps maps
from from different
different channels channels
generatedgenerated
in this wayin have
this way havefocusing
various various points.
focusing Wepoints. We
use these
heat mapsheat
use these to extract
maps to regional
extractfeatures.
regional features.

2.3. Different Layer


2.3. Different Layer Feature
Feature Fusion
Fusion
Since
Since different
different layers
layers of
of convolutional
convolutional features
features describe
describe the
the characteristics
characteristics of of objects and their
objects and their
surroundings
surroundingsfrom fromdifferent angles,
different howhow
angles, to obtain low-level
to obtain visual features
low-level visual while considering
features high-level
while considering
semantic
high-levelinformation has become ahas
semantic information new research
become hotspot
a new in thehotspot
research field of image
in theprocessing. Hariharan
field of image et al.
processing.
achieved
Hariharanbetter
et al.fine-grained
achieved bettersegmentation,
fine-grainedobject detection, and
segmentation, objectsemantic pixel
detection, andsegmentation
semantic pixel by
aggregating low-level features with high-level features [27–29]. Jin et al. proposed the use
segmentation by aggregating low-level features with high-level features [27–29]. Jin et al. proposed of a recurrent
neural
the usenetwork to transfer
of a recurrent high-level
neural networksemantic information
to transfer high-leveland low-level
semantic spatial features
information to each
and low-level
other
spatialfor the analysis
features to eachofother
sceneforimages [30]. Based
the analysis onimages
of scene the saliency module
[30]. Based andsaliency
on the low-level attention
module and
module, due to huge intra-class variance and small inter-class variance, this paper
low-level attention module, due to huge intra-class variance and small inter-class variance, this combines the
attention features of multiple intermediate layers and delivers them layer by layer. Finally, we better
variance. The bilinear convolutional neural network can better pay heed to the regional features of
the images. Additionally, it has the capability of learning regional features, hence it is capable of
representing the relationship between regional features. What is more, it is capable of end-to-end
training without manual intervention.
We choose to use bilinear CNN as our baseline feature extraction neural network. We use
Appl. Sci. 2020, 10, 4652
features from a high level and lower level of the network to calculate outer products, which are5later
of 18

used as image features. Our model is based on the weakly supervised learning method. For this
reason,
guide the wefine-grained
can only useimage
imageclassification
class labels by
for combining
training our model while
low-level visualnot providing
features manually
and advanced
annotated essential regional areas during our training process. By doing so, our proposed model
semantic information.
reduces the dependence on artificial annotation. The overall structure is illustrated in Figure 2. Our
3. Approach
model is composed of three parts.
First, the salient region localization module, which is used for locating salient regions of the
The characteristics of fine-grained images are large intra-class variance and small inter-class
target images. The salient regions would be intercepted as the input to the first layer of the bilinear
variance. The bilinear convolutional neural network can better pay heed to the regional features of
CNN.
the images. Additionally, it has the capability of learning regional features, hence it is capable of
The second part is the bilinear attention module, which serves as a feature extractor. The
representing the relationship between regional features. What is more, it is capable of end-to-end
extracted feature maps from this model are used as the parallel input of the maximum pooling layer
training without manual intervention.
and average pooling layer, that is, each feature map is converted into two vectors, one containing
We choose to use bilinear CNN as our baseline feature extraction neural network. We use features
maximum values and the other containing average values. These vectors are used as descriptor
from a high level and lower level of the network to calculate outer products, which are later used as
vectors.
image features. Our model is based on the weakly supervised learning method. For this reason, we can
The third part is the different layer feature fusion module, calculating the outer product of
only use image class labels for training our model while not providing manually annotated essential
features extracted from the higher level and lower level of the network for fusion. Then, the fused
regional areas during our training process. By doing so, our proposed model reduces the dependence
features are fed into the softmax classifier. During the training process, we construct an auxiliary
on artificial annotation. The overall structure is illustrated in Figure 2. Our model is composed of
mixed loss function for better integration of the regional features and global features.
three parts.
Different Layer Feature
Salient Region Localization Module Bilinear Attention Module Fusion Module
224×224×64

56×56×3 56×56×128
112×112×128 28×28×256
28×28×512 14×14×512 1024
7×7×512
7×7×512

1024

224×224×3 112×112×128
56×56×256 1024
14×14×512
7×7×512

1024

Figure
Figure 2.2. Overview
Overview ofof our
our neural
neural network’s
network’s structure.
structure. The
The whole
whole neural
neural network
network isis aa two-level
two-level
attention
attention model. The saliency module is used for locating salient regions of the target images, then
model. The saliency module is used for locating salient regions of the target images, then
the
the bilinear
bilinear attention
attention module
module is is used
used as
as our
our feature
feature extractor.
extractor. Finally,
Finally, the
the derived
derived features
features are
are used
used to
to
calculate
calculate the
the outer
outer product
product to achieve the
to achieve the final
final fused
fused feature.
feature.

3.1. Salient Region


First, the Localization
salient Module module, which is used for locating salient regions of the target
region localization
images. The salient regions would be intercepted as the input to the first layer of the bilinear CNN.
Our model uses bilinear CNN to extract features. Then, a weighted gradient-based algorithm
The second part is the bilinear attention module, which serves as a feature extractor. The extracted
for class activation mapping is applied on the resulting features. This process is inspired by the
feature maps from this model are used as the parallel input of the maximum pooling layer and average
process used in the Grad-CAM model, which was proposed by Selvaraju et al. [26]. This process
pooling layer, that is, each feature map is converted into two vectors, one containing maximum values
and the other containing average values. These vectors are used as descriptor vectors.
The third part is the different layer feature fusion module, calculating the outer product of features
extracted from the higher level and lower level of the network for fusion. Then, the fused features
are fed into the softmax classifier. During the training process, we construct an auxiliary mixed loss
function for better integration of the regional features and global features.

3.1. Salient Region Localization Module


Our model uses bilinear CNN to extract features. Then, a weighted gradient-based algorithm for
class activation mapping is applied on the resulting features. This process is inspired by the process
used in the Grad-CAM model, which was proposed by Selvaraju et al. [26]. This process enables our
model to eliminate the influence brought by varying structures of convolutional neural networks.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 19
Appl. Sci. 2020, 10, 4652 6 of 18
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 19
enables our model to eliminate the influence brought by varying structures of convolutional neural
networks.
enables our
Additionally, Additionally,
model
it grants ourit model
to eliminategrants our
thethe modelto
influence
ability the
brought ability to generate
by varying
generate visually visually
structures interpretable
of convolutional
interpretable feature maps. feature
neural
It also
maps. It
networks.
makes also makes our
Additionally,
our model capable ofmodel capable
it grants
giving our a score of
modelgiving
to a the a score
ability
specific to a specific
generate
label when label
onlyvisually when
one image only one
interpretable image
and the targetfeature and
labels
the
are target
maps.
fed into labels
It also
our are our
makes fedCNN
bilinear into model
model our
capablebilinear
of giving
without CNN model
a score
training to awithout
from specific training
the groundlabel when
up orfrom
only the
oneground
changing image up or
and
the original
changing
the target the original
labels are CNN
fed into model’s
our structure.
bilinear CNN The
model score of
without the
CNN model’s structure. The score of the labels is obtained through the calculation of specific or labels
training is
from obtained
the ground through
up the
tasks.
changing the
calculation of originaltasks.
specific CNNFor model’s
all structure.
labels, except The
the score
required,of thethe labels
labels’isgradients
obtained are through
set to the
1; the
For all labels, except the required, the labels’ gradients are set to 1; the rest of the gradients are set to 0,
calculation
rest of the of specific
gradients are tasks.
set to For
0, all
and labels,
then except
the the
gradient required,
is the
propagated labels’
back gradients
to the are
entire set to 1; the
convolutional
and then the gradient is propagated back to the entire convolutional feature map. All feature maps are
rest of map.
feature the gradients are set to 0, are
andcombined
then the gradient is propagated back to the entireheat
convolutional
combined by All feature
a precise maps
method for obtaining by amaps
heat preciseof themethod
givenfor obtaining
image. The resulting maps heatofmap the
feature
given map.The
image. All resulting
feature maps heat are
map combined
reveals by apart
the precise
that method
needs for
to payobtaining
more heat maps
attention. of the we
Finally,
reveals the part that needs to pay more attention. Finally, we apply element-wise multiplication to
givenelement-wise
apply image. The resulting heat map to reveals the part thattheneeds to pay more attention. Finally, we
the heatmap and themultiplication
directed backpropagation, the heatmap and
using directed
bilinear backpropagation,
interpolation to up-sample usingthe bilinear
input
apply element-wise
interpolation multiplication
to up-sample the to
input the heatmap
image’s and the directed
resolution.results
Then,and backpropagation,
wevisualization using
merge the backpropagation bilinear
image’s resolution.
interpolation Then,
to up-sample we merge
the toinput the backpropagation
image’s resolution. Then,are weshown
mergeinthe results
backpropagationto obtain
results
saliency and visualization
maps, which are results
shown in obtain
Figure saliency
3. maps, which Figure 3.
results and visualization results to obtain saliency maps, which are shown in Figure 3.

Figure 3.
Figure 3. Saliency
Saliencymaps
Saliency mapsofof
maps ofdifferent
differentimages.
different images.

Afterobtaining
After obtaining
obtaining the
thethe generated
generated
generated saliency
saliency map,map,
saliency an
an adaptive
an adaptive
map, maximum
maximum
adaptive inter-class
inter-class
maximum variance
inter-classvariance
algorithm
variance
algorithm is
is used to obtain
algorithm used
is used the to obtain
tothreshold the threshold
obtain theaccording
threshold according
toaccording to the
the calculation calculation
to the[31], [31], and
and the threshold
calculation the
[31], and is threshold
theused is used
for converting
threshold is used
for
the converting
saliency the
feature saliency
map into feature
a binary map
mask.into a binary
Thereby, we mask.
can Thereby,
distinguish wethe can distinguish
background the the
from
for converting the saliency feature map into a binary mask. Thereby, we can distinguish
background
foreground and from the foreground
highlight and between
the differences highlightthese
the differences
parts ofbetween theseWe two parts
forof the
background from the foreground and highlight the two
differences the image.
between these use “1”
two partsmeaning
of the
image. We
the specific use “1”
position for meaning
of meaning the
the provided specific
image is position of
foreground,the provided image is foreground, and “0”
image. We use “1” for the specific position of theand “0” forimage
provided meaning that the position
is foreground, and “0” is
for meaning that the position is background. Then, we apply the eight-connected region labelling
background. Then, we apply the eight-connected region labelling algorithm to
for meaning that the position is background. Then, we apply the eight-connected region labelling the foreground to locate
algorithm to the foreground to locate the target and label the target coordinates. The mentioned
the target and
algorithm to thelabel the target coordinates.
foreground to locate theThe mentioned
target and label processes of the
the target saliency model
coordinates. are shown
The mentioned
processes of the saliency model are shown in Figure 4.
in Figure 4.
processes of the saliency model are shown in Figure 4.
224×224×64
224×224×64 112×112×128
28×28×512
112×112×128
14×14×512
28×28×512
14×14×5127×7×512
7×7×512

Visualization
Visualization

Figure 4. Saliency module based on weakly supervised learning.


Appl. Sci. 2020, 10, 4652 7 of 18

We locate and obtain the most distinct regional area from the input image to generate the heat map.
We choose to generate heat maps as they can be visualized directly by adding to the original image.
We use the bilinear interpolation method to generate a heat map for the original image. The heat map
and the input image have the same size. The heat map will be combined with the original image.
However, different feature maps have a different region of response on original images, and the key
regional features are found to be not salient enough after visual analysis, so we cannot localize salient
targets with them. To solve this problem, we decide to sum over the d dimension of three-dimension
tensor D, which has the size h × w × d, turning it into a two-dimension tensor B, which has the size
h × w, to better localize salient targets. The addition equation is as follows:

d
X
B= Ai . (1)
i=1

In the equation above, Ai is a feature map of the i-th channel. The fusion of multiple feature maps
through Equation (2) helps our model to enhance the feature information of salient areas, which in
turn makes it easier to locate regional salient areas more accurately. Each saliency map, having the
size of h × w, corresponds to all pixels in the h × w area. We also calculate a self-adaptive threshold, a,
using the OTSU algorithm [32]. With the derived threshold, we can turn a saliency map into a binary
map; the equation is as follows:
(
1 Aix,y ≥ a
Bx,y = . (2)
0 Aix,y < a
For the processed binary image B, we perform a scan, and mark all pixels’ connection areas
according to the four-neighborhood rule. It is assumed a pixel is represented by f (x, y), where the f
produces the binary value of the pixel by x, y, which stand for the pixel’s location in image. And we
assume the connectivity domain tag of pixel f (x, y) is represented by m(x, y). When scanning f (x, y),
the scanning process is already done for f (x − 1, y) and f (x, y − 1), so their marks, m(x − 1, y) and
m(x, y − 1), are already known. Hence, the connected area mark m(x, y) of the pixel f (x, y) is only
relevant to the connected area marks of pixel f (x − 1, y) and f (x, y − 1), which are m(x − 1, y) and
m(x, y − 1). The equation is as follows:



 m(x − 1, y) if f (x, y) = f (x − 1, y) and f (x, y) , f (x, y − 1)

 (x, y − 1)
m if f (x, y) , f (x − 1, y) and f (x, y) = f (x, y − 1)


m(x, y) =  . (3)



 m(x, y − 1) if f (x, y) = f (x − 1, y) and f (x, y) = f (x, y − 1)
 Newlabel if f (x, y) , f (x − 1, y) and f (x, y) , f (x, y − 1)

In the equation above, when the condition to the right of Function (3) holds, the marker number
of the connection mark is the same. At the same time, in the final case of Function (3), if the condition
is true that pixel f (x, y) belongs to the new connection domain, Newlabel = Newlabel + 1. Additionally,
we set the value of m(x, y) to the value of the new connected area mark Newlabel.
To better visualize the effect of the salient region localization module, we plot the bounding box
based on the resulting detected saliency regions. As shown in Figure 5, we visualize the effect of the
module on different fine-grained datasets. We can see that even for different fine-grained images,
all the regions of focus are detected accurately. Especially, most parts of the response will be in the
foreground of the target to be classified, and only a minority will be on the background of the target to
be classified.
Appl. Sci. 2020, 10, 4652 8 of 18
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 19

Figure 5. Saliency localization map of different images.


Figure 5. Saliency localization map of different images.
3.2. Bilinear Attention Module
3.2. Bilinear Attention Module
In this module, we adopt the general purpose bilinear neural network method, which was
In this
proposed by module, we adopt
Lin et al. [24]. the general
Their neural networkpurpose
can be bilinear neural network
mainly divided method,
into the upper and which was
lower level.
proposed
Each level by
usesLin
theetVGG
al. [24]. Their
neural neural
network asnetwork
a featurecan be mainly
extractor. Imagesdivided
are fedinto
intothe
theupper
upper and
and lower
lower
network for the extraction of features. After that, the bilinear pooling function is performedthe
level. Each level uses the VGG neural network as a feature extractor. Images are fed into onupper
these
and lower network for the extraction of features. After that, the bilinear
extracted features to combine these features. In the end, the combined feature is fed into the Softmaxpooling function is
performed on these extracted features to combine these features. In the end, the
layer for classification. Our model is based on the bilinear neural network. We propose a new module, combined feature is
fed bilinear
the into the attention
Softmax layer
module,for which
classification.
is shown Our model 2.
in Figure is This
based on the uses
module bilinear neural network.
the upper-level We
network
propose
Net-A toaextract
new module,
regionalthe bilinear
salient attention
areas’ target module,
feature fupwhich
, andisuses
shown in Figure 2. network
the lower-level This module Net-Buses
to
the upper-level network Net-A to extract regional salient areas’ target feature
extract the global target feature fdown . Then, the outer product is performed on these features fup , and to get the
uses the
bilinear feature
lower-level B1 . Then,
network we to
Net-B perform
extractthetheouter product
global targetagain to getfdown
feature the .bilinear feature
Then, the outerB2 .product
The outeris
product B is calculated with the following equation:
performed on these features to get the bilinear feature B1 . Then, we perform the outer product
again to get the bilinear feature B2 . The outer fup T · fdownB. is calculated with the following equation:
B = product (4)
T
Feature B1 is obtained by performing theBdot
 fup  fdown on
product . fup 7∗7 and f 7∗7 . f 7∗7 and f 7∗7 are features
(4)
down up down
extracted from the upper- and lower-level networks’ 7 ∗ 7 convolutional layer separately. Feature B2 is
77 7 7 77 7 7
Feature
obtained B1 is obtained
by performing theby performing
dot product onthe dot and
14∗14
fup product on
f 14∗14 14∗14
fup
. fup and fdown
and f 14∗14 . fup
are features fdown
and extracted
are
down down
from the extracted
features upper- and fromlower-level
the upper- networks’ 14 ∗ 14 convolutional
and lower-level networks’ 7 layer separately. Since
7 convolutional layerthe bilinear
separately.
feature Bi is a three-dimensional matrix with various sizes on each dimension,
14 14 1414 we need to transform
14 14 1414
Feature B2 is obtained by performing the dot product on fup and fdown . fup and fdown are
the two bilinear features into column vectors. Then, we concatenate these two resulting column vectors
features extracted from the upper- and lower-level networks’ 14  14
into the new column vector B to enhance the relevance between each layer’s features, so that welayer
convolutional can
separately.
fuse Sinceand
the regional the global
bilinear
features Bi isIna the
featurebetter. three-dimensional
end, we feed thematrix with
resulting various
column sizesB on
vector intoeach
the
different layer feature fusion module for further processing.
dimension, we need to transform the two bilinear features into column vectors. Then, we
concatenate these two resulting column vectors into the new column vector B to enhance the
3.3. Different
relevance Layer Feature
between Fusionfeatures,
each layer’s Module so that we can fuse the regional and global features better.
In theInend, we feedwe
this section, thewill
resulting
designcolumn vector
a new layer B into
fusion the different
method layer
to ensure thatfeature
both thefusion module
low-level for
visual
further processing.
features and high-level semantic information are fully utilized. We perform a simple convolution
operation on each module in the network and combine them with the feature maps on the main path
3.3.perform
to Differentfine-grained
Layer Feature Fusion
image Module
classification.
The Softmax
In this function
section, we willis design
widely aused
newfor constructing
layer the loss
fusion method to function in image
ensure that classification.
both the low-level
However,
visual features and high-level semantic information are fully utilized. We perform ais simple
Softmax does not require intra-class compactness and inter-class separation, which highly
unsuitable for fine-grained classification. Therefore, to use the loss function to force
convolution operation on each module in the network and combine them with the feature maps our model to learn
on
features with larger inter-class and smaller intra-class
the main path to perform fine-grained image classification. distances, we add the center loss function to
Appl. Sci. 2020, 10, 4652 9 of 18

improve the distinction between classes. Center loss will learn the centers of each class feature and
reduce intra-class variation for each feature according to their corresponding class centers. In this way,
we can reduce the impact of large intra-class variance and small inter-class variance. The definition of
the center loss function is as follows:
m
1X
LC = kxi − c yi k22 . (5)
2
i=1

In the equation above, c yi stands for the center of the yi -th feature. During each iteration, only class
centers relevant to the features are updated. The Softmax function consists of three parts: The loss
function PA for the upper regional feature classification network, loss function PB for the lower global
feature classification network, and fusion loss function P. Therefore, our loss function for the model is
defined as follows:

L = LP + αLPA + βLPB + λLC


m m m m (6)
λ
= − y0 pi log( yTi ) − α y0 pA log( yTi ) − β y0 pB log( yTi ) + kxi − c yi k22
P P P P
i i
2
i=1 i=1 i=1 i=1

In the equation above, y0 pi stands for the possibility on each category, which is produced by the
main neural network; yTi is one hot encoded vector for stating each image’s label; y0 pA and y0 pB are
i i
the probability on each category produced by the higher-level neural network PA and lower one PB ;
c yi stands for the central feature of the i-th category; xi stands for the features of the input images;
and α and β stand for the weight of each module. Hyperparameters α and β are chosen based on
the cross-validation method, while parameter λ is set to 1. During the experiment, we adjust the
different weights to optimize the features extracted from each layer of the bilinear network, to optimize
the identification results of the entire model. Then, we set the weighting constant α = [0.3, 0, 5],
β = [0.5, 0.8], and λ = 1. With the loss function above, the regional and global features of the image
can be better used, which allows us to obtain a higher classification accuracy.

4. Experiments
In this section, we conduct several experiments to evaluate the performance of our models on the
fine-grained image classification task. Our experiments are based primarily on public fine-grained
image datasets. First, under the same hardware and software conditions, we compared the results
derived from using different higher- and lower-level network loss functions, the fused loss function,
and the central loss function for correlation comparison. The experiments to prove the validity of the
loss function were verified in two main ways. On the one hand, by verifying the effect of an increase in
the loss term on the final classification accuracy. On the other hand, by verifying whether increasing
the effective loss term can speed up the convergence of the overall loss function and steer the overall
loss function toward the right direction for convergence. Second, we compared the results obtained by
using a single network and the bilinear network to demonstrate the effectiveness of our network. Third,
to prove the advances of our network, we also compared our method with fresh relevant methods.

4.1. Datasets’ Settings


The CUB-200-2001 dataset, a few sample images of which are illustrated in Figure 6, has been
extensively used in the research of fine-grained image classification [4]. The CUB-200-2001 dataset
contains 11,788 images of birds, with 200 types of birds in general. Each has a different posture,
which results in large intra-class variance and small inter-class variance. Differences between classes
are normally small and regional, such as the beak, the color of the wings, or another regional area.
The dataset not only provides classification labels for all bird image data but also provides essential
part annotations. However, our method only uses a weakly supervised method, so only image label
The CUB-200-2001 dataset, a few sample images of which are illustrated in Figure 6, has been
extensively used
extensively used inin the
the research
research ofof fine-grained
fine-grained image
image classification
classification [4].
[4]. The
The CUB-200-2001
CUB-200-2001 dataset
dataset
contains 11,788 images of birds, with 200 types of birds in general. Each has
contains 11,788 images of birds, with 200 types of birds in general. Each has a different posture, a different posture,
which results
which results inin large
large intra-class
intra-class variance
variance and
and small
small inter-class
inter-class variance.
variance. Differences
Differences between
between classes
classes
are normally small and regional, such as the beak, the color of the wings, or another
are normally small and regional, such as the beak, the color of the wings, or another regional area. regional area.
The dataset
Appl.
The dataset not
Sci. 2020, not only
10, only provides
4652 provides classification
classification labels
labels for
for all
all bird
bird image
image data
data but
but also
also provides
provides essential
essential
10 of 18
part annotations.
part annotations. However,
However, our our method
method only
only uses
uses aa weakly
weakly supervised
supervised method,
method, so so only
only image
image label
label
data was
data was used
was used
usedin in the
inthe model
themodel
modelfor for training
fortraining
training and
and testing.
testing. We used 70% of the data as a training set and
data and testing. WeWeusedused
70%70% of the
of the datadata
as aas a training
training set 30%
set and and
30% for
30% for testing.
testing.
for testing.

Figure 6. Sample
Figure Sample images
images from
from the
the CUB-200-2001
CUB-200-2001 dataset.
6. dataset.

The Stanford
The StanfordDogs
Stanford Dogsdataset
Dogs datasetprovides
dataset providesimages
provides imagesofof
images of data
data
data for
forfor 120
120120 different
different types
types
different of dogs
of dogs
types of dogs [33].
[33].[33].
ThereThere
are
There
are 20,580
20,580
are 20,580 images
images
images in
in total,total,
in total, including
including
including different
different perspectives
perspectives
different perspectives and
andand poses.
poses.
poses. Only
Only target
target
Only frame
frame
target frame information
information
information is
is provided,
is provided,
provided, and and
andkey key
key point
point
point information
information
information is excluded.
is excluded.
is excluded. TheThe
The sample
sample
sample image
image
image data
data
dataareare presented
presented
are presented in
inin Figure
Figure
Figure 7.
7.
In In
the the
figure, figure,
two two
distinct distinct
dog dog
breeds breeds
are shown.are shown.
From the From
analysisthe
of
7. In the figure, two distinct dog breeds are shown. From the analysis of the pictures, the analysis
the of
pictures, the
the pictures,
backgrounds the
backgrounds
of such a dataset
backgrounds of such
of such aa dataset
dataset are
are complicated, areascomplicated,
complicated,
some backgrounds as some
as some backgrounds
arebackgrounds are set
set on sofa, grass,
are set
etc.onHence,
on sofa, grass,
sofa, grass, etc.
when etc.
we
Hence,
used the when
Stanfordwe used
Dogs the Stanford
dataset, in the Dogs
data dataset, in
pre-processing the data
stage, wepre-processing
cropped
Hence, when we used the Stanford Dogs dataset, in the data pre-processing stage, we cropped images stage, we
according cropped
to the
images according
provided
images according
label boxto to the
to the provided
reduce labelof
the impact
provided label box
box toresponsible
theto reduce the
reduce thebackground.
impact of
impact of the
the responsible
Moreover,
responsible background.
dogsbackground.
of different
Moreover,
breeds
Moreover, dogs
in this
dogstypeof different
of of breeds
datasetbreeds
different in
have large this type of
intra-class
in this dataset have
differences.
type of dataset large
have We
large intra-class
used 70% of the
intra-class differences.
data as aWe
differences. We used
training
used
70%and
set
70% of the
of the data
30%data as aa training
training set
for testing.
as set and
and 30%
30% for
for testing.
testing.

Figure 7.
Figure
Figure 7. Sample
7. Sample images
images from
from the
the Stanford
Stanford Dogs
Dogs dataset.
Dogs dataset.
dataset.

The
The FGVC-Aircraft
FGVC-Aircraftprovides
providesimage
imagedata of 102
data categories
of 102
102 of aircraft
categories [34]. [34].
of aircraft
aircraft Each Each
category has more
category has
The FGVC-Aircraft provides image data of categories of [34]. Each category has
than
more 100
thandiverse
100 images.images.
diverse There are 10,200
There areimages
10,200 in total, and
images in only and
total, labelonly
box information
label box provided.
information
more than 100 diverse images. There are 10,200 images in total, and only label box information
Sample
provided. images
Sample areimages
presented in Figure 8.inWe
are presented
presented also 8.
Figure applied
We also image
also cropping
applied imageto such a kind
cropping to suchof dataset
such
provided. Sample images are in Figure 8. We applied image cropping to aa kind
kind
during
of dataset
dataset ourduring
data pre-processing
our data stage. We cropped
data pre-processing
pre-processing stage. We
We the image according
cropped imageto
the image the label to
according box
thetolabel
reduce
label boxthe
to
of during our stage. cropped the according to the box to
Appl.
impact
reduceSci. of
2020,
thethe10,background.
x FOR
impact PEER
of the REVIEW
We used 80%
background. We ofused
the data
80% as
of athe
training
data set
as a and 20%set
training forand
testing.
20% for 11 of 19
testing.
reduce the impact of the background. We used 80% of the data as a training set and 20% for testing.

Figure 8.
Figure Sample images
8. Sample images from
from the
the FGVC-Aircraft
FGVC-Aircraft dataset.
dataset.

We used a subset of the CompCars dataset, which was proposed by Yang et al. [35], and
contains 300,000 images of 500 categories of vehicles. We used 15 categories of vehicle type, 55
categories of vehicle brands, and 250 types of vehicle models. Each type of vehicle has
approximately 300 images, covering rainy days, nights, foggy days, and different angles of view. We
Figure 8. Sample images from the FGVC-Aircraft dataset.
Appl. Sci. 2020, 10, 4652 11 of 18

We used a subset of the CompCars dataset, which was proposed by Yang et al. [35], and
contains 300,000
We used images
a subset of CompCars
of the 500 categories of vehicles.
dataset, which was Weproposed
used 15bycategories
Yang et al.of[35],
vehicle type, 55
and contains
categories of vehicle
300,000 images brands, and
of 500 categories 250 types
of vehicles. of vehicle
We used models.
15 categories Each type
of vehicle of categories
type, 55 vehicle has of
approximately 300 images, covering rainy days, nights, foggy days, and different angles
vehicle brands, and 250 types of vehicle models. Each type of vehicle has approximately 300 images, of view. We
used 70%rainy
covering of thedays,
data as a training
nights, foggyset and and
days, 30%different
for testing. The visualization
angles of the70%
of view. We used CompCars dataset
of the data as a
is shownset
training in and
Figure
30% 9. for testing. The visualization of the CompCars dataset is shown in Figure 9.

Figure 9. Sample
Figure 9. Sample images
images from
from the
the CompCars
CompCars dataset.
dataset.

4.2. Data Pre-Processing


In general,
general, whether
whetherthethedata
datacan
canbebepre-processed
pre-processed effectively affects
effectively thethe
affects final effect
final of the
effect of model to a
the model
certain
to extent.extent.
a certain For theFor
casethe
of only
case aoffew fine-grained
only image samples
a few fine-grained being
image available,
samples we pre-processed
being available, we
all available images
pre-processed (denoising,
all available dimension
images reduction,
(denoising, normalization,
dimension reduction,standardization
normalization,etc.) and applied
standardization
data and
etc.) expansion
appliedtodata
avoid over-fitting.
expansion to avoid over-fitting.

4.2.1. Scale
4.2.1. Scale Cropping
Cropping
Different fine-grained
Different fine-grained image
image datasets have different
datasets have different image
image sizes. However, the
sizes. However, the presence
presence ofof the
the
Region of Interest (ROI) pooling layer allows any size image to be fed into the deep
Region of Interest (ROI) pooling layer allows any size image to be fed into the deep neural network. neural network.
Inspired by
Inspired by the
the idea
idea of
of migration
migration learning,
learning, wewe used
used the
the Inception
Inception v3 v3 model
model that
that was
was pre-trained
pre-trained onon
ImageNet data. For the input image, it needs to be cropped to the image size that
ImageNet data. For the input image, it needs to be cropped to the image size that Inception v3 Inception v3 requires
for input,for
requires which
input,is 229 × 229
which is ×229
3. To a certain
 229  3 . Toextent, it is extent,
a certain possibleittoisreduce thetoamount
possible reduceofthe
data used for
amount of
data used for training, and the fixed size of the image allows the convolutional neural networkthe
training, and the fixed size of the image allows the convolutional neural network to better extract to
characteristic
better information
extract the from information
characteristic it. from it.
4.2.2. Data Augmentation
4.2.2. Data Augmentation
To improve the classification accuracy and prevent overfitting, considering the huge amount
of network parameters, we need to adopt data augmentation for an increasing amount of data.
In our experiment, we used several methods to augment data from fine-grained image datasets,
making the number of training samples for each category relatively balanced. The methods we used
included randomly flipping and distorting images, randomly cropping images, randomly adding
noise, randomly modifying the contrast and saturation of images, etc.
Appl. Sci. 2020, 10, 4652 12 of 18

4.3. Comparison of Different Experiments

4.3.1. Evaluation Index


For the mentioned datasets, to better compare the performance of different algorithm models,
we used the classification accuracy as the evaluation index, and it is defined as follows:
nt
accuracy = , (7)
n
where n stands for the total number of test samples and nt stands for the number of images predicted
correctly. Such an evaluation index can more intuitively reflect the classification performance of
the models.

4.3.2. Comparative Experiment of Different Loss Functions


To confirm the validity of the loss functions, we designed the comparative experiment using the
CUB-200-2001 dataset and compared the changes of the loss values of the functions when the number
of iterations increased. The experimental results are shown in Figure 10, where the behavior of the
different loss function is reported. The green curve refers to only the first term LP of the mixed loss
function proposed in this paper. The blue curve represents the addition of the auxiliary function LPA
to the upper-layer network. It can adjust the upper-layer network to make it more focused on the
regional feature information while enabling the model to converge faster and lower the overall loss
value. The pink curve indicates the addition of the auxiliary classifier term LPB in the lower-layer
network, which allows the lower-layer network to adjust its extracted global features. Since the global
features extracted by the lower-layer network have more abundant characteristic information than
the regional features, the loss value is decreased more, and the classification accuracy is increased.
The remaining curve, the orange one, is the variation curve of the mixed loss function proposed by
this paper. The loss value of the model in the training process is close to 0. At the same time, in this
curve, we can see a significant downward trend in the loss values and a significant increase in the rate
of convergence. In addition to adding two auxiliary classifiers, the central loss function is added to
reduce the intra-class variance and increase the inter-class variance. Additionally, the results in Table 1
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 19
show that the accuracy of the identification can be effectively improved.

Figure 10.
Figure Loss values
10. Loss values of
of the
the different
different loss
loss function.
function.

In the test partition of the CUB200-2011 dataset, there were approximately 20 images for each
category. We compared obtained accuracies of the model using different loss functions, and chose to
set   0.3 and   0.6 . The results are shown in Table 1. The accuracy of the model proposed in
this paper reached 84.12% on public datasets. This result is better than some strong supervised
learning methods and some weakly supervised learning methods mentioned in the related works in
the first chapter.

Table 1. Accuracy of different loss functions.

Loss functions CUB200-2011(Accuracy)


L 81.12%
Appl. Sci. 2020, 10, 4652 13 of 18

Table 1. Accuracy of different loss functions.

Loss Functions CUB200-2011 (Accuracy)


LP 81.12%
LP + αLPA 82.43%
LP + αLPA + βLPB 83.81%
LP + αLPA + βLPB + CentralLoss 84.12%

In the test partition of the CUB200-2011 dataset, there were approximately 20 images for each
category. We compared obtained accuracies of the model using different loss functions, and chose
to set α = 0.3 and β = 0.6. The results are shown in Table 1. The accuracy of the model proposed
in this paper reached 84.12% on public datasets. This result is better than some strong supervised
learning methods and some weakly supervised learning methods mentioned in the related works in
the first chapter.

4.3.3. Classification Results of Different Network Structure


The basic network structure of the model proposed in this paper is based on Inception v3. For this
reason, in our experiments, we used the parameters from the Inception v3 model pre-trained on the
ILSVRC2012 dataset to initialize our model’s parameters. For the CUB200-2011 dataset, we used a
single network structure to classify fine-grained images, such as Inception v3 and DenseNet.
For the bilinear model, we used B-CNN proposed by Lin et al. [25]. The experimental results
are shown in Figure 11. Although the single network structure can improve the accuracy of image
classification to some extent when increasing the depth of the network, its performance is still weaker
than the bilinear model. Hence, we can derive the conclusion that the bilinear deep neural network can
make better use of the relationship between regional features and global features. At the same time,
our method runs with only 5M parameters, while achieving a classification speed of 48 frames per
second. From the derived experimental results, our proposed method obtained a better performance
Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 19
than B-CNN, reaching a classification accuracy of 85.1%.

Figure
Figure 11. Accuracies of
11. Accuracies of different
different networks
networks on
on the
the CUB200-2011
CUB200-2011 dataset.
dataset.
4.3.4. Comparison of State-of-the-Art Algorithms
4.3.4. Comparison of State-of-the-Art Algorithms
We intended to prove that our model is more versatile and advanced in different fine-grained
We intended to prove that our model is more versatile and advanced in different fine-grained
datasets. Therefore, we compared our method in the different datasets, CUB-200-2001, Stanford Dogs,
datasets. Therefore, we compared our method in the different datasets, CUB-200-2001, Stanford
and FGVC-Aircraft, with current state-of-the-art methods. Considering the existing methods have
Dogs, and FGVC-Aircraft, with current state-of-the-art methods. Considering the existing methods
significant differences in the performance on different datasets, we chose to use the classification results
have significant differences in the performance on different datasets, we chose to use the
classification results on the corresponding datasets recorded in the relevant papers during our
comparison. The comparison results are shown in Tables 2–4.

Table 2. Classification accuracy on the FGVC-Aircraft dataset.


Appl. Sci. 2020, 10, 4652 14 of 18

on the corresponding datasets recorded in the relevant papers during our comparison. The comparison
results are shown in Tables 2–4.

Table 2. Classification accuracy on the FGVC-Aircraft dataset.

Model Accuracy
B-CNN [25] 83.76%
OPAM [1] 84.01%
Multi-scale Granularity [17] 81.71%
Ours 84.23%

Table 3. Classification accuracy on the Stanford Dogs dataset.

Model Accuracy
PD [36] 72.0%
SCDA [37] 78.8%
B-CNN [25] 81.1%
PIR [38] 80.4%
Ours 80.6%

Table 4. Classification accuracy on the CUB 200-2011 dataset.

Model Label Box Essential Points Accuracy


√ √
Part-Based R-CNN [13] 73.5%
√ √
Pose Normalized CNN [15] 75.7%
√ √
Deep LAC [14] 84.1%

PG Alignment [39] 82.8%

VGG-BGLm [40] 80.4%
PIR [38] 79.3%
SCDA [37] 80.5%
B-CNN [25] 84.1%
OPAM [1] 85.83%
Multi-scale Granularity [17] 82.5%
PD [36] 84.6%
Ours 85.1%

Because the CUB200-2011 dataset provides essential points data, when we compared the performance
on the dataset, we chose to compare our method with strongly supervised methods. Table 5 shows
the classification labels of some of the images in the CUB 200-2011 dataset by advanced methods.
The text in red indicates an incorrect classification result. As we can see from the typical test images,
without the corresponding injection of manually supervised information, our classification remains
accurate for images with small inter-class differences and large intra-class differences. Additionally,
there are fewer instances of classification failures due to differences in the perspective and background.
From the tables above, we can see that our method has a great performance on the Stanford Dogs
and FGVC datasets; also, our method reaches an accuracy of 85.1% on the CUB-200-2001 dataset,
which is better than some strongly supervised algorithms, indicating that weakly supervised methods
can reduce the dependence on manual data labelling and improve the practicability of the algorithm
while ensuring a certain accuracy. Our accuracy is higher than the OPAM proposed by Peng et al.
on the FGVC-Aircraft dataset [1]. On the CUB 200-2011 dataset, our accuracy is very similar to the
OPAM method proposed by to Peng et al. However, OMPA runs with roughly 35 M parameters,
which is seven times the number we used, and only achieves a classification speed of 4 frames per
second. We reduced the number of parameters while increasing the speed of detection and ensuring
classification accuracy.
dataset, which
supervised methods is better
can than some strongly supervised algorithms, indicating that weakly
supervised methods canreduce
reducethethedependence
dependenceononmanual manual data
datalabelling
labellingandand improve
improve thethe
supervised methods can reduce the dependence on manual data labelling and improve the
practicability
practicability ofof
thethealgorithm
algorithm while
whileensuring
ensuringa certain
a certainaccuracy.
accuracy. Our
Our accuracy
accuracy is is
higher
higherthan thethe
than
practicability of the algorithm while ensuring a certain accuracy. Our accuracy is higher than the
OPAM
OPAM proposed
proposed bybyPeng
Peng et et
al.al.
ononthetheFGVC-Aircraft
FGVC-Aircraft dataset
dataset[1]. OnOn
[1]. thethe
CUBCUB200-2011
200-2011 dataset, our
dataset, our
OPAM proposed by Peng et al. on the FGVC-Aircraft dataset [1]. On the CUB 200-2011 dataset, our
accuracy
accuracyis is
very
verysimilar
similar to to
thethe
OPAM
OPAM method
method proposed
proposed bybyto to
Peng
Penget et
al.al.
However,
However, OMPAOMPA runs with
runs with
accuracy is very similar to the OPAM method proposed by to Peng et al. However, OMPA runs with
roughly
roughly35M 35Mparameters,
parameters,which whichis isseven
seventimes
timesthethenumber
numberweweused, used,andandonlyonlyachieves
achievesa a
Appl. Sci. 2020,roughly
10, 4652 35M parameters, which is seven times the number we used, and only achieves15 aof 18
classification
classificationspeed
speed of 4
of frames
4 frames per second.
per second.We Wereduced
reduced the number
the number of parameters
of parameters
classification speed of 4 frames per second. We reduced the number of parameters while increasing
while increasing
while increasing
thethe
speed
speedof detection
of detection and and ensuring
ensuring classification accuracy.
classification
the speed of detection and ensuring classification accuracy. accuracy.
Table 5. Test image results on the CUB 200-2011 dataset: the red text represents misclassification.
Table 5. 5.
Table
Table Test image
5. Test
Test results
image
image onon
results
results the
on CUB
the
the 200-2011
CUB
CUB dataset:
200-2011
200-2011 dataset:
dataset: the red
thethe text
red
red represents
text
text misclassification.
represents
represents misclassification.
misclassification.
Ground
Ground Truth
Groundtruth
truth Glaucous_Winged_Gull Gray_Kingbird Pine_Grosbeak Tropical_Kingbird
Ground truth

Input
Inputimage
Input
Input image
image
image

Deep
Deep Deep LAC [14]
LAC
LAC [14]
[14]
Deep LAC [14] Glaucous_Winged_Gull Gray_Kingbird Pine_Grosbeak Tropical_Kingbird
B-CNN
B-CNN
B-CNN [25]
B-CNN [25] [25]
[25] Heermann_Gull Gray_Kingbird
Gray_Kingbird
Gray_Kingbird
Gray_Kingbird Gray_Kingbird
Gray_Kingbird
Gray_Kingbird Tropical_Kingbird
Gray_Kingbird
Multi-scale Multi-scale[17] Glaucous_Winged_Gull Gray_Kingbird
Granularity
Multi-scale
Multi-scale Pine_Grosbeak Dark_eyed_Junco
Glaucous_Winged_Gull Gray_Kingbird
Gray_Kingbird Pine_Grosbeak Dark_eyed_Junco
Granularity
PD [36] [17]
Granularity
Granularity [17] Glaucous_Winged_Gull
[17]
Glaucous_Winged_Gull
Forsters_Tern Gray_Kingbird
Gray_Kingbird Pine_Grosbeak
Pine_Grosbeak Dark_eyed_Junco
Pine_Grosbeak Tropical_Kingbird
Dark_eyed_Junco
Ours Glaucous_Winged_Gull Gray_Kingbird
Gray_Kingbird Pine_Grosbeak
Pine_Grosbeak Tropical_Kingbird
PDPDPD [36]
[36]
[36] Forsters_Tern
Forsters_Tern
Forsters_Tern Gray_Kingbird
Gray_Kingbird Pine_Grosbeak Pine_Grosbeak Tropical_Kingbird
Tropical_Kingbird
Tropical_Kingbird
Ours Glaucous_Winged_Gull Gray_Kingbird Pine_Grosbeak Tropical_Kingbird
OursOurs Glaucous_Winged_Gull Gray_Kingbird
Glaucous_Winged_Gull Gray_Kingbird Pine_Grosbeak Pine_Grosbeak Tropical_Kingbird
Tropical_Kingbird
To verify that the proposed
To verify that the proposedmodel model has good has goodperformance
performance onon thethechallenges facedininthethe
challenges faced project,
project,
ToTo verify
verify that
that thethe proposed
proposed model
model has hasgoodgood performance
performance onon thethe
challenges
challenges faced
faced ininthethe project,
project,
we also tested we our
also tested
model our on model
the on the CompCars
CompCars dataset. dataset.
wewe also
alsotested
tested ourour model
model onon thetheCompCars
CompCars dataset.
dataset.
To compare the classification results of different vehicle hierarchy features, the existing vehicle
To compare ToTo the
compare
compareclassification
thethe results
classification
classification of different
results
results of of vehicle
different
different vehicle hierarchy
vehicle hierarchy
hierarchy features,
features,
features,the existing
thetheexisting
existing vehicle
vehicle
vehicle
labels were divided into three hierarchical labels according to the vehicle hierarchy division method.
labels were divided
labels
labels were
were into
divided
dividedthree
intointohierarchical
three
three hierarchical labels
hierarchical according
labels
labels according
according to the
to vehicle
the
to vehicle
the vehiclehierarchy
hierarchy
hierarchy division
division
division method.
method.
method.
Since the dataset we used is not a public dataset, we reproduced some related algorithms for
Since the Since
dataset
Since wedataset
thethe
fine-grained used
dataset iswe
image not
we aused
used public
classification dataset,
is isnotnot a apublic
on our we
public reproduced
dataset,
dataset.dataset,
Through some
wewereproduced relatedsome
reproduced
the experimental algorithms
somerelated
related
comparison,for fine-grained
algorithms
algorithms
it can be forfor
fine-grained
fine-grained
image classification
known that image
image
oninour classification
classification
dataset.
the case of large on
Through onour our dataset.
dataset. Through
the experimental
classification Through
labels, each method thetheexperimental
experimental
comparison,
has a good comparison,
comparison,
it performance,
can be known it can
it thebein
andthatcan be
known
known
the case of large
highestthatthat in the
in
classificationthe casecase
accuracy is uplabels, of large
of large
to 98.35%, classification
each classification
method
and the accuracylabels,
labels,
has aofgoodeacheach method
method
performance,
the model hashasa good
a
proposed inand good performance,
performance,
thisthe
paper highest andand
is closeaccuracy
to thethethe
highest
highest accuracy
state-of-the-art
is up to 98.35%, and the isaccuracy
accuracy upup
is
one. to 98.35%,
to
When 98.35%,
ofthethe andand
image thethe
model accuracy
labelaccuracy ofofthe
is configured
proposed inthemodel
model
this proposed
aspaper
the proposed
vehicle inbrand,
is close this
in paper
this
to paper
the
the is is
close
closeto to
classification
state-of-the-art thethe
state-of-the-art
state-of-the-art one.one. WhenWhen thethe image
image label
label is configured
is configured as the
as thevehicle
vehicle brand,
brand, thethe classification
classification
one. When accuracy
the image decreases as the category to be classified increases. The proposed model has better stability
label is configured as the vehicle brand, the classification accuracy decreases
accuracy
accuracy
and has decreases
decreases asas
a little decrease thethecategory
incategory
accuracy. to to
beUnder
classified
be classified increases.
the third increases.
level ofTheThe
250proposed
types of model
proposed model
vehicle hashas
modelbetter
better stability
labels, stability
all
as the category to be classified increases. Theinproposed model has better stability and has athe little
and has a
algorithms little decrease
have a in
significantaccuracy.
decrease Under the third
accuracy. level
It can
and has a little decrease in accuracy. Under the third level of 250 types of vehicle model labels,of
be 250
seen types
from of vehicle
Table 6 model
that labels,
through allall
decrease algorithms
inalgorithms
accuracy.
experimental haveUndera the
comparison,
significantthirdthe level
decrease of
classification250
in types
accuracy
accuracy. of vehicle
of
It the
can model
proposed
be seen
have a significant decrease in accuracy. It can be seen from Table 6 that through the labels,
model
from all
reached
Table 6algorithms
90.56%,
that through have
which the a
significant decrease
is on
experimental the
experimental in
brinkaccuracy.
of
comparison,the
comparison, It
latest can
thethe be seen
classification
classification
classificationfrom
accuracyTable
accuracy
accuracy 6
resultsthat
ofofthethethrough
reported
proposed
proposedon the
CVPR
modelexperimental
model obtained
reached
reached by comparison,
Fang
90.56%,
90.56%, et al.
which
which
is is
on[41],
the classification
on indicating
thetheaccuracy
brink
brink ofof that
theof
the the
the
latest model
proposed
latest proposed
classification
classification model in this
accuracy
accuracy paper
reachedresults hasreported
results certain
90.56%,reported superiority
which
onon is
CVPR
CVPR and
onobtained
the practicability.
brink
obtained bybyofFang
the
Fang Atet
latest
et al.al.
classification
[41],
[41],accuracy
indicating
indicating results
thatthat reported
thethe model
model on
proposedCVPR
proposed inobtained
this
in thispaper
paperbyhas Fang
has et al.
certain
certain [41],
superiorityindicating
superiority andand that the model
practicability.
practicability. AtAt
proposed in this paper has certain superiority and practicability. At the same time, the evaluation
and metrics for the model in Fang et al. were optimized for the CompCars data set only. In contrast
to the deeper experiments above, we compared the results on multiple datasets to make sure our
model is not optimized for a specific dataset. Besides, we aimed to show that our proposed method
retains the advantages of end-to-end training and testing by adopting bilinear neural networks. Thus,
the generality and superiority of our proposed algorithm is shown in the test results from multiple
data sets.

Table 6. Different levels of vehicle label recognition results.

Model Vehicle Type (100%) Vehicle Brand (100%) Vehicle Model (100%)
Inception v3 98.12 90.32 80.45.
B-CNN [25] 98.57 95.71 90.31
Zhang et al. [42] 93.87 86.34 72.89
Hsieh et al. [43] 90.23 84.21 70.49
Fang et al. [41] 98.41 95.22 91.52
Ours 98.35 95.82 90.46

5. Conclusions
Our survey aimed to study the small inter-class variance and large inter-class variance characteristic
of fine-grained image data, and the dependence of labels. Based on our study, we proposed a new method,
which is based on the weakly-supervised learning method and saliency module, for fine-grained image
classification. The salient region localization module first extracts salient regional area information. Then,
the information is fed into the bilinear attention module. The higher-level layer of the bilinear neural
network is used for extracting the regional feature, while the lower-level one is used for extracting
Appl. Sci. 2020, 10, 4652 16 of 18

global feature. Fused features are extracted by calculating the outer product on features acquired from
higher- and lower-level layers, which can be utilized to construct the auxiliary hierarchical mixed loss
function. The different layer feature fusion module allows the neural network to better fuse regional
features and global features. The experimental results show our model can achieve great classification
results on various datasets, which demonstrate our model’s robustness.
In the future, we will mainly focus on improving the classification accuracy while having hundreds
of, thousands of categories to predict, for realizing end-to-end classification. Additionally, the method
proposed in this paper is based on the weakly supervised learning method, which allows our model to
accurately locate and extract the most distinct regional areas, and reduces the dependence of manual
annotation information.

Author Contributions: Methodology, F.C., and G.H.; Writing—original draft, F.C., J.L., and Y.W.; Writing—review
and editing, J.L., G.H., C.-M.P., and W.-K.L.; Supervision, C.-M.P., and W.-K.L.; Funding acquisition, L.C.
All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported in part by the National Natural Science Foundation of China under Grant 61702111,
the National Nature Science Foundation of China-Guangdong Joint Fund under Grant 83-Y40G33-9001-18/20,
the National Key Research and Development Program of China under Grant 2017YFB1201203, the Guangdong
Provincial Key Laboratory of Cyber-Physical System under Grant 2016B030301008, the National Natural Science
Foundation of Guangdong Joint Fund under Grant U1801263, the National Natural Science Foundation of
Guangdong Joint Fund under Grant U1701262, the Guangdong R&D plan projects in key areas under Grant
2019B010153002, the Guangdong R&D plan projects in key areas under Grant 2018B010109007, the “Blue Fire
Plan” (Huizhou) Industry-University-Research Joint Innovation Fund 2017 Project of the Ministry of Education
under Grant CXZJHZ201730, and the Guangdong R&D plan projects in key areas under Grant 2019B010109001.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Peng, Y.; He, X.; Zhao, J. Object-Part Attention Model for Fine-Grained Image Classification. IEEE Trans.
Image Process. 2018, 27, 1487–1500. [CrossRef] [PubMed]
2. Luo, J.-H.; Wu, J.-X. A survey on fine-grained image categorization using deep convolutional features.
Acta Autom. Sin. 2017, 43, 1306–1318.
3. Zhang, L.-B.; Wang, C.-H.; Xiao, B.-H.; Shao, Y.-X. Image Representation Using Bag-of-phrases. Acta Autom. Sin.
2012, 38, 46–54. [CrossRef]
4. Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset;
CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011.
5. Abdel-Hakim, A.E.; Farag, A.A. CSIFT: A SIFT Descriptor with Color Invariant Characteristics. In Proceedings
of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06),
New York, NY, USA, 17–22 June 2006; pp. 1978–1983.
6. Berg, T.; Belhumeur, P. Poof: Part-based one-vs.-one features for fine-grained categorization, face verification,
and attribute estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Portland, OR, USA, 23–28 June 2013; pp. 955–962.
7. Perronnin, F.; Sánchez, J.; Mensink, T. Improving the Fisher Kernel for Large-Scale Image Classification.
In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2010; pp. 143–156.
8. Wang, P.; Wang, J.; Zeng, G.; Xu, W.; Zha, H.; Li, S. Supervised kernel descriptors for visual recognition.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA,
23–28 June 2013; pp. 2858–2865.
9. Branson, S.; Van Horn, G.; Wah, C.; Perona, P.; Belongie, S. The Ignorant Led by the Blind: A Hybrid
Human–Machine Vision System for Fine-Grained Categorization. Int. J. Comput. Vis. 2014, 108, 3–29.
[CrossRef]
10. Chai, Y.; Lempitsky, V.; Zisserman, A. Symbiotic segmentation and part localization for fine-grained
categorization. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia,
1–8 December 2013; pp. 321–328.
11. Deng, J.; Krause, J.; Fei-Fei, L. Fine-grained crowdsourcing for fine-grained recognition. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June
2013; pp. 580–587.
Appl. Sci. 2020, 10, 4652 17 of 18

12. Xu, Z.; Huang, S.; Zhang, Y.; Tao, D. Augmenting strong supervision using web data for fine-grained
categorization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile,
7–13 December 2015; pp. 2524–2532.
13. Zhang, N.; Donahue, J.; Girshick, R.; Darrell, T. Part-based R-CNNs for fine-grained category detection.
In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September
2014; pp. 834–849.
14. Lin, D.; Shen, X.; Lu, C.; Jia, J. Deep lac: Deep localization, alignment and classification for fine-grained
recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston,
MA, USA, 7–12 June 2015; pp. 1666–1674.
15. Branson, S.; Van Horn, G.; Belongie, S.; Perona, P. Bird Species Categorization Using Pose Normalized Deep
Convolutional Nets. In Proceedings of the BMVC 2014—British Machine Vision Conference, Nottingham,
UK, 1–5 September 2014.
16. Simon, M.; Rodner, E. Neural activation constellations: Unsupervised part model discovery with
convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago,
Chile, 7–13 December 2015; pp. 1143–1151.
17. Wang, D.; Shen, Z.; Shao, J.; Zhang, W.; Xue, X.; Zhang, Z. Multiple granularity descriptors for fine-grained
categorization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile,
7–13 December 2015; pp. 2399–2406.
18. Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial Transformer Networks, Advances in Neural information
Processing Systems; MIT Press: Montreal, QC, Canada, 2015; pp. 2017–2025.
19. Mnih, V.; Heess, N.; Graves, A. Recurrent Models of Visual Attention, Advances in Neural Information Processing
Systems; MIT Press: Montreal, QC, Canada, 2014; pp. 2204–2212.
20. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv
2014, arXiv:1409.0473.
21. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9.
22. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV,
USA, 26 June–1 July 2016; pp. 2818–2826.
23. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-V4, Inception-Resnet and the Impact of Residual
Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,
San Francisco, CA, USA, 4–9 February 2017.
24. Xiao, T.; Xu, Y.; Yang, K.; Zhang, J.; Peng, Y.; Zhang, Z. The application of two-level attention models in deep
convolutional neural network for fine-grained image classification. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 842–850.
25. Lin, T.-Y.; RoyChowdhury, A.; Maji, S. Bilinear cnn models for fine-grained visual recognition. In Proceedings
of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1449–1457.
26. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations
from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on
Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626.
27. Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Hypercolumns for object segmentation and fine-grained
localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston,
MA, USA, 7–12 June 2015; pp. 447–456.
28. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for
image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [CrossRef] [PubMed]
29. Zhang, P.; Wang, D.; Lu, H.; Wang, H.; Ruan, X. Amulet: Aggregating multi-level convolutional features for
salient object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice,
Italy, 22–29 October 2017; pp. 202–211.
30. Jin, X.; Chen, Y.; Jie, Z.; Feng, J.; Yan, S. Multi-path feedback recurrent neural networks for scene parsing.
In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9
February 2017.
Appl. Sci. 2020, 10, 4652 18 of 18

31. Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective Search for Object Recognition. Int. J.
Comput. Vis. 2013, 104, 154–171. [CrossRef]
32. Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979,
9, 62–66. [CrossRef]
33. Khosla, A.; Jayadevaprakash, N.; Yao, B.; Li, F.-F. Novel dataset for fine-grained image categorization:
Stanford dogs. In Proceedings of the CVPR Workshop on Fine-Grained Visual Categorization (FGVC); IEEE:
Colorado Springs, CO, USA, 2011.
34. Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; Vedaldi, A. Fine-grained visual classification of aircraft. arXiv
2013, arXiv:1306.5151.
35. Yang, L.; Luo, P.; Change Loy, C.; Tang, X. A large-scale car dataset for fine-grained categorization and
verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston,
MA, USA, 7–12 June 2015; pp. 3973–3981.
36. Zhang, X.; Xiong, H.; Zhou, W.; Lin, W.; Tian, Q. Picking deep filter responses for fine-grained image
recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas,
NV, USA, 26 June–1 July 2016; pp. 1134–1142.
37. Wei, X.-S.; Luo, J.-H.; Wu, J.; Zhou, Z.-H. Selective convolutional descriptor aggregation for fine-grained
image retrieval. IEEE Trans. Image Process. 2017, 26, 2868–2881. [CrossRef] [PubMed]
38. Zhang, Y.; Wei, X.-S.; Wu, J.; Cai, J.; Luo, Z.; Nguyen, V.-A.; Do, M. Weakly Supervised Fine-Grained
Categorization With Part-Based Image Representation. IEEE Trans. Image Process. A Publ. IEEE Signal
Process. Soc. 2016, 25, 1713–1725. [CrossRef] [PubMed]
39. Krause, J.; Jin, H.; Yang, J.; Fei-Fei, L. Fine-grained recognition without part annotations. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June
2015; pp. 5546–5555.
40. Zhou, F.; Lin, Y. Fine-grained image classification by exploring bipartite-graph labels. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July
2016; pp. 1124–1133.
41. Fang, J.; Zhou, Y.; Yu, Y.; Du, S. Fine-Grained Vehicle Model Recognition Using A Coarse-to-Fine Convolutional
Neural Network Architecture. IEEE Trans. Intell. Transp. Syst. 2016, 18, 1782–1792. [CrossRef]
42. Zhang, B. Reliable classification of vehicle types based on cascade classifier ensembles. IEEE Trans. Intell.
Transp. Syst. 2012, 14, 322–332. [CrossRef]
43. Hsieh, J.-W.; Chen, L.-C.; Chen, D.-Y. Symmetrical SURF and its applications to vehicle detection and vehicle
make and model recognition. IEEE Trans. Intell. Transp. Syst. 2014, 15, 6–20. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like