You are on page 1of 14

This article has been accepted for publication in IEEE Transactions on Multimedia.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 1

Fine-Grained Visual Classification via Internal


Ensemble Learning Transformer
Qin Xu, Jiahui Wang, Bo Jiang, and Bin Luo, Senior Member, IEEE

Abstract—Recently, vision transformers (ViTs) have been in- Large Intra-class differences
vestigated in fine-grained visual recognition (FGVC) and are now
considered state of the art. However, most ViT-based works ignore
the different learning performances of the heads in the multi-
Common Tern

Small Inter-class differences


head self-attention (MHSA) mechanism and its layers. To address
these issues, in this paper, we propose a novel internal ensemble
learning transformer (IELT) for FGVC. The proposed IELT
involves three main modules: multi-head voting (MHV) module,
cross-layer refinement (CLR) module, and dynamic selection (DS)
module. To solve the problem of the inconsistent performances of Forster’s Tern
multiple heads, we propose the MHV module, which considers
all of the heads in each layer as weak learners and votes for
tokens of discriminative regions as cross-layer feature based on
the attention maps and spatial relationships. To effectively mine
the cross-layer feature and suppress the noise, the CLR module
is proposed, where the refined feature is extracted and the assist Arctic Tern
logits operation is developed for the final prediction. In addition,
a newly designed DS module adjusts the token selection number
at each layer by weighting their contributions of the refined
feature. In this way, the idea of ensemble learning is combined
with the ViT to improve fine-grained feature representation. The Fig. 1. Three similar classes in CUB-200-2011 demonstrate the main
experiments demonstrate that our method achieves competitive challenges of fine-grained visual classification. The rows illustrate the high
results compared with the state of the art on five popular FGVC difference in certain categories, and the columns demonstrate the small
datasets. Source code has been released and can be found at differences between different classes.
https://github.com/mobulan/IELT.
Index Terms—fine-grained visual classification; vision trans-
former; ensemble learning; multi-head self-attention class in each row are highly different due to the different
postures. (2) Many samples in different classes have subtle
variances and are hard to discriminate, as illustrated in Figure
I. I NTRODUCTION 1. The three birds in each column look very similar, but
they belong to three different classes (the common tern, the
F INE-GRAINED visual classification (FGVC), which aims
to distinguish various subclasses within a metaclass, has
a wide range of applications, such as retail product recognition
Forster’s tern, and the Arctic tern) respectively. (3) Due to (1)
and (2), the annotations of fine-grained images require much
[1], smart transportation [2], and biodiversity conservation expert knowledge and thus a large cost.
[3]–[7]. Compared with the conventional image classification To address these challenges, many deep learning methods
problem [8], [9], there are many challenges in addressing [9]–[45] have been proposed and can be roughly divided
FGVC, which can be roughly characterized by three aspects. into two categories, i.e., part locating methods and feature
(1) The intra-class visual differences are usually large. As encoding methods. Part locating methods aim to find subtle
intuitively shown in Figure 1, the three birds with the same differences between input images by locating the bounding
boxes of discriminative parts [11]–[17]. Early part locating
This work was supported by the National Natural Science Foundation of works [11], [12] usually obtain the discriminative regions
China under Grant 61860206004 and Grant 72071001, by the Natural Science
Foundation of Anhui Province under Grant 2108085Y23, by the Natural based on the annotation of bounding boxes or parts. These
Science Foundation for the Higher Education Institutions of Anhui Province methods generally require expensive manual annotations that
under Grant KJ2021A0038, and by the University Synergy Innovation Pro- limit their applications in many scenes. To overcome this
gram of Anhui Province under Grant GXXT-2020-013 and Grant GXXT-
2022-032. (Corresponding authors: Bo Jiang) limitation, recent part locating methods [13]–[16] employ
Q. Xu, J. Wang and B. Luo are with the Key Laboratory of Intelligent weakly supervised object detection techniques to obtain part
Computing and Signal Processing of Ministry of Education, Anhui Provincial bounding box information. For example, Yang et al. [13]
Key Laboratory of Multimodal Cognitive Computation, School of Computer
Science and Technology, Anhui University, Hefei 230601, China (E-mail: proposed to find discriminative regions by using the region
xuqin@ahu.edu.cn; e21301179@stu.ahu.edu.cn; luobin@ahu.edu.cn) B. Jiang proposal network [46] and feeding the discriminative regions
is with the Information Materials and Intelligent Sensing Laboratory of Anhui to several sub-networks for local feature extraction. However,
Province, School of Computer Science and Technology, Anhui University, and
also with the Institute of Artificial Intelligence, Hefei Comprehensive National these methods require a bounding box generating module,
Science Center, Hefei, China. (E-mail: zeyiabc@163.com) which increases the network complexity. In order to find the

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 2

parts of an object automatically, an attention mechanism is influence of noise, we propose a cross-layer refinement (CLR)
introduced to find the discriminative regions without additional module. In the CLR module, the cross-layer tokens selected
annotation [18]–[23]. For example, Rao et al. [18] presented by all previous layers are concatenated to extract the cross-
counterfactual attention to encourage the network to learn layer feature. Then, an additional selection is conducted on
more useful attention information for feature representation. the cross-layer tokens to obtain refined tokens and a refined
The attention-based methods also demonstrate the advantages feature. We use the cross-layer feature to guide the refined
of exploring the relationships between feature channels [19]– feature and design the assist logits operation for prediction.
[21] in finding discriminative parts. Another category is the In addition, according to the proportion of refined tokens
feature encoding methods, which aim to learn the rich features in the CLR module, we attempt to obtain the contribution
through a feature fusion strategy [24]–[35]. For example, Xu of each layer to the refined feature. Inspired by the boosting
et al. [24] proposed to mix the discriminate regions of the algorithm [52] in ensemble learning, we propose a dynamic
samples in different categories with their labels so that the selection (DS) module to update the selection number of the
network can better discover the common features and specific MHV module on each layer. This can reduce the effect of
features with different categories. poorly performing layers and improve the quality of the cross-
Recently, vision transformers (ViTs) [9] have been shown to layer feature.
be promising in many computer vision tasks. It is known that Overall, we summarize the main contributions of our work
the core of a ViT is its multi-head self-attention mechanism as follows:
(MHSA), which calculates the interrelationships of the input 1) We propose a novel internal ensemble learning trans-
patches to capture global feature representation. ViTs can cap- former (IELT) that introduces ensemble learning into a
ture global features but lack the ability to learn local features, ViT to solve the problem of inconsistent learning perfor-
which limits the application to FGVC tasks. Therefore, recent mances among different MHSA heads and transformer
works [36]–[41] have attempted to enhance the regular ViT layers in ViTs.
for capturing local features. He et al. [36] proposed to select 2) We develop an effective multi-head voting (MHV) mod-
more discriminative image patches that improve the ability ule that allows all MHSA heads in each layer to select
to extract local features for classification. Wang et al. [40] the desired tokens as cross-layer feature based on their
proposed to fuse the features of different transformer layers attention maps and spatial relationships.
for improving the feature representation ability of the network. 3) We propose a cross-layer refinement (CLR) module to
However, the abilities of estimating discriminative regions are fuse the cross-layer feature and further extract the refined
different among different layers and heads of MHSA in ViTs feature to improve the feature representation ability.
[47], [48]. Additionally, the discriminative regions are spatially 4) A dynamic selection (DS) module is designed to dy-
close, while the noise regions are scattered. Most existing ViT- namically update the selection number of tokens in the
based methods ignore the different learning performances of MHV module of each layer based on their importance.
heads and layers and the spatial relationships among different The rest of this paper is organized as follows. Section
attention maps. II briefly reviews the related works. Section III describes
To address this issue, in this work, we propose a novel the proposed method. Section IV reports the experimental
internal ensemble learning transformer (IELT) for FGVC. The results and comprehensive analysis. Finally, conclusions and
ensemble learning strategy can solve the problem of unbal- suggestions for further work are presented in Section V.
anced learning performances [49]. In the IELT, all attention
heads of MHSA in each layer are considered weak learners
II. R ELATED W ORK
in terms of selecting valuable tokens from each layer, and all
transformer layers are considered weak learners in terms of A. Methods Based on Convolutional Neural Networks
contributing to more valuable cross-layer features. Existing CNN-based works for FGVC can be divided into
First, we propose the multi-head voting (MHV) module, two categories: part locating methods and feature encoding
where each head of the MHSA in each transformer layer votes methods.
for several tokens with the highest response in the attention 1) Part Locating Methods: Part locating methods [11]–[16]
map, similar to the bagging algorithm [50] in ensemble learn- aim to find discriminative regions to distinguish subtle inter-
ing. The voting results of all heads are assembled to obtain class differences. Early part locating methods [11], [12] direct-
a score map. To suppress the scattered noise, a convolution ly use the bonding boxes and part annotations. Huang et al.
operation is conducted, and it can enhance the concentrated [11] proposed to extract features from annotated parts during
discriminative regions and obtain an enhanced score map. training to enhance the recognition ability of the network.
According to the enhanced score map, several tokens are To avoid expensive labeling, more weakly supervised locating
selected from the output of each transformer layer as cross- methods have been proposed recently. Liu et al. [14] proposed
layer tokens. a weakly supervised cross-part convolutional neural network
Second, existing works [23], [40], [51] have revealed that for FGVC that localizes the multi-regional features and learns
cross-layer feature fusion can improve the feature represen- the cross-part feature. Yang et al. [15] proposed to extract the
tation ability. However, it is still a challenge to deal with global features for coarse class prediction and then re-rank the
the noise introduced by the cross-layer feature. To make prediction results by the local features that are extracted from
full use of the cross-layer feature and further reduce the the feature pyramid network. Liu et al. [16] proposed a filter

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 3

learning method to improve the part localization ability of the and use a pure transformer architecture to achieve a similar
network. Ardhendu et al. [53] proposed to encode the spatial performance in comparison with the CNN in computer vision
arrangement and visual features of multi-scale regions and tasks. ViTs can capture global features, but they cannot easily
capture subtle variations of these regions via bilinear pooling. capture local features, which limits its performance in FGVC
However, the locating modules of these methods often require tasks. To solve this problem, Wang et al. [36] proposed a
a large number of parameters to obtain accurate part locating transformer architecture for fine-grained recognition (Trans-
results, thus increasing the training complexity. FG), which splits the images into overlapped patches to obtain
The attention-based part locating methods [18]–[23] have richer features and uses the discriminative image patches for
received popularity, since they can find discriminate regions classification. In [37], [38], the most discriminative region
automatically without the region proposal network. Liu et is located and fed to the network again for detailed feature
al. [19] proposed to use the attention mechanism to group representation. Liu et al. [39] proposed to suppress the most
the feature channels for discriminative parts. Ding et al. [22] responsive token during prediction, which allows the network
proposed a selective sparse sampling framework that divides to find richer features and add them to the knowledge base. Sun
the parts into discriminative parts and complementary parts et al. [58] proposed to integrate structural information through
according to the response differences of the attention map. Luo a ViT by mining the spatial context relation of salient patches
et al. [23] proposed a Cross-X learning framework that extracts and enhance the feature robustness with a contrastive learning
features of multiple attention regions and uses cross-layer strategy. Zhao et al. [41] proposed to use several transformer
constraints for multi-scale feature representations. To obtain blocks to extract both global features and local features. Zhu
rich semantic information, Gao et al. [21] proposed to mine et al. [42] proposed a dual cross-attention learning algorithm
the channel-wise relationships of the feature map through the that enhances the interactions between global images and local
attention mechanism. Zheng et al. [20] proposed using trilinear high-response regions and establishes interactions between
attention to integrate feature maps through spatial relationships image pairs. Inspired by these works, we propose a cross-
and highlight the detailed areas of the input image with a high layer refinement module to make full use of the cross-layer
resolution. The aforementioned methods are based on CNN- feature selected by each layer of the ViT.
based models. Based on the recent ViT models, in this work, The feature fusion vision transformer (FFVT) [40], which
we group the channels by MHSA heads and propose the multi- chooses the discriminative tokens from each transformer layer
head voting module to locate tokens of discriminative regions. and conducts feature fusion in the last layer for the cross-layer
2) Feature Encoding Methods: Feature encoding methods, feature, is most related to our work. However, the FFVT ig-
e.g., [24]–[35], aim to learn richer features and are usually nores the different learning performances of transformer layers
performed by the fusion of multiple network features or by and heads in MHSA. We propose a multi-head voting module
various novel training strategies. The bilinear pooling methods to suppress the influence of heads with weak performances
[25], [26] use multiple networks to extract features separately and use a dynamic selection module to enable the layers with
and fuse the features for recognition. Several authors [27], better learning performances to contribute more cross-layer
[28] have proposed to partition the input images into small features.
patches and shuffle them to break the spatial correlation of
the images. This can force the network to focus on more
C. Ensemble Learning
valuable fine-grained features. The peak suppression methods
[29], [30] mask the highest response area so that the network In machine learning, ensemble learning aims to combine
can find more potential features and improve the generalization multiple weak learners to form a strong learner. Through
ability. Some methods [33], [34] utilize the click features the combination, the errors of an individual learner can be
from the user’s click data for fusing and regularizing the compensated by other learners. Ensemble learning usually uses
fine-grained visual features. Wang et al. [35] designed a voting methods to simulate the wisdom of the crowd and can
bimodal network that includes spatial domain and frequency also expand the search space to better fit the data space [49].
domain branches to obtain the spatial-frequency features for The classic algorithms in ensemble learning include the
classification. However, it is difficult to directly indicate what bagging algorithm [50] and the boosting algorithm [52]. The
features the networks have learned. In this work, we fuse the bagging algorithm allocates the training samples to each weak
cross-layer feature of discriminative regions in each layer to learner through random sampling with replacements. All weak
ensure the interpretability of the network. learners aggregate their votes to obtain the final prediction
result. The boosting algorithm uses all samples at each itera-
tion and updates the weight of each sample according to the
B. Methods Based on Vision Transformers prediction result during the training process. After the training,
The transformer framework [54] was proposed to process the weight of each weak learner is updated according to
sequence data and was applied to natural language processing their performances. The bagging algorithm reduces prediction
tasks [55], [56] and multimodal learning tasks [57]. The core variance, and the boosting algorithm reduces prediction bias
of the transformer is a multi-head self-attention mechanism [49].
that groups the channels and calculates the interrelationship In this work, we explore the strong correlation between the
of the input token sequences. Dosovitskiy et al. [9] proposed MHSA and cross-layer feature fusion and ensemble learning.
using the ViT to split the 2D image into patch sequence The characteristics of ensemble learning can solve the problem

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 4

of the different learning performances of each head and each the output token sequence Xout with the same size of Xin is
layer. To this end, we integrate ensemble learning into the computed as
transformer and propose using the internal ensemble learning
X0 = MHSA(LN(Xin )) + Xin
transformer to further improve the performance of ViTs in (2)
Xout = MLP(LN(X0 )) + X0
FGVC.
where LN(·) is the layer normalization [59], MHSA(·) is
III. T HE P ROPOSED M ETHOD the multi-head self-attention, and MLP(·) is the multi-layer
The architecture of the proposed internal ensemble learning perceptron.
transformer (IELT) for FGVC is illustrated in Figure 2. First,
the features of the input image are extracted through linear B. The Multi-Head Voting Module
projection and represented in the form of sequenced tokens, The multi-head self-attention (MHSA) in each transformer
which are concatenated with the class token of category layer divides the input into K groups in the channel di-
features. The concatenated tokens are sent into the transformer mensions and feeds each group to each head to calculate
layers. To improve the feature representation ability, the multi- self-attention. The attention map generated by each head
head voting (MHV) module is proposed to select the tokens of represents the degree of the attention of each region. However,
discriminative regions from the output of previous L−1 layers, the performances of all heads in locating the discriminative
where L is the number of layers of the original ViT. The input regions are not consistent [47], [48]. In some attention maps,
of L-th layer is then replaced by concatenating the 1) MHV the peak areas are located on the discriminative regions, while
selected tokens and 2) the class token of the (L−1)-th layer in in some attention maps, the peak areas are concentrated on the
the proposed cross-layer refinement (CLR) module to obtain background areas. To mitigate the effects of heads with poor
the cross-layer feature. To refine the extracted cross-layer location performances and select some valuable tokens from
feature, the output tokens are selected by the MHV module each layer, inspired by the bagging algorithm [50], we propose
and concatenated with the class token again as the input of the the multi-head voting (MHV) module in which the multiple
(L + 1)-th layer. The class tokens extracted from the outputs heads in MHSA are treated as weak learners for selecting
of the L-th layer and the (L + 1)-th layer, which represent the tokens of discriminative regions.
cross-layer feature and the refined feature, respectively, are fed For the l-th (l ∈ 1, 2, · · · , L − 1) transformer layer, suppose
into the fully connected (FC) layer. The cross-layer feature the input and output token sequence of this layer are Xin l
then guides the refined feature for prediction by the assist and Xout 1 2 k K
l , respectively. Let A = [A ; A ; · · · ; A ; · · · ; A ]
logits (the assist logits is an operation that we designed [39]). denote the attention score of the class token, where Ak ∈
In addition, a dynamic selection (DS) module is proposed to RN is the attention score of the k-th head, taken from the
update the selection numbers of previous L − 1 layers in the attention map generated by MHSA. To characterize the spatial
MHV module according to the selection results in the CLR relationship, according to the original spatial arrangement, the
module. attention score Ak is resized into a two-dimensional matrix
0
This section first introduces the backbone of our method in Ak ∈ Rn1 ×n2 , as shown in Figure 3. To select the valuable
0
Section III-A. We then describe the MHV module, the CLR tokens based on the attention score Ak , we define a score
module, and the DS module in Section III-B, Section III-C, map M ∈ Rk n1 ×n2
as
and Section III-D, respectively. ( 0
1, if Ak (i, j) is top−v value
Mk (i, j) = (3)
A. The Backbone Network 0, otherwise
The proposed IELT adopts the vision transformer (ViT) [9] where v is a hyper-parameter to define the number of votes
as the backbone network. First, the input image I is divided for each head. The score maps of all heads are then combined
into N = n1 × n2 non-overlapped patches, where n1 and together to obtain the total score map M0 ∈ Rn1 ×n2 as
n2 are the amounts of patches in each row and column of
K
the input image, respectively. The non-overlapped patches are X
M0 = Mk (4)
denoted by Iip ∈ RP ×P ×c , i = 1, 2, · · · , N , where (P, P ) is
k=0
the spatial dimension, and c is the number of channels. Using
2
a learnable linear projection E ∈ RcP ×D , the patch Iip can To suppress the noise and enhance discriminative regions,
be transformed to an embedded token xi = Iip E ∈ RD , i = a convolution operation is performed on the total score map
1, 2, · · · , N . The embedded tokens X = [x1 , x2 , · · · , xN ] and M0 with a convolution kernel K. The enhanced score map
the class token xclass ∈ RD are then concatenated and added M∗ ∈ Rn1 ×n2 is calculated by
by the trainable positional encoding Epos ∈ R(N +1)×D to M∗ = M0 ∗ K (5)
form the initial input token sequence X0 :
where ∗ represents the convolution operator. The adopted
X0 = [xclass ; x1 ; x2 ; · · · ; xN ] + Epos (1) convolution kernel K is as follows:
 
The ViT backbone consists of multiple transformer layers, 1 2 1
each of which includes the multi-head self-attention and multi- K = 2 4 2 (6)
layer perceptron. For one transformer layer, if the input is Xin , 1 2 1

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 5

Dynamic Selection
Prediction + * Transformer Layer L+1 Module

Fully Connected Layer


Re�ined feature
* ....
Re�ined tokens

Multi-head Voting Module Selection Ratio of


of Each Layer
Assist
Logits * Transformer Layer L
Cross-layer feature
New Selection
Loss Loss
* .... Number Calculation

Cross-layer Re�inement Module Layer 1 tokens Layer 2 tokens Layer L-1 tokens

Selection Number Update


Multi-head Voting Module Transformer Layer L-1

....
....
Multi-head Voting Module

...
Cross-layer Multi-head
EnhancedVoting Module
Vote Module
Re�inement Module
Attention Map of Each Head Transformer Layer 1

...
* Class Token
Selected Tokens 0 * 1 2 3 4 5 6 7 8 9
+ Element-wise Add
Dynamic Selection

Conv 3×3

Position Linear Projection


Module

Embedding

Enhance Voting & Converge

Fig. 2. Overview of the proposed method. The multi-head voting module selects the tokens of discriminative regions from each layer. The cross-layer
refinement module extracts the cross-layer feature and obtains the refined feature through an addition selection. The cross-layer feature guides the refined
feature in the prediction. The dynamic selection module updates the selection number of each layer in the multi-head voting module.

=1 =0 Element-wise Add
Head 1

A1’
Total Score Map M’ Enhanced Score Map M* The Selected Tokens
Dynamic Selection

Head 2 0 0 0 0 0 0 0

Number m(l)

A2’ 0 1 0 0 0 0 0
0 5 7 0 0 1 0
0 2 9 7 2 1 0

Head 3 0 3 4 3 2 0 0

A3’ 0 1 3 2 4 1 0
2 2 5 9 0 0 0

.... .... .... 1 2 1


2 4 2
Head K 1 2 1
M K
AK ’ Conv 3×3

Fig. 3. The progress of the multi-head voting module. We obtain the attention map of each head in one layer and obtain the the location of the largest v
value. In this figure, we set v to 4 for clear observation. We then obtain and converge the results of all heads. To suppress the noise, we enhance the score
map by conducting a convolution operation on it. Finally, we determine the selected tokens.

Here, we can use either a learnable convolution kernel or a [61]. The enhanced score map M∗ of a sample is visualized
fixed convolution kernel. Inspired by the asymmetric convo- in Figure 3. To filter the noise, we define a selection number
lution kernel [60], we adopt a fixed Gauss-like kernel in our vector m ∈ RL−1 for the previous L − 1 layers and a vector
implementation. It can effectively remove the scattered noise id ∈ Rm(l) , whose components are the indexes of the largest
and keep the edges of the main parts of the object distinct m(l) values of a flattened M∗ . The determination of selection

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 6

number vector m is detailed in the dynamic selection module xLclass Loss


p

Cross-layer features

Softmax
section (Section III-D). For the l-th layer, we select m(l)
tokens from the output Xout by id and concatenate them to + Element-wise Add
l
form a selected token sequence Xsel ∈ Rm(l)×D of the l-th Hadamard Product

Fully Connected Layer


l 1×C 1×C
layer, which is D Embedding Dimensions

Summation
1×D C Number of Classes
Xsel = [Xout out out Weights W
l l,id(1),: ; Xl,id(2),: ; · · · ; Xl,id(m(l)),: ] (7)
C×D
xL+1
class 1×C

Re�ined features
The proposed MHV module reduces the influence of heads
with weak performances and selects the tokens of discrimina- Assist Logits p’
y Class 1
Class 2
tive regions in each layer.

Softmax
+ Loss

.....
1×D 1×C 1×C Class C
C. The Cross-Layer Refinement Module
The existing methods [23], [40], [51] have reported the Fig. 4. The progress of assist logits. The cross-layer feature first makes
effectiveness of cross-layer feature fusion in enhancing the the prior prediction and hadamard products with the summed weights of the
FC layer to obtain the cross-layer logits. The refined feature is added to the
feature representation. However, the direct fusion of the cross-layer logits for the final prediction. We calculate the cross-entropy loss
cross-layer feature may involve the redundant information of both predictions. We omit the LN(·) before the FC layer for a clear view.
of unreliable regions, which affects the final classification
performance. In order to better exploit the cross-layer feature
and suppress the noise, we propose the cross-layer refinement After obtaining the cross layer feature xclass
L and the refined
(CLR) module, as shown in Figure 2. In this module, the feature xclass
L+1 , inspired by [39], we propose the assist logits
inputs are the selected tokens of all previous layers, which operation, which uses the prior prediction result to guide the
are considered as cross-layer tokens. Based on the cross-layer final prediction. The progress of assist logits is shown in Figure
tokens, we select several tokens by the MHV module, named 4. The prior prediction result p ∈ RC is computed by the
refined tokens. The cross-layer feature and the refined feature cross-layer feature as follows:
are extracted from the cross-layer tokens and refined tokens, p = softmax(FC(LN(xclass ))) (10)
L
respectively, by the transformer layers. In order to avoid a loss
of detail, the assist logits operation was designed, as suggested where C is the number of categories, and softmax(·) is the
in [39]. The refined feature and the cross-layer feature are softmax operation. As the weight W = [w1 , w2 , · · · , wD ] ∈
operated by the assist logits to generate the final prediction RC×D in the FC layer represents the responses of embedding
results. dimensions of the predicted categories, the cross-layer logits
To extract the cross-layer feature, we concatenate the class y ∈ RC is computed by the prior prediction p and weight W
token of the (L − 1)-th layer and the cross-layer tokens as the as follows:
D
input of the L-th layer: X
y=p wi (11)
Xin class sel sel sel
L = [XL−1 ; X1 ; X2 ; · · · ; XL−1 ] (8) i=0

where is the hadamard product operation. The final predic-


The output of the L-th layer is obtained by Eq. 2 and denoted tion p0 ∈ RC is then computed by the cross-layer logits y and
by Xout
L . The cross-layer feature is xL
class
= XoutL,1,: and the refined feature xclass
L+1 as follows:
contains rich cross-layer information.
To obtain the refined feature, first, as the tokens in Xout L
p0 = softmax(FC(LN(xclass
L+1 )) + y) (12)
have no spatial relationships, XoutL is processed by the MHV 0
Herein, p and p are computed by the same FC layer. To
module without the resize operation or the enhanced convolu- optimize the network during the training phase, the cross-
tion operation to obtain the refined tokens. Let id0 ∈ Rt denote entropy loss is adopted for p and p0 . The loss function L
the index of the refined tokens, where t is a hyper-parameter is defined as
used to represent the number of refined tokens. An additional
transformer layer, the (L + 1)-th layer, is then used to extract L = λCrossEntropy(p, z) + (1 − λ)CrossEntropy(p0 , z)
the refined feature from the refined tokens. To prevent the (13)
refined feature from the effect of noise, instead of direct fusion, where λ is the tradeoff hyper-parameter, and z denotes the
we adopt the class token of the (L − 1)-th layer rather than ground-truth label.
that of the L-th layer as the input of the (L+1)-th layer. Thus, By fusing the cross-layer feature and the refined feature, the
the input of the (L + 1)-th layer is the class token and refined proposed cross-layer refinement module can avoid the effect
tokens, denoted by Xin L+1 ∈ R
(t+1)×D
: of noise and improve the feature representation ability for
effective classification.
Xin class out out out
L+1 = [XL−1 ; XL,id0 (1),: ; XL,id0 (2),: ; · · · ; XL,id0 (t),: ] (9)

The output of the (L + 1)-th layer is computed by Eq. 2 D. The Dynamic Selection Module
and represented by Xout class
L+1 . The refined feature is xL+1 = To obtain more discriminative features, inspired by the
out
XL+1,1,: , which is the noise-filtered cross-layer feature. boosting algorithm [52], the dynamic selection (DS) module

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 7

is proposed. In the DS module, the transformer layers are Algorithm 1 Dynamic Selection.
considered as weak learners to contribute more tokens of dis- Input: The number of refined tokens t; The indexes of refined
criminative regions and useful cross-layer features by updating tokens id0 ∈ Rt in the CLR module; Total selection
the selection number vector m in the MHV modules of the number s; selection ratio r ∈ RL−1 ; Moving rate θ.
previous L − 1 layers, as shown in Figure 2. To determine the Output: New selection number vector m ∈ RL−1 .
contribution of each weak learner, the indexes of the refined 1: a = b = 0 . Index interval
tokens in the CLR module are exploited. 2: m = sr . Original selection number vector m ∈ RL−1
The selection number vector m ∈ RL−1 of the previous 3: r0 = 0 . New selection ratio r0 ∈ RL−1
(L − 1) layers is computed by m = dsre, where s is the total 4: for l = 1, · · · , L − 1 do
selection number, and r ∈ RL−1 is the selection ratio of the 5: a = b; b = a + m(l) . Index interval of layer l
previous L − 1 layers, which is initialized by r = 1/(L − 1). 6: q=0 . Refined tokens from layer l
To characterize the contribution of each layer to the refined 7: for i = 1, · · · , t do
feature, we count the number of tokens selected as refined 8: if a ≤ id0 (i) < b then
tokens in each layer. For the l-th layer, let [a(l), b(l)) denote 9: q =q+1
the index interval of the tokens contributed by this layer, i.e., 10: end if
 11: end for
0 l=1 12: r0 (l) = q/t . New selection ratio of layer l
a(l) = Pl−1
i=1 m(i) l > 1 (14) 13: end for
14: r ← (1 − θ)r + θr0 . Updating selection ratio
b(l) = a(l) + m(l)
15: m = dsre
where l ∈ 1, 2, · · · , L − 1. By determining whether each 16: return m
element of refined token indexes id0 is within the interval
[a(l), b(l)), a count number q(l) is obtained. Specifically,
if an element of id0 is within the interval [a(l), b(l)), then Section IV-A. The experimental results compared to the state-
q(l) ← q(l) + 1, as presented in Algorithm 1. An auxiliary of-the-art works are presented in Section IV-B. The ablation
selection ratio of the l-th layer is calculated as r0 (l) = q(l)/t. studies are presented in Section IV-C. The visualization of the
Using the auxiliary selection ratio r0 (l) and moving rate θ, we proposed method is presented in Section IV-D. The hyper-
update the selection ratio r and selection number vector m as parameter analyses are reported in Section IV-E. Finally, the
follows: model complexity is analyzed in Section IV-F.
r ← (1 − θ)r + θr0
(15) A. Datasets and Experiment Settings
m = dsre
It can be inferred that the more the tokens selected from a 1) Datasets: To verify the robustness and effectiveness of
certain layer contribute to the refined feature, the larger the the proposed method, the experiments were conducted on five
selection ratio r and r0 of this layer are, and vice versa. fine-grained datasets of various scales, including two small
This helps to increase the selection numbers of layers with datasets, Oxford 102 Flowers [6] and Oxford-IIIT Pet [7], two
a strong learning ability and decrease the selection numbers medium datasets, CUB-200-2011 [3] and Stanford Dogs [5],
of layers with a weak learning ability dynamically. Thus, and a large dataset, NABirds [4]. Table I shows the details of
the dynamic selection can improve the feature representation the five datasets. Top-1 accuracy is adopted as the evaluation
ability of classification. Through the DS module, the network metric. We only use classification labels without any additional
can automatically adjust the contribution of each layer to annotations for supervised training.
the cross-layer feature and learn more valuable cross-layer 2) Experiment Settings: The ViT-B-16 pre-trained on Im-
features. ageNet21K was adopted as the backbone network. The input
Note that, in the DS module, the total selection number s images were resized to 448×448. Random cropping, hori-
and the number of refined tokens t do not change, and only the zontal flipping, and color jittering were applied for training,
selection numbers of the previous L − 1 layers change. To be and center cropping was applied for testing. The model was
specific, the proposed network uses the ViT as the backbone, trained by the stochastic gradient descent (SGD) optimizer
and the layer normalization [59] in each layer can ensure the with a momentum of 0.9, and a cosine annealing scheduler
stability of the network. In addition, all of the tokens in the was applied. The initial learning rate was set to 0.002 for
CLR module come from the previous L − 1 layers, which is Stanford Dogs and 0.02 for other four datasets. The model
similar to previous work [40]. It does not affect the parameters was trained for 50 epochs, and the batch size was set to 8 on
receiving a gradient value in the corresponding layer [62], all datasets.
which ensures the differentiability of the entire network. In the MHV module, the vote number of each head v was
set to 24. In the CLR module, the number of refined tokens t
was set to 24, and the proportion of loss λ was set to 0.4. In
IV. E XPERIMENTS
the DS module, the total selection number s was set to 126,
In this section, the five FGVC datasets we used as the and the selection ratio of each layer r was set to 1/(L − 1)
benchmarks and the experiment settings are introduced in initially. The moving rate θ was set to 1e-4 for Stanford Dogs

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 8

TABLE I TABLE II
F INE - GRAINED DATASET STATISTICS . C OMPARISON RESULTS ON CUB-200-2011 AND S TANFORD D OGS .
Dataset Class Training Testing Accuracy (%)
Method Backbone
CUB-200-2011 [3] 200 5994 5794 CUB Dogs
NABirds [4] 555 23929 24633 ResNet [8] ResNet-50 84.5 82.7
Stanford Dogs [5] 120 12000 8580 RA-CNN [63] VGG-19 85.3 87.3
Oxford 102 Flowers [6] 102 2040 6179 MA-CNN [19] ResNet-50 86.5 -
Oxford-IIIT Pet [7] 37 3680 3669 NTS-Net [13] ResNet-50 87.5 87.5
Cross-X [23] ResNet-50 87.7 88.9
DCL [27] ResNet-50 87.8 -
CIN [21] ResNet-101 88.1 87.6
S3N [22] ResNet-50 88.5 87.1
and NABirds and to 1e-3 for the others. In the first 10 epochs, MRDMN [24] ResNet-50 88.8 89.1
the DS module was not used, as there are domain gaps between FDL [16] DenseNet-161 89.1 84.9
the pre-trained dataset and fine-grained datasets, which cause PMG [28] ResNet-50 89.6 -
FBSD [29] ResNet-50 89.8 88.1
the low-level features to be more helpful for classification in MSHQP [51] ResNet-50 89.0 90.4
the initial epochs. If the DS module is used, it will select API-Net [32] DenseNet-161 90.0 90.3
more tokens in the lower layers, which is not desirable. Thus, PRIS [64] ResNet-101 90.0 90.7
CAL [18] ResNet-101 90.6 88.7
the DS module was used after the network was optimized on CCFR [15] ResNet-50 91.1 -
the target dataset. The experimental results of TransFG [36] ViT [9] ViT-B-16 91.0 90.6
on Stanford Dogs were adopted from [40]. Furthermore, our FFVT [40] ViT-B-16 91.6 91.5
TransFG [36] ViT-B-16 91.7 90.6
model was implemented in PyTorch over four Nvidia Titan IELT ViT-B-16 91.8±0.04 91.8±0.05
Xp GPUs.
TABLE III
B. Comparison with SOTA Methods C OMPARISON RESULTS ON NAB IRDS .
To verify the effectiveness of the proposed IELT, we com- Method Backbone Accuracy (%)
Cross-X [23] ResNet-50 86.4
pared the proposed IELT with the SOTA methods on five PAIRS [65] ResNet-50 87.9
datasets. The compared SOTA methods include two categories: DSTL [66] Inception-v3 87.9
the CNN-based methods and the ViT-based methods. For each GHRD [31] ResNet-50 88.0
API-Net [32] DenseNet-161 88.1
dataset, as our proposed method adopts the seeds randomly PRIS [64] ResNet-101 88.4
for initialization, we repeated the experiments five times and CS-Part [67] ResNet-50 88.5
report the average classification accuracy with standard devia- MGE-CNN [68] SENet-154 88.6
ViT [9] ViT-B-16 89.9
tion. We have conducted significant tests on the experimental TransFG [36] ViT-B-16 90.8
results, and all results are within the 95% confidence interval. IELT ViT-B-16 90.8±0.05
The best accuracy is in bold, and the second best accuracy
is underlined. Table II shows the comparison results of CUB-
200-2011 and Stanford Dogs. Table III shows the comparison
and computation. Our proposed IELT fully uses the cross-
results of NABirds. The comparison results of Oxford 102
layer feature to obtain a competitive performance without an
Flowers and Oxford-IIIT Pet are shown in Table IV.
increase of computation.
1) Results on CUB-200-2011 and Stanford Dogs: Table
3) Results on Oxford 102 Flowers and Oxford-IIIT Pet:
II shows that our method achieves the best performance
From Table IV, the proposed method obtains improvements
compared to the SOTA methods on both CUB-200-2011 and
of 0.2% and 1.4% over the baseline on Oxford 102 Flowers
Stanford Dogs. The proposed method achieves a significant
and Oxford-IIIT Pet, respectively. We can see that the existing
improvement of 0.8% and 1.2% over the baseline ViT [9]
SOTA methods perform well on the flowers, and our method
on the two datasets, respectively. Compared with TransFG
can further improve the performance without adopting any
[36], which selects tokens from the input of the last layer
additional annotations. Compared with CNN-based methods,
and replaces the original input, the proposed IELT extracts the
such as FixSENet [69], which uses a high amount of data
features of each layer by the CLR module and can obtain an
preprocessing and OPAM [70] that trains the network in
accuracy that is higher by 0.1% and 1.2%, respectively. FFVT
stages, the proposed method achieves higher accuracy without
[40] fuses the cross-layer feature but ignores the influence of
special training strategies. Compared to CvT [71], which is
noise involved with the cross-layer feature and the different
a powerful backbone network based on a ViT and integrates
learning performances of multiple heads in the MHSA and
a convolution operation to the tokens, the proposed method
transformer layers. The proposed method reduces the effect of
can achieve a slight improvement of 0.1% and 0.5% on the
noise and outperforms FFVT by 0.2% and 0.3%, respectively.
two datasets, respectively. These results show that our method
achieves an improved performance on the FGVC datasets
2) Results on NABirds: For NABirds, a large-scale dataset,
compared to other improved ViT-based networks.
the ViT-based methods have greater potential than the CNN-
based methods. Based on Table III, the proposed method
obtains an accuracy of 90.8%, which outperforms the baseline C. Ablation Studies
by 0.9%. The proposed method achieves the same accuracy Ablation studies were conducted to verify the effectiveness
as TransFG [36] with a significant reduction in memory of our contributed components. They contain experimental

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 9

TABLE IV
C OMPARISON RESULTS ON OXFORD 102 F LOWERS AND OXFORD -IIIT
P ET.
Accuracy (%)
Method Backbone
Flowers Pet
NAC [72] VGG-19 95.3 93.8
FixSENet [69] Inception-v2 95.7 94.8
InterAct [73] DenseNet-161 96.4 93.5
OPAM [70] VGG-19 97.1 93.8
MC-Loss [74] VGG-16 97.7 -
Grafit [75] ResNet-50 99.1 -
ViT [9] ViT-B-16 99.4 93.8
CvT [71] CvT-21 99.5 94.7
IELT ViT-B-16 99.6±0.03 95.2±0.09

results and corresponding analyses. The experiments were


performed on CUB-200-2011 if not otherwise mentioned.
1) Effectiveness of the Proposed Modules: The proposed
IELT consists of three modules, the MHV module, the CLR
module, and the DS module. The ablation experimental results
on the three modules were conducted, and results are presented PSM MAWS MHV
Origin Attention Map
in Table V. Adding the MHV module to the original ViT (TransFG) (FFVT) (Ours)
improves accuracy on the five datasets, e.g., by 0.44% on
CUB-200-2011, 1.18% on Stanford Dogs, 0.63% on NABirds, Fig. 5. Selection result of comparison methods and the proposed MHV
module. The first column is the input image, and the second column is the
and 1.2 % on Oxford-IIIT Pet. These improvements show generated attention map. The remaining columns are the selection results of
that selecting tokens in discriminative regions at each layer different methods. The selected tokens are marked with a light color, and the
to obtain cross-layer feature can improve the performance. selection number is fixed at 12.
Adding the CLR module further improves the classification
TABLE VI
accuracy on CUB-200-2011, Stanford Dogs, and NABirds, as A BLATION STUDY ON THE TYPE OF ENHANCED CONVOLUTION KERNEL
the CLR module filters the noise in the cross-layer feature ON CUB-200-2011.
and obtains the refined tokens. It also improves the feature Method Kernel Type Kernel Size Accuracy (%)
representation ability by fusing the cross-layer feature and the IELT Without Enhance 3×3 91.70
IELT Learnable kernel 3×3 91.79
refined feature. There is a slight reduction in accuracy on IELT Gaussian-like kernel 3×3 91.81
Oxford 102 Flowers and Oxford-IIIT Pet. This may be due IELT Gaussian-like kernel 5×5 91.77
to the fact that the two datasets rely more on the low-level
features, which is discussed in Section IV-C5. In this case,
adopting the fixed token selection numbers of the previous
L−1 layers in the MHV module causes the network to extract in FFVT [40]. The comparison results are shown in Figure
unhelpful features. To solve this problem, the DS module is 5. For a fair comparison, the token selection number in the
added to adjust the selection number dynamically. In this way, three methods was fixed at 12. The figure shows that the
the layers with a stronger learning ability can contribute more proposed MHV module can reduce the effect of noise and
tokens of discriminative regions, and the network can further select tokens from discriminative regions accurately. The PSM
reduce the influence of layers with weaker performances and selects the tokens based on the location of the maximum
better extract the cross-layer feature. value on the attention map of each head. As the heads in
The proposed MHV, CLR, and DS modules complement the MHSA usually focus on different regions, the selection
each other. When the CLR module or DS module is used results of PSM are relatively scattered and lack discriminative
alone, the accuracy does not improve significantly. When the features. MAWS averages the attention maps of heads and
three modules are combined, the CLR module will compensate computes the mutual attention weights. However, the values
for the shortcomings of the MHV module, and the DS module of mutual attention weights generated by MAWS vary widely,
can further improve the effect of the CLR module. Therefore, and the selection results are overly influenced by the highly
the proposed IELT combines the three proposed modules, responsive regions. This reduces the diversity of selection
which can obtain overall significant improvements, especially features. Compared with MAWS, the values on the enhanced
on Oxford 102 Flowers and Oxford-IIIT Pet, as shown in Table score map M∗ generated by the proposed MHV module are
V. relatively uniform. The selected regions of the MHV module
2) Comparison of Selection Methods: The proposed MHV focus on the discriminative regions while ensuring feature
module is a token selection method for obtaining the tokens diversity. For example, in the sample of NABirds in the third
of the discriminative regions. In aim to evaluate our selection row, our method selects the head, foot, and tail tokens of
method, we visually compared the MHV module with the part the sample, while MAWS only focuses on the head. These
selection module (PSM) proposed in TransFG [36] and the experimental results demonstrate that our proposed MHV
mutual attention weight selection (MAWS) module proposed module outperforms other selection methods.

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 10

TABLE V
A BLATION S TUDIES OF THE P ROPOSED MHV, CLR, AND DS MODULES .
Accuracy (%)
MHV CLR DS
CUB-200-2011 Stanford Dogs NABirds Oxford-IIIT Pet Oxford 102 Flowers
-
√ - - 91.04 90.57 89.93 93.84 99.38
√ -
√ - 91.48 91.75 90.56 95.04 99.61
√ -
√ 91.67 91.82 90.67 94.99 99.58
√ -
√ √ 91.59 91.82 90.65 95.07 99.63
91.81 91.84 90.78 95.29 99.64

TABLE VII
A BLATION STUDY ON THE CHOICE OF THE CLASS TOKEN IN THE CLR layers and the multi-head self-attention mechanism computes
MODULE ON CUB-200-2011. the interrelationship between tokens. The output class token
Feature Input Class token Accuracy (%) of the L-th layer probably already contains rich cross-layer
Cross-layer (L − 1)-th layer 91.81 information.
Feature previous L − 1 layers 91.12
Refined (L − 1)-th layer 91.81 5) Dynamic Selection Analysis: The selection number vec-
Feature L-th layer 91.57 tor of the previous L − 1 layers m ∈ RL−1 in the DS
module is recorded when the model obtains the best accuracy.
As is shown in Figure 6, the distribution of the selection
numbers varies from different datasets. This demonstrates that
3) Enhanced Convolution Kernel in MHV: A convolution
the DS module considers the transformer layers of the ViT
operation was conducted on the score map M0 after converging
as multiple weak learners and adjusts the weight of each
the vote results of each head Mk in the MHV module. The
layer adaptively according to its contribution to the cross-layer
ablation experiments were conducted on the choices of the
feature. It is worth noting that the tokens selected for Oxford
convolution kernel, and the results on CUB-200-2011 are
102 Flowers and Oxford-IIIT Pet are mainly concentrated in
demonstrated in Table VI. Enhancing the score map with a
the lower layers. In contrast, the tokens selected for NABirds
Gaussian-like convolution kernel, compared with no applica-
are mainly concentrated in the higher layers. This phenomenon
tion, improves classification accuracy by 0.11%. A learnable
suggests that low-level features are more useful for small-
convolution kernel can also achieve the same effect and obtain
scale datasets, while high-level features are more useful for
similar performance. In addition, choosing a larger convolution
large-scale datasets. In Stanford Dogs and Oxford-IIIT Pet,
kernel will reduce the diversity of selected features and cause
the selection number of a particular layer is much larger than
a lower accuracy. If the score map is not enhanced, the values
the others. The performance imbalance at each layer may be
in the score map are very sparse, and the difference between
due to the great intra-class variance of the two datasets. This
each value is small. This results in many tokens having the
enables the DS module to provide much more improvement
same score, making it challenging to select valuable tokens.
on these two datasets, as shown in Table V.
In addition, the convolution operation can reduce the relative
score of scattered noise because Gaussian-like convolution can
effectively remove scattered noise values and retain the sharp D. Visualization of Selected Examples
boundaries of the object [61]. According to the experimental In order to demonstrate the effectiveness of the proposed
results, we use a 3 × 3 Gaussian-like convolution kernel in the method more directly, visualization results of IELT are pre-
MHV module. sented in Figure 7. The first row is the original image. The
4) Cross-Layer Refinement Ablation: To prevent the refined second and the third rows are the attention maps generated by
feature from the effect of noise, our proposed CLR module the baseline and the proposed method, respectively. The fourth
adopts the class token of the (L−1)-th layer rather than that of row is the selection result of the MHV module. Compared with
the L-th layer as the input of the (L+1)-th layer for extracting the baseline, the attention maps generated by IELT are more
the refined feature. To demonstrate the effectiveness of this, responsive to discriminative regions and have more defined
we conduct the comparison experiments of respectively using boundaries, e.g., the head of a bird, the eyes and nose of a
the class token of the (L − 1)-th layer and L-th layer as the dog, and the center of a flower. This result fully demonstrates
input of the (L + 1)-th layer. The accuracy comparison is that the proposed IELT can better focus on the discriminative
shown in Table VII. It can be seen that using (L − 1)-th layer regions, such that the MHV module can select tokens that are
produces a 0.24% higher than directly using the class token more helpful to classification.
of L-th layer. Moreover, in extracting the cross-layer feature,
instead of fusing the output of multiple layers, only the output
class token of the L-th layer is taken. To demonstrate the E. Hyperparameter Analyses
effectiveness of our approach, we experiment with fusing the To figure out the effects of the parameters in the proposed
class tokens of previous L−1 layers as the cross-layer feature. modules, the parameter analysis experiments are performed on
The result is shown in Table VII. We can see that the proposed CUB-200-2011 if not otherwise mentioned.
method outperforms fusing the class tokens of previous L − 1) Vote Number in MHV: To determine the optimal vote
1 layers by 0.69%. This result is because the input of the number v in the MHV module, different values are adopted
L-th layer are the tokens selected from the previous L − 1 and the accuracy results are shown in Table VIII. We can

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 11

CUB-200-2011 Stanford Dogs NABirds Oxford 102 Flowers Oxford-IIIT Pet


25 25 30
20
20 30 20
Selection Number

15 20
15 15
20
10
10 10
10
10 5
5 5
0 0 0 0 0
5 10 5 10 5 10 5 10 5 10
The Transforemer Layer Index
Fig. 6. The selection numbers of previous L − 1 transformer layers adjusted by the dynamic selection module when the network obtains the best accuracy
of each dataset.

Input

Baseline

Ours

The
Selected
Tokens

CUB-200-2011 Stanford Dogs NABirds Oxford 102 Flowers Oxford-IIIT Pet

Fig. 7. Visualization results of our method on each dataset. The first row is the input image. The second and the third row are the attention map generated
by the baseline and ours, respectively. The light-colored locations indicate higher responses to the class token. The fourth row is where the tokens are selected
by the MHV module, and the selected place are marked with a lighter color.

TABLE VIII
E FFECT OF VOTE N UMBER v IN MHV close to 0. In this case, the optimization of the network tends
v 8 16 20 24 28 32 to be more dependent on the refined tokens, which increases
Accuracy (%) 91.43 91.49 91.75 91.81 91.80 91.46 the risk of over-fitting and reduces the convergence speed.
When λ is close to 1, the parameters in the (L + 1)-th layer
become difficult to optimize, which affects the performance of
TABLE IX
I NFLUENCES OF THE L OSS P ROPORTION λ the network. Thus, we chose λ = 0.4 as the default setting.
λ 0 0.2 0.4 0.6 0.8 1
3) Total Selection Number: To explore the effect of the total
Accuracy (%) 91.61 91.70 91.81 91.70 91.70 91.46 selection number s in the DS module, we show the accuracy
of different s values in Figure 8. Due to the fact that FGVC
is mainly dependent on a few discriminative parts, a relatively
small part of the total tokens should be selected in each layer.
find that a better result is provided when v is set to 24. We When s is set to 126, the network obtains the highest accuracy,
believe that the larger vote numbers may introduce additional i.e., 91.81%. Note that, in this setting of s, the average number
noise because only a few parts of the attention map have high of selected tokens per layer is 11.5, which is much smaller than
responses. The smaller vote numbers may cause the network the token number N = 784. Reducing s can slightly improve
to ignore the low-response part and reduce the diversity of the the training and inference speed. However, fewer selection
cross-layer feature. Therefore, we set v = 24 in this work. tokens may lead to the loss of valuable features, while more
2) Loss Proportion: To evaluate the effect of the loss selection tokens may introduce additional noise. Therefore, we
proportion λ in the CLR module, different loss proportions set s = 126 in the DS module.
are used, and the results are shown in Table IX. The highest
accuracy is obtained when λ is set to 0.4. Because the cross- F. Model Complexity
layer feature predicts the result through the aid of assist logits, Compared with the baseline, our method introduces an addi-
the network can be updated even if the loss proportion λ is tional transformer layer. As is shown in Table X, the parameter

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 12

  the computation cost of each layer is 11.8 GFlops. The
7UDLQLQJ7LPH
 proposed IELT only takes half of the training and inference

 time and can achieve a competitive performance with much

less computational cost in comparison with TransFG.
 

$FFXDUF\ 

  V. C ONCLUSION

0LQXWHV

This paper proposes an internal ensemble learning trans-
 
 former architecture for fine-grained visual classification by

 exploring the relationship between the vision transformer
 operating mechanism and ensemble learning. We consider

each head in the multi-head self-attention mechanism as a

weak learner to select several tokens from each transformer
  layer as the cross-layer feature in the proposed multi-head
     
7RWDO6HOHFWLRQ1XPEHU voting module. The proposed cross-layer refinement module
further selects the refined feature from the cross-layer feature
Fig. 8. Results of ablation experiments for the total selection number. The to suppress the noise and fuses with the cross-layer feature to
histogram shows the precision for different selection numbers. The line graph improve the feature representation ability. At the same time,
shows the total training time for different selection numbers.
we consider transformer layers as weak learners to obtain more
valuable cross-layer features. The proposed dynamic selection
TABLE X
C OMPARISON OF THE MODEL COMPLEXITY AND CLASSIFICATION module adjusts the token selection number of each layer in
ACCURACY ON CUB-200-2011. the multi-head voting module according to the contribution to
Method Layer Number Params Flops Accuracy (%) the refined feature. We conducted experiments on five fine-
ViT L 86.4M 77.8G 91.0 grained visual classification datasets and achieve competitive
TransFG L 86.4M 129.4G 91.7
IELT L+1 93.5M 72.6G 91.8 performance compared with state-of-the-art methods. In the
IELT* L+2 100.6M 79.9G 91.4 future, we will investigate how the number of parameters of
IELT can be further reduced and how the scalability of IELT
on other backbone networks can be improved.

amount increases from 86.4 million to 93.5 million. However, R EFERENCES


the number of tokens involved in the calculation of the L-th [1] W. Wang, Y. Cui, G. Li, C. Jiang, and S. Deng, “A self-attention-based
layer reduces from N to s, and the token number in the (L+1)- destruction and construction learning fine-grained image classification
th layer is t, which is much smaller than N . Since the time method for retail product recognition,” Neural Computing and Applica-
tions, vol. 32, no. 18, pp. 14 613–14 622, 2020.
complexity of one transformer layer is Ω(12N D2 + 2N 2 D), [2] S. Zhang, T. Niu, Y. Wu, K. M. Zhang, T. J. Wallington, Q. Xie,
the proposed IELT has a slightly larger number of parameters X. Wu, and H. Xu, “Fine-grained vehicle emission management using
and a slightly smaller amount of computation. Specifically, intelligent transportation system data,” Environmental Pollution, vol.
241, pp. 1027–1037, 2018.
the computation cost of the last layer of the baseline is 6.49 [3] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The
GFlops, while the computation cost of the CLR module in caltech-ucsd birds-200-2011 dataset,” California Institute of Technology,
IELT is 1.16 GFlops. The computational cost of calculating Tech. Rep. CNS-TR-2011-001, 2011.
[4] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis,
the score map, conducting convolution operations on the score P. Perona, and S. Belongie, “Building a bird recognition app and large
map, and dynamically updating the selection number of each scale dataset with citizen scientists: The fine print in fine-grained dataset
layer can be ignored. Therefore, the training and inference collection,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2015, pp. 595–604.
speed of our method is slightly faster than the baseline. [5] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset
We also conducted an experiment where two additional for fine-grained image categorization,” in First Workshop on Fine-
transformer layers were used to extract the cross-layer fea- Grained Visual Categorization, IEEE Conference on Computer Vision
and Pattern Recognition, Colorado Springs, CO, June 2011.
ture and the refined feature, respectively. The experimental [6] M.-E. Nilsback and A. Zisserman, “Automated flower classification over
results are shown in Table X. Employing two additional a large number of classes,” in 2008 Sixth Indian Conference on Computer
layers (IELT*) reduces the accuracy by 0.4% compared to the Vision, Graphics & Image Processing. IEEE, 2008, pp. 722–729.
[7] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and
proposed method (IELT). Since the layer number increases dogs,” in 2012 IEEE Conference on Computer Vision and Pattern
from L + 1 to L + 2, the parameter amount increases by Recognition. IEEE, 2012, pp. 3498–3505.
7.1 million, and the computational cost increases by 6.49 [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE Conference on Computer Vision
GFlops. The reason why IELT* did not yield an effective and Pattern Recognition, 2016, pp. 770–778.
improvement may be that adding an additional transformer [9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
layer does not facilitate the extraction of cross-layer feature T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans-
and refined feature, while the existing network can already formers for image recognition at scale,” in International Conference on
extract rich features. Considering the classification accuracy Learning Representations, 2021.
and the computation cost, we only added an additional layer. [10] X.-S. Wei, Y.-Z. Song, O. Mac Aodha, J. Wu, Y. Peng, J. Tang,
J. Yang, and S. Belongie, “Fine-grained image analysis with deep
In TransFG [36], an overlapped patch split method is used, learning: A survey,” IEEE Transactions on Pattern Analysis and Machine
and the token number N increases from 784 to 1296, so Intelligence, pp. 1–1, 2021.

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 13

[11] S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-stacked cnn for fine- [32] P. Zhuang, Y. Wang, and Y. Qiao, “Learning attentive pairwise in-
grained visual categorization,” in Proceedings of the IEEE Conference teraction for fine-grained classification,” in Proceedings of the AAAI
on Computer Vision and Pattern Recognition, 2016, pp. 1173–1182. Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 130–
[12] X.-S. Wei, C.-W. Xie, J. Wu, and C. Shen, “Mask-cnn: Localizing parts 13 137.
and selecting descriptors for fine-grained bird species categorization,” [33] J. Yu, M. Tan, H. Zhang, Y. Rui, and D. Tao, “Hierarchical deep click
Pattern Recognition, vol. 76, pp. 704–714, 2018. feature prediction for fine-grained image recognition,” IEEE Transac-
[13] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang, “Learning to tions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp.
navigate for fine-grained classification,” in Proceedings of the European 563–578, 2022.
Conference on Computer Vision, 2018, pp. 420–435. [34] M. Tan, J. Yu, Z. Yu, F. Gao, Y. Rui, and D. Tao, “User-click-data-based
[14] M. Liu, C. Zhang, H. Bai, R. Zhang, and Y. Zhao, “Cross-part learning fine-grained image recognition via weakly supervised metric learning,”
for fine-grained image classification,” IEEE Transactions on Image ACM Transactions on Multimedia Computing, Communications, and
Processing, vol. 31, pp. 748–758, 2022. Applications, vol. 14, no. 3, pp. 1–23, 2018.
[15] S. Yang, S. Liu, C. Yang, and C. Wang, “Re-rank coarse classification [35] M. Wang, P. Zhao, X. Lu, F. Min, and X. Wang, “Fine-grained visual
with local region enhanced features for fine-grained image recognition,” categorization: A spatial-frequency feature fusion perspective,” IEEE
arXiv preprint arXiv:2102.09875, 2021. Transactions on Circuits and Systems for Video Technology, pp. 1–1,
[16] C. Liu, H. Xie, Z.-J. Zha, L. Ma, L. Yu, and Y. Zhang, “Filtration 2022.
and distillation: Enhancing region attention for fine-grained visual [36] J. He, J.-N. Chen, S. Liu, A. Kortylewski, C. Yang, Y. Bai, and C. Wang,
categorization,” in Proceedings of the AAAI Conference on Artificial “Transfg: A transformer architecture for fine-grained recognition,” in
Intelligence, vol. 34, no. 07, 2020, pp. 11 555–11 562. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36,
[17] J. Han, X. Yao, G. Cheng, X. Feng, and D. Xu, “P-cnn: Part-based con- no. 1, 2022, pp. 852–860.
volutional neural networks for fine-grained visual categorization,” IEEE [37] Y. Hu, X. Jin, Y. Zhang, H. Hong, J. Zhang, Y. He, and H. Xue, “Rams-
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, trans: Recurrent attention multi-scale transformer for fine-grained image
no. 2, pp. 579–590, 2022. recognition,” in Proceedings of the 29th ACM International Conference
[18] Y. Rao, G. Chen, J. Lu, and J. Zhou, “Counterfactual attention learning on Multimedia, 2021, pp. 4239–4248.
for fine-grained visual categorization and re-identification,” in Proceed- [38] Y. Zhang, J. Cao, L. Zhang, X. Liu, Z. Wang, F. Ling, and W. Chen,
ings of the IEEE/CVF International Conference on Computer Vision, “A free lunch from vit: adaptive attention multi-scale fusion transformer
2021, pp. 1025–1034. for fine-grained visual recognition,” in ICASSP 2022-2022 IEEE Inter-
[19] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convolu- national Conference on Acoustics, Speech and Signal Processing. IEEE,
tional neural network for fine-grained image recognition,” in Proceed- 2022, pp. 3234–3238.
ings of the IEEE International Conference on Computer Vision, 2017, [39] X. Liu, L. Wang, and X. Han, “Transformer with peak suppression and
pp. 5209–5217. knowledge guidance for fine-grained image recognition,” Neurocomput-
[20] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo, “Looking for the devil in the ing, vol. 492, pp. 137–149, 2022.
details: Learning trilinear attention sampling network for fine-grained [40] J. Wang, X. Yu, and Y. Gao, “Feature fusion vision transformer for
image recognition,” in Proceedings of the IEEE/CVF Conference on fine-grained visual categorization,” British Machine Vision Conference,
Computer Vision and Pattern Recognition, 2019, pp. 5012–5021. 2021.
[21] Y. Gao, X. Han, X. Wang, W. Huang, and M. Scott, “Channel interaction
[41] Y. Zhao, J. Li, X. Chen, and Y. Tian, “Part-guided relational transform-
networks for fine-grained image categorization,” in Proceedings of the
ers for fine-grained visual recognition,” IEEE Transactions on Image
AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp.
Processing, vol. 30, pp. 9470–9481, 2021.
10 818–10 825.
[42] H. Zhu, W. Ke, D. Li, J. Liu, L. Tian, and Y. Shan, “Dual cross-
[22] Y. Ding, Y. Zhou, Y. Zhu, Q. Ye, and J. Jiao, “Selective sparse sampling
attention learning for fine-grained visual categorization and object re-
for fine-grained image recognition,” in Proceedings of the IEEE/CVF
identification,” in Proceedings of the IEEE/CVF Conference on Com-
International Conference on Computer Vision, 2019, pp. 6599–6608.
puter Vision and Pattern Recognition, 2022, pp. 4692–4702.
[23] W. Luo, X. Yang, X. Mo, Y. Lu, L. S. Davis, J. Li, J. Yang, and S.-
[43] Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, “Beyond bilinear: Generalized
N. Lim, “Cross-x learning for fine-grained visual categorization,” in
multimodal factorized high-order pooling for visual question answering,”
Proceedings of the IEEE/CVF International Conference on Computer
IEEE Transactions on Neural Networks and Learning Systems, vol. 29,
Vision, 2019, Conference Proceedings, pp. 8242–8251.
no. 12, pp. 5947–5959, 2018.
[24] K. Xu, R. Lai, L. Gu, and Y. Li, “Multiresolution discriminative mixup
network for fine-grained visual categorization,” IEEE Transactions on [44] C. Zhang, H. Bai, and Y. Zhao, “Fine-grained image classification by
Neural Networks and Learning Systems, pp. 1–13, 2021. class and image-specific decomposition with multiple views,” IEEE
[25] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear convolutional neural Transactions on Multimedia, pp. 1–11, 2022.
networks for fine-grained visual recognition,” IEEE Transactions on [45] C. Zhang, G. Lin, Q. Wang, F. Shen, Y. Yao, and Z. Tang, “Guided
Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1309– by meta-set: A data-driven method for fine-grained visual recognition,”
1322, 2018. IEEE Transactions on Multimedia, pp. 1–13, 2022.
[26] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo, “Learning deep bilinear [46] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
transformation for fine-grained image representation,” in Advances in time object detection with region proposal networks,” in Advances in
Neural Information Processing Systems, H. Wallach, H. Larochelle, Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. M. Sugiyama, and R. Garnett, Eds., vol. 28. Curran Associates, Inc.,
Curran Associates, Inc., 2019. 2015.
[27] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and construction [47] D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Z. Jiang, Q. Hou, and
learning for fine-grained image recognition,” in Proceedings of the J. Feng, “Deepvit: Towards deeper vision transformer,” arXiv preprint
IEEE/CVF Conference on Computer Vision and Pattern Recognition, arXiv:2103.11886, 2021.
2019, Conference Proceedings, pp. 5157–5166. [48] K. Kim, B. Wu, X. Dai, P. Zhang, Z. Yan, P. Vajda, and S. J. Kim,
[28] R. Du, D. Chang, A. K. Bhunia, J. Xie, Z. Ma, Y.-Z. Song, and J. Guo, “Rethinking the self-attention in vision transformers,” in Proceedings of
“Fine-grained visual classification via progressive multi-granularity the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
training of jigsaw patches,” in European Conference on Computer 2021, pp. 3071–3075.
Vision. Springer, 2020, pp. 153–168. [49] O. Sagi and L. Rokach, “Ensemble learning: A survey,” Wiley Interdisci-
[29] J. Song and R. Yang, “Feature boosting, suppression, and diversifica- plinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4,
tion for fine-grained visual classification,” in 2021 International Joint p. e1249, 2018.
Conference on Neural Networks, 2021, pp. 1–8. [50] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp.
[30] G. Sun, H. Cholakkal, S. Khan, F. Khan, and L. Shao, “Fine-grained 123–140, 1996.
recognition: Accounting for subtle differences between similar classes,” [51] M. Tan, F. Yuan, J. Yu, G. Wang, and X. Gu, “Fine-grained image
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, classification via multi-scale selective hierarchical biquadratic pooling,”
no. 07, pp. 12 047–12 054, 2020. ACM Trans. Multimedia Comput. Commun. Appl., vol. 18, no. 1s, jan
[31] Y. Zhao, K. Yan, F. Huang, and J. Li, “Graph-based high-order relation 2022.
discovery for fine-grained recognition,” in Proceedings of the IEEE/CVF [52] Y. Freund and R. E. Schapire, “Experiments with a new boosting
Conference on Computer Vision and Pattern Recognition, 2021, pp. algorithm,” in Proceedings of the Thirteenth International Conference
15 079–15 088. on International Conference on Machine Learning, ser. ICML’96. San

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 14

Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1996, pp. 148– of the IEEE/CVF International Conference on Computer Vision, 2021,
156. pp. 22–31.
[53] A. Behera, Z. Wharton, P. R. Hewage, and A. Bera, “Context-aware [72] M. Simon and E. Rodner, “Neural activation constellations: Unsuper-
attentional pooling (cap) for fine-grained visual classification,” in Pro- vised part model discovery with convolutional networks,” in Proceedings
ceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, of the IEEE International Conference on Computer Vision, 2015, pp.
2021, pp. 929–937. 1143–1151.
[54] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [73] L. Xie, L. Zheng, J. Wang, A. L. Yuille, and Q. Tian, “Interactive: Inter-
L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances layer activeness propagation,” in Proceedings of the IEEE Conference
in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, on Computer Vision and Pattern Recognition, 2016, pp. 270–279.
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, [74] D. Chang, Y. Ding, J. Xie, A. K. Bhunia, X. Li, Z. Ma, M. Wu,
Eds., vol. 30. Curran Associates, Inc., 2017. J. Guo, and Y.-Z. Song, “The devil is in the channels: Mutual-channel
[55] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training loss for fine-grained image classification,” IEEE Transactions on Image
of deep bidirectional transformers for language understanding,” arXiv Processing, vol. 29, pp. 4683–4695, 2020.
preprint arXiv:1810.04805, 2018. [75] H. Touvron, A. Sablayrolles, M. Douze, M. Cord, and H. Jégou, “Grafit:
[56] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, Learning fine-grained image representations with coarse labels,” in
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- Proceedings of the IEEE/CVF International Conference on Computer
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, Vision, 2021, pp. 874–884.
J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, “Language models are few-shot learners,” in Advances
in Neural Information Processing Systems, H. Larochelle, M. Ranzato,
R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates,
Inc., 2020, pp. 1877–1901.
[57] J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with multi-
view visual representation for image captioning,” IEEE Transactions on
Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4467–
4480, 2020.
[58] M. Tan, J. Yu, Z. Yu, F. Gao, Y. Rui, and D. Tao, “User-click-data-based
fine-grained image recognition via weakly supervised metric learning,”
ACM Trans. Multimedia Comput. Commun. Appl., vol. 14, no. 3, jul
2018.
[59] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
preprint arXiv:1607.06450, 2016.
[60] X. Ding, Y. Guo, G. Ding, and J. Han, “Acnet: Strengthening the
kernel skeletons for powerful cnn via asymmetric convolution blocks,”
in Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2019, pp. 1911–1920.
[61] E. S. Gedraite and M. Hadad, “Investigation on the effect of a gaussian
blur in image filtering and segmentation,” in Proceedings ELMAR-2011,
2011, pp. 393–396.
[62] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
imperative style, high-performance deep learning library,” Advances in
neural information processing systems, vol. 32, 2019.
[63] J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent atten-
tion convolutional neural network for fine-grained image recognition,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017, pp. 4438–4446.
[64] R. Du, J. Xie, Z. Ma, D. Chang, Y.-Z. Song, and J. Guo, “Progres-
sive learning of category-consistent multi-granularity features for fine-
grained visual classification,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, pp. 1–1, 2021.
[65] P. Guo and R. Farrell, “Aligned to the object, not to the image: A unified
pose-aligned representation for fine-grained recognition,” in 2019 IEEE
Winter Conference on Applications of Computer Vision, 2019, pp. 1876–
1885.
[66] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie, “Large scale
fine-grained categorization and domain-specific transfer learning,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 4109–4118.
[67] D. Korsch, P. Bodesheim, and J. Denzler, “Classification-specific parts
for improving fine-grained visual categorization,” in Pattern Recognition,
G. A. Fink, S. Frintrop, and X. Jiang, Eds. Cham: Springer International
Publishing, 2019, pp. 62–75.
[68] L. Zhang, S. Huang, W. Liu, and D. Tao, “Learning a mixture of
granularity-specific experts for fine-grained categorization,” in Proceed-
ings of the IEEE/CVF International Conference on Computer Vision,
2019, pp. 8331–8340.
[69] H. Touvron, A. Vedaldi, M. Douze, and H. Jegou, “Fixing the train-test
resolution discrepancy,” in Advances in Neural Information Processing
Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019.
[70] Y. Peng, X. He, and J. Zhao, “Object-part attention model for fine-
grained image classification,” IEEE Transactions on Image Processing,
vol. 27, no. 3, pp. 1487–1500, 2018.
[71] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang,
“Cvt: Introducing convolutions to vision transformers,” in Proceedings

Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like