Professional Documents
Culture Documents
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
Abstract—Recently, vision transformers (ViTs) have been in- Large Intra-class differences
vestigated in fine-grained visual recognition (FGVC) and are now
considered state of the art. However, most ViT-based works ignore
the different learning performances of the heads in the multi-
Common Tern
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
parts of an object automatically, an attention mechanism is influence of noise, we propose a cross-layer refinement (CLR)
introduced to find the discriminative regions without additional module. In the CLR module, the cross-layer tokens selected
annotation [18]–[23]. For example, Rao et al. [18] presented by all previous layers are concatenated to extract the cross-
counterfactual attention to encourage the network to learn layer feature. Then, an additional selection is conducted on
more useful attention information for feature representation. the cross-layer tokens to obtain refined tokens and a refined
The attention-based methods also demonstrate the advantages feature. We use the cross-layer feature to guide the refined
of exploring the relationships between feature channels [19]– feature and design the assist logits operation for prediction.
[21] in finding discriminative parts. Another category is the In addition, according to the proportion of refined tokens
feature encoding methods, which aim to learn the rich features in the CLR module, we attempt to obtain the contribution
through a feature fusion strategy [24]–[35]. For example, Xu of each layer to the refined feature. Inspired by the boosting
et al. [24] proposed to mix the discriminate regions of the algorithm [52] in ensemble learning, we propose a dynamic
samples in different categories with their labels so that the selection (DS) module to update the selection number of the
network can better discover the common features and specific MHV module on each layer. This can reduce the effect of
features with different categories. poorly performing layers and improve the quality of the cross-
Recently, vision transformers (ViTs) [9] have been shown to layer feature.
be promising in many computer vision tasks. It is known that Overall, we summarize the main contributions of our work
the core of a ViT is its multi-head self-attention mechanism as follows:
(MHSA), which calculates the interrelationships of the input 1) We propose a novel internal ensemble learning trans-
patches to capture global feature representation. ViTs can cap- former (IELT) that introduces ensemble learning into a
ture global features but lack the ability to learn local features, ViT to solve the problem of inconsistent learning perfor-
which limits the application to FGVC tasks. Therefore, recent mances among different MHSA heads and transformer
works [36]–[41] have attempted to enhance the regular ViT layers in ViTs.
for capturing local features. He et al. [36] proposed to select 2) We develop an effective multi-head voting (MHV) mod-
more discriminative image patches that improve the ability ule that allows all MHSA heads in each layer to select
to extract local features for classification. Wang et al. [40] the desired tokens as cross-layer feature based on their
proposed to fuse the features of different transformer layers attention maps and spatial relationships.
for improving the feature representation ability of the network. 3) We propose a cross-layer refinement (CLR) module to
However, the abilities of estimating discriminative regions are fuse the cross-layer feature and further extract the refined
different among different layers and heads of MHSA in ViTs feature to improve the feature representation ability.
[47], [48]. Additionally, the discriminative regions are spatially 4) A dynamic selection (DS) module is designed to dy-
close, while the noise regions are scattered. Most existing ViT- namically update the selection number of tokens in the
based methods ignore the different learning performances of MHV module of each layer based on their importance.
heads and layers and the spatial relationships among different The rest of this paper is organized as follows. Section
attention maps. II briefly reviews the related works. Section III describes
To address this issue, in this work, we propose a novel the proposed method. Section IV reports the experimental
internal ensemble learning transformer (IELT) for FGVC. The results and comprehensive analysis. Finally, conclusions and
ensemble learning strategy can solve the problem of unbal- suggestions for further work are presented in Section V.
anced learning performances [49]. In the IELT, all attention
heads of MHSA in each layer are considered weak learners
II. R ELATED W ORK
in terms of selecting valuable tokens from each layer, and all
transformer layers are considered weak learners in terms of A. Methods Based on Convolutional Neural Networks
contributing to more valuable cross-layer features. Existing CNN-based works for FGVC can be divided into
First, we propose the multi-head voting (MHV) module, two categories: part locating methods and feature encoding
where each head of the MHSA in each transformer layer votes methods.
for several tokens with the highest response in the attention 1) Part Locating Methods: Part locating methods [11]–[16]
map, similar to the bagging algorithm [50] in ensemble learn- aim to find discriminative regions to distinguish subtle inter-
ing. The voting results of all heads are assembled to obtain class differences. Early part locating methods [11], [12] direct-
a score map. To suppress the scattered noise, a convolution ly use the bonding boxes and part annotations. Huang et al.
operation is conducted, and it can enhance the concentrated [11] proposed to extract features from annotated parts during
discriminative regions and obtain an enhanced score map. training to enhance the recognition ability of the network.
According to the enhanced score map, several tokens are To avoid expensive labeling, more weakly supervised locating
selected from the output of each transformer layer as cross- methods have been proposed recently. Liu et al. [14] proposed
layer tokens. a weakly supervised cross-part convolutional neural network
Second, existing works [23], [40], [51] have revealed that for FGVC that localizes the multi-regional features and learns
cross-layer feature fusion can improve the feature represen- the cross-part feature. Yang et al. [15] proposed to extract the
tation ability. However, it is still a challenge to deal with global features for coarse class prediction and then re-rank the
the noise introduced by the cross-layer feature. To make prediction results by the local features that are extracted from
full use of the cross-layer feature and further reduce the the feature pyramid network. Liu et al. [16] proposed a filter
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
learning method to improve the part localization ability of the and use a pure transformer architecture to achieve a similar
network. Ardhendu et al. [53] proposed to encode the spatial performance in comparison with the CNN in computer vision
arrangement and visual features of multi-scale regions and tasks. ViTs can capture global features, but they cannot easily
capture subtle variations of these regions via bilinear pooling. capture local features, which limits its performance in FGVC
However, the locating modules of these methods often require tasks. To solve this problem, Wang et al. [36] proposed a
a large number of parameters to obtain accurate part locating transformer architecture for fine-grained recognition (Trans-
results, thus increasing the training complexity. FG), which splits the images into overlapped patches to obtain
The attention-based part locating methods [18]–[23] have richer features and uses the discriminative image patches for
received popularity, since they can find discriminate regions classification. In [37], [38], the most discriminative region
automatically without the region proposal network. Liu et is located and fed to the network again for detailed feature
al. [19] proposed to use the attention mechanism to group representation. Liu et al. [39] proposed to suppress the most
the feature channels for discriminative parts. Ding et al. [22] responsive token during prediction, which allows the network
proposed a selective sparse sampling framework that divides to find richer features and add them to the knowledge base. Sun
the parts into discriminative parts and complementary parts et al. [58] proposed to integrate structural information through
according to the response differences of the attention map. Luo a ViT by mining the spatial context relation of salient patches
et al. [23] proposed a Cross-X learning framework that extracts and enhance the feature robustness with a contrastive learning
features of multiple attention regions and uses cross-layer strategy. Zhao et al. [41] proposed to use several transformer
constraints for multi-scale feature representations. To obtain blocks to extract both global features and local features. Zhu
rich semantic information, Gao et al. [21] proposed to mine et al. [42] proposed a dual cross-attention learning algorithm
the channel-wise relationships of the feature map through the that enhances the interactions between global images and local
attention mechanism. Zheng et al. [20] proposed using trilinear high-response regions and establishes interactions between
attention to integrate feature maps through spatial relationships image pairs. Inspired by these works, we propose a cross-
and highlight the detailed areas of the input image with a high layer refinement module to make full use of the cross-layer
resolution. The aforementioned methods are based on CNN- feature selected by each layer of the ViT.
based models. Based on the recent ViT models, in this work, The feature fusion vision transformer (FFVT) [40], which
we group the channels by MHSA heads and propose the multi- chooses the discriminative tokens from each transformer layer
head voting module to locate tokens of discriminative regions. and conducts feature fusion in the last layer for the cross-layer
2) Feature Encoding Methods: Feature encoding methods, feature, is most related to our work. However, the FFVT ig-
e.g., [24]–[35], aim to learn richer features and are usually nores the different learning performances of transformer layers
performed by the fusion of multiple network features or by and heads in MHSA. We propose a multi-head voting module
various novel training strategies. The bilinear pooling methods to suppress the influence of heads with weak performances
[25], [26] use multiple networks to extract features separately and use a dynamic selection module to enable the layers with
and fuse the features for recognition. Several authors [27], better learning performances to contribute more cross-layer
[28] have proposed to partition the input images into small features.
patches and shuffle them to break the spatial correlation of
the images. This can force the network to focus on more
C. Ensemble Learning
valuable fine-grained features. The peak suppression methods
[29], [30] mask the highest response area so that the network In machine learning, ensemble learning aims to combine
can find more potential features and improve the generalization multiple weak learners to form a strong learner. Through
ability. Some methods [33], [34] utilize the click features the combination, the errors of an individual learner can be
from the user’s click data for fusing and regularizing the compensated by other learners. Ensemble learning usually uses
fine-grained visual features. Wang et al. [35] designed a voting methods to simulate the wisdom of the crowd and can
bimodal network that includes spatial domain and frequency also expand the search space to better fit the data space [49].
domain branches to obtain the spatial-frequency features for The classic algorithms in ensemble learning include the
classification. However, it is difficult to directly indicate what bagging algorithm [50] and the boosting algorithm [52]. The
features the networks have learned. In this work, we fuse the bagging algorithm allocates the training samples to each weak
cross-layer feature of discriminative regions in each layer to learner through random sampling with replacements. All weak
ensure the interpretability of the network. learners aggregate their votes to obtain the final prediction
result. The boosting algorithm uses all samples at each itera-
tion and updates the weight of each sample according to the
B. Methods Based on Vision Transformers prediction result during the training process. After the training,
The transformer framework [54] was proposed to process the weight of each weak learner is updated according to
sequence data and was applied to natural language processing their performances. The bagging algorithm reduces prediction
tasks [55], [56] and multimodal learning tasks [57]. The core variance, and the boosting algorithm reduces prediction bias
of the transformer is a multi-head self-attention mechanism [49].
that groups the channels and calculates the interrelationship In this work, we explore the strong correlation between the
of the input token sequences. Dosovitskiy et al. [9] proposed MHSA and cross-layer feature fusion and ensemble learning.
using the ViT to split the 2D image into patch sequence The characteristics of ensemble learning can solve the problem
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
of the different learning performances of each head and each the output token sequence Xout with the same size of Xin is
layer. To this end, we integrate ensemble learning into the computed as
transformer and propose using the internal ensemble learning
X0 = MHSA(LN(Xin )) + Xin
transformer to further improve the performance of ViTs in (2)
Xout = MLP(LN(X0 )) + X0
FGVC.
where LN(·) is the layer normalization [59], MHSA(·) is
III. T HE P ROPOSED M ETHOD the multi-head self-attention, and MLP(·) is the multi-layer
The architecture of the proposed internal ensemble learning perceptron.
transformer (IELT) for FGVC is illustrated in Figure 2. First,
the features of the input image are extracted through linear B. The Multi-Head Voting Module
projection and represented in the form of sequenced tokens, The multi-head self-attention (MHSA) in each transformer
which are concatenated with the class token of category layer divides the input into K groups in the channel di-
features. The concatenated tokens are sent into the transformer mensions and feeds each group to each head to calculate
layers. To improve the feature representation ability, the multi- self-attention. The attention map generated by each head
head voting (MHV) module is proposed to select the tokens of represents the degree of the attention of each region. However,
discriminative regions from the output of previous L−1 layers, the performances of all heads in locating the discriminative
where L is the number of layers of the original ViT. The input regions are not consistent [47], [48]. In some attention maps,
of L-th layer is then replaced by concatenating the 1) MHV the peak areas are located on the discriminative regions, while
selected tokens and 2) the class token of the (L−1)-th layer in in some attention maps, the peak areas are concentrated on the
the proposed cross-layer refinement (CLR) module to obtain background areas. To mitigate the effects of heads with poor
the cross-layer feature. To refine the extracted cross-layer location performances and select some valuable tokens from
feature, the output tokens are selected by the MHV module each layer, inspired by the bagging algorithm [50], we propose
and concatenated with the class token again as the input of the the multi-head voting (MHV) module in which the multiple
(L + 1)-th layer. The class tokens extracted from the outputs heads in MHSA are treated as weak learners for selecting
of the L-th layer and the (L + 1)-th layer, which represent the tokens of discriminative regions.
cross-layer feature and the refined feature, respectively, are fed For the l-th (l ∈ 1, 2, · · · , L − 1) transformer layer, suppose
into the fully connected (FC) layer. The cross-layer feature the input and output token sequence of this layer are Xin l
then guides the refined feature for prediction by the assist and Xout 1 2 k K
l , respectively. Let A = [A ; A ; · · · ; A ; · · · ; A ]
logits (the assist logits is an operation that we designed [39]). denote the attention score of the class token, where Ak ∈
In addition, a dynamic selection (DS) module is proposed to RN is the attention score of the k-th head, taken from the
update the selection numbers of previous L − 1 layers in the attention map generated by MHSA. To characterize the spatial
MHV module according to the selection results in the CLR relationship, according to the original spatial arrangement, the
module. attention score Ak is resized into a two-dimensional matrix
0
This section first introduces the backbone of our method in Ak ∈ Rn1 ×n2 , as shown in Figure 3. To select the valuable
0
Section III-A. We then describe the MHV module, the CLR tokens based on the attention score Ak , we define a score
module, and the DS module in Section III-B, Section III-C, map M ∈ Rk n1 ×n2
as
and Section III-D, respectively. ( 0
1, if Ak (i, j) is top−v value
Mk (i, j) = (3)
A. The Backbone Network 0, otherwise
The proposed IELT adopts the vision transformer (ViT) [9] where v is a hyper-parameter to define the number of votes
as the backbone network. First, the input image I is divided for each head. The score maps of all heads are then combined
into N = n1 × n2 non-overlapped patches, where n1 and together to obtain the total score map M0 ∈ Rn1 ×n2 as
n2 are the amounts of patches in each row and column of
K
the input image, respectively. The non-overlapped patches are X
M0 = Mk (4)
denoted by Iip ∈ RP ×P ×c , i = 1, 2, · · · , N , where (P, P ) is
k=0
the spatial dimension, and c is the number of channels. Using
2
a learnable linear projection E ∈ RcP ×D , the patch Iip can To suppress the noise and enhance discriminative regions,
be transformed to an embedded token xi = Iip E ∈ RD , i = a convolution operation is performed on the total score map
1, 2, · · · , N . The embedded tokens X = [x1 , x2 , · · · , xN ] and M0 with a convolution kernel K. The enhanced score map
the class token xclass ∈ RD are then concatenated and added M∗ ∈ Rn1 ×n2 is calculated by
by the trainable positional encoding Epos ∈ R(N +1)×D to M∗ = M0 ∗ K (5)
form the initial input token sequence X0 :
where ∗ represents the convolution operator. The adopted
X0 = [xclass ; x1 ; x2 ; · · · ; xN ] + Epos (1) convolution kernel K is as follows:
The ViT backbone consists of multiple transformer layers, 1 2 1
each of which includes the multi-head self-attention and multi- K = 2 4 2 (6)
layer perceptron. For one transformer layer, if the input is Xin , 1 2 1
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
Dynamic Selection
Prediction + * Transformer Layer L+1 Module
Cross-layer Re�inement Module Layer 1 tokens Layer 2 tokens Layer L-1 tokens
....
....
Multi-head Voting Module
...
Cross-layer Multi-head
EnhancedVoting Module
Vote Module
Re�inement Module
Attention Map of Each Head Transformer Layer 1
...
* Class Token
Selected Tokens 0 * 1 2 3 4 5 6 7 8 9
+ Element-wise Add
Dynamic Selection
Conv 3×3
Embedding
Fig. 2. Overview of the proposed method. The multi-head voting module selects the tokens of discriminative regions from each layer. The cross-layer
refinement module extracts the cross-layer feature and obtains the refined feature through an addition selection. The cross-layer feature guides the refined
feature in the prediction. The dynamic selection module updates the selection number of each layer in the multi-head voting module.
=1 =0 Element-wise Add
Head 1
M¹
A1’
Total Score Map M’ Enhanced Score Map M* The Selected Tokens
Dynamic Selection
Head 2 0 0 0 0 0 0 0
M²
Number m(l)
A2’ 0 1 0 0 0 0 0
0 5 7 0 0 1 0
0 2 9 7 2 1 0
Head 3 0 3 4 3 2 0 0
M³
A3’ 0 1 3 2 4 1 0
2 2 5 9 0 0 0
Fig. 3. The progress of the multi-head voting module. We obtain the attention map of each head in one layer and obtain the the location of the largest v
value. In this figure, we set v to 4 for clear observation. We then obtain and converge the results of all heads. To suppress the noise, we enhance the score
map by conducting a convolution operation on it. Finally, we determine the selected tokens.
Here, we can use either a learnable convolution kernel or a [61]. The enhanced score map M∗ of a sample is visualized
fixed convolution kernel. Inspired by the asymmetric convo- in Figure 3. To filter the noise, we define a selection number
lution kernel [60], we adopt a fixed Gauss-like kernel in our vector m ∈ RL−1 for the previous L − 1 layers and a vector
implementation. It can effectively remove the scattered noise id ∈ Rm(l) , whose components are the indexes of the largest
and keep the edges of the main parts of the object distinct m(l) values of a flattened M∗ . The determination of selection
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
Cross-layer features
Softmax
section (Section III-D). For the l-th layer, we select m(l)
tokens from the output Xout by id and concatenate them to + Element-wise Add
l
form a selected token sequence Xsel ∈ Rm(l)×D of the l-th Hadamard Product
Summation
1×D C Number of Classes
Xsel = [Xout out out Weights W
l l,id(1),: ; Xl,id(2),: ; · · · ; Xl,id(m(l)),: ] (7)
C×D
xL+1
class 1×C
Re�ined features
The proposed MHV module reduces the influence of heads
with weak performances and selects the tokens of discrimina- Assist Logits p’
y Class 1
Class 2
tive regions in each layer.
Softmax
+ Loss
.....
1×D 1×C 1×C Class C
C. The Cross-Layer Refinement Module
The existing methods [23], [40], [51] have reported the Fig. 4. The progress of assist logits. The cross-layer feature first makes
effectiveness of cross-layer feature fusion in enhancing the the prior prediction and hadamard products with the summed weights of the
FC layer to obtain the cross-layer logits. The refined feature is added to the
feature representation. However, the direct fusion of the cross-layer logits for the final prediction. We calculate the cross-entropy loss
cross-layer feature may involve the redundant information of both predictions. We omit the LN(·) before the FC layer for a clear view.
of unreliable regions, which affects the final classification
performance. In order to better exploit the cross-layer feature
and suppress the noise, we propose the cross-layer refinement After obtaining the cross layer feature xclass
L and the refined
(CLR) module, as shown in Figure 2. In this module, the feature xclass
L+1 , inspired by [39], we propose the assist logits
inputs are the selected tokens of all previous layers, which operation, which uses the prior prediction result to guide the
are considered as cross-layer tokens. Based on the cross-layer final prediction. The progress of assist logits is shown in Figure
tokens, we select several tokens by the MHV module, named 4. The prior prediction result p ∈ RC is computed by the
refined tokens. The cross-layer feature and the refined feature cross-layer feature as follows:
are extracted from the cross-layer tokens and refined tokens, p = softmax(FC(LN(xclass ))) (10)
L
respectively, by the transformer layers. In order to avoid a loss
of detail, the assist logits operation was designed, as suggested where C is the number of categories, and softmax(·) is the
in [39]. The refined feature and the cross-layer feature are softmax operation. As the weight W = [w1 , w2 , · · · , wD ] ∈
operated by the assist logits to generate the final prediction RC×D in the FC layer represents the responses of embedding
results. dimensions of the predicted categories, the cross-layer logits
To extract the cross-layer feature, we concatenate the class y ∈ RC is computed by the prior prediction p and weight W
token of the (L − 1)-th layer and the cross-layer tokens as the as follows:
D
input of the L-th layer: X
y=p wi (11)
Xin class sel sel sel
L = [XL−1 ; X1 ; X2 ; · · · ; XL−1 ] (8) i=0
The output of the (L + 1)-th layer is computed by Eq. 2 D. The Dynamic Selection Module
and represented by Xout class
L+1 . The refined feature is xL+1 = To obtain more discriminative features, inspired by the
out
XL+1,1,: , which is the noise-filtered cross-layer feature. boosting algorithm [52], the dynamic selection (DS) module
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
is proposed. In the DS module, the transformer layers are Algorithm 1 Dynamic Selection.
considered as weak learners to contribute more tokens of dis- Input: The number of refined tokens t; The indexes of refined
criminative regions and useful cross-layer features by updating tokens id0 ∈ Rt in the CLR module; Total selection
the selection number vector m in the MHV modules of the number s; selection ratio r ∈ RL−1 ; Moving rate θ.
previous L − 1 layers, as shown in Figure 2. To determine the Output: New selection number vector m ∈ RL−1 .
contribution of each weak learner, the indexes of the refined 1: a = b = 0 . Index interval
tokens in the CLR module are exploited. 2: m = sr . Original selection number vector m ∈ RL−1
The selection number vector m ∈ RL−1 of the previous 3: r0 = 0 . New selection ratio r0 ∈ RL−1
(L − 1) layers is computed by m = dsre, where s is the total 4: for l = 1, · · · , L − 1 do
selection number, and r ∈ RL−1 is the selection ratio of the 5: a = b; b = a + m(l) . Index interval of layer l
previous L − 1 layers, which is initialized by r = 1/(L − 1). 6: q=0 . Refined tokens from layer l
To characterize the contribution of each layer to the refined 7: for i = 1, · · · , t do
feature, we count the number of tokens selected as refined 8: if a ≤ id0 (i) < b then
tokens in each layer. For the l-th layer, let [a(l), b(l)) denote 9: q =q+1
the index interval of the tokens contributed by this layer, i.e., 10: end if
11: end for
0 l=1 12: r0 (l) = q/t . New selection ratio of layer l
a(l) = Pl−1
i=1 m(i) l > 1 (14) 13: end for
14: r ← (1 − θ)r + θr0 . Updating selection ratio
b(l) = a(l) + m(l)
15: m = dsre
where l ∈ 1, 2, · · · , L − 1. By determining whether each 16: return m
element of refined token indexes id0 is within the interval
[a(l), b(l)), a count number q(l) is obtained. Specifically,
if an element of id0 is within the interval [a(l), b(l)), then Section IV-A. The experimental results compared to the state-
q(l) ← q(l) + 1, as presented in Algorithm 1. An auxiliary of-the-art works are presented in Section IV-B. The ablation
selection ratio of the l-th layer is calculated as r0 (l) = q(l)/t. studies are presented in Section IV-C. The visualization of the
Using the auxiliary selection ratio r0 (l) and moving rate θ, we proposed method is presented in Section IV-D. The hyper-
update the selection ratio r and selection number vector m as parameter analyses are reported in Section IV-E. Finally, the
follows: model complexity is analyzed in Section IV-F.
r ← (1 − θ)r + θr0
(15) A. Datasets and Experiment Settings
m = dsre
It can be inferred that the more the tokens selected from a 1) Datasets: To verify the robustness and effectiveness of
certain layer contribute to the refined feature, the larger the the proposed method, the experiments were conducted on five
selection ratio r and r0 of this layer are, and vice versa. fine-grained datasets of various scales, including two small
This helps to increase the selection numbers of layers with datasets, Oxford 102 Flowers [6] and Oxford-IIIT Pet [7], two
a strong learning ability and decrease the selection numbers medium datasets, CUB-200-2011 [3] and Stanford Dogs [5],
of layers with a weak learning ability dynamically. Thus, and a large dataset, NABirds [4]. Table I shows the details of
the dynamic selection can improve the feature representation the five datasets. Top-1 accuracy is adopted as the evaluation
ability of classification. Through the DS module, the network metric. We only use classification labels without any additional
can automatically adjust the contribution of each layer to annotations for supervised training.
the cross-layer feature and learn more valuable cross-layer 2) Experiment Settings: The ViT-B-16 pre-trained on Im-
features. ageNet21K was adopted as the backbone network. The input
Note that, in the DS module, the total selection number s images were resized to 448×448. Random cropping, hori-
and the number of refined tokens t do not change, and only the zontal flipping, and color jittering were applied for training,
selection numbers of the previous L − 1 layers change. To be and center cropping was applied for testing. The model was
specific, the proposed network uses the ViT as the backbone, trained by the stochastic gradient descent (SGD) optimizer
and the layer normalization [59] in each layer can ensure the with a momentum of 0.9, and a cosine annealing scheduler
stability of the network. In addition, all of the tokens in the was applied. The initial learning rate was set to 0.002 for
CLR module come from the previous L − 1 layers, which is Stanford Dogs and 0.02 for other four datasets. The model
similar to previous work [40]. It does not affect the parameters was trained for 50 epochs, and the batch size was set to 8 on
receiving a gradient value in the corresponding layer [62], all datasets.
which ensures the differentiability of the entire network. In the MHV module, the vote number of each head v was
set to 24. In the CLR module, the number of refined tokens t
was set to 24, and the proportion of loss λ was set to 0.4. In
IV. E XPERIMENTS
the DS module, the total selection number s was set to 126,
In this section, the five FGVC datasets we used as the and the selection ratio of each layer r was set to 1/(L − 1)
benchmarks and the experiment settings are introduced in initially. The moving rate θ was set to 1e-4 for Stanford Dogs
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
TABLE I TABLE II
F INE - GRAINED DATASET STATISTICS . C OMPARISON RESULTS ON CUB-200-2011 AND S TANFORD D OGS .
Dataset Class Training Testing Accuracy (%)
Method Backbone
CUB-200-2011 [3] 200 5994 5794 CUB Dogs
NABirds [4] 555 23929 24633 ResNet [8] ResNet-50 84.5 82.7
Stanford Dogs [5] 120 12000 8580 RA-CNN [63] VGG-19 85.3 87.3
Oxford 102 Flowers [6] 102 2040 6179 MA-CNN [19] ResNet-50 86.5 -
Oxford-IIIT Pet [7] 37 3680 3669 NTS-Net [13] ResNet-50 87.5 87.5
Cross-X [23] ResNet-50 87.7 88.9
DCL [27] ResNet-50 87.8 -
CIN [21] ResNet-101 88.1 87.6
S3N [22] ResNet-50 88.5 87.1
and NABirds and to 1e-3 for the others. In the first 10 epochs, MRDMN [24] ResNet-50 88.8 89.1
the DS module was not used, as there are domain gaps between FDL [16] DenseNet-161 89.1 84.9
the pre-trained dataset and fine-grained datasets, which cause PMG [28] ResNet-50 89.6 -
FBSD [29] ResNet-50 89.8 88.1
the low-level features to be more helpful for classification in MSHQP [51] ResNet-50 89.0 90.4
the initial epochs. If the DS module is used, it will select API-Net [32] DenseNet-161 90.0 90.3
more tokens in the lower layers, which is not desirable. Thus, PRIS [64] ResNet-101 90.0 90.7
CAL [18] ResNet-101 90.6 88.7
the DS module was used after the network was optimized on CCFR [15] ResNet-50 91.1 -
the target dataset. The experimental results of TransFG [36] ViT [9] ViT-B-16 91.0 90.6
on Stanford Dogs were adopted from [40]. Furthermore, our FFVT [40] ViT-B-16 91.6 91.5
TransFG [36] ViT-B-16 91.7 90.6
model was implemented in PyTorch over four Nvidia Titan IELT ViT-B-16 91.8±0.04 91.8±0.05
Xp GPUs.
TABLE III
B. Comparison with SOTA Methods C OMPARISON RESULTS ON NAB IRDS .
To verify the effectiveness of the proposed IELT, we com- Method Backbone Accuracy (%)
Cross-X [23] ResNet-50 86.4
pared the proposed IELT with the SOTA methods on five PAIRS [65] ResNet-50 87.9
datasets. The compared SOTA methods include two categories: DSTL [66] Inception-v3 87.9
the CNN-based methods and the ViT-based methods. For each GHRD [31] ResNet-50 88.0
API-Net [32] DenseNet-161 88.1
dataset, as our proposed method adopts the seeds randomly PRIS [64] ResNet-101 88.4
for initialization, we repeated the experiments five times and CS-Part [67] ResNet-50 88.5
report the average classification accuracy with standard devia- MGE-CNN [68] SENet-154 88.6
ViT [9] ViT-B-16 89.9
tion. We have conducted significant tests on the experimental TransFG [36] ViT-B-16 90.8
results, and all results are within the 95% confidence interval. IELT ViT-B-16 90.8±0.05
The best accuracy is in bold, and the second best accuracy
is underlined. Table II shows the comparison results of CUB-
200-2011 and Stanford Dogs. Table III shows the comparison
and computation. Our proposed IELT fully uses the cross-
results of NABirds. The comparison results of Oxford 102
layer feature to obtain a competitive performance without an
Flowers and Oxford-IIIT Pet are shown in Table IV.
increase of computation.
1) Results on CUB-200-2011 and Stanford Dogs: Table
3) Results on Oxford 102 Flowers and Oxford-IIIT Pet:
II shows that our method achieves the best performance
From Table IV, the proposed method obtains improvements
compared to the SOTA methods on both CUB-200-2011 and
of 0.2% and 1.4% over the baseline on Oxford 102 Flowers
Stanford Dogs. The proposed method achieves a significant
and Oxford-IIIT Pet, respectively. We can see that the existing
improvement of 0.8% and 1.2% over the baseline ViT [9]
SOTA methods perform well on the flowers, and our method
on the two datasets, respectively. Compared with TransFG
can further improve the performance without adopting any
[36], which selects tokens from the input of the last layer
additional annotations. Compared with CNN-based methods,
and replaces the original input, the proposed IELT extracts the
such as FixSENet [69], which uses a high amount of data
features of each layer by the CLR module and can obtain an
preprocessing and OPAM [70] that trains the network in
accuracy that is higher by 0.1% and 1.2%, respectively. FFVT
stages, the proposed method achieves higher accuracy without
[40] fuses the cross-layer feature but ignores the influence of
special training strategies. Compared to CvT [71], which is
noise involved with the cross-layer feature and the different
a powerful backbone network based on a ViT and integrates
learning performances of multiple heads in the MHSA and
a convolution operation to the tokens, the proposed method
transformer layers. The proposed method reduces the effect of
can achieve a slight improvement of 0.1% and 0.5% on the
noise and outperforms FFVT by 0.2% and 0.3%, respectively.
two datasets, respectively. These results show that our method
achieves an improved performance on the FGVC datasets
2) Results on NABirds: For NABirds, a large-scale dataset,
compared to other improved ViT-based networks.
the ViT-based methods have greater potential than the CNN-
based methods. Based on Table III, the proposed method
obtains an accuracy of 90.8%, which outperforms the baseline C. Ablation Studies
by 0.9%. The proposed method achieves the same accuracy Ablation studies were conducted to verify the effectiveness
as TransFG [36] with a significant reduction in memory of our contributed components. They contain experimental
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
TABLE IV
C OMPARISON RESULTS ON OXFORD 102 F LOWERS AND OXFORD -IIIT
P ET.
Accuracy (%)
Method Backbone
Flowers Pet
NAC [72] VGG-19 95.3 93.8
FixSENet [69] Inception-v2 95.7 94.8
InterAct [73] DenseNet-161 96.4 93.5
OPAM [70] VGG-19 97.1 93.8
MC-Loss [74] VGG-16 97.7 -
Grafit [75] ResNet-50 99.1 -
ViT [9] ViT-B-16 99.4 93.8
CvT [71] CvT-21 99.5 94.7
IELT ViT-B-16 99.6±0.03 95.2±0.09
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
TABLE V
A BLATION S TUDIES OF THE P ROPOSED MHV, CLR, AND DS MODULES .
Accuracy (%)
MHV CLR DS
CUB-200-2011 Stanford Dogs NABirds Oxford-IIIT Pet Oxford 102 Flowers
-
√ - - 91.04 90.57 89.93 93.84 99.38
√ -
√ - 91.48 91.75 90.56 95.04 99.61
√ -
√ 91.67 91.82 90.67 94.99 99.58
√ -
√ √ 91.59 91.82 90.65 95.07 99.63
91.81 91.84 90.78 95.29 99.64
TABLE VII
A BLATION STUDY ON THE CHOICE OF THE CLASS TOKEN IN THE CLR layers and the multi-head self-attention mechanism computes
MODULE ON CUB-200-2011. the interrelationship between tokens. The output class token
Feature Input Class token Accuracy (%) of the L-th layer probably already contains rich cross-layer
Cross-layer (L − 1)-th layer 91.81 information.
Feature previous L − 1 layers 91.12
Refined (L − 1)-th layer 91.81 5) Dynamic Selection Analysis: The selection number vec-
Feature L-th layer 91.57 tor of the previous L − 1 layers m ∈ RL−1 in the DS
module is recorded when the model obtains the best accuracy.
As is shown in Figure 6, the distribution of the selection
numbers varies from different datasets. This demonstrates that
3) Enhanced Convolution Kernel in MHV: A convolution
the DS module considers the transformer layers of the ViT
operation was conducted on the score map M0 after converging
as multiple weak learners and adjusts the weight of each
the vote results of each head Mk in the MHV module. The
layer adaptively according to its contribution to the cross-layer
ablation experiments were conducted on the choices of the
feature. It is worth noting that the tokens selected for Oxford
convolution kernel, and the results on CUB-200-2011 are
102 Flowers and Oxford-IIIT Pet are mainly concentrated in
demonstrated in Table VI. Enhancing the score map with a
the lower layers. In contrast, the tokens selected for NABirds
Gaussian-like convolution kernel, compared with no applica-
are mainly concentrated in the higher layers. This phenomenon
tion, improves classification accuracy by 0.11%. A learnable
suggests that low-level features are more useful for small-
convolution kernel can also achieve the same effect and obtain
scale datasets, while high-level features are more useful for
similar performance. In addition, choosing a larger convolution
large-scale datasets. In Stanford Dogs and Oxford-IIIT Pet,
kernel will reduce the diversity of selected features and cause
the selection number of a particular layer is much larger than
a lower accuracy. If the score map is not enhanced, the values
the others. The performance imbalance at each layer may be
in the score map are very sparse, and the difference between
due to the great intra-class variance of the two datasets. This
each value is small. This results in many tokens having the
enables the DS module to provide much more improvement
same score, making it challenging to select valuable tokens.
on these two datasets, as shown in Table V.
In addition, the convolution operation can reduce the relative
score of scattered noise because Gaussian-like convolution can
effectively remove scattered noise values and retain the sharp D. Visualization of Selected Examples
boundaries of the object [61]. According to the experimental In order to demonstrate the effectiveness of the proposed
results, we use a 3 × 3 Gaussian-like convolution kernel in the method more directly, visualization results of IELT are pre-
MHV module. sented in Figure 7. The first row is the original image. The
4) Cross-Layer Refinement Ablation: To prevent the refined second and the third rows are the attention maps generated by
feature from the effect of noise, our proposed CLR module the baseline and the proposed method, respectively. The fourth
adopts the class token of the (L−1)-th layer rather than that of row is the selection result of the MHV module. Compared with
the L-th layer as the input of the (L+1)-th layer for extracting the baseline, the attention maps generated by IELT are more
the refined feature. To demonstrate the effectiveness of this, responsive to discriminative regions and have more defined
we conduct the comparison experiments of respectively using boundaries, e.g., the head of a bird, the eyes and nose of a
the class token of the (L − 1)-th layer and L-th layer as the dog, and the center of a flower. This result fully demonstrates
input of the (L + 1)-th layer. The accuracy comparison is that the proposed IELT can better focus on the discriminative
shown in Table VII. It can be seen that using (L − 1)-th layer regions, such that the MHV module can select tokens that are
produces a 0.24% higher than directly using the class token more helpful to classification.
of L-th layer. Moreover, in extracting the cross-layer feature,
instead of fusing the output of multiple layers, only the output
class token of the L-th layer is taken. To demonstrate the E. Hyperparameter Analyses
effectiveness of our approach, we experiment with fusing the To figure out the effects of the parameters in the proposed
class tokens of previous L−1 layers as the cross-layer feature. modules, the parameter analysis experiments are performed on
The result is shown in Table VII. We can see that the proposed CUB-200-2011 if not otherwise mentioned.
method outperforms fusing the class tokens of previous L − 1) Vote Number in MHV: To determine the optimal vote
1 layers by 0.69%. This result is because the input of the number v in the MHV module, different values are adopted
L-th layer are the tokens selected from the previous L − 1 and the accuracy results are shown in Table VIII. We can
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
15 20
15 15
20
10
10 10
10
10 5
5 5
0 0 0 0 0
5 10 5 10 5 10 5 10 5 10
The Transforemer Layer Index
Fig. 6. The selection numbers of previous L − 1 transformer layers adjusted by the dynamic selection module when the network obtains the best accuracy
of each dataset.
Input
Baseline
Ours
The
Selected
Tokens
Fig. 7. Visualization results of our method on each dataset. The first row is the input image. The second and the third row are the attention map generated
by the baseline and ours, respectively. The light-colored locations indicate higher responses to the class token. The fourth row is where the tokens are selected
by the MHV module, and the selected place are marked with a lighter color.
TABLE VIII
E FFECT OF VOTE N UMBER v IN MHV close to 0. In this case, the optimization of the network tends
v 8 16 20 24 28 32 to be more dependent on the refined tokens, which increases
Accuracy (%) 91.43 91.49 91.75 91.81 91.80 91.46 the risk of over-fitting and reduces the convergence speed.
When λ is close to 1, the parameters in the (L + 1)-th layer
become difficult to optimize, which affects the performance of
TABLE IX
I NFLUENCES OF THE L OSS P ROPORTION λ the network. Thus, we chose λ = 0.4 as the default setting.
λ 0 0.2 0.4 0.6 0.8 1
3) Total Selection Number: To explore the effect of the total
Accuracy (%) 91.61 91.70 91.81 91.70 91.70 91.46 selection number s in the DS module, we show the accuracy
of different s values in Figure 8. Due to the fact that FGVC
is mainly dependent on a few discriminative parts, a relatively
small part of the total tokens should be selected in each layer.
find that a better result is provided when v is set to 24. We When s is set to 126, the network obtains the highest accuracy,
believe that the larger vote numbers may introduce additional i.e., 91.81%. Note that, in this setting of s, the average number
noise because only a few parts of the attention map have high of selected tokens per layer is 11.5, which is much smaller than
responses. The smaller vote numbers may cause the network the token number N = 784. Reducing s can slightly improve
to ignore the low-response part and reduce the diversity of the the training and inference speed. However, fewer selection
cross-layer feature. Therefore, we set v = 24 in this work. tokens may lead to the loss of valuable features, while more
2) Loss Proportion: To evaluate the effect of the loss selection tokens may introduce additional noise. Therefore, we
proportion λ in the CLR module, different loss proportions set s = 126 in the DS module.
are used, and the results are shown in Table IX. The highest
accuracy is obtained when λ is set to 0.4. Because the cross- F. Model Complexity
layer feature predicts the result through the aid of assist logits, Compared with the baseline, our method introduces an addi-
the network can be updated even if the loss proportion λ is tional transformer layer. As is shown in Table X, the parameter
Authorized licensed use limited to: South China Normal University. Downloaded on September 20,2023 at 14:41:29 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3244340
the computation cost of each layer is 11.8 GFlops. The
7 U D L Q L Q J 7 L P H
proposed IELT only takes half of the training and inference
time and can achieve a competitive performance with much
less computational cost in comparison with TransFG.