Professional Documents
Culture Documents
29, 2022
Abstract—Few-shot image classification (FSIC) is the task of or inconsistent training-testing distribution [9]. Therefore, the
generalizing a model to unknown categories by learning from a problem of learning with few labeled samples called few-shot
small number of labeled samples of some given categories. Recently, learning has become a popular research direction [10]–[18].
metric-based approaches have received lots of attention for their
Metric-based few-shot image classification is an essential
simplicity and effectiveness, but they often only use support set
to generate inaccurate prototypes matching query set, ignoring direction of few-shot learning. In recent years, many models
the rich information contained in queries and the reversibility have been proposed [10], [11], [19], [20]. The matching net-
of the matching relationship between the two. In this letter, we work [10] combines meta-learning with a deep neural network
propose a new simple and effective metric-based method called to train a learnable nearest neighbor classifier. The prototypical
Bidirectional Matching Prototypical Network (BMPN), which has network [11] uses Euclidean distance to measure the distance
three innovations:1)It has an additional reverse matching process. between prototypes and queries to classify queries. Further, the
This process uses queries to generate more accurate prototypes
to improve the model’s performance while also forcing the model
relation network [19] trains an additional neural network to
to learn features far from the decision boundary to enhance gen- learn a nonlinear metric used to calculate the similarity between
eralization capabilities; 2)It has a lightweight coordinate attention prototypes and queries. However, these methods generally use a
feature extractor (CAFE). This module not only captures long-term limited number of support samples to generate inaccurate proto-
dependence along one spatial direction but also saving the accurate type matching queries, ignoring the rich information contained
position information of the other spatial direction, helping the in queries.
model to locate the region of interest more accurately; 3)It has Some work in few-shot image classification areas has recently
a joint loss function, including forward matching loss and reverse
matching loss, and a progressive weighting strategy is used in the attempted to use information from both support set and query
training process to balance the importance of the two. Our model is set [21]–[23]. GNN [23] uses supports and queries to construct
trained end-to-end, and the experimental results show that we have a graph model that transfers the distance metric from Euclidean
reached the most advanced performance on the two benchmark space to non-Euclidean space and uses a message-passing algo-
datasets. rithm to predict queries. TPN [21] uses transductive learning to
Index Terms—Bidirectional matching, deep learning, few-shot feed supports and queries into the network for training and uses
image classification, metric-based method. label-propagation algorithm to predict queries. Their common
idea is to transfer label information from the labeled support
set to the unlabeled query set. CAN [22] proposes a CAM
I. INTRODUCTION module, which generates a cross-attention map between sup-
EEP learning has developed rapidly in recent years. With ports and queries to highlight the target area and generate more
D the support of large amounts of data, more complex
networks, and powerful hardware, deep learning can achieve
discriminative features. Although these attempts have achieved
good performance gains, they have ignored that the matching
satisfactory results in various visual tasks [1]–[8]. However, relationship between support and query should be reversible.
sometimes the cost of labeling training images is very high, Inspired by PANet [24], which reuses the predictive mask as a
or some scarce samples are difficult to obtain, resulting in new segmentation annotation, we propose a new metric-based
few available training samples. In this case, the performance few-shot image classification method based on the prototypical
of the existing model is not ideal enough due to overfitting network called Bidirectional Matching Prototypical Network
(BMPN). The intuitive motivation of our work is that if the
support set can classify the query set well, then the query set
Manuscript received October 12, 2021; revised February 1, 2022; accepted must also be able to classify the support set well. Since there
February 14, 2022. Date of publication February 22, 2022; date of current version are more queries than supports, the model can generate more
April 26, 2022. This work was supported by the National Science Foundation
of China under Grant U1832217. The associate editor coordinating the review accurate prototypes. Meanwhile, due to the reverse matching
of this manuscript and approving it for publication was Prof. Mylene Q. Farias. process, the model will learn more general features, which means
(Corresponding author: Jie Chen.) that the learned features are far from the decision boundary,
Wen Fu is with the Institute of Microelectronics, Chinese Academy of Sci-
ences, Beijing 100029, China, and also with the University of Chinese Academy
and the model will be more general. Our main contributions are
of Sciences, Beijing 100049, China (e-mail: fuwen2020@ime.ac.cn). summarized as follows:
Li Zhou and Jie Chen are with the Institute of Microelectronics, Chinese 1) BMPN with an additional reverse matching process is
Academy of Sciences, Beijing 100029, China (e-mail: zhouli@ime.ac.cn; proposed. It can use the rich information in queries to
jchen@ime.ac.cn).
Digital Object Identifier 10.1109/LSP.2022.3152686 generate more accurate prototypes to improve the model’s
performance while forcing the model to learn features
1070-9908 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on November 14,2022 at 10:18:19 UTC from IEEE Xplore. Restrictions apply.
FU et al.: BIDIRECTIONAL MATCHING PROTOTYPICAL NETWORK FOR FEW-SHOT IMAGE CLASSIFICATION 983
A. Problem Formulation
In the few-shot image classification setup, datasets are usually
divided into training set Dtrain and testing set Dtest (Dtrain ∩
Dtest = φ). Following [10], We employ the standard episode
strategies for both the training and the testing set. Specifically,
C(C ≥ 5) categories are randomly sampled from the dataset,
and K(K ≤ 5) samples are randomly selected from each cat-
egory to form a label support set S = {(xi , yi )}CK i=1 , referred
to as the C-way K-shot task. Then, N (N = 15) samples are
randomly selected from the same C categories to form an unseen
query set Q = {(x̂i , ŷi )}CN
i=1 . In the training phase, the model is
trained to minimize the prediction error of the query set in each
episode. In the testing phase, the generalization performance of
the model is tested on Dtest with the same settings.
B. Prototypical Network Fig. 2. (a) is the overall structure of CAFE, which have four CA block. (b) is
the composition of a CA block. The coordinate attention module decomposes the
The prototypical network averages the feature vectors of each 2-D global pooling into two 1-D average pooling along with X and Y directions
and generates attention by subsequent operation.
class in the support set as its corresponding prototype. It then
calculates the distance between the prototype and the query
through forward matching to achieve classification. For C-way
K-shot task, the c-th class prototype pc is calculated by the In the training stage, standard cross-entropy loss is used,
following formula: which is called forward loss in this paper:
1 1
K CN
pc = Fθ (xi ) (1) Lf orward = − log P (y = c | Q) (4)
K i=1 CN j=1
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on November 14,2022 at 10:18:19 UTC from IEEE Xplore. Restrictions apply.
984 IEEE SIGNAL PROCESSING LETTERS, VOL. 29, 2022
other spatial direction, helping the model locate the region of rate ξ, hyper-parameter α;
interest more accurately. Experimental results show that CAFE Output: model parameter θ;
effectively suppresses useless information and significantly im- 1: initialize θ;
proves performance and robustness of the model. 2: for iter in iterations do
Specifically, for feature map x with the size of C × H × W , 3: forward matching, labeling queries: {(x̂i , ŷi )}CN i=1 ;
1
CN
its operation through the coordinate attention module can be 4: Lf orward = − CN j=1 log P(y = c | Q);
expressed as: 5: reverse matching, labeling supports: {(xi , yi )}CK i=1 ;
1
CK
1 1 6: Lreverse = − CK i=1 log P(y = c | S);
uhc (h) = xc (h, i), uw
c (w) = xc (j, w) 7: λ = α × iterations
iter
;
W H
0≤i≤W 0≤j≤H 8: Ltotal = Lf orward + λLreverse ;
(5) 9: Updating model:θ ← θ − ξ · ∇θ Ltotal ;
10: end for
v = Relu BN H1 uh , u w
(6)
v h = H2 (v), v w = H3 (v) (7)
III. EXPERIMENTS
g h = Sigmoid(v h ), g w = Sigmoid(v w ) (8)
We evaluated our BMPN on the two few-shot image classifica-
where xc is the feature map of the c-th channel of the x, Hi (I = tion benchmarks and compared it with the most advanced meth-
1, 2, 3) is 1 × 1 convolution, uhc and uwc are the outputs of xc after ods available: MiniImageNet [10] and CUB-200-2011 [27].
1-D pooling in the X direction and Y direction, respectively. [·] is
C
the concatenation operation, v ∈ R r ×(H+W ) is the output after A. Dataset
nonlinear operation, r is the reduction ratio which is set to 32 in 1) Miniimagenet: It is a subset of ImageNet, containing
C C
this letter, and v h ∈ R r ×H and v w ∈ R r ×W are the partition 60,000 RGB images and 100 categories, with 600 images in
of v into two independent tensors along the spatial dimension. each category, and the size of each image is 84 × 84. Following
The final output of the feature map after CA module is: the settings in [13], we used 64, 16, and 20 randomly selected
classes for training, validation, and testing, respectively.
yc (i, j) = xc (i, j) × gch (i) × gcw (j) (c = 1 · · · C) (9)
2) CUB-200-2011: It is commonly used to evaluate fine-
2) Bidirectional Matching Process: The flow of the bidirec- grained classification and contains 11,788 images in 200 cat-
tional matching algorithm is shown in Fig. 1. The forward match- egories. Following the settings in [28], we use 100, 50, and 50
ing process is precisely the same as the prototypical network, and randomly selected classes for training, validation, and testing,
the model learns from the forward matching loss Lf orward . After respectively.
completing the forward matching process, we get the network’s
prediction for queries, which we call pseudo-label of queries. B. Implementation Details
We get the reverse prototypes by averaging the queries of the
We now conduct experiments on the most common 5-way
same pseudo-label. Assuming that forward matching is accurate
1-shot and 5-way 5-shot cases in few-shot image classification.
enough, adding a reverse matching process has the following two
Accuracy is evaluated as the performance metric. During the
advantages: 1) The reverse matching loss Lreverse will force the
training and testing phases, each class has 15 queries (75 in total).
model to learn features far from the decision boundary, making
Our model is trained end-to-end from scratch with 150,000
the model more generalized. 2) More queries than supports will
iterations. CAFE is used for the feature extraction network,
result in more accurate prototypes. In particular, both kinds of
and Adam [29] is used for the optimizer. The initial learning
losses are cross-entropy losses. The total loss of the model is
rate is set to 0.001, halved at 30,000, 50,000, and 100,000
Ltotal = Lf orward + λLreverse , λ is used to balance the two
iterations. All images are uniformly adjusted to 84 × 84 × 3
kinds of losses.
to input. We use standard data augment methods in the training
3) Progressive Weight Strategy: The reverse matching pro-
phase, including random crop, left-right flip, and color jitter.
cess assumes that the pseudo-label obtained by the forward
α is set to 4 in this letter. In the testing phase, to make the
matching is sufficiently accurate. Since our model is trained
evaluation more convincing, we conducted more than 600 tests
end-to-end, the predictions of queries are mostly wrong at the
with 95% confidence intervals to obtain the results. It’s worth
beginning. To solve this problem, we designed a progressive
emphasizing that we used the same setup for all datasets. Our
weight generation module, as shown in Algorithm 1. It can be
code is implemented in PyTorch [30] and runs on an NVIDIA
seen that at the beginning of training, the proportion of Lreverse
TITAN Xp GPU.
is almost zero, and the model is mainly supervised by forward
matching loss Lf orward . As the training progresses, the model’s
performance is improved, and the proportion of Lreverse is C. Result and Analysis
increasing. In the later stage of training, only Lreverse supervises 1) Result: Our backbone is CAFE, which consists of four
model training. α is the maximum value that λ can reach, set to layers of convolution blocks, the method with the same structure
4 in this letter. is selected for a fair comparison, and the baseline network is
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on November 14,2022 at 10:18:19 UTC from IEEE Xplore. Restrictions apply.
FU et al.: BIDIRECTIONAL MATCHING PROTOTYPICAL NETWORK FOR FEW-SHOT IMAGE CLASSIFICATION 985
TABLE II
RESULT ON CUB-200-2011
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on November 14,2022 at 10:18:19 UTC from IEEE Xplore. Restrictions apply.
986 IEEE SIGNAL PROCESSING LETTERS, VOL. 29, 2022
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on November 14,2022 at 10:18:19 UTC from IEEE Xplore. Restrictions apply.