Few-Shot Relation Classification Based On The BERT Model, Hybrid Attention and Fusion Networks

Applied Intelligence
https://doi.org/10.1007/s10489-023-04634-0
Few-shot relation classification based on the BERT model, hybrid

attention and fusion networks
Yibing Li1,2,3 · Zenghui Ding1 · Zuchang Ma1 · Yichen Wu1,2 · Yu Wang1,2 · Ruiqi Zhang1,2 · Fei Xie3 ·
Xiaoye Ren3
Accepted: 11 April 2023

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023
Abstract
Relation classification (RC) is an essential task in information extraction. The distance supervision (DS) method can use many
unlabeled data and solve the lack of training data on the RC task. However, the DS method has the problems of long tails and
noise. Intuitively, people can solve these problems using few-shot learning (FSL). Our work aims to improve the accuracy and
rapidity of convergence on the few-shot RC task. We believe that entity pairs have an essential role in the few-shot RC task.
We propose a new context encoder, which is improved based on the bidirectional encoder representations from transformers
(BERT) model to fuse entity pairs and their dependence information in instances. At the same time, we design hybrid attention,
which includes support instance-level and query instance-level attention. The support instance level dynamically assigns the
weight of each instance in the support set. It makes up for the insufficiency of prototypical networks, which distribute weights
to sentences equally. Query instance-level attention is dynamically assigned weights to query instances by similarity with the
prototype. The ablation study shows the effectiveness of our proposed method. In addition, a fusion network is designed to
replace the Euclidean distance method of previous works when class matching is performed, improving the convergence’s
rapidity. This makes our model more suitable for industrial applications. The experimental results show that the proposed
model’s accuracy is better than that of several other models.
Keywords Relation classification · Few-shot learning · BERT · Attention · Rapidity of convergence
1 Introduction is an uphill task. Many researchers use distance supervision

(DS) to label data to solve the problem of insufficient training
Relation classification (RC) is important in natural language data [2]. The DS method uses the existing knowledge graph
processing (NLP) tasks, such as knowledge graph construc- to label the data automatically. However, the DS method has
tion and question-answering systems. Its main goal is to problems with noise and long tails, and the quality of the
identify a relation between two entities in a sentence. With labeled data obtained is not high, affecting the model’s final
the rapid development of deep learning, an increasing number effect. Therefore, it becomes meaningful to use a few data
of scholars use deep learning to solve RC tasks. For exam- for RC. Few-shot RC can solve the long-tail problem and
ple, some scholars use supervised methods for RC, such as insufficient labeled data.
convolutional neural networks (CNNs) and recurrent neural This research focuses on the few-shot RC task. Only a
networks (RNNs) [1]. However, supervised methods depend few instances in the support set are given for each relation
on training large-scale labeled data, and manual data labeling in the few-shot RC task. Few-shot learning (FSL) has been
successfully applied to computer vision [3, 4], Han et al.
applied FSL to RC for the first time in NLP and released two
B Zenghui Ding datasets, FewRel1.0 and FewRel2.0 [5, 6]. Subsequently, an
dingzenghui@iim.ac.cn increasing number of scholars have applied few-shot learning
B Zuchang Ma to RC [7–10]. Table 1 is an example of the few-shot RC
zcma@iim.ac.cn task. In Table 1, is an example of the few-shot RC task. In
Extended author information available on the last page of the article Table 1, the correct relation class of the query instance is A,
123
Li et al.
Table 1 A data example of 5 way 2 few-shot RC task

Support set
Relation Sentence
(A) place served by transport hub (1) Nearby airports include Akwa Ibom Airport at Okobo and [Margaret Ekpo International
Airport]eh in [Calabar]et.
(2) [Pakyong Airport]eh a Greenfield project is under construction southeast of [Gangtok]et.
(B) religion ...
(C) participating team ...
(D) league ...
(E) publisher
Query instance
(A) or (B) or (C) or (D) or (E) [Manchester]et boasts a [large regional airport]eh with scheduled commercial services.
and examples of other relations are omitted to save space. According to Snell et al. (2017), Euclidean distance out-
The red is the head entity, and the blue is the tail entity. performs other distance functions in prototype networks.
The metric learning model was previously applied to the Because of the use of BERT in our model, context encoding
few-shot RC task, which learns the distance distributions has a higher dimensionality. When the prototypical network
among relation classes [11, 12]. Among them, the prototypi- model calculates the distance, a tremendous amount of cal-
cal networks are efficient and straightforward. Many scholars culation and the rapidity of convergence are slow. This paper
make related variants and improvements based on this model. designs a fusion network that fuses query instance and class
First, support and query instances are represented as vector prototypes into a vector as input to the linear layer for class
spaces in the prototype networks. Then, the class prototype matching. It ensures the accuracy of the prediction, reduces
for each relation is represented by the average vector of cor- computational complexity, and speeds up the rapidity of con-
responding instances in the support set. Finally, the distance vergence.
between the query instance and each class prototype in the The main contributions of this paper are as follows:
support set is calculated. The shortest distance corresponding
relation is the predicted query instance.
1. Improve the BERT encoding method and propose the
In the prototype networks, when constructing the proto-
FD_BERT context encoder. The context encoder fusions
type of each relation, only the instances of each relation in
the entity pair and its dependent information to improve
the support set are considered equally. However, we think
accuracy.
the instance’s weight of the support set is related to the
2. A hybrid attention method is designed, and different
query instance, and the instance that is more similar to a
weights are assigned to support instances and query
query instance should receive a higher weight. At the same
instances to improve accuracy.
time, when calculating the distance between the class proto-
3. A fusion network is designed, which reduces the calcu-
type and the query instance, we also design a query instance
lation amount of the model and speeds up the rapidity of
weight. The more similar the class prototype is, the higher the
convergence.
weight should be. Therefore, we created a hybrid attention
4. Experimental results and ablation studies confirm the
mechanism to assign query and support instance weights.
effectiveness of the proposed method. The performance
The bidirectional encoder representations from transform-
of the proposed model is better than that of several bench-
ers (BERT) model won state-of-the-art results in 11 NLP
mark models and achieves better results.
tasks. Therefore, we use the BERT model to encode queries
and support instances. We believe that entity pairs in sen-
tences play a crucial role for RC, so it is essential to integrate
entity pair information when encoding the instance. Further- 2 Related work
more, we believe that the words that depend on the sentence’s
entity pair also play a critical role in RC. For this purpose, The main task of RC is to identify the relation between entity
we improved BERT and designed a new encoding method to pairs in a sentence. Recently, deep learning methods have
fuse the information of the entity pair and its dependents. We been widely used for RC. Zeng et al. defined the relation
denote this context encoder method as FD_BERT. extraction problem as given a sentence S and entity pairs
123
Few-shot relation classification based on the BERT model, hybrid attention and fusion networks
e1 and e2, determining the relation between e1 and e2 in cation scenarios where the test set is a new domain, and new
the sentence. The relation extraction problem is equivalent relations emerge and public a new dataset, FewRel 2.0 [5].
to an RC problem. Compared with the traditional method, They also used the neural snowball for FSL to explore new
deep learning methods only need to use the entire sentence relations [29].
as input and not select features, resulting in perfect results. Pang et al. focused on few-shot RC under DS [30]. They
Li et al. proposed the keyword-attentive sentence mechanism incorporate multiple instance learning methods into the pro-
to effectively combine the attention mechanisms and shortest totype networks. Xiao et al. proposed an adaptive mixture
dependency paths [13]. Sun et al. proposed a training repre- mechanism to add label words to the representation of the
sentation based on the dependency paths between entities in a class prototype. In addition, they also introduce a new loss
dependency tree[14]. Shi et al. proposed the A2DSRE frame- function for joint representation learning to encode each sup-
work to adaptively obtain the dependency paths related to port set instance adaptively [31]. FSL has applications in
entity relations based on the adaptive dependency path [15]. many fields. Li et al. proposed an FSL model that can predict
Liu et al. also used the dependency tree. They offered the various laws relevant to a case based on its description
the augmented dependency path, which is composed of the [7].Wu et al. proposed an FSL model for the intelligent diag-
shortest dependency path between two entities and the sub- nosis of rotating machinery [32]. Li et al. detected multi-label
trees attached to the shortest path [16]. They developed a intent by FSL [33]. The Signatures model was proposed by
dependency-based neural network model that combines the Bao et al. They used the distributional word features for
advantages of RNNs and CNNs. Ma et al. proposed a graph- encoding. A meta learning framework maps these features
based model for RC. They build a semantic graph to describe into attention scores, which are then used to weigh the lexi-
syntactic dependencies and sentence interactions. A bidirec- cal word representations [34].
tional gated recurrent unit (Bi-GRU) and graph attention Our work is focused on the few-shot RC task based on
network (GAT) are introduced to mine semantic features prototype networks. The BERT context encoding method
[17]. Runyan et al. used a bidirectional RNN architecture is improved, the hybrid attention mechanism is designed to
for RC. They used an attention layer to organize the context improve accuracy, and a fusion function is proposed to accel-
information on the word level and a tensor layer to detect erate the rapidity of model convergence. The details of each
complex connections between two entities [18]. All these method are described in Section 3.
models adopt supervised training. They require large quan-
tities of labeled data for training.
However, labeled data are relatively lacking, and labeling
data consume considerable time and energy. To solve this 3 Methodology
problem, Mintz proposed using DS [19]. DS uses knowl-
edge graph alignment to label data automatically. However, This section introduces our proposed model’s overall frame-
it brings noise and long-tail problems. Many scholars have work and context encoder, the hybrid attention mechanism,
researched this problem. Zhang et al. used knowledge graph and the fusion network. Relevant definitions are defined first.
embedding to learn implicit relational knowledge and graph
convolutional networks to learn explicit relational knowledge
3.1 Task definition
to reduce the long-tail problem [20]. The primary purpose of
these works is to minimize noise and long-tail problems in
Generally, the few-shot training set contains many relations,
DS. However, they only declined and did not fundamentally
and these relations form set R. There are multiple instances in
eliminate the effect of these problems.
each relation. We randomly select N relations as the training
The few-shot RC model can learn through a few data, and
set in the training phase, with K instances for each relation.
it does not have the long-tail problem and noise problem
There are a total of N ∗ K instances in support set S as
of DS. The metric-based classification method mainly uses
the model’s input. Then, Q samples are randomly selected
distance to determine the class of relation, which is easy to
from the non-selected instance of each relation as the model’s
implement [21, 22]. FSL was first applied in computer vision
prediction object, the query set (query set Q). The few-shot
(CV). Most of these works focus on CV applications [23, 24].
RC task needs to determine which relation from N relations
This is the first time Han et al. introduced FSL into the NLP
by N ∗K data belongs to the query instance. This task is called
field [6]. Recently, many scholars have conducted a few stud-
the N -way-K -shot problem. The instances in the training set
ies on few-shot RC in NLP [25–27]. Ye et al. proposed using
can be built up as a support set and can be defined as follows:
multilayer matching and aggregation network methods for
few-shot RC [28]. Gao et al. used a hybrid attention mecha-

nism to solve the noise influence in few-shot RC [8]. In 2019, S = (xik , h ik , tik , ri ); i = 1 · · · N , k = 1 · · · K , ri ∈ R (1)
they proposed the BERT-PAIR model to address real appli-
123
Li et al.
where xik represents the kth instance in the ith relation. h ik and mechanism. Finally, the fusion network connects ci and qi
tik are the head and tail entity in the sentence xik , respectively. to make a new vector as input to a linear layer. It obtains the
ri is the relation of the head entity and the tail entity. The score between the query sentence q and the prototype vector
query set Q can be expressed as follows: ci . The maximum score corresponding relation is the model’s
prediction of query instances q.
j
Q= qi , r j ; j = 1 · · · N (2) Take the task of 5-way-2-shot in Table 1 as an example to
illustrate the technical details of the proposed approach. Dur-
j ing the model training, five relations were randomly selected,
where qi represents an instance in the relation r j . Therefore,
with two sentences for each relation, so each batch has ten
few-shot RC tasks can be defined as follows,
sentences. The task of the model is to determine to query
instance q belongs to which class of the five relations.
K
exp f xik k=1 , q
First, the positions of the dependent words of the entity
p (y = ri | c, q) = (3)
N
exp f x k K ,q pair are identified according to Stanford dependency pars-
i=1 i k=1
ing. For example, in the instance in relation An in Table 1.
“Margaret Ekpo International Airport” is a head entity, and
where f ({xik }k=1
K , q) is a function for calculating the match-
position [9, 10, 11,12] in the sentence is noted as Ph . [14] is
ing class degree between the query sentence q and {xik }k=1K
the tail entity position in the sentence and is indicated as pt.
. The design of this function is the focus of this research.
According to Stanford dependency parsing, the head and tail
where c is that K sentences in the support set S are rep-
entity dependency position is [8, 13, 7], which is recorded as
resented as a class prototype. In the prototypical networks,
Pd .
all sentences are considered equally. In contrast, this article
Second, the BERT model encodes the sentence to obtain
designs a hybrid attention mechanism, which is not a sim-
the embedding of each word. All word embedding is made
ple equal consideration for each instance. The details are
up of an embedding set, which is denoted as E all . Therefore,
described in Section 3.4.
the word embedding of the entity pair and its dependency are
acquired from the word embedding set E all by their position.
3.2 Model framework
They are recorded as E h , E t and E d , respectively.
Third, the entity pair and dependent words perform atten-
This section introduces the framework of the proposed
tion operations with each word embedding in the sentence.
HAFN model. The structure diagram of the model frame-
We use the head entity as an example to illustrate the technical
work is shown in Fig. 1.
details of attention. In the above example, the position of the
First, K instances of each relation in the support set
head entity in the sentence is [9, 10, 11, 12]. The embedding
and query instance are used to obtain the context encoding
of “Margaret” is obtained from the word embedding set Eh
through the FD_BERT method proposed in this paper. Then,
of the head entity. The embedding operation of “Margaret” is
the K instances of each relation and the query instance q
performed with each word embedding of the sentence, as in
are used to obtain the class prototype vector ci by the sup-
Eq. 6.The operation changes the weight of each word in the
port instance-level attention (SIA) mechanism. The vectors
constructed sentence encoding and highlights the importance
ci and q obtain qi by the query instance-level attention (QIA)
Fig. 1 Framework of the HAFN model
123
of the word “Margaret” in the sentence encoding. Therefore, studies, we are inspired by the importance of dependency
we obtain the weight of each word, denoted as A. paths in RC tasks. The FD-BERT is fusion information of
The weights A and the word embedding are multiplied and entity pair and their dependency tree words. The framework
then max-pooled to obtain the sentence encoding, as shown is shown in Fig. 2.
in Eq. 7. The sentence encoding emphasizes the importance The instance s(x, h, t, r ) is operated through the depen-
of “Margaret.” Since the head entity is not a word, there are dency parse tree to find the words dependent on the entity
three words, “Ekpo,” “International,” and “Airport.” Each pair and obtain the positions of the head entity, tail entity, and
word performs the same operation as “Margaret.” Therefore, dependency, Ph , Pt , Pd respectively. At the same time, each
we obtain four sentence encodings, and the sum of the four word of instance s obtains the embedding set E all through the
encodings is the final encoding that incorporates all the word BERT model. According to the positions of Ph , Pt and Pd
information of the entity pair and its dependency. The calcu- from the word embedding set E all , the word embedding set
lation is shown in Eq. 8. The same operation is performed for of the head entity, the tail entity, and dependent words, E h ,
the tail entity, the entity pair’s dependency. The final sentence E t , E d , respectively, is selected. For word embedding in E h ,
encoding is obtained by summing, as shown in Eq. 9. E t , and E d , and E all an attention operation to obtain E h , E t ,
The above is the process of encoding sentences by our and E d is performed. Because E h , E t , and E d include the
proposed FD_BERT encoder. information of each word in the sentence, they can be used
The proposed hybrid attention mechanism obtains the as the context encoding. This paper adds these three context
relation prototype and new query instance encoding. Take encodings to obtain the final context encoding.
the 5-way-2-shot task in the table as an example. We obtain According to the literature [1], we use relative positions.
the sentence encoding of sentence (1) of relation A as x1 and The positions of the word Wi and the two entities are Pi h and
sentence (2) as x2. The query instance encoding is q. Pit , respectively. The dimension of the word embedding is
First, calculate the similarity between q and x1 and x2 dw , and the dimension of the position vector is d p , then the
and then perform softmax to obtain the weights b1 and b2 word Wi is expressed as follows,
(b1+b2=1), and the prototype c1 of relation A is expressed
as b1*x1+b2*x2, as shown in Eq. 10. This process is our Wi = [ei ; Pi h , Pit ] (4)
proposed SIA. The other four relational prototypes, c2, c3,
c4, and c5, are obtained in the same way. After the sentence is input into the BERT model, each
Second, the similarity between the query sentence encod- word embedding is obtained, and the embedding set of all
ing q and the prototypes c1, c2, c3, c4, and c5 is calculated
to obtain the similarity r1, r2, r3, r4, and r5, as shown in Eq.
12. r1, r2, r3, r4, and r5 are multiplied by the query sentence
encoding q to obtain the new query encoding q1, q2, q3, q4,
and q5.
The query instance encoding is obtained from the dynam-
ics of the relation prototype. The above is the process of our
proposed query QIA.
Third, network fusion was performed. c1, q1, (c1-q1)
absolute values were connected as shown in Eq. 14. In our
experiments, the sentence encoding is 768 dimensions, and
after the fusion becomes 2,304. Then, the value of d1 is
obtained by a linear layer, as shown in Eq. 15. The same
method (c2, q2). . . (c5,q5) is fused to obtain the other four
values, d2, d3, d4, and d5. The maximum value correspond-
ing to the relation is the prediction of query instances.
The above examples illustrate the technical details of the
model.
3.3 Context encoder
BERT has achieved state-of-the-art performance in various

NLP tasks[35]. Our research is based on BERT and proposes
a new context encoder, FD_BERT. According to previous Fig. 2 Framework of the FD_BERT model
123
Li et al.
words is labeled as E all , When constructing a relation prototype, prototypical net-

works consider each sentence equally. However, we believe
E all = B E RT (w1 , w2 · · · wn ) (5) that the weight assignment should be based on the contribu-
tion of the sentence, and the instance that is more similar to
The word embedding of the ith word is represented as the query instance should receive more weight. The weight
i .
E all can be calculated by cosine similarity, and the formula is as
We used the Stanford Parser to obtain the dependency follows:
parse tree in the experiment. According to the existing entity
pair positions Ph and Pt in the sentence, the position Pd of the β = softmax cosine_similarity xik , q (10)
dependent words of the head entity and tail entity is obtained.
The word embedding is obtained from the word embedding “Cosine_similarity” is the cosine function used to find the
set E all by the positions of the head entity, tail entity, and cosine value between two vectors and express the similarity.
dependent word. They are, respectively denoted as E h , E t , Then the class prototype vector ci (i = 1 · · · N ) composed of
and E d . We think these words significantly impact the RC K sentences in each relation can be defined as follows:
task, and we designed a context encoder called FD_BERT.
Inspired by the idea of self-attention, the context encoder
K
j
ci = β j xi (i = 1 · · · N ) (11)
fused the entity pair and its dependency information, and they
j=1
performed attention operations with each word of instance.
Take the head entity word weight as an example. The formula In the same way, we believe that the query instance q and
for the weight of each word in the entity is as follows: ci should also be related. q is more similar to ci and should

be assigned more weight. For this reason, we dynamically
QK T set the weight of the query instance q , and the weight γi (i =
α = softmax √

d 1 · · · N ) of the query instance qi can be defined as follows:
Q = linear E h
i
(6)
γi = cosine_similarit y(q, ci )(i = 1 · · · N ) (12)
K = linear (E all )
V = linear (E all ) Then the query instance vector can be defined as follows,
where E hi represents the word embedding of the ith word in qi = γi q(i = 1 · · · N ) (13)
the head entity. The word E hi and each word of the instance
have an attention operation, and then after the max pooling 3.5 Fusion network
operation, the context encoding is as follows:
The prototypical networks and related variants use the
Ē hi = maxpooling(αV ) (7) Euclidean distance method to calculate the distance of query
vector q and the class prototype vector ci , which requires
The tail entity and the entity pair dependency words per- extensive calculation. According to the fusion ideas in the rel-
form the same attention operation as the head entity to obtain evant literature [36, 37], this paper designs a fusion network
Ē ti and E di . The context encoding is shown in Formula 8. to ensure accuracy and rapid convergence. The calculation
formula for the fusion function is as follows:

n
Ē h = Ē hi (8) m(ci , qi ) = [ci , qi , ci − qi ](i = 1 · · · N ) (14)
i=1
qi is a new vector of query instance q after the hybrid
The tail entity and the entity pair dependency words do attention mechanism is processed. In the experiment, we
the same attention operation as the head entity to obtain Ē t also compared several other connection functions. The vector
and Ē d . Finally, the instance encoding is defined as follows: dimension is 3 ∗ dw after connection. In addition, we com-
pared the model’s performance using other fusion methods
E = Ē h + Ē t + Ē d (9) in our experiments.
After a linear layer dimensionality reduction to dw , and
3.4 Hybrid attention then, through the activation function, output a new vector,
the formula is as follows,
We propose hybrid attention composed of two attention
mechanisms. One is based on SIA. The other is based on QIA. m (ci , qi ) = Relu (σ (m (ci , qi ))) (15)
123
Table 2 Model related hyperparameters The dependency parse tree ignores words with more than
Component Parameter Values 40 positions related to the head and tail entities. According
to Snell et al., the model using more relations in the train-
Position feature Max relative distance ±40 ing phase can achieve better results. Therefore, the number of
Dimension 5 relations used in the training phase is 10. Due to the limitation
Batch Batch size 2 of hardware resources, the batch size used in this experiment
Optimization Size of query set Q 5 is 2. If the accuracy on the validation set drops twice in a row,
Size of support set Ntrain 10 the model is considered to have converged. We choose the
Weight Decay 10-5 model with the highest accuracy as the final training result,
Dropout rate 0.1 and the training stops. The GPU we used in the experiment
Max_length 40 was 2080ti, using the PyTorch framework. Related parame-
ters are shown in Table 2.
where σ () is a linear layer, Relu(.) is an activation function. 4.2 Comparison with previous work
Finally, perform another linear layer σ () and output a value.
The output value formula is as follows, Table 3 shows the different models’ experimental results on
the FewRel1.0 test dataset. The results are from published
di = σ (m (ci , qi )) (16) papers. The Proto model is that Han et al. (2018) directly
migrated the prototypical networks of the CV field and only
Find the corresponding relation class with the max score used the CNN method to encode the instances. Afterward,
value di (i = 1 · · · N ) , which is the model’s predicted rela- Han et al. used the BERT model to encode sentences in
tion class between the entity pair in the query instance q . 2019. Table 3 shows that the BERT encode method improves
the few-shot RC accuracy in the prototype networks. The
MLMAN model was proposed by Ye et al. in 2019. Unlike
the Proto model, the MLMAN model encodes the query
4 Experiments instance and support instances interactively by considering
their matching information at both local and instance levels.
In this section, we show the experimental details and the The Proto-HATT model (2019) also uses CNN as the encod-
results of the proposed model. We conduct ablation studies ing method. It adds instance-level and feature-level attention
on the context encoder, hybrid attention mechanisms, and mechanisms based on the Proto model to highlight essential
fusion network and explore their impact on the results. instances and features.
The BERT-PAIR model (2019) is based on the BERT
4.1 Dataset and hyperparameters sequence classification model. The query instance is paired
with all the instances of the support set. Then, a sequence
The dataset used in this experiment is FewRel1.0, which con- is formed, which is used as the model’s input. Finally, the
tains 100 relations. Each relation has 700 sentences. Among BERT model outputs a score, and the maximum value of
them, 64 relations are the training set, and 16 are the verifi- this score indicates that the query instance has the same rela-
cation set. This part of the data has been made public, and tion with the corresponding support instances. Xiao et al.
the other 20 relations are the test set that has not been made proposed the APN-LW-JRL model in 2021. The model adds
public. The official test script is provided. The results can be label words to the representation of the prototype class by
submitted to obtain the accuracy of the test set. an adaptive mixture mechanism and designs a loss function
Table 3 Accuracy (%) on the

Model 5-way-1-shot 5-way-5-shot 10-way-1-shot 10-way-5-shot
FewRel1.0 test set on different
models Proto(CNN) 74.52 88.40 62.38 80.45
Proto (BERT) 80.68 89.60 71.48 82.89
APN-LW-JRL(CNN) 73.45 87.27 61.02 77.69
MLMAN(CNN) 82.98 92.66 75.59 87.29
Proto-HATT(CNN) – 90.12 – 83.05
BERT-PAIR(BERT) 88.32 93.22 80.63 87.02
HAFN(This work) 85.56 94.93 79.58 90.83
123
Li et al.
for joint representation learning to encode each sentence of Third, we introduce the fusion network. The benchmark
the support set [31]. The Signatures model (2020) combines models, except the MLMAN model, use the Euclidean dis-
the distribution statistics of the source pool and the support tance method to find the similarity of query sentences to the
set to generate class-specific attention. The model requires prototype vector. Experiments show that our proposed fusion
additional source pool datasets. network approach improves the convergence speed and accu-
Compared with the above models, the HAFN model has racy compared to the Euclidean distance approach.
three main differences. First, the HAFN model fuses the The above apparent differences are demonstrated to be
information of the entity pair and its dependencies. We also effective by ablation experiments.
design a method to embed these words. This is the differ- In Table 3, the HAFN model is better than all models at
ence between the HAFN model and other models. Second, K = 5(5-shot), and the accuracy is worse than BERT-PAIR at
the Proto-HATT model is for data noise cases. It designed K = 1(1-shot). The main reason is that the support instance
sentence-level and feature-level attention mechanisms to attention does not work when there is only one instance. The
emphasize the importance of specific sentences and fea- main reason is that the SIA does not work with only one
tures. We create a hybrid attention mechanism that includes instance. The attention mechanism can only focus on one
support and query sentence-level attention. Our proposed sentence. Overall, the HAFN model achieved good results.
hybrid attention mechanism is different from that of the We conducted experiments on the HuffPost headlines
Proto-HATT model. We consider support and query instances dataset to further validate the effectiveness of the proposed
and design the attention function differently from the Proto- method. The HuffPost headlines dataset has 41 classes, which
HATT model’s attention. The attention of the Proto-HATT contain the news headlines of HuffPost from 2012-2018. This
models adopts a linear layer and element wise production dataset is larger than the FewRel1.0 dataset with 200853
and is finally processed by softmax. We use the simple vec- news headlines. The FewRel 1.0 dataset has a total of 70,000
tor angle method, which is relatively less computationally instances.
intensive. The Signatures model combines the distribution We follow the experimental dataset setup of bao et al. [34].
statistics of the source pool and the support set to generate and divide the 41 classes into the training, test, and valida-
class-specific attention. The model requires additional source tion sets. Among them, the training set has 20 classes, the
pool datasets. validation set has 5 classes, and the test set has 16 classes.
In addition, except for the MLMAN model in Table 3, The HAFN model needs to fuse entity pair and dependen-
the other models use the distance distribution method for cies information. The FewRel dataset is already labeled with
class matching. The MLMAN model adopts the technique entity teams, but the HuffPost headlines are not marked. We
of connecting two vectors. We also conduct experiments use the Stanford named entity recognizer to annotate sen-
with this method. For details, see Section 4.3.3. However, the tence entities as head and tail entities. If only one entity is
experimental results show that our proposed fusion method annotated, it is the head entity, and the tail entity is empty.
outperforms the MLMAN model. If no entity is marked, both head and tail entities are empty.
Compared with the baseline models, the proposed HAFN The number of sentences with empty head and tail entities
model has significant differences. in the experiment is 133,438. The experimental results of the
First, we incorporate the entity pair and its dependency HAFN model and other baselines on the HuffPost headlines
information in sentence encoding for the first time. Other dataset are shown in Table 4.
models do not contain this information. The experimental results show that HFNA still achieves
Inspired by the idea of self-attention, the FD_BERT better results on the HAFN model. It significantly outper-
encoding method is proposed to emphasize the role of entity
pairs and their dependencies in sentence encoding, and the
experimental results verify their importance.
Table 4 Accuracy (%) on the HuffPost headlines test set on different
Second, instead of assigning weights equally to the sen- models
tences in the support set when generating the prototype
Model 5-way-1-shot 5-way-5-shot
vector, weights are assigned according to the similarity to
the query sentences. Additionally, the query sentence weights Proto(CNN) 35.6 41.6
are dynamically changed for the similarity between the query MAML(CNN) 36.1 49.6
sentence and the prototype vector. Thus, the more similar the Ridge(CNN) 36.3 54.8
query sentences are to the prototype vector, the more weights Proto (BERT) 39.27 52.47
are assigned to them. It is also the first time it is used, and Signatures(BERT) 42.12 62.97
other models do not use this dynamic assignment of weights HAFN(This work) 46.16 61.69
to query sentences.
123
Table 5 Accuracy(%) on the

Model 5-way-1-shot 5way-5-shot 10-way-1-shot 10-way5-shot
FewRel1.0 validation set on
different ablation methods HAFN 83.46 92.64 73.77 85.72
W/o FD_BERT 80.90 88.60 70.51 79.10
w/ Head 82.79 92.13 73.08 84.61
w/ Tail 82.36 92.23 72.38 84.83
w/Dep 82.51 92.13 72.46 84.74
W/o QIA 82.65 91.63 72.79 84.30
W/o SIA 83.46 92.06 73.77 84.36
forms all baselines on the 5-way-1-shot task and is 4.04% Table 6. The experimental results in Table 6 show that the
more accurate than the Signatures model. The accuracy on proposed method is still appropriate for the HuffPost head-
the 5-way-5-shot task is less than the accuracy of the Signa- lines dataset.
tures model by 1.28%. This indicates that HAFN is still valid
on the HuffPost headlines dataset.
4.3.1 The effect of FD_BERT
4.3 Ablation study
Tables 5 and 6 show that the FD_BERT method has the
most significant effects on accuracy. The SIA and the QIA
To further verify the performance of each proposed method,
have less influence. In the FD_BERT encoding model, using
this section demonstrates the effectiveness of the proposed
“w/Head,” “w/Tail,” and “w/Dep,” the output accuracy was
modules on the final experimental results by an ablation
improved. However, the single effect is less than the com-
study. The ablation study experiment was carried out on the
bined effect of the three. This shows that the entity pair and
validation set. First, the context encoder ablation study is car-
their dependent word information have more significance for
ried out in this article. We use the BERT model to replace
RC.
FD_BERT of our proposed method. Only use the BERT
We visualize the sentence encoding of the support set by
method to context encode as “W/o FD_BERT.” Dependent
the BERT and the FD_BERT encoder. Figure 3 is a visualiza-
words of the entity pair and head and tail entities in the
tion for a 5-way-5-shot task. Each dot represents a sentence
FD_BERT model. We further explored the impact of each
of the support set, and each color represents a relation. (a)and
module on accuracy. In the FD_BERT method, only the head
(b) are the encoding visualization results of the sentences by
entity information is used as “w/Head,” only the tail entity
the FD_BERT and BERT encoders, respectively. As shown
information is used as “w/Tail,” and only the dependent word
in Fig. 3, sentences are more aggregated by FD_BERT than
information of the entity pair is used as “w/Dep.” The HAFN
by BERT. That is, sentences belonging to the same relation
model removes the QIA, denoted as “W/o QIA.” It removes
are more concentrated together. It shows that the FD_BERT
the SIA and only uses the average replace, represented as
encoder understands the semantics of the sentence better than
“W/o SIA.” The results of the ablation study are shown in
BERT, indicating the effectiveness of FD_BERT.
Table 5.
We also conducted ablation experiments on the HuffPost
headlines dataset, and the experimental results are shown in 4.3.2 The effect of hybrid attention
According to the query instance, SIA dynamically assigns

Table 6 Accuracy(%)on the HuffPost headline test set on different
ablation methods weights to K sentences in the support set. It is more rea-
sonable to think that the support set sentences that are more
Model 5-way-1-shot 5way-5-shot
similar to the query instance should be assigned more weight
HAFN 46.16 61.69 than equally distributed, such as the prototype networks. The
W/o FD_BERT 39.33 52.47 “W/o SIA” experimental results show that SIA improves the
w/ Head 45.86 59.70 accuracy and verifies the effectiveness of SIA. When the
w/ Tail 45.59 59.60 number of support set instances is one, the SIA does not
w/Dep 45.64 59.76 play a role, and the final model’s accuracy is not improved.
W/o QIA 45.36 61.18 When there is only one instance, all the weights only focus
W/o SIA 46.16 61.51 on this sentence, and the weight of the sentence can only be
1.
123
Li et al.
Fig. 3 Visualization of sentence

encoding with different
encoders in 5-way-5-shot task
Inspired by the idea of SIA, when the K sentences of examples, and (c)(d) is another group of experimental exam-
each relation are represented as a class prototype vector by ples. (a)(c) does not add the hybrid attention mechanism,
SIA, we believe that the more similar query instance to the and (b)(d) adds the hybrid attention mechanism. The com-
class prototype should be assigned more weight. Through the parison of (a) and (b) shows that regardless of whether the
“W/o QIA” experimental results of the ablation study, it is hybrid attention mechanism is added, the query instance will
found that the accuracies are reduced in RC tasks, indicating be judged as having the same relation. However, after adding
the effectiveness of the proposed QIA method. the hybrid attention mechanism, the query instance is closer
We visualize the experimental results with and without to the correct class prototype vector.
adding the hybrid attention mechanism. Figure 4 shows the In Fig. 4(c), after the principal component analysis (PCA)
class prototype vectors generated by the support set sentence processes the “0” relation prototype, the coordinates are
and query instance of a 5-way-5-shot task. The circles repre- (-9.876678, -6.461625), the coordinates of the “2” rela-
sent the five class prototype vectors generated by the support tion prototype are (-16.707176-17.965567), and the query
set sentence, and the triangles represent the vector of query instance coordinates are (-44.804405, -16.9106). The dis-
instance. Among them, (a)(b) is a group of experimental
Fig. 4 Comparison of
experimental results with and
without adding the hybrid
attention mechanism
123
Table 7 Accuracy(%) on the

FewRel1.0 validation set on
different ablation methods (m1,m2, m1-m2 ) 83.46 92.64 73.77 85.72
(m1,m2, m1-m2 , m1*m2) 83.00 92.12 73.14 84.88
(m1,m2) 83.24 91.39 73.05 83.79
Euclidean distance 82.81 92.21 72.67 85.02
tance distributions of the query instance to relation prototypes Table 9 Convergence times of the training phase of the two methods
0 and 2 are 36.45 and 28.11, respectively. on the Huffpost headlines dataset
In Fig. 4(d), the coordinates of the “0” relation proto- Model 5-way-1-shot 5way-5-shot
type are (4.1756144,-17.280622), the coordinates of the “2”
Euclidean distance 7h43m22s 5h21m01s
relation prototype is (3.1600468,-31.593447), and the query
(m1,m2, |m1-m2|) 1h10m29s 1h27m52s
instance coordinates are (-20.522394,-6.0680723). The dis-
tance distributions of query instance to relation prototype 0
and 2 are 27.12 and 34.81, respectively.
The correct relation of the query instance in (c)(d) is “0”. only needs to connect the query vector and the candidate rela-
However, (c) without the hybrid attention mechanism, the tion prototype to a linear layer, avoiding many calculations,
relation judged is “2”, and (d) consider the relation to be so the convergence speed is faster than that of the Euclidean
“0” because of adding the hybrid attention mechanism. This distance method. The convergence time comparison is shown
shows that the hybrid attention mechanism is effective. in Tables 8 and 9. The convergence times in Tables 8 and 9 are
the training phases on the Fewrel1.0 and HuffPost headlines
datasets, respectively. The convergence time of the fusion
4.3.3 Effect of the fusion network
network is much shorter than that of the Euclidean distance
method.
In our experiment, we compared two class-matching meth-
We also visualize the trend of loss and accuracy in the
ods. One is the method of connecting query instance and
Euclidean distance and our proposed fusion network meth-
relation prototypes proposed in this paper, and the other is
ods, respectively. Figure 5 shows the variation trend of the
the Euclidean distance method of prototypical networks and
loss on the training set with the number of iterations of the
their variant models. This paper also designed a variety of
two approaches during the training process. Figure 6 shows
fusion methods, and the experimental results of each method
the variation trend of the accuracy on the validation set.
are shown in Table 5.
As shown in Figs. 5 and 6, the speed of loss decrease
Through experimental comparison, it is found that the pro-
and accuracy increase in the fusion network is faster than
posed fusion network method improves the accuracy slightly
the Euclidean distance method. This result is also consistent
compared with the Euclidean distance method. Moreover, the
with the conclusions in Tables 5 and 6. The convergence time
rapidity of convergence is dramatically improved. In sum-
of the fusion network for several tasks is between 30% and
mary, the experimental results of the connection method
36.2% of the Euclidean distance in the FewRel1.0 dataset.
(m1, m2, m1 − m2) are better than those of the other two
fusion methods. It is also superior to the Euclidean distance
method. 4.3.4 Computational complexity
The experiment further compares the convergence time
of the fusion network in this paper and Euclidean distance. E is epochs, D is the dataset size, B is the batch size, and T
Because the Euclidean distance method needs to calculate is the time complexity of a single iteration. The model com-
the distance between the query instance and each candidate plexity can be denoted as O(E* (D/B) *T). The one iteration
class prototype vector, each vector has a dimension of 768 time complexity is discussed below.
in our experiment. Therefore, the number of calculations is Some definitions follow. V indicates the number of words
enormous in the model. However, the fusion network method in the BERT-base word list. The total number of words
Table 8 Convergence times of

the training phase of the two
methods on the FewRel 1.0 Euclidean distance 5h46m44s 6h23m17s 7h10m10s 9h58m51s
dataset
(m1,m2, m1-m2 ) 1h47m15s 2h19m24s 2h12m7s 3h4m36s
123
Li et al.
Fig. 5 Loss of the two methods

on the training set
is 30,522. S represents the length of the sentence. In this ing prototype vectors, and QIA for dynamically developing
experiment, S is 40, H represents the dimension of the word new query vectors. Finally, it is used to predict the final result
embedding 768, and L means the number of BERT-base lay- by a fusion network.
ers is 12. N represents the relations, K represents the number
of each relation sentence in the support set, and Q indicates
the number of query set sentences in each batch. M denotes 1. FD_BERT Computational Complexity
the number of head and tail entities and their dependency The first step of the HAFN model is to use BERT
words. to obtain word embedding. BERT’s word embedding
The HAFN model consists of FD_BERT for sentence method is divided into three parts: token embedding, seg-
encoding, support sentence-level attention (SIA) for generat- ment embedding, and position embedding. The required
Fig. 6 Accuracy of the two

methods on the validation set
123
Table 10 The Computational

Embedding initialization Multi-head Attention Add &Norm Feed forward Pooling
Complexity of word embedding
model
O((V+S)H) O L S2 H O(LSH) O LSH2 O LSH2
Table 11 The computational

FD_BERT SIA QIA Fusion network
complexity of the modules of
the HAFN model
O L S2 H + O L S H 2 + O M S2 H O N2QK H2 O N2QH O(NH)
parameters are V*H+2*H+S*H. The time complexity connected layer becomes 3H, the output value is N, and
can be denoted as O((V+S)H). its complexity is O(NH). The computational complexity
The BERT encoding layer includes “multi head atten- of the modules of the HAFN model is shown in Table 11.
tion,” “add & norm,” “feed forward,” and pooling.
The Therefore, the computational complexity
of
a single

“multi head attention” complexity is O L S 2 H [35]. iteration is expressed as O L S 2H + O LSH2 +

The computational complexity of “add & norm” is O M S2 H + O N 2 Q K H 2 .
focused on the norm. The norm is layer normalization,
and its main parameters are the mean and variance. The
time complexity is called O (L S H ). 5 Discussion
The “feed forward” is two fully
connected layers. The
time complexity is O L S H 2 . In general, the experimental results of the HAFN model
The pooling operation is essentially a fully connected
achieved good results. The ablation study verified the effec-
layer. The time complexity can be denoted as O L S H 2 . tiveness of the proposed method. According to the published
The word embedding complexity using BERT is shown source code1 , BERT-PAIR uses the “BertForSequenceClas-
in Table 10. sification” model in the transformer library provided by
Therefore, the computational complexity
of
a single
Hugging face, which requires many computing resources.
iteration is expressed as O L S 2H + O LSH2 +
According to the parameter configuration in Table 2 on our
O M S2 H + O N 2 Q K H 2 . computer in this experiment, we executed the 5-way-1-shot
The head and tail entities and their dependencies per- task of BERT-PARI, which showed insufficient computing
form
self-attention
operations, whose complexity is resources. However, our computer can run the 5-way-10-shot
O M S 2 H . Next, the sentence encoding is obtained task of the HAFN model.
after pooling operation with time complexity O S H 2 . Table 3 shows that the accuracy of HAFN is lower than
Therefore, the time complexity of the sentence 2encod-
that of the BERT_PAIR model on the 5-way-1-shot and 10-
ing method FD_BERT
can
be denoted as O LS H + way-1-shot tasks. This is mainly because the SIA mechanism
O L S H 2 + O M S2 H . does not work when K = 1 (N -way-1-shot). In this case,
2. SIA Computational Complexity attention can only focus on one sentence in the support set. As
The relation prototype is constructed using SIA. When shown in Table 4 of the ablation study, the model’s accuracy
creating the prototype, the similarity between the query does not change on the 5-way-1-shot and 10-way-1-shot tasks
instances and the support set sentence is first calcu- when the model performs the “w/SIA” experiment. This is
lated, and the time complexity is O(N K N Q H ). The new the disadvantage of the model.
prototype is obtained by the softmax and matrix multi-
However, our model also has advantages. The first advan-
plication operation. Its complexity is O N K N Q H 2 , so tage is that the product value of N and K (N ∗ K ) is greater,
the complexity of the prototype vector
construction part and the model’s accuracy is better. In Table 3, we compare
can be denoted as O N 2 Q K H 2 . the accuracy of our proposed HAFN model and the BERT-
3. QIA Computational Complexity PAIR model. The accuracy of the HAFN model differs from
The QIA first performs the similarity calculation with that of the BERT-PAIR model by -2.76%, -1.05%, 1.71%,
complexity O(N N Q H ). Then softmax and matrix dot 3.81% in 5-way-1-shot, 10-way-1-shot, 5-way-5-shot, and
product operations are performed, and the time complex- 5-way-10-shot tasks, respectively. This shows that the model
ity is O(NNQH). Therefore,
the QIA complexity can be works better when the N ∗ K value is greater. That is, the
written as O N 2 Q H . more relations and sentences of each relation in the support
4. Computational Complexity of the Fusion Network set, the more significant the model’s advantages.
The fusion network layer predicts N relations. Because
of the fusion operation, the input dimension of the fully 1 https://github.com/thunlp/FewRel
123
Li et al.
There are two main reasons for this. First, the greater N ∗ Acknowledgements This paper was supported by the Exercise and
K is, the more sentences there are in the support set. The Health Laboratory of the Institute of Intelligent Machinery, Chinese
Academy of Sciences. Thanks to Tsinghua University for developing
more sentences there are, the more information that is fused the FewRel dataset. We would like to thank the anonymous reviewers
about entity pairs and their dependencies, and our proposed for their helpful comments.
FD_BERT mechanism can play a more significant role. From
the ablation study results in Table 4, when performing the Funding This research was supported by the National Key Research and
Development Program of China (2022YFC2010200, 2020YFC2005603),
“W/o FD_BERT” ablation experiment, the model accuracy
the National Natural Science Foundation of China (NSFC) (grant
drops by 2.56%, 3.26%, 4.04%, and 6.62% on the 5v1 s, 10v1 numbers 61701482), the Key projects of the National Natural Sci-
s, 5V5s, and 5V10S tasks, respectively. This shows that the ence Foundation of universities in Anhui Province (grant number
greater the N ∗ K value is, the greater the accuracy decreases, KJ2020A0112),the Major Special Projects of Anhui Province(grant
number 202103a07020004), the Natural Science Foundation of Anhui
indicating that the effect of FD_BERT is more significant.
Province, China (grant number 1808085MF191),the Education Research
Second, the effect of the hybrid attention mechanism is Project of Anhui Province, China(2020jyxm1573) and the High-level
more significant at K > 1. When K = 1, the SIA does not Talents Research Start-up Fund of Hefei Normal University (grant num-
work. However, the SIA comes into play at k > 1. From ber 2020rcjj45). In addition, we would like to thank the anonymous
reviewers who have helped to improve the paper.
the results of the ablation study in Table 4, when conducting
the “W/o SIA” ablation experiment, the model’s accuracy on Data availability Data openly available in a public repository.
the 5v1 s, 10v1 s, 5V5s, and 5V10S tasks dropped by 0%,
0%, 1.01%, and 1.36%, respectively. This shows that when Declarations
K > 1, the model accuracy decreases, indicating that the
SIA plays a role in this case.
Another advantage is that we design a fusion network to Conflict of interest We declare that we have no financial and personal
relationships with other people or organizations that can inappropriately
accelerate the rapidity of convergence. Other models except influence our work. There is no professional or other personal interest
the MLMAN in Table 3 do not create similar fusion net- of any nature or kind in any product, service and/or company that could
works. Most of them still use distance distribution for class be construed as influencing the position presented in, or the review of
matching. the manuscript entitled.
The HAFN model is less effective than the BERT-PAIR
model when there is only one sentence (1-shot) in the support
set and is better than all models at other tasks (N-way-5-shot). References
It requires less computational resource overhead than the
BERT_PAIR model. Additionally, using the fusion network 1. Wang, L., Cao, Z., De Melo, G., Liu, Z.: Relation classification
via multi-level attention cnns. In: 54th Annual Meeting of the
for class matching reduces the convergence time. Therefore,
Association for Computational Linguistics, ACL 2016, August 7,
our model can be better applied in industry. 2016 - August 12, 2016. 54th Annual Meeting of the Association
for Computational Linguistics, ACL 2016 - Long Papers, vol. 3,
pp. 1298–1307. Association for Computational Linguistics (ACL).
10.18653/v1/p16-1123
6 Conclusion and future work 2. Chen T, Wang N, Wang H, Zhan H (2021) Distant supervision
for relation extraction with sentence selection and interaction rep-
This paper proposes a model for few-shot RC. First, the resentation. Wireless Communications and Mobile Computing
FD_BERT method is proposed for context encoding. Second, 2021:1–16. https://doi.org/10.1155/2021/8889075
3. Feng RW, Zheng XS, Gao TX, Chen JT, Wang WZ, Chen DZ,
a hybrid attention mechanism is presented, including SIA and
Wu J (2021) Interactive few-shot learning: Limited supervision,
QIA. Finally, a fusion network is proposed for class match- better medical image segmentation. IEEE Transactions on Medi-
ing, which fuses the class prototype and the query instance. cal Imaging 40(10):2575–2588. https://doi.org/10.1109/tmi.2021.
Our proposed HAFN model is better than other models when 3060551
4. Ye HJ, Hu HX, Zhan DC (2021) Learning adaptive classifiers
multiple instances (K = 5) of each relation are in the sup- synthesis for generalized few-shot learning. International Journal
port set. When there is only one instance (K = 1) of each of Computer Vision 129(6):1930–1953. https://doi.org/10.1007/
relation in the support set, the effect is worse than that of the s11263-020-01381-4
BERT-PAIR model. In this case, SIA does not play a role. 5. Gao, T., Han, X., Zhu, H., Liu, Z., Li, P., Sun, M., Zhou, J.:
Fewrel 2.0: Towards more challenging few-shot relation classi-
Ablation study demonstrates their effectiveness. Addition- fication. In: 2019 Conference on Empirical Methods in Natural
ally, the HAFN model using the fusion network improves Language Processing and 9th International Joint Conference on
the convergence speed. The speed and rapidity of conver- Natural Language Processing, EMNLP-IJCNLP 2019, November
gence may make it more suitable for industrial applications. 3, 2019 - November 7, 2019. EMNLP-IJCNLP 2019 - 2019 Con-
ference on Empirical Methods in Natural Language Processing and
We will explore the hybrid attention mechanism to make the 9th International Joint Conference on Natural Language Process-
model more general in the future. Further research will make ing, Proceedings of the Conference, pp. 6250–6255. Association
the model more suitable for industrial applications. for Computational Linguistics. 10.18653/v1/D19-1649
123
6. Han, X., Zhu, H., Yu, P., Wang, Z., Yao, Y., Liu, Z., Sun, M.: Fewrel: Language Processing of the AFNLP, 2-7 August 2009, Singapore.
A large-scale supervised few-shot relation classification dataset 10.3115/1690219.1690287
with state-of-the-art evaluation. In: 2018 Conference on Empirical 20. Zhang, N., Deng, S., Sun, Z., Wang, G., Chen, X., Zhang, W., Chen,
Methods in Natural Language Processing, EMNLP 2018, October H.: Long-tail relation extraction via knowledge graph embeddings
31, 2018 - November 4, 2018. Proceedings of the 2018 Conference and graph convolution networks. Proceedings of the 2019 Con-
on Empirical Methods in Natural Language Processing, EMNLP ference of the North American Chapter of the Association for
2018, pp. 4803–4809. Association for Computational Linguistics. Computational Linguistics: Human Language Technologies, Vol-
10.18653/v1/D18-1514 ume 1 (Long and Short Papers), pp. 3016–3025. Association for
7. Chen YS, Chiang SW, Wu ML (2022) A few-shot transfer learning Computational Linguistics. 10.18653/v1/N19-1306
approach using text-label embedding with legal attributes for law 21. Khan MS, Lohani QMD (2022) Topological analysis of intuitionis-
article prediction. Applied Intelligence 52(3):2884–2902. https:// tic fuzzy distance measures with applications in classification and
doi.org/10.1007/s10489-021-02516-x clustering. Engineering Applications of Artificial Intelligence 116.
8. Gao, T., Han, X., Liu, Z., Sun, M.: Hybrid attention-based pro- https://doi.org/10.1016/j.engappai.2022.105415
totypical networks for noisy few-shot relation classification. In: 22. Hallajian B, Motameni H, Akbari E (2022) Ensemble feature selec-
33rd AAAI Conference on Artificial Intelligence, AAAI 2019, tion using distance-based supervised and unsupervised methods in
31st Annual Conference on Innovative Applications of Artificial binary classification. Expert Systems with Applications 200:18.
Intelligence, IAAI 2019 and the 9th AAAI Symposium on Edu- https://doi.org/10.1016/j.eswa.2022.116794
cational Advances in Artificial Intelligence, EAAI 2019, January 23. Jiang W, Huang K, Geng J, Deng XY (2021) Multi-scale metric
27, 2019 - February 1, 2019. 33rd AAAI Conference on Artificial learning for few-shot learning. IEEE Transactions on Circuits and
Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Systems for Video Technology 31(3):1091–1102. https://doi.org/
Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium 10.1109/tcsvt.2020.2995754
on Educational Advances in Artificial Intelligence, EAAI 2019, pp. 24. Lake BM, Salakhutdinov R, Tenenbaum JB (2015) Human-level
6407–6414. AAAI Press. 10.1609/aaai.v33i01.33016407 concept learning through probabilistic program induction. Science
9. Xie Y, Wang H, Yu B, Zhang C (2020) Secure collaborative few- 350(6266):1332–1338. https://doi.org/10.1126/science.aab3050
shot learning. Knowledge-Based Systems 203:10. https://doi.org/ 25. Li, W.B., Xu, J.L., Huo, J., Wang, L., Gao, Y., Luo, J.B., Aaai:
10.1016/j.knosys.2020.106157 Distribution consistency based covariance metric networks for few-
10. Xu H, Wang JX, Li H, Ouyang DQ, Shao J (2021) Unsupervised shot learning. In: 33rd AAAI Conference on Artificial Intelligence
meta-learning for few-shot learning. Pattern Recognition 116:10. / 31st Innovative Applications of Artificial Intelligence Confer-
https://doi.org/10.1016/j.patcog.2021.107951 ence / 9th AAAI Symposium on Educational Advances in Artificial
11. Li DW, Tian YJ (2018) Survey and experimental study on metric Intelligence, pp. 8642–8649. Assoc Advancement Artificial Intel-
learning methods. Neural Networks 105:447–462. https://doi.org/ ligence, PALO ALTO (2019). 10.1609/aaai.v33i01.33018642
10.1016/j.neunet.2018.06.003 26. Xie, Y.X., Xu, H., Yang, C.C., Gao, K., Assoc Advancement Arti-
12. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few- ficial, I.: Multi-channel convolutional neural networks with adver-
shot learning. In: 31st Annual Conference on Neural Information sarial training for few-shot relation classification. In: 34th AAAI
Processing Systems, NIPS 2017, December 4, 2017 - December Conference on Artificial Intelligence / 32nd Innovative Applica-
9, 2017. Advances in Neural Information Processing Systems, vol. tions of Artificial Intelligence Conference / 10th AAAI Symposium
2017-December, pp. 4078–4088. Neural information processing on Educational Advances in Artificial Intelligence. AAAI Con-
systems foundation ference on Artificial Intelligence, vol. 34, pp. 13967–13968.
13. Li LQ, Wang JB, Li JC, Ma QL, Wei J (2019) Relation classi- Assoc Advancement Artificial Intelligence, PALO ALTO (2020).
fication via keyword-attentive sentence mechanism and synthetic 10.1609/aaai.v34i10.7256
stimulation loss. IEEE-ACM Transactions on Audio Speech and 27. Xie Y, Xu H, Li J, Yang C, Gao K (2020) Heterogeneous
Language Processing 27(9):1392–1404. https://doi.org/10.1109/ graph neural networks for noisy few-shot relation classification.
taslp.2019.2921726 Knowledge-Based Systems 194. https://doi.org/10.1016/j.knosys.
14. Sun HY, Grishman R (2022) Lexicalized dependency paths based 2020.105548
supervised learning for relation extraction. Computer Systems Sci- 28. Ye, Z.X., Ling, Z.H.: Multi-level matching and aggregation net-
ence and Engineering 43(3):861–870. https://doi.org/10.32604/ work for few-shot relation classification. In: Proceedings of the
csse.2022.030759 57th Annual Meeting of the Association for Computational Lin-
15. Shi Y, Xiao Y, Quan P, Lei ML, Niu LF (2021) Distant supervision guistics. 10.18653/v1/P19-1277
relation extraction via adaptive dependency-path and additional 29. Gao, T.Y., Han, X., Xie, R.B., Liu, Z.Y., Lin, F., Lin, L.Y., Sun,
knowledge graph supervision. Neural Networks 134:42–53. https:// M.S., Assoc Advancement Artificial, I.: Neural snowball for few-
doi.org/10.1016/j.neunet.2020.10.012 shot relation learning. In: 34th AAAI Conference on Artificial
16. Liu Y, Li SJ, Wei FR, Ji H (2016) Relation classification via mod- Intelligence / 32nd Innovative Applications of Artificial Intel-
eling augmented dependency paths. IEEE-ACM Transactions on ligence Conference / 10th AAAI Symposium on Educational
Audio Speech and Language Processing 24(9):1589–1598. https:// Advances in Artificial Intelligence. AAAI Conference on Artificial
doi.org/10.1109/taslp.2016.2573050 Intelligence, vol. 34, pp. 7772–7779. Assoc Advancement Artifi-
17. Ma, Y.H., Zhu, J., Liu, J.: Enhanced semantic representation cial Intelligence, PALO ALTO (2020). 10.1609/aaai.v34i05.6281
learning for implicit discourse relation classification. Applied Intel- 30. Pang N, Tan Z, Xu H, Xiao WD (2020) Boosting knowledge base
ligence (2021). 10.1007/s10489-021-02785-6 automatically via few-shot relation classification. Frontiers in Neu-
18. Runyan Z, Fanrong M, Yong Z, Bing L (2018) Relation classifica- rorobotics 14. https://doi.org/10.3389/fnbot.2020.584192
tion via recurrent neural network with attention and tensor layers. 31. Xiao, Y., Jin, Y., Hao, K.: Adaptive prototypical networks with
Big Data Mining and Analytics 1(3):234–244. https://doi.org/10. label words and joint representation learning for few-shot relation
26599/bdma.2018.9020022 classification. IEEE transactions on neural networks and learning
19. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for systems PP (2021). 10.1109/tnnls.2021.3105377
relation extraction without labeled data. In: ACL 2009, Proceedings 32. Wu JY, Zhao ZB, Sun C, Yan RQ, Chen XF (2020) Few-shot trans-
of the 47th Annual Meeting of the Association for Computational fer learning for intelligent fault diagnosis of machine. Measurement
Linguistics and the 4th International Joint Conference on Natural 166. https://doi.org/10.1016/j.measurement.2020.108202
123
Li et al.
33. Hou, Y.T., Lai, Y.K., Wu, Y.S., Che, W.X., Liu, T., Assoc Advance- sification in an open world environment. IEEE-ACM Transactions
ment Artificial, I.: Few-shot learning for multi-label intent detec- on Audio Speech and Language Processing 30:1003–1015. https://
tion. In: 35th AAAI Conference on Artificial Intelligence / 33rd doi.org/10.1109/taslp.2022.3153254
Conference on Innovative Applications of Artificial Intelligence 37. Wang, W., Yan, M., Wu, C.: Multi-granularity hierarchical atten-
/ 11th Symposium on Educational Advances in Artificial Intelli- tion fusion networks for reading comprehension and question
gence. AAAI Conference on Artificial Intelligence, vol. 35, pp. answering. Proceedings of the 56th Annual Meeting of the
13036–13044. Assoc Advancement Artificial Intelligence, PALO Association for Computational Linguistics (Volume 1: Long
ALTO (2021) Papers), pp. 1705–1714. Association for Computational Linguis-
34. Bao, Y., Wu, M., Chang, S., Barzilay, R.: Few-shot text classifi- tics. 10.18653/v1/P18-1158
cation with distributional signatures. In: International Conference
on Learning Representations (2020). https://doi.org/10.1007/978-
981-33-4859-2_14
Publisher’s Note Springer Nature remains neutral with regard to juris-
35. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
dictional claims in published maps and institutional affiliations.
Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need.
In: 31st Annual Conference on Neural Information Processing Sys-
Springer Nature or its licensor (e.g. a society or other partner) holds
tems, NIPS 2017, December 4, 2017 - December 9, 2017. Advances
exclusive rights to this article under a publishing agreement with the
in Neural Information Processing Systems, vol. 2017-December,
author(s) or other rightsholder(s); author self-archiving of the accepted
pp. 5999–6009. Neural information processing systems foundation
manuscript version of this article is solely governed by the terms of such
36. Chen XF, Wang GH, Ren HP, Cai Y, Leung HF, Wang T (2022)
publishing agreement and applicable law.
Task-adaptive feature fusion for generalized few-shot relation clas-
123
Authors and Aﬃliations
Yibing Li1,2,3 · Zenghui Ding1 · Zuchang Ma1 · Yichen Wu1,2 · Yu Wang1,2 · Ruiqi Zhang1,2 · Fei Xie3 ·
Xiaoye Ren3
Yibing Li Xiaoye Ren

liyibing@mail.ustc.edu.cn 121403909@qq.com
Yichen Wu 1 Institute of Intelligent Machines, Institutes of Physical
wuyichen@mail.ustc.edu.cn
Science, Chinese Academy of Sciences, Hefei 230031, China
Yu Wang 2 Science Island Branch of Graduate School, University of
626344827@qq.com
Science and Technology of China, Hefei 230026, China
Ruiqi Zhang 3 School of Computer Science and Technology, Hefei Normal
1131377413@qq.com
University, Hefei 230601, China
Fei Xie
xiefei9815057@sina.com
123

Few-Shot Relation Classification Based On The BERT Model, Hybrid Attention and Fusion Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Few-Shot Relation Classification Based On The BERT Model, Hybrid Attention and Fusion Networks

Uploaded by

Copyright:

Available Formats

Applied Intelligence

Few-shot relation classification based on the BERT model, hybrid

Accepted: 11 April 2023

Keywords Relation classification · Few-shot learning · BERT · Attention · Rapidity of convergence

1 Introduction is an uphill task. Many researchers use distance supervision

Table 1 A data example of 5 way 2 few-shot RC task

Fig. 1 Framework of the HAFN model

3.3 Context encoder

BERT has achieved state-of-the-art performance in various

words is labeled as E all , When constructing a relation prototype, prototypical net-

Table 3 Accuracy (%) on the

Table 5 Accuracy(%) on the

According to the query instance, SIA dynamically assigns

Fig. 3 Visualization of sentence

Table 7 Accuracy(%) on the

Table 8 Convergence times of

Fig. 5 Loss of the two methods

Fig. 6 Accuracy of the two

Table 10 The Computational

Table 11 The computational

Authors and Aﬃliations

Yibing Li Xiaoye Ren

You might also like