You are on page 1of 12

Named Entity Recognition and Relation Extraction

using Enhanced Table Filling by Contextualized Representations

Youmi MaTatsuya Hiraoka Naoaki Okazaki


Tokyo Institute of Technology
{youmi.ma, tatsuya.hiraoka}@nlp.c.titech.ac.jp
okazaki@c.titech.ac.jp

Abstract single table (Miwa and Sasaki, 2014; Gupta et al.,


2016; Zhang et al., 2017). As reported by these
In this study, a novel method for extract-
ing named entities and relations from unstruc-
studies, table-filling is a promising approach for
arXiv:2010.07522v1 [cs.CL] 15 Oct 2020

tured text based on the table representation extracting both NE and relations. However, table-
is presented. By using contextualized word filling approaches require feature engineering and
embeddings, the proposed method computes search strategy, which is merely a representation of
representations for entity mentions and long- the label space of NER and RE. Previous work have
range dependencies without complicated hand- designed complicated features to encode contexts
crafted features or neural-network architec- and long-range dependencies between NE and rela-
tures. We also adapt a tensor dot-product to
tions. For example, Miwa and Sasaki (2014) used
predict relation labels all at once without re-
sorting to history-based predictions or search hand-crafted syntactic features (e.g., the shortest
strategies. These advances significantly sim- path between two words in the syntactic tree) and
plify the model and algorithm for the extrac- Zhang et al. (2017) extracted syntactic information
tion of named entities and relations. Despite using the encoder of a pre-trained syntactic parser.
its simplicity, the experimental results demon- Authors in Miwa and Sasaki (2014) explore decod-
strate that the proposed method outperforms ing (search) strategies for filling in the table, based
the state-of-the-art methods on the CoNLL04
on history-based predictions. In addition, they ex-
and ACE05 English datasets. We also con-
firm that the proposed method achieves a com-
plore six strategies to determine the order of filling
parable performance with the state-of-the-art for table cells. History-based predictions are also
NER models on the ACE05 datasets when mul- an obstacle for parallelizing label decoding.
tiple sentences are provided for context aggre- To address the aforementioned issues, we
gation. present a novel yet simple method for NER and RE
by enhancing table-filling approaches. We utilize
1 Introduction
BERT initialized with pre-trained weights for rep-
Named Entity Recognition (NER) (Nadeau and resenting entity mentions and encoding long-range
Sekine, 2007; Ratinov and Roth, 2009) and Rela- dependencies among entities to simplify feature
tion Extraction (RE) (Zelenko et al., 2003; Zhou engineering. Furthermore, the presented model
et al., 2005) are two major sub-tasks of Information enhances entity representations with span-based
Extraction (IE). Recent studies have reported ad- features. To reduce the burden of exploring search-
vantages of solving these two tasks jointly in terms ing strategies, we utilize a tensor dot-product to fill
of both efficiency and accuracy (Miwa and Sasaki, up cells of relation labels in the table all-at-once
2014; Li and Ji, 2014; Gupta et al., 2016; Miwa and (instead of cell-by-cell with beam-search). This
Bansal, 2016; Zhang et al., 2017). Compared with modification also simplifies the decoding process
the pipelined approaches (Chan and Roth, 2011), and improves decoding parallelism, completing RE
models that jointly extract named entities (NE) and with matrix and tensor operations.
relations can capture dependencies between entities This work uses two widely used benchmark
and relations. datasets, namely, CoNLL04 (Roth and Yih, 2004)
Many existing studies cast joint extraction of and ACE05, both in English, for evaluating the
NER and RE as a table-filling problem, where en- models of both NER and RE. Experimental results
tity and relation labels are represented as cells in a demonstrate that the proposed method achieves
the word wi , and an off-diagonal element Yi,j ∈ R
Johanson lives London
(1 ≤ i < j ≤ n) represents a directed relation
Smith in
label between the words wi and wj . Following
1 2 3 4 5 j Zhang et al. (2017), we hard-code directions into
1 6 6 6 6 relation labels R to avoid considering the lower
Johanson 1 B-PER ⊥ ⊥ ⊥ LiveIn triangular part of the table for RE. Our model can
2 6 6 6 be seen as a mapping transforming a sequence of
Smith 2 L-PER ⊥ ⊥ LiveIn words w1 w2 · · · wn to an upper triangular matrix
3 6 6 Y . We denote an NE label as yi = Yi,i for sim-
lives 3 O ⊥ ⊥ plicity. Figure 1 illustrates an example of a matrix
4 6 Y for the input sentence, “Johanson Smith lives
in 4 O ⊥ in London”. Notably, relations are mapped from
5
1-dimensional word sequences to 2-dimensional
London 5 U-LOC matrix Y on entity-level. Further, each word inside
i
an entity span is annotated with the corresponding
relation label. Take the sentence in Figure 1 as an
Figure 1: An example of the upper triangular matrix Y example, for the NE “Johanson Smith” labeled as
and table-filling strategy. The numbers in the cells in- −−−−→
P ERSON, relation L IVE I N is labeled on both Y1,5
dicate the order of filling cells (decoding). ⊥ indicates
a negative relation label (i.e., there exists no relation
and Y2,5 corresponding to “Johanson” and “Smith,”
among the corresponding words). respectively.
This study is based on pre-trained BERT models
to leverage contextual information in solving NER
higher performance than previous state-of-the-art and RE. The proposed method stacks layers for
methods, including SpERT (Eberts and Ulges, NER and RE on the top of a BERT encoder. As
2020) and DyGIE++ (Wadden et al., 2019) in addi- illustrated in Figures 2 and 3, our method computes
tion to the conventional table-filling systems (Miwa word representations from contextualized embed-
and Sasaki, 2014; Zhang et al., 2017). We con- dings of sub-word tokens obtained by Byte-Pair-
firm that the tensor dot-product successfully pre- Encoding (BPE) (explained in Section 2.1), and
dict relation labels at once without any special performs NER (Section 2.2) and RE (Section 2.3).
search strategy. Moreover, the proposed method
attains comparable performance to the state-of-the- 2.1 Word Representations
art NER model DyGIE++ when providing mul- BERT tokenizer uses WordPiece to split words
tiple sentences as input for context aggregation. (e.g., “Johanson”) into sub-word tokens (e.g., “Jo-
The source code is publicly available at https: han” and “##son”) with the aid of BPE. This tech-
//github.com/YoumiMa/Enhanced_TF. nique is proved to be effective in reducing the vo-
cabulary size and unknown words (Devlin et al.,
2 Proposed Method 2019). Since NEs are annotated at word level, we
need its representations at word level during both
This study aims to extract NE and relation in- training and predicting.
stances. Given a sequence of words w1 w2 · · · wn In this study, we compute a max-pooling of
(n is the number of words in the input), our BERT embeddings of sub-word tokens composing
goal is to extract relation triples in the form of the word as its representation1 (Liu et al., 2019),
(arg0type0 , relation, arg1type1 ). Here, type0 rep-
resents the NE type of the mention arg0; arg1 and ewi = f (eti,1 , eti,2 , · · · , eti,s ). (1)
type1 are defined analogously. We define E and R
as label sets of named entities and relations, respec- Here, we assume the following: the word wi com-
tively. prises s sub-word tokens ti,1 , ti,2 , · · · , ti,s ; ewi and
The table representation (Miwa and Sasaki, 1
We examined the performance on the CoNLL04 devel-
2014) is employed for jointly recognizing NEs and opment set by using (1) embedding of first sub-word to-
relation instances. Formally, we define an n × n up- ken (Devlin et al., 2019); (2) mean-pooling of constituent
sub-word tokens; (3) mean-pooling of constituent sub-word
per triangular matrix Y , where a diagonal element tokens and [CLS]; (4) max-pooling of constituent sub-word
Yi,i ∈ E (1 ≤ i ≤ n) represents an NE label for tokens. Among these, max-pooling worked the best.
|𝑅|
Label 𝒍"" 𝒍"!
Embedding
Softmax Q |𝑅| Johanson − ⊥ ⊥ ⊥
O B-PER L-PER O O U-LOC 𝐿𝑖𝑣𝑒𝐼𝑛
Classifier (#$%) Smith − − ⊥ ⊥
𝒉! 𝐿𝑖𝑣𝑒𝐼𝑛
Feed-Forward … … 𝒉'
(#$%) lives − − − ⊥ ⊥
Neural Network …
|𝑅|
in − − − − ⊥
Entity (*+,) … (*+,) … 𝒉(
(#$%)
London − − − − −
Representation 𝒉$ 𝒉'
K
Word-level
𝒆!" 𝒆!!
Embedding
Word Tensor Softmax
Subtoken-level Representation Dot-Product Classifier
𝒆#"" 𝒆#"! 𝒆#!"
Embedding
Pretrained Figure 3: The Relation Extraction model. Relation hid-
BERT BERT (rel)
den vectors hi are computed from the NE module.
[CLS] 𝑡$$ 𝑡$% 𝑡%$ 𝑡'$ 𝑡($ 𝑡&$ In the right-most table, − indicates a masked-out cell
during both training and predicting.
Johanson Smith lives in London

Live In
the previous word and a unit-length outside (O) as
Figure 2: The Named Entity Recognition model. For the previous label, i.e., y0 = O and w0 = [CLS].
clarity, we only show the calculation of entity hidden Following Zhang et al. (2017), when the previous
(ent) (ent)
vectors h1 and h3 for the words “Johanson” and word is labeled O, we assume that the previous span
“lives” in detail. is a unit-length span, i.e., first(i − 1) = i − 1.
We define the entity representation for predicting
the label of current timestep i as the concatenation
eti,k are embeddings for the word wi and sub-word
of three features described above,
token ti,k , respectively; and f presents a max-
pooling operation (henceforth). (ent)
hi = ewi ⊕ lyi−1 ⊕ f (ewfirst(i−1) , · · · , ewi−1 ),
2.2 Named Entity Recognition (2)
where ⊕ stands for a vector concatenation.
We use the BILOU (begin, inside, last, outside,
unit-length) notation for representing spans of We apply a fully connected layer followed by a
NEs (Ratinov and Roth, 2009). We consider NER softmax function σ to obtain the probability distri-
as a sequential labeling task, where each word wi bution across all possible NE labels at timestep i,
in the input is labeled as yi (a diagonal element
(ent)
in Y ) in the BILOU notation. In this study, we ŷi = σ(W (ent) hi + b(ent) ). (3)
enhance the existing architecture by using span
features at previous timesteps. The use of span fea- Here, W (ent) and b(ent) represent the matrix and
tures is inspired by Zhang et al. (2017); the authors the bias vector for a linear transformation. The vec-
extracted span representations from bidirectional tor ŷi represents the probability distribution over
LSTM cells as features. NE labels; we fill the element Yi,i (= yi ) with
Specifically, the model predicts an NE label for the NE label yielding the highest probability for
the word wi based on three features: (1) a rep- ŷi . Thus, we perform NER by filling up diagonal
resentation ewi of the word wi , (2) embeddings elements Yi,i from i = 1 to n.
lyi−1 of the label yi−1 at the previous timestep
2.3 Relation Extraction
i − 1, and (3) max-pooling of BERT embeddings
of the previous NE span appearing at timesteps We perform RE on top of the BERT encoder and
(first(i − 1), · · · , i − 1). Here, first(i) denotes the entity spans recognized in Section 2.2. After the
timestep where the NE span including word i starts. NER model fills up all diagonal elements in Y ,
For example, when processing the sentence shown the RE model predicts all off-diagonal elements in
in Figure 2, since the phrase “Johanson Smith” is Y . We adapt a tensor dot-product to score each
labeled as an NE, we have first(1) = first(2) = 1. word pair along with the relation label distribution.
Similarly, since “lives” is a single non-entity word, The computation is similar to the multi-head self-
we have first(3) = 3. In addition, when the attention mechanism (Vaswani et al., 2017), but our
timestep is one (i = 1), we assume [CLS] as goal is not to compute attention weights over entity
representations2 . j (1 ≤ i < j ≤ n), which is realized by the dot-
Our model utilizes features of entity spans and product of Q and K,
their NE labels to obtain relation representations
ŷi,j = σ(QK> )i,j,: . (10)
for predicting relation labels. Let last(i) denote
the timestep where the entity span containing the Here, (·)i,j,: denotes a slice (vector) of the tensor
timestep i ends, analogous to first(i) defined in extracting the (i, j, ∗) elements. Softmax function
Section 2.2. For instance, for sentence delineated σ computes the probability distribution across all
in Figure 2, we have last(1) = last(2) = 2. The relation labels R. We fill in the cell Yi,j with the
entity-span feature zi (at timestep i) is computed relation label yielding the highest probability for
using the representations of the constituent words ŷi,j . In this way, the RE model predicts relation
in the entity span, labels for all pairs of input words at once by com-
puting Equation 10 on the top of NE labels and
zi = f (ewfirst(i) , · · · , ewi , · · · , ewlast(i) ). (4) spans predicted by the NE module.
To rephrase, zi is the max-pooling across word rep- Meanwhile, we replace Equation 5 with 11 dur-
resentations of the entity span starting at first(i) ing the training phase,
and ending at last(i). Embedding of the NE la- zi = ewi . (11)
bel lyi is then used as the entity-label feature at
timestep i. Mathematically, the word representa- The rationale behind this treatment is that Equa-
tion, i.e., an input to the RE model, is a concatena- tion 5 might repeat the same pattern of parameter
tion of the entity-span and entity-label feature, updates for multiple words within an entity span
because of the max-pooling operation. In contrast,
hi
(rel)
= z i ⊕ l yi . (5) Equation 11 promotes different patterns of parame-
ter updates for different words in an entity span3 .
For each possible relation r ∈ R, we apply lin-
ear transformations parameterized by two matri- 2.4 Training and Predicting
(q) (k)
ces Wr , Wr ∈ Rdatt ×drel and two bias vectors The objective of training is to minimize the sum
(q) (k)
br , br ∈ Rdatt , of cross-entropy losses of NER (L(ent) ) and RE
(rel) (L(rel) ),
qi,r = Wr(q) hi + b(q)
r , (6) L = L(ent) + L(rel) . (12)
(rel)
ki,r = Wr(k) hi + b(k)
r , (7) The proposed NER model uses ground-truth NE
labels and spans at training time, similar to Zhang
where drel denotes the number of dimensions of
(rel) et al. (2017). We perform a greedy search (from left
hi , and datt represents the number of dimen-
to right) for predicting a label sequence4 . Our RE
sions after the transformations. qi,r ∈ Rdatt and
model receives the predicted NE labels and spans
ki,r ∈ Rdatt are query and key vectors, respectively,
from the NER model and then predicts relation
for the relation r at timestep i. As demonstrated,
(rel) labels based on these predictions.
we map the word representation vector hi into
Notably, as shown in Figure 1, if a phrase is la-
the query and key spaces associated with the rela-
beled as an NE, relations sourcing from or pointing
tion r.
to the phrase will be mapped span-wise to the ma-
After collecting both query and key vectors for
trix Y . The mapping strategy helps us fully update
all relations r ∈ R at all timesteps i ∈ {1, . . . , n},
parameters for different words inside an entity span
we obtain two tensors Q ∈ Rn×|R|×datt and K ∈
during relation training. During relation prediction,
Rn×|R|×datt . Slices of the tensors are,
we ignore inconsistency of label predictions among
(rel)
Qi,r,: = qi,r = Wr(q) hi + b(q)
r , (8) componential words by max-pooling as in Equa-
(rel) tion 4, which results in the same entity-span feature
Ki,r,: = ki,r = Wr(k) hi + b(k)
r . (9) inside the span.
We compute a probability distribution across all 3
We explore several combinations. For example, using
possible relations for every combination of i and Equation 5 for both training and predicting phases, we con-
firmed that the combination: (Equation 5 for prediction and
2 Equation 11 for training), performed the best.
We also tried the deep bi-affine attention mechanism
4
used in Dozat and Manning (2017) and Nguyen and Verspoor The beam search decoding did not depict a definite per-
(2019). However, we found it sufficient to use the tensor formance improvement, which is consistent with the report in
dot-product in the early experiments. Miwa and Bansal (2016).
3 Experiment PyTorch for parameter updates (Loshchilov and
Hutter, 2019).
3.1 Datasets Hyperparameters were tuned on the held-out de-
We assessed the performance for NER and RE of velopment set of CoNLL04. Further, we merged
the proposed method on two widely used datasets: the development and training sets of CoNLL04
CoNLL04 (Roth and Yih, 2004) and ACE05. In ad- for the final training and evaluation, following the
dition, the ability of the proposed model to capture procedure of Gupta et al. (2016) and Eberts and
cross-sentence dependencies in NER is evaluated Ulges (2020). Major hyperparameters are listed in
on CoNLL03 (Tjong Kim Sang and De Meulder, Appendix A. We report mean values of all evalu-
2003) and ACE05, which include markers for doc- ation metrics following five runs on each dataset
ument boundaries. throughout the paper.
CoNLL04 The dataset defines four entity types 3.3 Main Results
and five relation types. We report F1 scores for Table 1 reports the performance of our method on
NER and RE adhering to the conventional evalua- the datasets, along with in several recent studies on
tion scheme. The experiments followed the setup joint NER and RE. On the ConLL04 dataset, the
and data split of Gupta et al. (2016) and Eberts and proposed method achieved comparable or slightly
Ulges (2020), which are similar to those of Miwa better performance in both NER and RE than
and Sasaki (2014) and Zhang et al. (2017). SpERT, i.e., an existing state-of-the-art (SOTA)
ACE05 We used the English corpus that encom- model. Another advantage of the proposed method
passes seven coarse-grained entity types and six over SpERT is its stability in achieving higher
coarse-grained relation types. We followed the performance. Table 2 reports the standard deriva-
data splits, pre-processing, and task settings of Li tions (SDs) of F1 scores of NER and RE on the
and Ji (2014) and Miwa and Bansal (2016). For CoNLL04 test set achieved by the two models.
evaluating NER, we regarded an entity mention as In addition, Table 1 indicates that the proposed
correct if its label and the headword of its span method outperformed existing work on ACE05. Re-
were identical to the ground truth. For evaluat- gardless of the evaluation criteria for RE (ACE05♦
ing RE, we report performance values computed and ACE05♠ ), F1 scores of the proposed method
by two different criteria to make them compara- were approximately 1.0 point higher than those of
ble with the previous work: ACE05♦ is indifferent previous SOTA models (DyGIE++ and Multi-turn
towards incorrect predictions of NE labels, while QA).
ACE05♠ requires NE labels of relation arguments The proposed method ranked second for NER on
to be correct. ACE05 among previous studies, while DyGIE++
portrayed superior performance on NER. However,
CoNLL03 The dataset contains four different their method receives document-level contexts as
entity types similar to CoNLL04 (Roth and Yih, input that make it incomparable with our model,
2004). We used this dataset to measure the per- i.e., trained only with sentence-level contexts. In
formance of NER that considers cross-sentence addition, DyGIE++ utilized coreference informa-
contexts within a document. tion from OntoNotes (Pradhan et al., 2012). As we
will see in Section 3.6 with Table 5, the proposed
3.2 Experimental Settings method achieved comparable performance to Dy-
The presented model is implemented in Py- GIE++ on NER with document-level context given
Torch (Paszke et al., 2019) with HuggingFace to the input.
Transformer package (Wolf et al., 2019), utiliz- Additionally, a detailed error inspection of both
ing BERTBASE (cased) (Devlin et al., 2019) as a NER and RE is shown in Appendix B. The error
pre-trained BERT model. We ran all experiments inspection aided us in categorizing two significant
on a single GPU of NVIDIA Tesla V100 (16 GiB). error types of RE sources, namely, a lack of exter-
We trained parameters of the NER and RE mod- nal knowledge and global constraints.
els as well as those in BERT (fine-tuning) during
the training phase, with parameters other than pre- 3.4 Ablation Test
trained BERT initialized with the default initializer. To better understand the significance of our pro-
We used the AdamW algorithm implemented in posed method, we conducted ablation tests. This
Entity Relation
Dataset Model
P R F1 P R F1
Miwa and Sasaki (2014) 81.2 80.2 80.7 76.0 50.9 61.0
Zhang et al. (2017) - - 85.6 - - 67.8
CoNLL04 Multi-turn QA (Li et al., 2019) 89.0 86.6 87.8 69.2 68.2 68.9
SpERT (Eberts and Ulges, 2020) 88.3 89.6 88.9 73.0 70.0 71.5
This work 89.3 91.1 90.2 71.6 71.9 71.8
Li and Ji (2014) 85.2 76.9 80.8 68.9 41.9 52.1
Dixit and Al-Onaizan (2019) 85.9 86.1 86.0 68.0 58.4 62.8
ACE05♦
DyGIE++ (Wadden et al., 2019) - - 88.6 - - 63.4
This work 87.2 88.1 87.6 69.0 60.7 64.5
Li and Ji (2014) 85.2 76.9 80.8 65.4 39.8 49.5
SPTree (Miwa and Bansal, 2016) 82.9 83.9 83.4 57.2 54.0 55.6
ACE05♠ Zhang et al. (2017) - - 83.6 - - 57.5
MRT (Sun et al., 2018) 83.9 83.2 83.6 64.9 55.1 59.6
Multi-turn QA (Li et al., 2019) 84.7 84.9 84.8 64.8 56.2 60.2
This work 87.2 88.1 87.6 65.3 57.5 61.1

Table 1: Micro-averaged precision (P), recall (R), and F1 score (F1) on the test sets of CoNLL04 and ACE05.
ACE05♦ regards a relation prediction to be correct when both the relation label and head regions of two arguments
are correct. ACE05♠ requires that NE labels of arguments are correct in addition to the evaluation criteria of
ACE05♦ . Notably, NER scores of ACE05♦ and ACE05♠ are comparable because the difference in the evaluation
criteria affects RE scores only.

SD Entity Relation
Model Model
P R F1 P R F1
NER RE Full 89.3 91.1 90.2 71.6 71.9 71.8
SpERT 0.378 0.857 - Label 89.3 90.6 89.9 72.6 70.7 71.6
This work 0.187 0.334 - Span 88.7 90.5 89.6 70.6 71.0 70.8
- Both 88.5 90.2 89.3 71.2 70.1 70.7
Table 2: Standard derivations (SD) of F1 scores of NER
Table 3: Ablation test of features used in this study on
and RE on the CoNLL04 test set predicted by SpERT
the CoNLL04 test set. “- Label” presents the perfor-
and this work (with five runs).
mance when removing the label embeddings at previ-
ous timesteps from the model. “- Span” removes the
features of the previous span. “- Both” presents the re-
involves gradual removal of introduced features sults when removing both of them.
from the model followed by the measurement of
performance drops. Specifically, we ablated fea-
tures of the label embedding lyi−1 and the previ- pre-defined order (Miwa and Sasaki, 2014; Gupta
ous span f (ewfirst(i−1) , · · · , ewi−1 ) from Equation et al., 2016; Zhang et al., 2017). These methods
2. Table 3 shows that removing features of previous assume that earlier decisions help later decisions,
spans from the model harmed the most, whereas which may involve long-range dependencies. In
the impact of removing label embeddings at pre- contrast, our model is free from prediction history
vious timesteps was relatively small. This result for relation labels; it focuses on predicting them at
validates our assumption that span-level features once using a tensor dot-product.
are beneficial in representing entities for both NER
A natural question is whether history-based pre-
and RE.
dictions are useful for the proposed method or not?
To find the answer, we designed a variant of the RE
3.5 Prediction Order
model that utilized predicted results of cells to the
Existing methods for jointly extracting entities and left of and below a target cell. More specifically,
relations with the table-filling approach rely on we modified Equation 10 to make use of the embed-
history-based predictions, i.e., fill up the lower (or dings of relation labels at (i, j − 1) and (i + 1, j)
upper) triangular part of a table cell-by-cell in a when predicting a relation label for the element
Entity Relation fectiveness of our model, we use “BERT (our repli-
Order
P R F1 P R F1
Once 89.3 91.1 90.2 71.6 71.9 71.8 cation)” as the baseline. Our models then equip the
Seq 88.8 90.3 89.6 68.4 71.7 70.0 BERT encoder with extra modules, as described
in Section 2.2. We observe that multi-sentence
Table 4: Performance of the proposed method on inputs boosted the performance on both datasets,
the CoNLL04 test set with history-based predictions.
making our model outperform other models, in-
“Once” stands for the proposed method that predicts all
relation labels at once. “Seq” decides relation labels se- cluding DyGIE++ and BERT (Devlin et al., 2019).
quentially so that it can incorporate prediction results By comparing the prediction results, we conclude
of cells to the left of and below a target. that multi-sentence input improved predictions for
multiple occurrences of the same entity, gathering
contexts in different occurrences. Specific shreds
(i, j), and scheduled predictions in ascending order of evidence with examples are shown in Appendix
of distance from the diagonal elements and from C.
left-top to right-bottom. Although we cannot apply this technique directly
However, a significant improvement in the per- to RE (because the table size is O(n2 )), we tested
formance of the model is not observed even after our RE model in a pipelined fashion by using pre-
several fine-tuning efforts. Experimental results on dictions of NER with multi-sentence input on the
the CoNLL04 test set are shown in Table 4. It is ACE05 test set. However, we did not see a per-
difficult to identify the reason for the experimental formance boost on RE. By analyzing predicted
results, but the RE model might utilize long-range RE instances, we discovered that multi-sentence
dependencies from the BERT encoder to make de- NER increased the overall performance by captur-
cisions. In addition, the history-based prediction ing coreferences, but it also failed to correctly label
increases the number of parameters and complex- some of the NEs essential for RE. The observation
ity of label predictions, by introducing extra pa- emphasizes the importance of improving the design
rameters for relation embeddings and additional of the RE model and a better approach to combine
classifiers. It is potentially beneficial to try several sentence-level and document-level context during
prediction orders as in Miwa and Sasaki (2014), RE.
but the experimental results suggest that predicting
relation labels at once is sufficient for the proposed 4 Related Work
method.
Early studies formulated the task of jointly extract-
3.6 Multi-Sentence NER ing entities and relations as a structured prediction
As described in Section 3.3, the proposed method with the global features and search algorithms. Li
could not outperform DyGIE++ (Wadden et al., and Ji (2014) presented an incremental algorithm
2019) on ACE05 NER, because of the unavailabil- for joint NER and RE with global features and in-
ity of cross-sentence information such as corefer- exact (beam-search) decoding. Miwa and Sasaki
ences. In this subsection, we describe how we (2014) proposed a table representation for entities
eliminated the performance gap by merely pro- and relations. Further, it investigated hand-crafted
viding multiple sentences into the model without features and complex search heuristics on the ta-
modifying the architecture. Specifically, we split ble. Gupta et al. (2016) enhanced the table-filling
a document into segments of multiple sentences approach by adapting recurrent neural networks
such that each segment was not longer than 256 (RNNs) to fill cells of a table in a pre-defined se-
sub-word tokens. This length restriction is intro- quential order. Miwa and Bansal (2016) explored a
duced due to GPU memory capacity limitations. shared representation for entities and relations by
Assuming that each segment was a sequence of stacking bidirectional tree-structured and sequen-
sub-word tokens from multiple sentences, we fed tial LSTM-RNNs. Zhang et al. (2017) integrated
each segment to the model with multiple sentences a global optimization technique and syntax-aware
separated by a special token [SEP], similar to word representations. These studies heavily relied
Devlin et al. (2019). on feature engineering (as hand-crafted features or
Table 5 shows the performance of NER mod- specialized models of deep neural networks) and
els on the CoNLL03 and ACE05 with and without search/optimization strategies.
multi-sentence inputs. To better investigate the ef- Recently, several researchers explored the deep
Entity
Dataset Model Sentence
P R F1
BERT (reported in Devlin et al. (2019)) Multi - - 92.4
BERT (our replication) Single 89.5 89.9 89.7
CoNLL03 BERT (our replication) Multi 91.3 92.7 92.0
This work Single 90.3 90.5 90.4
This work Multi 92.0 92.9 92.5
DyGIE++ (Wadden et al., 2019) Multi - - 88.6
ACE05 This work Single 87.2 88.1 87.6
This work Multi 88.8 88.6 88.7

Table 5: Results of NER on the CoNLL03 and ACE05 test sets. Values of Devlin et al. (2019) and Wadden et al.
(2019) are reported scores. We included a baseline BERT, following the original study settings, which involves
each word to be represented by its first sub-word token.

contextualized word representations for the sequen- We aimed at solving the drawbacks of the table-
tial labeling problem. Liu et al. (2019) proposed filling approach, e.g., complicated feature engi-
a deep transition architecture enhanced with the neering and decoding algorithm. In our approach,
global context and reported improvements on NER feature engineering for NER was removed by us-
and chunking tasks by using contextualized word ing contextualized word representations and span-
embeddings. Straková et al. (2019) also demon- based features. The proposed method utilized a
strated the effectiveness of the contextualized rep- tensor dot-product for filling in off-diagonal cells
resentations on the architectures for nested named at once without using history-based predictions.
entity recognition, where NE may overlap with Although the proposed architecture was differ-
multiple labels assigned. ent from the span-enumeration based approaches
(DyGIE++ and SpERT), the experimental results
More recently, span-enumeration methods have demonstrated competitive or better performance
been a popular approach for jointly extracting en- than the span-enumeration based approaches.
tities and relations (Luan et al., 2019; Wadden
et al., 2019; Eberts and Ulges, 2020). In general,
span enumeration methods consider possible en-
tity spans for an input sentence with some crite- 5 Conclusion
ria (e.g., the maximum number of words), choose
likely spans using features extracted from the span
This paper presented a novel method for extracting
candidates. Luan et al. (2019) proposed a general
NE and relations based on the table representation,
framework of information extraction called DyGIE
making use of contextualized word embeddings for
that can incorporate global information on a dy-
representing entity mentions. We applied tensor
namic span graph. Wadden et al. (2019) further
dot-product for predicting all the relation labels at
expanded the model to DyGIE++. The method
once. The experimental results on the CoNLL04
receives multiple sentences from the same docu-
and ACE05 dataset demonstrated that the proposed
ment as input and enumerates candidate spans for
method outperformed not only the existing table-
relation, coreference, and event identification. To
filling methods but also the SOTA methods based
update span representations of entities, they care-
on pre-trained BERT models. We also confirmed
fully designed strategies for dynamic graph con-
that the method achieved comparable performance
struction and span refinement. Eberts and Ulges
to the SOTA NER models on the ACE05 when
(2020) also proposed an end-to-end RE model for
multiple sentences were fed to the model.
extracting both entities and relations called SpERT.
The method bases on pre-trained BERT models In the future, we plan to explore an approach for
and enumerates candidates of entity spans. Using a incorporating global constraints in the RE model,
negative sampling strategy for both NER and RE, which currently predicts all relation labels indepen-
the method classifies entity and relation candidates dently.
into positive and negative.
References Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari
Ostendorf, and Hannaneh Hajishirzi. 2019. A gen-
Yee Seng Chan and Dan Roth. 2011. Exploiting eral framework for information extraction using dy-
syntactico-semantic structures for relation extrac- namic span graphs. In Proceedings of the 2019 Con-
tion. In Proceedings of the 49th Annual Meeting of ference of the North American Chapter of the Asso-
the Association for Computational Linguistics: Hu- ciation for Computational Linguistics: Human Lan-
man Language Technologies (ACL), pages 551–560. guage Technologies, Volume 1 (Long and Short Pa-
pers) (NAACL), pages 3036–3046.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of Makoto Miwa and Mohit Bansal. 2016. End-to-end re-
deep bidirectional transformers for language under- lation extraction using LSTMs on sequences and tree
standing. In Proceedings of the 2019 Conference structures. In Proceedings of the 54th Annual Meet-
of the North American Chapter of the Association ing of the Association for Computational Linguistics
for Computational Linguistics: Human Language (Volume 1: Long Papers) (ACL), pages 1105–1116.
Technologies, Volume 1 (Long and Short Papers) Makoto Miwa and Yutaka Sasaki. 2014. Modeling
(NAACL), pages 4171–4186. joint entity and relation extraction with table repre-
sentation. In Proceedings of the 2014 Conference on
Kalpit Dixit and Yaser Al-Onaizan. 2019. Span-level Empirical Methods in Natural Language Processing
model for relation extraction. In Proceedings of the (EMNLP), pages 1858–1869.
57th Annual Meeting of the Association for Compu-
tational Linguistics (ACL), pages 5308–5314. David Nadeau and Satoshi Sekine. 2007. A survey of
named entity recognition and classification. Linguis-
Timothy Dozat and Christopher D. Manning. 2017. ticae Investigationes, 30(1):3–26.
Deep biaffine attention for neural dependency pars-
Dat Quoc Nguyen and Karin Verspoor. 2019. End-to-
ing. In International Conference on Learning Rep-
end neural relation extraction using deep biaffine at-
resentations (ICLR).
tention. In Proceedings of the 41st European Con-
ference on Information Retrieval (ECIR).
Markus Eberts and Adrian Ulges. 2020. Span-based
joint entity and relation extraction with transformer Adam Paszke, Sam Gross, Francisco Massa, Adam
pre-training. In 24th European Conference on Artifi- Lerer, James Bradbury, Gregory Chanan, Trevor
cial Intelligence (ECAI). Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Kopf, Edward
Pankaj Gupta, Hinrich Schütze, and Bernt Andrassy. Yang, Zachary DeVito, Martin Raison, Alykhan Te-
2016. Table filling multi-task recurrent neural net- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
work for joint entity and relation extraction. In Pro- Junjie Bai, and Soumith Chintala. 2019. Pytorch:
ceedings of COLING 2016, the 26th International An imperative style, high-performance deep learn-
Conference on Computational Linguistics: Techni- ing library. In Advances in Neural Information Pro-
cal Papers (COLING), pages 2537–2547. cessing Systems 32 (NIPS), pages 8024–8035. Cur-
ran Associates, Inc.
Qi Li and Heng Ji. 2014. Incremental joint extraction Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,
of entity mentions and relations. In Proceedings Olga Uryupina, and Yuchen Zhang. 2012. Conll-
of the 52nd Annual Meeting of the Association for 2012 shared task: Modeling multilingual unre-
Computational Linguistics (Volume 1: Long Papers) stricted coreference in ontonotes. In Joint Confer-
(ACL), pages 402–412. ence on EMNLP and CoNLL - Shared Task, pages
1–40.
Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna
Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019. Lev Ratinov and Dan Roth. 2009. Design chal-
Entity-relation extraction as multi-turn question an- lenges and misconceptions in named entity recog-
swering. In Proceedings of the 57th Annual Meet- nition. In Proceedings of the Thirteenth Confer-
ing of the Association for Computational Linguistics ence on Computational Natural Language Learning
(ACL), pages 1340–1350. (CoNLL-2009), pages 147–155.
Dan Roth and Wen-tau Yih. 2004. A linear program-
Yijin Liu, Fandong Meng, Jinchao Zhang, Jinan Xu,
ming formulation for global inference in natural lan-
Yufeng Chen, and Jie Zhou. 2019. GCDT: A global
guage tasks. In Proceedings of the Eighth Confer-
context enhanced deep transition architecture for se-
ence on Computational Natural Language Learning
quence labeling. In Proceedings of the 57th Annual
(CoNLL-2004) at HLT-NAACL 2004, pages 1–8.
Meeting of the Association for Computational Lin-
guistics (ACL), pages 2431–2441. Jana Straková, Milan Straka, and Jan Hajic. 2019. Neu-
ral architectures for nested NER through lineariza-
Ilya Loshchilov and Frank Hutter. 2019. Decoupled tion. In Proceedings of the 57th Annual Meeting
weight decay regularization. In International Con- of the Association for Computational Linguistics
ference on Learning Representations (ICLR). (ACL), pages 5326–5331.
Changzhi Sun, Yuanbin Wu, Man Lan, Shiliang Sun, Parameter Value
Wenting Wang, Kuang-Chih Lee, and Kewen Wu. # dims of token embeddings |ew | 768
2018. Extracting entities and relations with joint
# dims of label embeddings |ly | 50
minimum risk training. In Proceedings of the 2018
Conference on Empirical Methods in Natural Lan- # dims of relation attention datt 20
learning rate (BERT encoder) 5 × 10−5
guage Processing (EMNLP), pages 2256–2265.
learning rate (others) 1 × 10−3
Erik F. Tjong Kim Sang and Fien De Meulder.
dropout rate 0.3
2003. Introduction to the CoNLL-2003 shared task:
Language-independent named entity recognition. In warm-up period 0.2
Proceedings of the Seventh Conference on Natu- total number of epochs 30
ral Language Learning at HLT-NAACL 2003, pages
142–147. Table 6: Hyper-parameter settings. We adopted a
learning rate scheduler that increased learning rates
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob linearly from 0 to 5 × 10−5 in the warm-up period
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz of 0.2 × (total number of epochs), and then decreased
Kaiser, and Illia Polosukhin. 2017. Attention is all learning rates using a cosine function.
you need. In Advances in neural information pro-
cessing systems (NIPS), pages 5998–6008.

David Wadden, Ulme Wennberg, Yi Luan, and Han-


Appendix B: Error Inspection
naneh Hajishirzi. 2019. Entity, relation, and event This section contains descriptions for incorrect pre-
extraction with contextualized span representations.
In Proceedings of the 2019 Conference on Empirical dictions of the proposed method to delineate future
Methods in Natural Language Processing and the directions for improvement. Table 7 summarizes
9th International Joint Conference on Natural Lan- typical errors of the proposed method found in the
guage Processing (EMNLP-IJCNLP), pages 5784– CoNLL04 test set.
5789.
Incorrect NE span Cases where the NER model
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien predicts slightly incorrect spans. Typical errors
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- of this category involve adding/missing a nearby
icz, and Jamie Brew. 2019. Huggingface’s trans- phrase of an NE span. Table 7 (a) is an example
formers: State-of-the-art natural language process- where the phrase “on Earth” can be interpreted as
ing. ArXiv, abs/1910.03771. a prepositional phrase or a part of a proper noun.
Dmitry Zelenko, Chinatsu Aone, and Anthony Incorrect NE type Cases where the NER model
Richardella. 2003. Kernel methods for relation ex-
predicts incorrect NE labels for entity mentions.
traction. Journal of Machine Learning Research,
3:1083–1106. These cases usually occur when an entity can be
interpreted with different NE types. Table 7 (b)
Meishan Zhang, Yue Zhang, and Guohong Fu. 2017. illustrates that “Charing Cross Hospital” can be
End-to-end neural relation extraction with global op- categorized as O RGANIZATION if we look at the
timization. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Process- NE alone, but is actually annotated as L OCATION
ing (EMNLP), pages 1730–1740. in the context (indicating the location of the event
‘died’).
GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang.
2005. Exploring various knowledge in relation ex- Lack of Knowledge Cases where the RE model
traction. In Proceedings of the 43rd Annual Meet- fails to recognize implicit relations. In Table 7 (c),
ing of the Association for Computational Linguistics
it is not so easy to recognize the relation instance
(ACL), pages 427–434.
(LivingstonLoc , LocatedIn, MontanaLoc ) only
Appendix A: Major Hyper-Parameters from the sentence without the knowledge about the
entities ‘Livingston’ and ‘Montana’. Fortunately,
This section contains a list of major hyper- the RE model could predict the relation instance
parameters in our model, as shown in Table 6. correctly in this example. However, it is even
For both CoNLL04 and ACE05, we used the same more difficult to infer the relation instance
hyperparameters with an exception to batch size, (LivingstonLoc , LocatedIn, Rocky MountainsLoc )
owing to the difference in the data scale. We ap- from the text; we are not sure of the inclusion rela-
plied dropout to ewi and f (ewfirst(i−1) , · · · , ewi−1 ) tion between “Rocky Mountain” and ‘Livingston’
in Equation 2 and zi in Equation 5. without the knowledge about the entities.
(a) Incorrect NE span
Text of the statement issued by the [Organization of the Oppressed on Earth]O RG
Ground truth
claiming U. S. Marine Lt. William R. Higgins was hanged.
Text of the statement issued by the [Organization of the Oppressed]O RG on Earth
Prediction
claiming U. S. Marine Lt. William R. Higgins was hanged.
(b) Incorrect NE type
Manygate Management said Ogdon died peacefully after going into a coma
Ground truth following his admission to London’s [Charing Cross Hospital]L OC Monday
for bronchopneumonia.
Manygate Management said Ogdon died peacefully after going into a coma
Prediction following his admission to London’s [Charing Cross Hospital]O RG Monday
for bronchopneumonia.
(c) Lack of knowledge
High winds blew on the east slopes of the [Rocky Mountains]L OC in [Montana]L OC ,
Sentence
with winds gusting to near 50 mph at [Livingston]L OC .
(Rocky MountainsLoc , LocatedIn, MontanaLoc )
Ground Truth (LivingstonLoc , LocatedIn, MontanaLoc )
(LivingstonLoc , LocatedIn, Rocky MountainsLoc )
(Rocky MountainsLoc , LocatedIn, MontanaLoc )
Prediction
(LivingstonLoc , LocatedIn, MontanaLoc )
(d) Lack of global constraints
[Soviet]L OC Foreign [Eduard A. Shevardnadze]P EOP is to visit [China]L OC next month
Sentence
to pave the way for the first Chinese - Soviet summit in 30 years ...
Ground Truth (Eduard A. ShevardnadzePeop , LiveIn, SovietLoc )
(Eduard A. ShevardnadzePeop , LiveIn, SovietLoc )
Prediction
(Eduard A. ShevardnadzePeop , LiveIn, ChinaLoc )

Table 7: Typical error cases of the proposed method on the CoNLL04 test set. Cases (a) and (b) are errors caused
by the NER model; and Cases (c) and (d) are those caused by the RE model. Each RE case shows the sentence
with the NE labels in the first line, followed by relation tuples of the ground truth annotation and the prediction.

Lack of global constraints Cases where the RE


model could avoid incorrect RE instances with con-
straints. As shown in Table 7 (d), the model infers
that the same person lives in two different places
(Soviet and China). the proposed method cannot
consider associations between relation predictions
explicitly because relation labels are predicted in-
dependently of each other.

Appendix C: Predicted Examples for


Multi-Sentence NER
This section contains several typical predicted ex-
amples showing the effectiveness of the multi-
sentence NER model, as shown in Table 8.
CoNLL03
Location (single), Organization (multi), Organization (gold)
Charleroi ( Belgium ) 75 Estudiantes Madrid ( Spain ) 82 ( 34 - 35 )
Leading scorers: Charleroi - Eric Cleymans 18, Ron Ellis 18, Jacques Stas 14
Person (single), Organization (multi), Organization (gold)
Tambang Timah at $ 15. 575 in London.
LONDON 1996 - 12 - 07
PT Tambang Timah closed at $ 15. 575 per GDR in London on Friday.
ACE05
Person (single), Organization (multi), Organization (gold)
North Korea has told American lawmakers it already has nuclear weapons ...
“They admitted to having just about completed the reprocessing of 8, 000 rods,” said ...
Geographical Entity (single), Person (multi), Person (gold)
...today’s Southern voters are “children of Democrats who are not swayed by the same things ...
They certainly are susceptible to the Republican message. ”

Table 8: Typical examples where the proposed NER model showed improvements with multi-sentence input. A
boldface word presents a target for predicting the NE label, and an italic word is a coreference to the boldface word.
NE labels shown on top of each example are the prediction from single-sentence input; that from multi-sentence
input; and the ground-truth label. We can infer that the model with multi-sentence input made correct predictions,
looking at coreferential words.

You might also like