Professional Documents
Culture Documents
Vladimir Dobrovolskii
ABBYY / Moscow, Russia
v.dobrovolskii@abbyy.com
The resulting vectors are used to compute men- The candidate antecedent with the highest positive
tion and antecedent scores for each span without score is assumed to be the predicted antecedent of
7671
each token. If there are no candidates with positive We propose to use binary cross-entropy as an addi-
coreference scores, the token is concluded to have tional regularization factor:
no antecedents.
Our model does not use higher-order inference, LCOREF = LNLML + αLBCE (13)
as Xu and Choi (2020) have shown that it has a
marginal impact on the performance. It is also This encourages the model to output higher corefer-
computationally expensive since it repeats the most ence scores for all coreferent mentions, as the pairs
memory intensive part of the computation during are classified independently from each other. We
each iteration. use the value of α = 0.5 to prioritize NLML loss
over BCE loss.
3.3 Span extraction The span extraction module is trained using
The tokens that are found to be coreferent to some cross-entropy loss over the start and end scores
other tokens are further passed to the span extrac- of all words in the same sentence.
tion module. For each token, the module recon- We jointly train the coreference and span extrac-
structs the span by predicting the most probable tion modules by summing their losses.
start and end tokens in the same sentence. To re-
construct a span headed by a token, all tokens in 4 Experiments
the same sentence are concatenated with this head We use the OntoNotes 5.0 test dataset to evaluate
token and then passed through a feed-forward neu- our model and compare its performance with the
ral network followed by a convolution block with results reported by authors of other mention rank-
two output channels (for start and end scores) and a ing models. The development portion is used to
kernel size of three. The intuition captured by this reason about the importance of the decisions in our
approach is that the best span boundary is located model’s architecture.
between a token that is likely to be in the span and
a token that is unlikely to be in the span. During in- 4.1 Experimental setup
ference, tokens to the right of the head token are not
We implement our model with PyTorch framework
considered as potential start tokens and tokens to
(Paszke et al., 2019). We use the Hugging Face
the left are not considered as potential end tokens.
Transformers (Wolf et al., 2020) implementations
3.4 Training of BERT (Devlin et al., 2019), RoBERTa (Liu et al.,
2019), SpanBERT (Joshi et al., 2020) and Long-
To train our model, we first need to transform the former (Beltagy et al., 2020). All the models were
OntoNotes 5.0 training dataset to link individual used in their large variants. We also replicate the
words rather than word spans. To do that, we use span-level model by Joshi et al. (2020) to use it as
the syntax information available in the dataset to a baseline for comparison.
reduce each span to its syntactic head. We define We do not perform hyperparameter tuning and
a span’s head as the only word in the span that mostly use the same hyperparameters as the ones
depends on a word outside the span or is the head used in the independent model by Joshi et al.
of the sentence. If the number of such words is not (2020), except that we also linearly decay the learn-
one, we choose the rightmost word as the span’s ing rates.
head. This allows us to build two training datasets: Our models were trained for 20 epochs on a
1) word coreference dataset to train the coreference 48GB nVidia Quadro RTX 8000. The training
module and 2) word-to-span dataset to train the took 5 hours (except for the Longformer-based
span extraction module. model, which needed 17 hours). To evaluate the
Following Lee et al. (2017), the coreference mod- final model one will need 4 GB of GPU memory.
ule uses negative log marginal likelihood as its base
loss function, since the exact antecedents of gold 4.2 Selecting the best model
spans are unknown and only the final gold cluster-
The results in Table 2 demonstrate that the word-
ing is available:
level SpanBERT-based model performs better than
N
Y X its span-level counterpart (JOSHI - REPLICA), de-
LNLML = − log P (y 0 ) (12) spite being much more computationally efficient.
i=0 y 0 ∈Y (i)∩GOLD(i) The replacement of SpanBERT by RoBERTa or
7672
MUC B3 CEAFφ4
P R F1 P R F1 P R F1 Mean F1
Joshi et al. (2020) 85.8 84.8 85.3 78.3 77.9 78.1 76.4 74.2 75.3 79.6
JOSHI - REPLICA 85.2 85.5 85.4 78.2 78.7 78.4 75.4 75.2 75.3 79.7
Xu and Choi (2020) 85.9 85.5 85.7 79.0 78.9 79.0 76.7 75.2 75.9 80.2
Kirstain et al. (2021) 86.5 85.1 85.8 80.3 77.9 79.1 76.8 75.4 76.1 80.3
wl-coref + RoBERTa 84.9 87.9 86.3 77.4 82.6 79.9 76.1 77.1 76.6 81.0
Table 1: Best results on the OntoNotes 5.0 test dataset. All the scores have been obtained using the official
CoNLL-2012 scorer (Pradhan et al., 2014) and rounded to 1 decimal place. MUC is the metric proposed by Vilain
et al. (1995), B3 was introduced by Bagga and Baldwin (1998) and CEAFφ4 was designed by Luo (2005). The
usage of mean F1 score as the aggregate metric was suggested by Pradhan et al. (2012).
Adam Paszke, Sam Gross, Francisco Massa, Adam Liyan Xu and Jinho D. Choi. 2020. Revealing the myth
Lerer, James Bradbury, Gregory Chanan, Trevor of higher-order inference in coreference resolution.
Killeen, Zeming Lin, Natalia Gimelshein, Luca In Proceedings of the 2020 Conference on Empirical
Antiga, Alban Desmaison, Andreas Kopf, Edward Methods in Natural Language Processing (EMNLP),
Yang, Zachary DeVito, Martin Raison, Alykhan Te- pages 8527–8533, Online. Association for Computa-
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, tional Linguistics.
Junjie Bai, and Soumith Chintala. 2019. Py-
torch: An imperative style, high-performance deep
learning library. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-
nett, editors, Advances in Neural Information Pro-
cessing Systems 32, pages 8024–8035. Curran Asso-
ciates, Inc.