Professional Documents
Culture Documents
Abstract
Most NLP approaches to entity linking and
coreference resolution focus on retrieving sim-
arXiv:2209.00133v1 [cs.CL] 31 Aug 2022
Table 3: CoNLL Scores for CrossRef name disambiguation models with varied metadata and embedding sources.
Results given for naive (N), antecedent classifier (AC), and community detection (CD) models.
Table 4: CoNLL Score for Base and Tuned isnad naive, community detection, and antecedent classification models
using kNN networks (left) and surface form heuristic networks (right)
mine antecedence regardless of the presence of the they are connected by a low-weight edge, though
additional linguistic information added by the fine such a method can occasionally overdo this, as the
tuning process. The small size of the dataset may decreased MUC recall for community detection
have caused the embeddings to not shift enough for shows. The simple inference model, as one would
the performance to significantly change. expect, struggles with precision but does better on
To understand how much the performance de- recall. B3 scores are of course still generally high
pends on community detection algorithms rather as there are many singletons and larger clusters that
than on network structure alone, we compare the naive model can do quite well on.
against a naive baseline in which each mention
Method CoNLL Score
is assigned to the same cluster as its nearest earlier
Naive .833
neighbor in the network. This naive methodcannot
CD .722
assign mentions to singleton clusters unless a men-
tion is disconnected from the rest of the network. Table 5: Evaluation of Leiden community detection
Comparing the results for this method to those for on the CrossRef dataset compared to a naive nearest-
community detection using the Leiden algorithm neighbor baseline.
on the CrossRef dataset with added Journal meta-
data (Table 5), it can be seen that there is a clear A similar comparison with the isnad dataset
precision-recall tradeoff. The naive model still has shows a comparable, albeit lower, tradeoff with
high precision, so the structure of the network con- the recall of the naive model being slightly higher
tributes a fair amount of performance, properly iso- and precision slightly lower. MUC is less obviously
lating distinct clusters from the rest of the network, affected as the dataset contains fewer singletons.
implying that the network on its own contributes
some information rather than being reliant on the 5.2 Effect of CrossRef Metadata
quality of the community detection algorithm on Table 3 also shows the effect of including addi-
its own. However, expanding the evaluation to tional context in the language model when com-
include singleton mentions makes the precision- puting name embeddings. These metadata fields,
recall tradeoff much more obvious. Consequently, such as paper title or journal name, are appended
the simple cluster assignment model’s inability to to the list of author names as they would be in a
assign mentions to singleton clusters causes its pre- plain text citation. With the anomalous exception
cision to suffer in the easy cases, where two single- of untuned community detection, adding the title
ton names happen to be nearest to each other in the alone or both the title and journal helps slightly,
network but are completely unrelated. Community while the journal alone gives the highest increase
detection algorithms like Leiden can resolve those in performance. It is possible that adding the jour-
issues by not clustering two mentions together if nal alone is more effective than adding both due to
the dissimilarity in titles making the embeddings isnads, in contrast, display increased performance
less useful, as scholars may try to make their ti- as k increases, likely because for any given men-
tles unique, forcing mentions of their name further tion, the distinct surface forms of the individual
apart in embedding space. Additionally, one should that mention refers to are going to be further from
note that the performance variance across different mentions of the same surface form, and thus the
metadata settings decreases when one uses tuned connections to those other regions of embedding
embeddings or supervised methods. space do not exist in the low-k networks. Using
the surface form heuristic, we find that even for
5.3 Effect of Training Data Quantity the isnad data, where the method performed well,
To understand how much benefit the additional performance begins to drop off quickly as the mul-
training data and annotation time brings, we inves- tiplier increases, with the optimal multiplier never
tigate the rate at which the inference performance being greater than 1.2. When the surface forms
improves as more documents are used to train the are spread across a wide variety of clusters and oc-
antecedent classifier. The plots used for this analy- curring in many varied contexts, like in CrossRef,
sis can be seen in Appendix C. On the isnad dataset, the kNN networks become more useful, while the
the scores stay in the high 80s, with peak perfor- surface form networks perform better when each
mance at using only 40% of the training data. This surface form is more associated with a single in-
is likely an artifact of the small size of the dataset. dividual. Additional metadata associated with the
It is possible that that 40% of the training data was mentions, like the bibliographic information pro-
more similar to the test set, and as the additional vided by CrossRef, is also useful as it allows other-
training data was added, the more general model wise distinct mentions to be moved closer together
started to struggle on the test set in comparison. in embedding space. Finally, for larger datasets, if
The CrossRef results are more like what one would annotating data from scratch, one would need to do
expect, with score increasing with training data a fair bit of annotation for the kind of supervised
quantity. The supervised model does not surpass models we explore here to surpass the performance
community detection until 40% of the data is used of unsupervised models, but once one gets past a
for training, likely because of the higher diversity certain point there will be diminishing performance
of the datatset. returns from the addition of more training data.
This work also serves as the first step in future
6 Discussion work exploring this problem of name ambiguity
in more detail, whether in other domains or in ex-
There are several takeaways from these experi-
panded versions of the current datasets, like ex-
ments for those interested in solving similar prob-
panding the isnad dataset to cover multiple texts
lems of name ambiguity in low external-resource
and dealing with the issues that would arise from
domains. Firstly, depending on the size of the
varied authorial style, or integrating the partially an-
dataset, unsupervised community detection meth-
notated papers present in the CrossRef dataset. Re-
ods may give competitive performance when com-
lated inference problems in these networked texts,
pared to more complex supervised inference meth-
such as inferring missing nodes or estimating the
ods. If one opts to use unsupervised methods,
number of individuals present in a dataset, could
the quality of one’s mention representations is ex-
also be of interest. One could also use the output
tremely important, and should be improved by ad-
of these models as the beginning of further work
ditional pretraining of the source language model.
in social network analysis. The networks created
For supervised methods, the effect of further pre-
by these sorts of models would also be invaluable
training depends more on the size of the dataset in
for understanding the roles played by individuals
question, with larger, more diverse datasets benefit-
in these social networks of knowledge production.
ing more from pretraining. Similarly, depending on
the distribution of cluster sizes and the division of
surface forms between clusters, different network
Acknowledgements
construction heuristics will give different clustering
results.As k increases for the CrossRef dataset, per-
formance tends to decrease as the network includes Omitted for peer review
irrelevant edges, even with aggressive pruning. The
References Eryani, Alexander Erdmann, and Nizar Habash.
2020. CAMeL Tools: An Open Source Python
Amit Bagga and Breck Baldwin. 1998. Entity-Based Toolkit for Arabic Natural Language Processing.
Cross-Document Coreferencing Using the Vector Proceedings of the 12th Language Resources and
Space Model. In 36th Annual Meeting of the Associ- Evaluation Conference, page 11.
ation for Computational Linguistics and 17th Inter-
national Conference on Computational Linguistics, Usha Nandini Raghavan, Réka Albert, and Soundar Ku-
Volume 1, pages 79–85, Montreal, Quebec, Canada. mara. 2007. Near linear time algorithm to detect
Association for Computational Linguistics. community structures in large-scale networks. Phys-
ical Review E, 76(3):036106.
Indrajit Bhattacharya and Lise Getoor. 2007. Collec-
tive entity resolution in relational data. ACM Trans- V. A. Traag, L. Waltman, and N. J. van Eck.
actions on Knowledge Discovery from Data, 1(1):5. 2019. From Louvain to Leiden: guaranteeing
well-connected communities. Scientific Reports,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 9(1):5233.
Kristina Toutanova. 2019. BERT: Pre-training
of Deep Bidirectional Transformers for Language Marc Vilain, John Burger, John Aberdeen, Dennis Con-
Understanding. arXiv:1810.04805 [cs]. ArXiv: nolly, and Lynette Hirschman. 1995. A model-
1810.04805. theoretic coreference scoring scheme. In Proceed-
ings of the 6th conference on Message understand-
Greg Durrett and Dan Klein. 2014. A Joint Model for ing - MUC6 ’95, page 45, Columbia, Maryland. As-
Entity Analysis: Coreference, Typing, and Linking. sociation for Computational Linguistics.
Transactions of the Association for Computational
Linguistics, 2:477–490. Li Zhang. 2021. LAGOS-AND-Online-Appendix.
CoRR. Publisher: Zenodo Version Number: 1.0.
Pigi Kouki, Jay Pujara, Christopher Marcum, Laura
Koehly, and Lise Getoor. 2019. Collective entity res-
olution in multi-relational familial networks. Knowl-
edge and Information Systems, 61(3):1547–1581.
Wuwei Lan, Yang Chen, Wei Xu, and Alan Ritter. 2020.
An Empirical Study of Pre-trained Transformers for
Arabic Information Extraction. arXiv:2004.14519
[cs]. ArXiv: 2004.14519.
B Example Isnad
Although we use the original Arabic texts of is-
nads in the experiments, this English translation
gives some idea of the typical brevity of each name
reference: