Thesis Non Euclidea Neural Representation Words Entities Heirarchies

ETH Library
Non-Euclidean Neural
Representation Learning of Words,
Entities and Hierarchies
Doctoral Thesis
Author(s):
Ganea, Octavian-Eugen
Publication date:
2019
Permanent link:
https://doi.org/10.3929/ethz-b-000397726
Rights / license:
In Copyright - Non-Commercial Use Permitted
This page was generated automatically upon download from the ETH Zurich Research Collection.
For more information, please consult the Terms of use.
diss. eth no. 26043
N O N - E U C L I D E A N N E U R A L R E P R E S E N TAT I O N L E A R N I N G
OF WORDS, ENTITIES AND HIERARCHIES
A thesis submitted to attain the degree of
doctor of sciences of eth z ürich

(Dr. sc. ETH Zürich)
presented by
octavian-eugen ganea
Master in Computer Science

École Polytechnique Fédérale de Lausanne
born on 26.09.1987
citizen of Romania
accepted on the recommendation of
Prof. Dr. Thomas Hofmann

Prof. Dr. Regina Barzilay
Prof. Dr. Tommi Jaakkola
Prof. Dr. Gunnar Rätsch
Dr. Olivier Bousquet
2019
Octavian-Eugen Ganea: Non-Euclidean Neural Representation Learning of
Words, Entities and Hierarchies, © 2019
ABSTRACT
In recent years, some of the greatest breakthroughs in machine learn-

ing have been achieved by neural networks and deep learning methods.
These non-linear models with potentially millions of parameters are all
about finding “good” representations, meaning generic or task depen-
dent features. The underlying conjecture, also known as the “manifold
assumption”, is that the intrinsic dimensionality of the input data is
much lower compared to the input feature space dimension.
In this thesis, we investigate several problems associated with em-
beddings and deep learning methods, focusing on natural language
processing (NLP) and graph embedding applications.
Popular neural network or embedding methods assume data have
Euclidean structure and, thus, employ Euclidean geometry in the em-
bedding space. For example, semantic similarity is typically geomet-
rically modeled via the standard Euclidean distance or its positive
semi-definite (PSD) inner product. However, this has been theoretically
and empirically shown to exhibit a significant information loss or dis-
tortion for many types of real data [Bro+17]; [DP10]; [Kri+10]; [NK17];
[RH07]; [Wil+14], e.g. for data with non-PSD pairwise similarities.
So, what geometry offers a better inductive bias for the data of
interest? In this dissertation, we focus on a particular large class of
data with non-Euclidean latent anatomy, namely tree-like, hierarchical
or exhibiting power-law distributions. Such data are provably [De
+18a]; [Gro87]; [Kri+10] best embedded in a fixed dimensional space of
constant negative curvature that exhibits exponential volume growth,
i.e. the hyperbolic space. Recently, [De +18a]; [NK17] have learned
embeddings of undirected graphs and adapted some linear and non-
linear dimensionality reduction methods in this geometry. However,
there still remains a large demand for generalizing popular Machine
Learning (ML) techniques and Euclidean models, while matching their
state-of-the-art results on various downstream tasks. Alleviating this
gap between Euclidean and hyperbolic geometry in ML is one main
goal of this thesis.
Our first contribution is a novel and principled method for embed-
ding directed acyclic graphs in the hyperbolic space. We advocate
iii
for inducing a partial order over the embedding space by exploiting
geodesically convex cones for modeling antisymmetric entailment rela-
tions. We show that these cones have an appealing optimal closed form
expression. Empirically, we demonstrate improved representational
capacity on the link prediction task for word hypernymy detection.
Second, we tackle the question: how to use hyperbolic embeddings
together with popular deep learning architectures for downstream
tasks? In our journey, we rely on the Gyrovector space formalism
for taking concrete steps towards generalizing popular deep learning
methods to hyperbolic spaces in a principled manner. Our models can
be deformed to their respective Euclidean counterparts when the space
is continuously flattened.
Third, we dive into NLP and learn word representations in products
of hyperbolic spaces. To our knowledge, we present the first fully
unsupervised word embeddings being simultaneously competitive on
three different tasks - semantic similarity, hypernmy prediction and
semantic analogy. The key aspect is leveraging the connection between
products of hyperbolic spaces and Gaussian embeddings.
Finally, this thesis investigates closely how to learn good (neural)
representations for data with power-law distributions such as entities
in a knowledge base (e.g. Wikipedia). Towards this end, we focus
on solving the core NLP task of “entity linking”, i.e. finding entities
in raw text corpora and linking them to knowledge bases. We push
state-of-the-art accuracies on popular datasets by leveraging entity
embeddings, attention mechanisms, probabilistic graphical models and
unrolled approximate inference networks using truncated loopy belief
propagation.
iv
ABSTRAIT
Au cours des dernières années, certaines des plus grandes avancées en

matière d’apprentissage automatique ont été réalisées par des réseaux
de neurones et des méthodes d’apprentissage profond. Ces modèles
non linéaires contenant potentiellement des millions de paramètres
sont à la recherche de bonnes représentations, qui peuvent être
génériques ou spécifiques à une tâche donnée. La conjecture sous-
jacente, également connue sous le nom d’hypothèse de la variété,
est que la dimension intrinsèque des données est beaucoup plus faible
que la dimension de l’espace ambiant dans lequel les données sont
plongées.
Dans cette thèse, nous étudions plusieurs problèmes associés aux
plongements et aux méthodes d’apprentissage profond, en se concen-
trant sur le traitement du langage naturel (Natural Language Process-
ing (NLP)) et les applications liées à la représentation de graphes.
Les méthodes populaires de réseaux de neurones ou de plongemen-
t/représentation des données supposent implicitement une géométrie
euclidienne dans l’espace ambiant. Par exemple, la similarité sémantique
est modélisée via une fonction géométrique, généralement la distance
euclidienne ou son produit scalaire semi-défini-positif. Cependant, il a
été montré théoriquement et empiriquement que ceci pouvait présenter
une perte d’information significative ou résulter en une distorsion
pour de nombreux types de données réelles [Bro+17]; [DP10]; [Kri+10];
[NK17]; [RH07]; [Wil+14], par exemple pour les données ayant de
matrice de similarité semi-défini-positif.
Ainsi, quelle géométrie offre un meilleur biais inductif pour les
données d’intérêt? Dans cette thèse, nous nous concentrons sur une
grande classe de données avec une anatomie latente non euclidienne, à
savoir des distributions arborescentes, hiérarchiques ou présentant des
distributions puissance. Ces données sont manifestement [De +18a];
[Gro87]; [Kri+10] mieux intégrées dans un espace à la courbure négative
constante et dont le volume croı̂t exponentiellement, c’est-à-dire dans un
espace hyperbolique. Récemment, [De +18a]; [NK17] ont représenté
des graphes non orientés et ont adapté certaines méthodes de réduction
de dimensionnalité linéaire et non linéaire à l’aide de cette géométrie.
v
Cependant, il existe toujours une forte demande de généralisation
des techniques populaires d’Apprentissage Automatique (AA) et des
modèles euclidiens, tout en atteignant des résultats compétitifs sur les
tâches et applications d’usage. La réduction de cet écart entre géométrie
euclidienne et hyperbolique pour l’AA est l’un des objectifs principaux
de cette thèse.
Notre première contribution est une méthode novatrice, avec une
fondation théorique, pour des graphes acycliques dirigés dans l’espace
hyperbolique en introduisant des cônes géodésiquement convexes pour
la modélisation de relations d’implication. Ces cônes induisent une rela-
tion d’ordre partiel dans l’espace ambiant et possèdent une expression
optimale et attrayante en forme close. Nous démontrons une capacité
de représentation accrue sur la tâche de prédiction de lien pour la
détection de relations d’hyperonymie entre des mots.
Deuxièmement, nous abordons la question: comment rendre com-
patible la donnée en entrée de représentations hyperboliques aux ar-
chitectures populaires d’apprentissage profond?”. Dans notre quête,
nous nous appuyons sur le formalisme des espaces gyro vectoriels afin
d’effectuer un pas concret vers la généralisation des méthodes popu-
laires d’apprentissage profond aux espaces hyperboliques, d’une façon
théoriquement fondée. Nos modèles peuvent être continuellement
déformés en leurs contreparties euclidiennes respectives.
Troisièmement, nous plongeons dans le NLP et l’apprentissage au-
tomatique de représentations des mots dans des produits d’espaces
hyperboliques. À notre connaissance, nous présentons les premières
représentations de mots, obtenues par apprentissage non supervisé,
étant simultanément compétitives sur trois tâches différentes - la simi-
larité sémantique, la prédiction de l’hyperonymie et l’analogie sémantique.
L’aspect essentiel est de tirer parti de la connexion entre les produits
d’espaces hyperboliques et la représentation de distributions gaussi-
ennes.
Enfin, cette thèse étudie de près comment apprendre de bonnes
représentations neuronales pour les données ayant des distributions à
queues longues telles que des entités dans une base de données comme
Wikipédia. Pour ce faire, nous nous concentrons sur la résolution d’une
tâche primordiale de NLP consistant à la ”liaison d’entités”, c’est-à-
dire à rechercher des entités dans un corpus de textes quelconque et
à les relier à des bases de données. Nous poussons l’état de l’art de
la recherche sur des bases de données populaires en exploitant des
vi
représentations d’entités, des modèles graphiques probabilistes, des
réseaux d’inférence approximative déroulés utilisant des mécanismes
de propagation tronqués de croyance en boucle, et des mécanismes
d’attention.
vii
P U B L I C AT I O N S
This material presented in this thesis has in parts been published in

the following publications:
• Octavian-Eugen Ganea, Gary Bécigneul and Thomas Hofmann.

“Hyperbolic Entailment Cones for Learning Hierarchical Em-
beddings.”
International Conference on Machine Learning (ICML 2018).
[GBH18b]
• Octavian-Eugen Ganea 1 , Gary Bécigneul 1 and Thomas Hofmann.

“Hyperbolic Neural Networks.”
Advances in Neural Information Processing Systems (NeurIPS 2018).
[GBH18a].
• Alexandru Tifrea 1 , Gary Bécigneul 1 and Octavian-Eugen Ganea 1 .

“Poincaré GloVe: Hyperbolic Word Embeddings.”
International Conference on Learning Representations (ICLR 2019).
[TBG19].
• Octavian-Eugen Ganea and Thomas Hofmann.

“Deep Joint Entity Disambiguation with Local Neural Atten-
tion.”
Conference on Empirical Methods in Natural Language Processing
EMNLP 2017.
[GH17].
• Octavian-Eugen Ganea, Marina Ganea, Aurelien Lucchi, Carsten

Eickhoff and Thomas Hofmann.
“Probabilistic Bag-of-hyperlinks Model for Entity Linking.”
International Conference on World Wide Web (WWW 2016).
[Gan+16].
The following publications were part of my PhD research and present

results that are supplemental to this work or build upon its results, but
are not covered in this dissertation:
1 Equal contribution.
ix
• Octavian-Eugen Ganea, Sylvain Gelly, Gary Bécigneul and Aliak-
sei Severyn.
“Breaking the Softmax Bottleneck via Learnable Monotonic Point-
wise Non-linearities.”
International Conference on Machine Learning (ICML 2019).
[Gan+19].
• Nikolaos Kolitsas 1 , Octavian-Eugen Ganea 1 and Thomas Hof-

mann.
“End-to-End Neural Entity Linking. ”
Conference on Computational Natural Language Learning (CoNLL
2018).
[KGH18].
• Gary Bécigneul and Octavian-Eugen Ganea.

“Riemannian Adaptive Optimization Methods.”
International Conference on Learning Representations (ICLR 2019).
[BG19].
• Valentin Trifonov, Octavian-Eugen Ganea, Anna Potapenko and

Thomas Hofmann.
“Learning and Evaluating Sparse Interpretable Sentence Em-
beddings.”
BlackboxNLP Workshop 2018: Analyzing and Interpreting Neural Net-
works for NLP.
[Tri+18a].
The remaining publications were part of my PhD research, but are

outside the scope of the material covered here.
• Till Haug, Octavian-Eugen Ganea and Paulina Grnarova.

“Neural Multi-step Reasoning for Question Answering on Semi-
structured Tables.”
European Conference on Information Retrieval (ECIR 2018).
[HGG18].
• Thijs Vogels, Octavian-Eugen Ganea and Carsten Eickhoff.

“Web2text: Deep Structured Boilerplate Removal.”
European Conference on Information Retrieval (ECIR 2018).
[VGE18].
x
ACKNOWLEDGMENTS
First, I thank my professor Thomas Hofmann for the continuous

support during my PhD studies at ETH. His trust in my success (even
when it was incipient) and the provided scientific freedom and guidance
were decisive to the result of this thesis. Next, I would also like to
express my gratitude to Gary Bécigneul and other reviewers who
provided valuable feedback which I incorporated in the final version of
this dissertation.
I would like to acknowledge my co-workers and co-authors of papers
published during my PhD. It was a great pleasure working with all of
you: Gary Bécigneul, Carsten Eickhoff, Aurelien Lucchi, Marina Ganea,
Sylvain Gelly, Aliaksei Severyn, Paulina Grnarova, Alexandru Tifrea,
Anna Potapenko, Till Haug, Nikolaos Kolitsas, Valentin Trifonov, Thijs
Vogels and all my Data Analytics lab colleagues. Thank you!
Special gratitude to all my former professors that put a brick to
my passion for mathematics and computer science starting from my
first school years. I am especially grateful to prof. Victor Vacariu and
Valentin Vornicu who transformed me from an average high school
student to a fighter for the top places in Romanian National Mathe-
matical Olympiad. Without you, I would have never gotten here! Also,
many thanks to prof. Radu Gologan, leader of Romanian Mathematical
Olympiad, for his support over the years.
All my love to my mother, father and sister for their encouragement,
trust and support throughout these years. Also, many thanks to my
Romanian friends spread all over the world for their encouragement.
Last, but not least, there is a special place in my heart and in my

mind for Marina and our cheerful daughter, Natalia.
xi
CONTENTS
acronyms xxiii
1 introduction and motivation 1
1.1 Representation Learning . . . . . . . . . . . . . . . . . . . 1
1.1.1 (Dis)Similarity Representation and Metric Spaces 1
1.1.2 Embedding Maps, Isometric Functions and Dis-
tortion Measures . . . . . . . . . . . . . . . . . . . 3
1.2 Representations in the Euclidean Space . . . . . . . . . . 5
1.2.1 Pairwise Distance and Similarity (Gram) Matrices 6
1.2.2 Manifold Learning . . . . . . . . . . . . . . . . . . 8
1.3 Properties and Constraints of Euclidean Embedding Spaces 10
1.3.1 Positive Semi-definite Gram Matrices . . . . . . . 11
1.3.2 Short Diagonals Lemma . . . . . . . . . . . . . . . 12
1.3.3 Ptolemy’s inequality and Ptolemaic graphs . . . . 13
1.3.4 Distortions of Arbitrary Metric Embeddings in the
Euclidean Space. Bourgain’s Embedding Theorem 14
1.3.5 Johnson-Lindenstrauss Lemma . . . . . . . . . . . 15
1.3.6 Distortions of Euclidean Graphs Embeddings . . 15
1.3.7 Non-flat Manifold Data . . . . . . . . . . . . . . . 16
1.3.8 Other Causes of Non-Euclidean Data . . . . . . . 17
1.3.9 Curse of Dimensionality . . . . . . . . . . . . . . . 17
1.4 Non-Euclidean Geometric Spaces for Embedding Specific
Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Constant Curvature Spaces . . . . . . . . . . . . . 20
1.4.2 Non-Metric Spaces for Neural Entity Disambigua-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Hyperbolic Spaces – an Intuition . . . . . . . . . . . . . . 21
1.6 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . 23
2 hyperbolic geometry 27
2.1 Short Overview of Differential geometry . . . . . . . . . 27
2.2 Hyperbolic Space: the Poincaré Ball . . . . . . . . . . . . 29
2.2.1 Gyrovector Spaces . . . . . . . . . . . . . . . . . . 34
2.2.2 Connecting Gyrovector Spaces with the Rieman-
nian geometry of the Poincaré Ball . . . . . . . . . 36
xiii
contents
3 hyperbolic entailment cones for learning hierar-

chical embeddings 41
3.1 Introduction and Related Work . . . . . . . . . . . . . . . 41
3.2 Entailment Cones in the Poincaré Ball . . . . . . . . . . . 43
3.3 Learning with Entailment Cones . . . . . . . . . . . . . . 51
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 hyperbolic neural networks 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Hyperbolic Multiclass Logistic Regression . . . . . . . . 61
4.3 Hyperbolic Feed-forward Neural Networks . . . . . . . . 67
4.4 Hyperbolic Recurrent Neural Networks . . . . . . . . . . 69
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 hyperbolic word embeddings 81
5.1 Introduction & Motivation . . . . . . . . . . . . . . . . . . 81
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Hyperbolic Spaces and their Cartesian Product . . . . . . 83
5.4 Adapting GloVe . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 Connecting Gaussian Distributions and Hyperbolic Em-
beddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Analogies for Hyperbolic/Gaussian Embeddings . . . . 87
5.7 Towards a Principled Score for Entailment/Hypernymy 88
5.8 Embedding Space Hyperbolicity . . . . . . . . . . . . . . 89
5.9 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6 probabilistic graphical models for entity resolu-
tion 99
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7 deep joint entity disambiguation with local neu-
ral attention 129
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Contributions and Related Work . . . . . . . . . . . . . . 130
xiv
contents
7.3 Learning Entity Embeddings . . . . . . . . . . .

131 . . . . .
7.4 Local Model with Neural Attention . . . . . . .
133 . . . . .
7.5 Document-Level Deep Model . . . . . . . . . . .
135 . . . . .
7.6 Candidate Selection . . . . . . . . . . . . . . . . .
139 . . . . .
7.7 Experiments . . . . . . . . . . . . . . . . . . . . .
139 . . . . .
7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . .
147 . . . . .
8 conclusion 149
8.1 Future Research Directions . . . . . . . . . . . . . . . . . 149
bibliography 151
xv
LIST OF FIGURES
Figure 1.1 Examples of graphs that cannot be embedded in

the Euclidean space with arbitrarily low distor-
tion. Left: 4-cycle - isometrically embeddable in
spaces of constant positive curvature (spherical).
Right: 3-star tree or K1,3 - isometrically embed-
dable in spaces of constant negative curvature
(hyperbolic). . . . . . . . . . . . . . . . . . . . . 12
Figure 1.2 Examples of three different types of Gaussian
curvature. Source: www.science4all.org . . . . 20
Figure 1.3 Visualization of Escher tiles (left) and a regular
tree (right) represented in the Poincaré ball. . . . 22
Figure 2.1 Tangent space, a tangent unit-speed vector and
its determined geodesic in a Riemannian mani-
fold. Image source: Wikipedia.org . . . . . . . . 28
Figure 3.1 Convex cones in a complete Riemannian mani-
fold. . . . . . . . . . . . . . . . . . . . . . . . . . 43
Figure 3.2 Poincaré angular cones satisfying eq. (3.31) for
K = 0.1. Left: examples of cones for points with
Euclidean norm varying from 0.1 to 0.9. Right:
transitivity for various points on the border of
their parent cones. . . . . . . . . . . . . . . . . . 49
Figure 3.3 Two dimensional embeddings of two datasets: a
toy uniform tree of depth 7 and branching factor
3, with root removed (left); the mammal subtree
of WordNet with 4230 relations, 1165 nodes and
top 2 nodes removed (right). [NK17] (each left
side) has most of the nodes and edges collapsed
on the space border, while our hyperbolic cones
(each right side) nicely reveal the data structure. 54
Figure 4.1 An example of a hyperbolic hyperplane in D31
plotted using sampling. The red point is p. The
shown normal axis to the hyperplane through p
is parallel to a. . . . . . . . . . . . . . . . . . . . . 63
xvi
List of Figures
Figure 4.2 Test accuracies for various models and four datasets.
“Eucl” denotes Euclidean, “Hyp” denotes hy-
perbolic. All word and sentence embeddings
have dimension 5. We highlight in bold the best
method (or methods, if the difference is less than
0.5%). . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 4.3 PREFIX-30% accuracy and first (premise) sen-
tence norm plots for different runs of the same
architecture: hyperbolic GRU followed by hy-
perbolic FFNN and hyperbolic/Euclidean (half-
half) MLR. The X axis shows millions of training
examples processed. . . . . . . . . . . . . . . . . 74
architecture: Euclidean GRU followed by Eu-
clidean FFNN and Euclidean MLR. The X axis
shows millions of training examples processed. 75
architecture: hyperbolic RNN followed by hy-
perbolic FFNN and hyperbolic MLR. The X axis
shows millions of training examples processed. 76
Figure 4.6 Hyperbolic (left) vs Direct Euclidean (right) bi-
nary MLR used to classify nodes as being part
in the group.n.01 subtree of the WordNet noun
hierarchy solely based on their Poincaré embed-
dings. The positive points (from the subtree) are
in red, the negative points (the rest) are in yellow
and the trained positive separation hyperplane
is depicted in green. . . . . . . . . . . . . . . . . 77
xvii
Figure 4.7 Test F1 classification scores (%) for four different
subtrees of WordNet noun tree. 95% confidence
intervals for 3 different runs are shown for each
method and each dimension. “Hyp” denotes our
hyperbolic MLR, “Eucl” denotes directly apply-
ing Euclidean MLR to hyperbolic embeddings
in their Euclidean parametrization, and log0 de-
notes applying Euclidean MLR in the tangent
space at 0, after projecting all hyperbolic embed-
dings there with log0 . . . . . . . . . . . . . . . . 79
Figure 5.1 Isometric deformation ϕ of D2 (left end) into H2
(right end). . . . . . . . . . . . . . . . . . . . . . . 84
Figure 6.1 An entity disambiguation problem showcasing
five given mentions and their potential entity
candidates. . . . . . . . . . . . . . . . . . . . . . . 100
Figure 6.2 Proposed factor graph for a document with four
mentions. Each mention node mi is paired with
its corresponding entity node Ei , while all entity
nodes are connected through entity-entity pair
factors. . . . . . . . . . . . . . . . . . . . . . . . . 108
Figure 6.3 Interactive Gerbil visualization of “in-Knowledge
Base (KB)” (i.e. only entities in KB should be
linked) micro F1 scores for a variety of Entity
Disambiguation (ED) methods and datasets. Our
system, PBOH, is outperforming the vast major-
ity of the presented baselines. Screen shot from
January 2018. . . . . . . . . . . . . . . . . . . . . 120
Figure 7.1 Local model with neural attention. Inputs: con-
text word vectors, candidate entity priors and
embeddings. Outputs: entity scores. All parts
are differentiable and trainable with backpropagation.133
Figure 7.2 Global model: unrolled LBP deep network that
is end-to-end differentiable and trainable. . . . . 136
xviii
List of Tables
Figure 7.3 Non-linear scoring function of the belief and

mention prior learned with a neural network.
Achieves a 1.7% improvement on AIDA-B dataset
compared to a weighted average scheme. . . . . 138
L I S T O F TA B L E S
Table 3.1 Test F1 results for various models for embedding

dimensionality equal to 5. Simple Euclidean
Emb and Poincaré Emb are the Euclidean and
hyperbolic methods proposed by [NK17], Order
Emb is proposed by [Ven+15]. . . . . . . . . . . 57
Table 3.2 Test F1 results for the same methods as in ta-
ble 3.1, but for embedding dimensionality equal
to 10. . . . . . . . . . . . . . . . . . . . . . . . . . 57
Table 5.1 average distances, δ-hyperbolicities and ratios
computed via sampling for the metrics induced
by different h functions, as defined in eq. (5.12). 90
Table 5.2 Word similarity results for 100-dimensional mod-
els. Highlighted: the best and the 2nd best. Im-
plementation of these experiments was done by
co-authors in [TBG19]. . . . . . . . . . . . . . . . 92
Table 5.3 Nearest neighbors (in terms of Poincaré dis-
tance) for some words using our 100D hyperbolic
embedding model. . . . . . . . . . . . . . . . . . 93
Table 5.4 Word analogy results for 100-dimensional mod-
els on the Google and MSR datasets. High-
lighted: the best and the 2nd best. Implemen-
tation of these experiments was done by co-
authors in [TBG19]. . . . . . . . . . . . . . . . . . 94
Table 5.5 Some words selected from the 100 nearest neigh-
bors and ordered according to the hypernymy
score function for a 50x2D hyperbolic embed-
ding model using h( x ) = x2 . . . . . . . . . . . . . 95
xix
List of Tables
Table 5.6 Hyperlex results in terms of Spearman correla-

tion for 3 different model types ordered accord-
ing to their difficulty. Implementation of these
experiments was done by co-authors in [TBG19]. 96
Table 5.7 WBLESS results in terms of accuracy for 3 dif-
ferent model types ordered according to their
difficulty. Implementation of these experiments
was done by co-authors in [TBG19]. . . . . . . . 97
Table 6.1 Statistics on some of the used datasets . . . . . . 119
Table 6.2 AIDA test-a and AIDA test-b datasets results. . 121
Table 6.3 Results on the newer versions of the MSNBC,
AQUAINT and ACE04 datasets. . . . . . . . . . 121
Table 6.4 Micro (top) and macro (bottom) F1 scores re-
ported by Gerbil for each of the 14 datasets and
of the 11 ED methods including PBOH. For each
dataset and each metric, we highlight in red the
best system and in blue the second best system. 123
Table 6.5 Loopy belief propagation statistics. Average run-
ning time, number of rounds and convergence
rate of our inference procedure are provided. . . 124
Table 6.6 Average number of entities that appear in the
graph built by PBOH and by REL-RW. . . . . . . 126
Table 6.7 Accuracy gains of individual PBOH components. 126
Table 7.1 Statistics of ED datasets. Gold recall is the per-
centage of mentions for which the entity can-
didate set contains the ground truth entity. We
only trained on mentions with at least one can-
didate. . . . . . . . . . . . . . . . . . . . . . . . . 140
Table 7.2 Entity relatedness results on the test set of [Cec+13b].
WLM is a well-known similarity method of [MW08].141
Table 7.3 Closest words to a given entity. Words with at
least 500 frequency in the Wikipedia corpus are
shown. . . . . . . . . . . . . . . . . . . . . . . . . 142
Table 7.4 In-KB accuracy for AIDA-B test set. All base-
lines use KB+YAGO mention-entity index. For
our method we show 95% confidence intervals
obtained over 5 runs. . . . . . . . . . . . . . . . . 143
Table 7.5 Micro F1 results for other datasets. . . . . . . . 144
xx
List of Tables
Table 7.6 Effects of two of the hyper-parameters. Left: A

low T (e.g.5) is already sufficient for accurate
approximate marginals. Right: Hard attention
improves accuracy of a local model with K=100. 145
Table 7.7 ED accuracy on AIDA-B for our best system
splitted by Wikipedia hyperlink frequency and
mention prior of the ground truth entity, in cases
where the gold entity appears in the candidate
set. . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Table 7.8 Examples of context words selected by our lo-
cal attention mechanism. Distinct words are
sorted decreasingly by attention weights and
only words with non-zero weights are shown. . 147
xxi
ACRONYMS
ML Machine Learning
AA Apprentissage Automatique
NLP Natural Language Processing
CNN Convolutional Neural Network
FFNN Feed-forward Neural Network
RNN Recurrent Neural Network
GRU Gated Recurrent Unit
MLR Multinomial Logistic Regression
SVM Support Vector Machine
MF Matrix Factorization
PSD positive semi-definite
SGD Stochastic Gradient Descent
MDS Multidimensional Scaling
LLE Locally-linear Embedding
PCA Principal Component Analysis
DAG Directed Acyclic Graphs
PBOH Probabilistic Bag of Hyperlinks Model for Entity

Disambiguation
ER Entity Resolution
ED Entity Disambiguation
KB Knowledge Base
KG Knowledge Graph
xxiii
acronyms
CRF Conditional Random Field
BP Belief Propagation
LBP Loopy Belief Propagation
xxiv
I N T R O D U C T I O N A N D M O T I VAT I O N
1
1.1 representation learning
Producing semantically rich representations of data such as text or

images is a central point of interest in artificial intelligence. Deep
learning and neural representations have become the de facto method-
ology towards this goal, exhibiting state-of-the-art models and results
in NLP, computer vision or graph pattern discovery. These methods
learn to parameterize the intrinsically low-dimensional data manifold
accessible via point samples living in a very high dimensional feature
space. Thus, compressed embeddings are obtained from a non-linear
dimensionality reduction method which can be task-specific or fully un-
supervised. These representations are combined and learned together
using parametric composition functions such as Feed-forward Neural
Network (FFNN), Recurrent Neural Network (RNN), Gated Recurrent
Unit (GRU) or Convolutional Neural Network (CNN).
A long line of research focuses on learning embeddings of discrete
data such as graphs [GF17]; [GL16] or linguistic instances [Kir+15];
[Mik+13b]; [PSM14]. This is the main class of models we are explor-
ing in this dissertation. Neural representations and deep learning
architectures have established state-of-the-art results for various such
discrete problems, e.g. link prediction in knowledge bases [Bor+13];
[NTK11] or in social networks [HRH02], text disambiguation [GH17],
word hypernymy [SGD16], textual entailment [Roc+15] or taxonomy
induction [Fu+14].
1.1.1 (Dis)Similarity Representation and Metric Spaces
Neural models embed data into continuous 1 latent spaces (e.g. Rn )

that exhibit certain desirable geometric characteristics. The general
aim is to capture the semantic properties of interest in a geometric
1 Space
continuity is necessary to enable continuous optimization of loss/cost
functions which can leverage efficient algorithms such as Stochastic Gradient Descent
(SGD), AdaGrad [DHS11] or Adam [KB14].
1
introduction and motivation
manner. In ML, data are classically represented in particular metric

spaces where various traditional geometric learning techniques can be
applied to learn useful models. Here, semantic similarity in the input
data is encoded via the distance function associated with the geometric
embedding space. Intuitively,
• Semantically similar instances (or data examples or data points)

should be embedded close in the latent space.
• Semantically dissimilar instances should be embedded far apart

in the latent space.
• Geodesics (“straight lines”) in the embedding space should be

semantically meaningful and leveraged in a principled learning
process (otherwise the geometry of the space would be ignored).
This is formalized using the mathematical notion of a metric space,

i.e. a pair ( X, d X ), where X is a set and d X : X × X → R is a metric or
distance function satisfying the following axiomatic properties for any
x, y, z ∈ X:
• non-negativity: d X ( x, y) ≥ 0
• identity of indiscernibles: d X ( x, y) = 0 iff x = y
• symmetry: : d X ( x, y) = d X (y, x )
• triangle inequality: d X ( x, y) + d X (y, z) ≥ d X ( x, z)
Classic examples of metric spaces that will be widely used in this

thesis are:
• The Euclidean space Rn with the l2 distance function kx − yk2
• Riemannian manifolds (Geodesically complete 2 ). Intuitively, these

are generalizations of 2D surfaces and include all Euclidean
spaces, among a lot more. Formally, (M, g) is a real, smooth
manifold equipped with an inner product g p defined in the tan-
gent space Tp M at each point p and that varies smoothly with p.
Several well-known Euclidean geometric notions are generalized
on a Riemannian manifold, e.g. straight lines (geodesics), angle
2 To
have the distance function defined between every pair of points in the space,
see Hopf–Rinow theorem.
2
1.1 representation learning
between geodesics, length of a curve, area of a surface or curva-

ture, making them interesting for data representation and ML. We
will discuss them in more detail in Chapter 2.
• Graph metrics: given any graph G (simple, connected, unweighted

or weighted with positive weights), the lengths of the shortest
paths between every two nodes defines a metric for the set of
vertices. In this dissertation, a particular interesting class is that
of tree metrics (i.e. when graphs are restricted to trees). Relating
graph patterns and characteristics to its associated metric prop-
erties is usually non-trivial. However, many interesting graphs
have corresponding associated isometric Riemannian manifolds,
i.e. where (geodesic) distances in the two spaces match. Such
structures are mathematically richer and, thus, more useful for
ML methods. This includes, for example, all the tree graphs and
their associated hyperbolic manifold.
1.1.2 Embedding Maps, Isometric Functions and Distortion Measures
In ML, data are available via sample sets, e.g. images of particular
objects sampled from an unknown data distribution. The goal is to
extract patterns and characteristics from the particular data distribution
of interest by learning information preserving data representations.
Formally, data points are vectors in an input metric space ( X, d X ) which
is usually the continuous Euclidean vector space Rn of some large
dimension n. Examples include vectorial pixel image representations
or feature vectors such as co-occurrence statistics for discrete inputs.
data manifold assumption. It is typical to make the data man-

ifold assumption, i.e. that data lies on a latent “surface” or manifold
of much smaller dimension. This helps us to perform supervised and
unsupervised learning, avoid overfitting and generalize to unseen data
points. We further assume this manifold has an useful geometry in the
sense that it is a metric space (Y, dY ) and has a smooth differentiable
structure that allows learning via function optimization. Thus, we
are interested in geodesically complete Riemannian manifolds of intrinsic
dimension d ≤ n. This is typically done by learning an embedding
mapping f : X → Y from the input high dimensional feature space to
the low-dimensional embedding space.
3
Such embedding functions are frequently learned in one of the two

settings: supervised and unsupervised. In the former, the goal is to provide
representations useful for a specific task at hand for which a significant
amount of labeled data exists. In the latter, generic data representa-
tions useful for any task are desired – for example, by minimizing the
distortion in the embedding process (see below).
isometric graph embeddings. A particular situation is when

the input data forms a graph structure, e.g. points are vertices of a
nearest neighbor graph, and the goal of dimensionality reduction and
representation learning is to obtain an isometric embedding, i.e. one
that (approximately) preserves the graph metric, namely the length
of the shortest path connecting each two nodes matches the manifold
geodesic distance of their corresponding embedding points. This prop-
erty is desirable since it allows to preserve the graph structure in the
low-dimensional space, being important for various downstream tasks
such as link prediction, node clustering, node similarity learning, node
classification, graph generation or graph visualization. Moreover, if the
original graph is obtained from samples on a Riemannian manifold
connected based on geodesic proximity, then an isometric embedding
retains both the topology and the geometric structure, being useful for
learning complex parametric data representations (manifold learning) –
this is the goal of (non-linear) dimensionality reduction methods such
as metric Multidimensional Scaling (MDS) [BG03] or Isomap [TDL00]
discussed in section 1.2.2. For example, if nodes are point samples
from the unit sphere Sn and edges are between each two points with
an Euclidean distance smaller than some small constant e > 0, then the
lengths of the shortest path between any two points in this graph will
approximately be the true spherical geodesic distance between any two
points, i.e. the cosine similarity, and this is the distance one would like
to preserve in the low-dimensional embedding space.
distortion measures. To quantify how “far” an embedding map

is from an isometry, we define distortion measures. We follow popular
literature (e.g. [Mat13]) and define the notion of worst-case distortion:
Definition 1.1. Assume we have access to several data points P =

{p1 , . . . , p N } ⊂ X. An embedding map f : X → Y is called D-
4
1.2 representations in the euclidean space
embedding, where D ≥ 1, if there exists a real number r > 0 such

that
r · d X (pi , p j ) ≤ dY ( f (pi ), f (p j )) ≤ D · r · d X (pi , p j ) (1.1)
where d X (·, ·) is the distance in the input space (e.g. Euclidean or graph
distance) and dY (·, ·) is the distance in the embedding space. The
infimum of numbers D for which f is a D-embedding is called the
(worst-case) distortion of f .
However, in practice, we might be interested in the average or total

distortion. A classical choice are the strain and stress functions used in
metric or generalized MDS methods [BBK06]; [BG03]; [Tor52]. Here, we
use a similar choice, inspired from [Gu+19]:
Definition 1.2. With the notations from definition 1.1, we define the
total distortion as
2 !2
dY ( f (pi ), f (p j ))

total-distortion( f ) := ∑ −1 (1.2)
1≤ i < j ≤ N
d X ( pi , p j )
It is extremely common in ML to embed data in the Euclidean em-

bedding space Rn . This is the natural generalization of our intuition-
friendly, visual three-dimensional space. Other reasons for such a
choice are the vectorial structure and simple closed-form expressions
of the distance, inner-product and geodesics (straight lines). Learn-
ing is typically done by minimizing the distance (or maximizing the
dot-product) between representations of correlated items, while maxi-
mizing the distance between embeddings of dissimilar objects. Popular
examples are the algorithms for embedding words [Mik+13b]; [PSM14]
and graphs [GL16] trained on co-occurrence statistics gathered from
sentences in large text corpora or, random graph walks, respectively.
These methods have been shown to strongly relate semantically close
words and their topics, as well as graph nodes with similar patterns.
Moreover, embedding symbolic objects in the Euclidean continuous
space allows tackling more complex tasks involving compositionality
and non-linear hidden interactions. This is achieved by feeding data
vectorial representations as input to (deep) neural networks, which
has led to unprecedented performance on a broad range of problems
5
including sentiment detection [Kim14], machine translation [BCB14],

abstractive text summarization [RCW15] or textual entailment [Roc+15].
Last, Euclidean geometry dominates the popular unsupervised ap-
proaches for non-linear dimensionality reduction or manifold learning with
various applications such as 3D data visualization, low-dimensional
embeddings, graph link prediction or graph node clustering. These
methods are analyzed in section 1.2.2.
1.2.1 Pairwise Distance and Similarity (Gram) Matrices
We now discuss and relate embedding, Gram and Euclidean dis-

tance matrices. These will be later useful to understand properties
and restrictions of Euclidean spaces, as well as to derive popular di-
mensionality reduction methods such as metric-MDS [BG03]; [Tor52] or
Isomap [TDL00].
Given N data points P = {p1 , . . . , p N } ⊂ Rn , typically as feature
vectors, their corresponding embeddings via an embedding map f :
Rn → Rd are X = {xi : xi = f (pi ) ∈ Rd , ∀i ∈ [ N ]} 3 . These points can
be further collected together in an embedding matrix
x1>
 
 x> 
 2
X =  .  ∈ R N ×d (1.3)
 .. 
x>
N
Then, if one assumes the embedding space M = Rd to be Euclidean,
one typically defines a squared distance matrix or dissimilarity matrix in
this space
D = Dij ∈ R N × N , Dij := d(xi , x j )2 = kxi − x j k2

(1.4)
In general, data can be embedded in any metric space, in which case
d(·, ·) is the associated distance function.
On the other side, it is useful to define a notion of similarity in the
embedding space for which large values correspond to semantically
similar symbolic objects. In the Euclidean space, the typical measure is
the inner or dot product
d
hx, yi = ∑ xk yk (1.5)
k =1
3 Notation: [ N ] = {1, . . . , N }
6
which can be further used to define a similarity matrix or Gram matrix of

points in Rd
K = Kij ∈ R N × N , Kij := hxi , x j i

(1.6)
There is a natural and elegant connection between all the three matrices
defined above. We will briefly derive it here.
First, one derives the connection between K and X. Note that
K = XX> (1.7)
Note that K has to be a PSD matrix. Let
K = UΛU> (1.8)
be the eigenvalue decomposition of the matrix K. Then, in order for K

to be PSD, all the eigenvalues in Λ have to be non-negative 4 . If that is
the case, then one can recover X from K by
X = UΛ1/2 (1.9)
An important and very common operation is centering, i.e. translating

all the embeddings such that their mean coincides with the origin.
While it does not change neither the distance matrix, nor the geometry
of the system, it does affect the Gram matrix. Formally, the centered
embedding matrix is obtained as
X = CX (1.10)
where C is the symmetric centering matrix

1 >
C = Id − 11 (1.11)
N
Then, the corresponding Gram matrix of the centered points is
K = CXX> C = CKC (1.12)
Now, the connection between K and D is given by the relation
d(xi − x j )2 = kxi − x j k2 = hxi , xi i + hx j , x j i − 2hxi , x j i = Kii + K jj − 2Kij

(1.13)
which gives
D = −2K + eγ > + γe> (1.14)
4 Known as the Mercer condition.
7
where e is the vector of all 1’s and γ = diag(K). Conversely, the

centered Gram matrix can be recovered as
1
K = − CDC (1.15)
2
Note that, as distances are translation invariant, the original uncentered
K cannot be recovered solely from D.
Finally, the matrix X can be computed from D by combining eqs. (1.8),
(1.9) and (1.15).
1.2.2 Manifold Learning
Finding the low-dimensional data manifold is the goal of linear and

non-linear dimensionality reduction or manifold learning methods such
as Principal Component Analysis (PCA), Kernel-PCA [SSM98], metric-
MDS [BG03]; [Tor52], Isomap [TDL00], Laplacian Eigenmap [BN03]
or Locally-linear Embedding (LLE) [RS00]. These models focus on
finding low-dimensional embeddings that preserve the most important
properties and structure of the metric in the input feature space. We
briefly review some of these methods.
pca and kernel pca. PCA is the most well-known linear dimen-
sionality reduction approach. It finds the most significant directions
along which the data maximizes its variance, namely the eigenvec-
tors corresponding to the top eigenvalues of the covariance matrix
Σ = N1 X> X = N1 ∑iN=1 xi xi> . Then, the data is linearly projected onto the
linear vector subspace spanned by these vectors.
However, PCA is ineffective when the data does not lie on or close to a
d-dimensional linear sub-space, which is typically a strong assumption.
A possible fix is the kernel-PCA [SSM98] method that employs a non-
linear data transformation Φ via a kernel function, resulting in the
covariance matrix Σ = N1 ∑iN=1 Φ(xi )Φ(xi )> . This can be efficiently
computed using the kernel trick. Unfortunately, finding a suitable
kernel for the problem at hand is not trivial.
metric mds. The goal of the metric MDS algorithm [BG03]; [Tor52]
is to perform non-linear dimensionality reduction by matching the
pairwise distances in the input feature space (given as an input matrix
D ∈ R N × N ) with the Euclidean distances in the embedding space. The
8
goal of metric MDS is to minimize the following stress function w.r.t.

the data embeddings
s
stress(x1 , . . . , x N ) = ∑ ( Dij − kxi − x j k)2 (1.16)
1≤i,j≤ N
The algorithm follows eq. (1.15) to compute the Gram matrix of the
centered data, then determines its d largest eigenvalues λ1 , λ2 , . . . , λd
and their corresponding eigenvectors u1 , u2 , . . . , ud , and, last, computes
the new embeddings using X = Ud Λ1/2 d , where Λd is the diagonal
matrix of these d eigenvalues and Ud is the matrix of their associated
eigenvectors.
isomap. This method [TDL00] estimates the intrinsic geometry of the

data manifold. It is based on a rough estimate of the geodesic distances
that uses the Floyd–Warshall algorithm for a K-nearest-neighbor graph
of the input data. It then employs metric-MDS to compute a quasi-
isometric embedding by attempting to match the manifold distances in
the input feature space to Euclidean distances in the embedding space.
limitations. We have seen that manifold learning approaches focus

on preserving metric properties and are based on a nearest-neighbor
search that generally uses straight-line Euclidean distances to model
both global and local similarities. This is consistent with a locally
“flat” approximation of a manifold, but becomes problematic when a
global isometric embedding is desired (e.g. for link prediction or graph
generative models). For example, Isomap has zero distortion only for
intrinsically flat manifolds, i.e. that can be isometrically embedded
onto an Euclidean vector space, that have zero intrinsic curvature. In
this case, geodesic distances would correspond to Euclidean distances
in the target embedding space. Similarly, metric MDS is suitable for
manifolds whose extrinsic dimensionality 5 is smaller or equal than
the target embedding space dimension. To visualize these statements,
consider data sampled from the S2 sphere embedded in R10 ; metric MDS
can isometrically embed it into R3 since 3 is this manifold’s extrinsic
dimensionality; however, it will encounter distortion when attempting
5 The extrinsic dimensionality of a manifold can be a large number compared to

its intrinsic dimensionality as stated by Nash embedding’s theorem (discussed in
section 1.3.7).
9
to move to R2 ; in the later case, one can still obtain zero distortion
embeddings using MDS if the target Euclidean spaces are replaced by
spaces of constant positive curvature (i.e. spherical spaces) and the
algorithm is modified to use their associated distance function. In
contrast, Isomap cannot reduce the dimension of this data without loss
because there exists no isometric deformation of the sphere into an
Euclidean space.
1.3 properties and constraints of euclidean embedding

spaces
As we have seen, Euclidean geometry is heavily used by both super-

vised and unsupervised methods for learning neural representations
(via deep neural networks), for non-linear dimensionality reduction,
manifold learning or embedding graphs.
However, another line of research [Bro+17]; [DP10]; [RH07]; [Wil+14]
revealed that in many domains one has to deal with data whose latent
anatomy is best defined by non-Euclidean spaces such as intrinsically
curved (non-flat) Riemannian manifolds. Examples include computer
graphics (meshed surfaces [Bro+17]), genomics (human protein interac-
tion networks [AMA18]), phylogenetics [BHV01], bioinformatics (cancer
networks [San+15]), recommender systems [Cha+19] or network sci-
ence (Internet topology [Ni+15]). Non-Euclidean nature also happens
for a large class of structured objects and graphs (e.g. trees or cycle
graphs) or data sources exhibiting power-law or Zipf’s law distributions
- e.g. social networks [VS14] or natural language [NK17]. Such data
would be better embedded in certain non-Euclidean manifolds, for
example complex heterogeneous networks are provably [Kri+10] best
represented in the hyperbolic space described in section 1.5. Moreover,
remaining in the Euclidean space and increasing embedding dimension-
ality does not suffice to bridge this gap as we will see in sections 1.3.2,
1.3.7 and 1.5.
We next present several properties and constraints of Euclidean em-
bedding spaces, motivating the need to complement popular Euclidean
techniques with non-Euclidean methods for data with certain character-
istics.
10
1.3 properties and constraints of euclidean embedding spaces
1.3.1 Positive Semi-definite Gram Matrices
In section 1.2.1 we explained that a similarity (Gram) matrix is Eu-

clidean if and only if it is PSD, i.e. has no negative eigenvalues (kernel
matrix). The same argument holds for a (squared) distance matrix: it is
Euclidean iff the similarity matrix recovered from it via eq. (1.15) is PSD.
This shows that Euclidean embeddings cannot perfectly reconstruct
the data if the underlying Gram matrix contains even a single negative
eigenvalue.
A different, but equivalent result (given in various sources, e.g. Thm.
3.7.2 of [Mat13]) states that D is an Euclidean squared distance matrix
iff the N × N matrix G given by
1
gij := ( D1i + D1j − Dij ), i, j ∈ [ N ] (1.17)
2
is PSD. This results holds irrespectively of which node is chosen as the
“1” node. The proof relies on the observation that G is the Gram matrix
of the translated vectors that set x1 to be the origin:
gij = hxi − x1 , x j − x1 i, i, j ∈ [ N ] (1.18)
discarding negative eigenvalues. The most simplistic assump-

tion is that the negative Gram eigenvalues in a Gram matrix are due
to noise and, thus, do not carry significant discriminative information.
It might appear that simply discarding (or setting to zero) these neg-
ative eigenvalues would be a good correction heuristic similarly to
how in PCA or Matrix Factorization (MF) one would discard the low-
est eigenvalues. Unfortunately, [DP10]; [Wil+14]; [Xu12] show that
this has non-negligible effects, the resulting embeddings exhibiting
a big distortion w.r.t. the original distances. This procedure also af-
fects quality performance in downstream tasks such as Support Vector
Machine (SVM) classification.
how often is the gram matrix non-psd? From an empirical

point of view, [DP10]; [Wil+14]; [Xu12] argue that this happens almost
always in practice for real data. They evaluate the negative eigenfraction
measure on the eigenvalues of the Gram matrix, i.e.
∑ λ i <0 | λ i |
NEF = ∈ [0, 1] (1.19)
∑ λi | λ i |
11
Figure 1.1: Examples of graphs that cannot be embedded in the Eu-

clidean space with arbitrarily low distortion. Left: 4-cycle
- isometrically embeddable in spaces of constant positive
curvature (spherical). Right: 3-star tree or K1,3 - isometri-
cally embeddable in spaces of constant negative curvature
(hyperbolic).
and show it is almost always larger than 0 in practice.

So, what are the main sources of non-PSD Gram matrices and non-Euclidean
behavior? We will discuss this in the following sections.
1.3.2 Short Diagonals Lemma
Embeddings in Euclidean spaces need to satisfy an additional con-

straint compared to the ones in any metric space, namely the following
lemma (see e.g. [Mat13]).
Lemma 1.1. Short Diagonals. Any four points x1 , x2 , x3 , x4 in an Euclidean
space of any dimension must satisfy
kx1 − x3 k2 + kx2 − x4 k2 ≤kx1 − x2 k2 + kx2 − x3 k2 +
(1.20)
k x3 − x4 k2 + k x4 − x1 k2
Proof. Using the fact that kx − yk2 = kxk2 + kyk2 − 2hx, yi for any
Euclidean vectors x, y, one rewrites the above as:
k x1 + x3 − x2 − x4 k2 ≥ 0 (1.21)
The later is obviously true.
One can see, for example, that the graph metric of an undirected
4-cycle does not satisfy this lemma and, thus, cannot be embedded
isometrically in the Euclidean space of any dimension. In fact, we can
lower bound the total distortion for this particular graph.
Lemma 1.2. Embedding the 4-cycle graph in the Euclidean space of any
dimension incurs a total distortion (as defined by eq. (1.2)) of at least 34 .
12
Proof. Consider any Euclidean embedding map. Let a, b, c, d be the

embedding distances corresponding to the edge lengths (i.e. graph
distances of 1). Let e, f be the distances corresponding to the diagonals,
namely the graph distances of 2. Then, the total distortion is
total-distortion( G ) =( a2 − 1)2 + (b2 − 1)2 + (c2 − 1)2 +

1 1 (1.22)
( d2 − 1)2 + ( e2 − 4)2 + ( f 2 − 4)2
4 4
Let us denote by s = a2 + b2 + c2 + d2 and t = e2 + f 2 , where t ≤ s from
lemma 1.1. Applying Cauchy-Schwartz inequality twice, one derives
1
( a2 − 1)2 + ( b2 − 1)2 + ( c2 − 1)2 + ( d2 − 1)2 ≥ ( s − 4)2
4 (1.23)
1 2 1 1
( e − 4)2 + ( f 2 − 4)2 ≥ (8 − t )2
4 4 8
We distinguish two cases:
• If t ≤ 4, then 8 − t ≥ 4 and distortion( G ) ≥ 18 (8 − t)2 ≥ 2.
• Else, s ≥ t ≥ 4, then distortion( G ) ≥ 14 (t − 4)2 + 81 (8 − t)2 ≥ 4

3
√ Similar results hold for worst-case distortion (a minimum value of

2, see [Mat13]), as well as for the 3-star tree in fig. 1.1 in terms of both
total and worst-case distortion. However, trees and cycle graphs can
isometrically be embedded in spaces of constant negative and positive
curvature, respectively, as we will discuss later in this section (see
also [Gu+19]).
1.3.3 Ptolemy’s inequality and Ptolemaic graphs
An additional 4-point constraint always holds for Euclidean embed-

dings.
Lemma 1.3. Ptolemy’s inequality. Any four points x1 , x2 , x3 , x4 in an

Euclidean space of any dimension satisfy
kx1 − x3 k · kx2 − x4 k ≤kx1 − x2 k · kx3 − x4 k + kx2 − x3 k · kx4 − x1 k
(1.24)
13
Of particular interest is how limiting is this property when embed-

ding graph metrics (i.e. such that distances in the embedding space
approximate shortest path graph distances). It turns out that the class
of such Ptolemaic graphs has several equivalent characterizations 6 out
of which we mention two:
3( k −3)
• Every k-vertex cycle has at least 2 diagonals.
• The graph does not contain an induced 4-cycle (i.e. with no di-
agonals) and is distance-hereditary (i.e. every connected induced
subgraph has the same distances as the whole graph).
This shows that cycle graphs of length at least 4 are not Ptolemaic,
thus not embeddable isometrically in the Euclidean space. Examples of
Ptolemaic spaces include all Hadamard spaces (e.g. Euclidean and the
Poincaré disk), but do not include spaces of constant positive curvature
(e.g. spherical).
1.3.4 Distortions of Arbitrary Metric Embeddings in the Euclidean Space.

Bourgain’s Embedding Theorem
Can we understand how powerful are Euclidean spaces in terms of best

achievable embedding distortion? In the most generic case, we want to
assume that the N data points come from an arbitrary metric space.
Then, the most important and well-known result is Bourgain’s theorem.
Theorem 1.3. Bourgain’s embedding theorem [Bou85]. Any set of N points

from a metric space ( X, d X ) can be embedded in the Euclidean space with
(worst-case) distortion O(log N ).
It can also be shown (see [IM04]; [Mat13]) that the embedding space
that achieves the above distortion will have dimension O(log2 N) with
high probability. The above distortion is tight: for example constant
degree expander graph metrics cannot be embedded with a lower distor-
tion.
6 See https://en.wikipedia.org/wiki/Ptolemaic_graph.
14
1.3.5 Johnson-Lindenstrauss Lemma
One of the most useful results regarding quasi-isometric embeddings

deals with the case when the original input space ( X, d X ) is Euclidean.
For any e > 0, the set of N points can be embedded with worst-case
log N
distortion 1 + e in the Euclidean space of dimension e2 .
Lemma 1.4. Johnson-Lindenstrauss lemma. Let e > 0 and S a set of N

points in Rn . Then, for any d > 8 ln( N )/e2 , there is a map f : Rn → Rd
such that
(1 − e)kx − yk2 ≤ k f (x) − f (y)k2 ≤ (1 + e)kx − yk2 , ∀x, y ∈ S
A proof can be found in [DG03]. However, this result does not apply
to other metric spaces like graph metrics.
1.3.6 Distortions of Euclidean Graphs Embeddings
If the input data has a graph structure, the resulting graph metric
might be very different from the Euclidean metric, resulting in poten-
tially large embedding distortion no matter how many dimensions are
used. We already dived into this issue in section 1.3.2, seeing that even
small cycles or trees cannot be isometrically or with arbitrarily low
distortion embedded in the Euclidean space.
In general, there are various theoretical results that provide lower and
upper bounds on the worst-case distortion for Euclidean embeddings of
several classes of graphs [IM04]; [Mat13]. For example, k-regular graphs
√
(k ≥ 3) with girth g encounter
p an Ω( g) distortion, planar graphs can
be embedded with O( log N ) distortion, trees can be embedded in
d-dimensions with distortion O( N 1/(d−1) ), while the recursive
√ diamond
graph Gm cannot be embedded with distortion less than 1 + m.
The particular case of tree structures and directed acyclic graphs
is of special interest in this dissertation. From a theoretical point of
view, [Gro87] (cited by [De +18a], [GBH18b]) show that arbitrary tree
structures cannot be embedded with arbitrary low distortion in the
Euclidean space with any number of dimensions, but this task becomes
possible in the hyperbolic space with only 2 dimensions where the
exponential volume growth matches the exponential growth of nodes
with the tree depth. We will examine hyperbolic spaces extensively in
15
Chapter 2, Chapter 3, Chapter 4 and Chapter 5.
1.3.7 Non-flat Manifold Data
An important source of non-Euclidean data are intrinsically curved

(non-flat) manifolds. If data points are samples from such an underlying
Riemannian manifold, then they cannot be embedded in the Euclidean
space without (potentially large) distortion when using the Euclidean
distance as an approximation to the geodesic distance. Thus, popular
embedding techniques or non-linear dimensionality reduction methods
(e.g. Isomap, metric MDS, Laplacian Eigenmap) would exhibit embed-
ding distortion and error when the data comes from an intrinsically
curved manifold.
Particular examples are the sphere and cycle graphs which can
only be isometrically embedded in spaces of strictly positive curva-
ture [Wil+14]. Similarly, trees and their associated metric require a
space of constant negative curvature that can accommodate their ex-
ponential volume growth [NK17], i.e. the hyperbolic space that is of a
particular great interest in this dissertation. In these examples, convert-
ing the distance matrix to the associated Gram matrix is done differently
than in the Euclidean space, as follows (see also [Wil+14]):
• For a d-dimensional sphere of radius r, the matrix K of dot-

products Kij = hxi , x j i is

2 1
K = r cos D (1.25)
r
where Dij = r cos−1 (hxi , x j i/r2 ) is the geodesic distance on the

sphere. The resulting matrix K is then restricted to be PSD for this
particular manifold.
• For the hyperboloid model 7 of the hyperbolic space with curva-

ture − r12 , the matrix K of Minkowski dot-products 8 Kij = hxi , x j i1
is
2 1
K = −r cosh D (1.26)
r
7 https://en.wikipedia.org/wiki/Hyperboloid_model
8 h x, y i
1 = ∑dk=1 xk yk − x0 y0 if x, y are vectors in the Minkowski space Rd,1 .
16
where Dij = r cosh−1 (−hxi , x j i1 /r2 ) is the geodesic distance on

the hyperboloid. This matrix is restricted to have exactly one
negative eigenvalue.
a note on nash embedding theorem. Relevant in this con-

text is the Nash embedding theorem 9 stating that any d-dimensional
Riemannian manifold can be “isometrically” embedded in an Eu-
d(3d+11)
clidean space of dimension d0 ≤ 2 for compact manifolds, and
d(d+1)(3d+11)
d0 ≤ 2 for non-compact manifolds. However, “isometric” in
this context means preserving the metric tensor (i.e. the inner-product
in each tangent space), meaning that the original manifold is isometric
with its embedding image, but not with the entire ambient Euclidean
space. This does not imply that the Euclidean distances would match
the geodesic distances. In fact, the later cannot happen for non-flat
manifolds (e.g. spherical, hyperbolic) as stated above.
1.3.8 Other Causes of Non-Euclidean Data
[DP10]; [Xu12] investigate other main causes of non-Euclidean data,

i.e. non-PSD Gram matrices. We only mention here Gaussian noise
(which can be related to the spectrum of random matrices and Wigner’s
semicircle law 10 ), extended objects, overestimation of large distances,
underestimation of small distances and numerical inaccuracies. We
refer the reader to the mentioned resources for further information.
1.3.9 Curse of Dimensionality
It is known that Euclidean spaces suffer from the curse of dimen-

sionality in high dimensions [AHK01]; [Bey+99]. The minimum and
maximum Euclidean distances between N points become indiscernible
as the space dimension goes to infinity, which poses a problem for meth-
ods such as nearest neighbor search, k-nearest neighbor classification
or clustering.
Formally, any fixed distribution on R induces a product distribu-
tion on Rd . For any fixed N, let Q be a random reference point and
P1 , . . . , PN be random points sampled from this distribution. We denote
9 https://en.wikipedia.org/wiki/Nash_embedding_theorem
10 https://en.wikipedia.org/wiki/Wigner_semicircle_distribution
17
by distmin (d) := mini dist( Q, Pi ) and, similarly, we define distmax (d).

Then,
distmax (d) − distmin (d)

lim E →0 (1.27)
d→∞ distmin (d)
For this reason, as well as for better generalization capabilities and
less overfitting, it is desirable to work in small dimensional embedding
spaces having a better clustering behavior, e.g. trees embedded in small
hyperbolic spaces [Kri+10]; [NK17] will be discussed below.
1.4 non-euclidean geometric spaces for embedding specific

types of data
In the previous section we have seen several important properties

and constraints of the Euclidean spaces. As a consequence, it is natural
to ask:
Which (non-Euclidean) geometric spaces offer better/complementary inductive
biases for learning (supervised or unsupervised) representations of various
(structured or unstructured) data types?
Since similarity relations for real data can be very complex, we would
like to explore the question:
What other embedding geometries are appropriate to represent global similarity
structure of specific data ?
We are restricting to embedding data on Riemannian manifolds
which represent the principled generalization of our intuition friendly
notion of (curved) surfaces. The first reason for this choice is that
they have well understood geometric properties generalizing the classic
Euclidean tools [RS11], e.g. inner products, vector fields, geodesics,
angles, distances, exponential map, parallel transport, volume element,
curvature, geodesic convexity. Second, optimization methods essential
for any ML pipeline have recently made great advances in the context
of Riemannian manifolds [BG19]; [Bon13]; [Tri+18b]; [ZRS16]; [ZS16];
[ZS18]. In order to respect and benefit from the particular manifold’s
intrinsic geometry and structure, it is essential to use these tools when
learning embedding in such non-Euclidean spaces.
The choice of the Riemannian manifold to embed data should match
the geometry and structure of the data of interest and its invariances.
As a consequence, these Riemannian methods have the potential to
18
1.4 non-euclidean geometric spaces for embedding specific types of data
alleviate some of the current ML problems for certain types of data, e.g.
achieving
• drastic reduction in the number of parameters and embedding

dimensionality – can result in better generalization capabilities,
less overfitting, better quality with less training data, improved
computational complexity and better clustering behavior due to
less effect of the curse of dimensionality
• low embedding distortion – thus better preserving the local and

global geometric information and avoiding discarding essential
information
• better disentanglement in the embedding space – improved clus-

tering, classification, interpretability
• better model understanding and interpretation – popular deep

learning methods are usually black-box, thus hard to interpret,
debug and improve.
However, moving away from the Euclidean space comes with its own
challenges. Practical and computationally efficient implementations of
Riemannian manifolds require for modeling, learning and optimization,
access to closed form expressions of the most relevant Riemannian
geometric tools such as geodesic equations, exponential map, distance
function or parallel transport. In the case of generic manifolds, these
geometric objects can easily loose their appealing closed form expres-
sions.
Moreover, the adoption of neural networks and deep learning in
these non-Euclidean settings has been rather limited until very re-
cently, the main reason being the non-trivial or impossible principled
generalizations of basic operations (e.g. vector addition, matrix-vector
multiplication, vector translation, vector inner product) as well as, in
more complex geometries, the lack of closed form expressions for basic
objects (e.g. distances, geodesics, parallel transport). Thus, classic tools
such as FFNNs, RNNs or Multinomial Logistic Regression (MLR) do not
have correspondence in non-Euclidean geometries. This is important
as much of the value of embeddings lies in their use for downstream
tasks (e.g. in NLP).
19
Figure 1.2: Examples of three different types of Gaussian curvature.

Source: www.science4all.org
1.4.1 Constant Curvature Spaces
The previous obstacles can be overcome when restricting to partic-

ular well behaved Riemannian manifolds, namely constant curvature
spaces [Wil+14] and products of these manifolds [Gu+19]. Such spaces
are Euclidean (flat, 0 curvature), spherical (positively curved) and hy-
perbolic (negative curvature). An intuition of the notion of (Gaussian)
curvature is provided by fig. 1.2, but we refer the reader to more
extensive resources for a deeper view [Do 92]; [GHL90]; [RS11].
These constant curvature Riemannian manifolds are mathematically
very well understood [Can+97]; [Do 92]; [GHL90]; [Par13]; [RS11], have
closed form expressions for all the relevant geometric tools needed for
ML modeling and optimization (e.g. distances, geodesics, exponential
and logarithmic map, parallel transport) and have recently outperform
Euclidean embeddings on various data structures such as tree graphs,
scale free and complex networks embeddings [Kri+10], hierarchical
word embeddings [NK17], texture mapping [Wil+14] and various other
dissimilarity-based datasets [Wil+14].
In this dissertation we seek to understand and leverage hyperbolic
embeddings for ML problems. As will be extensively detailed later, our
first goals are three-fold:
i Embed Directed Acyclic Graphs (DAG) and hierarchical structures
in the hyperbolic space – Chapter 3.
ii Generalize popular deep and recurrent neural networks to hyper-

bolic spaces in a principled manner – Chapter 4.
iii Learn multi-task word representations in hyperbolic spaces and

products of those – Chapter 5.
20
1.5 hyperbolic spaces – an intuition
1.4.2 Non-Metric Spaces for Neural Entity Disambiguation
Our second goal is to explore a particular type of non-metric spaces

as a way to avoid the limitation of the Euclidean space. Specifically, we
focus on d-dimensional embeddings in Rd where the similarity function
is given by a parameterizable bilinear form B : Rd × Rd → R:
B(x, y) = x> My (1.28)
where M ∈ Rd×d is a learnable symmetric matrix. Unfortunately, with
the exception of PSD matrices M which would define a Mahalanobis
distance, this function does not give a proper “inner-product”, vector
norm or distance function and, thus, the associated space is non-metric,
loosing important geometric properties described before (for example,
Cauchy-Schwarz inequality does not hold and vector norms loose
significance, potentially even being negative). However, it has several
other advantages (e.g. it is linear and symmetric), the most important
for our interests being the capability of generating similarity (Gram)
matrices of any signature, thus avoiding the PSD restriction specific to
Euclidean spaces (see section 1.3.1).
Our second goal for this dissertation is to explore bilinear forms as
“similarity” measures that would help us learn useful data embeddings.
Towards this goal, we focus on learning embeddings of entities in a
knowledge base and leverage them for the NLP downstream task of ED
– Chapter 6 and Chapter 7.
1.5 hyperbolic spaces – an intuition
In this dissertation, one of our major goals is to embed data in the

hyperbolic space. We now provide some motivation and intuition on
some useful properties of this special Riemannian manifold, but defer
the mathematical details to Chapter 2.
We begin by considering the case of embedding tree structures in
an approximately isometric manner. It is mathematically impossible to
achieve arbitrarily low (total or worst-case) distortion in the Euclidean
space with any number of dimensions. However, it is mathematically
possible [De +18a]; [Gro87] to do so for any tree when using an embed-
ding hyperbolic space of only 2 dimensions! An intuition on this fact
is given in fig. 1.3.
21
Figure 1.3: Visualization of Escher tiles (left) and a regular tree (right)
represented in the Poincaré ball.
Why does this happen ? One important property of the hyperbolic

spaces is the exponential volume growth, i.e. in just 2 dimensions, a
ball of radius r centered at the origin has volume
V2H (r ) = 4π sinh2 r = O(er ) (1.29)
On the contrary, the same ball in the Euclidean space of dimension d

grows only polynomially in the radius, i.e.
d
π2
VdE (r ) = r d = O (r d ) (1.30)
Γ ( 2 + 1)
d
This poses a capacity problem for data requiring exponential vol-

ume growth such as trees and hierarchical structures or sequence data
(e.g. NLP, time series, genetics). In the Euclidean space, distances do not
have the capacity to accurately represent (dis)similarity if the volume of
the embedding space does not grow with a similar rate as the data. For
example, if the tree in fig. 1.3 would have been embedded in the Eu-
clidean space such that each graph edge would correspond to the same
embedding distance, then, after a few tree levels, different tree branches
would overlap, breaking the assumption of geometric distances rep-
resenting the graph metric. However, in the hyperbolic space of only
2 dimensions, all the tree edges shown in fig. 1.3 would correspond
to the same geodesic distance, thus preventing undesired overlapping
22
1.6 thesis contributions
(and, in fact, preserving the original tree-metric with arbitrarily low

distortion).
As a consequence, hyperbolic spaces have the potential to use much
fewer dimensions than the current Euclidean models for tree-like types
of data. Indeed, empirically, [De +18a]; [NK17]; [NK18] have proved
that hyperbolic spaces with 5 dimensions achieve better ranking qual-
ity compared to 200-dimensional Euclidean spaces when embedding
taxonomic data (such as the WordNet dataset). Disjoint subtrees from
the latent hierarchical structure surprisingly disentangle and cluster
in the embedding space as a simple reflection of the space’s negative
curvature.
The tree-likeness of hyperbolic spaces properties have been exten-
sively studied [Gro87]; [Ham17]; [Ung08] and used to visualize large
taxonomies [LRP95] or to embed heterogeneous complex networks
[Kri+10]. Due to the negative curvature and exponentially growing
volume, this geometry is mathematically suitable (see [Kri+10]) to ap-
proximately isometrically embed hierarchical structures or scale-free
networks with heavy tailed degree distributions that are ubiquitous
among real-world graphs [ST08]; [VS14] (e.g. social, text or biologi-
cal networks). For such datasets, this space offers a better inductive
bias that can ease optimization, lead to improved clustering [Kri+10],
interpretability, disentanglement and generalization properties. As
suggested in [NK17] the data’s latent hierarchical structure surprisingly
emerges due to the space’s negative curvature.
We next give a brief overview of the main contributions of this thesis.
• Embedding DAGs in the Hyperbolic Space using Nested En-

tailment Cones
Our first contribution is based on the publication “Hyperbolic Entail-

ment Cones for Learning Hierarchical Embeddings” [GBH18b], detailed
in Chapter 3.
Learning graph representations via low-dimensional embeddings that
preserve relevant network properties is an important class of problems
in machine learning. We present a novel method to embed directed
acyclic graphs. Following prior work, we first advocate for using
23
hyperbolic spaces which provably model tree-like structures better than

Euclidean geometry. Second, we view hierarchical relations as partial
orders defined using a family of nested geodesically convex cones. We
prove that these entailment cones admit an optimal shape with a closed
form expression both in the Euclidean and hyperbolic spaces, which
canonically define the embedding learning process. Experiments show
significant improvements of our method over strong recent baselines
both in terms of representational capacity and generalization.
• Hyperbolic Neural Networks
Our second contribution is based on the publication “Hyperbolic

Neural Networks” [GBH18a], detailed in Chapter 4.
The key question we target is: “how to use hyperbolic embeddings in
downstream tasks in a principled manner by leveraging the geometric
properties of the embedding space?”. Traditionally, deep learning
architectures digest input data embeddings as feature vectors endowed
with standard Euclidean operations (e.g. vector addition). To benefit
from the hyperbolic structure, we generalize in a principled manner
common neural network layers to the hyperbolic domain - feed-forward
neural networks FFNN, recurrent neural networks RNN, gated recurrent
units GRU and multiclass logistic regression MLR. Towards this goal,
we use the connection between the gyro-vector space theory and the
Riemannian geometry of the Poincaré ball. Our models recover their
corresponding Euclidean variants when the space’s curvature goes to
0, i.e. when continuously deforming the hyperbolic into the Euclidean
space. Empirically, we show that, even if hyperbolic optimization
tools are limited, hyperbolic word embeddings are better classified
using our method compared to traditional Euclidean techniques, while
hyperbolic sentence embeddings are better suited when addressing
textual entailment and noisy-prefix recognition tasks.
• Hyperbolic Word Embeddings
Our next contribution is based on the publication “Poincaré Glove:

Hyperbolic Word Embeddings” [TBG19] 11 , detailed in Chapter 5.
We take a step towards learning the first unsupervised hyperbolic
word embeddings that can simultaneously be competitive on three
11 Equal contribution. Shared first authorship across all authors.
24
different tasks - semantic similarity, hypernmy prediction and semantic

analogy. The key aspect is to leverage the connection between distances
in products of hyperbolic spaces and Fisher distances of Gaussian
embeddings with diagonal covariance matrices [CSS15]. This allows us
to move back and forth between point and distributional embeddings,
taking the best of both worlds, but also benefiting from powerful
Riemannian optimization tools we previously developed [BG19] in
order to match their Euclidean variants (e.g. AdaGrad [DHS11]).
• Neural Embeddings and Message Passing Inference Techniques

for Entity Disambiguation
Finally, the last contribution is based on the publications “Probabilis-

tic Bag-of-hyperlinks Model for Entity Linking” [Gan+16] and “Deep
Joint Entity Disambiguation with Local Neural Attention” [GH17],
detailed in Chapter 6 and Chapter 7.
We investigate the core NLP task of ED, i.e. finding entities in raw
text corpora and linking them to knowledge bases. As a strong start,
we establish a state-of-the-art model that does not use embeddings
and neural networks. Towards this goal, we propose a probabilistic
approach that makes use of an effective graphical model to perform col-
lective entity disambiguation. Input mentions (i.e., linkable token spans)
are disambiguated jointly across an entire document by combining a
document-level prior of entity co-occurrences with local information
captured from mentions and their surrounding context. The model is
based on simple sufficient statistics extracted from data, thus relying on
few parameters to be learned. Our method does not require extensive
feature engineering, nor an expensive training procedure. We use loopy
belief propagation to perform approximate inference. We demonstrate
the accuracy of our approach on a wide range of benchmark datasets,
showing that it matches, and in many cases outperforms, existing
state-of-the-art methods.
Next, we propose a novel deep learning model for joint document-
level entity disambiguation, which leverages learned neural represen-
tations. Key components are entity embeddings, bilinear forms as
similarity functions, a neural attention mechanism over local context
windows and a differentiable joint inference stage for disambiguation,
namely truncated unrolled loopy belief propagation. Our approach
thereby combines the benefits of deep learning with more traditional
approaches such as graphical models and probabilistic mention-entity
25
maps. Extensive experiments show that we are able to obtain competi-

tive or state-of-the-art accuracy at moderate computational costs.
26
H Y P E R B O L I C G E O M E T RY
2
In this chapter, we present the hyperbolic Riemannian manifold that
would be used as an embedding space in our methods presented in
Chapter 3, Chapter 4 and in Chapter 5. We already gave an intuition
on Riemannian manifolds and hyperbolic spaces in Chapter 1 and
section 1.4. We now give mathematical details of these concepts. The
material presented here has in parts been published in the publica-
tions [GBH18a]; [GBH18b].
notations. We always use k · k to denote the Euclidean norm of

a point (in both hyperbolic or Euclidean spaces). We also use h·, ·i to
denote the Euclidean scalar product.
2.1 short overview of differential geometry
For a rigorous reasoning about hyperbolic spaces, one needs to use

concepts in differential geometry, some of which we highlight here. For
an in-depth introduction, we refer the reader to [Spi79] and [HA10].
manifold. A manifold M of dimension n is a set that can be locally

approximated by the Euclidean space Rn . For instance, the sphere S2
and the torus T2 embedded in R3 are 2-dimensional manifolds, also
called surfaces, as they can locally be approximated by R2 . The notion
of manifold is a generalization of the notion of surface.
tangent space. For x ∈ M, the tangent space Tx M of M at x is

defined as the n-dimensional vector-space approximating M around x
at a first order. It can be defined as the set of vectors v ∈ Rn that can
be obtained as v := c0 (0), where c : (−ε, ε) → M is a smooth path in
M such that c(0) = x.
riemannian metric. A Riemannian metric g on M is a collection

( gx )x of inner-products gx : Tx M × Tx M → R on each tangent space
27
hyperbolic geometry
Figure 2.1: Tangent space, a tangent unit-speed vector and its deter-
mined geodesic in a Riemannian manifold. Image source:
Wikipedia.org
Tx M, varying smoothly with x. Although it defines the geometry of

M locally, it induces a global distance function d : M × M → R+ by
setting d(x, y) to be the infimum of all lengths of smooth curves joining
x to y in M, where the length ` of a curve γ : [0, 1] → M is defined by
integrating the length of the speed vector living in the tangent space:
Z 1q
`(γ) = gγ(t) (γ0 (t), γ0 (t))dt. (2.1)
0
riemannian manifold. A smooth manifold M equipped with a

Riemannian metric g is called a Riemannian manifold and is denoted by
the pair (M, g). Subsequently, due to their metric properties, we will
only consider such manifolds.
geodesics. A geodesic (straight line) between two points x, y ∈ M

is a smooth curve of minimal length joining x to y in M. Geodesics
define shortest paths on the manifold. They are a generalization of lines
in the Euclidean space.
parallel transport. In certain spaces, such as the hyperbolic

space, there is a unique geodesic between two points, which allows
to consider the parallel transport from x to y (implicitly taken along
this unique geodesic) Px→y : Tx M → Ty M, which is a linear isometry
between tangent spaces corresponding to moving tangent vectors along
geodesics and defines a canonical way to connect tangent spaces.
exponential map. The exponential map expx : Tx M → M around

x, when well-defined, maps a small perturbation of x by a vector
28
2.2 hyperbolic space: the poincar é ball
v ∈ Tx M to a point expx (v ) ∈ M, such that t ∈ [0, 1] 7→ expx (tv ) is

a geodesic joining x to expx (v ). In Euclidean space, we simply have
expx (v ) = x + v. The exponential map is important, for instance,
when performing gradient-descent over parameters lying in a mani-
fold [Bon13]. For geodesically complete manifolds, such as the Poincaré
ball considered in this work, expx is well-defined on the full tangent
space Tx M.
conformality. A metric g̃ on M is said to be conformal to g if it

defines the same angles, i.e. for all x ∈ M and u, v ∈ Tx M \ {0},
g̃x (u, v ) gx (u, v )
p p =p p . (2.2)
g̃x (u, u) g̃x (v, v ) gx (u, u) gx (v, v )
This is equivalent to the existence of a smooth function λ : M → (0, ∞)
such that g̃x = λ2x gx , which is called the conformal factor of g̃ (w.r.t. g).
The hyperbolic space of dimension n ≥ 2 is a fundamental object in

Riemannian geometry. It is (up to isometry) uniquely characterized
as a complete, simply connected Riemannian manifold with constant
negative curvature [Can+97]. The other two model spaces of constant
sectional curvature are the flat Euclidean space Rn (zero curvature) and
the hyper-sphere Sn (positive curvature).
The hyperbolic space has five models which are often insightful
to work in. They are isometric to each other and conformal to the
Euclidean space [Can+97]; [Par13]. We prefer to work in the Poincaré
ball model Dn for the same reasons as [NK17] and, additionally, because
we can derive a closed form expression of geodesics and exponential
map.
poincar é metric tensor. The Poincaré ball model (Dn , gD ) is

defined by the manifold Dn = {x ∈ Rn : kxk < 1} equipped with the
following Riemannian metric
2
gxD = λ2x g E , where λx := (2.3)
1 − k x k2
and g E is the Euclidean metric tensor with components In of the stan-
dard space Rn with the usual Cartesian coordinates.
29
hyperbolic geometry
As the above model is a Riemannian manifold, its metric tensor is

fundamental in order to uniquely define most of its geometric prop-
erties like distances, inner products (in tangent spaces), straight lines
(geodesics), curve lengths or volume elements. In the Poincaré ball
model, the Euclidean metric is changed by a simple scalar field, hence
the model is conformal (i.e. angle preserving), yet distorts distances.
induced distance and norm. It is known [NK17] that the in-

duced distance between 2 points x, y ∈ Dn is given by
k x − y k2

dD (x, y) = cosh−1 1 + 2 . (2.4)
(1 − k x k2 ) · (1 − k y k2 )
The Poincare norm is then defined as:
kxkD := dD (0, x) = 2 tanh−1 (kxk) (2.5)
geodesics and exponential map. We derive closed form para-

metric expressions of unit-speed geodesics and exponential map in the
Poincaré ball. We will make extensive use of these tools throughout this
thesis. Geodesics in Dn are all intersections of the Euclidean unit ball
Dn with (degenerated) Euclidean circles orthogonal to the unit sphere
∂Dn (equations are derived below). We know from the Hopf-Rinow
theorem that the hyperbolic space is complete as a metric space. This
guarantees that Dn is geodesically complete. Thus, the exponential
map is defined for each point x ∈ Dn and any v ∈ Rn (= Tx Dn ). To
derive its closed form expression, we first prove the following.
Theorem 2.1. (Unit-speed geodesics) Let x ∈ Dn and v ∈ Tx Dn (= Rn )

such that gxD (v, v ) = 1. The unit-speed geodesic γx,v : R+ → Dn with
γx,v (0) = x and γ̇x,v (0) = v is given by
λx cosh(t) + λ2x hx, v i sinh(t) x + λx sinh(t)v

γx,v (t) = (2.6)
1 + (λx − 1) cosh(t) + λ2x hx, v i sinh(t)
Proof. We first explain the formulas for geodesics in the hyperboloid

model of the hyperbolic geometry. Further, we project them to geodesics
in the Poincaré ball.
The hyperboloid model is (Hn , h·, ·i1 ), where Hn := {x ∈ Rn,1 :

hx, xi1 = −1, x0 > 0}. The hyperboloid model can be viewed from
30
the extrinsically point of view as embedded in the pseudo-Riemannian

manifold Minkowski space (Rn,1 , h·, ·i1 ) and inducing its metric. The
Minkowski metric tensor gR of signature (n, 1) has the components
n,1
−1 0 ... 0
 
 0 1 ... 0
gR
n,1
=
 0

0 ... 0
0 0 ... 1
The associated inner-product is hx, yi1 := − x0 y0 + ∑in=1 xi yi . Note that

the hyperboloid model is a Riemannian manifold because the quadratic
form associated with gH is positive definite.
In the extrinsic view, the tangent space at Hn can be described as
Tx Hn = {v ∈ Rn,1 : hv, xi1 = 0}, see [Par13]; [RS11].
Unit-speed geodesics of Hn are given by the following theorem (Eq

(6.4.10) in [RS11]):
Theorem 2.2. Let x ∈ Hn and v ∈ Tx Hn be a unit norm vector, i.e. hv, v i =
1. The unique unit-speed geodesic φx,v : [0, 1] → Hn with φx,v (0) = x and
φ̇x,v (0) = v is
φx,v (t) = x cosh(t) + v sinh(t). (2.7)
We can now use the Egregium theorem to project the geodesics of Hn

to the geodesics of Dn . We can do that because we know an isometry
ψ : Dn → Hn between the two spaces:
1
ψ(x) := (λx − 1, λx x), ψ −1 ( x 0 , x 0 ) = x0 (2.8)
1 + x0
Formally, let x ∈ Dn , v ∈ Tx Dn with gD (v, v ) = 1. Also, let γ :

[0, 1] → Dn be the unique unit-speed geodesic in Dn with γ(0) = x
and γ̇(0) = v. Then, by Egregium theorem, φ := ψ ◦ γ is also a
unit-speed geodesic in Hn . From theorem 2.2, we have that φ(t) =
x0 cosh(t) + v 0 sinh(t), for some x0 ∈ Hn , v 0 ∈ Tx0 Hn . One derives their
expression:
x0 = ψ ◦ γ(0) = (λx − 1, λx x) (2.9)

λ2x hx, v i

0 ∂ψ(y0 , y)
v = φ̇(0) = γ̇(0) = 2
∂y
γ (0) λx hx, v ix + λx v
31
hyperbolic geometry
Inverting once again, γ(t) = ψ−1 ◦ φ(t), one gets the closed-form
expression for γ stated in the theorem.
One can sanity check that indeed the formula from theorem 2.1
satisfies the conditions:
• dD (γ(0), γ(t)) = t, ∀t ∈ [0, 1]
• γ (0) = x
• γ̇(0) = v
• limt→∞ γ(t) := γ(∞) ∈ ∂Dn
We can now derive the formula for the exponential map in the
Poincaré ball.
Corollary 2.2.1. (Exponential map) The exponential map at a point x ∈ Dn ,
namely expx : Tx Dn → Dn , is given by
expx (v ) =

λx cosh(λx kv k) + h x, kvv k i sinh(λx kv k)
x+
1 + (λx − 1) cosh(λx kv k) + λx hx, kvv k i sinh(λx kv k)
1
kv k
sinh(λx kv k)
v (2.10)
1 + (λx − 1) cosh(λx kv k) + λx hx, kvv k i sinh(λx kv k)
1
Proof. Denote u = √ v. Using the notations from theorem 2.1,
gxD (v,v )
γx,u ( gxD (v, v )). Using eqs. (2.3) and (2.6), one
p
one has expx (v ) =
derives the result.
We also derive the following fact (useful for future proofs).

Corollary 2.2.2. Given any arbitrary geodesic in Dn , all its points are copla-
nar with the origin 0.
Proof. For any geodesic γx,v (t), consider the plane spanned by the
vectors x and v. Then, from theorem 2.1, this plane contains all the
points of γx,v (t), i.e.
{γx,v (t) : t ∈ R} ⊆ { ax + bv : a, b ∈ R} (2.11)
32
angles in hyperbolic space. It is natural to extend the Eu-

clidean notion of an angle to any geodesically complete Riemannian
manifold. For any points A, B, C on such a manifold, the angle ∠ ABC is
the angle between the initial tangent vectors of the geodesics connecting
B with A, and B with C, respectively. In the Poincaré ball, the angle
between two tangent vectors u, v ∈ Tx Dn is given by
gxD (u, v ) hu, v i

cos(∠(u, v )) = p = (2.12)
D D kukkv k
p
gx (u, u) gx (v, v )
The second equality happens since gD is conformal to g E .
hyperbolic trigonometry. The notion of angles and geodesics

allow definition of the notion of a triangle in the Poincaré ball. Then,
the classic theorems in Euclidean geometry have hyperbolic formula-
tions [Par13]. In the next chapters, we will use the following theorems.
Let A, B, C ∈ Dn . Denote by ∠ B := ∠ ABC and by c = dD ( B, A) the
length of the hyperbolic segment BA (and others). Then, the hyperbolic
laws of cosines and sines hold respectively
cosh( a) cosh(c) − cosh(b)

cos(∠ B) = (2.13)
sinh( a) sinh(c)
sin(∠ A) sin(∠ B) sin(∠C )
= = (2.14)
sinh( a) sinh(b) sinh(c)
embedding trees in hyperbolic vs euclidean spaces. We

give a brief explanation on why hyperbolic spaces are better suited
than Euclidean spaces for embedding trees.
[Gro87] introduces a notion of δ-hyperbolicity in order to characterize
how ‘hyperbolic’ a metric space is. For instance, the Euclidean space
Rn for n ≥ 2 is not
√ δ-hyperbolic for any δ ≥ 0, while the Poincaré
ball D is log(1 + 2)-hyperbolic. This is formalized in the following
n
theorem 1 (section 6.2 of [Gro87], proposition 6.7 of [Bow06]):
Theorem 2.3. For any δ > 0, any δ-hyperbolic metric space ( X, d X ) and any
set of points x1 , ..., xn ∈ X, there exists a finite weighted tree ( T, d T ) and an
embedding f : T → X such that for all i, j,
|dT ( f −1 ( xi ), f −1 ( x j )) − d X ( xi , x j )| = O(δ log(n)). (2.15)

1 https://en.wikipedia.org/wiki/Hyperbolic_metric_space
33
hyperbolic geometry
Conversely, any tree can be embedded with arbitrary low distortion

into the Poincaré disk (with only 2 dimensions), whereas this is not true
for Euclidean spaces even when an unbounded number of dimensions
is allowed [De +18a]; [Sar11].
As explained in section 1.5, the difficulty in embedding trees having a
branching factor at least 2 in a quasi-isometric manner comes from the
fact that they have an exponentially increasing number of nodes with
depth. The exponential volume growth of hyperbolic metric spaces
confers them enough capacity to embed trees quasi-isometrically, unlike
the Euclidean space.
2.2.1 Gyrovector Spaces
In the Euclidean space, natural operations inherited from the vectorial

structure, such as vector addition, subtraction and scalar multiplication
are often useful. The framework of gyrovector spaces provides an ele-
gant non-associative algebraic formalism for hyperbolic geometry just
as vector spaces provide the algebraic setting for Euclidean geometry
[Alb08]; [Ung01]; [Ung08].
In particular, these operations are used in special relativity, allowing
to add speed vectors belonging to the Poincaré ball of radius c (the
celerity, i.e. the speed of light) so that they remain in the ball, hence not
exceeding the speed of light.
We will make extensive use of these operations in our definitions of
hyperbolic neural networks.
For any c ≥ 0, let us denote 2 by Dnc := {x ∈ Rn | ckxk2 < 1}

the Poincaré ball of constant curvature −c. Note that if c = 0, then
√
Dnc = Rn ; if c > 0, then Dnc is the open ball of radius 1/ c. If c = 1
then we recover the usual ball Dn .
m öbius addition. The Möbius addition of x, y ∈ Dnc is defined as
(1 + 2chx, yi + ckyk2 ) x + (1 − ckxk2 )y

x ⊕c y := . (2.16)
1 + 2chx, yi + c2 kxk2 kyk2
In particular, when c = 0, one recovers the Euclidean addition of

x, y ∈ Rn : limc→0 x ⊕c y = x + y. Note that, without loss of generality,
2 We
√
take different notations as in [Ung01] where the author uses s = 1/ c.
34
the case c > 0 can be reduced to c = 1. Unless stated otherwise,

we will use ⊕ as ⊕1 to simplify notations. For general c > 0, this
operation is not commutative nor associative. However, it satisfies, for
any x, y ∈ Dnc ,
• x ⊕c 0 = 0 ⊕c x = x (identity element)
• (−x) ⊕c x = x ⊕c (−x) = 0 (inverse element)
• (−x) ⊕c (x ⊕c y) = y (left-cancellation law)
The Möbius substraction is then defined by the use of the following

notation: x c y := x ⊕c (−y). See [Ver05, section 2.1] for a geometric
interpretation of the Möbius addition.
m öbius scalar multiplication. For c > 0, the Möbius scalar

multiplication of x ∈ Dnc \ {0} by r ∈ R is defined as
√ √ x
r ⊗c x := (1/ c) tanh(r tanh−1 ( ckxk)) , (2.17)
kxk
and r ⊗c 0 := 0. Similarly as for the Möbius addition, one recovers the
Euclidean scalar multiplication when c goes to zero: limc→0 r ⊗c x = rx.
This operation satisfies desirable properties such as
• n ⊗c x = x ⊕c · · · ⊕c x (n additions)
• (r + r 0 ) ⊗c x = r ⊗c x ⊕c r 0 ⊗c x (scalar distributivity 3 )
• (rr 0 ) ⊗c x = r ⊗c (r 0 ⊗c x) (scalar associativity)
• |r | ⊗c x/kr ⊗c xk = x/kxk (scaling property)
distance. If one defines the generalized hyperbolic metric tensor

gc as the metric conformal to the Euclidean one, with conformal factor
λcx := 1−c2kxk2 , then the induced distance function on the Riemannian
manifold (Dnc , gc ) is given by 4
√ √
dc (x, y) = (2/ c) tanh−1

ck − x ⊕c yk . (2.18)
3 ⊗ has priority over ⊕ in the sense that r ⊗ x ⊕ y : = (r ⊗ x ) ⊕ y and r ⊕ x ⊗

c c c c c c c c
y : = r ⊕ c ( x ⊗ c y ).
4 The notation − x ⊕ y should always be read as (− x ) ⊕ y and not −( x ⊕ y ).
c c c
35
hyperbolic geometry
Again, observe that limc→0 dc (x, y) = 2kx − yk, i.e. we recover Eu-
clidean geometry in the limit 5 . Moreover, for c = 1 we recover dD
previously stated in eq. (2.4).
hyperbolic angles. Similarly as in the Euclidean space, one can

define the notions of hyperbolic angles or gyroangles (when using the ⊕c
operator) in the generalized Poincaré ball (Dnc , gc ). For A, B, C ∈ Dnc ,
we denote by ∠ A := ∠ BAC the angle between the two geodesics
starting from A and ending at B and C respectively. This angle can be
defined in two equivalent ways: i) either using the angle between the
initial velocities of the two geodesics as given by eq. (2.12), or ii) using
the formula
(− A) ⊕c B (− A) ⊕c C

cos(∠ A) = , , (2.19)
k(− A) ⊕c Bk k(− A) ⊕c C k
In this case, ∠ A is also called a gyroangle by [Ung08, section 4].
hyperbolic law of sines. The classic law of sines has also a

hyperbolic variant that we state here. If for A, B, C ∈ Dnc , we denote by
∠ B := ∠ ABC the angle between the two geodesics starting from B and
ending at A and C respectively, and by c̃ = dc ( B, A) the length of the
hyperbolic segment BA (and similarly for others), then we have:
sin(∠ A) sin(∠ B) sin(∠C )

√ = √ = √ . (2.20)
sinh( c ã) sinh( cb̃) sinh( cc̃)
Note that one can also adapt the hyperbolic law of cosines to the
hyperbolic space.
2.2.2 Connecting Gyrovector Spaces with the Riemannian geometry of the

Poincaré Ball
In this subsection, we connect the gyrospace framework with the Rie-

mannian geometry of the hyperbolic space. We present how geodesics
in the Poincaré ball model are usually described with Möbius oper-
ations, and push one step further the existing connection between
factor 2 comes from the conformal factor λx = 2/(1 − kxk2 ), which is a

5 The
convention setting the curvature to −1.
36
gyrovector spaces and the Poincaré ball by finding new identities in-
volving the exponential map, and parallel transport.
In particular, these findings provide us with a simpler formulation
of Möbius scalar multiplication, yielding a natural definition of matrix-
vector multiplication in the Poincaré ball.
riemannian gyroline element. The Riemannian gyroline ele-

ment is defined for an infinitesimal dx as ds := (x + dx) c x, and its
size is given by [Ung08, section 3.7]:
kdsk = k(x + dx) c xk = kdxk/(1 − ckxk2 ). (2.21)
What is remarkable is that it turns out to be identical, up to a scaling fac-

tor of 2, to the usual line element 2kdxk/(1 − ckxk2 ) of the Riemannian
manifold (Dnc , gc ).
geodesics. The geodesic connecting points x, y ∈ Dnc is shown in

[Alb08]; [Ung08] to have the formula:
γx→y (t) := x ⊕c (−x ⊕c y) ⊗c t (2.22)
where γx→y : R → Dnc is a curve with conditions γx→y (0) = x and

γx→y (1) = y. Note that, when c goes to 0, geodesics become straight-
lines, recovering Euclidean geometry.
Lemma 2.1. For any x ∈ Dn and v ∈ Tx Dnc s.t. gxc (v, v ) = 1, the unit-
speed geodesic γx,v : R → Dn starting from point x with direction v,
namely γx,v (0) = x and γ̇x,v (0) = v, is given by:
√ t

1
γx,v (t) = x ⊕c tanh c √ v (2.23)
2 c kv k
Proof. One can use eq. (2.22) and re-parametrize it to unit-speed using
eq. (2.18). Alternatively, direct computation and identification with the
formula in theorem 2.1 would give the same result. Using eqs. (2.18)
and (2.23), one can sanity-check that dc (γ(0), γ(t)) = t, ∀t ∈ [0, 1].
exponential and logarithmic maps. The following lemma

gives the closed-form derivation of exponential and logarithmic maps.
37
hyperbolic geometry
Lemma 2.2. For any point x ∈ Dnc , the exponential map expcx : Tx Dnc → Dnc
and the logarithmic map logcx : Dnc → Tx Dnc are given for v 6= 0 and y 6= x
by:
√ λcx kv k

c v
expx (v ) = x ⊕c tanh c √ (2.24)
2 c kv k
2 √ −x ⊕c y
logcx (y) = √ c tanh−1 ( ck − x ⊕c yk) (2.25)
cλx k − x ⊕c yk
Proof. Following the proof of corollary 2.2.1, one obtains
expcx (v ) = γx, v
(λcx kv k) (2.26)
λcx kv k
Using eq. (2.23) gives the formula for expcx . Algebraic check of the
identity logcx (expcx (v )) = v concludes the proof.
The above maps have more appealing forms when x = 0, namely for
v ∈ T0 Dnc \ {0}, y ∈ Dnc \ {0}:
√ v
exp0c (v ) = tanh( ckv k) √ (2.27)
c kv k
c −1 √ y
log0 (y) = tanh ( ckyk) √ (2.28)
ckyk
Moreover, we still recover Euclidean geometry in the limit c → 0,
as limc→0 expcx (v ) = x + v is the Euclidean exponential map, and
limc→0 logcx (y) = y − x is the Euclidean logarithmic map.
m öbius scalar multiplication using exponential and log-

arithmic maps. We study the exponential and logarithmic maps in
order to gain a better understanding of the Möbius scalar multiplication
eq. (2.17)). We find the following:
Lemma 2.3. The quantity r ⊗ x can actually be obtained by projecting x in
the tangent space T0 Dnc with the logarithmic map, multiplying this projection
by the scalar r, and then projecting it back on the manifold with the exponential
map:
r ⊗c x = exp0c (r log0c (x)), ∀r ∈ R, x ∈ Dnc . (2.29)
In addition, we recover the well-known relation between geodesics connecting

two points and the exponential map:
γx→y (t) = x ⊕c (−x ⊕c y) ⊗c t = expcx (t logcx (y)), t ∈ [0, 1]. (2.30)
38
This last result enables us to generalize scalar multiplication in order

to define matrix-vector multiplication between Poincaré balls, one of
the essential building blocks of hyperbolic neural networks.
parallel transport. Finally, we connect parallel transport along

the unique geodesic from 0 to x to gyrovector spaces with the following
theorem, which we prove in [GBH18a].
Theorem 2.4. In the manifold (Dnc , gc ), the parallel transport w.r.t. the Levi-
Civita connection of a vector v ∈ T0 Dnc to another tangent space Tx Dnc is
given by the following isometry:
λ0c
P0c →x (v ) = logcx (x ⊕c exp0c (v )) = v. (2.31)
λcx
As we’ll see later, this result is crucial in order to define and optimize
parameters shared between different tangent spaces, such as biases in
hyperbolic neural layers or parameters of hyperbolic MLR.
The general parallel transport formula for any x, y ∈ Dn , v ∈ Tx D is
given by
λx
Px→y (v ) = · gyr[y, −x]v (2.32)
λy
where gyr 6 is the gyro-automorphism on Dn with closed form expres-
sion shown in Eq. 1.27 of [Ung08]:
Au + Bv
gyr[u, v ]w = (u ⊕ v ) ⊕ {u ⊕ (v ⊕ w)} = w + 2 . (2.33)
D
where the quantities A, B, D have closed-form expressions and are
thus easy to implement:
A = −hu, wikv k2 + hv, wi + 2hu, v i · hv, wi, (2.34)
B = −hv, wikuk2 − hu, wi, (2.35)
D = 1 + 2hu, v i + kuk2 kv k2 . (2.36)
6 https://en.wikipedia.org/wiki/Gyrovector_space
39
H Y P E R B O L I C E N TA I L M E N T C O N E S F O R
3
LEARNING HIERARCHICAL EMBEDDINGS
In this chapter, we present a novel and principled method for em-

bedding DAGs and hierarchical data in the hyperbolic space. The
key components are nested geodesically convex cones for modeling
entailment relations. These cones induce a partial order relation in
the embedding space and have an appealing closed form expression
derived in a principled manner. This approach improves the representa-
tional capacity of previous related taxonomy embeddings on the word
hypernymy link prediction task.
The material presented here has in parts been published in the publi-
cation [GBH18b].
3.1 introduction and related work
In this chapter, we are interested in geometrically modeling hierarchi-

cal structures, directed acyclic graphs DAGs and entailment relations via
low dimensional embeddings. Examples of applications include word
hypernymy detection [SGD16], textual entailment [Roc+15], taxonomy
induction [Fu+14] or knowledge base completion [Bor+13]; [NTK11].
We follow the motivation of order embeddings [Ven+15] that explic-
itly model the partial order induced by entailment relations between
embedded objects. Formally, in this method, a vector x ∈ Rn represents
a more general concept than any other embedding from the Euclidean
entailment region Ox := {y | yi ≥ xi , ∀1 ≤ i ≤ n}. A first concern is
that the capacity of order embeddings grows only linearly with the
embedding space dimension. Moreover, the regions Ox suffer from
heavy intersections, implying that their disjoint volumes rapidly be-
come bounded 1 . As a consequence, representing wide (with high
example, in n dimensions, no n + 1 distinct regions Ox can simultaneously

1 For
have unbounded disjoint sub-volumes.
41
hyperbolic entailment cones for learning hierarchical embeddings
branching factor) and deep hierarchical structures in a bounded region

of the Euclidean space would cause many points to end up undesirably
close to each other. This also implies that Euclidean distances would
no longer be capable of reflecting the original tree metric.
Fortunately, the hyperbolic space does not suffer from the afore-
mentioned capacity problem because the volume of any ball grows
exponentially with its radius, instead of polynomially as in the Eu-
clidean space. This exponential growth property enables hyperbolic
spaces to embed any weighted tree while almost preserving their met-
ric 2 [Bow06]; [Gro87]; [Sar11]. The tree-likeness of hyperbolic spaces
has been extensively studied [Ham17]. Moreover, hyperbolic spaces
are used to visualize large hierarchies [LRP95], to efficiently forward
information in complex networks [CC09]; [Kri+09] or to embed hetero-
geneous, scale-free graphs [Blä+16]; [Kri+10]; [ST08].
From a machine learning perspective, recently, hyperbolic spaces
have been observed to provide powerful representations of entailment
relations [NK17]. The latent hierarchical structure surprisingly emerges
as a simple reflection of the space’s negative curvature. However, the
approach of [NK17] suffers from a few drawbacks: first, their loss
function causes most points to collapse on the border of the Poincaré
ball, as exemplified in fig. 3.3. Second, the hyperbolic distance alone
(being symmetric) is not capable of encoding antisymmetric relations
needed for entailment detection, thus a heuristic score is chosen to
account for concept generality or specificity encoded in the embedding
norm.
We inspire ourselves from hyperbolic embeddings [NK17] and order
embeddings [Ven+15] to develop a cone-based approach to embed
entailment relational data in the hyperbolic space. Our contributions
are summarized as follows:
• We address the aforementioned issues of [NK17] and [Ven+15].

We propose to replace the entailment regions Ox of order em-
beddings by a more efficient and generic class of objects, namely
geodesically convex entailment cones. These cones are defined on a
large class of Riemannian manifolds and induce a partial ordering
relation in the embedding space.
2 Seeend of section 2.2 for a rigorous formulation and section section 1.5 for a
detailed intuition.
42
3.2 entailment cones in the poincar é ball
• The optimal entailment cones satisfying four natural properties

surprisingly exhibit canonical closed-form expressions in both
Euclidean and hyperbolic geometry that we rigorously derive.
• An efficient algorithm for learning hierarchical embeddings of

directed acyclic graphs is presented. This learning process is
driven by our entailment cones.
• Using the analytic closed-form expression for the exponential

map in the n-dimensional Poincaré ball derived in Chapter 2, we
show how to perform full Riemannian optimization [Bon13] in
the Poincaré ball, as opposed to the approximate optimization
method used by [NK17].
• Experimentally, we learn high quality embeddings and improve

over experimental results in [NK17] and [Ven+15] on hypernymy
link prediction for word embeddings, both in terms of capacity of
the model and generalization performance.
In this section, we define “entailment” cones that will be used to

embed hierarchical structures in the Poincaré ball. They generalize and
improve over the idea of order embeddings [Ven+15].
Figure 3.1: Convex cones in a complete Riemannian manifold.
43
convex cones in a complete riemannian manifold. We are

interested in generalizing the notion of a convex cone to any geodesi-
cally complete Riemannian manifold M (such as hyperbolic models).
In a vector space, a convex cone S (at the origin) is a set that is closed
under non-negative linear combinations
v1 , v2 ∈ S =⇒ αv1 + βv2 ∈ S (∀α, β ≥ 0) . (3.1)
The key idea for generalizing this concept is to make use of the expo-
nential map at a point x ∈ M.
expx : Tx M → M, Tx M = tangent space at x (3.2)
We can now take any cone in the tangent space S ⊆ Tx M at a fixed

point x and map it into a set Sx ⊂ M, which we call the S-cone at x,
via
Sx := expx (S) , S ⊆ Tx M . (3.3)
Note that, in the above definition, we desire that the exponential map
be injective. We already know that it is a local diffeomorphism. Thus,
we restrict the tangent space in eq. (3.3) to the ball B n (0, r ), where r is
the injectivity radius of M at x. Note that for hyperbolic space models
the injectivity radius of the tangent space at any point is infinite, thus
no restriction is needed.
angular cones in the poincar é ball. We are interested in

special types of cones in Dn that can extend in all space directions.
We want to avoid heavy cone intersections and to have capacity that
scales exponentially with the space dimension. To achieve this, we
want the definition of cones to exhibit the following four intuitive
properties detailed below. Subsequently, solely based on these necessary
conditions, we formally prove that the optimal cones in the Poincaré
ball have a closed form expression.
1) Axial symmetry. For any x ∈ Dn \ {0}, we require circular sym-
metry with respect to a central axis of the cone Sx . We define this axis
to be the spoke through x from x:
1
Ax := {x0 ∈ Dn : x0 = αx, > α ≥ 1} (3.4)
kxk
44
Then, we fixany tangent vector with the same direction as x, e.g. x̄ =

1+kxk
−
expx 1
2k x k
x ∈ Tx Dn . One can verify using corollary 2.2.1 that x̄
generates the axis-oriented geodesic as:
Ax = expx ({y ∈ Rn : y = αx̄, α > 0}) . (3.5)
We next define the angle ∠(v, x̄) for any tangent vector v ∈ Tx Dn
as in eq. (2.12). Then, the axial symmetry property is satisfied if we
define the angular cone at x to have a non-negative aperture 2ψ(x) ≥ 0
as follows:
ψ(x)
Sx := {v ∈ Tx Dn : ∠(v, x̄) ≤ ψ(x)} (3.6)
ψ(x) ψ(x)
Sx := expx (Sx ).
We further define the conic border (face):
ψ ψ
∂Sψ := {v : ∠(v, x̄) = ψ(x)}, ∂Sx := expx (∂Sx ). (3.7)
ψ(x)
2) Rotation invariance. We want the definition of cones Sx to be
independent of the angular coordinate of the apex x, i.e. to only depend
on the (Euclidean) norm of x:
ψ(x) = ψ(x0 ) (∀x, x0 ∈ Dn \ {0}, s.t. kxk = kx0 k). (3.8)
This implies that there exists ψ̃ : (0, 1) → [0, π ) s. t. for all x ∈

Dn \ {0} we have ψ(x) = ψ̃(kxk).
3) Continuous cone aperture functions. We require the aperture ψ
of our cones to be a continuous function. Using eq. (3.8), it is equivalent
to the continuity of ψ̃. This requirement seems reasonable and will be
helpful in order to prove uniqueness of the optimal entailment cones.
When optimization-based training is employed, it is also necessary that
this function be differentiable. Surprisingly, we will show below that
the optimal functions ψ̃ are actually smooth, even when only requiring
continuity.
4) Transitivity of nested angular cones. We want cones to deter-
mine a partial order in the embedding space. The difficult property is
transitivity. We are interested in defining a cone width function ψ(x)
such that the resulting angular cones satisfy the transitivity property of
partial order relations, i.e. they form a nested structure as follows
ψ(x) ψ(x0 ) ψ(x)
∀x, x0 ∈ Dn \ {0} : x 0 ∈ Sx =⇒ Sx0 ⊆ Sx . (3.9)
45
closed form expression of the optimal ψ. We now analyze

the implications of the above necessary properties. Surprisingly, the
optimal form of the function ψ admits an interesting closed-form ex-
pression. We will see below that mathematically ψ cannot be defined
on the entire open ball Dn . Towards these goals, we first prove the
following.
Lemma 3.1. If transitivity holds, then

π
∀x ∈ Dom(ψ) : ψ(x) ≤ . (3.10)
2
Proof. Assume the contrary and let x ∈ Dn \ {0} s.t. ψ(kxk) > π
2. We
will show that transitivity implies that
ψ(x) π
∀x0 ∈ ∂Sx : ψ(kx0 k) ≤ (3.11)
2
If the above is true, by moving x0 on any arbitrary (continuous) curve
ψ(x)
on the cone border ∂Sx that ends in x, one will get a contradiction
due to the continuity of ψ(k · k).
We now prove the remaining fact, namely eq. (3.11). Let any arbitrary
0 ψ(x) ψ(x)
x ∈ ∂Sx . Also, let y ∈ ∂Sx be any arbitrary point on the geodesic
half-line connecting x with x0 starting from x0 (i.e. excluding the segment
from x to x0 ). Moreover, let z be any arbitrary point on the spoke
through x0 radiating from x0 , namely z ∈ Ax0 (notation from eq. (3.4)).
Then, based on the properties of hyperbolic angles discussed before
(based on eq. (2.12)), the angles ∠yx0 z and ∠zx0 x are well-defined 3 .
From corollary 2.2.2 we know that the points 0, x, x0 , y, z are coplanar.
We denote this plane by P . Furthermore, the metric of the Poincaré
3 Abuse of notation.
46
ball is conformal with the Euclidean metric. Given these two facts, we
derive that
∠yx0 z + ∠zx0 x = ∠(yx0 x) = π (3.12)
thus
π
min(∠yx0 z, ∠zx0 x) ≤ (3.13)
2
It only remains to prove that
∠yx0 z ≥ ψ(x0 ) & ∠zx0 x ≥ ψ(x0 ) (3.14)
Indeed, assume w.l.o.g. that ∠yx0 z < ψ(x0 ). Since ∠yx0 z < ψ(x0 ),
there exists a point t in the plane P such that
∠0xt < ∠0xy & ψ(x0 ) ≥ ∠tx0 z > ∠yx0 z (3.15)
ψ(x0 ) ψ(x)
Then, clearly, t ∈ Sx0 , and also t ∈
/ Sx , which contradicts the
transitivity property in eq. (3.9).
Note that, so far we removed the origin 0 of Dn from our definitions.

However, the above surprising lemma implies that we cannot define
a useful cone at the origin. To see this, we first note that the origin
should “entail” the entire space Dn , i.e. S0 = Dn . Second, similar with
property 3, we desire the cone at 0 be a continuous deformation of the
cones of any sequence of points (xn )n≥0 in Dn \ {0} that converges to
0. Formally, limn→∞ Sxn = S0 when limn→∞ xn = 0. However, this is
impossible because lemma 3.1 implies that the cone at each point xn
can only cover at most half of Dn . We further prove the following:
Theorem 3.1. If transitivity holds, then the function

r
h : (0, 1) ∩ Dom(ψ̃) → R+ , h (r ) : = sin(ψ̃(r )), (3.16)
1 − r2
is non-increasing.
Proof. We first need to prove the following fact:

ψ(x)
Lemma 3.2. Transitivity implies that for all x ∈ Dn \ {0}, ∀x0 ∈ ∂Sx :
sin(ψ(kx0 k)) sinh(kx0 kD ) ≤ sin(ψ(kxk)) sinh(kxkD ). (3.17)
47
Proof. We will use the exact same figure and notations of points y, z as
in the proof of lemma 3.1. In addition, we assume w.l.o.g that
π
∠yx0 z ≤ (3.18)
2
Further, let b ∈ ∂Dn be the intersection point of the spoke through x
with the border of Dn . Following the same argument as in the proof of
lemma 3.1, one proves eq. (3.14) which gives:
∠yx0 z ≥ ψ(x0 ) (3.19)
In addition, the angle at x0 between the geodesics xy and 0z can be

written in two ways:
∠0x0 x = ∠yx0 z (3.20)

ψ(x)
Since x0 ∈ ∂Sx , one proves
∠0xx0 = π − ∠x0 xb = π − ψ(x) (3.21)
We apply hyperbolic law of sines (eq. (2.20)) in the hyperbolic triangle

0xx0 :
sin(∠0xx0 ) sin(∠0x0 x)
= (3.22)
sinh(dD (0, x0 )) sinh(dD (0, x))
Putting together eqs. (3.18) to (3.22), and using the fact that sin(·) is an
increasing function on [0, π2 ], we derive the conclusion of this helper
lemma.
We now return to the proof of our theorem. Consider any arbitrary

r, r 0 ∈ (0, 1) ∩ Dom(ψ) with r < r 0 . Then, we claim that is enough to
prove that
ψ(x)
∃x ∈ Dn , x0 ∈ ∂Sx s.t. kxk = r, kx0 k = r 0 (3.23)
Indeed, if the above is true, then one can use the fact in eq. (2.5), i.e.

1+r 2r
sinh(kxkD ) = sinh ln = (3.24)
1−r 1 − r2
and apply lemma 3.2 to derive
h (r 0 ) ≤ h (r ) (3.25)
48
Figure 3.2: Poincaré angular cones satisfying eq. (3.31) for K = 0.1. Left:
examples of cones for points with Euclidean norm varying
from 0.1 to 0.9. Right: transitivity for various points on the
border of their parent cones.
which is enough for proving the non-increasing property of function h.
We are only left to prove eq. (3.23). Let any arbitrary x ∈ Dn s.t.
ψ(x)
kxk = r. Also, consider any arbitrary geodesic γx,v : R+ → ∂Sx that
takes values on the cone border, i.e. ∠(v, x) = ψ(x). We know that
kγx,v (0)k = kxk = r (3.26)
and that this geodesic “ends” on the ball’s border ∂Dn , i.e.
k lim γx,v (t)k = 1 (3.27)

t→∞
Thus, because the function kγx,v (·)k is continuous, we obtain that for
any r 0 ∈ (r, 1) there exists an t0 ∈ R+ s.t. kγx,v (t0 )k = r 0 . By setting
ψ(x)
x0 := γx,v (t0 ) ∈ ∂Sx we obtain the desired result.
The above theorem implies that a non-zero ψ̃ cannot be defined

on the entire (0, 1) because limr→0 h(r ) = 0, for any function ψ̃. As
a consequence, we are forced to restrict Dom(ψ̃) to some [e, 1), i.e.
to leave the open ball B n (0, e) outside of the domain of ψ. Then,
theorem 3.1 implies that
r e
∀r ∈ [e, 1) : sin(ψ̃(r )) 2
≤ sin(ψ̃(e)) . (3.28)
1−r 1 − e2
49
Since we are interested in cones with an aperture as large as possible

(to maximize model capacity), it is natural to set all terms h(r ) equal to
K := h(e), i.e. to make h constant:
r
∀r ∈ [e, 1) : sin(ψ̃(r )) = K, (3.29)
1 − r2
which gives both a restriction on e (in terms of K):

e 2K
K≤ ⇐⇒ e∈ √ ,1 , (3.30)
1 − e2 1 + 1 + 4K2
as well as a closed form expression for ψ
ψ : Dn \ B n (0, e) → (0, π/2)

(1 − k x k2 )

x 7→ arcsin K (3.31)
kxk
which is also a sufficient condition for transitivity to hold:
Theorem 3.2. If ψ is defined as in eqs. (3.30) and (3.31), then transitivity

holds.
The above theorem has a proof similar to that of theorem 3.1.

So far, we have obtained a closed form expression for hyperbolic
entailment cones. However, we still need to understand how they
can be used during embedding learning. For this goal, we derive an
ψ(x)
equivalent (and more practical) definition of the cone Sx :
Theorem 3.3. For any x, y ∈ Dn \ B n (0, e), we denote the angle between
the half-lines (xy and (0x as
Ξ(x, y) := π − ∠0xy, (3.32)
Then, this angle equals

!
hx, yi(1 + kxk2 ) − kxk2 (1 + kyk2 )
arccos p , (3.33)
kxk · kx − yk 1 + kxk2 kyk2 − 2hx, yi
Moreover, we have the following equivalent expression of the Poincaré entail-

ment cones satisfying eq. (3.31):
1 − k x k2

ψ(x)
Sx = y ∈ D Ξ(x, y) ≤ arcsin K
n

. (3.34)
kxk
50
3.3 learning with entailment cones
ψ(x)
Proof. For any y ∈ Sx , the axial symmetry property implies that
π − ∠0xy ≤ ψ(x). Applying the hyperbolic cosine law in the triangle
0xy and writing the above angle inequality in terms of the cosines of
the two angles, one gets
− cosh(kykD ) + cosh(kxkD ) cosh(dD (x, y))

cos ∠0xy = (3.35)
sinh(kxkD ) sinh(dD (x, y))
Equation (3.33) is then derived from the above by an algebraic reformu-

lation.
Examples of 2-dimensional Poincaré cones corresponding to apex

points located at different radii from the origin are shown in fig. 3.2.
This figure also shows that transitivity is satisfied for some points on
the border of the hypercones.
euclidean entailment cones. One can easily adapt the above

proofs to derive entailment cones in the Euclidean space (Rn , g E ). The
only adaptations are: i) replace the hyperbolic cosine law by usual
Euclidean cosine law, ii) geodesics are straight lines, and iii) the expo-
nential map is given by expx (v ) = x + v. Thus, one derives in a similar
way that h(r ) = r sin(ψ(r )) is non-decreasing, the optimal values of ψ
are obtained for constant h being equal to K ≤ ε and
ψ(x)
Sx = {y ∈ Rn | Ξ(x, y) ≤ ψ(x)}, (3.36)
where Ξ(x, y) now becomes
k y k2 − k x k2 − k x − y k2

Ξ(x, y) = arccos , (3.37)
2k x k · k x − y k
for all x, y ∈ Rn \ B(0, ε). From a learning perspective, there is no

need to be concerned about the Riemannian optimization described in
section 3.3, as the usual Euclidean gradient-step is used in this case.
3.3 learning with entailment cones
We now describe how embedding learning is performed.
51
Max-margin training on angles
We learn hierarchical word embeddings from a dataset X of entail-

ment relations (u, v) ∈ X , also called hypernym links, defining that u
entails v, or, equivalently, that v is a subconcept of u 4 .
We choose to model the entailment relation (u, v) in the embedding
ψ(u)
space as v belonging to the entailment cone Su .
Our model is trained with a max-margin loss function similar to the
one in [Ven+15]:
L= ∑ E(u, v ) + ∑
0 0
max(0, γ − E(u0 , v 0 )), (3.38)
(u,v)∈ P (u ,v )∈ N
for some margin γ > 0, where P and N define samples of positive and
negative edges respectively. The energy E(u, v ) measures the penalty
of a wrongly classified pair (u, v), which in our case measures how far
ψ(u)
is point v from belonging to Su expressed as the smallest angle of a
ψ(u)
rotation of center u bringing v into Su :
E(u, v ) := max(0, Ξ(u, v ) − ψ(u)), (3.39)
where Ξ(u, v ) is defined in eqs. (3.33) and (3.37). Note that [Ven+15]
use k max(0, v − u)k2 . This loss function encourages positive samples
to satisfy E(u, v ) = 0 and negative ones to satisfy E(u, v ) ≥ γ. The
same loss is used both in the hyperbolic and Euclidean cases.
Full Riemannian optimization
As the parameters of the model live in the hyperbolic space, the

back-propagated gradient is a Riemannian gradient. Indeed, if u is in
the Poincaré ball, and if we compute the usual (Euclidean) gradient
∇u L of our loss, then
u ← u − η ∇u L (3.40)
makes no sense as an operation in the Poincaré ball, since the sub-

traction operation is not defined in this manifold. Instead, one should
4 We prefer this notation over the one in [NK17]
52
3.4 experiments
compute the Riemannian gradient ∇uR L indicating a direction in the tan-

gent space Tu Dn , and should move u along the corresponding geodesic
in Dn [Bon13]:
u ← expu (−η ∇uR L), (3.41)
where the Riemannian gradient is obtained by rescaling the Euclidean

gradient by the inverse of the metric tensor. As our metric is conformal,
i.e. gD = λ2 g E where g E = I is the Euclidean metric (see eq. (2.3)), this
leads to a simple formulation
∇uR L = (1/λu )2 ∇u L. (3.42)
Previous work [NK17] optimizing word embeddings in the Poincaré ball

used the retraction map Rx (v ) := x + v as a first order approximation
of expx (v ). Note that since we derived a closed-form expression of the
exponential map in the Poincaré ball (corollary 2.2.1), we are able to
perform full Riemannian optimization in this model of the hyperbolic
space.
3.4 experiments
We evaluate the representational and generalization power of hy-

perbolic entailment cones and of other baselines using data that ex-
hibits a latent hierarchical structure. We follow previous work [NK17];
[Ven+15] and use the full transitive closure of the WordNet noun hi-
erarchy [Mil+90]. Our binary classification task is link prediction for
unseen edges in this directed acyclic graph.
dataset splitting. train and evaluation settings. We re-

move the tree root since it carries little information and only has trivial
edges to predict. Note that this implies that we co-embed the resulting
subgraphs together to prevent overlapping embeddings (see smaller
examples in fig. 3.3). The remaining WordNet dataset contains 82,114
nodes and 661,127 edges in the full transitive closure. We split it into
train - validation - test sets as follows. We first compute the transitive
reduction 5 of this directed acyclic graph, i.e. “basic” edges that form the
minimal edge set for which the original transitive closure can be fully
recovered. These edges are hard to predict, so we will always include
5 https://en.wikipedia.org/wiki/Transitive_reduction
53
Figure 3.3: Two dimensional embeddings of two datasets: a toy uniform

tree of depth 7 and branching factor 3, with root removed
(left); the mammal subtree of WordNet with 4230 relations,
1165 nodes and top 2 nodes removed (right). [NK17] (each
left side) has most of the nodes and edges collapsed on the
space border, while our hyperbolic cones (each right side)
nicely reveal the data structure.
54
3.4 experiments
them in the training set. The remaining “non-basic” edges (578,477) are
split into validation (5%), test (5%) and train (fraction of the rest).
We augment both the validation and the test parts with sets of nega-
tive pairs as follows: for each true (positive) edge (u, v), we randomly
sample five (u0 , v) and five (u, v0 ) negative corrupted pairs that are not
edges in the full transitive closure. These are then added to the respec-
tive negative set. Thus, ten times as many negative pairs as positive
pairs are used. They are used to compute standard classification metrics
associated with these datasets: precision, recall, F1. For the training set,
negative pairs are dynamically generated as explained below.
We make the task harder in order to understand the generalization
ability of various models when differing amounts of transitive closure
edges are available during training. We generate four training sets
that include 0%, 10%, 25%, or 50% of the non-basic edges, selected
randomly. We then train separate models using each of these four sets
after being augmented with the basic edges.
baselines. We compare against the strong hierarchical embedding

methods of Order embeddings [Ven+15] and Poincaré embeddings [NK17].
Additionally, we also use Simple Euclidean embeddings, i.e. the Euclidean
version of the method presented in [NK17] (one of their baselines).
We note that Poincaré and Simple Euclidean embeddings were trained
using a symmetric distance function, and thus cannot be directly used
to evaluate antisymmetric entailment relations. Thus, for these baselines
we use the heuristic scoring function proposed in [NK17]:
score(u, v) = (1 + α(kuk − kv k))d(u, v ) (3.43)
and tune the parameter α on the validation set. For all the other methods
(our proposed cones and order embeddings), we use the energy penalty
E(u, v ), e.g. eq. (3.39) for hyperbolic cones. This scoring function is then
used at test time for binary classification as follows: if it is lower than a
threshold, we predict an edge; otherwise, we predict a non-edge. The
optimal threshold is chosen to achieve maximum F1 on the validation
set by passing over the sorted array of scores of positive and negative
validation pairs.
training details. For all methods except Order embeddings, we

observe that initialization is very important. Being able to properly
disentangle embeddings from different subparts of the graph in the
55
initial learning stage is essential in order to train qualitative models. We

conjecture that initialization is hard because these models are trained
to minimize highly non-convex loss functions. In practice, we obtain
our best results when initializing the embeddings corresponding to
the hyperbolic cones using the Poincaré embeddings pre-trained for
100 epochs. The embeddings for the Euclidean cones are initialized
using Simple Euclidean embeddings pre-trained also for 100 epochs.
For the Simple Euclidean embeddings and Poincaré embeddings, we
find the burn-in strategy of [NK17] to be essential for a good initial
disentanglement. We also observe that the Poincaré embeddings are
heavily collapsed to the unit ball border (as also pictured in fig. 3.3)
and so we rescale them by a factor of 0.7 before starting the training of
the hyperbolic cones.
Each model is trained for 200 epochs after the initialization stage,
except for order embeddings which were trained for 500 epochs. During
training, 10 negative edges are generated per positive edge by randomly
corrupting one of its end points. We use batch size of 10 for all models.
For both cone models we use a margin of γ = 0.01.
All Euclidean models and baselines are trained using stochastic
gradient descent. For the hyperbolic models, we do not find significant
empirical improvements when using full Riemannian optimization
instead of approximating it with a retraction map as done in [NK17].
We thus use the retraction approximation since it is faster. For the
cone models, we always project outside of the e ball centered on the
origin during learning as constrained by eq. (3.31) and its Euclidean
version. For both we use e = 0.1. A learning rate of 1e-4 is used for
both Euclidean and hyperbolic cone models.
results and discussion. Tables 3.1 and 3.2 show the obtained
results. For a fair comparison, we use models with the same number
of dimensions. We focus on the low dimensional setting (5 and 10
dimensions) which is more informative. It can be seen that our hyper-
bolic cones are better than all the baselines in all settings, except in
the 0% setting for which order embeddings are better. However, once
a small percentage of the transitive closure edges becomes available
during training, we observe significant improvements of our method,
sometimes by more than 8% F1 score. Moreover, hyperbolic cones
have the largest growth when transitive closure edges are added at
56
3.5 summary
Percentage of Transitive Closure

Embedding Dimension = 5
Non-basic Edges used for Training
0% 10% 25% 50%
Simple Euclidean Emb 26.8% 71.3% 73.8% 72.8%
Poincaré Emb 29.4% 70.2% 78.2% 83.6%
Order Emb 34.4% 70.2% 75.9% 81.7%
Our Euclidean Cones 28.5% 69.7% 75.0% 77.4%
Our Hyperbolic Cones 29.2% 80.1% 86.0% 92.8%
Table 3.1: Test F1 results for various models for embedding dimension-
ality equal to 5. Simple Euclidean Emb and Poincaré Emb are
the Euclidean and hyperbolic methods proposed by [NK17],
Order Emb is proposed by [Ven+15].
Percentage of Transitive Closure

Embedding Dimension = 10
Non-basic Edges used for Training
0% 10% 25% 50%
Simple Euclidean Emb 29.4% 75.4% 78.4% 78.1%
Poincaré Emb 28.9% 71.4% 82.0% 85.3%
Order Emb 43.0% 69.7% 79.4% 84.1%
Our Euclidean Cones 31.3% 81.5% 84.5% 81.6%
Our Hyperbolic Cones 32.2% 85.9% 91.0% 94.4%
Table 3.2: Test F1 results for the same methods as in table 3.1, but for
embedding dimensionality equal to 10.
train time. We further note that, while mathematically not justified 6 ,

if embeddings of our proposed Euclidean cones model are initialized
with the Poincaré embeddings instead of the Simple Euclidean ones,
then they perform on par with the hyperbolic cones.
3.5 summary
Learning meaningful graph embeddings is relevant for many impor-

tant ML applications. Hyperbolic geometry has proven to be powerful
6 Indeed,
mathematically, hyperbolic embeddings cannot be considered as Eu-
clidean points.
57
for embedding hierarchical structures, however lacking a good ap-

proach to deal with DAG like data such as word hypernymy datasets
(e.g. the WordNet noun hierarchy [Mil+90]). We here take one step
forward towards this goal and propose a novel model based on geodesi-
cally convex entailment cones and show its theoretical and practical
benefits. We empirically discover that strong embedding methods can
vary a lot with the percentage of the taxonomy observable during
training and demonstrate that our proposed method benefits the most
from increasing size of the training data. As future work, we aim to
understand if the proposed entailment cones can be used to embed
more complex data such as sentences or images.
Our code is publicly available 7 .
7 https://github.com/dalab/hyperbolic_cones.
58
HYPERBOLIC NEURAL NETWORKS
4
Hyperbolic spaces have recently gained momentum in the context of
machine learning due to their high capacity and tree-likeliness prop-
erties. However, the representational power of hyperbolic geometry is
not yet on par with Euclidean geometry, mostly because of the absence
of corresponding hyperbolic neural network layers. This makes it hard
to use hyperbolic embeddings in downstream tasks. In this chapter, we
bridge this gap in a principled manner by combining the formalism
of Möbius gyrovector spaces with the Riemannian geometry of the
Poincaré model of hyperbolic spaces. As a result, we derive hyper-
bolic versions of important deep learning tools – multinomial logistic
regression MLR, feed-forward FFNN and recurrent neural networks RNN
such as gated recurrent units GRU – and prove some of their interesting
properties. This further allows to embed sequential data and perform
classification in the hyperbolic space. Empirically, we show the benefit
of using our hyperbolic models compared to their Euclidean variants
on word embedding classification, textual entailment and noisy-prefix
recognition tasks.
cation [GBH18a].
4.1 introduction
As highlighted in previous chapters, due to the negative curvature

and exponentially growing volume, hyperbolic spaces are mathemati-
cally suitable to (e-)isometrically embed hierarchical structures or scale-
free networks with heavy tailed degree distributions that are ubiquitous
among real-world graphs (e.g. social, text or biological networks). Em-
pirically, various recent ML works [De +18a]; [NK17]; [NK18]; [Wil+14]
have confirmed this, by learning hyperbolic representations that greatly
outperform Euclidean embeddings for hierarchical, taxonomic or en-
tailment data.
59
hyperbolic neural networks
However, Euclidean embeddings are so popular due to their usage

as input to deep neural networks which allows capturing non-linear
interactions and, thus, target complex downstream tasks. To match
this desirable property, appropriate deep learning tools are needed to
embed feature data in the hyperbolic space and use it in downstream
tasks. For example, sequence data implicitly hierarchical (e.g. textual
entailment data, phylogenetic trees of DNA sequences or hierarchical
captions of images) would benefit from suitable hyperbolic RNNs.
How should one generalize deep neural models to hyperbolic spaces? The
main contribution of this chapter is to bridge the gap between hy-
perbolic and Euclidean geometry in the context of neural networks
and deep learning by generalizing in a principled manner both the
atomic operations of neural networks (e.g. matrix-vector multiplica-
tion), as well as multinomial logistic regression MLR, feed-forward FFNN,
simple and gated GRU recurrent neural networks RNN to the Poincaré
model of the hyperbolic geometry. We connect the theory of gyrovector
spaces and generalized Möbius transformations introduced by [Alb08];
[Ung08] with the Riemannian geometry properties of the manifold. We
smoothly parametrize basic operations and objects in all spaces of con-
stant negative curvature using a unified framework that depends only
on the curvature value. Thus, we show how Euclidean and hyperbolic
spaces can be continuously deformed into each other. On a series of
experiments and datasets we showcase the effectiveness of our hyper-
bolic neural network layers compared to their classic Euclidean variants
on word embedding classification, textual entailment and noisy-prefix
recognition tasks.
Neural networks can be seen as being made of compositions of
basic operations, such as linear maps, bias translations, pointwise non-
linearities and a final sigmoid or softmax layer. We first explain how to
construct a softmax layer for logits lying in the Poincaré ball. Then, we
explain how to transform a mapping between two Euclidean spaces into
one between Poincaré balls, yielding matrix-vector multiplication and
pointwise non-linearities in the Poincaré ball. Finally, we present possi-
ble adaptations of various recurrent neural networks to the hyperbolic
domain.
60
4.2 hyperbolic multiclass logistic regression
In order to perform multi-class classification on the Poincaré ball,

one needs to generalize multinomial logistic regression (MLR) − also
called softmax regression − to the Poincaré ball.
reformulating euclidean mlr. Let’s first reformulate Euclidean

MLR from the perspective of distances to margin hyperplanes, as in
[LL04, Section 5]. This will allow us to easily generalize it.
Given K classes, one learns a margin hyperplane for each such class
k ∈ {1, ..., K } using softmax probabilities:
p(y = k |x) ∝ exp ((hak , xi − bk )) (4.1)
where bk ∈ R is a class bias, ak ∈ Rn is a class weight vector and x ∈ Rn

is an embedding vector of the point to be classified. Note that any affine
hyperplane in Rn can be written with a normal vector a ∈ Rn \ {0}
and a scalar shift b ∈ R:
Ha,b = {x ∈ Rn : ha, xi − b = 0} (4.2)
As in [LL04, Section 5], we note that
ha, xi − b = sign(ha, xi − b)kakd(x, Ha,b ) (4.3)
Using eq. (4.1), we get for bk ∈ R, x, ak ∈ Rn :
p(y = k |x) ∝ exp(sign(hak , xi − bk )kak kd(x, Hak ,bk )) (4.4)
As it is not immediately obvious how to generalize the Euclidean

hyperplane of eq. (4.2) to other spaces such as the Poincaré ball, we
reformulate it as follows:
H̃a,p = {x ∈ Rn : h−p + x, ai = 0} = p + {a}⊥ (4.5)
where p ∈ Rn , a ∈ Rn \ {0}. This new definition relates to the previous

one as H̃a,p = Ha,ha,pi . Rewriting eq. (4.4) with b = ha, pi:
p(y = k |x) ∝ exp(sign(h−pk + x, ak i)kak kd(x, H̃ak ,pk )) (4.6)
where pk , x, ak ∈ Rn . It is now natural to adapt the previous definition

to the hyperbolic setting by replacing + by ⊕c :
61
Definition 4.1 (Poincaré hyperplanes). For p ∈ Dnc , a ∈ Tp Dnc \ {0},

we define
{a}⊥ := {z ∈ Tp Dnc : gpc (z, a) = 0} = {z ∈ Tp Dnc : hz, ai = 0} (4.7)
Then, we define 1 Poincaré hyperplanes as

c
H̃a,p := {x ∈ Dnc : hlogcp (x), aip = 0} = expcp ({a}⊥ ) (4.8)
An alternative definition of the Poincaré hyperplanes is
Lemma 4.1.
c
H̃a,p = {x ∈ Dnc : h−p ⊕c x, ai = 0}. (4.9)
Proof. Two steps proof:

i) expcp ({a}⊥ ) ⊆ {x ∈ Dnc : h−p ⊕c x, ai = 0}:
Let z ∈ {a}⊥ . From eq. (2.25), we have that:
expcp (z) = −p ⊕c βz, for some β ∈ R. (4.10)
This, together with the left-cancellation law in gyrospaces (see sec-

tion 2.2.1), implies that
h−p ⊕c expcp (z), ai = h βz, ai = 0 (4.11)
which is what we wanted.
ii) {x ∈ Dnc : h−p ⊕c x, ai = 0} ⊆ expcp ({a}⊥ ):

Let x ∈ Dnc s.t. h−p ⊕c x, ai = 0. Then, using eq. (2.25), we derive
that:
logcp (x) = β(−p ⊕c x), for some β ∈ R, (4.12)
which is orthogonal to a, by assumption. This implies logcp (x) ∈ {a}⊥ ,

hence x ∈ expcp ({a}⊥ ).
c can also be described as the union of images of
Turning back, H̃a,p
all geodesics in Dc orthogonal to a and containing p. Notice that our
n
definition matches that of hypergyroplanes, see [Ung14, definition 5.8]. A

3D hyperplane example is depicted in fig. 4.1.
1 Where h·, ·i denotes the (Euclidean) inner-product of the ambient space.
62
Figure 4.1: An example of a hyperbolic hyperplane in D31 plotted using

sampling. The red point is p. The shown normal axis to the
hyperplane through p is parallel to a.
Theorem 4.2.
c
dc (x, H̃a,p ) := inf dc (x, w)
c
w∈ H̃a,p
√ (4.13)
2 c|h−p ⊕c x, ai|

1
= √ sinh−1 .
c (1 − ck − p ⊕c xk2 )kak
Proof. We first need to prove the following lemma, trivial in the Eu-
clidean space, but not in the Poincaré ball:
Lemma 4.2. (Orthogonal projection on a geodesic) Any point in the Poincaré
ball has a unique orthogonal projection on any given geodesic that does not
pass through the point. Formally, for all y ∈ Dnc and for all geodesics γx→z (·)
such that y ∈ / Im γx→z , there exists an unique w ∈ Im γx→z such that
∠(γw→y , γx→z ) = π/2.
Proof. We first note that any geodesic in Dnc has the form γ(t) =
u ⊕c v ⊗c t as given by eq. (2.23), and has two “points at infinity” lying
on the ball border (v 6= 0):
±v
γ(±∞) = u ⊕c √ ∈ ∂Dnc . (4.14)
c kv k
63
Using the notations in the lemma statement, the closed-form of γx→z

is given by eq. (2.22):
γx→z (t) = x ⊕c (−x ⊕c z) ⊗c t
We denote by x0 , z0 ∈ ∂Dnc its points at infinity as described by eq. (4.14).

Then, the hyperbolic angle ∠ywx0 is well defined from eq. (2.19):
cos(∠(γw→y , γx→z )) = cos(∠ywz0 )

h−w ⊕c y, −w ⊕c z0 i (4.15)
= .
k − w ⊕c yk · k − w ⊕c z0 k
We now perform 2 steps for this proof.
i) Existence of w:
The angle function from eq. (4.15) is continuous w.r.t t when w =
γx→z (t). So we first prove existence of an angle of π/2 by continuously
moving w from x0 to z0 when t goes from −∞ to ∞, and observing that
cos(∠ywz0 ) goes from −1 to 1 as follows:
cos(∠yx0 z0 ) = 1 & lim cos(∠ywz0 ) = −1. (4.16)

w→z0
The left part of eq. (4.16) follows from eq. (4.15) and from the fact (easy
√
to show from the definition of ⊕c ) that a ⊕c b = a, when kak = 1/ c
(which is the case of x0 ). The right part of eq. (4.16) follows from the
fact that ∠ywz0 = π − ∠ywx0 (from the conformal property, or from
eq. (2.19)) and cos(∠yz0 x0 ) = 1 (proved as above).
Hence cos(∠ywz0 ) has to pass through 0 when going from −1 to 1,
which achieves the proof of existence.
ii) Uniqueness of w:
Assume by contradiction that there are two w and w0 on γx→z that
form angles ∠ywx0 and ∠yw0 x0 of π/2. Since w, w0 , x0 are on the same
geodesic, we have
π/2 = ∠yw0 x0 = ∠yw0 w = ∠ywx0 = ∠yw0 w (4.17)
So the triangle ∆yww0 has two right angles, but in the Poincaré ball
this is impossible.
Now, we need two more lemmas:

Lemma 4.3. (Minimizing distance from point to geodesic) The orthogonal
projection of a point to a geodesic (not passing through the point) is minimizing
the distance between the point and the geodesic.
64
Proof. The proof is similar with the Euclidean case and it’s based on
hyperbolic sine law and the fact that in any right hyperbolic triangle
the hypotenuse is strictly longer than any of the other sides.
c be a Poincaré hyperplane. Then,
Lemma 4.4. (Geodesics through p) Let H̃a,p
c
for any w ∈ H̃a,p \ {p}, all points on the geodesic γp→w are included in
c .
H̃a,p
Proof. γp→w (t) = p ⊕c (−p ⊕c w) ⊗c t. Then, it is easy to check the

condition in eq. (4.8):
h−p ⊕c γp→w (t), ai = h(−p ⊕c w) ⊗c t, ai ∝ h−p ⊕c w, ai = 0. (4.18)
We now turn back to our proof. Let x ∈ Dnc be an arbitrary point and
c
H̃a,p a Poincaré hyperplane. We prove that there is at least one point
w∗ ∈ H̃a,p
c that achieves the infimum distance
dc (x, w∗ ) = inf dc (x, w), (4.19)

c
w∈ H̃a,p
and, moreover, that this distance is the same as the one in the theorem’s
statement.
We first note that for any point w ∈ H̃a,p c , if ∠xwp 6 = π/2, then
w 6= w∗ . Indeed, using lemmas 4.3 and 4.4, it is obvious that the

projection of x to γp→w will give a strictly lower distance.
Thus, we only consider w ∈ H̃a,p c such that ∠xwp = π/2. Applying
hyperbolic sine law in the right triangle ∆xwp, one gets:

√ √
dc (x, w) = (1/ c) sinh−1 sinh( c dc (x, p)) · sin(∠xpw) . (4.20)

One of the above quantities does not depend on w:
√ √
sinh( c dc (x, p)) = sinh(2 tanh−1 ( ck − p ⊕c xk))
√
2 ck − p ⊕c xk (4.21)
= .
1 − c k − p ⊕ c x k2
The other quantity is sin(∠xpw) which is minimized when the angle

∠xpw is minimized (because ∠xpw < π/2 for the hyperbolic right
65
triangle ∆xwp), or, alternatively, when cos(∠xpw) is maximized. But,

we already have from eq. (2.19) that:
h−p ⊕c x, −p ⊕c wi
cos(∠xpw) = . (4.22)
k − p ⊕c xk · k − p ⊕c wk
To maximize the above, the constraint on the right angle at w can be
dropped because cos(∠xpw) depends only on the geodesic γp→w and
not on w itself, and because there is always an orthogonal projection
from any point x to any geodesic as stated by lemma 4.2. Thus, it
remains to find the maximum of eq. (4.22) when w ∈ H̃a,p c . Using the
c from eq. (4.8), one can easily prove that
definition of H̃a,p
{logcp (w) : w ∈ H̃a,p

c
} = {a}⊥ . (4.23)
Using that fact that logcp (w)/k logcp (w)k = −p ⊕c w/k − p ⊕c wk, we
just have to find
h−p ⊕c x, zi

max , (4.24)
z∈{a}⊥ k − p ⊕c xk · kzk
and we are left with a well known Euclidean problem which is equiva-
lent to finding the minimum angle between the vector −p ⊕c x (viewed
as Euclidean) and the hyperplane {a}⊥ . This angle is given by the
Euclidean orthogonal projection whose sin value is the distance from
the vector’s endpoint to the hyperplane divided by the vector’s length:
|h−p ⊕c x, kaak i|
∗
sin(∠xpw ) = . (4.25)
k − p ⊕c xk
It follows that a point w∗ ∈ H̃a,p

c satisfying eq. (4.25) exists (but might
not be unique). Combining eqs. (4.19) to (4.21) and (4.25), one concludes
the proof.

These results are the last missing pieces to conclude the proof of
theorem 4.2.
final formula for mlr in the poincar é ball. Putting to-

gether eq. (4.6) and theorem 4.2, we get the hyperbolic MLR formulation.
66
4.3 hyperbolic feed-forward neural networks
Given K classes and k ∈ {1, . . . , K }, the separation hyperplane are

determined by pk ∈ Dnc , ak ∈ Tpk Dnc \ {0} and given for all x ∈ Dnc by:
q
p(y = k |x) ∝ exp(sign(h−pk ⊕c x, ak i) gpc k (ak , ak )dc (x, H̃ac k ,pk ))
(4.26)
or, equivalently
√ !
λcpk kak k 2 ch−pk ⊕c x, ak i

p(y = k|x) ∝ exp √ sinh−1
c (1 − ck − pk ⊕c xk2 )kak k
(4.27)
Notice that when c goes to zero, the above formula becomes
p(y = k |x) ∝ exp(4h−pk + x, ak i) = exp((λ0pk )2 h−pk + x, ak i) (4.28)
recovering the usual Euclidean softmax.

However, it is unclear how to perform optimization over ak , since
these vectors live in Tpk Dnc and, hence, depend on pk . The solution is
that one should write
ak = P0c →pk (a0k ) = (λ0c /λcpk )a0k (4.29)
where a0k ∈ T0 Dnc = Rn , and optimize a0k as a Euclidean parameter.
4.3 hyperbolic feed-forward neural networks
In order to define hyperbolic neural networks, it is crucial to define

a canonically simple parametric family of transformations, playing
the role of linear mappings in usual Euclidean neural networks, and
to know how to apply pointwise non-linearities. Inspiring ourselves
from our reformulation of Möbius scalar multiplication in eq. (2.29), we
define:
Definition 4.3 (Möbius version). For f : Rn → Rm , we define the

Möbius version of f as the map from Dnc to Dm
c by:
f ⊗c (x) := exp0c ( f (log0c (x))), (4.30)

c
where exp0c : T0 Dm
c → Dc and log0 : Dc → T0 Dc .
m n n
67
Note that similarly as for other Möbius operations, we recover

the Euclidean mapping in the limit c → 0 if f is continuous, as
limc→0 f ⊗c (x) = f (x). This definition satisfies a few desirable prop-
erties too, such as: ( f ◦ g)⊗c = f ⊗c ◦ g⊗c for f : Rm → Rl and g :
Rn → Rm (morphism property), and f ⊗c (x)/k f ⊗c (x)k = f (x)/k f (x)k
for f (x) 6= 0 (direction preserving). It is then straight-forward to prove
the following result:
Lemma 4.5 (Möbius matrix-vector multiplication). If M : Rn → Rm is
a linear map, which we identify with its matrix representation M ∈ Mm,n ,
then ∀x ∈ Dnc , if Mx 6= 0 we have
√ kMxk −1 √

⊗c Mx
M (x) = (1/ c) tanh tanh ( ckxk) , (4.31)
kxk kMxk
and M⊗c (x) = 0 if Mx = 0. Moreover, if we define the Möbius matrix-vector
multiplication of M with x by M ⊗c x := M⊗c ( x ), then we have:
• (MM0 ) ⊗c x = M ⊗c (M0 ⊗c x) for any M ∈ Ml,m (R) and M0 ∈
Mm,n (R) (matrix associativity)
• (rM) ⊗c x = r ⊗c (M ⊗c x) for r ∈ R and M ∈ Mm,n (R) (scalar-
matrix associativity)
• M ⊗c x = Mx for all M ∈ On (R) (rotations are preserved)
pointwise non-linearity. If ϕ : Rn → Rn is a pointwise non-

linearity, then its Möbius version ϕ⊗c can be applied to elements of the
Poincaré ball.
bias translation. The generalization of a translation in the Poincaré

ball is naturally given by moving along geodesics. But should we use
the Möbius sum x ⊕c b with a hyperbolic bias b ∈ Dnc or the expo-
nential map expcx (b0 ) with a Euclidean bias b0 ∈ Rn ? These views are
unified with parallel transport (see theorem 2.4). Möbius translation of
a point x ∈ Dnc by a bias b ∈ Dnc is given by
c
c c c c λ0 c
x ← x ⊕c b = expx ( P0→x (log0 (b))) = expx log0 (b) . (4.32)
λcx
We recover Euclidean translations in the limit c → 0. Note that bias
translations play a particular role in this model. Indeed, consider multi-
ple layers of the form f k (x) = ϕk (Mk x), each of which having Möbius
68
4.4 hyperbolic recurrent neural networks
version f k⊗c (x) = ϕ⊗ k ( Mk ⊗c x ). Then their composition can be re-

c
⊗c ⊗c
written f k ◦ · · · ◦ f 1 = exp0c ◦ f k ◦ · · · ◦ f 1 ◦ log0c . This means that these
operations can essentially be performed in Euclidean space. Therefore,
it is the interposition between those with the bias translation of eq. (4.32)
which differentiates this model from its Euclidean counterpart.
concatenation of multiple input vectors. If a vector x ∈

Rn+ p is the (vertical) concatenation of two vectors x1 ∈ Rn , x2 ∈ R p ,
and M ∈ Mm,n+ p (R) can be written as the (horizontal) concatenation
of two matrices M1 ∈ Mm,n (R) and M2 ∈ Mm,p (R), then Mx =
M1 x1 + M2 x2 .
We generalize this to hyperbolic spaces: if we are given x1 ∈ Dnc ,
p p
x2 ∈ Dc , x = (x1 x2 )T ∈ Dnc × Dc , and M, M1 , M2 as before, then we
define
M ⊗ c x : = M1 ⊗ c x1 ⊕ c M2 ⊗ c x2 (4.33)
Note that, when c goes to zero, we recover the Euclidean formulation,
as
lim M ⊗c x = lim M1 ⊗c x1 ⊕c M2 ⊗c x2 = M1 x1 + M2 x2 = Mx (4.34)
c →0 c →0
Moreover, hyperbolic vectors x ∈ Dnc can also be “concatenated” with

real features y ∈ R via M ⊗c x ⊕c y ⊗c b with learnable bias b ∈ Dmc
and weight matrices M ∈ Mm,n (R).
4.4 hyperbolic recurrent neural networks
naive rnn. A simple RNN is defined by

ht+1 = ϕ(Wht + Uxt + b) (4.35)
where ϕ is a pointwise non-linearity, typically tanh, sigmoid, ReLU, etc.
This formula can be naturally generalized to the hyperbolic space as
follows. For parameters W ∈ Mm,n (R), U ∈ Mm,d (R), b ∈ Dm c , we
define:
h t +1 = ϕ ⊗ c ( W ⊗ c h t ⊕ c U ⊗ c x t ⊕ c b ), ht ∈ Dnc , xt ∈ Ddc . (4.36)
Note that if inputs xt ’s are Euclidean, one can write x̃t := exp0c (xt ) and
use the above formula because of the following relation
expcW⊗c ht ( P0c →W⊗c ht (Uxt )) = W ⊗c ht ⊕c exp0c (Uxt )
(4.37)
= W ⊗c ht ⊕c U ⊗c x̃t
69
gru architecture. One can also adapt the GRU architecture:

rt = σ ( W r h t −1 + U r x t + b r )
z t = σ ( W z h t −1 + U z x t + b z )
(4.38)
h̃t = ϕ(W(rt ht−1 ) + Uxt + b)
ht = (1 − zt ) ht−1 + zt h̃t ,
where denotes pointwise product. First, how should we adapt the
pointwise multiplication by a scaling gate? Note that the definition of
the Möbius version (see eq. (4.30)) can be naturally extended to maps
f : Rn × R p → Rm as
p
f ⊗c : Dnc × Dc 7→ Dm
c , f ⊗c (h, h0 ) = exp0c ( f (log0c (h), log0c (h0 )))
(4.39)
In particular, choosing f (h, h0 ) := σ(h) h0 yields 2
f ⊗c (h, h0 ) = exp0c (σ(log0c (h)) log0c (h0 ))
(4.40)
= diag(σ(log0c (h))) ⊗c h0
Hence we adapt rt ht−1 to diag(rt ) ⊗c ht−1 and the reset gate rt to:
rt = σ log0c (Wr ⊗c ht−1 ⊕c Ur ⊗c xt ⊕c br ), (4.41)
and similarly for the update gate zt . Note that as the argument of σ in
the above is unbounded, rt and zt can a priori take values onto the full
range (0, 1). Now the intermediate hidden state becomes:
h̃t = ϕ⊗c ((Wdiag(rt )) ⊗c ht−1 ⊕c U ⊗c xt ⊕ b), (4.42)
where Möbius matrix associativity simplifies W ⊗c (diag(rt ) ⊗c ht−1 )
into (Wdiag(rt )) ⊗c ht−1 . Finally, we propose to adapt the update-gate
equation as
ht = ht−1 ⊕c diag(zt ) ⊗c (−ht−1 ⊕c h̃t ). (4.43)
Note that when c goes to zero, one recovers the usual GRU. Moreover,
if zt = 0 or zt = 1, then ht becomes ht−1 or h̃t respectively, similarly as
in the usual GRU. This adaptation was obtained by adapting [TO18]: in
this work, the authors re-derive the update-gate mechanism from a first
principle called time-warping invariance. We adapted their derivation to
the hyperbolic setting by using the notion of gyroderivative [BU01] and
proving a gyro-chain-rule (proof omitted here, presented in [GBH18a]).
2 Ifx has n coordinates, then diag(x) denotes the diagonal matrix of size n with
xi ’s on its diagonal.
70
4.5 experiments
4.5 experiments
Textual Entailment
snli task and dataset. We evaluate our method on two tasks.

The first is natural language inference, or textual entailment. Given two
sentences, a premise (e.g. “Little kids A. and B. are playing soccer.”)
and a hypothesis (e.g. “Two children are playing outdoors.”), the binary
classification task is to predict whether the second sentence can be
inferred from the first one. This defines a partial order in the sentence
space. We test hyperbolic networks on the biggest real dataset for this
task, SNLI [Bow+15]. It consists of 570K training, 10K validation and
10K test sentence pairs. Following [Ven+15], we merge the “contradic-
tion” and “neutral” classes into a single class of negative sentence pairs,
while the “entailment” class gives the positive pairs.
prefix task and datasets. We conjecture that the improvements

of hyperbolic neural networks are more significant when the underlying
data structure is closer to a tree. To test this, we design a proof-of-
concept task of detection of noisy prefixes, i.e. given two sentences, one
has to decide if the second sentence is a noisy prefix of the first, or a
random sentence. We thus build synthetic datasets PREFIX-Z% (for Z
being 10, 30 or 50) as follows: for each random first sentence of random
length at most 20 and one random prefix of it, a second positive sentence
is generated by randomly replacing Z% of the words of the prefix, and a
second negative sentence of same length is randomly generated. Word
vocabulary size is 100, and we generate 500K training, 10K validation
and 10K test pairs.
models architecture. Our neural network layers can be used in

a plug-n-play manner exactly like standard Euclidean layers. They can
also be combined with Euclidean layers. However, optimization w.r.t.
hyperbolic parameters is different (see below) and based on Riemannian
gradients which are just rescaled Euclidean gradients when working in
the conformal Poincaré model [NK17]. Thus, back-propagation can be
applied in the standard way.
In our setting, we embed the two sentences using two distinct hyper-
bolic RNNs or GRUs. The sentence embeddings are then fed together
with their squared distance (hyperbolic or Euclidean, depending on
71
their geometry) to a FFNN (Euclidean or hyperbolic, see section 4.3)

which is further fed to an MLR (Euclidean or hyperbolic, see section 4.2)
that gives probabilities of the two classes (entailment vs neutral). We
use cross-entropy loss on top. Note that hyperbolic and Euclidean
layers can be mixed, e.g. the full network can be hyperbolic and only
the last layer be Euclidean, in which case one has to use log0 and exp0
functions to move between the two manifolds in a correct manner as
explained for eq. (4.30).
optimization. Our models have both Euclidean (e.g. weight ma-

trices in both Euclidean and hyperbolic FFNNs, RNNs or GRUs) and
hyperbolic parameters (e.g. word embeddings or biases for the hyper-
bolic layers). We optimize the Euclidean parameters with Adam [KB14]
(learning rate 0.001). Hyperbolic parameters cannot be updated with
an equivalent method that keeps track of gradient history due to the
absence of a Riemannian Adam. Thus, they are optimized using full
Riemannian stochastic gradient descent (RSGD) [Bon13]. We also exper-
iment with projected RSGD [NK17], but optimization was sometimes
less stable. We use a different constant learning rate for word em-
beddings (0.1) and other hyperbolic weights (0.01) because words are
updated less frequently.
numerical errors. Gradients of the basic operations defined

above (e.g. ⊕c , exponential map) are not defined when the hyper-
√
bolic argument vectors are on the ball border, i.e. ckxk = 1. Thus,
we always project results of these operations in the ball of radius 1 − e,
where e = 10−5 . Numerical errors also appear when hyperbolic vectors
get closer to 0, thus we perturb them with an e0 = 10−15 before they
are used in any of the above operations. Finally, arguments of the
tanh function are clipped between ±15 to avoid numerical errors, while
arguments of tanh−1 are clipped to at most 1 − 10−5 .
hyperparameters. For all methods, baselines and datasets, we

use c = 1, word and hidden state embedding dimension of 5 (we
focus on the low dimensional setting that was shown to already be
effective [NK17]), batch size of 64. We ran all methods for a fixed
number of 30 epochs. For all models, we experiment with both identity
(no non-linearity) or tanh non-linearity in the RNN/GRU cell, as well
as identity or ReLU after the FFNN layer and before MLR. As expected,
72
4.5 experiments
PREFIX PREFIX PREFIX

SNLI
10% 30% 50%
Eucl RNN+FFNN
79.34 % 89.62 % 81.71 % 72.10 %
Eucl MLR
Hyp RNN+FFNN
79.18 % 96.36 % 87.83 % 76.50 %
Eucl MLR
Hyp RNN+FFNN
78.21 % 96.91 % 87.25 % 62.94 %
Hyp MLR
Eucl GRU+FFNN
81.52 % 95.96 % 86.47 % 75.04 %
Eucl MLR
Hyp GRU+FFNN
79.76 % 97.36 % 88.47 % 76.87 %
Eucl MLR
Hyp RNN+FFNN
81.19 % 97.14 % 88.26 % 76.44 %
Hyp MLR
Figure 4.2: Test accuracies for various models and four datasets. “Eucl”
denotes Euclidean, “Hyp” denotes hyperbolic. All word
and sentence embeddings have dimension 5. We highlight
in bold the best method (or methods, if the difference is less
than 0.5%).
for the fully Euclidean models, tanh and ReLU respectively surpassed
the identity variant by a large margin. We only report the best Euclidean
results. Interestingly, for the hyperbolic models, using only identity for
both non-linearities works slightly better and this is likely due to the
fact that our hyperbolic layers already contain non-linearities by their
nature.
For the results shown in fig. 4.2, we run each model (baseline or
ours) exactly 3 times and report the test result corresponding to the
best validation result from these 3 runs. We do this because the highly
non-convex spectrum of hyperbolic neural networks sometimes results
in convergence to poor local minima, suggesting that initialization is
very important.
results. Results are shown in fig. 4.2. Note that the fully Euclidean
baseline models might have an advantage over hyperbolic baselines
because more sophisticated optimization algorithms such as Adam
do not have a hyperbolic analogue at the moment. We first observe
that all GRU models overpass their RNN variants. Hyperbolic RNNs
73
(a) Test accuracy
(b) Norm of the first sentence. Averaged over all sentences in the test set.
Figure 4.3: PREFIX-30% accuracy and first (premise) sentence norm

plots for different runs of the same architecture: hyper-
bolic GRU followed by hyperbolic FFNN and hyperbol-
ic/Euclidean (half-half) MLR. The X axis shows millions of
training examples processed.
and GRUs have the most significant improvement over their Euclidean
variants when the underlying data structure is more tree-like, e.g. for
PREFIX-10% − for which the tree relation between sentences and their
prefixes is more prominent − we reduce the error by a factor of 3.35
for hyperbolic vs Euclidean RNN, and by a factor of 1.5 for hyperbolic
vs Euclidean GRU. As soon as the underlying structure diverges more
and more from a tree, the accuracy gap decreases − for example, for
PREFIX-50% the noise heavily affects the representational power of
hyperbolic networks. Also, note that on SNLI our methods perform
similarly as with their Euclidean variants. Moreover, hyperbolic and
Euclidean MLR are on par when used in conjunction with hyperbolic
sentence embeddings, suggesting further empirical investigation is
needed for this direction (see below).
additional experimental results We observed that, in the

hyperbolic setting, accuracy is often much higher when sentence em-
74
4.5 experiments
(a) Test accuracy

plots for different runs of the same architecture: Euclidean
GRU followed by Euclidean FFNN and Euclidean MLR. The
X axis shows millions of training examples processed.
beddings increase norms, tending to diverge towards the “infinity” bor-

der of the Poincaré ball. Moreover, the faster the two sentence norms
go to 1, the more it’s likely that a good local minima was reached. See
figs. 4.3 and 4.5.
We often observe that test accuracy starts increasing exactly when
sentence embedding norms do. However, in the hyperbolic setting,
the sentence embeddings norms remain close to 0 for a few epochs,
which does not happen in the Euclidean case. See figs. 4.3 to 4.5. This
surprising behavior was also exhibited in a similar way by [NK17]
which suggests that the model first has to adjust the angular layout in
the almost Euclidean vicinity of 0 before increasing norms and fully
exploiting hyperbolic geometry.
Classification of Hierarchical Word Embeddings
setup description. For the sentence entailment classification task

we did not see a clear advantage of using the hyperbolic MLR layer
75
(a) Test accuracy

plots for different runs of the same architecture: hyperbolic
RNN followed by hyperbolic FFNN and hyperbolic MLR.
The X axis shows millions of training examples processed.
compared to its Euclidean variant. A possible reason is that, when

trained end-to-end, the model might decide to place positive and neg-
ative embeddings in a manner that is already well separated with a
classic MLR. As a consequence, we further investigate MLR for the
task of subtree classification. Using an open source implementation 3
of [NK17], we pre-trained Poincaré embeddings of the WordNet noun
hierarchy (82,115 nodes). We then choose one node in this tree (see
fig. 4.7) and classify all other nodes (solely based on their embeddings)
as being part of the subtree rooted at this node. All nodes in such a
subtree are divided into positive training nodes (80%) and positive test
nodes (20%).
The same splitting procedure is applied for the remaining WordNet
nodes that are divided into a negative training and negative test sets.
training details. Three variants of MLR are then trained on

top of pre-trained Poincaré embeddings[NK17] to solve this binary
3 https://github.com/dalab/hyperbolic_cones
76
4.5 experiments
Figure 4.6: Hyperbolic (left) vs Direct Euclidean (right) binary MLR

used to classify nodes as being part in the group.n.01 sub-
tree of the WordNet noun hierarchy solely based on their
Poincaré embeddings. The positive points (from the subtree)
are in red, the negative points (the rest) are in yellow and
the trained positive separation hyperplane is depicted in
green.
77
classification task: hyperbolic MLR, Euclidean MLR applied directly on

the hyperbolic embeddings (even if mathematically this is not respecting
the hyperbolic geometry) and Euclidean MLR applied after mapping
all embeddings in the tangent space at 0 using the log0 map.
We use different embedding dimensions : 2, 3, 5 and 10. For the
hyperbolic MLR, we use full Riemannian SGD with a learning rate of
0.001. For the two Euclidean models we use ADAM optimizer and
the same learning rate. During training, we always sample the same
number of negative and positive nodes in each minibatch of size 16;
thus positive nodes are frequently re-sampled. All methods are trained
for 30 epochs and the final F1 score is reported (no hyperparameters
to validate are used, thus we do not require a validation set). This
procedure is repeated for four subtrees of different sizes.
results. Quantitative results are presented in fig. 4.7. We can see

that the hyperbolic MLR overpasses its Euclidean variants in almost all
settings, sometimes by a large margin. Moreover, to provide further
understanding, we plot the 2-dimensional embeddings and the trained
separation hyperplanes (geodesics in this case) in fig. 4.6.
78
WordNet
Model d=2 d=3 d=5 d = 10
subtree
Hyp 47.43 ± 1.07 91.92 ± 0.61 98.07 ± 0.55 99.26 ± 0.59
animal.n.01
Eucl 41.69 ± 0.19 68.43 ± 3.90 95.59 ± 1.18 99.36 ± 0.18
3218 / 798
log0 38.89 ± 0.01 62.57 ± 0.61 89.21 ± 1.34 98.27 ± 0.70
Hyp 81.72 ± 0.17 89.87 ± 2.73 87.89 ± 0.80 91.91 ± 3.07
group.n.01
Eucl 61.13 ± 0.42 63.56 ± 1.22 67.82 ± 0.81 91.38 ± 1.19
6649 / 1727
log0 60.75 ± 0.24 61.98 ± 0.57 67.92 ± 0.74 91.41 ± 0.18
Hyp 12.68 ± 0.82 24.09 ± 1.49 55.46 ± 5.49 66.83 ± 11.38
worker.n.01
Eucl 10.86 ± 0.01 22.39 ± 0.04 35.23 ± 3.16 47.29 ± 3.93
861 / 254
log0 9.04 ± 0.06 22.57 ± 0.20 26.47 ± 0.78 36.66 ± 2.74
Hyp 32.01 ± 17.14 87.54 ± 4.55 88.73 ± 3.22 91.37 ± 6.09
mammal.n.01
Eucl 15.58 ± 0.04 44.68 ± 1.87 59.35 ± 1.31 77.76 ± 5.08
953 / 228
log0 13.10 ± 0.13 44.89 ± 1.18 52.51 ± 0.85 56.11 ± 2.21
Figure 4.7: Test F1 classification scores (%) for four different subtrees of WordNet noun tree. 95% confidence intervals
for 3 different runs are shown for each method and each dimension. “Hyp” denotes our hyperbolic
MLR, “Eucl” denotes directly applying Euclidean MLR to hyperbolic embeddings in their Euclidean
parametrization, and log0 denotes applying Euclidean MLR in the tangent space at 0, after projecting all
hyperbolic embeddings there with log0 .
79
4.5 experiments
4.6 summary
We showed how classic Euclidean deep learning tools such as MLR,

FFNN, RNN or GRU can be generalized in a principled manner to all
spaces of constant negative curvature combining Riemannian geom-
etry with the elegant theory of gyrovector spaces. Empirically we
found that our models outperform or are on par with corresponding
Euclidean architectures on sequential data with implicit hierarchical
structure. We hope to trigger exciting future research related to better
understanding of the hyperbolic non-convexity spectrum and devel-
opment of other non-Euclidean deep learning methods. Our data and
Tensorflow [Aba+16] code are publicly available 4 .
4 https://github.com/dalab/hyperbolic_nn
80
HYPERBOLIC WORD EMBEDDINGS
5
Words are not created equal. In fact, they form an aristocratic graph
with a latent hierarchical structure that the next generation of unsuper-
vised learned word embeddings should reveal. In this chapter, justified
by the notion of delta-hyperbolicity or tree-likeliness of a space, we
propose to embed words in a Cartesian product of hyperbolic spaces
which we theoretically connect to the Gaussian word embeddings and
their Fisher geometry. This connection allows us to introduce a novel
principled hypernymy score for word embeddings. Moreover, we adapt
the well-known Glove algorithm to learn unsupervised word embed-
dings in this type of Riemannian manifolds. We further explain how
to solve the analogy task using the Riemannian parallel transport that
generalizes vector arithmetics to this new type of geometry. Empirically,
based on extensive experiments, we prove that our embeddings, trained
unsupervised, are the first to simultaneously outperform strong and
popular baselines on the tasks of similarity, analogy and hypernymy
detection. In particular, for word hypernymy, we obtain new state-of-
the-art on fully unsupervised WBLESS classification accuracy.
cation [TBG19] 1 .
5.1 introduction & motivation
Word embeddings are ubiquitous nowadays as first layers in neural

network and deep learning models for natural language processing.
They are essential in order to move from the discrete word space to
the continuous space where differentiable loss functions can be opti-
mized. The popular models of Glove [PSM14], Word2Vec [Mik+13b] or
FastText [Boj+16], provide efficient ways to learn word vectors fully un-
supervised from raw text corpora, solely based on word co-occurrence
1 Shared equal first authorship position.
81
hyperbolic word embeddings
statistics. These models are then successfully applied to word similarity

and other downstream tasks and, surprisingly (or not [Aro+16]), exhibit
a linear algebraic structure that is also useful to solve word analogy.
However, unsupervised word embeddings still largely suffer from
revealing antisymmetric word relations including the latent hierarchi-
cal structure of words. This is currently one of the key limitations in
automatic text understanding, e.g. for tasks such as textual entail-
ment [Bow+15]. To address this issue, [MC18]; [VM15] propose to move
from point embeddings to probability density functions, the simplest
being Gaussian or Elliptical distributions. Their intuition is that the
variance of such a distribution should encode the generality/specificity
of the respective word. However, this method results in losing the arith-
metic properties of point embeddings (e.g. for analogy reasoning) and
becomes unclear how to properly use them in downstream tasks. To this
end, we propose to take the best from both worlds: we embed words
as points in a Cartesian product of hyperbolic spaces and, additionally,
explain how they are bijectively mapped to Gaussian embeddings with
diagonal covariance matrices, where the hyperbolic distance between
two points becomes the Fisher distance between the corresponding
probability distribution functions (PDFs). This allows us to derive a
novel principled is-a score on top of word embeddings that can be
leveraged for hypernymy detection. We learn these word embeddings
unsupervised from raw text by generalizing the Glove method. More-
over, the linear arithmetic property used for solving word analogy has
a mathematical grounded correspondence in this new space based on
the established notion of parallel transport in Riemannian manifolds.
In addition, these hyperbolic embeddings outperform Euclidean Glove
on word similarity benchmarks. We thus describe, to our knowledge,
the first word embedding model that competitively addresses the above
three tasks simultaneously. Finally, these word vectors can also be used
in downstream tasks as explained by [GBH18a].
We provide additional reasons for choosing the hyperbolic geometry
to embed words. We explain the notion of average δ-hyperbolicity of a
graph [BCC15], a geometric quantity that measures its “democracy”.
A small hyperbolicity constant implies “aristocracy”, namely the exis-
tence of a small set of nodes that “influence” most of the paths in the
graph. It is known that real-world graphs are mainly complex networks
(e.g. scale-free exhibiting power-law node degree distributions) which
in turn are better embedded in a tree-like space, i.e. hyperbolic [Kri+10].
82
5.2 related work
Since, intuitively, words form an “aristocratic” community (few generic

ones from different topics and many more specific ones) and since a
significant subset of them exhibits a hierarchical structure (e.g. Word-
Net [Mil+90]), it is naturally to learn hyperbolic word embeddings.
Moreover, we empirically measure very low average δ-hyperbolicity
constants of some variants of the word log-co-occurrence graph (used
by the Glove method), providing additional quantitative reasons for
why spaces of negative curvature (i.e. hyperbolic) are better suited for
word representations.
5.2 related work
Recent supervised methods can be applied to embed any tree or directed

acyclic graph in a low dimensional space with the aim of improving
link prediction either by imposing a partial order in the embedding
space [AW18]; [Ven+15]; [Vil+18], by using hyperbolic geometry [NK17];
[NK18], or both, as we have seen in Chapter 3.
To learn word embeddings that exhibit hypernymy or hierarchical
information, supervised methods [Ngu+17]; [VM18] leverage external
information (e.g. WordNet) together with raw text corpora. However,
the same goal is also targeted by more ambitious fully unsupervised
models which move away from the “point” assumption and learn
various probability densities for each word [AW17]; [MC18]; [Sin+18];
[VM15].
There have been two very recent attempts at learning unsupervised
word embeddings in the hyperbolic space [Dhi+18]; [LW18]. However,
they suffer from either not being competitive on standard tasks in
high dimensions, not showing the benefit of using hyperbolic spaces to
model antisymmetric relations, or not being trained on realistically large
corpora. We address these problems and, moreover, the connection
with density based methods is made explicit and leveraged to improve
hypernymy detection.
5.3 hyperbolic spaces and their cartesian product
In order to work in the hyperbolic space, we have to choose one

model, among the five isometric models that exist. We first choose
to embed words in the Poincaré ball Dn = {x ∈ Rn | kxk2 < 1}
83
Figure 5.1: Isometric deformation ϕ of D2 (left end) into H2 (right end).
extensively analyzed in Chapter 2. This is illustrated in fig. 5.1 for

n = 2 dimensions, where dark lines represent geodesics. Second,
we also embed words in products of hyperbolic spaces, and explain
why later on. A product of p balls (Dn ) p , with the induced product
geometry, is known to have distance function
p
d(Dn ) p (x, y)2 = ∑ dD n ( x i , y i )2 (5.1)
i =1
Finally, another model of interest for us is the Poincaré half-plane

H2 = R × R∗+ illustrated in fig. 5.1, with distance function
kx − yk22

−1
dH2 (x, y) = cosh 1+ (5.2)
2x2 y2
Figure 5.1 shows an isometry ϕ from D2 to H2 mapping the vertical

segment {0} × (−1, 1) to {0} × R∗+ , where 0 ∈ D2 becomes (0, 1) ∈ H2 .
5.4 adapting glove
euclidean glove. The Glove [PSM14] algorithm is an unsuper-

vised method for learning word representations in the Euclidean space
from statistics of word co-occurrences in a text corpus, with the aim to
geometrically capture the words’ meaning and relations.
We use the notations: Xij is the number of times word j occurs in the
same window context as word i; Xi = ∑k Xik is the word frequency; the
embedding of a target word i is written wi , while the embedding of a
context word k is written w̃k .
84
5.4 adapting glove
The initial formulation of the Glove [PSM14] model suggests to learn

embeddings as to math the word log-co-occurrence counts using vector
dot-product together with biases:
wiT w̃k + bi + b̃k = log( Xik ). (5.3)
The authors suggest to enforce this equality by optimizing a weighted

least-square loss:
V 2
J= ∑ f ( Xij ) wiT w̃ j + bi + b̃ j − log Xij , (5.4)
i,j=1
where V is the size of the vocabulary and f down-weights the sig-

nal coming from frequent words (it is typically chosen to be f ( x ) =
min{1, ( x/xm )α }, with α = 3/4 and xm = 100).
glove in metric spaces. How should one generalize Glove to non-

Euclidean spaces such as the hyperbolic space? Note that there is no
clear correspondence of the Euclidean inner-product in a hyperbolic
space. However, we are provided with a distance function. Further
notice that one could rewrite eq. (5.3) with the Euclidean distance as
1
− kwi − w̃k k2 + bi + b̃k = log( Xik ) (5.5)
2
where we absorbed the squared norms of the embeddings into the
biases. We thus replace the Glove loss by:
V 2
J= ∑ f ( Xij ) −h(d(wi , w̃ j )) + bi + b̃ j − log Xij , (5.6)
i,j=1
where h is a function to be chosen as a hyperparameter of the model,

and d can be any differentiable distance function. Although the most di-
rect correspondence with Glove would suggest h( x ) = x2 /2, we some-
times obtained better results with other functions, such as h = cosh2
(see sections 5.8 and 5.9). Note that [De +18b] also apply cosh to
their distance matrix for hyperbolic MDS before applying PCA. Under-
standing why h = cosh2 is a good choice would be interesting future
work.
85
5.5 connecting gaussian distributions and hyperbolic em-

beddings
In order to endow Euclidean word embeddings with richer informa-

tion, [VM15] proposed to represent words as Gaussians, i.e. with a mean
vector and a covariance matrix 2 , expecting the variance parameters
to capture how generic/specific a word is, and, hopefully, entailment
relations. On the other hand, [NK17] proposed to embed words of the
WordNet hierarchy [Mil+90] in hyperbolic space, because this space is
mathematically known to be better suited to embed tree-like graphs. It
is hence natural to wonder: is there a connection between the two?
the fisher geometry of gaussians is hyperbolic. It turns

out that there exists a striking connection [CSS15]. Note that a 1D
Gaussian N (µ, σ2 ) can be represented as a point (µ, σ) in R × R∗+ . Then,
the Fisher distance between two distributions relates to the hyperbolic
distance in H2 :
√ √ √
d F N (µ, σ2 ), N (µ0 , σ02 ) = 2dH2 (µ/ 2, σ), (µ0 / 2, σ0 ) . (5.7)
For n-dimensional Gaussians with diagonal covariance matrices written

Σ = diag(σ)2 , it becomes:
s
n √ √ 2
d F N (µ, Σ), N (µ0 , Σ0 ) = ∑ 2dH2 (µi / 2, σi ), (µi0 / 2, σi0 ) .

i =1
(5.8)
Hence there is a direct correspondence between diagonal Gaussians
and the product space (H2 )n .
This connection allows us to mathematically ground
• word analogy computations for Gaussian embeddings using hyper-

bolic geometry – section 5.6.
• hypernymy detection for hyperbolic embeddings using Gaussian

distributions – section 5.7.
fisher distance, kl & gaussian embeddings. The above para-

graph lets us relate the word2gauss algorithm [VM15] to hyperbolic
word embeddings. Although one could object that word2gauss is
2 diagonal or even spherical, for simplicity.
86
5.6 analogies for hyperbolic/gaussian embeddings
trained using a KL divergence, while hyperbolic embeddings relate to

Gaussian distributions via the Fisher distance d F , let us remind that KL
and d F define the same local geometry. Indeed, the KL is known to be
related to d F , as its local approximation [Jef46].
riemannian optimization. A benefit of representing words in

(products of) hyperbolic spaces, as opposed to (diagonal) Gaussian dis-
tributions, is that one can use recent Riemannian adaptive optimization
tools such as Radagrad [BG19]. Note that without this connection, it
would be unclear how to define a variant of Adagrad [DHS11] intrinsic
to the statistical manifold of Gaussians. Empirically, we indeed noticed
better results using Radagrad, rather than simply Riemannian sgd
[Bon13]. Similarly, note that Glove trains with Adagrad.
5.6 analogies for hyperbolic/gaussian embeddings
A common task used to evaluate word embeddings, called analogy,

consists in finding which word d is to the word c, what the word b is
to the word a. For instance, queen is to woman what king is to man. In
the Euclidean embedding space, the solution to this problem is usually
taken geometrically as d = c + (b − a) = b + (c − a). Note that this
implies that the same d is also to b, what c is to a.
How should one intrinsically define “analogy parallelograms” in a
space of Gaussian distributions? Note that [VM15] do not evaluate
their Gaussian embeddings on the analogy task, and that it would be
unclear how to do so. However, since we can go back and forth between
(diagonal) Gaussians and (products of) hyperbolic spaces as explained
in section 5.5, we can use the fact that parallelograms are naturally
defined in the Poincaré ball, by the notion of gyro-translation [Ung12,
section 4]. In the Poincaré ball, the two solutions d1 = c + (b − a) and
d2 = b + (c − a) are respectively generalized to
d1 = c ⊕ gyr [c, a]( a ⊕ b), and d2 = b ⊕ gyr [b, a]( a ⊕ c).

(5.9)
where the gyr operator as well as the closed-form formulas for these
operations are described in Chapter 2 (e.g. eq. (2.33)), being easy to
implement. The fact that d1 and d2 differ is due to the curvature of the
space. For evaluation, we chose a point mtd1 d2 := d1 ⊕ ((−d1 ⊕ d2 ) ⊗ t)
87
located on the geodesic between d1 and d2 for some t ∈ [0, 1]; if t = 1/2,
this is called the gyro-midpoint and then m0.5 0.5
d1 d2 = md2 d1 , which is at
equal hyperbolic distance from d1 as from d2 . We select t based on
2-fold cross-validation, as explained in [TBG19],
Note that continuously deforming the Poincaré ball to the Euclidean
space (by sending its radius to infinity) lets these analogy computations
recover their Euclidean counterparts, which is a nice sanity check.
Indeed, one can rewrite eq. (5.9) with tools from differential geometry
as
c ⊕ gyr [c, a]( a ⊕ b) = expc ( Pa→c (loga (b))), (5.10)
where Px→y = (λx /λy ) gyr [y, x] denotes the parallel transport along
the unique geodesic from x to y (see eq. (2.32)). The exp and log
maps of Riemannian geometry are related to the theory of gyrovector
spaces as mentioned in Chapter 2. We also mention again that, when
continuously deforming the hyperbolic space Dn into the Euclidean
space Rn , sending its curvature κ from −1 to 0 (i.e. the radius of
Dn from 1 to ∞), the Möbius operations ⊕κ , κ , ⊗κ , gyrκ recover their
respective Euclidean counterparts +, −, ·, Id. Hence, the analogy so-
lutions d1 , d2 , mtd1 d2 of eq. (5.9) would then all recover the Euclidean
formulation d = c + b − a.
5.7 towards a principled score for entailment/hypernymy
We are interested in using our word embeddings to address another

task, namely hypernymy detection, i.e. to predict relations of type
is-a(v,w) such as is-a(dog, animal).
As a strong first baseline, we focus on the method of [NK17] that
uses an heuristic entailment score in order to predict whether u is-a v,
defined in terms of their embeddings u, v ∈ Dn as
is-a(u, v) := −(1 + α(kv k2 − kuk2 ))d(u, v ) (5.11)
This choice is based on the intuition that the Euclidean norm should
encode generality/specificity of a concept/word. However, such a
choice depends on the parametrization and origin of the hyperbolic
space, which is problematic when the word embedding training loss
involves only the distance function.
A second baseline is that of Gaussian word embeddings [VM15] in
which words are modeled as normal probability distributions. The
88
5.8 embedding space hyperbolicity
authors propose using the entailment score is-a( P, Q) := −KL( P|| Q)

for continuous distributions P and Q representing two words. Their
argument is that a low KL( P|| Q) indicates that we can encode Q easily
as P, implying that Q entails P.
Our contribution is to use the connection explained in section 5.5
to introduce a novel principled score that can be applied on top of
our unsupervised learned Poincaré Glove embeddings to address the
task of hypernymy detection. This hypernymy score that can leverage
either fully unsupervised information (i.e. word frequencies) or weakly-
supervised (i.e. WordNet tree levels) information. However, this score
is presented in details in [TBG19] and not in this dissertation.
5.8 embedding space hyperbolicity
Why would we embed words in a hyperbolic space? Given some

symbolic data, such as a vocabulary along with similarity measures
between words − in our case, co-occurrence counts Xij − can we under-
stand in a principled manner which geometry would represent it best?
Choosing the right metric space to embed words can be understood as
selecting the right inductive bias − an essential step.
δ-hyperbolicity. A particular quantity of interest describing qual-

itative aspects of metric spaces is the δ-hyperbolicity. This metric in-
troduced by [Gro87] quantifies the tree-likeliness of a metric space.
Formally, the hyperbolicity δ( x, y, z, t) of a 4-tuple ( x, y, z, t) is defined
as half the difference between the biggest two of the following sums:
d( x, y) + d(z, t), d( x, z) + d(y, t), d( x, t) + d(y, z). The δ-hyperbolicity
of a metric space is defined as the supremum of these numbers over
all 4-tuples. Following [ADM14], we will denote this number by δworst
(it is a worst-case measure), and by δavg the average of these over all
4-tuples, when the space is a finite set. Intuitively, a low δavg of a finite
metric space characterizes that this space has an underlying hyperbolic
geometry, i.e. an approximate tree-like structure, and that the hyper-
bolic space would be well suited to isometrically embed it. We also
report the ratio 2 ∗ δavg /d avg (invariant to metric scaling), where d avg
is the average distance in the finite space, as suggested by [BCC15],
whose low value also characterizes the “hyperbolicness” of the space.
89
computing δavg . Since our methods are trained on a weighted

graph of co-occurrences, it makes sense to look for the corresponding
hyperbolicity δavg of this symbolic data. The lower this value, the more
hyperbolic is the underlying nature of the graph, thus indicating that
the hyperbolic space should be preferred over the Euclidean space for
embedding words. However, in order to do so, one needs to be provided
with a distance d(i, j) for each pair of words (i, j), while our symbolic
data is only made of similarity measures. Inspired by eq. (5.6), we as-
sociate to words i, j the distance 3 h(d(i, j)) := − log( Xij ) + bi + b j ≥ 0
with the choice bi := log( Xi ), i.e.
d(i, j) := h−1 (log(( Xi X j )/Xij )). (5.12)
Table 5.1 shows values for different choices of h. The discrete metric
spaces we obtained for our symbolic data of co-occurrences appear to
have a very low hyperbolicity, i.e. to be very much “hyperbolic”, which
suggests to embed words in (products of) hyperbolic spaces. We report
in section 5.9 empirical results for h = (·)2 and h = cosh2 .
h( x ) log( x ) x x2 cosh( x ) cosh2 ( x ) cosh4 ( x ) cosh10 ( x )
d avg 18950.4 18.9505 4.3465 3.68 2.3596 1.7918 1.4947

δavg 8498.6 0.7685 0.088 0.0384 0.0167 0.0072 0.0026
2δavg
d avg 0.8969 0.0811 0.0405 0.0209 0.0142 0.0081 0.0034
Table 5.1: average distances, δ-hyperbolicities and ratios computed via

sampling for the metrics induced by different h functions, as
defined in eq. (5.12).
5.9 experiments
implementation authorship. The experimental results reported

in this section are based on the implementation of my co-authors (with
whom I share equal first authorship). For more details please refer to
the original publication [TBG19].
3 One can replace log( x ) with log(1 + x ) to avoid computing the logarithm of zero.
90
5.9 experiments
experimental setup. We trained all models on a corpus provided

by [LG14]; [LGD15] used in other word embeddings related work.
Corpus preprocessing is explained in the above references. The dataset
has been obtained from an English Wikipedia dump and contains 1.4
billion tokens. Words appearing less than one hundred times in the
corpus have been discarded, leaving 189, 533 unique tokens. The co-
occurrence matrix contains approximately 700 millions non-zero entries,
for a symmetric window size of 10. All models were trained for 50
epochs, and unless stated otherwise, on the full corpus of 189,533 word
types. For certain experiments, we also trained the model on a restricted
vocabulary of the 50, 000 most frequent words, which we specify by
appending either “(190k)” or “(50k)” to the experiment’s name in the
table of results.
poincar é models, euclidean baselines. We report results for

both 100D embeddings trained in a 100D Poincaré ball, and for 50x2D
embeddings, which were trained in the Cartesian product of 50 2D
Poincaré balls. Note that in the case of both models, one word will be
represented by exactly 100 parameters. For the Poincaré models we
employ both h( x ) = x2 and h( x ) = cosh2 ( x ). All hyperbolic models
were optimized with Radagrad [BG19] as explained in section 5.5.
On the tasks of similarity and analogy, we compare against a 100D
Euclidean GloVe model which was trained using the hyperparameters
suggested in the original GloVe paper [PSM14]. The vanilla GloVe
model was optimized using Adagrad [DHS11]. For the Euclidean
baseline as well as for models with h( x ) = x2 we used a learning rate
of 0.05. For Poincaré models with h( x ) = cosh2 ( x ) we used a learning
rate of 0.01.
the initialization trick. We obtained improvement in the ma-

jority of the metrics when initializing our Poincaré model with pre-
trained parameters. These were obtained by first training the same
model on the restricted (50k) vocabulary, and then using this model
as an initialization for the full (190K) vocabulary. This will be referred
to as the “initialization trick“. For fairness, we also trained the vanilla
(Euclidean) GloVe model in the same fashion.
similarity. Word similarity is assessed using a number of well

established benchmarks shown in table 5.2. We summarize here our
91
main results, but more extensive experiments (including in lower di-

mensions) are shown in [TBG19]. We note that, with a single exception,
our 100D and 50x2D models outperform the vanilla Glove baselines in
all settings.
Table 5.2: Word similarity results for 100-dimensional models. High-

lighted: the best and the 2nd best. Implementation of these
experiments was done by co-authors in [TBG19].
Rare Word
Experiment name SimLex SimVerb MC RG
Word Sim
100D Vanilla GloVe 0.3798 0.5901 0.2963 0.1675 0.6524 0.6894
100D Vanilla GloVe
0.3787 0.5668 0.2964 0.1639 0.6562 0.6757
w/ init trick
100D Poincaré GloVe
h( x ) = cosh2 ( x ), 0.4187 0.6209 0.3208 0.1915 0.7833 0.7578
w/ init trick
50x2D Poincaré GloVe
h( x ) = cosh2 ( x ), 0.4276 0.6234 0.3181 0.189 0.8046 0.7597
w/ init trick
h( x ) = x2 , 0.4104 0.5782 0.3022 0.1685 0.7655 0.728
w/ init trick
analogy. For word analogy, we evaluate on the Google bench-

mark [Mik+13a] and its two splits that contain semantic and syntactic
analogy queries. We also use a benchmark by MSR that is also com-
monly employed in other word embedding works. For the Euclidean
baselines we use 3COSADD [LGD15]. For our models, the solution d
to the problem “which d is to c, what b is to a” is selected as mtd1 d2 , as
described in section 5.6. In order to select the best t without overfitting
on the benchmark dataset, we used the same 2-fold cross-validation
method used by [LGD15, section 5.1], which resulted in selecting t = 0.3.
We report our main results in table 5.4. More extensive experiments in
various settings (including in lower dimensions) are shown in [TBG19].
We note that the vast majority of our models outperform the vanilla
92
5.9 experiments
Table 5.3: Nearest neighbors (in terms of Poincaré distance) for some
words using our 100D hyperbolic embedding model.
sixties seventies, eighties, nineties, 60s, 70s, 1960s, 80s, 90s, 1980s,
1970s
dance dancing, dances, music, singing, musical, performing, hip-
hop, pop, folk, dancers
daughter wife, married, mother, cousin, son, niece, granddaughter,
husband, sister, eldest
vapor vapour, refrigerant, liquid, condenses, supercooled, fluid,
gaseous, gases, droplet
ronaldo cristiano, ronaldinho, rivaldo, messi, zidane, romario, pele,
zinedine, xavi, robinho
mechanic electrician, fireman, machinist, welder, technician, builder,
janitor, trainer, brakeman
algebra algebras, homological, heyting, geometry, subalgebra,
quaternion, calculus, mathematics, unital, algebraic
Glove baselines, with the 100D hyperbolic embeddings being the abso-
lute best.
93
Table 5.4: Word analogy results for 100-dimensional models on the

Google and MSR datasets. Highlighted: the best and the
2nd best. Implementation of these experiments was done by
co-authors in [TBG19].
Experiment name Method SemG SynG G MSR
100D Vanilla GloVe 3COSADD 0.6005 0.5869 0.5931 0.4868

100D Vanilla GloVe
3COSADD 0.6427 0.595 0.6167 0.4826
w/ init trick
100D Poincaré GloVe
h( x ) = cosh2 ( x ) Cosine dist 0.6641 0.6088 0.6339 0.4971
w/ init. trick
h( x ) = x2 Poincaré dist 0.6582 0.6066 0.6300 0.4672
w/ init. trick
h( x ) = cosh2 ( x ) Poincaré dist 0.6048 0.6042 0.6045 0.4849
w/ init. trick
hypernymy evaluation. For hypernymy evaluation we use the

Hyperlex [Vul+17] and WBLess (subset of BLess) [BL11] datasets. We
classify all the methods in three categories depending on the supervi-
sion used for word embedding learning and for the hypernymy score,
respectively. For Hyperlex we report results in table 5.6 and use the
baseline scores reported in [NK17]; [Vul+17]. For WBLess we report
results in table 5.7 and use the baseline scores reported in [Ngu+17].
94
5.9 experiments
Table 5.5: Some words selected from the 100 nearest neighbors and
ordered according to the hypernymy score function for a
50x2D hyperbolic embedding model using h( x ) = x2 .
reptile amphibians, carnivore, crocodilian, fish-like, dinosaur, alli-
gator, triceratops
algebra mathematics, geometry, topology, relational, invertible, en-
domorphisms, quaternions
music performance, composition, contemporary, rock, jazz, elec-
troacoustic, trio
feeling sense, perception, thoughts, impression, emotion, fear,
shame, sorrow, joy
hypernymy results discussion. We first note that our fully un-

supervised 50x2D, h( x ) = x2 model outperforms all its corresponding
baselines setting a new state-of-the-art on unsupervised WBLESS ac-
curacy and matching the previous state-of-the-art on unsupervised
HyperLex Spearman correlation.
Second, once a small amount of weakly supervision is used for the
hypernymy score, we obtain significant improvements as shown in
the same tables. We note that this weak supervision is only as a post-
processing step (after word embeddings are trained). Moreover, it
does not consist of hypernymy pairs, but only of 400 or 800 generic
and specific sets of words from WordNet. Even so, our unsupervised
learned embeddings are remarkably able to outperform all (except
WN-Poincaré) supervised embedding learning baselines on HyperLex
which have the great advantage of using the hypernymy pairs to train
the word embeddings.
which model to choose? While there is no single model that

outperforms all the baselines on all presented tasks, one can remark
that the model 50x2D, h( x ) = x2 , with the initialization trick obtains
state-of-the-art results on hypernymy detection and is close to the best
models for similarity and analogy (also Poincaré Glove models), but
almost constantly outperforming the vanilla Glove baseline on these.
This is the first model that can achieve competitive results on all these
95
Table 5.6: Hyperlex results in terms of Spearman correlation for 3 dif-

ferent model types ordered according to their difficulty. Im-
plementation of these experiments was done by co-authors
in [TBG19].
Method ρ
i) Supervised embedding learning, Unsupervised hypernymy score
Order Embeddings [Ven+15] 0.191

PARAGRAM + CF 0.320
WN-Basic 0.240
WN-WuP 0.214
WN-LCh 0.214
WN-Eucl from [NK17] 0.389
WN-Poincaré from [NK17] 0.512
ii) Unsupervised embedding learning, Weakly-supervised hypernymy score
50x2D Poincaré GloVe, h( x ) = cosh2 ( x ), init trick (190k) 0.402

50x2D Poincaré GloVe, h( x ) = x2 , init trick (190k) 0.421
iii) Unsupervised embedding learning, Unsupervised hypernymy score
Word2Gauss-DistPos 0.206
SGNS-Deps 0.205
Frequency 0.279
SLQS-Slim 0.229
Vis-ID 0.253
DIVE-W∆S [Cha+18] 0.333
SBOW-PPMI-C∆S from [Cha+18] 0.345
50x2D Poincaré GloVe h( x ) = cosh2 ( x ) init trick (190k) 0.284
96
5.9 experiments
Table 5.7: WBLESS results in terms of accuracy for 3 different model

types ordered according to their difficulty. Implementation
of these experiments was done by co-authors in [TBG19].
Method Acc.
i) Supervised embedding learning, Unsupervised hypernymy score
[Wee+14] 0.75
WN-Poincaré from [NK17] 0.86
[Ngu+17] 0.87
ii) Unsupervised embedding learning, Weakly-supervised hypernymy score

iii) Unsupervised embedding learning, Unsupervised hypernymy score

SGNS from [Ngu+17] 0.48
[Wee+14] 0.58
97
three tasks, additionally offering interpretability via the connection to

Gaussian word embeddings.
5.10 summary
We propose to adapt the GloVe algorithm to hyperbolic spaces and

to leverage a connection between statistical manifolds of Gaussian
distributions and hyperbolic geometry, in order to better interpret
entailment relations between hyperbolic embeddings. We justify the
choice of products of hyperbolic spaces via this connection to Gaussian
distributions and via computations of the hyperbolicity of the symbolic
data upon which GloVe is based. Empirically we present the first model
that can simultaneously obtain state-of-the-art results or close on the
three tasks of word similarity, analogy and hypernymy detection.
Future work includes jointly learning the curvature of the model,
together with the h function defining the geometry from co-occurrence
counts, as well as re-running experiments in the hyperboloid model,
which has been reported to lead to lesser numerical instabilities.
98
PROBABILISTIC GRAPHICAL MODELS FOR ENTITY
RESOLUTION
6
We now leave the hyperbolic geometry and turn our attention to
other non-Euclidean representations, namely for entities, i.e. entries in
a Knowledge Graph (KG). We choose to evaluate the quality of entity
representations using a popular downstream task related to text se-
mantic understanding, namely ED (or Entity Resolution (ER)) – the task
of resolving potential ambiguous textual references to entities in a KG.
Before we dive into neural network models for ED (in the next Chap-
ter 7) and non-metric spaces, we first need to investigate probabilistic
graphical models such as Conditional Random Field (CRF), plus approx-
imate learning and inference techniques specific to those. We devise
a novel model called Probabilistic Bag of Hyperlinks Model for Entity
Disambiguation (PBOH), achieving the best results on many datasets
and against a variety of methods. In the next Chapter 7 we will build
upon PBOH and design a fully differentiable deep neural network based
on truncated message passing inference that would utilize word and
entity embeddings and non-Euclidean similarity measures to further
push the state-of-the-art ER performance.
cation [Gan+16].
6.1 introduction
Many fundamental problems in natural language processing rely on

determining what entities appear in a given text. Commonly referenced
as entity linking and disambiguation, this step is a fundamental component
of many NLP tasks such as text understanding, automatic summariza-
tion, semantic search or machine translation. Name ambiguity, word
polysemy, context dependencies and a heavy-tailed distribution of
entities contribute to the complexity of this problem.
We here propose a probabilistic approach that makes use of an
effective graphical model to perform collective entity disambiguation.
Input mentions (i.e., linkable token spans) are disambiguated jointly
99
probabilistic graphical models for entity resolution
across an entire document by combining a document-level prior of

entity co-occurrences with local information captured from mentions
and their surrounding context. The model is based on simple sufficient
statistics extracted from data, thus relying on few parameters to be
learned.
Our method does not require extensive feature engineering, nor an
expensive training procedure. We use loopy belief propagation to
perform approximate inference. The low complexity of our model
makes this step sufficiently fast for real-time usage. We demonstrate
the accuracy of our approach on a wide range of benchmark datasets,
showing that it matches, and in many cases outperforms, existing
state-of-the-art methods.
Figure 6.1: An entity disambiguation problem showcasing five given

mentions and their potential entity candidates.
Digital systems are producing increasing amounts of data every day.

With daily global volumes of several terabytes of newly textual content,
there is a growing need for automatic methods for text aggregation,
summarization, and, eventually, semantic understanding. Entity disam-
biguation is a key step towards these goals as it reveals the semantics
of spans of text that refer to real-world entities. In practice, this is
achieved by establishing a mapping between potentially ambiguous
surface forms of entities and their canonical representations in a KG such
100
6.1 introduction
as corresponding Wikipedia 1 articles or Freebase 2 entries. Figure 6.1

illustrates the difficulty of this task when dealing with real-world data.
The main challenges arise from word ambiguities inherent to natural
language: surface form synonymy, i.e., different spans of text referring
to the same entity, and homonymy, i.e., the same name being shared by
multiple entities.
We here describe and evaluate a novel light-weight and fast alterna-
tive to heavy machine-learning approaches for document-level entity
disambiguation with Wikipedia. Our model is primarily based on
simple empirical statistics acquired from a training dataset and relies
on a very small number of learned parameters. This has certain advan-
tages like a very fast training procedure that can be applied to massive
amounts of data, as well as a better understanding of the model com-
pared to increasingly popular deep learning architectures (e.g., He et
al. [He+13]). As a prerequisite, we assume that a given input set of
mentions was already discovered via a mention detection procedure 3 .
Our starting point is the natural assumption that each entity depends
(i) on its mention, (ii) its neighboring local contextual words, and (iii)
on other entities that appear in the same document.
In order to enforce these conditions, we rely on a conditional proba-
bilistic model that consists of two parts: (1) the likelihood of a candidate
entity given the referring token span and its surrounding context, and
(2) the prior joint distribution of the candidate entities corresponding to
all the mentions in a document. Our model relies on the max-product
algorithm to collectively infer entities for all mentions in a given docu-
ment.
We further illustrate these modeling decisions. In the example de-
picted in fig. 6.1, each highlighted mention constrains the set of possible
entity candidates to a limited size set, yet leaves a significant level of
ambiguity. However, there is one collective way of disambiguation
that is jointly consistent with all the chosen entities and supported
by contextual cues. Intuitively, the related entities Thomas Müller and
Germany national football team are likely to appear in the same doc-
ument, especially in the presence of contextual words related to soccer,
like “team” or “goal”.
1 http://en.wikipedia.org/
2 https://www.freebase.com/
3 Forexample, using a named-entity recognition system. However, note that our
approach is not restricted to named entities, but targets any Wikipedia entity.
101
Our main contributions are outlined below: (1) We employ rigorous

probabilistic semantics for the entity disambiguation problem by intro-
ducing a principled probabilistic graphical model that requires a simple
and fast training procedure. (2) At the core of our joint probabilistic
model, we derive a minimal set of potential functions that proficiently
explain statistics of observed training data. (3) Throughout a range of
experiments performed on several standard datasets using the Gerbil
platform [Usb+15], we demonstrate competitive or state of the art qual-
ity compared to some of the best existing approaches. (4) Moreover,
our training procedure is solely based on publicly available Wiki-pedia
hyperlink statistics and the method does not require extensive hy-
perparameter tuning, nor feature engineering, making this chapter
a self-contained manual of implementing an entity disambiguation
system from scratch.
The remainder of this chapter is structured as follows. Section 6.2
briefly discusses relevant ED literature. Section 6.3 formally introduces
our probabilistic graphical model and details the initialization and
learning procedure of the model’s parameters. Section 6.4 describes
the inference process used for collective entity resolution. Section 6.5
empirically demonstrates the merits of the proposed method on mul-
tiple standard collections of manually annotated documents. Finally,
in section 6.6, we conclude with a summary of our findings and an
overview of ongoing and future work.
6.2 related work
There is a substantial body of existing work dedicated to the task of

ED with Wikipedia (Wikification). We can identify four major paradigms
of how this challenge is approached.
Local models consider the individual context of each entity mention in
isolation in order to reduce the size of the decision space. In one of the
early ED methods, Mihalcea and Csomai [MC07] propose an entity dis-
ambiguation scheme based on similarity statistics between the mention
context and the entity’s Wikipedia page. Milne and Witten [MW08]
further refine their scheme with special focus on the mention detection
step. Bunescu and Pasca [BP06] present a Wikipedia-driven approach,
making use of manually created resources such as redirect and disam-
biguation pages. Dredze et al. [Dre+10] cast the ED task as a retrieval
102
6.2 related work
problem, treating mentions and their contexts as queries, and ranking

candidate entities according to their likelihood of being referred to.
Global models attempt to jointly disambiguate all mentions in a docu-
ment based on the assumption that the underlying entities are corre-
lated and consistent with the main topic of the document. While this
approach tends to result in superior accuracy, the space of possible
entity assignments grows combinatorially. As a consequence, many
approaches in this group rely on approximate inference mechanisms.
[Cuc07] use high-dimensional vector space representations of candidate
entities and attempts to iteratively choose candidates that optimize
the mutual proximity to existing candidates. [Kul+09] exploit topical
information about candidate entities and try to harmonize these topics
across all assigned entities. [Rat+11] prune the list of entity mentions
using support vector machines trained on a range of similarity and
term overlap features between entity representations. [FS10b] focus on
short documents such as tweets or search engine snippets. Based on
evidence across all mentions, the authors employ a voting scheme for
entity disambiguation. [CR13] and [Sin+13] describe models for jointly
capturing the interdependence between the tasks of entity tagging, rela-
tion extraction and co-reference resolution. Similarly, [DK14] describe a
graphical model for collectively addressing the tasks of named entity
recognition, entity disambiguation and co-reference resolution.
Graph-based models establish relationships between candidate enti-
ties and mentions using structural models. For inference, various
approaches are employed, ranging from densest graph estimation algo-
rithms ([Hof+11a]) to graph traversal methods such as random graph
walks ([GB14]; [HSZ11]). In a similar fashion, these techniques can be
combined to enhance the quality of both ED and word sense disam-
biguation in a synergistic solution ([MRN14]).
The above approaches are limited because they assume a single
topic per document. Naturally, topic modelling can be used for entity
disambiguation by attempting to harmonize the individual distribution
of latent topics across candidate entities. [HC14] and [PP11] rely on
Latent Dirichlet Allocation (LDA) and compare the resulting topic
distribution of the input document to the topic distributions of the
disambiguated entities’ Wikipedia pages. [HS12] propose a joint model
of mention context compatibility and topic coherence, allowing them
to simultaneously draw from both local (terms, mentions) as well as
global (topic distributions) information. [Kat+11] use a semi-supervised
103
hierarchical LDA model based on a wide range of features extracted

from Wikipedia pages and topic hierarchies.
In contrast to previous work on this problem, our method exploits co-
occurrence statistics in a fully probabilistic manner using a graph-based
model that addresses collective entity disambiguation. It combines a
clean and light-weight probabilistic model with an elegant, real-time
inference algorithm. An advantage over increasingly popular deep
learning architectures for ED (e.g. [He+13]; [Sun+15]) is the speed of
our training procedure that relies on count statistics from data and
that learns only very few parameters. State-of-art accuracy is achieved
without the need for special-purpose computational heuristics.
6.3 model
In this section, we formally define the ED task that we address in this

work and describe our modeling approach in detail.
Problem Definition and Formulation
Let E be a knowledge base (KB) of entities, V a finite dictionary of

phrases or names and C a context representation. Formally, we seek a
mapping F : (V , C)n → E n , that takes as input a sequence of linkable
mentions m = (m1 , . . . , mn ) along with their contexts c = (c1 , . . . , cn )
and produces a joint entity assignment e = (e1 , . . . , en ). Here n refers
to the number of linkable spans in a document. Our problem is also
known as entity disambiguation or link generation in the literature 4 .
We can construct such a mapping F in a probabilistic approach, by
learning a conditional probability model p(e|m, c) from data and then
employing (approximate) probabilistic inference in order to find the
maximum a posteriori (MAP) assignment, hence:
F (m, c) := arg max p(e|m, c) . (6.1)

e∈E n
In the sequel, we describe how to estimate such a model from a corpus

of entity-linked documents. Finally, we show in section 6.4 how to
4 Note
that we do not address the issues of mention detection or nil identification in
this work. Rather, our input is a document along with a fixed set of linkable mentions
corresponding to existing KB entities.
104
6.3 model
apply belief propagation (max-product) for approximate inference in

this model.
Maximum Entropy Models
Assume a corpus of entity-linked documents is available. Specifically,

we used the set of Wikipedia pages together with their respective Wiki
hyperlinks. These hyperlinks are considered ground truth annotations,
the mention being the linked span of text and the truth entity being
the Wikipedia page it refers to. One can extract two kinds of basic
statistics from such a corpus: First, counts of how often each entity was
referred to by a specific name. Second, pairwise co-occurrence counts
for entities in documents. Our fundamental conjecture is that most of
the relevant information needed for entity disambiguation is contained
in these counts, that they are sufficient statistics. We thus request that our
probability model reproduces these counts in expectation. As this alone
typically yields an ill-defined problem, we follow the maximum entropy
principle of Jaynes [Jay82]: Among the feasible set of distributions we
favor the one with maximal entropy.
Formally, let D be an entity-linked document collection. Ignoring
mention contexts for now, we extract for each document d ∈ D a
sequence of mentions m(d) and their corresponding target entities e(d) ,
both of length n(d) . Assuming exchangeability of random variables
within these sequences, we reduce each (e, m) to statistics (or features)
about mention-entity and entity-entity co-occurrence as follows:
n
φe,m (e, m) := ∑ 1[ei = e]·1[mi = m], ∀(e, m) ∈ E ×V (6.2)
i =1
ψ{e,e0 } (e) := ∑ 1[{ei , e j } = {e, e0 }], ∀e, e0 ∈ E , (6.3)
i< j
where 1[·] is the indicator function. Note that we use the subscript
notation {e, e0 } for ψ to take into account the symmetry in e, e0 as well
the fact that one may have e = e0 .
105
The document collection provides us with empirical estimates for

the expectation of these statistics under an i.i.d. sampling model for
documents, namely the averages
1
|D| d∑
φe,m (D) := φe,m (e(d) , m(d) ) , (6.4)
∈D
1
|D| d∑
ψ{e,e0 } (D) := ψ{e,e0 } (e(d) ) . (6.5)
∈D
Note that in entity disambiguation, the mention sequence m is al-

ways considered given, while we seek to predict the corresponding
entity sequence e. It is thus not necessary to try to model the joint
distribution p(e, m), but sufficient to construct a conditional model
p(e|m). Following Berger et al. [BPP96] this can be accomplished by
taking the empirical distribution p(m|D) of mention sequences and
combining it with a conditional model via p(e, m) = p(e|m) · p(m|D).
We then require that:
E p [φe,m ] = φe,m (D) and E p [ψ{e,e0 } ] = ψ{e,e0 } (D), (6.6)
which yields |E | · |V | + (|E2 |) + |E | moment constraints on p(e|m).

The maximum entropy distributions, fulfilling constraints as stated
in Eq. equation 6.6 form a conditional exponential family for which
φ(·, m) and ψ(·, ·) are sufficient statistics. We thus know that there
are canonical parameters ρe,m and λ{e,e0 } (formally corresponding to
Lagrange multipliers) such that the maximum entropy distribution can
be written as
1
p(e|m; ρ, λ) = exp [hρ, φ(e, m)i + hλ, ψ(e)i] (6.7)
Z (m)
where Z (m) is the partition function
Z (m) := ∑ exp [hρ, φ(e, m)i + hλ, ψ(e)i] . (6.8)

e∈E n
Here we interpret (e, m) and {e, e0 } as multi-indices and suggestively

define the shorthands
hρ, φi := ∑ ρe,m φe,m , hλ, ψi := ∑ λ{e,e0 } ψ{e,e0 } . (6.9)

e,m {e,e0 }
106
6.3 model
Note that we can switch between the statistics view and the raw data
view by observing that
n
hρ, φ(e, m)i = ∑ ρe ,m ,i i
hλ, ψ(e)i = ∑ λ{ei ,e j } . (6.10)
i =1 i< j
While the maximum entropy principle applied to our fundamental

conjecture restricts the form of our model to a finite-dimensional expo-
nential family, we need to investigate ways of finding the optimal or –
as we will see – an approximately optimal distribution in this family.
To that extent, we first re-interpret the obtained model as a factor graph
model.
Markov Network and Factor Graph
Complementary to the maximum entropy estimation perspective, we

want to present a view on our model in terms of probabilistic graphical
models and factor graphs. Inspecting Eq. equation 6.7 and interpreting
φ and ψ as potential functions, we can recover a Markov network that
makes conditional independence assumptions of the following type: an
entity link ei and a mention m j with i 6= j are independent, given mi
and e−i , where e−i denotes the set of entity variables in the document
excluding ei . This means that a mention m j only influences a variable ei
through the intermediate variable e j . However, the functional form in
Eq. equation 6.7 goes beyond these conditional independences in that it
limits the order of interaction among the variables. A variable ei inter-
acts with neighbors in its Markov blanket through pairwise potentials.
In terms of a factor graph decomposition, p(e|m) decomposes into func-
tions of two arguments only, modeling pairwise interactions between
entities on one hand, and between entities and their corresponding
mentions on the other hand.
We emphasize the factor model view by rewriting equation 6.7 as
h i
p(e|m; ρ, λ) ∝ ∏ exp [ρei ,mi ] · ∏ exp λ{ei ,e j } (6.11)
i i< j
where we think of ρ and λ as functions
ρ : E × V → R, (e, m) 7→ ρe,m
λ : E ∪ E → R,
2
{e, e0 } 7→ λ{e,e0 }
107
m4 E4 E3 m3
m1 E1 E2 m2
Figure 6.2: Proposed factor graph for a document with four mentions.
Each mention node mi is paired with its corresponding
entity node Ei , while all entity nodes are connected through
entity-entity pair factors.
An example of a factor graph (n = 4) is shown in fig. 6.2. We will

investigate in the sequel how the factor graph structure can be further
exploited.
(Pseudo–)Likelihood Maximization
While the maximum entropy approach directly motivates the expo-

nential form of Eq. equation 6.7 and is amenable to a plausible factor
graph interpretation, it does not by itself suggest an efficient parameter
fitting algorithm. As is known by convex duality, the optimal parame-
ters can be obtained by maximizing the conditional likelihood of the
model under the data,
L(ρ, λ; D) = ∑ log p(e(d) |m(d) ; ρ, λ) (6.12)

d
However, specialized algorithms for maximum entropy estimation such

as generalized iterative scaling [DR72] are known to be slow, whereas
gradient-based methods require the computation of gradients of L,
which involves evaluating expectations with regard to the model, since
∇ρ log Z (m) = E p φ(e, m), ∇λ log Z (m) = E p ψ(e) . (6.13)
The exact inference problem of computing these model expectations,

however, is not generally tractable due to the pairwise couplings
through the ψ-statistics.
As an alternative to maximizing the likelihood in Eq. equation 6.12,

we have investigated an approximation known as, pseudo-likelihood
108
6.3 model
maximization [SM09]; [Vis+06]. Its main benefits are low computational

complexity, simplicity and practical success. Switching to the Markov
network view, the pseudo-likelihood estimator predicts each variable
conditioned on the value of all variables in its Markov blanket. The
latter consists of the minimal set of variables that renders a variable
conditionally independent of everything else. In our case the Markov
blanket consists of all variables that share a factor with a given variable.
Consequently, the Markov blanket of ei is N (ei ) := (mi , e−i ). The
posterior is then approximated in the pseudo-likelihood approach as:
n
p̃(e|m; ρ, λ) := ∏ p(ei |N (ei ); ρ, λ) , (6.14)
i =1
which results in the tractable log-likelihood function
n(d)
∑ ∑ log p(ei
(d) (d)
L̃(ρ, λ; D) := |N (ei ); ρ, λ) . (6.15)
d∈D i =1
Introducing additional L2 -norm penalties γ(kλk22 + kρk22 ) to further

regularize L̃, we have utilized parallel stochastic gradient descent
(SGD) [Rec+11] with sparse updates to learn parameters ρ, λ. From a
practical perspective, we only keep for each token span m parameters
ρe,m for the most frequently observed entities e. Moreover, we only
use λ{e,e0 } for entity pairs (e, e0 ) that co-occurred together a sufficient
number of times in the collection D 5 . As we will discuss in more
detail in section 6.5, our experimental findings suggest this brute-force
learning approach to be somewhat ineffective, which has motivated us
to develop simpler, yet more effective plug-in estimators as described
below.
Bethe Approximation
The major computational difficulty with our model lies in the pair-
wise couplings between entities and the fact that these couplings are
dense: The Markov dependency graph between different entity links in
a document is always a complete graph. Let us consider what would
happen, if the dependency structure were loop-free, i.e., it would form
5 Forthe Wikipedia collection, even after these pruning steps, we ended up with
more than 50 million parameters in total.
109
a tree. Then we could rewrite the prior probability in terms of marginal

distributions in the so-called Bethe form. Encoding the tree structure in
a symmetric relation T , we would get
∏{i,j}∈T p(ei , e j )
p (e) = , di := |{ j : {i, j} ∈ T }| . (6.16)
∏in=1 p(ei )di −1
The Bethe approximation [YFW00] pursues the idea of using the above
representation as an unnormalized approximation for p(e), even when
the Markov network has cycles. How does this relate to the exponential
form in equation 6.7? By simple pattern matching, we see that if we
choose
p(e, e0 )

λ{e,e0 } = log , ∀e, e0 ∈ E (6.17)
p(e) p(e0 )
we can apply Eq. equation 6.16 to get an approximate distribution
∏ i < j p ( ei , e j ) n p ( ei , e j )
p̄(e) ∝
∏in=1 p(ei )n−2
= ∏ p ( ei ) ∏ p ( ei ) p ( e j )
i =1 i< j
" # (6.18)
= exp ∑ log p(ei ) + ∑ λ{e ,e } i j
,
i i< j
where we see the same exponential form in λ appearing as in Eq. equa-

tion 6.10. We complete this argument by observing that with
ρe,m = log p(e) + log p(m|e) (6.19)
we obtain a representation of a joint distribution that exactly matches

the form in Eq. equation 6.7.
What have we gained so far? We started from the desire of con-

structing a model that would agree with the observed data on the
co-occurrence probabilities of token spans and their linked entities as
well as on the co-link probability of entity pairs within a document.
This has led to the conditional exponential family in Eq. equation 6.7.
We have then proposed pseudo-likelihood maximization as a way to
arrive at a tractable learning algorithm to try to fit the massive amount
of parameters ρ and λ. Alternatively, we have now seen that a Bethe
approximation of the joint prior p(e) yields a conditional distribution
p(e|m) that (i) is a member of the same exponential family, (ii) has
110
6.3 model
explicit formulas for how to choose the parameters from pairwise

marginals, and (iii) would be exact in the case of a dependency tree.
We claim that the benefits of computational simplicity together with the
correctness guarantee for non-dense dependency networks outweighs
the approximation loss, relative to the model with the best generaliza-
tion performance within the conditional exponential family. In order to
close the suboptimality gap further, we suggest refinements below.
Parameter Calibration
With the previous suggestion, one issue comes into play: The total
contribution coming from the pairwise interactions between entities will
scale with (n2 ), while the entity–mention compatibility contributions will
scale with n, the total number of mentions. This is a direct observation
of the number of terms contributing to the sums in equation 6.10.
However, for practical reasons, it is somewhat implausible that, as n
grows, the prior p(e) should dominate and the contribution of the
likelihood term should vanish. The model is not well-calibrated with
regard to n.
We propose to correct for this effect by adding a normalization factor
to the λ-parameters by replacing equation 6.17 with:
p(e, e0 )

2
n
λe,e0 = log , ∀e, e0 ∈ E (6.20)
n−1 p(e) · p(e0 )
where now these parameters scale inversely with n, the number of

entity links in a document, making the corresponding sum in Eq. equa-
tion 6.7 scale with n. With this simple change, a substantial accuracy
improvement was observed empirically, the details of which are re-
ported in our experiments.
The re-calibration in Eq. equation 6.20 can also be justified by the

following combinatorial argument: For a given set Y of random vari-
ables, define an Y-cycle as a graph containing as nodes all variables in
Y, each with degree exactly 2, connected in a single cycle. Let Ξ be the
set enumerating all possible Y-cycles. Then, |Ξ| = (n − 1)!, where n is
the size of Y.
111
In our case, if the entity variables e per document would have formed
a cycle of length n instead of a complete subgraph, the Bethe approxi-
mation would have been written as:
∏(i,j)∈E(π ) p(ei , e j )
p̄π (e) ∝ , ∀π ∈ Ξ (6.21)
∏ i p ( ei )
where E(π ) is the set of edges of the e-cycle π. However, as we do
not desire to further constrain our graph with additional independence
assumptions, we propose to approximate the joint prior p(e) by the
average of the Bethe approximation of all possible π, that is
1
|Ξ| π∑
log p̄(e) ≈ log p̄π (e) . (6.22)
∈Ξ
Since each pair (ei , e j ) would appear in exactly 2(n − 2)! e-cycles, one
can derive the final approximation:
2
∏ i < j p ( e i , e j ) n −1
p̄(e) ≈ . (6.23)
∏ i p ( ei )
Distributing marginal probabilities over the parameters starting from
Eq. equation 6.23 and applying a similar argument as in Eq. equa-
tion 6.18 results in the assignment given by Eq. equation 6.20. While the
above line of argument is not a strict mathematical derivation, we be-
lieve this to shed further light on the empirically observed effectiveness
of the parameter re-scaling.
Integrating Context
The model that we have discussed so far does not consider the
local context of a mention. This is a powerful source of information
that a competitive ED system should utilize. For example, words like
“computer”, “company” or “device” are more likely to appear near
references of the entity Apple Inc. than of the entity Apple fruit. We
demonstrate in this section how this integration can be easily done
in a principled way on top of the current probabilistic model. This
showcases the extensibility of our approach. Enhancing our model with
additional knowledge such as entity categories or word co-reference
can also be done in a rigorous way, so we hope that this provides a
template for future extensions.
112
6.3 model
As stated in section 6.3, for each mention mi in a document, we

maintain a context representation ci consisting of the bag of words sur-
rounding the mention within a window of length K 6 . Hence, ci can be
viewed as an additional random variable with an observed outcome. At
this stage, we make additional reasonable independence assumptions
that increase tractability of our model. First, we assume that, knowing
the identity of the linked entity ei , the mention token span mi is just
the surface form of the entity, so it brings no additional information
for the generative process describing the surrounding context ci . For-
mally, this means that mi and ci are conditionally independent given ei .
Consequently, we obtain a factorial expression for the joint model
n
p(e, m, c) = p(e) p(m, c|e) = p(e) ∏ p(mi |ei ) p(ci |ei ) (6.24)
i =1
This is a simple extension of the previous factor graph that includes

context variables. Second, we assume conditional independence of
the words in ci given an entity ei which let us factorize the context
probabilities as
p ( c i | ei ) = ∏ p ( w j | ei ) . (6.25)
w j ∈ ci
Note that this assumption is commonly made in models using bag-of-

word representations or naı̈ve Bayes classifiers.
While this completes the argument from a joint model point of view,
we need to consider one more aspect for the conditional distribution
p(e|m, c) that we are interested in. If we cannot afford (computationally
as well as with regard to training data size) a full-blown discriminative
learning approach, then how do we balance the relative influence of the
context ci and the mention token span mi on ei ? For instance, the effect
of ci will depend on the chosen window size K, which is not realistic.
To address this issue, we resort to a hybrid approach, where, in the
spirit of the Bethe approximation, we continue to express our model in
terms of simple marginal distributions that can be easily estimated in-
dependently from data, yet that allow for a small number of parameters
(in our case “small” equals 2) to be chosen to optimize the condi-
tional log-likelihood p(e|m, c). We thus introduce weights ζ and τ that
6 Throughout our experiments, we used a context window of size K = 100, intu-
itively chosen and without extensive validation.
113
control the importance of the context factors and, respectively, of the

entity-entity interaction factors. Putting equations equation 6.19, equa-
tion 6.20, equation 6.24 and equation 6.25 together, we arrive at the
final model that will be subsequently referred to as the PBOH model
(Probabilistic Bag of Hyperlinks):
!
n
log p(e|m, c) = ∑ log p(ei |mi ) + ζ ∑ log p(w j |ei )
i =1 w j ∈ ci
p ( ei , e j )

2τ
n−1 ∑
+ log + const . (6.26)
i< j
p ( ei ) p ( e j )
Here we used the identity p(m|e) p(e) = p(e|m) p(m) and absorbed all
log p(m) terms in the constant. We use grid-search on a validation
set for the remaining problem of optimizing over the parameters ζ, τ.
Details are provided in section 6.5.
Smoothing Empirical Probabilities
In order to estimate the probabilities involved in Eq. equation 6.26, we

rely on an entity annotated corpus of text documents, e.g., Wikipedia
Web pages together with their hyperlinks which we view as ground
truth annotations. From this corpus, we derive empirical probabilities
for a name-to-entity dictionary p̂(m|e) based on counting how many
times an entity appeared referenced by a given name 7 . We also compute
the pairwise probabilities p̂(e, e0 ) obtained by counting the pairwise
co-occurrence of entities e and e0 within the same document. Similarly,
we obtained empirical values for the marginals p̂(e) = ∑e0 p̂(e, e0 ) and
for the context word-entity statistics p̂(w|e).
In the absence of huge amounts of data, estimating such probabilities
from counts is subject to sparsity. For instance, in our statistics, there
are 8 times more distinct pairs of entities that co-occur in at most 3
Wikipedia documents compared to the total number of distinct pairs
of entities that appear together in at least 4 documents. Thus, it is
expected that the heavy tail of infrequent pairs of entities will have a
strong impact on the accuracy of our system.
7 Inour implementation we summed the mention-entity counts from Wikipedia

hyperlinks with the Crosswikis counts [SC12a]
114
6.4 inference
Traditionally, various smoothing techniques are employed to address

sparsity issues arising commonly in areas such as natural language pro-
cessing. Out of the wealth of methods, we decided to use the absolute
discounting smoothing technique [ZL04] that involves interpolation
of higher and lower order (backoff) models. In our case, whenever
insufficient data is available for a pair of entities (e, e0 ), we assume the
two entities are drawn from independent distributions. Thus, if we
denote by N (e, e0 ) the total number of corpus documents that link both
e and e0 , and by Nep the total number of pairs of entities referenced in
each document, then the final formula for the smoothed entity pairwise
probabilities is:
max( N (e, e0 ) − δ, 0)
pe(e, e0 ) = + (1 − µe ) p̂(e) p̂(e0 ) (6.27)
Nep
where δ ∈ [0, 1] is a fixed discount and µe is a constant that assures that

∑e ∑e0 pe(e, e0 ) = 1. δ was set by performing a coarse grid search on a
validation set. The best δ value was found to be 0.5.
The word-entity empirical probabilities p̂(w|e) were computed based
on the Wikipedia corpus by counting the frequency with which word
w appears in the context windows of size K around the hyperlinks
pointing to e. In order to avoid memory explosion, we only considered
the entity-words pairs for which these counts are at least 3. These
empirical estimates are also sparse, so we used absolute discounting
smoothing for their correction by backing off to the unbiased estimates
p̂(w). The latter can be much more accurately estimated from any text
corpus. Finally, we obtain:
max( N (w, e) − ξ, 0)
pe(w|e) = + (1 − µw ) p̂(w) . (6.28)
Nwp
Again ξ ∈ [0, 1] was optimized by grid search to be 0.5.
6.4 inference
After introducing our model and showing how to train it in the pre-
vious section, we now explain the inference process used for prediction.
115
Candidate Selection
At test time, for each mention to be disambiguated, we first select

a set of potential candidates by considering the top R ranked entities
based on the local mention-entity probability dictionary p̂(e|m). We
found R = 64 to be a good compromise between efficiency and accuracy
loss. Second, we want to keep the average number of candidates per
mention as small as possible in order to reduce the running time
which is quadratic in this number (see the next section for details).
Consequently, we further limit the number of candidates per mention
by keeping only the top 10 entity candidates re-ranked by the local
mention-context-entity compatibility defined as
log p(ei |mi , ci ) = log p(ei |mi ) + ζ ∑ log p(w j |ei ) + const . (6.29)
w j ∈ ci
These pruning heuristics result in a significantly improved running

time at an insignificant accuracy loss.
If the given mention is not found in our map p̂(e|m), we try to replace
it by the closest name in this dictionary. Such a name is picked only
if the Jaccard distance between the set of letter trigrams of these two
strings is smaller than a threshold that we empirically picked as 0.5.
Otherwise, the mention is not linked at all.
Belief Propagation
Collectively disambiguating all mentions in a text involves iterating

through an exponential number of possible entity resolutions. Exact
inference in general graphical models is NP-hard, therefore approx-
imations are employed. We propose solving the inference problem
through the loopy belief propagation (LBP) [MWJ99a] technique, using
the max-product algorithm that approximates the MAP solution in a
run-time polynomial in n, the number of input mentions. For the sake
of brevity, we only present the algorithm for the maximum entropy
model described by Eq. equation 6.7; A similar approach was used for
the enhanced PBOH model given by Eq. equation 6.26.
Our proposed graphical model is a fully connected graph where
each node corresponds to an entity random variable. Unary potentials
exp(ρm,e ) model the entity-mention compatibility, while pairwise poten-
tials exp(λ{e,e0 } ) express entity-entity correlations. For the posterior in
116
6.5 experiments
Eq. equation 6.7, one can derive the update equation of the logarithmic
message that is sent in round t + 1 from entity random variable Ei to
the outcome e j of the entity random variable Ej :
!
mtE+i →
1
Ej ( e j ) = max ρei ,mi + λ{ei ,e j } +
ei
∑ mtEk →Ei (ei ) (6.30)
1≤k ≤n;k 6= j
Note that, for simplicity, we skip the factor graph framework and
send messages directly between each pair of entity variables. This is
equivalent to the original Belief Propagation (BP) framework.
We chose to update messages synchronously: in each round t, each
two entity nodes Ei and Ej exchange messages. This is done until
convergence or until an allowed maximum number of iterations (15 in
our experiments) is reached. The convergence criterion is:
max |mtE+i →
1 t
Ej ( e j ) − m Ei → Ej ( e j )| ≤ e (6.31)
1≤i,j≤n;e j ∈E
where e = 10−5 . This setting was sufficient in most of the cases to reach
convergence.
In the end, the final entity assignment is determined by:
!
ei∗ = arg max ρei ,mi +
ei
∑ mtEk →Ei (ei ) (6.32)
1≤ k ≤ n
The complexity of the belief propagation algorithm is, in our case,

O(n2 · r2 ), with n being the number of mentions in a document and
r being the average number of candidate entities per mention (10 in
our case). More details regarding the run-time and convergence of the
loopy BP algorithm can be found in section 6.5.
6.5 experiments
We now present the experimental evaluation of our method. We

first uncover some practical details of our approach. Further, we show
an empirical comparison between PBOH and well known or recent
competitive entity disambiguation systems. We use the Gerbil testing
platform [Usb+15] version 1.1.4 with the D2KB setting in which a
document together with a fixed set of mentions to be annotated are
given as input. We run additional experiments that allow us to compare
against more recent approaches, such as [HC14] and [GB14].
117
Note that in all the experiments we assume that we have access to

a set of linkable token spans for each document. In practice this set is
obtained by first applying a mention detection approach which is not
part of our method. Our main goal is then to annotate each token span
with a Wikipedia entity 8 .
evaluation metrics We quantify the quality of an ED system by

measuring common metrics such as precision, recall and F1 scores.
Let M∗ be the ground truth entity annotations associated with a given
set of mentions X. Note that in all the results reported, mentions that
contain NIL or empty ground truth entities are discarded before the
evaluation; this decision is taken as well in Gerbil version 1.1.4. Let M
be the output annotations of an entity disambiguation system on the
same input. Then, our quality metrics are computed as follows:
| M∩ M∗ |
• Precision: P = | M|
| M∩ M∗ |
• Recall: R = | M∗ |
2· P · R
• F1 score: F1 = P+ R
We mostly report results in terms of F1 scores, namely macro-averaged

F1@MA (aggregated across documents), and micro-averaged F1@MI
(aggregated across mentions). For a fair comparison with [HC14], we
also report micro-recall R@MI and macro-recall R@MA on the AIDA
datasets. Note that, in our case, the precision and recall are not nec-
essarily identical since a method may not consider annotating certain
mentions 8 .
pseudo-likelihood training We briefly mention some of the

practical issues that we encounter with the likelihood maximization de-
scribed in section 6.3. From the practical perspective, for each mention
m, we only considered the set of parameters ρm,e limited to the top 64
candidate entities e per mention, ranked by p̂(e|m) . Additionally, we
restricted the set λe,e0 to entity pairs (e, e0 ) that co-occurred together in
at least 7 documents throughout the Wikipedia corpus. In total, a set
of 26 millions ρ and 39 millions λ parameters were learned using the
8 In PBOH,we refrain from annotating mentions for which no candidate entity is

found according to the procedure described in section 6.4.
118
6.5 experiments
Dataset # non-NIL mentions # documents

AIDA test A 4791 216
AIDA test B 4485 231
MSNBC 656 20
AQUAINT 727 50
ACE04 257 35
Table 6.1: Statistics on some of the used datasets
previously described procedure. Note that the universe of all Wikipedia

entities is of size ∼ 4 million.
For the SGD procedure, we tried different initializations of these
parameters, including ρm,e = log p(e|m), λe,e0 = 0, as well as the param-
eters given by Eq. equation 6.17. However, in all cases, the accuracy
gain on a sample of 1000 Wikipedia test pages was small or negligible
compared to the LocalMention baseline (described below). One reason
is the inherent sparsity of the data: the parameters associated with the
long tail of infrequent entity pairs are updated rarely and expected to
be defective at the end of the SGD procedure. However, these scattered
pairs are crucial for the effectiveness and coverage of the entity disam-
biguation system. To overcome this problem, we refined our model as
described in section 6.3 and subsequent sections.
pboh training details Wikipedia itself is a valuable resource for

ED since each internal hyperlink can be considered as the ground truth
annotation for the respective anchor text. In our system, the training
is solely done on the entire Wikipedia corpus 9 . Hyper-parameters
are grid-searched such that the micro F1 plus macro F1 scores are
maximized over the combined held-out set containing only the AIDA
Test-A dataset and a Wikipedia validation set consisting of random
1000 pages. As a preprocessing step in our training procedure, we
removed all annotations and hyperlinks that point to non-existing,
disambiguation or list Wikipedia pages.
9 We used the Wikipedia dump from February 2014
119
Figure 6.3: Interactive Gerbil visualization of “in-KB” (i.e. only entities

in KB should be linked) micro F1 scores for a variety of ED
methods and datasets. Our system, PBOH, is outperforming
the vast majority of the presented baselines. Screen shot
from January 2018.
The PBOH system used in the experimental comparison is the model

given by Eq. equation 6.26 for which grid search of the hyper-parameters
suggested using ζ = 0.075, τ = 0.5, δ = 0.5, ξ = 0.5.
datasets We evaluate our approach on 14 well-known public ED

datasets built from various sources, whose statistics are shown in
table 6.1 and [Usb+15], and some descriptions are provided below.
• The CoNLL-AIDA dataset is an entity annotated corpus of Reuters

news documents introduced by [Hof+11a]. It is much larger than
most of the other existing EL datasets, making it an excellent
evaluation target. The data is divided in three parts: Train (not
used in our current setting for training, but only in the Gerbil
evaluation), Test-A (used for validation) and Test-B (used for blind
evaluation). Similar to [HC14] and others, we report results also
on the validation set Test-A.
120
6.5 experiments
Datasets
AIDA test A AIDA test B
Systems R@MI R@MA R@MI R@MA
LocalMention 69.73 69.30 67.98 72.75

TagMe reimpl. 76.89 74.57 78.64 78.21
AIDA 79.29 77.00 82.54 81.66
S&Y - 84.22 - -
Houlsby et al. 79.65 76.61 84.89 83.51
PBOH 85.70 85.26 87.61 86.44
Table 6.2: AIDA test-a and AIDA test-b datasets results.
Datasets
new MSNBC new AQUAINT new ACE2004
Systems F1@MI F1@MA F1@MI F1@MA F1@MI F1@MA
Local
73.64 77.71 87.33 86.80 84.75 85.70
Mention
Cucerzan 88.34 87.76 78.67 78.22 79.30 78.22

M&W 78.43 80.37 85.13 84.84 81.29 84.25
Han et al. 88.46 87.93 79.46 78.80 73.48 66.80
AIDA 78.81 76.26 56.47 56.46 80.49 84.13
GLOW 75.37 77.33 83.14 82.97 81.91 83.18
RI 90.22 90.87 87.72 87.74 86.60 87.13
REL-RW 91.37 91.73 90.74 90.58 87.68 89.23
PBOH 91.06 91.19 89.27 88.94 88.71 88.46
Table 6.3: Results on the newer versions of the MSNBC, AQUAINT and
ACE04 datasets.
121
• The AQUAINT dataset introduced by [MW08] contains docu-

ments from a news corpus from the Xinhua News Service, the
New York Times and the Associated Press.
• MSNBC [Cuc07] - a dataset of news documents that includes many

mentions which do not easily map to Wikipedia titles because of
their rare surface forms or distinctive lexicalization.
• The ACE04 dataset [Rat+11] is a subset of ACE2004 Coreference

documents annotated using Amazon Mechanical Turk. Note
that the ACE04 dataset contains mentions that are annotated
with NIL entities, meaning that no proper Wikipedia entity was
found. Following common practice, we removed all the mentions
corresponding to these NIL entities prior to our evaluation.
Note that the Gerbil platform uses an old version of the AQUAINT,
MSNBC and ACE04 datasets that contain some no-longer existing
Wikipedia entities. A new cleaned version of these sets 10 was released
by [GB14]. We report results for the new cleaned datasets in table 6.3,
while table 6.4 and fig. 6.3 contain results for the old versions currently
used by Gerbil.
10 http://www.cs.ualberta.ca/
~denilson/data/deos14_ualberta_
experiments.tgz
122
IITB
MSNBC
KORE50
ACE2004
AQUAINT
N3-RSS-500
AIDA-Test B
AIDA-Test A
AIDA-Training
N3-Reuters-128
AIDA-Complete
DBpediaSpotlight
Microposts2014-Test
Microposts2014-Train
AGDISTIS 65.83 60.27 59.06 58.32 61.05 60.10 36.61 41.23 34.16 42.43 50.39 75.42 67.95 59.88
77.63 56.97 53.36 58.03 57.53 58.62 33.25 43.38 30.20 61.08 62.87 73.82 75.52 70.80
Babelfy 63.20 78.00 75.77 80.36 78.01 72.27 51.05 57.13 73.12 47.20 50.60 78.17 58.61 69.17
76.71 73.81 71.26 74.52 74.22 73.23 51.97 55.36 69.77 62.11 61.02 75.73 59.87 76.00
DBpedia Spotlight 70.38 58.84 54.90 57.69 60.04 74.03 69.27 65.44 37.59 56.43 56.26 69.27 56.44 57.63
80.02 60.59 54.11 61.34 62.23 73.13 67.23 62.81 32.90 71.63 67.99 69.82 58.77 65.03
Dexter 18.72 48.46 45.44 48.59 49.25 38.28 26.70 28.53 17.20 31.27 35.21 36.86 32.74 31.11
16.97 45.29 42.17 46.20 45.85 38.15 22.75 28.48 12.54 44.02 42.07 39.42 31.85 33.55
Entityclassifier.eu 12.74 46.6 44.13 44.02 47.83 21.67 22.59 18.46 27.97 29.12 32.69 41.24 28.4 21.77
12.3 42.86 42.36 41.31 43.36 19.59 18.0 19.54 25.2 39.53 38.41 40.3 24.84 22.2
Kea 80.08 73.39 70.9 72.64 74.22 81.84 73.63 72.03 57.95 63.4 64.67 85.49 63.2 69.29
87.57 73.26 67.91 73.31 74.47 81.27 76.60 70.52 53.17 76.54 74.32 87.4 64.45 75.93
NERD-ML 54.89 54.62 52.85 52.59 55.55 49.68 46.8 51.08 29.96 38.65 39.83 64.03 54.96 61.22
72.22 52.35 49.6 51.34 53.23 46.06 45.59 49.91 24.75 57.91 53.74 67.28 62.9 67.3
TagMe 2 81.93 72.07 69.07 70.62 73.2 76.27 63.31 57.23 57.34 56.81 59.14 75.96 59.32 78.05
89.09 71.19 66.5 70.38 72.45 75.12 65.1 55.8 54.67 71.66 70.45 77.05 67.55 83.2
WAT 80.0 83.82 81.82 84.34 84.21 76.82 65.18 61.14 58.99 59.56 61.96 77.72 64.38 68.21
86.49 83.59 80.25 84.12 84.22 77.64 68.24 59.36 53.13 73.89 72.65 79.08 65.81 76.0
Wikipedia Miner 77.14 64.72 61.65 60.71 66.48 75.96 62.57 58.59 41.63 54.88 55.93 64.25 60.05 64.54
6.5 experiments
86.36 66.17 61.67 63.19 67.93 74.63 61.43 56.98 35.0 69.29 67.0 64.68 66.51 72.23
123
PBOH 87.19 86.72 86.63 87.39 86.59 86.64 79.48 62.47 61.70 74.19 73.08 89.54 76.54 71.24
90.40 86.85 85.48 86.32 87.30 86.14 80.13 61.04 55.83 84.48 81.25 89.62 83.31 78.33
Table 6.4: Micro (top) and macro (bottom) F1 scores reported by Gerbil for each of the 14 datasets and of the 11 ED
methods including PBOH. For each dataset and each metric, we highlight in red the best system and in
blue the second best system.
Datasets
AIDA AIDA
MSNBC AQUAINT ACE04
test A test B
Avg. nb
22.18 19.41 32.8 14.54 7.34
mentions / doc
Algorithm
convergence 100% 99.56% 100% 100% 100%
rate
Avg. running
445.56 203.66 371.65 40.42 10.88
time (ms/doc)
Avg. nb
2.86 2.83 3.0 2.56 2.25
rounds
Table 6.5: Loopy belief propagation statistics. Average running time,

number of rounds and convergence rate of our inference
procedure are provided.
systems For comparison, we selected a broad range of competitor

systems from the vast literature in this field. The Gerbil platform al-
ready integrates the methods of Agdistis [Usb+14], Babelfy [MRN14],
DBpedia Spotlight [Men+11], Dexter [Cec+13a], Kea [SS13], Nerd-ML
[RET14], Tagme2 [FS10a], WAT [PF14], Wikipedia Miner [MW08] and
Illinois Wikifier [Rat+11]. We furthermore compare against Cucerzan
[Cuc07] – the first collective EL system that uses optimization tech-
niques, M& W [MW08]– a popular machine learning approach, Han
et al. [HSZ11] – a graph based disambiguation system that uses ran-
dom walks for joint disambiguation, AIDA [Hof+11a] – a performant
graph based approach, GLOW [Rat+11] – a system that uses local and
global context to perform joint entity disambiguation, RI [CR13] – an
approach using relational inference for mention disambiguation, and
REL-RW [GB14], a recent system that iteratively solves mentions rely-
ing on an online updating random walk model. In addition, on the
AIDA datasets we also compare against S& Y [SY13] – an apparatus for
combining the NER and EL tasks, and Houlsby et al. [HC14] – a topic
modelling LDA-based approach for EL.
To empirically assess the accuracy gain introduced by each incremen-
tal step of our approach, we ran experiments on several of our method’s
components, individually: LocalMention – links mentions to entities
124
6.5 experiments
solely based on the token span statistics, i.e., e∗ = arg maxe p̂(e|m);
Unnorm – uses the unnormalized mention-entity model described in
section 6.3; Rescaled – relies on the rescaled model presented in sec-
tion 6.3; LocalContext – disambiguates an entity based on the mention
and the local context probability given by Equation equation 6.29, i.e.,
e∗ = arg maxe p(e|m, c). Note that Unnorm, Rescaled and PBOH use
the loopy belief propagation procedure for inference.
Results
Results of the experiments run on the Gerbil platform are shown

intable 6.4 and fig. 6.3. Detailed results are also provided 11 12 . We
obtain the highest performance on 11 datasets and the second highest
performance on 2 datasets, showing the effectiveness of our method.
Other results are presented in tables 6.2 and 6.3. The highest accuracy
for the cleaned version of AQUAINT, MSNBC and ACE04 was previ-
ously reported by [GB14], while [HC14] dominate the AIDA datasets.
Note that the performance of the baseline systems shown in these two
tables is taken from [GB14] and [HC14].
All these methods are tested in the setting where a fixed set of
mentions is given as input, without requiring the mention detection
step.
discussion Several observations are worth noting here. First, the

simple LocalMention component alone outperforms many EL systems.
However, our experimental results show that PBOH consistently beats
LocalMention on all the datasets. Second, PBOH produces state-of-the-art
results on both development (Test-A) and blind evaluation (Test-B) parts
of the AIDA dataset. Third, on the AQUAINT, MSNBC and ACE04
datasets, PBOH outperforms all but one of the presented EL systems
and is competitive with the state-of-art approaches. The method whose
performance is closer to ours is REL-RW [GB14] whose average F1 score
is only slightly higher than ours (+0.6 on average). However, there are
significant advantages of our method that make it easier to use for prac-
titioners. First, our approach is conceptually simpler and only requires
11 The PBOH Gerbil experiment is available at http://gerbil.aksw.org/gerbil/
experiment?id=201510160025.
12 The detailed Gerbil results of the baseline systems can be accessed at http:
//gerbil.aksw.org/gerbil/experiment?id=201510160026
125
Datasets
MSNBC AQUAINT ACE2004
Avg # mentions per doc 36.95 14.54 8.68
Systems # entities # entities # entities
PBOH 247.19 95.38 66.66
REL-RW 382773.6 242443.1 256235.49
Table 6.6: Average number of entities that appear in the graph built by
PBOH and by REL-RW.
Datasets
AIDA test A AIDA test B
Systems R@MI R@MA R@MI R@MA
LocalMention 69.73 69.30 67.98 72.75

Unnorm 69.77 69.95 75.87 75.12
Rescaled 75.09 74.25 74.76 78.28
LocalContext 82.50 81.56 85.46 84.08
PBOH 85.53 85.09 87.51 86.39
Table 6.7: Accuracy gains of individual PBOH components.
sufficient statistics computed from Wikipedia. Second, PBOH shows a

superior computational complexity manifested in significantly lower
run times (table 6.5), making it a good fit for large-scale real-time ED
systems; this is not the case for REL-RW qualified as “time consuming”
by its authors. Third, the number of entities in the underlying graph,
and thus the required memory, is significantly lower for PBOH (see
statistics provided in table 6.6).
126
6.6 summary
Incremental accuracy gains

To give further insight to our method, table 6.7 provides an overview
of the contribution brought step by step by each incremental component
of the Full PBOH system. It can be noted that PBOH performs best,
outranking all its individual components.
Reproducibility of the experiments

Our experiments are easily reproducible using the details provided
in this chapter. Our learning procedure is only based on statistics
coming from the set of Wikipedia webpages. As a consequence, one can
implement a real-time highly accurate entity disambiguation system
solely based on the details described here.
Our code is publicly available at : https://github.com/dalab/
pboh-entity-linking
6.6 summary
In this chapter, we described a light-weight graphical model for ED

via approximate inference. Our method employs simple sufficient
statistics that rely on three sources of information: First, a probabilistic
name to entity map p̂(e|m) derived from a large corpus of hyperlinks;
second, observational data about the pairwise co-occurrence of enti-
ties within documents from a Web collection; third, entity - contextual
words statistics. Our experiments based on a number of popular ED
benchmarking collections show improved performance as compared to
several well-known or recent systems.
There are several promising directions of future work. Currently,

our model considers only pairwise potentials. In the future, it would
be interesting to investigate the use of higher-order potentials and
submodular optimization in an ED pipeline, thus allowing us to capture
the interplay between entire groups of entity candidates (e.g., through
the use of entity categories). Additionally, we will further enrich our
probabilistic model with statistics from new sources of information. We
expect some of the performance gains that other papers report from
using entity categories or semantic relations to be additive with regard
to our system’s current accuracy.
127
D E E P J O I N T E N T I T Y D I S A M B I G U AT I O N W I T H
7
L O C A L N E U R A L AT T E N T I O N
In this chapter, we build upon the previously presented PBOH model

and propose a novel deep learning model for joint document-level ED.
Key components are entity embeddings, non-metric bilinear form simi-
larity measures between words, contexts and entities, a neural attention
mechanism over local context windows, and a differentiable joint infer-
ence stage for disambiguation. Our approach thereby combines benefits
of deep learning with more traditional approaches such as graphical
models and probabilistic mention-entity maps. Extensive experiments
show that we are able to obtain competitive or state-of-the-art accuracy
at moderate computational costs.
cation [GH17].
7.1 introduction
We have seen in previous Chapter 6 that “entity disambiguation” ED

is an important stage for text semantic understanding which automati-
cally resolves references to entities in a given KB. This task is challenging
due to the inherent ambiguity between surface form mentions such as
names and the entities they refer to. This many-to-many ambiguity
can often be captured partially by name-entity co-occurrence counts
extracted from entity-linked corpora.
ED research has largely focused on two types of contextual informa-
tion for disambiguation: local information based on words that occur
in a context window around an entity mention, and, global informa-
tion, exploiting document-level coherence of the referenced entities.
Many state-of-the-art methods aim to combine the benefits of both,
which is also the philosophy we follow in this chapter. What is specific
to our approach is that we use embeddings of entities as a common
representation to assess local as well as global evidence.
In recent years, many text and language understanding tasks have
been advanced by neural network architectures. However, despite
129
deep joint entity disambiguation with local neural attention
recent work, competitive ED systems still largely employ manually

designed features. Such features often rely on domain knowledge and
may fail to capture all relevant statistical dependencies and interactions.
The explicit goal of our work is to use deep learning in order to learn
basic features and their combinations from scratch. To the best of our
knowledge, our approach is the first to carry out this program with full
rigor.
7.2 contributions and related work
There is a vast prior research on entity disambiguation, highlighted

by [Ji16]. We will focus here on a discussion of our main contributions
in relation to prior work.
entity embeddings. We have developed a simple, yet effective

method to embed entities and words in a common vector space. This
follows the popular line of work on word embeddings, e.g. [Mik+13b];
[PSM14], which was recently extended to entities and ED by [Fan+16];
[HHJ15]; [Yam+16]; [ZSG16]. In contrast to the above methods that
require data about entity-entity co-occurrences which often suffers from
sparsity, we rather bootstrap entity embeddings from their canonical
entity pages and local context of their hyperlink annotations. This
allows for more efficient training and alleviates the need to compile
co-linking statistics. These vector representations are a key component
to avoid hand-engineered features, multiple disambiguation steps, or
the need for additional ad hoc heuristics when solving the ED task.
context attention. We present a novel attention mechanism

for local ED. Inspired by memory networks of [S+15] and insights
of [Laz+15], our model deploys attention to select words that are infor-
mative for the disambiguation decision. A learned combination of the
resulting context-based entity scores and a mention–entity prior yields
the final local scores. Our local model achieves better accuracy than the
local probabilistic model of [Gan+16], as well as the feature-engineered
local model of [Glo+16]. As an added benefit, our model has a smaller
memory footprint and it’s very fast for both training and testing.
There have been other deep learning approaches to define local con-
text models for ED. For instance [FDK16]; [He+13] use convolutional
neural networks (CNNs) and stacked denoising auto-encoders, respec-
130
7.3 learning entity embeddings
tively, to learn representations of textual documents and canonical entity

pages. Entities for each mention are locally scored based on cosine
similarity with the respective document embedding. In a similar local
setting, [Sun+15] embed mentions, their immediate contexts and their
candidate entities using word embeddings and CNNs. However, their
entity representations are restrictively built from entity titles and entity
categories only. Unfortunately, the above models are rather ’black-box’
(as opposed to ours which reveals the attention focus) and were never
extended to perform joint document disambiguation.
collective disambiguation. Last, a novel deep learning archi-

tecture for global ED is proposed. Mentions in a document are resolved
jointly, using a conditional random field [L+01] with parametrized
potentials. We suggest to learn the latter by casting Loopy Belief Propa-
gation (LBP) [MWJ99b] as a rolled-out deep network. This is inspired
by similar approaches in computer vision, e.g. [Dom13], and allows
us to backpropagate through the (truncated) message passing, thereby
optimizing the CRF potentials to work well in conjunction with the
inference scheme. Our model is thus trained end-to-end with the
exception of the pre-trained word and entity embeddings. Previous
work has investigated different approximation techniques, including:
random graph walks [GB16], personalized PageRank [PHG15], inter-
mention voting [FS10b], graph pruning [Hof+11b], integer linear pro-
gramming [CR13], or ranking SVMs [Rat+11]. Mostly connected to our
approach is [Gan+16] where LBP is used for inference (but not learning)
in a probabilistic graphical model and [Glo+16] where a single round
of message passing with attention is performed. To our knowledge, we
are one of the first to investigate differentiable message passing for NLP
problems.
7.3 learning entity embeddings
In a first step, we propose to train entity vectors that can be used

for the ED task (and potentially for other tasks). These embeddings
compress the semantic meaning of entities and drastically reduce the
need for manually designed features or co-occurrence statistics.
Entity embeddings are bootstrapped from word embeddings and
are trained independently for each entity. A few arguments motivate
this decision: (i) there is no need for entity co-occurrence statistics that
131
suffer from sparsity issues and/or large memory footprints; (ii) vectors
of entities in a subset domain of interest can be trained separately,
obtaining potentially significant speed-ups and memory savings that
would otherwise be prohibitive for large entity KBs 1 ; (iii) entities can be
easily added in an incremental manner, which is important in practice;
(iv) the approach extends well into the tail of rare entities with few
linked occurrences; (v) empirically, we achieve better quality compared
to methods that use entity co-occurrence statistics.
Our model embeds words and entities in the same low-dimensional
vector space in order to exploit geometric similarity between them. We
start with a pre-trained word embedding map x : W → Rd that is
known to encode semantic meaning of words w ∈ W ; specifically we
use word2vec pre-trained vectors [Mik+13b]. We extend this map to
entities E , i.e. x : E → Rd , as described below.
We assume a generative model in which words that co-occur with an
entity e are sampled from a conditional distribution p(w|e) when they
are generated. Empirically, we collect word-entity co-occurrence counts
#(w, e) from two sources: (i) the canonical KB description page of the
entity, and (ii) the windows of fixed size surrounding mentions of the
entity in an annotated corpus (e.g. Wikipedia hyperlinks). These counts
define a practical approximation of the above word-entity conditional
distribution, i.e. p̂(w|e) ∝ #(w, e). We call this the “positive” distribu-
tion of words related to the entity. Next, let q(w) be a generic word
probability distribution which we use for sampling “negative” words
unrelated to a specific entity. As in [Mik+13b], we choose a smoothed
unigram distribution q(w) = p̂(w)α for some α ∈ (0, 1). The desired
outcome is that vectors of positive words are closer (in terms of dot
product) to the embedding of entity e compared to vectors of random
words. Let w+ ∼ p̂(w|e) and w− ∼ q(w). Then, we use a max-margin
objective to infer the optimal embedding xe for entity e:
J (z; e) := Ew+ |e Ew− h z; w+ , w−

h(z; w, v) := [γ − hz, xw − xv i]+ (7.1)

xe := arg min J (z; e)
z:kzk=1
where γ > 0 is a margin parameter and [·]+ is the ReLU activation.

The above loss is optimized using stochastic gradient descent with
1 Notably useful with (limited memory) GPU hardware.
132
7.4 local model with neural attention
Figure 7.1: Local model with neural attention. Inputs: context word
vectors, candidate entity priors and embeddings. Outputs:
entity scores. All parts are differentiable and trainable with
backpropagation.
projection over sampled pairs (w+ , w− ). Note that the entity vector is
directly optimized on the unit sphere which is important in order to
obtain qualitative embeddings.
We empirically assess the quality of our entity embeddings on entity
similarity and ED tasks as detailed in section 7.7 and tables 7.2 and 7.3.
The technique described in this section can also be applied, in principle,
for computing embeddings of general text documents, but a comparison
with such methods is left as future work.
7.4 local model with neural attention
We now explain our local ED approach that uses embeddings to steer

a neural attention mechanism. We build on the insight that only a
few context words are informative for resolving an ambiguous men-
tion, something that has been exploited before in [Laz+15]. Focusing
only on those words helps reducing noise and improves disambigua-
tion. [Yam+16] observe the same problem and adopt the restrictive
133
strategy of removing all non-nouns. Here, we assume that a context

word may be relevant, if it is related to at least one of the entity candi-
dates of a given mention.
context scores. Let us assume that we have computed a mention–

entity prior p̂(e|m). In addition, for each mention m, a pruned candidate
set Γ(m) of at most S entities has been identified. Our model, depicted
in fig. 7.1, computes a score for each e ∈ Γ(m) based on the K-word
local context c = {w1 , . . . , wK } surrounding m, as well as on the prior. It
is a composition of differentiable functions, thus it is smooth from input
to output, allowing us to easily compute gradients and backpropagate
through it.
Each word w ∈ c and entity e ∈ Γ(m) is mapped to its embedding
via the pre-trained map x (conform to section 7.3). We then compute
an unnormalized support score for each word in the context as follows:
u(w) = max x>
e Axw (7.2)
e∈Γ(m)
where A is a parameterized diagonal matrix. The weight is high if

the word is strongly related to at least one candidate entity. We often
observed that uninformative words (e.g. similar to stop words) receive
non-negligible scores which add undesired noise to our local context
model. As a consequence, we (hard) prune to the top R ≤ K words with
the highest scores 2 and apply a soft-max function on these weights.
Define the reduced context:
c̄ = {w ∈ c|u(w) ∈ topR(u)} (7.3)
Then, the final attention weights are explicitly
( exp[u(w)]
. if w ∈ c̄
β(w) = ∑v∈c̄ exp[u(v)] (7.4)
0 otherwise.
Finally, we define a β-weighted context-based entity-mention score via
Ψ(e, c) = ∑ β(w) x>e B xw (7.5)
w∈c̄
where B is another trainable diagonal matrix. We will later use the

same architecture for the unary scores of our global ED model.
2 We implement this in a differentiable way by setting the lowest K-R attention
weights in u to −∞ and applying a vanilla softmax on top of them. We used the layers
Threshold and TemporalDynamicKMaxPooling from Torch NN package, which allow
subgradient computation.
134
7.5 document-level deep model
local score combination. We integrate these context scores

with the context-independent scores encoded in p̂(e|m). Our final (un-
normalized) local model is a combination of both Ψ(e, c) and log p̂(e|m):
Ψ(e, m, c) = f (Ψ(e, c), log p̂(e|m)) (7.6)
Here, we found a flexible choice for f to be important and to be superior

to a naı̈ve weighted average combination model. We therefore used a
neural network with two fully connected layers of 100 hidden units and
ReLU non-linearities, which we regularized as suggested in [Den+15]
by constraining the sum of squares of all weights in the linear layer.
We use standard projected SGD for training. The same network is also
used in section 7.5.
Prediction is done independently for each mention mi and context ci
by maximizing the Ψ(e, mi , ci ) score.
learning the local model. Entity and word embeddings are

pre-trained as discussed in section 7.3. Thus, the only learnable pa-
rameters are the diagonal matrices A and B, plus the parameters of
f . Having few parameters helps to avoid overfitting and to be able
to train with little annotated data. We assume that a set of known
mention-entity pairs {(m, e∗ )} with their respective context windows
have been extracted from a corpus. For model fitting, we then utilize
a max-margin loss that ranks ground truth entities higher than other
candidate entities. This leads us to the objective:
θ ∗ = arg min ∑ ∑ ∑ g(e, m), (7.7)

θ D ∈D m∈ D e∈Γ(m)
g(e, m) := [γ − Ψ(e∗ , m, c) + Ψ(e, m, c)]+
where γ > 0 is a margin parameter and D is a training set of entity

annotated documents. We aim to find a Ψ (i.e. parameterized by θ)
such that the score of the correct entity e∗ referenced by m is at least a
margin γ higher than that of any other candidate entity e. Whenever
this is not the case, the margin violation becomes the experienced loss.
Next, we address global ED assuming document coherence among

entities. We therefore introduce the notion of a document as consisting
135
Figure 7.2: Global model: unrolled LBP deep network that is end-to-end
differentiable and trainable.
of a sequence of mentions m = m1 , . . . , mn , along with their context

windows c = c1 , . . . cn . Our goal is to define a joint probability distribu-
tion over Γ(m1 ) × · · · × Γ(mn ) 3 e. Each such e selects one candidate
entity for each mention in the document. Obviously, the state space of
e grows exponentially in the number of mentions n.
crf model Our model is a fully-connected pairwise conditional

random field, defined on the log scale as
n
g(e, m, c) =∑ Ψi (ei ) + ∑ Φ(ei , e j ) (7.8)
i =1 i< j
The unary factors are the local scores Ψi (ei ) = Ψ(ei , ci ) described in
Eq. equation 7.5. The pairwise factors are bilinear forms of the entity
embeddings
2
Φ(e, e0 ) = x > C x e0 , (7.9)
n−1 e
where C is again a diagonal matrix. Similar to [Gan+16], the above
normalization helps balancing the unary and pairwise terms across
documents with different numbers of mentions.
differentiable inference Training and prediction in binary

CRF models as the one above is NP-hard. Therefore, in learning one
usually maximizes a likelihood approximation and during operations
(i.e. in prediction) one may use an approximate inference procedure,
often based on message-passing. Among many challenges of these
approaches, it is worth pointing out that weaknesses of the approxi-
mate inference procedure are generally not captured during learning.
Inspired by [Dom11]; [Dom13], we use truncated fitting of loopy belief
propagation (LBP) to a fixed number of message passing iterations. Our
136
model directly optimizes the marginal likelihoods, using the same net-
works for learning and prediction. As noted by [Dom13], this method
is robust to model mis-specification, avoids inherent difficulties of parti-
tion functions and is faster compared to double-loop likelihood training
(where, for each stochastic update, inference is run until convergence is
achieved).
Our architecture is shown in fig. 7.2. A neural network with T layers
encodes T message passing iterations of synchronous max-product LBP 3
which is designed to find the most likely (MAP) entity assignments that
maximize g(e, m, c). We also use message damping, which is known to
speed-up and stabilize convergence of message passing. Formally, in
iteration t, mention mi votes for entity candidate e ∈ Γ(m j ) of mention
m j using the normalized log-message mit→ j (e) computed as:
+1
mit→ Ψi (e0 ) + Φ(e, e0 ) + ∑ mtk→i (e0 )}

j ( e ) = max (7.10)
e0 ∈Γ(m i) k6= j
Herein the first part just reflects the CRF potentials, whereas the second
part is defined as
−1
mit→ j (e) = log[δ · softmax(mit→ j (e)) + (1 − δ) · exp(mit→ j ( e ))] (7.11)
where δ ∈ (0, 1] is a damping factor. Note that, without loss of general-

ity, we simplified the LBP procedure by dropping the factor nodes. The
messages at first iteration (layer) are set to zero.
After T iterations (network layers), the beliefs (marginals) are com-
puted as:
µi (e) = Ψi (e) + ∑ mkT→i (e) (7.12)

k 6 =i
exp[µi (e)]
µi ( e ) = (7.13)
∑e0 ∈Γ(mi ) exp[µi (e0 )]
Similar to the local case, we obtain accuracy improvement when
combining the mention-entity prior p̂(e|m) with marginal µi (e) using
the same non-linear combination function f from eq. (7.6) as follows:
ρi (e) := f (µi (e), log p̂(e|mi )) (7.14)
The learned function f for global ED is non-trivial (see fig. 7.3), showing
that the influence of the prior tends to weaken for larger µ(e), whereas
3 Sum-product and mean-field performed worse in our experiments.
137
Figure 7.3: Non-linear scoring function of the belief and mention prior
learned with a neural network. Achieves a 1.7% improve-
ment on AIDA-B dataset compared to a weighted average
scheme.
it has a dominating influence whenever the document-level evidence is

weak. We also experimented with the prior integrated directly inside
the unary factors Ψi (ei ), but results were worse because, in some cases,
the global entity interaction is not able to recover from strong incorrect
priors (e.g. country names have a strong prior towards the respective
countries as opposed to national sports teams).
Parameters of our global model are the diagonal matrices A, B, C
and the weights of the f network. As before, we found a margin based
objective to be the most effective and we suggest to fit parameters by
minimizing a ranking loss 4 defined as follows:
L(θ ) = ∑ ∑ ∑ h ( mi , e ) (7.15)
D ∈D mi ∈ D e∈Γ(mi )
h(mi , e) = [γ − ρi (ei∗ ) + ρi (e)]+ (7.16)

Computing this objective is trivial by running T times the steps de-
scribed by Eqs. equation 7.10, equation 7.11, followed in the end by the
step in Eq. equation 7.13. Each step is differentiable and the gradient
over model parameters can be computed on the resulting marginals
and back-propagated over messages using chain rule.
At test time, marginals ρi (e) are computed jointly per document using
this network, but prediction is done independently for each mention
mi by maximizing its respective marginal score.
4 Optimizing a marginal log-likelihood loss function performed worse.
138
7.6 candidate selection
7.6 candidate selection
We make use of a mention-entity prior p̂(e|m) both as a feature and

for entity candidate selection. It is computed by averaging probabilities
from two indexes build from mention entity hyperlink statistics from
Wikipedia and a large Web corpus [SC12b], plus the YAGO index
of [Hof+11b] (with uniform prior).
Candidate selection, i.e. building of Γ(e), is done for each input
mention as follows: first, the top 30 candidates are selected based on
p̂(e|m). Then, in order to optimize for memory and run time (LBP
has complexity quadratic in S), we keep only 7 of these entities based
on the following heuristic: (i) the top 4 entities based on p̂(e|m) are
selected, (ii) the top 3 entities based on the local context-entity similarity
measured as in eq. (7.5) are selected 5 . We refrain from annotating
mentions without any candidate, implying that precision and recall can
be different in our case.
In a few cases, generic mentions of persons (e.g. “Peter”) are coref-
erences of more specific mentions (e.g. “Peter Such”) from the same
document. We employ a simple heuristic to address this issue: for each
mention m, if there exist mentions of persons that contain m as a contin-
uous subsequence of words, then we consider the merged candidate set
of these specific mentions for the mention m. We decide that a mention
refers to a person if its most probable candidate by p̂(e|m) is a person.
7.7 experiments
ed datasets We validate our ED models on some of the most

popular available datasets used by our predecessors 6 . We provide
statistics in table 7.1.
• AIDA-CoNLL dataset [Hof+11b] is one of the biggest manually

annotated ED datasets. It contains training (AIDA-train), valida-
tion (AIDA-A) and test (AIDA-B) sets.
• MSNBC (MSB), AQUAINT (AQ) and ACE2004 (ACE) datasets

cleaned and updated by [GB16] 7
5 We have used a simpler context vector here computed by simply averaging all its
constituent word vectors.

6 TAC-KBP datasets used by [Glo+16]; [Sun+15]; [Yam+16] are no longer available.
7 Available at: bit.ly/2gnSBLg
139
Number Number Mentions Gold

Dataset
mentions docs per doc recall
AIDA-train 18448 946 19.5 -
AIDA-A (valid) 4791 216 22.1 96.9%
AIDA-B (test) 4485 231 19.4 98.2%
MSNBC 656 20 32.8 98.5%
AQUAINT 727 50 14.5 94.2%
ACE2004 257 36 7.1 90.6%
WNED-CWEB 11154 320 34.8 91.1%
WNED-WIKI 6821 320 21.3 92%
Table 7.1: Statistics of ED datasets. Gold recall is the percentage of

mentions for which the entity candidate set contains the
ground truth entity. We only trained on mentions with at
least one candidate.
• WNED-WIKI (WW) and WNED-CWEB (CWEB): are larger, but

automatically extracted, thus less reliable. Are built from the
ClueWeb and Wikipedia corpora by [GB16]; [GRS13].
training details and (hyper)parameters We explain train-

ing details of our approach. All models are implemented in the Torch
framework.
entity vectors training & relatedness evaluation. For

entity embeddings only, we use Wikipedia (Feb 2014) corpus for train-
ing. Entity vectors are initialized randomly from a 0-mean normal
distribution with standard deviation 1. We first train each entity vec-
tor on the entity canonical description page (title words included) for
400 iterations. Subsequently, Wikipedia hyperlinks of the respective
entities are used for learning until validation score (described below)
stops improving. In each iteration, 20 positive words, each with 5
negative words, are sampled and used for optimization as explained in
section 7.3. We use Adagrad [DHS11] with a learning rate of 0.3. We
choose embedding size d = 300, pre-trained (fixed) Word2Vec word
140
7.7 experiments
PP
PP Metric
NDCG@1 NDCG@5 NDCG@10 MAP
Method PPPP
WikiLinkMeasure (WLM) 0.54 0.52 0.55 0.48
[Yam+16]
0.59 0.56 0.59 0.52
d = 500
our (canonical pages)
0.624 0.589 0.615 0.549
d = 300
our (canonical&hyperlinks)
0.632 0.609 0.641 0.578
d = 300
Table 7.2: Entity relatedness results on the test set of [Cec+13b]. WLM
is a well-known similarity method of [MW08].
vectors 8 , α = 0.6, γ = 0.1 and window size of 20 for the hyperlinks. We

remove stop words before training. Learning of vectors for all candidate
entities in all datasets (270000 entities) takes 20 hours on a single TitanX
GPU with 12GB of memory.
We test and validate our entity embeddings on the entity relatedness
dataset of [Cec+13b]. It contains 3319 and 3673 queries for the test and
validation sets. Each query consist of one target entity and up to 100
candidate entities with gold standard binary labels indicating if the
two entities are related. The associated task requires ranking of related
candidate entities higher than the others. Following previous work, we
use different evaluation metrics: normalized discounted cumulative
gain (NDCG) and mean average precision (MAP). The validation score
used during learning is then the sum of the four metrics showed in
table 7.2. We perform candidate ranking based on cosine similarity of
entity pairs.
Results for the entity similarity task are shown in table 7.2. Our
method outperforms the well established Wikipedia link measure and
the method of [Yam+16] using less information (only word - entity
statistics). We note that the best result on this dataset was reported in
the unpublished work of [HHJ15]. Their entity embeddings are trained
on many more sources of information (e.g. KG links, relations, entity
types). However, our focus was to prove that lightweight trained em-
8 By Word2Vec authors: http://bit.ly/1R9Wsqr
141
Entity Closest words sorted by cosine similarity

Japan Japan player Shizuoka Yokohama played Asian USISL Saitama Okada
national Nakamura Tokyo Pele matches Japanese Korea players Tanaka soccer
football Chunnam game Suwon Takuya Kawaguchi Mizuno match Qatar team
team Eto Eiji football playing Confederations tournament Kagawa Chiba
apple fruit berry grape varieties apples crop pear potato blueberry
strawberry growers peach orchards pears Prunus grower Rubus citrus
Apple spinosa tomato berries Blueberry peaches grapes almond juice melon
bean apricot insect vegetable strawberries olive pomegranate Vaccinium
cherries potatoes Strawberry plums cultivar Apples harvest figs
cultivars sunflower beet
Apple software computer Microsoft Adobe hardware company iPod PC
product Dell laptop Mac computers Macintosh Flash video desktop
Apple Inc. iPhone Digital Windows app PCs Intel technology device iTunes
Motorola Sony digital Multimedia iPad HP licensing multimedia Nokia
apps smartphone laptops Computer previewed products application
Jobs devices startup
U2 band singer Avenged Rockers Coldplay concert Lynyrd Kiss
Queen Metallica Killers rerecorded song Beatles rock Stones recording Slash
(band) Singer touring musician music CD Dirty Moby rockers Sting Blackest
songs rocker
Germany Berlin German Munich Hamburg Austria Cologne Bavaria
Hessen country Europe Wernigerode Saxony western Germans
Germany Schwaben Switzerland TuS Heilbronn Realschule Westfalen
Deutschland Brandenburg eastern Rudolf Glarus Wolfgang Esslingen
Kaserne Swabia Schwerin Andreas Poland Helmut Palatinate history
Darmstadt Rhein Harald Ludwigsburg Kiel
Obama campaign President presidential endorsed Democrat Clinton
nominee Presidential inauguration Senator senator administration
Barack speech Barack Democratic appointee Washington Republican vote
Obama Tuesday Secretary election Administration elect nomination Bush
November president congressman Senate endorsing announcement
candidacy
Leicestershire curacy town Yeomanry Buckinghamshire Leicestershire Bedfordshire

Lichfield Wiltshire Shropshire almshouses Lancashire Stonyhurst
Warwickshire batsman England Hampshire Leicestershire Trott
Glamorgan Nottinghamshire Northants Lancashire Middlesex Essex
Leicestershire Giles fielding Porterfield Test Surrey cricketer centurion Gough Bevan
County Sussex Gloucestershire bowled Worcestershire Tests Martyn Croft
Cricket Club Derbyshire Clarke overs bowler Lancastrian played Northamptonshire
Kent Vaughan Fletcher captaining internationals batting Gilchrist Notts
batted cricket
Table 7.3: Closest words to a given entity. Words with at least 500
frequency in the Wikipedia corpus are shown.
142
7.7 experiments
Methods AIDA-B
Local models
prior p̂(e|m) 71.9
[Laz+15] 86.4
[Glo+16] 87.9
[Yam+16] 87.2
our (local, K=100, R=50) 88.8
Global models
[HHJ15] 86.6
[Gan+16] 87.6
[CH15] 88.7
[GB16] 89.0
[Glo+16] 91.0
[Yam+16] 91.5
our (global) 92.22 ± 0.14
Table 7.4: In-KB accuracy for AIDA-B test set. All baselines use
KB+YAGO mention-entity index. For our method we show
95% confidence intervals obtained over 5 runs.
beddings useful for the ED task can also perform decent for the entity
similarity task. We emphasize that our global ED model outperforms
Huang’s ED model (table 7.4), likely due to the power of our local and
joint neural network architectures. For example, our attention mecha-
nism clearly benefits from explicitly embedding words and entities in
the same space.
As a qualitative evaluation, we show in table 7.3 the closest words to
a given entity in terms of their embeddings.
local and global ed model training. Our local and global

ED models are trained on AIDA-train (multiple epochs), validated
on AIDA-A and tested on AIDA-B and other datasets mentioned in
section 7.7. We use Adam [KB14] with learning rate of 1e-4 until
143
Global methods MSB AQ ACE CWEB WW

prior p̂(e|m) 89.3 83.2 84.4 69.8 64.2
[Fan+16] 81.2 88.8 85.3 - -
[Gan+16] 91 89.2 88.7 - -
[MW08] 78 85 81 64.1 81.7
[Hof+11b] 79 56 80 58.6 63
[Rat+11] 75 83 82 56.2 67.2
[CR13] 90 90 86 67.5 73.4
[GB16] 92 87 88 77 84.5
93.7 88.5 88.5 77.9 77.5
our (global)
± 0.1 ± 0.4 ± 0.3 ± 0.1 ± 0.1
Table 7.5: Micro F1 results for other datasets.
validation accuracy exceeds 90%, afterwards setting it to 1e-5. Variable

size mini-batches consisting of all mentions in a document are used
during training. We remove stop words. Hyper-parameters of the
best validated global model are: γ = 0.01, K = 100, R = 25, S = 7, δ =
0.5, T = 10. For the local model, R = 50 was best. Validation accuracy
is computed after each 5 epochs. To regularize, we use early stopping,
i.e. we stop learning if the validation accuracy does not increase after
500 epochs. Training on a single GPU takes, on average, 2ms per
mention, or 16 hours for 1250 epochs over AIDA-train.
By using diagonal matrices A, B, C, we keep the number of param-
eters very low (approx. 1.2K parameters). This is necessary to avoid
overfitting when learning from a very small training set. We also exper-
imented with diagonal plus low-rank matrices, but encountered quality
degradation.
ed baselines & results We compare with systems that report

state-of-the-art results on the datasets and report results in tables 7.4
and 7.5. Some baseline scores from table 7.5 are taken from [GB16].
The best results for the AIDA datasets are reported by [Yam+16]
and [Glo+16]. We do not compare against [PHG15] since, as noted
144
7.7 experiments
Table 7.6: Effects of two of the hyper-parameters. Left: A low T (e.g.5) is

already sufficient for accurate approximate marginals. Right:
Hard attention improves accuracy of a local model with
K=100.
also by [Glo+16], their mention index artificially includes the gold

entity (guaranteed gold recall), which is not a realistic setting.
For a fair comparison with prior work, we use in-KB accuracy and
micro F1 (averaged per mention) metrics to evaluate our approach.
Results are shown in tables 7.4 and 7.5. We run our system 5 times,
each time we pick the best model on the validation set, and report
results on the test set for these models. We obtain state of the art
accuracy on AIDA which is the largest and hardest (by the accuracy
of the p̂(e|m) baseline) manually created ED dataset . We are also
competitive on the other datasets. It should be noted that all the other
methods use, at least partially, engineered features. The merit of our
proposed method is to show that, with the exception of the p̂(e|m)
feature, a neural network is able to learn the best features for ED
without requiring expert input.
To gain further insight, we analyzed the accuracy on the AIDA-B
dataset for situations where gold entities have low frequency or mention
prior. Table 7.7 shows that our method performs well in these harder
cases.
hyperparameter studies In table 7.6, we analyze the effect of

two hyper-parameters. First, we see that hard attention (i.e. R < K)
helps reducing the noise from uninformative context words (as opposed
to keeping all words when R = K).
145
Freq p̂(e|m)
Number Solved Number Solved
gold gold
mentions correctly mentions correctly
entity entity
0 5 80.0 % ≤ 0.01 36 89.19%
1-10 0 - 0.01 - 0.03 249 88.76%
11-20 4 100.0% 0.03 - 0.1 306 82.03%
21-50 50 90.0% 0.1 - 0.3 381 86.61%
> 50 4345 94.2% > 0.3 3431 96.53%
Table 7.7: ED accuracy on AIDA-B for our best system splitted by

Wikipedia hyperlink frequency and mention prior of the
ground truth entity, in cases where the gold entity appears in
the candidate set.
Second, we see that a small number of LBP iterations (hard-coded

in our network) is enough to obtain good accuracy. This speeds up
training and testing compared to traditional methods that run LBP until
convergence. An explanation is that a truncated version of LBP can
perform well enough if used at both training and test time.
qualitative analysis of local model In table 7.8 we show

some examples of context words attended by our local model for cor-
rectly solved hard cases (where the mention prior of the correct entity
is low). One can notice that words relevant for at least one entity
candidate are chosen by our model in most of the cases.
error analysis We analyzed some of the errors made by our

model on the AIDA-B dataset. We mostly observed three situations:
i) annotation errors, ii) gold entities that do not appear in mentions’
candidate sets, or iii) gold entities with very low p(e|m) prior whose
mentions have an incorrect entity candidate with high prior. As an
example for the last case, the mention “Italians” refers in some specific
context to the entity “Italy national football team” rather than the entity
representing the country. The contextual information is not strong
enough in this case to avoid an incorrect model behavior. On the
other hand, there are situations where the context can be misleading,
146
7.8 conclusion
p̂(e|m)
Mention Gold entity of gold Attended contextual words
entity
Scotland England Rugby team squad Murrayfield
national Twickenham national play Cup Saturday
Scotland 0.034
rugby union World game George following Italy week
team Friday selection dropped row month
matches League Oxford Hull league
Charlton Oldham Cambridge Sunderland
Wolverhampton Blackburn Sheffield Southampton
Wolverhampton 0.103
Wanderers F.C. Huddersfield Leeds Middlesbrough
Reading Coventry Darlington Bradford
Birmingham Enfield Barnsley
League team Hockey Toronto Ottawa
Montreal games Anaheim Edmonton Rangers
Montreal 0.021 Philadelphia Caps Buffalo Pittsburgh
Canadiens
Chicago Louis National home Friday York
Dallas Washington Ice
Carlos Telmex Mexico Mexican group
Santander firm market week Ponce debt shares
Santander 0.192
Group buying Televisa earlier pesos share
stepped Friday analysts ended
FIS Alpine Alpine ski national slalom World Skiing

World Cup Ski World 0.063 Whistler downhill Cup events race
Cup consecutive weekend Mountain Canadian
racing
Table 7.8: Examples of context words selected by our local attention

mechanism. Distinct words are sorted decreasingly by at-
tention weights and only words with non-zero weights are
shown.
e.g. a document heavily discussing about cricket will favor resolving

the mention “Australia” to the entity “Australia national cricket team”
instead of the gold entity “Australia” (naming a location of cricket
games).
7.8 conclusion
We have proposed a novel deep learning architecture for entity dis-

ambiguation that combines entity embeddings, a contextual attention
mechanism, an adaptive local score combination, as well as unrolled
differentiable message passing for global inference. Compared to many
147
other methods, we do not rely on hand-engineered features, nor on an

extensive corpus for entity co-occurrences or relatedness. Our system
is fully differentiable, although we chose to pre-train word and entity
embeddings. Extensive experiments show the competitiveness of our
approach across a wide range of corpora. In the future, we would like
to extend this system to perform nil detection, coreference resolution
and mention detection.
Our code and data are publicly available: http://github.com/dalab/
deep-ed
148
CONCLUSION
8
In this dissertation, we advocated for non-Euclidean spaces and
Riemannian manifolds offering a better inductive bias for learning rep-
resentations compared to traditional Euclidean methods. In particular,
constant curvature spaces (e.g. hyperbolic) and their product space
offer computationally efficient geometries with appealing properties for
machine learning methods - e.g. isometric embedding guarantees for
hierarchical structures and the connection with Gaussian distributions.
First, we saw how taxonomies, hierarchical structures and directed
acyclic graphs can be embedded in the hyperbolic space using geodesi-
cally convex entailment cones.
Next, we explored how hyperbolic embeddings can be used in down-
stream tasks by generalizing some of the most important deep learning
architectures.
Third, we investigated how word embeddings can be trained in prod-
ucts of hyperbolic spaces, exploiting the connection between products
of hyperbolic spaces and Gaussian distributions and being the first
method to obtain competitive results for the three different tasks of
word similarity, analogy and hypernymy.
Further, we moved into investigating non-Euclidean representations
for entities for the task of text disambiguation or entity resolution. We
explored probabilistic graphical models, learned entity embeddings and
leveraged attention mechanisms and fully differentiable approximate
message passing inference methods in Markov Random Fields.
8.1 future research directions
Embedding data with a latent hierarchical anatomy in the hyperbolic

space has still many more stories to tell. It would be interesting to
explore generative models in the hyperbolic space (e.g. variational auto-
encoders or generative adversarial networks) that would, for example,
reveal the underlying taxonomic structure in an unsupervised setting.
Relevant applications are text, financial or genomic data.
149
conclusion
We explored spaces of constant curvature, but one might ask: What

about more complex Riemannian manifolds? Very recent work in ma-
chine learning, has begun exploring modeling [Gu+19] and optimiza-
tion [BG19] in products of constant curvature spaces, but much remains
to be discovered - for example, what types of data and graphs are
provably better embedded in these products of spaces?
In general, we envision future research to span across interrelated
sub-areas of non-Euclidean embeddings and Riemannian manifold
learning. One goal would be to understand which geometric spaces i)
offer a provably better inductive bias for various specific data properties,
ii) can benefit from easier optimization, iii) achieve superior results with
very dense models, or iv) exhibit enhanced disentanglement and strong
clustering. Such spaces should be computationally efficient, ideally
revealing closed form expressions of useful differential geometry tools
(e.g. geodesics, distance functions, exponential map).
150
BIBLIOGRAPHY
[Aba+16] Martı-n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,

Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-
mawat, Geoffrey Irving, Michael Isard, et al. TensorFlow:
”
A System for Large-Scale Machine Learning.“ In: 2016 (cit.
on p. 80).
[ADM14] Réka Albert, Bhaskar DasGupta, and Nasim Mobasheri.
Topological implications of negative curvature for biologi-
”
cal and social networks.“ In: Physical Review E 89.3 (2014),
p. 032811 (cit. on p. 89).
[AHK01] Charu C Aggarwal, Alexander Hinneburg, and Daniel A
Keim. On the surprising behavior of distance metrics
”
in high dimensional space.“ In: International conference on
database theory. Springer. 2001, pp. 420–434 (cit. on p. 17).
[Alb08] Ungar Abraham Albert. Analytic hyperbolic geometry and
Albert Einstein’s special theory of relativity. World scientific,
2008 (cit. on pp. 34, 37, 60).
[AMA18] Gregorio Alanis-Lobato, Pablo Mier, and Miguel Andrade-
Navarro. The latent geometry of the human protein inter-
”
action network.“ In: Bioinformatics 34.16 (2018), pp. 2826–
2834 (cit. on p. 10).
[Aro+16] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and
Andrej Risteski. A latent variable model approach to pmi-
”
based word embeddings.“ In: Transactions of the Association
for Computational Linguistics 4 (2016), pp. 385–399 (cit. on
p. 82).
[AW17] Ben Athiwaratkun and Andrew Wilson. Multimodal Word
”
Distributions.“ In: Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long
Papers). Vol. 1. 2017, pp. 1645–1656 (cit. on p. 83).
[AW18] Ben Athiwaratkun and Andrew Gordon Wilson. Hier-
”
archical Density Order Embeddings.“ In: arXiv preprint
arXiv:1804.09843 (2018) (cit. on p. 83).
151
Bibliography
[BBK06] Alexander M Bronstein, Michael M Bronstein, and Ron

Kimmel. Generalized multidimensional scaling: a frame-
”
work for isometry-invariant partial surface matching.“ In:
Proceedings of the National Academy of Sciences 103.5 (2006),
pp. 1168–1172 (cit. on p. 5).
[BCB14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
Neural machine translation by jointly learning to align
”
and translate.“ In: arXiv preprint arXiv:1409.0473 (2014) (cit.
on p. 6).
[BCC15] Michele Borassi, Alessandro Chessa, and Guido Caldarelli.
Hyperbolicity measures democracy in real-world networks.“
”
In: Physical Review E 92.3 (2015), p. 032812 (cit. on pp. 82,
89).
[Bey+99] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan,
and Uri Shaft. When is “nearest neighbor” meaningful?“
”
In: International conference on database theory. Springer. 1999,
pp. 217–235 (cit. on p. 17).
[BG03] Ingwer Borg and Patrick Groenen. Modern multidimen-
”
sional scaling: Theory and applications.“ In: Journal of Edu-
cational Measurement 40.3 (2003), pp. 277–280 (cit. on pp. 4–
6, 8).
[BG19] Gary Becigneul and Octavian-Eugen Ganea. Riemannian
”
Adaptive Optimization Methods.“ In: International Con-
ference on Learning Representations. 2019. url: https : / /
openreview.net/forum?id=r1eiqi09K7 (cit. on pp. x, 18,
25, 87, 91, 150).
[BHV01] Louis J Billera, Susan P Holmes, and Karen Vogtmann.
Geometry of the space of phylogenetic trees.“ In: Advances
”
in Applied Mathematics 27.4 (2001), pp. 733–767 (cit. on
p. 10).
[BL11] Marco Baroni and Alessandro Lenci. How we BLESSed
”
distributional semantic evaluation.“ In: Proceedings of the
GEMS 2011 Workshop on GEometrical Models of Natural Lan-
guage Semantics. Association for Computational Linguistics.
2011, pp. 1–10 (cit. on p. 94).
152
Bibliography
[Blä+16] Thomas Bläsius, Tobias Friedrich, Anton Krohmer, and

Sören Laue. Efficient embedding of scale-free graphs in
”
the hyperbolic plane.“ In: LIPIcs-Leibniz International Pro-
ceedings in Informatics. Vol. 57. Schloss Dagstuhl-Leibniz-
Zentrum fuer Informatik. 2016 (cit. on p. 42).
[BN03] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps
”
for dimensionality reduction and data representation.“ In:
Neural computation 15.6 (2003), pp. 1373–1396 (cit. on p. 8).
[Boj+16] Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. Enriching word vectors with subword
”
information.“ In: arXiv preprint arXiv:1607.04606 (2016) (cit.
on p. 81).
[Bon13] Silvere Bonnabel. Stochastic gradient descent on Rieman-
”
nian manifolds.“ In: IEEE Transactions on Automatic Control
58.9 (2013), pp. 2217–2229 (cit. on pp. 18, 29, 43, 53, 72, 87).
[Bor+13] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran,
Jason Weston, and Oksana Yakhnenko. Translating em-
”
beddings for modeling multi-relational data.“ In: Advances
in neural information processing systems. 2013, pp. 2787–2795
(cit. on pp. 1, 41).
[Bou85] Jean Bourgain. On Lipschitz embedding of finite metric
”
spaces in Hilbert space.“ In: Israel Journal of Mathematics
52.1-2 (1985), pp. 46–52 (cit. on p. 14).
[Bow+15] Samuel R Bowman, Gabor Angeli, Christopher Potts, and
Christopher D Manning. A large annotated corpus for
”
learning natural language inference.“ In: Proceedings of the
2015 Conference on Empirical Methods in Natural Language
Processing. 2015, pp. 632–642 (cit. on pp. 71, 82).
[Bow06] Brian H Bowditch. A course on geometric group theory.“
”
In: (2006) (cit. on pp. 33, 42).
[BP06] Razvan C Bunescu and Marius Pasca. Using Encyclopedic
”
Knowledge for Named entity Disambiguation.“ In: Eacl.
Vol. 6. 2006, pp. 9–16 (cit. on p. 102).
[BPP96] Adam L Berger, Vincent J Della Pietra, and Stephen A Della
Pietra. A maximum entropy approach to natural language
”
processing.“ In: Computational linguistics 22.1 (1996), pp. 39–
71 (cit. on p. 106).
153
Bibliography
[Bro+17] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur

Szlam, and Pierre Vandergheynst. Geometric deep learn-
”
ing: going beyond euclidean data.“ In: IEEE Signal Pro-
cessing Magazine 34.4 (2017), pp. 18–42 (cit. on pp. iii, v,
10).
[BU01] Graciela S Birman and Abraham A Ungar. The hyperbolic
”
derivative in the poincaré ball model of hyperbolic geome-
try.“ In: Journal of mathematical analysis and applications 254.1
(2001), pp. 321–333 (cit. on p. 70).
[Can+97] James W Cannon, William J Floyd, Richard Kenyon, Wal-
ter R Parry, et al. Hyperbolic geometry.“ In: Flavors of
”
geometry 31 (1997), pp. 59–115 (cit. on pp. 20, 29).
[CC09] Andrej Cvetkovski and Mark Crovella. Hyperbolic em-
”
bedding and routing for dynamic graphs.“ In: INFOCOM
2009, IEEE. IEEE. 2009, pp. 1647–1655 (cit. on p. 42).
[Cec+13a] Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando,
Raffaele Perego, and Salvatore Trani. Dexter: an open
”
source framework for entity linking.“ In: Proceedings of the
sixth international workshop on Exploiting semantic annotations
in information retrieval. ACM. 2013, pp. 17–20 (cit. on p. 124).
[Cec+13b] Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando,
Raffaele Perego, and Salvatore Trani. Learning relatedness
”
measures for entity linking.“ In: Proceedings of the 22nd
ACM international conference on Information & Knowledge
Management. ACM. 2013, pp. 139–148 (cit. on p. 141).
[CH15] Andrew Chisholm and Ben Hachey. Entity disambigua-
”
tion with web links.“ In: Transactions of the Association
p. 143).
[Cha+18] Haw-Shiuan Chang, Ziyun Wang, Luke Vilnis, and An-
drew McCallum. Distributional Inclusion Vector Embed-
”
ding for Unsupervised Hypernymy Detection.“ In: Proceed-
ings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers). Vol. 1. 2018, pp. 485–
495 (cit. on p. 96).
154
Bibliography
[Cha+19] Benjamin Paul Chamberlain, Stephen R Hardwick, David R

Wardrope, Fabon Dzogang, Fabio Daolio, and Saúl Vargas.
Scalable Hyperbolic Recommender Systems.“ In: arXiv
”
preprint arXiv:1902.08648 (2019) (cit. on p. 10).
[CR13] Xiao Cheng and Dan Roth. Relational inference for wik-
”
ification.“ In: Urbana 51.61801 (2013), pp. 16–58 (cit. on
pp. 103, 124, 131, 144).
[CSS15] Sueli IR Costa, Sandra A Santos, and João E Strapasson.
Fisher information distance: a geometrical reading.“ In:
”
Discrete Applied Mathematics 197 (2015), pp. 59–69. url:
https://arxiv.org/pdf/1210.2354.pdf (cit. on pp. 25,
86).
[Cuc07] Silviu Cucerzan. Large-scale named entity disambigua-
”
tion based on Wikipedia data.“ In: (2007) (cit. on pp. 103,
122, 124).
[De +18a] Christopher De Sa, Albert Gu, Christopher Ré, and Fred-
eric Sala. Representation Tradeoffs for Hyperbolic Embed-
”
dings.“ In: Proceedings of the 35th International Conference on
Machine Learning (ICML-18) (2018) (cit. on pp. iii, v, 15, 21,
23, 34, 59).
[De +18b] Christopher De Sa, Albert Gu, Christopher Ré, and Fred-
eric Sala. Representation Tradeoffs for Hyperbolic Em-
”
beddings.“ In: International Conference on Machine Learning.
2018 (cit. on p. 85).
[Den+15] Emily Denton, Jason Weston, Manohar Paluri, Lubomir
Bourdev, and Rob Fergus. User conditional hashtag pre-
”
diction for images.“ In: Proceedings of the 21th ACM SIGKDD
International Conference on Knowledge Discovery and Data
Mining. ACM. 2015, pp. 1731–1740 (cit. on p. 135).
[DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary
”
proof of a theorem of Johnson and Lindenstrauss.“ In:
Random Structures & Algorithms 22.1 (2003), pp. 60–65 (cit.
on p. 15).
[Dhi+18] Bhuwan Dhingra, Christopher Shallue, Mohammad Norouzi,
Andrew Dai, and George Dahl. Embedding Text in Hy-
”
perbolic Spaces.“ In: Proceedings of the Twelfth Workshop
155
Bibliography
on Graph-Based Methods for Natural Language Processing

(TextGraphs-12). 2018, pp. 59–69 (cit. on p. 83).
[DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive
”
subgradient methods for online learning and stochastic
optimization.“ In: Journal of Machine Learning Research 12.Jul
(2011), pp. 2121–2159 (cit. on pp. 1, 25, 87, 91, 140).
[DK14] Greg Durrett and Dan Klein. A Joint Model for Entity
”
Analysis: Coreference, Typing, and Linking.“ In: Transac-
tions of the Association for Computational Linguistics 2 (2014),
pp. 477–490 (cit. on p. 103).
[Do 92] Manfredo Perdigao Do Carmo. Riemannian geometry. Vol. 115.
1992 (cit. on p. 20).
[Dom11] Justin Domke. Parameter learning with truncated message-
”
passing.“ In: Computer Vision and Pattern Recognition (CVPR),
2011 IEEE Conference on. IEEE. 2011, pp. 2937–2943 (cit. on
p. 136).
[Dom13] Justin Domke. Learning graphical model parameters with
”
approximate marginal inference.“ In: IEEE transactions
on pattern analysis and machine intelligence 35.10 (2013),
pp. 2454–2467 (cit. on pp. 131, 136, 137).
[DP10] Robert PW Duin and Elżbieta Pekalska. Non-Euclidean
”
dissimilarities: causes and informativeness.“ In: Joint IAPR
International Workshops on Statistical Techniques in Pattern
Recognition (SPR) and Structural and Syntactic Pattern Recog-
nition (SSPR). Springer. 2010, pp. 324–333 (cit. on pp. iii, v,
10, 11, 17).
[DR72] John N Darroch and Douglas Ratcliff. Generalized itera-
”
tive scaling for log-linear models.“ In: The annals of mathe-
matical statistics (1972), pp. 1470–1480 (cit. on p. 108).
[Dre+10] Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber,
and Tim Finin. Entity disambiguation for knowledge base
”
population.“ In: Proceedings of the 23rd International Confer-
ence on Computational Linguistics. Association for Computa-
tional Linguistics. 2010, pp. 277–285 (cit. on p. 102).
156
Bibliography
[Fan+16] Wei Fang, Jianwen Zhang, Dilin Wang, Zheng Chen, and
Ming Li. Entity Disambiguation by Knowledge and Text
”
Jointly Embedding.“ In: CoNLL 2016 (2016), p. 260 (cit. on
pp. 130, 144).
[FDK16] Matthew Francis-Landau, Greg Durrett, and Dan Klein.
Capturing semantic similarity for entity linking with con-
”
volutional neural networks.“ In: arXiv preprint arXiv:1604.00734
(2016) (cit. on p. 130).
[FS10a] Paolo Ferragina and Ugo Scaiella. Fast and accurate an-
”
notation of short texts with Wikipedia pages.“ In: arXiv
[FS10b] Paolo Ferragina and Ugo Scaiella. Tagme: on-the-fly an-
”
notation of short text fragments (by wikipedia entities).“
In: Proceedings of the 19th ACM international conference on In-
formation and knowledge management. ACM. 2010, pp. 1625–
1628 (cit. on pp. 103, 131).
[Fu+14] Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng
Wang, and Ting Liu. Learning semantic hierarchies via
”
word embeddings.“ In: Proceedings of the 52nd Annual Meet-
ing of the Association for Computational Linguistics (Volume 1:
Long Papers). Vol. 1. 2014, pp. 1199–1209 (cit. on pp. 1, 41).
[Gan+16] Octavian-Eugen Ganea, Marina Ganea, Aurelien Lucchi,
Carsten Eickhoff, and Thomas Hofmann. Probabilistic
”
bag-of-hyperlinks model for entity linking.“ In: Proceed-
ings of the 25th International Conference on World Wide Web.
International World Wide Web Conferences Steering Com-
mittee. 2016, pp. 927–938 (cit. on pp. ix, 25, 99, 130, 131,
136, 143, 144).
[Gan+19] Octavian-Eugen Ganea, Sylvain Gelly, Gary Bécigneul, and
Aliaksei Severyn. Breaking the Softmax Bottleneck via
”
Learnable Monotonic Pointwise Non-linearities.“ In: arXiv
preprint arXiv:1902.08077 (2019) (cit. on p. x).
[GB14] Zhaochen Guo and Denilson Barbosa. Robust Entity Link-
”
ing via Random Walks.“ In: Proceedings of the 23rd ACM
International Conference on Conference on Information and
Knowledge Management. CIKM ’14. Shanghai, China: ACM,
157
Bibliography
2014, pp. 499–508. url: http://doi.acm.org/10.1145/

2661829.2661887 (cit. on pp. 103, 117, 122, 124, 125).
[GB16] Zhaochen Guo and Denilson Barbosa. Robust Named
”
Entity Disambiguation with Random Walks.“ In: (2016)
(cit. on pp. 131, 139, 140, 143, 144).
[GBH18a] Octavian Ganea, Gary Becigneul, and Thomas Hofmann.
Hyperbolic Neural Networks.“ In: Advances in Neural In-
”
formation Processing Systems 31. Ed. by S. Bengio, H. Wal-
lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett. Curran Associates, Inc., 2018, pp. 5345–5355.
url: http://papers.nips.cc/paper/7780-hyperbolic-
neural-networks.pdf (cit. on pp. ix, 24, 27, 39, 59, 70, 82).
[GBH18b] Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hof-
mann. Hyperbolic Entailment Cones for Learning Hi-
”
erarchical Embeddings.“ In: Proceedings of the 35th Inter-
national Conference on Machine Learning, ICML 2018, Stock-
holmsmässan, Stockholm, Sweden, July 10-15, 2018. 2018, pp. 1632–
1641. url: http://proceedings.mlr.press/v80/ganea18a.
html (cit. on pp. ix, 15, 23, 27, 41).
[GF17] Palash Goyal and Emilio Ferrara. Graph embedding tech-
”
niques, applications, and performance: A survey.“ In: arXiv
[GH17] Octavian-Eugen Ganea and Thomas Hofmann. Deep Joint
”
Entity Disambiguation with Local Neural Attention.“ In:
Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing. 2017, pp. 2619–2629 (cit. on
pp. ix, 1, 25, 129).
[GHL90] Sylvestre Gallot, Dominique Hulin, and Jacques Lafontaine.
Riemannian geometry. Vol. 3. Springer, 1990 (cit. on p. 20).
[GL16] Aditya Grover and Jure Leskovec. node2vec: Scalable fea-
”
ture learning for networks.“ In: Proceedings of the 22nd ACM
SIGKDD international conference on Knowledge discovery and
data mining. ACM. 2016, pp. 855–864 (cit. on pp. 1, 5).
[Glo+16] Amir Globerson, Nevena Lazic, Soumen Chakrabarti, Amar-
nag Subramanya, Michael Ringgaard, and Fernando Pereira.
Collective Entity Resolution with Multi-Focal Attention.“
”
In: ACL (1). 2016 (cit. on pp. 130, 131, 139, 143–145).
158
Bibliography
[Gro87] Mikhael Gromov. Hyperbolic groups.“ In: Essays in group

”
theory. Springer, 1987, pp. 75–263 (cit. on pp. iii, v, 15, 21,
23, 33, 42, 89).
[GRS13] Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag
Subramanya. FACC1: Freebase annotation of ClueWeb
”
corpora, Version 1 (Release date 2013-06-26, Format ver-
sion 1, Correction level 0).“ In: Note: http://lemurproject.
org/clueweb09/FACC1/Cited by 5 (2013) (cit. on p. 140).
[Gu+19] Albert Gu, Frederic Sala, Beliz Gunel, and Christopher Ré.
Learning Mixed-Curvature Representations in Product
”
Spaces.“ In: International Conference on Learning Represen-
tations. 2019. url: https://openreview.net/forum?id=
HJxeWnCcF7 (cit. on pp. 5, 13, 20, 150).
[HA10] Christopher Hopper and Ben Andrews. The Ricci flow in
Riemannian geometry. 2010 (cit. on p. 27).
[Ham17] Matthias Hamann. On the tree-likeness of hyperbolic
”
spaces.“ In: Mathematical Proceedings of the Cambridge Philo-
sophical Society (2017), pp. 1–17 (cit. on pp. 23, 42).
[HC14] Neil Houlsby and Massimiliano Ciaramita. A scalable
”
gibbs sampler for probabilistic entity linking.“ In: Advances
in Information Retrieval. Springer, 2014, pp. 335–346 (cit. on
pp. 103, 117, 118, 120, 124, 125).
[He+13] Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai
Zhang, and Houfeng Wang. Learning Entity Representa-
”
tion for Entity Disambiguation.“ In: ACL (2). 2013, pp. 30–
34 (cit. on pp. 101, 104, 130).
[HGG18] Till Haug, Octavian-Eugen Ganea, and Paulina Grnarova.
Neural multi-step reasoning for question answering on
”
semi-structured tables.“ In: European Conference on Informa-
tion Retrieval. Springer. 2018, pp. 611–617 (cit. on p. x).
[HHJ15] Hongzhao Huang, Larry Heck, and Heng Ji. Leveraging
”
deep neural networks and knowledge graphs for entity
disambiguation.“ In: arXiv preprint arXiv:1504.07678 (2015)
(cit. on pp. 130, 141, 143).
159
Bibliography
[Hof+11a] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino,

Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana
Taneva, Stefan Thater, and Gerhard Weikum. Robust dis-
”
ambiguation of named entities in text.“ In: Proceedings of
the Conference on Empirical Methods in Natural Language Pro-
cessing. Association for Computational Linguistics. 2011,
pp. 782–792 (cit. on pp. 103, 120, 124).
[Hof+11b] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino,
Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana
Taneva, Stefan Thater, and Gerhard Weikum. Robust dis-
”
ambiguation of named entities in text.“ In: Proceedings of
the Conference on Empirical Methods in Natural Language Pro-
cessing. Association for Computational Linguistics. 2011,
pp. 782–792 (cit. on pp. 131, 139, 144).
[HRH02] Peter D Hoff, Adrian E Raftery, and Mark S Handcock.
Latent space approaches to social network analysis.“ In:
”
Journal of the american Statistical association 97.460 (2002),
pp. 1090–1098 (cit. on p. 1).
[HS12] Xianpei Han and Le Sun. An entity-topic model for en-
”
tity linking.“ In: Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Processing and Com-
putational Natural Language Learning. Association for Com-
putational Linguistics. 2012, pp. 105–115 (cit. on p. 103).
[HSZ11] Xianpei Han, Le Sun, and Jun Zhao. Collective entity
”
linking in web text: a graph-based method.“ In: Proceedings
of the 34th international ACM SIGIR conference on Research
and development in Information Retrieval. ACM. 2011, pp. 765–
774 (cit. on pp. 103, 124).
[IM04] Piotr Indyk and Jirı Matoušek. Low-distortion embed-
”
dings of finite metric spaces.“ In: Handbook of discrete and
computational geometry 37 (2004), p. 46 (cit. on pp. 14, 15).
[Jay82] Edwin T Jaynes. On the rationale of maximum-entropy
”
methods.“ In: Proceedings of the IEEE 70.9 (1982), pp. 939–
952 (cit. on p. 105).
[Jef46] Harold Jeffreys. An invariant form for the prior proba-
”
bility in estimation problems.“ In: Proc. R. Soc. Lond. A
186.1007 (1946), pp. 453–461 (cit. on p. 87).
160
Bibliography
[Ji16] Heng Ji. Entity discovery and linking reading list.“ In:
”
(2016). url: http://nlp.cs.rpi.edu/kbp/2014/elreading.
html (cit. on p. 130).
[Kat+11] Saurabh S Kataria, Krishnan S Kumar, Rajeev R Rastogi,
Prithviraj Sen, and Srinivasan H Sengamedu. Entity dis-
”
ambiguation with hierarchical topic models.“ In: Proceed-
ings of the 17th ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM. 2011, pp. 1037–
1045 (cit. on p. 103).
[KB14] Diederik Kingma and Jimmy Ba. Adam: A method for
”
stochastic optimization.“ In: arXiv preprint arXiv:1412.6980
(2014) (cit. on pp. 1, 72, 143).
[KGH18] Nikolaos Kolitsas, Octavian-Eugen Ganea, and Thomas
Hofmann. End-to-End Neural Entity Linking.“ In: Pro-
”
ceedings of the 22nd Conference on Computational Natural
Language Learning. 2018, pp. 519–529 (cit. on p. x).
[Kim14] Yoon Kim. Convolutional neural networks for sentence
”
classification.“ In: arXiv preprint arXiv:1408.5882 (2014) (cit.
on p. 6).
[Kir+15] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard
Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler.
Skip-thought vectors.“ In: Advances in neural information
”
processing systems. 2015, pp. 3294–3302 (cit. on p. 1).
[Kri+09] Dmitri Krioukov, Fragkiskos Papadopoulos, Marián Boguñá,
and Amin Vahdat. Greedy forwarding in scale-free net-
”
works embedded in hyperbolic metric spaces.“ In: ACM
SIGMETRICS Performance Evaluation Review 37.2 (2009),
pp. 15–17 (cit. on p. 42).
[Kri+10] Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kit-
sak, Amin Vahdat, and Marián Boguná. Hyperbolic ge-
”
ometry of complex networks.“ In: Physical Review E 82.3
(2010), p. 036106 (cit. on pp. iii, v, 10, 18, 20, 23, 42, 82).
[Kul+09] Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and
Soumen Chakrabarti. Collective annotation of Wikipedia
”
entities in web text.“ In: Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and data min-
ing. ACM. 2009, pp. 457–466 (cit. on p. 103).
161
Bibliography
[L+01] John Lafferty, Andrew McCallum, Fernando Pereira, et

al. Conditional random fields: Probabilistic models for
”
segmenting and labeling sequence data.“ In: Proceedings
of the eighteenth international conference on machine learning,
ICML. Vol. 1. 2001, pp. 282–289 (cit. on p. 131).
[Laz+15] Nevena Lazic, Amarnag Subramanya, Michael Ringgaard,
and Fernando Pereira. Plato: A selective context model
”
for entity resolution.“ In: Transactions of the Association
pp. 130, 133, 143).
[LG14] Omer Levy and Yoav Goldberg. Linguistic Regularities
”
in Sparse and Explicit Word Representations.“ In: Proceed-
ings of the Eighteenth Conference on Computational Natural
Language Learning. Ann Arbor, Michigan: Association for
Computational Linguistics, 2014, pp. 171–180. url: http:
//www.aclweb.org/anthology/W14-1618 (cit. on p. 91).
[LGD15] Omer Levy, Yoav Goldberg, and Ido Dagan. Improving
”
Distributional Similarity with Lessons Learned from Word
Embeddings.“ In: Transactions of the Association for Compu-
tational Linguistics 3 (2015), pp. 211–225. url: https://
transacl.org/ojs/index.php/tacl/article/view/570
(cit. on pp. 91, 92).
[LL04] Guy Lebanon and John Lafferty. Hyperplane margin clas-
”
sifiers on the multinomial manifold.“ In: Proceedings of the
international conference on machine learning (ICML). ACM.
2004, p. 66 (cit. on p. 61).
[LRP95] John Lamping, Ramana Rao, and Peter Pirolli. A focus+
”
context technique based on hyperbolic geometry for visu-
alizing large hierarchies.“ In: Proceedings of the SIGCHI
conference on Human factors in computing systems. ACM
Press/Addison-Wesley Publishing Co. 1995, pp. 401–408
(cit. on pp. 23, 42).
[LW18] Matthias Leimeister and Benjamin J Wilson. Skip-gram
”
word embeddings in hyperbolic space.“ In: arXiv preprint
arXiv:1809.01498 (2018) (cit. on p. 83).
[Mat13] Jirı Matoušek. Lecture notes on metric embeddings. Tech. rep.
Technical report, ETH Zürich, 2013 (cit. on pp. 4, 11–15).
162
Bibliography
[MC07] Rada Mihalcea and Andras Csomai. Wikify!: linking doc-

”
uments to encyclopedic knowledge.“ In: Proceedings of the
sixteenth ACM conference on Conference on information and
knowledge management. ACM. 2007, pp. 233–242 (cit. on
p. 102).
[MC18] Boris Muzellec and Marco Cuturi. Generalizing Point
”
Embeddings using the Wasserstein Space of Elliptical Dis-
tributions.“ In: arXiv preprint arXiv:1805.07594 (2018) (cit.
on pp. 82, 83).
[Men+11] Pablo N Mendes, Max Jakob, Andrés Garcı-a-Silva, and
Christian Bizer. DBpedia spotlight: shedding light on the
”
web of documents.“ In: Proceedings of the 7th International
Conference on Semantic Systems. ACM. 2011, pp. 1–8 (cit. on
p. 124).
[Mik+13a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient estimation of word representations in vector
”
space.“ In: arXiv preprint arXiv:1301.3781 (2013) (cit. on
p. 92).
[Mik+13b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,
and Jeff Dean. Distributed representations of words and
”
phrases and their compositionality.“ In: Advances in neural
information processing systems. 2013, pp. 3111–3119 (cit. on
pp. 1, 5, 81, 130, 132).
[Mil+90] George A Miller, Richard Beckwith, Christiane Fellbaum,
Derek Gross, and Katherine J Miller. Introduction to Word-
”
Net: An on-line lexical database.“ In: International journal
of lexicography 3.4 (1990), pp. 235–244 (cit. on pp. 53, 58, 83,
86).
[MRN14] Andrea Moro, Alessandro Raganato, and Roberto Nav-
igli. Entity Linking meets Word Sense Disambiguation:
”
a Unified Approach.“ In: Transactions of the Association for
Computational Linguistics (TACL) 2 (2014), pp. 231–244 (cit.
on pp. 103, 124).
[MW08] David Milne and Ian H Witten. Learning to link with
”
wikipedia.“ In: Proceedings of the 17th ACM conference on
Information and knowledge management. ACM. 2008, pp. 509–
518 (cit. on pp. 102, 122, 124, 141, 144).
163
Bibliography
[MWJ99a] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy

”
Belief Propagation for Approximate Inference: An Empir-
ical Study.“ In: Proceedings of the Fifteenth Conference on
Uncertainty in Artificial Intelligence. UAI’99. Stockholm, Swe-
den: Morgan Kaufmann Publishers Inc., 1999, pp. 467–475.
url: http://dl.acm.org/citation.cfm?id=2073796.
2073849 (cit. on p. 116).
[MWJ99b] Kevin P Murphy, Yair Weiss, and Michael I Jordan. Loopy
”
belief propagation for approximate inference: An empirical
study.“ In: Proceedings of the Fifteenth conference on Uncer-
tainty in artificial intelligence. Morgan Kaufmann Publishers
Inc. 1999, pp. 467–475 (cit. on p. 131).
[Ngu+17] Kim Anh Nguyen, Maximilian Köper, Sabine Schulte im
Walde, and Ngoc Thang Vu. Hierarchical Embeddings
”
for Hypernymy Detection and Directionality.“ In: Proceed-
ings of the 2017 Conference on Empirical Methods in Natural
Language Processing. 2017, pp. 233–243 (cit. on pp. 83, 94,
97).
[Ni+15] Chien-Chun Ni, Yu-Yao Lin, Jie Gao, Xianfeng David Gu,
and Emil Saucan. Ricci curvature of the Internet topol-
”
ogy.“ In: 2015 IEEE Conference on Computer Communications
(INFOCOM). IEEE. 2015, pp. 2758–2766 (cit. on p. 10).
[NK17] Maximillian Nickel and Douwe Kiela. Poincaré embed-
”
dings for learning hierarchical representations.“ In: Ad-
vances in neural information processing systems. 2017, pp. 6338–
6347 (cit. on pp. iii, v, 10, 16, 18, 20, 23, 29, 30, 42, 43, 52–57,
59, 71, 72, 75, 76, 83, 86, 88, 94, 96, 97).
[NK18] Maximilian Nickel and Douwe Kiela. Learning Continu-
”
ous Hierarchies in the Lorentz Model of Hyperbolic Geom-
etry.“ In: International Conference on Machine Learning. 2018
(cit. on pp. 23, 59, 83).
[NTK11] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel.
A Three-Way Model for Collective Learning on Multi-
”
Relational Data.“ In: 2011 (cit. on pp. 1, 41).
[Par13] Jouni Parkkonen. Hyperbolic Geometry.“ In: (2013) (cit.
”
on pp. 20, 29, 31, 33).
164
Bibliography
[PF14] Francesco Piccinno and Paolo Ferragina. From TagME to

”
WAT: a new entity annotator.“ In: Proceedings of the first
international workshop on Entity recognition & disambiguation.
ACM. 2014, pp. 55–62 (cit. on p. 124).
[PHG15] Maria Pershina, Yifan He, and Ralph Grishman. Person-
”
alized Page Rank for Named Entity Disambiguation.“ In:
2015 (cit. on pp. 131, 144).
[PP11] Anja Pilz and Gerhard Paaß. From names to entities us-
”
ing thematic context distance.“ In: Proceedings of the 20th
ACM international conference on Information and knowledge
management. ACM. 2011, pp. 857–866 (cit. on p. 103).
[PSM14] Jeffrey Pennington, Richard Socher, and Christopher D
Manning. Glove: Global Vectors for Word Representa-
”
tion.“ In: EMNLP. Vol. 14. 2014, pp. 1532–43 (cit. on pp. 1,
5, 81, 84, 85, 91, 130).
[Rat+11] Lev Ratinov, Dan Roth, Doug Downey, and Mike Ander-
son. Local and global algorithms for disambiguation to
”
wikipedia.“ In: Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language
Technologies-Volume 1. Association for Computational Lin-
guistics. 2011, pp. 1375–1384 (cit. on pp. 103, 122, 124, 131,
144).
[RCW15] Alexander M Rush, Sumit Chopra, and Jason Weston. A
”
Neural Attention Model for Abstractive Sentence Summa-
rization.“ In: Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing. 2015, pp. 379–389
(cit. on p. 6).
[Rec+11] Benjamin Recht, Christopher Re, Stephen Wright, and Feng
Niu. Hogwild: A Lock-Free Approach to Parallelizing
”
Stochastic Gradient Descent.“ In: Advances in Neural Infor-
mation Processing Systems 24. Ed. by J. Shawe-taylor, R.s.
Zemel, P. Bartlett, F.c.n. Pereira, and K.q. Weinberger. 2011,
pp. 693–701. url: http://books.nips.cc/papers/files/
nips24/NIPS2011_0485.pdf (cit. on p. 109).
[RET14] Giuseppe Rizzo, Marieke van Erp, and Raphaël Troncy.
Benchmarking the extraction and disambiguation of named
”
entities on the semantic web.“ In: Proceedings of the 9th In-
165
Bibliography
ternational Conference on Language Resources and Evaluation.

2014 (cit. on p. 124).
[RH07] Antonio Robles-Kelly and Edwin R Hancock. A Rieman-
”
nian approach to graph embedding.“ In: Pattern Recognition
40.3 (2007), pp. 1042–1056 (cit. on pp. iii, v, 10).
[Roc+15] Tim Rocktäschel, Edward Grefenstette, Karl Moritz Her-
mann, Tomáš Kočisk, and Phil Blunsom. Reasoning about
”
entailment with neural attention.“ In: arXiv preprint arXiv:1509.06664
(2015) (cit. on pp. 1, 6, 41).
[RS00] Sam T Roweis and Lawrence K Saul. Nonlinear dimen-
”
sionality reduction by locally linear embedding.“ In: science
290.5500 (2000), pp. 2323–2326 (cit. on p. 8).
[RS11] Joel W Robbin and Dietmar A Salamon. Introduction to
”
differential geometry.“ In: ETH, Lecture Notes, preliminary
version, January (2011) (cit. on pp. 18, 20, 31).
[S+15] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.
End-to-end memory networks.“ In: Advances in neural
”
information processing systems. 2015, pp. 2440–2448 (cit. on
p. 130).
[San+15] Romeil Sandhu, Tryphon Georgiou, Ed Reznik, Liangjia
Zhu, Ivan Kolesov, Yasin Senbabaoglu, and Allen Tan-
nenbaum. Graph curvature for differentiating cancer net-
”
works.“ In: Scientific reports 5 (2015), p. 12323 (cit. on p. 10).
[Sar11] Rik Sarkar. Low distortion delaunay embedding of trees
”
in hyperbolic plane.“ In: International Symposium on Graph
Drawing. Springer. 2011, pp. 355–366 (cit. on pp. 34, 42).
[SC12a] Valentin I. Spitkovsky and Angel X. Chang. A Cross-
”
Lingual Dictionary for English Wikipedia Concepts.“ In:
Proceedings of the Eighth International Conference on Language
Resources and Evaluation (LREC 2012). Istanbul, Turkey, May
2012. url: pubs/crosswikis.pdf (cit. on p. 114).
[SC12b] Valentin I Spitkovsky and Angel X Chang. A Cross-Lingual
”
Dictionary for English Wikipedia Concepts.“ In: 2012 (cit.
on p. 139).
166
Bibliography
[SGD16] Vered Shwartz, Yoav Goldberg, and Ido Dagan. Improv-

”
ing Hypernymy Detection with an Integrated Path-based
and Distributional Method.“ In: Proceedings of the 54th An-
nual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). Vol. 1. 2016, pp. 2389–2398 (cit. on
pp. 1, 41).
[Sin+13] Sameer Singh, Sebastian Riedel, Brian Martin, Jiaping
Zheng, and Andrew McCallum. Joint inference of en-
”
tities, relations, and coreference.“ In: Proceedings of the 2013
workshop on Automated knowledge base construction. ACM.
2013, pp. 1–6 (cit. on p. 103).
[Sin+18] Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, and
Martin Jaggi. Wasserstein is all you need.“ In: arXiv preprint
”
arXiv:1808.09663 (2018) (cit. on p. 83).
[SM09] Charles Sutton and Andrew McCallum. Piecewise train-
”
ing for structured prediction.“ English. In: Machine Learning
77.2-3 (2009), pp. 165–194. url: http://dx.doi.org/10.
1007/s10994-009-5112-z (cit. on p. 109).
[Spi79] Michael Spivak. A comprehensive introduction to differ-
”
ential geometry. Volume four.“ In: (1979) (cit. on p. 27).
[SS13] Nadine Steinmetz and Harald Sack. Semantic multimedia
”
information retrieval based on contextual descriptions.“ In:
The Semantic Web: Semantics and Big Data. Springer, 2013,
pp. 382–396 (cit. on p. 124).
[SSM98] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert
Müller. Nonlinear component analysis as a kernel eigen-
”
value problem.“ In: Neural computation 10.5 (1998), pp. 1299–
1319 (cit. on p. 8).
[ST08] Yuval Shavitt and Tomer Tankel. Hyperbolic embedding
”
of internet graph for distance estimation and overlay con-
struction.“ In: IEEE/ACM Transactions on Networking (TON)
16.1 (2008), pp. 25–36 (cit. on pp. 23, 42).
[Sun+15] Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, Zhenzhou
Ji, and Xiaolong Wang. Modeling Mention, Context and
”
Entity with Neural Networks for Entity Disambiguation.“
In: IJCAI. 2015, pp. 1333–1339 (cit. on pp. 104, 131, 139).
167
Bibliography
[SY13] Avirup Sil and Alexander Yates. Re-ranking for joint

”
named-entity recognition and linking.“ In: Proceedings of
the 22nd ACM international conference on Conference on infor-
mation & knowledge management. ACM. 2013, pp. 2369–2374
(cit. on p. 124).
[TBG19] Alexandru Tifrea, Gary Becigneul, and Octavian-Eugen
Ganea. Poincare Glove: Hyperbolic Word Embeddings.“
”
In: International Conference on Learning Representations. 2019.
url: https : / / openreview . net / forum ? id = Ske5r3AqK7
(cit. on pp. ix, 24, 81, 88–90, 92, 94, 96, 97).
[TDL00] Joshua B Tenenbaum, Vin De Silva, and John C Langford.
A global geometric framework for nonlinear dimension-
”
ality reduction.“ In: science 290.5500 (2000), pp. 2319–2323
(cit. on pp. 4, 6, 8, 9).
[TO18] Corentin Tallec and Yann Ollivier. Can recurrent neural
”
networks warp time?“ In: Proceedings of the International
Conference on Learning Representations (ICLR). 2018 (cit. on
p. 70).
[Tor52] Warren S Torgerson. Multidimensional scaling: I. Theory
”
and method.“ In: Psychometrika 17.4 (1952), pp. 401–419
(cit. on pp. 5, 6, 8).
[Tri+18a] Valentin Trifonov, Octavian-Eugen Ganea, Anna Potapenko,
and Thomas Hofmann. Learning and Evaluating Sparse
”
Interpretable Sentence Embeddings.“ In: Proceedings of the
2018 EMNLP Workshop BlackboxNLP: Analyzing and Inter-
preting Neural Networks for NLP. 2018, pp. 200–210 (cit. on
p. x).
[Tri+18b] Nilesh Tripuraneni, Nicolas Flammarion, Francis Bach,
and Michael I Jordan. Averaging Stochastic Gradient De-
”
scent on Riemannian Manifolds.“ In: Proceedings of Machine
Learning Research vol 75 (2018), pp. 1–38 (cit. on p. 18).
[Ung01] Abraham A Ungar. Hyperbolic trigonometry and its appli-
”
cation in the Poincaré ball model of hyperbolic geometry.“
In: Computers & Mathematics with Applications 41.1-2 (2001),
pp. 135–147 (cit. on p. 34).
168
Bibliography
[Ung08] Abraham Albert Ungar. A gyrovector space approach to

”
hyperbolic geometry.“ In: Synthesis Lectures on Mathematics
and Statistics 1.1 (2008), pp. 1–194 (cit. on pp. 23, 34, 36, 37,
39, 60).
[Ung12] Abraham A Ungar. Beyond the Einstein addition law and its
gyroscopic Thomas precession: The theory of gyrogroups and
gyrovector spaces. Vol. 117. Springer Science & Business
Media, 2012 (cit. on p. 87).
[Ung14] Abraham Albert Ungar. Analytic hyperbolic geometry in n
dimensions: An introduction. CRC Press, 2014 (cit. on p. 62).
[Usb+14] Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Michael
Röder, Daniel Gerber, Sandro Athaide Coelho, Sören Auer,
and Andreas Both. AGDISTIS-graph-based disambigua-
”
tion of named entities using linked data.“ In: The Semantic
Web–ISWC 2014. Springer, 2014, pp. 457–471 (cit. on p. 124).
[Usb+15] Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo,
Ciro Baron, Andreas Both, Martin Brümmer, Diego Cec-
carelli, Marco Cornolti, Didier Cherix, Bernd Eickmann,
et al. GERBIL: General Entity Annotator Benchmarking
”
Framework.“ In: Proceedings of the 24th International Con-
ference on World Wide Web. International World Wide Web
Conferences Steering Committee. 2015, pp. 1133–1143 (cit.
on pp. 102, 117, 120).
[Ven+15] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Ur-
tasun. Order-embeddings of images and language.“ In:
”
arXiv preprint arXiv:1511.06361 (2015) (cit. on pp. 41–43, 52,
53, 55, 57, 71, 83, 96).
[Ver05] J Vermeer. A geometric interpretation of Ungar’s addition
”
and of gyration in the hyperbolic plane.“ In: Topology and
its Applications 152.3 (2005), pp. 226–242 (cit. on p. 35).
[VGE18] Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff.
Web2text: Deep structured boilerplate removal.“ In: Eu-
”
ropean Conference on Information Retrieval. Springer. 2018,
pp. 167–179 (cit. on p. x).
169
Bibliography
[Vil+18] Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCal-
lum. Probabilistic Embedding of Knowledge Graphs with
”
Box Lattice Measures.“ In: arXiv preprint arXiv:1805.06627
(2018) (cit. on p. 83).
[Vis+06] S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark W.
Schmidt, and Kevin P. Murphy. Accelerated Training
”
of Conditional Random Fields with Stochastic Gradient
Methods.“ In: Proceedings of the 23rd International Conference
on Machine Learning. ICML ’06. Pittsburgh, Pennsylvania,
USA: ACM, 2006, pp. 969–976. url: http://doi.acm.org/
10.1145/1143844.1143966 (cit. on p. 109).
[VM15] Luke Vilnis and Andrew McCallum. Word representa-
”
tions via gaussian embedding.“ In: ICLR (2015) (cit. on
pp. 82, 83, 86–88).
[VM18] Ivan Vulić and Nikola Mrkšić. Specialising Word Vectors
”
for Lexical Entailment.“ In: Proceedings of the 2018 Con-
ference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Vol-
ume 1 (Long Papers). Vol. 1. 2018, pp. 1134–1145 (cit. on
p. 83).
[VS14] Kevin Verbeek and Subhash Suri. Metric embedding, hy-
”
perbolic space, and social networks.“ In: Proceedings of the
thirtieth annual symposium on Computational geometry. ACM.
2014, p. 501 (cit. on pp. 10, 23).
[Vul+17] Ivan Vulić, Daniela Gerz, Douwe Kiela, Felix Hill, and
Anna Korhonen. Hyperlex: A large-scale evaluation of
”
graded lexical entailment.“ In: Computational Linguistics
43.4 (2017), pp. 781–835 (cit. on p. 94).
[Wee+14] Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir,
and Bill Keller. Learning to distinguish hypernyms and
”
co-hyponyms.“ In: Proceedings of COLING 2014, the 25th
International Conference on Computational Linguistics: Tech-
nical Papers. Dublin City University and Association for
Computational Linguistics. 2014, pp. 2249–2259 (cit. on
p. 97).
170
Bibliography
[Wil+14] Richard C Wilson, Edwin R Hancock, Elżbieta Pekalska,

and Robert PW Duin. Spherical and hyperbolic embed-
”
dings of data.“ In: IEEE transactions on pattern analysis
and machine intelligence 36.11 (2014), pp. 2255–2269 (cit. on
pp. iii, v, 10, 11, 16, 20, 59).
[Xu12] Weiping Xu. Non-Euclidean dissimilarity data in pattern
”
recognition.“ PhD thesis. University of York, 2012 (cit. on
pp. 11, 17).
[Yam+16] Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and
Yoshiyasu Takefuji. Joint Learning of the Embedding of
”
Words and Entities for Named Entity Disambiguation.“ In:
CoNLL 2016 (2016), p. 250 (cit. on pp. 130, 133, 139, 141,
143, 144).
[YFW00] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Generalized
”
Belief Propagation.“ In: Advances in Neural Information Pro-
cessing Systems (NIPS). Vol. 13. Dec. 2000, pp. 689–695. url:
http://www.merl.com/publications/TR2000-26 (cit. on
p. 110).
[ZL04] Chengxiang Zhai and John Lafferty. A study of smoothing
”
methods for language models applied to information re-
trieval.“ In: ACM Transactions on Information Systems (TOIS)
22.2 (2004), pp. 179–214 (cit. on p. 115).
[ZRS16] Hongyi Zhang, Sashank J Reddi, and Suvrit Sra. Rieman-
”
nian SVRG: Fast stochastic optimization on Riemannian
manifolds.“ In: Advances in Neural Information Processing
Systems. 2016, pp. 4592–4600 (cit. on p. 18).
[ZS16] Hongyi Zhang and Suvrit Sra. First-order methods for
”
geodesically convex optimization.“ In: Conference on Learn-
ing Theory. 2016, pp. 1617–1638 (cit. on p. 18).
[ZS18] Hongyi Zhang and Suvrit Sra. Towards Riemannian Accel-
”
erated Gradient Methods.“ In: arXiv preprint arXiv:1806.02812
(2018) (cit. on p. 18).
[ZSG16] Stefan Zwicklbauer, Christin Seifert, and Michael Granitzer.
Robust and collective entity disambiguation through se-
”
mantic embeddings.“ In: Proceedings of the 39th International
ACM SIGIR conference on Research and Development in Infor-
mation Retrieval. ACM. 2016, pp. 425–434 (cit. on p. 130).
171

Thesis Non Euclidea Neural Representation Words Entities Heirarchies

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis Non Euclidea Neural Representation Words Entities Heirarchies

Uploaded by

Copyright:

Available Formats

ETH Library

A thesis submitted to attain the degree of

doctor of sciences of eth z ürich

Master in Computer Science

accepted on the recommendation of

Prof. Dr. Thomas Hofmann

In recent years, some of the greatest breakthroughs in machine learn-

Au cours des dernières années, certaines des plus grandes avancées en

This material presented in this thesis has in parts been published in

• Octavian-Eugen Ganea, Gary Bécigneul and Thomas Hofmann.

• Octavian-Eugen Ganea 1 , Gary Bécigneul 1 and Thomas Hofmann.

• Alexandru Tifrea 1 , Gary Bécigneul 1 and Octavian-Eugen Ganea 1 .

• Octavian-Eugen Ganea and Thomas Hofmann.

• Octavian-Eugen Ganea, Marina Ganea, Aurelien Lucchi, Carsten

The following publications were part of my PhD research and present

• Nikolaos Kolitsas 1 , Octavian-Eugen Ganea 1 and Thomas Hof-

• Gary Bécigneul and Octavian-Eugen Ganea.

• Valentin Trifonov, Octavian-Eugen Ganea, Anna Potapenko and

The remaining publications were part of my PhD research, but are

• Till Haug, Octavian-Eugen Ganea and Paulina Grnarova.

• Thijs Vogels, Octavian-Eugen Ganea and Carsten Eickhoff.

First, I thank my professor Thomas Hofmann for the continuous

Last, but not least, there is a special place in my heart and in my

3 hyperbolic entailment cones for learning hierar-

7.3 Learning Entity Embeddings . . . . . . . . . . .

Figure 1.1 Examples of graphs that cannot be embedded in

Figure 7.3 Non-linear scoring function of the belief and

Table 3.1 Test F1 results for various models for embedding

Table 5.6 Hyperlex results in terms of Spearman correla-

Table 7.6 Effects of two of the hyper-parameters. Left: A

NLP Natural Language Processing

CNN Convolutional Neural Network

FFNN Feed-forward Neural Network

RNN Recurrent Neural Network

GRU Gated Recurrent Unit

MLR Multinomial Logistic Regression

SVM Support Vector Machine

PSD positive semi-definite

SGD Stochastic Gradient Descent

MDS Multidimensional Scaling

LLE Locally-linear Embedding

PCA Principal Component Analysis

DAG Directed Acyclic Graphs

PBOH Probabilistic Bag of Hyperlinks Model for Entity

CRF Conditional Random Field

LBP Loopy Belief Propagation

Producing semantically rich representations of data such as text or

1.1.1 (Dis)Similarity Representation and Metric Spaces

Neural models embed data into continuous 1 latent spaces (e.g. Rn )

manner. In ML, data are classically represented in particular metric

• Semantically similar instances (or data examples or data points)

• Semantically dissimilar instances should be embedded far apart

• Geodesics (“straight lines”) in the embedding space should be

This is formalized using the mathematical notion of a metric space,

• identity of indiscernibles: d X ( x, y) = 0 iff x = y

• triangle inequality: d X ( x, y) + d X (y, z) ≥ d X ( x, z)

Classic examples of metric spaces that will be widely used in this

• The Euclidean space Rn with the l2 distance function kx − yk2

• Riemannian manifolds (Geodesically complete 2 ). Intuitively, these

between geodesics, length of a curve, area of a surface or curva-

• Graph metrics: given any graph G (simple, connected, unweighted

1.1.2 Embedding Maps, Isometric Functions and Distortion Measures

data manifold assumption. It is typical to make the data man-