Professional Documents
Culture Documents
Non-Euclidean Neural
Representation Learning of Words,
Entities and Hierarchies
Doctoral Thesis
Author(s):
Ganea, Octavian-Eugen
Publication date:
2019
Permanent link:
https://doi.org/10.3929/ethz-b-000397726
Rights / license:
In Copyright - Non-Commercial Use Permitted
This page was generated automatically upon download from the ETH Zurich Research Collection.
For more information, please consult the Terms of use.
diss. eth no. 26043
N O N - E U C L I D E A N N E U R A L R E P R E S E N TAT I O N L E A R N I N G
OF WORDS, ENTITIES AND HIERARCHIES
presented by
octavian-eugen ganea
born on 26.09.1987
citizen of Romania
2019
Octavian-Eugen Ganea: Non-Euclidean Neural Representation Learning of
Words, Entities and Hierarchies, © 2019
ABSTRACT
iii
for inducing a partial order over the embedding space by exploiting
geodesically convex cones for modeling antisymmetric entailment rela-
tions. We show that these cones have an appealing optimal closed form
expression. Empirically, we demonstrate improved representational
capacity on the link prediction task for word hypernymy detection.
Second, we tackle the question: how to use hyperbolic embeddings
together with popular deep learning architectures for downstream
tasks? In our journey, we rely on the Gyrovector space formalism
for taking concrete steps towards generalizing popular deep learning
methods to hyperbolic spaces in a principled manner. Our models can
be deformed to their respective Euclidean counterparts when the space
is continuously flattened.
Third, we dive into NLP and learn word representations in products
of hyperbolic spaces. To our knowledge, we present the first fully
unsupervised word embeddings being simultaneously competitive on
three different tasks - semantic similarity, hypernmy prediction and
semantic analogy. The key aspect is leveraging the connection between
products of hyperbolic spaces and Gaussian embeddings.
Finally, this thesis investigates closely how to learn good (neural)
representations for data with power-law distributions such as entities
in a knowledge base (e.g. Wikipedia). Towards this end, we focus
on solving the core NLP task of “entity linking”, i.e. finding entities
in raw text corpora and linking them to knowledge bases. We push
state-of-the-art accuracies on popular datasets by leveraging entity
embeddings, attention mechanisms, probabilistic graphical models and
unrolled approximate inference networks using truncated loopy belief
propagation.
iv
ABSTRAIT
v
Cependant, il existe toujours une forte demande de généralisation
des techniques populaires d’Apprentissage Automatique (AA) et des
modèles euclidiens, tout en atteignant des résultats compétitifs sur les
tâches et applications d’usage. La réduction de cet écart entre géométrie
euclidienne et hyperbolique pour l’AA est l’un des objectifs principaux
de cette thèse.
Notre première contribution est une méthode novatrice, avec une
fondation théorique, pour des graphes acycliques dirigés dans l’espace
hyperbolique en introduisant des cônes géodésiquement convexes pour
la modélisation de relations d’implication. Ces cônes induisent une rela-
tion d’ordre partiel dans l’espace ambiant et possèdent une expression
optimale et attrayante en forme close. Nous démontrons une capacité
de représentation accrue sur la tâche de prédiction de lien pour la
détection de relations d’hyperonymie entre des mots.
Deuxièmement, nous abordons la question: comment rendre com-
patible la donnée en entrée de représentations hyperboliques aux ar-
chitectures populaires d’apprentissage profond?”. Dans notre quête,
nous nous appuyons sur le formalisme des espaces gyro vectoriels afin
d’effectuer un pas concret vers la généralisation des méthodes popu-
laires d’apprentissage profond aux espaces hyperboliques, d’une façon
théoriquement fondée. Nos modèles peuvent être continuellement
déformés en leurs contreparties euclidiennes respectives.
Troisièmement, nous plongeons dans le NLP et l’apprentissage au-
tomatique de représentations des mots dans des produits d’espaces
hyperboliques. À notre connaissance, nous présentons les premières
représentations de mots, obtenues par apprentissage non supervisé,
étant simultanément compétitives sur trois tâches différentes - la simi-
larité sémantique, la prédiction de l’hyperonymie et l’analogie sémantique.
L’aspect essentiel est de tirer parti de la connexion entre les produits
d’espaces hyperboliques et la représentation de distributions gaussi-
ennes.
Enfin, cette thèse étudie de près comment apprendre de bonnes
représentations neuronales pour les données ayant des distributions à
queues longues telles que des entités dans une base de données comme
Wikipédia. Pour ce faire, nous nous concentrons sur la résolution d’une
tâche primordiale de NLP consistant à la ”liaison d’entités”, c’est-à-
dire à rechercher des entités dans un corpus de textes quelconque et
à les relier à des bases de données. Nous poussons l’état de l’art de
la recherche sur des bases de données populaires en exploitant des
vi
représentations d’entités, des modèles graphiques probabilistes, des
réseaux d’inférence approximative déroulés utilisant des mécanismes
de propagation tronqués de croyance en boucle, et des mécanismes
d’attention.
vii
P U B L I C AT I O N S
ix
• Octavian-Eugen Ganea, Sylvain Gelly, Gary Bécigneul and Aliak-
sei Severyn.
“Breaking the Softmax Bottleneck via Learnable Monotonic Point-
wise Non-linearities.”
International Conference on Machine Learning (ICML 2019).
[Gan+19].
x
ACKNOWLEDGMENTS
xi
CONTENTS
acronyms xxiii
1 introduction and motivation 1
1.1 Representation Learning . . . . . . . . . . . . . . . . . . . 1
1.1.1 (Dis)Similarity Representation and Metric Spaces 1
1.1.2 Embedding Maps, Isometric Functions and Dis-
tortion Measures . . . . . . . . . . . . . . . . . . . 3
1.2 Representations in the Euclidean Space . . . . . . . . . . 5
1.2.1 Pairwise Distance and Similarity (Gram) Matrices 6
1.2.2 Manifold Learning . . . . . . . . . . . . . . . . . . 8
1.3 Properties and Constraints of Euclidean Embedding Spaces 10
1.3.1 Positive Semi-definite Gram Matrices . . . . . . . 11
1.3.2 Short Diagonals Lemma . . . . . . . . . . . . . . . 12
1.3.3 Ptolemy’s inequality and Ptolemaic graphs . . . . 13
1.3.4 Distortions of Arbitrary Metric Embeddings in the
Euclidean Space. Bourgain’s Embedding Theorem 14
1.3.5 Johnson-Lindenstrauss Lemma . . . . . . . . . . . 15
1.3.6 Distortions of Euclidean Graphs Embeddings . . 15
1.3.7 Non-flat Manifold Data . . . . . . . . . . . . . . . 16
1.3.8 Other Causes of Non-Euclidean Data . . . . . . . 17
1.3.9 Curse of Dimensionality . . . . . . . . . . . . . . . 17
1.4 Non-Euclidean Geometric Spaces for Embedding Specific
Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Constant Curvature Spaces . . . . . . . . . . . . . 20
1.4.2 Non-Metric Spaces for Neural Entity Disambigua-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Hyperbolic Spaces – an Intuition . . . . . . . . . . . . . . 21
1.6 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . 23
2 hyperbolic geometry 27
2.1 Short Overview of Differential geometry . . . . . . . . . 27
2.2 Hyperbolic Space: the Poincaré Ball . . . . . . . . . . . . 29
2.2.1 Gyrovector Spaces . . . . . . . . . . . . . . . . . . 34
2.2.2 Connecting Gyrovector Spaces with the Rieman-
nian geometry of the Poincaré Ball . . . . . . . . . 36
xiii
contents
xiv
contents
bibliography 151
xv
LIST OF FIGURES
xvi
List of Figures
Figure 4.2 Test accuracies for various models and four datasets.
“Eucl” denotes Euclidean, “Hyp” denotes hy-
perbolic. All word and sentence embeddings
have dimension 5. We highlight in bold the best
method (or methods, if the difference is less than
0.5%). . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 4.3 PREFIX-30% accuracy and first (premise) sen-
tence norm plots for different runs of the same
architecture: hyperbolic GRU followed by hy-
perbolic FFNN and hyperbolic/Euclidean (half-
half) MLR. The X axis shows millions of training
examples processed. . . . . . . . . . . . . . . . . 74
Figure 4.4 PREFIX-30% accuracy and first (premise) sen-
tence norm plots for different runs of the same
architecture: Euclidean GRU followed by Eu-
clidean FFNN and Euclidean MLR. The X axis
shows millions of training examples processed. 75
Figure 4.5 PREFIX-30% accuracy and first (premise) sen-
tence norm plots for different runs of the same
architecture: hyperbolic RNN followed by hy-
perbolic FFNN and hyperbolic MLR. The X axis
shows millions of training examples processed. 76
Figure 4.6 Hyperbolic (left) vs Direct Euclidean (right) bi-
nary MLR used to classify nodes as being part
in the group.n.01 subtree of the WordNet noun
hierarchy solely based on their Poincaré embed-
dings. The positive points (from the subtree) are
in red, the negative points (the rest) are in yellow
and the trained positive separation hyperplane
is depicted in green. . . . . . . . . . . . . . . . . 77
xvii
Figure 4.7 Test F1 classification scores (%) for four different
subtrees of WordNet noun tree. 95% confidence
intervals for 3 different runs are shown for each
method and each dimension. “Hyp” denotes our
hyperbolic MLR, “Eucl” denotes directly apply-
ing Euclidean MLR to hyperbolic embeddings
in their Euclidean parametrization, and log0 de-
notes applying Euclidean MLR in the tangent
space at 0, after projecting all hyperbolic embed-
dings there with log0 . . . . . . . . . . . . . . . . 79
Figure 5.1 Isometric deformation ϕ of D2 (left end) into H2
(right end). . . . . . . . . . . . . . . . . . . . . . . 84
Figure 6.1 An entity disambiguation problem showcasing
five given mentions and their potential entity
candidates. . . . . . . . . . . . . . . . . . . . . . . 100
Figure 6.2 Proposed factor graph for a document with four
mentions. Each mention node mi is paired with
its corresponding entity node Ei , while all entity
nodes are connected through entity-entity pair
factors. . . . . . . . . . . . . . . . . . . . . . . . . 108
Figure 6.3 Interactive Gerbil visualization of “in-Knowledge
Base (KB)” (i.e. only entities in KB should be
linked) micro F1 scores for a variety of Entity
Disambiguation (ED) methods and datasets. Our
system, PBOH, is outperforming the vast major-
ity of the presented baselines. Screen shot from
January 2018. . . . . . . . . . . . . . . . . . . . . 120
Figure 7.1 Local model with neural attention. Inputs: con-
text word vectors, candidate entity priors and
embeddings. Outputs: entity scores. All parts
are differentiable and trainable with backpropagation.133
Figure 7.2 Global model: unrolled LBP deep network that
is end-to-end differentiable and trainable. . . . . 136
xviii
List of Tables
L I S T O F TA B L E S
xix
List of Tables
xx
List of Tables
xxi
ACRONYMS
ML Machine Learning
AA Apprentissage Automatique
MF Matrix Factorization
ER Entity Resolution
ED Entity Disambiguation
KB Knowledge Base
KG Knowledge Graph
xxiii
acronyms
BP Belief Propagation
xxiv
I N T R O D U C T I O N A N D M O T I VAT I O N
1
1.1 representation learning
1
introduction and motivation
• non-negativity: d X ( x, y) ≥ 0
• symmetry: : d X ( x, y) = d X (y, x )
2
1.1 representation learning
In ML, data are available via sample sets, e.g. images of particular
objects sampled from an unknown data distribution. The goal is to
extract patterns and characteristics from the particular data distribution
of interest by learning information preserving data representations.
Formally, data points are vectors in an input metric space ( X, d X ) which
is usually the continuous Euclidean vector space Rn of some large
dimension n. Examples include vectorial pixel image representations
or feature vectors such as co-occurrence statistics for discrete inputs.
3
introduction and motivation
4
1.2 representations in the euclidean space
Definition 1.2. With the notations from definition 1.1, we define the
total distortion as
2 !2
dY ( f (pi ), f (p j ))
total-distortion( f ) := ∑ −1 (1.2)
1≤ i < j ≤ N
d X ( pi , p j )
5
introduction and motivation
x1>
x>
2
X = . ∈ R N ×d (1.3)
..
x>
N
Then, if one assumes the embedding space M = Rd to be Euclidean,
one typically defines a squared distance matrix or dissimilarity matrix in
this space
D = Dij ∈ R N × N , Dij := d(xi , x j )2 = kxi − x j k2
(1.4)
In general, data can be embedded in any metric space, in which case
d(·, ·) is the associated distance function.
On the other side, it is useful to define a notion of similarity in the
embedding space for which large values correspond to semantically
similar symbolic objects. In the Euclidean space, the typical measure is
the inner or dot product
d
hx, yi = ∑ xk yk (1.5)
k =1
3 Notation: [ N ] = {1, . . . , N }
6
1.2 representations in the euclidean space
K = XX> (1.7)
K = UΛU> (1.8)
X = UΛ1/2 (1.9)
X = CX (1.10)
7
introduction and motivation
pca and kernel pca. PCA is the most well-known linear dimen-
sionality reduction approach. It finds the most significant directions
along which the data maximizes its variance, namely the eigenvec-
tors corresponding to the top eigenvalues of the covariance matrix
Σ = N1 X> X = N1 ∑iN=1 xi xi> . Then, the data is linearly projected onto the
linear vector subspace spanned by these vectors.
However, PCA is ineffective when the data does not lie on or close to a
d-dimensional linear sub-space, which is typically a strong assumption.
A possible fix is the kernel-PCA [SSM98] method that employs a non-
linear data transformation Φ via a kernel function, resulting in the
covariance matrix Σ = N1 ∑iN=1 Φ(xi )Φ(xi )> . This can be efficiently
computed using the kernel trick. Unfortunately, finding a suitable
kernel for the problem at hand is not trivial.
metric mds. The goal of the metric MDS algorithm [BG03]; [Tor52]
is to perform non-linear dimensionality reduction by matching the
pairwise distances in the input feature space (given as an input matrix
D ∈ R N × N ) with the Euclidean distances in the embedding space. The
8
1.2 representations in the euclidean space
The algorithm follows eq. (1.15) to compute the Gram matrix of the
centered data, then determines its d largest eigenvalues λ1 , λ2 , . . . , λd
and their corresponding eigenvectors u1 , u2 , . . . , ud , and, last, computes
the new embeddings using X = Ud Λ1/2 d , where Λd is the diagonal
matrix of these d eigenvalues and Ud is the matrix of their associated
eigenvectors.
9
introduction and motivation
to move to R2 ; in the later case, one can still obtain zero distortion
embeddings using MDS if the target Euclidean spaces are replaced by
spaces of constant positive curvature (i.e. spherical spaces) and the
algorithm is modified to use their associated distance function. In
contrast, Isomap cannot reduce the dimension of this data without loss
because there exists no isometric deformation of the sphere into an
Euclidean space.
10
1.3 properties and constraints of euclidean embedding spaces
1
gij := ( D1i + D1j − Dij ), i, j ∈ [ N ] (1.17)
2
is PSD. This results holds irrespectively of which node is chosen as the
“1” node. The proof relies on the observation that G is the Gram matrix
of the translated vectors that set x1 to be the origin:
∑ λ i <0 | λ i |
NEF = ∈ [0, 1] (1.19)
∑ λi | λ i |
11
introduction and motivation
k x1 + x3 − x2 − x4 k2 ≥ 0 (1.21)
One can see, for example, that the graph metric of an undirected
4-cycle does not satisfy this lemma and, thus, cannot be embedded
isometrically in the Euclidean space of any dimension. In fact, we can
lower bound the total distortion for this particular graph.
Lemma 1.2. Embedding the 4-cycle graph in the Euclidean space of any
dimension incurs a total distortion (as defined by eq. (1.2)) of at least 34 .
12
1.3 properties and constraints of euclidean embedding spaces
1
( a2 − 1)2 + ( b2 − 1)2 + ( c2 − 1)2 + ( d2 − 1)2 ≥ ( s − 4)2
4 (1.23)
1 2 1 1
( e − 4)2 + ( f 2 − 4)2 ≥ (8 − t )2
4 4 8
We distinguish two cases:
(1.24)
13
introduction and motivation
• The graph does not contain an induced 4-cycle (i.e. with no di-
agonals) and is distance-hereditary (i.e. every connected induced
subgraph has the same distances as the whole graph).
This shows that cycle graphs of length at least 4 are not Ptolemaic,
thus not embeddable isometrically in the Euclidean space. Examples of
Ptolemaic spaces include all Hadamard spaces (e.g. Euclidean and the
Poincaré disk), but do not include spaces of constant positive curvature
(e.g. spherical).
It can also be shown (see [IM04]; [Mat13]) that the embedding space
that achieves the above distortion will have dimension O(log2 N) with
high probability. The above distortion is tight: for example constant
degree expander graph metrics cannot be embedded with a lower distor-
tion.
6 See https://en.wikipedia.org/wiki/Ptolemaic_graph.
14
1.3 properties and constraints of euclidean embedding spaces
A proof can be found in [DG03]. However, this result does not apply
to other metric spaces like graph metrics.
If the input data has a graph structure, the resulting graph metric
might be very different from the Euclidean metric, resulting in poten-
tially large embedding distortion no matter how many dimensions are
used. We already dived into this issue in section 1.3.2, seeing that even
small cycles or trees cannot be isometrically or with arbitrarily low
distortion embedded in the Euclidean space.
In general, there are various theoretical results that provide lower and
upper bounds on the worst-case distortion for Euclidean embeddings of
several classes of graphs [IM04]; [Mat13]. For example, k-regular graphs
√
(k ≥ 3) with girth g encounter
p an Ω( g) distortion, planar graphs can
be embedded with O( log N ) distortion, trees can be embedded in
d-dimensions with distortion O( N 1/(d−1) ), while the recursive
√ diamond
graph Gm cannot be embedded with distortion less than 1 + m.
The particular case of tree structures and directed acyclic graphs
is of special interest in this dissertation. From a theoretical point of
view, [Gro87] (cited by [De +18a], [GBH18b]) show that arbitrary tree
structures cannot be embedded with arbitrary low distortion in the
Euclidean space with any number of dimensions, but this task becomes
possible in the hyperbolic space with only 2 dimensions where the
exponential volume growth matches the exponential growth of nodes
with the tree depth. We will examine hyperbolic spaces extensively in
15
introduction and motivation
16
1.3 properties and constraints of euclidean embedding spaces
17
introduction and motivation
18
1.4 non-euclidean geometric spaces for embedding specific types of data
alleviate some of the current ML problems for certain types of data, e.g.
achieving
However, moving away from the Euclidean space comes with its own
challenges. Practical and computationally efficient implementations of
Riemannian manifolds require for modeling, learning and optimization,
access to closed form expressions of the most relevant Riemannian
geometric tools such as geodesic equations, exponential map, distance
function or parallel transport. In the case of generic manifolds, these
geometric objects can easily loose their appealing closed form expres-
sions.
Moreover, the adoption of neural networks and deep learning in
these non-Euclidean settings has been rather limited until very re-
cently, the main reason being the non-trivial or impossible principled
generalizations of basic operations (e.g. vector addition, matrix-vector
multiplication, vector translation, vector inner product) as well as, in
more complex geometries, the lack of closed form expressions for basic
objects (e.g. distances, geodesics, parallel transport). Thus, classic tools
such as FFNNs, RNNs or Multinomial Logistic Regression (MLR) do not
have correspondence in non-Euclidean geometries. This is important
as much of the value of embeddings lies in their use for downstream
tasks (e.g. in NLP).
19
introduction and motivation
20
1.5 hyperbolic spaces – an intuition
21
introduction and motivation
Figure 1.3: Visualization of Escher tiles (left) and a regular tree (right)
represented in the Poincaré ball.
22
1.6 thesis contributions
23
introduction and motivation
24
1.6 thesis contributions
25
introduction and motivation
26
H Y P E R B O L I C G E O M E T RY
2
In this chapter, we present the hyperbolic Riemannian manifold that
would be used as an embedding space in our methods presented in
Chapter 3, Chapter 4 and in Chapter 5. We already gave an intuition
on Riemannian manifolds and hyperbolic spaces in Chapter 1 and
section 1.4. We now give mathematical details of these concepts. The
material presented here has in parts been published in the publica-
tions [GBH18a]; [GBH18b].
27
hyperbolic geometry
Figure 2.1: Tangent space, a tangent unit-speed vector and its deter-
mined geodesic in a Riemannian manifold. Image source:
Wikipedia.org
28
2.2 hyperbolic space: the poincar é ball
29
hyperbolic geometry
k x − y k2
dD (x, y) = cosh−1 1 + 2 . (2.4)
(1 − k x k2 ) · (1 − k y k2 )
30
2.2 hyperbolic space: the poincar é ball
−1 0 ... 0
0 1 ... 0
gR
n,1
=
0
0 ... 0
0 0 ... 1
1
ψ(x) := (λx − 1, λx x), ψ −1 ( x 0 , x 0 ) = x0 (2.8)
1 + x0
31
hyperbolic geometry
Inverting once again, γ(t) = ψ−1 ◦ φ(t), one gets the closed-form
expression for γ stated in the theorem.
One can sanity check that indeed the formula from theorem 2.1
satisfies the conditions:
• dD (γ(0), γ(t)) = t, ∀t ∈ [0, 1]
• γ (0) = x
• γ̇(0) = v
We can now derive the formula for the exponential map in the
Poincaré ball.
Corollary 2.2.1. (Exponential map) The exponential map at a point x ∈ Dn ,
namely expx : Tx Dn → Dn , is given by
expx (v ) =
λx cosh(λx kv k) + h x, kvv k i sinh(λx kv k)
x+
1 + (λx − 1) cosh(λx kv k) + λx hx, kvv k i sinh(λx kv k)
1
kv k
sinh(λx kv k)
v (2.10)
1 + (λx − 1) cosh(λx kv k) + λx hx, kvv k i sinh(λx kv k)
1
Proof. Denote u = √ v. Using the notations from theorem 2.1,
gxD (v,v )
γx,u ( gxD (v, v )). Using eqs. (2.3) and (2.6), one
p
one has expx (v ) =
derives the result.
32
2.2 hyperbolic space: the poincar é ball
Theorem 2.3. For any δ > 0, any δ-hyperbolic metric space ( X, d X ) and any
set of points x1 , ..., xn ∈ X, there exists a finite weighted tree ( T, d T ) and an
embedding f : T → X such that for all i, j,
33
hyperbolic geometry
34
2.2 hyperbolic space: the poincar é ball
• x ⊕c 0 = 0 ⊕c x = x (identity element)
• n ⊗c x = x ⊕c · · · ⊕c x (n additions)
• (r + r 0 ) ⊗c x = r ⊗c x ⊕c r 0 ⊗c x (scalar distributivity 3 )
35
hyperbolic geometry
Again, observe that limc→0 dc (x, y) = 2kx − yk, i.e. we recover Eu-
clidean geometry in the limit 5 . Moreover, for c = 1 we recover dD
previously stated in eq. (2.4).
Note that one can also adapt the hyperbolic law of cosines to the
hyperbolic space.
36
2.2 hyperbolic space: the poincar é ball
gyrovector spaces and the Poincaré ball by finding new identities in-
volving the exponential map, and parallel transport.
In particular, these findings provide us with a simpler formulation
of Möbius scalar multiplication, yielding a natural definition of matrix-
vector multiplication in the Poincaré ball.
Lemma 2.1. For any x ∈ Dn and v ∈ Tx Dnc s.t. gxc (v, v ) = 1, the unit-
speed geodesic γx,v : R → Dn starting from point x with direction v,
namely γx,v (0) = x and γ̇x,v (0) = v, is given by:
√ t
1
γx,v (t) = x ⊕c tanh c √ v (2.23)
2 c kv k
Proof. One can use eq. (2.22) and re-parametrize it to unit-speed using
eq. (2.18). Alternatively, direct computation and identification with the
formula in theorem 2.1 would give the same result. Using eqs. (2.18)
and (2.23), one can sanity-check that dc (γ(0), γ(t)) = t, ∀t ∈ [0, 1].
37
hyperbolic geometry
Lemma 2.2. For any point x ∈ Dnc , the exponential map expcx : Tx Dnc → Dnc
and the logarithmic map logcx : Dnc → Tx Dnc are given for v 6= 0 and y 6= x
by:
√ λcx kv k
c v
expx (v ) = x ⊕c tanh c √ (2.24)
2 c kv k
2 √ −x ⊕c y
logcx (y) = √ c tanh−1 ( ck − x ⊕c yk) (2.25)
cλx k − x ⊕c yk
Proof. Following the proof of corollary 2.2.1, one obtains
expcx (v ) = γx, v
(λcx kv k) (2.26)
λcx kv k
Using eq. (2.23) gives the formula for expcx . Algebraic check of the
identity logcx (expcx (v )) = v concludes the proof.
The above maps have more appealing forms when x = 0, namely for
v ∈ T0 Dnc \ {0}, y ∈ Dnc \ {0}:
√ v
exp0c (v ) = tanh( ckv k) √ (2.27)
c kv k
c −1 √ y
log0 (y) = tanh ( ckyk) √ (2.28)
ckyk
Moreover, we still recover Euclidean geometry in the limit c → 0,
as limc→0 expcx (v ) = x + v is the Euclidean exponential map, and
limc→0 logcx (y) = y − x is the Euclidean logarithmic map.
38
2.2 hyperbolic space: the poincar é ball
Theorem 2.4. In the manifold (Dnc , gc ), the parallel transport w.r.t. the Levi-
Civita connection of a vector v ∈ T0 Dnc to another tangent space Tx Dnc is
given by the following isometry:
λ0c
P0c →x (v ) = logcx (x ⊕c exp0c (v )) = v. (2.31)
λcx
As we’ll see later, this result is crucial in order to define and optimize
parameters shared between different tangent spaces, such as biases in
hyperbolic neural layers or parameters of hyperbolic MLR.
The general parallel transport formula for any x, y ∈ Dn , v ∈ Tx D is
given by
λx
Px→y (v ) = · gyr[y, −x]v (2.32)
λy
where gyr 6 is the gyro-automorphism on Dn with closed form expres-
sion shown in Eq. 1.27 of [Ung08]:
Au + Bv
gyr[u, v ]w = (u ⊕ v ) ⊕ {u ⊕ (v ⊕ w)} = w + 2 . (2.33)
D
where the quantities A, B, D have closed-form expressions and are
thus easy to implement:
6 https://en.wikipedia.org/wiki/Gyrovector_space
39
H Y P E R B O L I C E N TA I L M E N T C O N E S F O R
3
LEARNING HIERARCHICAL EMBEDDINGS
The material presented here has in parts been published in the publi-
cation [GBH18b].
41
hyperbolic entailment cones for learning hierarchical embeddings
2 Seeend of section 2.2 for a rigorous formulation and section section 1.5 for a
detailed intuition.
42
3.2 entailment cones in the poincar é ball
43
hyperbolic entailment cones for learning hierarchical embeddings
The key idea for generalizing this concept is to make use of the expo-
nential map at a point x ∈ M.
Note that, in the above definition, we desire that the exponential map
be injective. We already know that it is a local diffeomorphism. Thus,
we restrict the tangent space in eq. (3.3) to the ball B n (0, r ), where r is
the injectivity radius of M at x. Note that for hyperbolic space models
the injectivity radius of the tangent space at any point is infinite, thus
no restriction is needed.
1
Ax := {x0 ∈ Dn : x0 = αx, > α ≥ 1} (3.4)
kxk
44
3.2 entailment cones in the poincar é ball
We next define the angle ∠(v, x̄) for any tangent vector v ∈ Tx Dn
as in eq. (2.12). Then, the axial symmetry property is satisfied if we
define the angular cone at x to have a non-negative aperture 2ψ(x) ≥ 0
as follows:
ψ(x)
Sx := {v ∈ Tx Dn : ∠(v, x̄) ≤ ψ(x)} (3.6)
ψ(x) ψ(x)
Sx := expx (Sx ).
We further define the conic border (face):
ψ ψ
∂Sψ := {v : ∠(v, x̄) = ψ(x)}, ∂Sx := expx (∂Sx ). (3.7)
ψ(x)
2) Rotation invariance. We want the definition of cones Sx to be
independent of the angular coordinate of the apex x, i.e. to only depend
on the (Euclidean) norm of x:
45
hyperbolic entailment cones for learning hierarchical embeddings
ψ(x) π
∀x0 ∈ ∂Sx : ψ(kx0 k) ≤ (3.11)
2
If the above is true, by moving x0 on any arbitrary (continuous) curve
ψ(x)
on the cone border ∂Sx that ends in x, one will get a contradiction
due to the continuity of ψ(k · k).
We now prove the remaining fact, namely eq. (3.11). Let any arbitrary
0 ψ(x) ψ(x)
x ∈ ∂Sx . Also, let y ∈ ∂Sx be any arbitrary point on the geodesic
half-line connecting x with x0 starting from x0 (i.e. excluding the segment
from x to x0 ). Moreover, let z be any arbitrary point on the spoke
through x0 radiating from x0 , namely z ∈ Ax0 (notation from eq. (3.4)).
Then, based on the properties of hyperbolic angles discussed before
(based on eq. (2.12)), the angles ∠yx0 z and ∠zx0 x are well-defined 3 .
From corollary 2.2.2 we know that the points 0, x, x0 , y, z are coplanar.
We denote this plane by P . Furthermore, the metric of the Poincaré
3 Abuse of notation.
46
3.2 entailment cones in the poincar é ball
ball is conformal with the Euclidean metric. Given these two facts, we
derive that
thus
π
min(∠yx0 z, ∠zx0 x) ≤ (3.13)
2
It only remains to prove that
Indeed, assume w.l.o.g. that ∠yx0 z < ψ(x0 ). Since ∠yx0 z < ψ(x0 ),
there exists a point t in the plane P such that
ψ(x0 ) ψ(x)
Then, clearly, t ∈ Sx0 , and also t ∈
/ Sx , which contradicts the
transitivity property in eq. (3.9).
47
hyperbolic entailment cones for learning hierarchical embeddings
Proof. We will use the exact same figure and notations of points y, z as
in the proof of lemma 3.1. In addition, we assume w.l.o.g that
π
∠yx0 z ≤ (3.18)
2
Further, let b ∈ ∂Dn be the intersection point of the spoke through x
with the border of Dn . Following the same argument as in the proof of
lemma 3.1, one proves eq. (3.14) which gives:
Indeed, if the above is true, then one can use the fact in eq. (2.5), i.e.
1+r 2r
sinh(kxkD ) = sinh ln = (3.24)
1−r 1 − r2
and apply lemma 3.2 to derive
h (r 0 ) ≤ h (r ) (3.25)
48
3.2 entailment cones in the poincar é ball
Figure 3.2: Poincaré angular cones satisfying eq. (3.31) for K = 0.1. Left:
examples of cones for points with Euclidean norm varying
from 0.1 to 0.9. Right: transitivity for various points on the
border of their parent cones.
We are only left to prove eq. (3.23). Let any arbitrary x ∈ Dn s.t.
ψ(x)
kxk = r. Also, consider any arbitrary geodesic γx,v : R+ → ∂Sx that
takes values on the cone border, i.e. ∠(v, x) = ψ(x). We know that
and that this geodesic “ends” on the ball’s border ∂Dn , i.e.
Thus, because the function kγx,v (·)k is continuous, we obtain that for
any r 0 ∈ (r, 1) there exists an t0 ∈ R+ s.t. kγx,v (t0 )k = r 0 . By setting
ψ(x)
x0 := γx,v (t0 ) ∈ ∂Sx we obtain the desired result.
49
hyperbolic entailment cones for learning hierarchical embeddings
Theorem 3.3. For any x, y ∈ Dn \ B n (0, e), we denote the angle between
the half-lines (xy and (0x as
1 − k x k2
ψ(x)
Sx = y ∈ D Ξ(x, y) ≤ arcsin K
n
. (3.34)
kxk
50
3.3 learning with entailment cones
ψ(x)
Proof. For any y ∈ Sx , the axial symmetry property implies that
π − ∠0xy ≤ ψ(x). Applying the hyperbolic cosine law in the triangle
0xy and writing the above angle inequality in terms of the cosines of
the two angles, one gets
k y k2 − k x k2 − k x − y k2
Ξ(x, y) = arccos , (3.37)
2k x k · k x − y k
51
hyperbolic entailment cones for learning hierarchical embeddings
L= ∑ E(u, v ) + ∑
0 0
max(0, γ − E(u0 , v 0 )), (3.38)
(u,v)∈ P (u ,v )∈ N
for some margin γ > 0, where P and N define samples of positive and
negative edges respectively. The energy E(u, v ) measures the penalty
of a wrongly classified pair (u, v), which in our case measures how far
ψ(u)
is point v from belonging to Su expressed as the smallest angle of a
ψ(u)
rotation of center u bringing v into Su :
where Ξ(u, v ) is defined in eqs. (3.33) and (3.37). Note that [Ven+15]
use k max(0, v − u)k2 . This loss function encourages positive samples
to satisfy E(u, v ) = 0 and negative ones to satisfy E(u, v ) ≥ γ. The
same loss is used both in the hyperbolic and Euclidean cases.
u ← u − η ∇u L (3.40)
52
3.4 experiments
3.4 experiments
53
hyperbolic entailment cones for learning hierarchical embeddings
54
3.4 experiments
them in the training set. The remaining “non-basic” edges (578,477) are
split into validation (5%), test (5%) and train (fraction of the rest).
We augment both the validation and the test parts with sets of nega-
tive pairs as follows: for each true (positive) edge (u, v), we randomly
sample five (u0 , v) and five (u, v0 ) negative corrupted pairs that are not
edges in the full transitive closure. These are then added to the respec-
tive negative set. Thus, ten times as many negative pairs as positive
pairs are used. They are used to compute standard classification metrics
associated with these datasets: precision, recall, F1. For the training set,
negative pairs are dynamically generated as explained below.
We make the task harder in order to understand the generalization
ability of various models when differing amounts of transitive closure
edges are available during training. We generate four training sets
that include 0%, 10%, 25%, or 50% of the non-basic edges, selected
randomly. We then train separate models using each of these four sets
after being augmented with the basic edges.
and tune the parameter α on the validation set. For all the other methods
(our proposed cones and order embeddings), we use the energy penalty
E(u, v ), e.g. eq. (3.39) for hyperbolic cones. This scoring function is then
used at test time for binary classification as follows: if it is lower than a
threshold, we predict an edge; otherwise, we predict a non-edge. The
optimal threshold is chosen to achieve maximum F1 on the validation
set by passing over the sorted array of scores of positive and negative
validation pairs.
55
hyperbolic entailment cones for learning hierarchical embeddings
results and discussion. Tables 3.1 and 3.2 show the obtained
results. For a fair comparison, we use models with the same number
of dimensions. We focus on the low dimensional setting (5 and 10
dimensions) which is more informative. It can be seen that our hyper-
bolic cones are better than all the baselines in all settings, except in
the 0% setting for which order embeddings are better. However, once
a small percentage of the transitive closure edges becomes available
during training, we observe significant improvements of our method,
sometimes by more than 8% F1 score. Moreover, hyperbolic cones
have the largest growth when transitive closure edges are added at
56
3.5 summary
Table 3.1: Test F1 results for various models for embedding dimension-
ality equal to 5. Simple Euclidean Emb and Poincaré Emb are
the Euclidean and hyperbolic methods proposed by [NK17],
Order Emb is proposed by [Ven+15].
Table 3.2: Test F1 results for the same methods as in table 3.1, but for
embedding dimensionality equal to 10.
3.5 summary
6 Indeed,
mathematically, hyperbolic embeddings cannot be considered as Eu-
clidean points.
57
hyperbolic entailment cones for learning hierarchical embeddings
7 https://github.com/dalab/hyperbolic_cones.
58
HYPERBOLIC NEURAL NETWORKS
4
Hyperbolic spaces have recently gained momentum in the context of
machine learning due to their high capacity and tree-likeliness prop-
erties. However, the representational power of hyperbolic geometry is
not yet on par with Euclidean geometry, mostly because of the absence
of corresponding hyperbolic neural network layers. This makes it hard
to use hyperbolic embeddings in downstream tasks. In this chapter, we
bridge this gap in a principled manner by combining the formalism
of Möbius gyrovector spaces with the Riemannian geometry of the
Poincaré model of hyperbolic spaces. As a result, we derive hyper-
bolic versions of important deep learning tools – multinomial logistic
regression MLR, feed-forward FFNN and recurrent neural networks RNN
such as gated recurrent units GRU – and prove some of their interesting
properties. This further allows to embed sequential data and perform
classification in the hyperbolic space. Empirically, we show the benefit
of using our hyperbolic models compared to their Euclidean variants
on word embedding classification, textual entailment and noisy-prefix
recognition tasks.
The material presented here has in parts been published in the publi-
cation [GBH18a].
4.1 introduction
59
hyperbolic neural networks
60
4.2 hyperbolic multiclass logistic regression
61
hyperbolic neural networks
Lemma 4.1.
c
H̃a,p = {x ∈ Dnc : h−p ⊕c x, ai = 0}. (4.9)
62
4.2 hyperbolic multiclass logistic regression
Theorem 4.2.
c
dc (x, H̃a,p ) := inf dc (x, w)
c
w∈ H̃a,p
√ (4.13)
2 c|h−p ⊕c x, ai|
1
= √ sinh−1 .
c (1 − ck − p ⊕c xk2 )kak
Proof. We first need to prove the following lemma, trivial in the Eu-
clidean space, but not in the Poincaré ball:
Lemma 4.2. (Orthogonal projection on a geodesic) Any point in the Poincaré
ball has a unique orthogonal projection on any given geodesic that does not
pass through the point. Formally, for all y ∈ Dnc and for all geodesics γx→z (·)
such that y ∈ / Im γx→z , there exists an unique w ∈ Im γx→z such that
∠(γw→y , γx→z ) = π/2.
Proof. We first note that any geodesic in Dnc has the form γ(t) =
u ⊕c v ⊗c t as given by eq. (2.23), and has two “points at infinity” lying
on the ball border (v 6= 0):
±v
γ(±∞) = u ⊕c √ ∈ ∂Dnc . (4.14)
c kv k
63
hyperbolic neural networks
The left part of eq. (4.16) follows from eq. (4.15) and from the fact (easy
√
to show from the definition of ⊕c ) that a ⊕c b = a, when kak = 1/ c
(which is the case of x0 ). The right part of eq. (4.16) follows from the
fact that ∠ywz0 = π − ∠ywx0 (from the conformal property, or from
eq. (2.19)) and cos(∠yz0 x0 ) = 1 (proved as above).
Hence cos(∠ywz0 ) has to pass through 0 when going from −1 to 1,
which achieves the proof of existence.
ii) Uniqueness of w:
Assume by contradiction that there are two w and w0 on γx→z that
form angles ∠ywx0 and ∠yw0 x0 of π/2. Since w, w0 , x0 are on the same
geodesic, we have
So the triangle ∆yww0 has two right angles, but in the Poincaré ball
this is impossible.
64
4.2 hyperbolic multiclass logistic regression
Proof. The proof is similar with the Euclidean case and it’s based on
hyperbolic sine law and the fact that in any right hyperbolic triangle
the hypotenuse is strictly longer than any of the other sides.
c be a Poincaré hyperplane. Then,
Lemma 4.4. (Geodesics through p) Let H̃a,p
c
for any w ∈ H̃a,p \ {p}, all points on the geodesic γp→w are included in
c .
H̃a,p
We now turn back to our proof. Let x ∈ Dnc be an arbitrary point and
c
H̃a,p a Poincaré hyperplane. We prove that there is at least one point
w∗ ∈ H̃a,p
c that achieves the infimum distance
and, moreover, that this distance is the same as the one in the theorem’s
statement.
We first note that for any point w ∈ H̃a,p c , if ∠xwp 6 = π/2, then
√ √
sinh( c dc (x, p)) = sinh(2 tanh−1 ( ck − p ⊕c xk))
√
2 ck − p ⊕c xk (4.21)
= .
1 − c k − p ⊕ c x k2
65
hyperbolic neural networks
h−p ⊕c x, −p ⊕c wi
cos(∠xpw) = . (4.22)
k − p ⊕c xk · k − p ⊕c wk
To maximize the above, the constraint on the right angle at w can be
dropped because cos(∠xpw) depends only on the geodesic γp→w and
not on w itself, and because there is always an orthogonal projection
from any point x to any geodesic as stated by lemma 4.2. Thus, it
remains to find the maximum of eq. (4.22) when w ∈ H̃a,p c . Using the
c from eq. (4.8), one can easily prove that
definition of H̃a,p
Using that fact that logcp (w)/k logcp (w)k = −p ⊕c w/k − p ⊕c wk, we
just have to find
h−p ⊕c x, zi
max , (4.24)
z∈{a}⊥ k − p ⊕c xk · kzk
and we are left with a well known Euclidean problem which is equiva-
lent to finding the minimum angle between the vector −p ⊕c x (viewed
as Euclidean) and the hyperplane {a}⊥ . This angle is given by the
Euclidean orthogonal projection whose sin value is the distance from
the vector’s endpoint to the hyperplane divided by the vector’s length:
|h−p ⊕c x, kaak i|
∗
sin(∠xpw ) = . (4.25)
k − p ⊕c xk
not be unique). Combining eqs. (4.19) to (4.21) and (4.25), one concludes
the proof.
These results are the last missing pieces to conclude the proof of
theorem 4.2.
66
4.3 hyperbolic feed-forward neural networks
or, equivalently
√ !
λcpk kak k 2 ch−pk ⊕c x, ak i
p(y = k|x) ∝ exp √ sinh−1
c (1 − ck − pk ⊕c xk2 )kak k
(4.27)
67
hyperbolic neural networks
68
4.4 hyperbolic recurrent neural networks
⊗c ⊗c
written f k ◦ · · · ◦ f 1 = exp0c ◦ f k ◦ · · · ◦ f 1 ◦ log0c . This means that these
operations can essentially be performed in Euclidean space. Therefore,
it is the interposition between those with the bias translation of eq. (4.32)
which differentiates this model from its Euclidean counterpart.
69
hyperbolic neural networks
70
4.5 experiments
4.5 experiments
Textual Entailment
71
hyperbolic neural networks
72
4.5 experiments
Figure 4.2: Test accuracies for various models and four datasets. “Eucl”
denotes Euclidean, “Hyp” denotes hyperbolic. All word
and sentence embeddings have dimension 5. We highlight
in bold the best method (or methods, if the difference is less
than 0.5%).
for the fully Euclidean models, tanh and ReLU respectively surpassed
the identity variant by a large margin. We only report the best Euclidean
results. Interestingly, for the hyperbolic models, using only identity for
both non-linearities works slightly better and this is likely due to the
fact that our hyperbolic layers already contain non-linearities by their
nature.
For the results shown in fig. 4.2, we run each model (baseline or
ours) exactly 3 times and report the test result corresponding to the
best validation result from these 3 runs. We do this because the highly
non-convex spectrum of hyperbolic neural networks sometimes results
in convergence to poor local minima, suggesting that initialization is
very important.
results. Results are shown in fig. 4.2. Note that the fully Euclidean
baseline models might have an advantage over hyperbolic baselines
because more sophisticated optimization algorithms such as Adam
do not have a hyperbolic analogue at the moment. We first observe
that all GRU models overpass their RNN variants. Hyperbolic RNNs
73
hyperbolic neural networks
(b) Norm of the first sentence. Averaged over all sentences in the test set.
and GRUs have the most significant improvement over their Euclidean
variants when the underlying data structure is more tree-like, e.g. for
PREFIX-10% − for which the tree relation between sentences and their
prefixes is more prominent − we reduce the error by a factor of 3.35
for hyperbolic vs Euclidean RNN, and by a factor of 1.5 for hyperbolic
vs Euclidean GRU. As soon as the underlying structure diverges more
and more from a tree, the accuracy gap decreases − for example, for
PREFIX-50% the noise heavily affects the representational power of
hyperbolic networks. Also, note that on SNLI our methods perform
similarly as with their Euclidean variants. Moreover, hyperbolic and
Euclidean MLR are on par when used in conjunction with hyperbolic
sentence embeddings, suggesting further empirical investigation is
needed for this direction (see below).
74
4.5 experiments
(b) Norm of the first sentence. Averaged over all sentences in the test set.
75
hyperbolic neural networks
(b) Norm of the first sentence. Averaged over all sentences in the test set.
76
4.5 experiments
77
hyperbolic neural networks
78
WordNet
Model d=2 d=3 d=5 d = 10
subtree
Hyp 47.43 ± 1.07 91.92 ± 0.61 98.07 ± 0.55 99.26 ± 0.59
animal.n.01
Eucl 41.69 ± 0.19 68.43 ± 3.90 95.59 ± 1.18 99.36 ± 0.18
3218 / 798
log0 38.89 ± 0.01 62.57 ± 0.61 89.21 ± 1.34 98.27 ± 0.70
Hyp 81.72 ± 0.17 89.87 ± 2.73 87.89 ± 0.80 91.91 ± 3.07
group.n.01
Eucl 61.13 ± 0.42 63.56 ± 1.22 67.82 ± 0.81 91.38 ± 1.19
6649 / 1727
log0 60.75 ± 0.24 61.98 ± 0.57 67.92 ± 0.74 91.41 ± 0.18
Hyp 12.68 ± 0.82 24.09 ± 1.49 55.46 ± 5.49 66.83 ± 11.38
worker.n.01
Eucl 10.86 ± 0.01 22.39 ± 0.04 35.23 ± 3.16 47.29 ± 3.93
861 / 254
log0 9.04 ± 0.06 22.57 ± 0.20 26.47 ± 0.78 36.66 ± 2.74
Hyp 32.01 ± 17.14 87.54 ± 4.55 88.73 ± 3.22 91.37 ± 6.09
mammal.n.01
Eucl 15.58 ± 0.04 44.68 ± 1.87 59.35 ± 1.31 77.76 ± 5.08
953 / 228
log0 13.10 ± 0.13 44.89 ± 1.18 52.51 ± 0.85 56.11 ± 2.21
Figure 4.7: Test F1 classification scores (%) for four different subtrees of WordNet noun tree. 95% confidence intervals
for 3 different runs are shown for each method and each dimension. “Hyp” denotes our hyperbolic
MLR, “Eucl” denotes directly applying Euclidean MLR to hyperbolic embeddings in their Euclidean
parametrization, and log0 denotes applying Euclidean MLR in the tangent space at 0, after projecting all
hyperbolic embeddings there with log0 .
79
4.5 experiments
hyperbolic neural networks
4.6 summary
4 https://github.com/dalab/hyperbolic_nn
80
HYPERBOLIC WORD EMBEDDINGS
5
Words are not created equal. In fact, they form an aristocratic graph
with a latent hierarchical structure that the next generation of unsuper-
vised learned word embeddings should reveal. In this chapter, justified
by the notion of delta-hyperbolicity or tree-likeliness of a space, we
propose to embed words in a Cartesian product of hyperbolic spaces
which we theoretically connect to the Gaussian word embeddings and
their Fisher geometry. This connection allows us to introduce a novel
principled hypernymy score for word embeddings. Moreover, we adapt
the well-known Glove algorithm to learn unsupervised word embed-
dings in this type of Riemannian manifolds. We further explain how
to solve the analogy task using the Riemannian parallel transport that
generalizes vector arithmetics to this new type of geometry. Empirically,
based on extensive experiments, we prove that our embeddings, trained
unsupervised, are the first to simultaneously outperform strong and
popular baselines on the tasks of similarity, analogy and hypernymy
detection. In particular, for word hypernymy, we obtain new state-of-
the-art on fully unsupervised WBLESS classification accuracy.
The material presented here has in parts been published in the publi-
cation [TBG19] 1 .
81
hyperbolic word embeddings
82
5.2 related work
83
hyperbolic word embeddings
kx − yk22
−1
dH2 (x, y) = cosh 1+ (5.2)
2x2 y2
84
5.4 adapting glove
1
− kwi − w̃k k2 + bi + b̃k = log( Xik ) (5.5)
2
where we absorbed the squared norms of the embeddings into the
biases. We thus replace the Glove loss by:
V 2
J= ∑ f ( Xij ) −h(d(wi , w̃ j )) + bi + b̃ j − log Xij , (5.6)
i,j=1
85
hyperbolic word embeddings
86
5.6 analogies for hyperbolic/gaussian embeddings
87
hyperbolic word embeddings
located on the geodesic between d1 and d2 for some t ∈ [0, 1]; if t = 1/2,
this is called the gyro-midpoint and then m0.5 0.5
d1 d2 = md2 d1 , which is at
equal hyperbolic distance from d1 as from d2 . We select t based on
2-fold cross-validation, as explained in [TBG19],
Note that continuously deforming the Poincaré ball to the Euclidean
space (by sending its radius to infinity) lets these analogy computations
recover their Euclidean counterparts, which is a nice sanity check.
Indeed, one can rewrite eq. (5.9) with tools from differential geometry
as
c ⊕ gyr [c, a]( a ⊕ b) = expc ( Pa→c (loga (b))), (5.10)
where Px→y = (λx /λy ) gyr [y, x] denotes the parallel transport along
the unique geodesic from x to y (see eq. (2.32)). The exp and log
maps of Riemannian geometry are related to the theory of gyrovector
spaces as mentioned in Chapter 2. We also mention again that, when
continuously deforming the hyperbolic space Dn into the Euclidean
space Rn , sending its curvature κ from −1 to 0 (i.e. the radius of
Dn from 1 to ∞), the Möbius operations ⊕κ , κ , ⊗κ , gyrκ recover their
respective Euclidean counterparts +, −, ·, Id. Hence, the analogy so-
lutions d1 , d2 , mtd1 d2 of eq. (5.9) would then all recover the Euclidean
formulation d = c + b − a.
This choice is based on the intuition that the Euclidean norm should
encode generality/specificity of a concept/word. However, such a
choice depends on the parametrization and origin of the hyperbolic
space, which is problematic when the word embedding training loss
involves only the distance function.
A second baseline is that of Gaussian word embeddings [VM15] in
which words are modeled as normal probability distributions. The
88
5.8 embedding space hyperbolicity
89
hyperbolic word embeddings
Table 5.1 shows values for different choices of h. The discrete metric
spaces we obtained for our symbolic data of co-occurrences appear to
have a very low hyperbolicity, i.e. to be very much “hyperbolic”, which
suggests to embed words in (products of) hyperbolic spaces. We report
in section 5.9 empirical results for h = (·)2 and h = cosh2 .
5.9 experiments
3 One can replace log( x ) with log(1 + x ) to avoid computing the logarithm of zero.
90
5.9 experiments
91
hyperbolic word embeddings
92
5.9 experiments
Table 5.3: Nearest neighbors (in terms of Poincaré distance) for some
words using our 100D hyperbolic embedding model.
sixties seventies, eighties, nineties, 60s, 70s, 1960s, 80s, 90s, 1980s,
1970s
dance dancing, dances, music, singing, musical, performing, hip-
hop, pop, folk, dancers
daughter wife, married, mother, cousin, son, niece, granddaughter,
husband, sister, eldest
vapor vapour, refrigerant, liquid, condenses, supercooled, fluid,
gaseous, gases, droplet
ronaldo cristiano, ronaldinho, rivaldo, messi, zidane, romario, pele,
zinedine, xavi, robinho
mechanic electrician, fireman, machinist, welder, technician, builder,
janitor, trainer, brakeman
algebra algebras, homological, heyting, geometry, subalgebra,
quaternion, calculus, mathematics, unital, algebraic
Glove baselines, with the 100D hyperbolic embeddings being the abso-
lute best.
93
hyperbolic word embeddings
94
5.9 experiments
Table 5.5: Some words selected from the 100 nearest neighbors and
ordered according to the hypernymy score function for a
50x2D hyperbolic embedding model using h( x ) = x2 .
reptile amphibians, carnivore, crocodilian, fish-like, dinosaur, alli-
gator, triceratops
algebra mathematics, geometry, topology, relational, invertible, en-
domorphisms, quaternions
music performance, composition, contemporary, rock, jazz, elec-
troacoustic, trio
feeling sense, perception, thoughts, impression, emotion, fear,
shame, sorrow, joy
95
hyperbolic word embeddings
Word2Gauss-DistPos 0.206
SGNS-Deps 0.205
Frequency 0.279
SLQS-Slim 0.229
Vis-ID 0.253
DIVE-W∆S [Cha+18] 0.333
SBOW-PPMI-C∆S from [Cha+18] 0.345
50x2D Poincaré GloVe h( x ) = cosh2 ( x ) init trick (190k) 0.284
50x2D Poincaré GloVe, h( x ) = x2 , init trick (190k) 0.341
96
5.9 experiments
[Wee+14] 0.75
WN-Poincaré from [NK17] 0.86
[Ngu+17] 0.87
97
hyperbolic word embeddings
5.10 summary
98
PROBABILISTIC GRAPHICAL MODELS FOR ENTITY
RESOLUTION
6
We now leave the hyperbolic geometry and turn our attention to
other non-Euclidean representations, namely for entities, i.e. entries in
a Knowledge Graph (KG). We choose to evaluate the quality of entity
representations using a popular downstream task related to text se-
mantic understanding, namely ED (or Entity Resolution (ER)) – the task
of resolving potential ambiguous textual references to entities in a KG.
Before we dive into neural network models for ED (in the next Chap-
ter 7) and non-metric spaces, we first need to investigate probabilistic
graphical models such as Conditional Random Field (CRF), plus approx-
imate learning and inference techniques specific to those. We devise
a novel model called Probabilistic Bag of Hyperlinks Model for Entity
Disambiguation (PBOH), achieving the best results on many datasets
and against a variety of methods. In the next Chapter 7 we will build
upon PBOH and design a fully differentiable deep neural network based
on truncated message passing inference that would utilize word and
entity embeddings and non-Euclidean similarity measures to further
push the state-of-the-art ER performance.
The material presented here has in parts been published in the publi-
cation [Gan+16].
6.1 introduction
99
probabilistic graphical models for entity resolution
100
6.1 introduction
1 http://en.wikipedia.org/
2 https://www.freebase.com/
3 Forexample, using a named-entity recognition system. However, note that our
approach is not restricted to named entities, but targets any Wikipedia entity.
101
probabilistic graphical models for entity resolution
102
6.2 related work
103
probabilistic graphical models for entity resolution
6.3 model
4 Note
that we do not address the issues of mention detection or nil identification in
this work. Rather, our input is a document along with a fixed set of linkable mentions
corresponding to existing KB entities.
104
6.3 model
where 1[·] is the indicator function. Note that we use the subscript
notation {e, e0 } for ψ to take into account the symmetry in e, e0 as well
the fact that one may have e = e0 .
105
probabilistic graphical models for entity resolution
1
|D| d∑
φe,m (D) := φe,m (e(d) , m(d) ) , (6.4)
∈D
1
|D| d∑
ψ{e,e0 } (D) := ψ{e,e0 } (e(d) ) . (6.5)
∈D
106
6.3 model
Note that we can switch between the statistics view and the raw data
view by observing that
n
hρ, φ(e, m)i = ∑ ρe ,m ,i i
hλ, ψ(e)i = ∑ λ{ei ,e j } . (6.10)
i =1 i< j
ρ : E × V → R, (e, m) 7→ ρe,m
λ : E ∪ E → R,
2
{e, e0 } 7→ λ{e,e0 }
107
probabilistic graphical models for entity resolution
m4 E4 E3 m3
m1 E1 E2 m2
Figure 6.2: Proposed factor graph for a document with four mentions.
Each mention node mi is paired with its corresponding
entity node Ei , while all entity nodes are connected through
entity-entity pair factors.
(Pseudo–)Likelihood Maximization
108
6.3 model
n(d)
∑ ∑ log p(ei
(d) (d)
L̃(ρ, λ; D) := |N (ei ); ρ, λ) . (6.15)
d∈D i =1
Bethe Approximation
The major computational difficulty with our model lies in the pair-
wise couplings between entities and the fact that these couplings are
dense: The Markov dependency graph between different entity links in
a document is always a complete graph. Let us consider what would
happen, if the dependency structure were loop-free, i.e., it would form
5 Forthe Wikipedia collection, even after these pruning steps, we ended up with
more than 50 million parameters in total.
109
probabilistic graphical models for entity resolution
∏{i,j}∈T p(ei , e j )
p (e) = , di := |{ j : {i, j} ∈ T }| . (6.16)
∏in=1 p(ei )di −1
The Bethe approximation [YFW00] pursues the idea of using the above
representation as an unnormalized approximation for p(e), even when
the Markov network has cycles. How does this relate to the exponential
form in equation 6.7? By simple pattern matching, we see that if we
choose
p(e, e0 )
λ{e,e0 } = log , ∀e, e0 ∈ E (6.17)
p(e) p(e0 )
∏ i < j p ( ei , e j ) n p ( ei , e j )
p̄(e) ∝
∏in=1 p(ei )n−2
= ∏ p ( ei ) ∏ p ( ei ) p ( e j )
i =1 i< j
" # (6.18)
= exp ∑ log p(ei ) + ∑ λ{e ,e } i j
,
i i< j
110
6.3 model
Parameter Calibration
With the previous suggestion, one issue comes into play: The total
contribution coming from the pairwise interactions between entities will
scale with (n2 ), while the entity–mention compatibility contributions will
scale with n, the total number of mentions. This is a direct observation
of the number of terms contributing to the sums in equation 6.10.
However, for practical reasons, it is somewhat implausible that, as n
grows, the prior p(e) should dominate and the contribution of the
likelihood term should vanish. The model is not well-calibrated with
regard to n.
We propose to correct for this effect by adding a normalization factor
to the λ-parameters by replacing equation 6.17 with:
p(e, e0 )
2
n
λe,e0 = log , ∀e, e0 ∈ E (6.20)
n−1 p(e) · p(e0 )
111
probabilistic graphical models for entity resolution
In our case, if the entity variables e per document would have formed
a cycle of length n instead of a complete subgraph, the Bethe approxi-
mation would have been written as:
∏(i,j)∈E(π ) p(ei , e j )
p̄π (e) ∝ , ∀π ∈ Ξ (6.21)
∏ i p ( ei )
where E(π ) is the set of edges of the e-cycle π. However, as we do
not desire to further constrain our graph with additional independence
assumptions, we propose to approximate the joint prior p(e) by the
average of the Bethe approximation of all possible π, that is
1
|Ξ| π∑
log p̄(e) ≈ log p̄π (e) . (6.22)
∈Ξ
Since each pair (ei , e j ) would appear in exactly 2(n − 2)! e-cycles, one
can derive the final approximation:
2
∏ i < j p ( e i , e j ) n −1
p̄(e) ≈ . (6.23)
∏ i p ( ei )
Distributing marginal probabilities over the parameters starting from
Eq. equation 6.23 and applying a similar argument as in Eq. equa-
tion 6.18 results in the assignment given by Eq. equation 6.20. While the
above line of argument is not a strict mathematical derivation, we be-
lieve this to shed further light on the empirically observed effectiveness
of the parameter re-scaling.
Integrating Context
The model that we have discussed so far does not consider the
local context of a mention. This is a powerful source of information
that a competitive ED system should utilize. For example, words like
“computer”, “company” or “device” are more likely to appear near
references of the entity Apple Inc. than of the entity Apple fruit. We
demonstrate in this section how this integration can be easily done
in a principled way on top of the current probabilistic model. This
showcases the extensibility of our approach. Enhancing our model with
additional knowledge such as entity categories or word co-reference
can also be done in a rigorous way, so we hope that this provides a
template for future extensions.
112
6.3 model
p ( c i | ei ) = ∏ p ( w j | ei ) . (6.25)
w j ∈ ci
113
probabilistic graphical models for entity resolution
!
n
log p(e|m, c) = ∑ log p(ei |mi ) + ζ ∑ log p(w j |ei )
i =1 w j ∈ ci
p ( ei , e j )
2τ
n−1 ∑
+ log + const . (6.26)
i< j
p ( ei ) p ( e j )
Here we used the identity p(m|e) p(e) = p(e|m) p(m) and absorbed all
log p(m) terms in the constant. We use grid-search on a validation
set for the remaining problem of optimizing over the parameters ζ, τ.
Details are provided in section 6.5.
114
6.4 inference
max( N (e, e0 ) − δ, 0)
pe(e, e0 ) = + (1 − µe ) p̂(e) p̂(e0 ) (6.27)
Nep
max( N (w, e) − ξ, 0)
pe(w|e) = + (1 − µw ) p̂(w) . (6.28)
Nwp
6.4 inference
After introducing our model and showing how to train it in the pre-
vious section, we now explain the inference process used for prediction.
115
probabilistic graphical models for entity resolution
Candidate Selection
log p(ei |mi , ci ) = log p(ei |mi ) + ζ ∑ log p(w j |ei ) + const . (6.29)
w j ∈ ci
Belief Propagation
116
6.5 experiments
Eq. equation 6.7, one can derive the update equation of the logarithmic
message that is sent in round t + 1 from entity random variable Ei to
the outcome e j of the entity random variable Ej :
!
mtE+i →
1
Ej ( e j ) = max ρei ,mi + λ{ei ,e j } +
ei
∑ mtEk →Ei (ei ) (6.30)
1≤k ≤n;k 6= j
Note that, for simplicity, we skip the factor graph framework and
send messages directly between each pair of entity variables. This is
equivalent to the original Belief Propagation (BP) framework.
We chose to update messages synchronously: in each round t, each
two entity nodes Ei and Ej exchange messages. This is done until
convergence or until an allowed maximum number of iterations (15 in
our experiments) is reached. The convergence criterion is:
max |mtE+i →
1 t
Ej ( e j ) − m Ei → Ej ( e j )| ≤ e (6.31)
1≤i,j≤n;e j ∈E
where e = 10−5 . This setting was sufficient in most of the cases to reach
convergence.
In the end, the final entity assignment is determined by:
!
ei∗ = arg max ρei ,mi +
ei
∑ mtEk →Ei (ei ) (6.32)
1≤ k ≤ n
6.5 experiments
117
probabilistic graphical models for entity resolution
| M∩ M∗ |
• Recall: R = | M∗ |
2· P · R
• F1 score: F1 = P+ R
118
6.5 experiments
119
probabilistic graphical models for entity resolution
120
6.5 experiments
Datasets
AIDA test A AIDA test B
Systems R@MI R@MA R@MI R@MA
Datasets
new MSNBC new AQUAINT new ACE2004
Systems F1@MI F1@MA F1@MI F1@MA F1@MI F1@MA
Local
73.64 77.71 87.33 86.80 84.75 85.70
Mention
Table 6.3: Results on the newer versions of the MSNBC, AQUAINT and
ACE04 datasets.
121
probabilistic graphical models for entity resolution
Note that the Gerbil platform uses an old version of the AQUAINT,
MSNBC and ACE04 datasets that contain some no-longer existing
Wikipedia entities. A new cleaned version of these sets 10 was released
by [GB14]. We report results for the new cleaned datasets in table 6.3,
while table 6.4 and fig. 6.3 contain results for the old versions currently
used by Gerbil.
10 http://www.cs.ualberta.ca/
~denilson/data/deos14_ualberta_
experiments.tgz
122
IITB
MSNBC
KORE50
ACE2004
AQUAINT
N3-RSS-500
AIDA-Test B
AIDA-Test A
AIDA-Training
N3-Reuters-128
AIDA-Complete
DBpediaSpotlight
Microposts2014-Test
Microposts2014-Train
AGDISTIS 65.83 60.27 59.06 58.32 61.05 60.10 36.61 41.23 34.16 42.43 50.39 75.42 67.95 59.88
77.63 56.97 53.36 58.03 57.53 58.62 33.25 43.38 30.20 61.08 62.87 73.82 75.52 70.80
Babelfy 63.20 78.00 75.77 80.36 78.01 72.27 51.05 57.13 73.12 47.20 50.60 78.17 58.61 69.17
76.71 73.81 71.26 74.52 74.22 73.23 51.97 55.36 69.77 62.11 61.02 75.73 59.87 76.00
DBpedia Spotlight 70.38 58.84 54.90 57.69 60.04 74.03 69.27 65.44 37.59 56.43 56.26 69.27 56.44 57.63
80.02 60.59 54.11 61.34 62.23 73.13 67.23 62.81 32.90 71.63 67.99 69.82 58.77 65.03
Dexter 18.72 48.46 45.44 48.59 49.25 38.28 26.70 28.53 17.20 31.27 35.21 36.86 32.74 31.11
16.97 45.29 42.17 46.20 45.85 38.15 22.75 28.48 12.54 44.02 42.07 39.42 31.85 33.55
Entityclassifier.eu 12.74 46.6 44.13 44.02 47.83 21.67 22.59 18.46 27.97 29.12 32.69 41.24 28.4 21.77
12.3 42.86 42.36 41.31 43.36 19.59 18.0 19.54 25.2 39.53 38.41 40.3 24.84 22.2
Kea 80.08 73.39 70.9 72.64 74.22 81.84 73.63 72.03 57.95 63.4 64.67 85.49 63.2 69.29
87.57 73.26 67.91 73.31 74.47 81.27 76.60 70.52 53.17 76.54 74.32 87.4 64.45 75.93
NERD-ML 54.89 54.62 52.85 52.59 55.55 49.68 46.8 51.08 29.96 38.65 39.83 64.03 54.96 61.22
72.22 52.35 49.6 51.34 53.23 46.06 45.59 49.91 24.75 57.91 53.74 67.28 62.9 67.3
TagMe 2 81.93 72.07 69.07 70.62 73.2 76.27 63.31 57.23 57.34 56.81 59.14 75.96 59.32 78.05
89.09 71.19 66.5 70.38 72.45 75.12 65.1 55.8 54.67 71.66 70.45 77.05 67.55 83.2
WAT 80.0 83.82 81.82 84.34 84.21 76.82 65.18 61.14 58.99 59.56 61.96 77.72 64.38 68.21
86.49 83.59 80.25 84.12 84.22 77.64 68.24 59.36 53.13 73.89 72.65 79.08 65.81 76.0
Wikipedia Miner 77.14 64.72 61.65 60.71 66.48 75.96 62.57 58.59 41.63 54.88 55.93 64.25 60.05 64.54
6.5 experiments
86.36 66.17 61.67 63.19 67.93 74.63 61.43 56.98 35.0 69.29 67.0 64.68 66.51 72.23
123
PBOH 87.19 86.72 86.63 87.39 86.59 86.64 79.48 62.47 61.70 74.19 73.08 89.54 76.54 71.24
90.40 86.85 85.48 86.32 87.30 86.14 80.13 61.04 55.83 84.48 81.25 89.62 83.31 78.33
Table 6.4: Micro (top) and macro (bottom) F1 scores reported by Gerbil for each of the 14 datasets and of the 11 ED
methods including PBOH. For each dataset and each metric, we highlight in red the best system and in
blue the second best system.
probabilistic graphical models for entity resolution
Datasets
AIDA AIDA
MSNBC AQUAINT ACE04
test A test B
Avg. nb
22.18 19.41 32.8 14.54 7.34
mentions / doc
Algorithm
convergence 100% 99.56% 100% 100% 100%
rate
Avg. running
445.56 203.66 371.65 40.42 10.88
time (ms/doc)
Avg. nb
2.86 2.83 3.0 2.56 2.25
rounds
124
6.5 experiments
solely based on the token span statistics, i.e., e∗ = arg maxe p̂(e|m);
Unnorm – uses the unnormalized mention-entity model described in
section 6.3; Rescaled – relies on the rescaled model presented in sec-
tion 6.3; LocalContext – disambiguates an entity based on the mention
and the local context probability given by Equation equation 6.29, i.e.,
e∗ = arg maxe p(e|m, c). Note that Unnorm, Rescaled and PBOH use
the loopy belief propagation procedure for inference.
Results
experiment?id=201510160025.
12 The detailed Gerbil results of the baseline systems can be accessed at http:
//gerbil.aksw.org/gerbil/experiment?id=201510160026
125
probabilistic graphical models for entity resolution
Datasets
MSNBC AQUAINT ACE2004
Avg # mentions per doc 36.95 14.54 8.68
Systems # entities # entities # entities
PBOH 247.19 95.38 66.66
REL-RW 382773.6 242443.1 256235.49
Table 6.6: Average number of entities that appear in the graph built by
PBOH and by REL-RW.
Datasets
AIDA test A AIDA test B
Systems R@MI R@MA R@MI R@MA
126
6.6 summary
6.6 summary
127
D E E P J O I N T E N T I T Y D I S A M B I G U AT I O N W I T H
7
L O C A L N E U R A L AT T E N T I O N
7.1 introduction
129
deep joint entity disambiguation with local neural attention
130
7.3 learning entity embeddings
131
deep joint entity disambiguation with local neural attention
suffer from sparsity issues and/or large memory footprints; (ii) vectors
of entities in a subset domain of interest can be trained separately,
obtaining potentially significant speed-ups and memory savings that
would otherwise be prohibitive for large entity KBs 1 ; (iii) entities can be
easily added in an incremental manner, which is important in practice;
(iv) the approach extends well into the tail of rare entities with few
linked occurrences; (v) empirically, we achieve better quality compared
to methods that use entity co-occurrence statistics.
Our model embeds words and entities in the same low-dimensional
vector space in order to exploit geometric similarity between them. We
start with a pre-trained word embedding map x : W → Rd that is
known to encode semantic meaning of words w ∈ W ; specifically we
use word2vec pre-trained vectors [Mik+13b]. We extend this map to
entities E , i.e. x : E → Rd , as described below.
We assume a generative model in which words that co-occur with an
entity e are sampled from a conditional distribution p(w|e) when they
are generated. Empirically, we collect word-entity co-occurrence counts
#(w, e) from two sources: (i) the canonical KB description page of the
entity, and (ii) the windows of fixed size surrounding mentions of the
entity in an annotated corpus (e.g. Wikipedia hyperlinks). These counts
define a practical approximation of the above word-entity conditional
distribution, i.e. p̂(w|e) ∝ #(w, e). We call this the “positive” distribu-
tion of words related to the entity. Next, let q(w) be a generic word
probability distribution which we use for sampling “negative” words
unrelated to a specific entity. As in [Mik+13b], we choose a smoothed
unigram distribution q(w) = p̂(w)α for some α ∈ (0, 1). The desired
outcome is that vectors of positive words are closer (in terms of dot
product) to the embedding of entity e compared to vectors of random
words. Let w+ ∼ p̂(w|e) and w− ∼ q(w). Then, we use a max-margin
objective to infer the optimal embedding xe for entity e:
132
7.4 local model with neural attention
Figure 7.1: Local model with neural attention. Inputs: context word
vectors, candidate entity priors and embeddings. Outputs:
entity scores. All parts are differentiable and trainable with
backpropagation.
projection over sampled pairs (w+ , w− ). Note that the entity vector is
directly optimized on the unit sphere which is important in order to
obtain qualitative embeddings.
We empirically assess the quality of our entity embeddings on entity
similarity and ED tasks as detailed in section 7.7 and tables 7.2 and 7.3.
The technique described in this section can also be applied, in principle,
for computing embeddings of general text documents, but a comparison
with such methods is left as future work.
133
deep joint entity disambiguation with local neural attention
weights in u to −∞ and applying a vanilla softmax on top of them. We used the layers
Threshold and TemporalDynamicKMaxPooling from Torch NN package, which allow
subgradient computation.
134
7.5 document-level deep model
135
deep joint entity disambiguation with local neural attention
Figure 7.2: Global model: unrolled LBP deep network that is end-to-end
differentiable and trainable.
The unary factors are the local scores Ψi (ei ) = Ψ(ei , ci ) described in
Eq. equation 7.5. The pairwise factors are bilinear forms of the entity
embeddings
2
Φ(e, e0 ) = x > C x e0 , (7.9)
n−1 e
where C is again a diagonal matrix. Similar to [Gan+16], the above
normalization helps balancing the unary and pairwise terms across
documents with different numbers of mentions.
136
7.5 document-level deep model
model directly optimizes the marginal likelihoods, using the same net-
works for learning and prediction. As noted by [Dom13], this method
is robust to model mis-specification, avoids inherent difficulties of parti-
tion functions and is faster compared to double-loop likelihood training
(where, for each stochastic update, inference is run until convergence is
achieved).
Our architecture is shown in fig. 7.2. A neural network with T layers
encodes T message passing iterations of synchronous max-product LBP 3
which is designed to find the most likely (MAP) entity assignments that
maximize g(e, m, c). We also use message damping, which is known to
speed-up and stabilize convergence of message passing. Formally, in
iteration t, mention mi votes for entity candidate e ∈ Γ(m j ) of mention
m j using the normalized log-message mit→ j (e) computed as:
+1
mit→ Ψi (e0 ) + Φ(e, e0 ) + ∑ mtk→i (e0 )}
j ( e ) = max (7.10)
e0 ∈Γ(m i) k6= j
Herein the first part just reflects the CRF potentials, whereas the second
part is defined as
−1
mit→ j (e) = log[δ · softmax(mit→ j (e)) + (1 − δ) · exp(mit→ j ( e ))] (7.11)
The learned function f for global ED is non-trivial (see fig. 7.3), showing
that the influence of the prior tends to weaken for larger µ(e), whereas
3 Sum-product and mean-field performed worse in our experiments.
137
deep joint entity disambiguation with local neural attention
Figure 7.3: Non-linear scoring function of the belief and mention prior
learned with a neural network. Achieves a 1.7% improve-
ment on AIDA-B dataset compared to a weighted average
scheme.
138
7.6 candidate selection
7.7 experiments
139
deep joint entity disambiguation with local neural attention
140
7.7 experiments
PP
PP Metric
NDCG@1 NDCG@5 NDCG@10 MAP
Method PPPP
WikiLinkMeasure (WLM) 0.54 0.52 0.55 0.48
[Yam+16]
0.59 0.56 0.59 0.52
d = 500
our (canonical pages)
0.624 0.589 0.615 0.549
d = 300
our (canonical&hyperlinks)
0.632 0.609 0.641 0.578
d = 300
Table 7.2: Entity relatedness results on the test set of [Cec+13b]. WLM
is a well-known similarity method of [MW08].
141
deep joint entity disambiguation with local neural attention
Table 7.3: Closest words to a given entity. Words with at least 500
frequency in the Wikipedia corpus are shown.
142
7.7 experiments
Methods AIDA-B
Local models
prior p̂(e|m) 71.9
[Laz+15] 86.4
[Glo+16] 87.9
[Yam+16] 87.2
our (local, K=100, R=50) 88.8
Global models
[HHJ15] 86.6
[Gan+16] 87.6
[CH15] 88.7
[GB16] 89.0
[Glo+16] 91.0
[Yam+16] 91.5
our (global) 92.22 ± 0.14
Table 7.4: In-KB accuracy for AIDA-B test set. All baselines use
KB+YAGO mention-entity index. For our method we show
95% confidence intervals obtained over 5 runs.
beddings useful for the ED task can also perform decent for the entity
similarity task. We emphasize that our global ED model outperforms
Huang’s ED model (table 7.4), likely due to the power of our local and
joint neural network architectures. For example, our attention mecha-
nism clearly benefits from explicitly embedding words and entities in
the same space.
As a qualitative evaluation, we show in table 7.3 the closest words to
a given entity in terms of their embeddings.
143
deep joint entity disambiguation with local neural attention
144
7.7 experiments
145
deep joint entity disambiguation with local neural attention
Freq p̂(e|m)
Number Solved Number Solved
gold gold
mentions correctly mentions correctly
entity entity
0 5 80.0 % ≤ 0.01 36 89.19%
1-10 0 - 0.01 - 0.03 249 88.76%
11-20 4 100.0% 0.03 - 0.1 306 82.03%
21-50 50 90.0% 0.1 - 0.3 381 86.61%
> 50 4345 94.2% > 0.3 3431 96.53%
146
7.8 conclusion
p̂(e|m)
Mention Gold entity of gold Attended contextual words
entity
Scotland England Rugby team squad Murrayfield
national Twickenham national play Cup Saturday
Scotland 0.034
rugby union World game George following Italy week
team Friday selection dropped row month
matches League Oxford Hull league
Charlton Oldham Cambridge Sunderland
Wolverhampton Blackburn Sheffield Southampton
Wolverhampton 0.103
Wanderers F.C. Huddersfield Leeds Middlesbrough
Reading Coventry Darlington Bradford
Birmingham Enfield Barnsley
League team Hockey Toronto Ottawa
Montreal games Anaheim Edmonton Rangers
Montreal 0.021 Philadelphia Caps Buffalo Pittsburgh
Canadiens
Chicago Louis National home Friday York
Dallas Washington Ice
Carlos Telmex Mexico Mexican group
Santander firm market week Ponce debt shares
Santander 0.192
Group buying Televisa earlier pesos share
stepped Friday analysts ended
7.8 conclusion
147
deep joint entity disambiguation with local neural attention
148
CONCLUSION
8
In this dissertation, we advocated for non-Euclidean spaces and
Riemannian manifolds offering a better inductive bias for learning rep-
resentations compared to traditional Euclidean methods. In particular,
constant curvature spaces (e.g. hyperbolic) and their product space
offer computationally efficient geometries with appealing properties for
machine learning methods - e.g. isometric embedding guarantees for
hierarchical structures and the connection with Gaussian distributions.
First, we saw how taxonomies, hierarchical structures and directed
acyclic graphs can be embedded in the hyperbolic space using geodesi-
cally convex entailment cones.
Next, we explored how hyperbolic embeddings can be used in down-
stream tasks by generalizing some of the most important deep learning
architectures.
Third, we investigated how word embeddings can be trained in prod-
ucts of hyperbolic spaces, exploiting the connection between products
of hyperbolic spaces and Gaussian distributions and being the first
method to obtain competitive results for the three different tasks of
word similarity, analogy and hypernymy.
Further, we moved into investigating non-Euclidean representations
for entities for the task of text disambiguation or entity resolution. We
explored probabilistic graphical models, learned entity embeddings and
leveraged attention mechanisms and fully differentiable approximate
message passing inference methods in Markov Random Fields.
149
conclusion
150
BIBLIOGRAPHY
151
Bibliography
152
Bibliography
153
Bibliography
154
Bibliography
155
Bibliography
156
Bibliography
[Fan+16] Wei Fang, Jianwen Zhang, Dilin Wang, Zheng Chen, and
Ming Li. Entity Disambiguation by Knowledge and Text
”
Jointly Embedding.“ In: CoNLL 2016 (2016), p. 260 (cit. on
pp. 130, 144).
[FDK16] Matthew Francis-Landau, Greg Durrett, and Dan Klein.
Capturing semantic similarity for entity linking with con-
”
volutional neural networks.“ In: arXiv preprint arXiv:1604.00734
(2016) (cit. on p. 130).
[FS10a] Paolo Ferragina and Ugo Scaiella. Fast and accurate an-
”
notation of short texts with Wikipedia pages.“ In: arXiv
preprint arXiv:1006.3498 (2010) (cit. on p. 124).
[FS10b] Paolo Ferragina and Ugo Scaiella. Tagme: on-the-fly an-
”
notation of short text fragments (by wikipedia entities).“
In: Proceedings of the 19th ACM international conference on In-
formation and knowledge management. ACM. 2010, pp. 1625–
1628 (cit. on pp. 103, 131).
[Fu+14] Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng
Wang, and Ting Liu. Learning semantic hierarchies via
”
word embeddings.“ In: Proceedings of the 52nd Annual Meet-
ing of the Association for Computational Linguistics (Volume 1:
Long Papers). Vol. 1. 2014, pp. 1199–1209 (cit. on pp. 1, 41).
[Gan+16] Octavian-Eugen Ganea, Marina Ganea, Aurelien Lucchi,
Carsten Eickhoff, and Thomas Hofmann. Probabilistic
”
bag-of-hyperlinks model for entity linking.“ In: Proceed-
ings of the 25th International Conference on World Wide Web.
International World Wide Web Conferences Steering Com-
mittee. 2016, pp. 927–938 (cit. on pp. ix, 25, 99, 130, 131,
136, 143, 144).
[Gan+19] Octavian-Eugen Ganea, Sylvain Gelly, Gary Bécigneul, and
Aliaksei Severyn. Breaking the Softmax Bottleneck via
”
Learnable Monotonic Pointwise Non-linearities.“ In: arXiv
preprint arXiv:1902.08077 (2019) (cit. on p. x).
[GB14] Zhaochen Guo and Denilson Barbosa. Robust Entity Link-
”
ing via Random Walks.“ In: Proceedings of the 23rd ACM
International Conference on Conference on Information and
Knowledge Management. CIKM ’14. Shanghai, China: ACM,
157
Bibliography
158
Bibliography
159
Bibliography
160
Bibliography
[Ji16] Heng Ji. Entity discovery and linking reading list.“ In:
”
(2016). url: http://nlp.cs.rpi.edu/kbp/2014/elreading.
html (cit. on p. 130).
[Kat+11] Saurabh S Kataria, Krishnan S Kumar, Rajeev R Rastogi,
Prithviraj Sen, and Srinivasan H Sengamedu. Entity dis-
”
ambiguation with hierarchical topic models.“ In: Proceed-
ings of the 17th ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM. 2011, pp. 1037–
1045 (cit. on p. 103).
[KB14] Diederik Kingma and Jimmy Ba. Adam: A method for
”
stochastic optimization.“ In: arXiv preprint arXiv:1412.6980
(2014) (cit. on pp. 1, 72, 143).
[KGH18] Nikolaos Kolitsas, Octavian-Eugen Ganea, and Thomas
Hofmann. End-to-End Neural Entity Linking.“ In: Pro-
”
ceedings of the 22nd Conference on Computational Natural
Language Learning. 2018, pp. 519–529 (cit. on p. x).
[Kim14] Yoon Kim. Convolutional neural networks for sentence
”
classification.“ In: arXiv preprint arXiv:1408.5882 (2014) (cit.
on p. 6).
[Kir+15] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard
Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler.
Skip-thought vectors.“ In: Advances in neural information
”
processing systems. 2015, pp. 3294–3302 (cit. on p. 1).
[Kri+09] Dmitri Krioukov, Fragkiskos Papadopoulos, Marián Boguñá,
and Amin Vahdat. Greedy forwarding in scale-free net-
”
works embedded in hyperbolic metric spaces.“ In: ACM
SIGMETRICS Performance Evaluation Review 37.2 (2009),
pp. 15–17 (cit. on p. 42).
[Kri+10] Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kit-
sak, Amin Vahdat, and Marián Boguná. Hyperbolic ge-
”
ometry of complex networks.“ In: Physical Review E 82.3
(2010), p. 036106 (cit. on pp. iii, v, 10, 18, 20, 23, 42, 82).
[Kul+09] Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and
Soumen Chakrabarti. Collective annotation of Wikipedia
”
entities in web text.“ In: Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and data min-
ing. ACM. 2009, pp. 457–466 (cit. on p. 103).
161
Bibliography
162
Bibliography
163
Bibliography
164
Bibliography
165
Bibliography
166
Bibliography
167
Bibliography
168
Bibliography
169
Bibliography
[Vil+18] Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCal-
lum. Probabilistic Embedding of Knowledge Graphs with
”
Box Lattice Measures.“ In: arXiv preprint arXiv:1805.06627
(2018) (cit. on p. 83).
[Vis+06] S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark W.
Schmidt, and Kevin P. Murphy. Accelerated Training
”
of Conditional Random Fields with Stochastic Gradient
Methods.“ In: Proceedings of the 23rd International Conference
on Machine Learning. ICML ’06. Pittsburgh, Pennsylvania,
USA: ACM, 2006, pp. 969–976. url: http://doi.acm.org/
10.1145/1143844.1143966 (cit. on p. 109).
[VM15] Luke Vilnis and Andrew McCallum. Word representa-
”
tions via gaussian embedding.“ In: ICLR (2015) (cit. on
pp. 82, 83, 86–88).
[VM18] Ivan Vulić and Nikola Mrkšić. Specialising Word Vectors
”
for Lexical Entailment.“ In: Proceedings of the 2018 Con-
ference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Vol-
ume 1 (Long Papers). Vol. 1. 2018, pp. 1134–1145 (cit. on
p. 83).
[VS14] Kevin Verbeek and Subhash Suri. Metric embedding, hy-
”
perbolic space, and social networks.“ In: Proceedings of the
thirtieth annual symposium on Computational geometry. ACM.
2014, p. 501 (cit. on pp. 10, 23).
[Vul+17] Ivan Vulić, Daniela Gerz, Douwe Kiela, Felix Hill, and
Anna Korhonen. Hyperlex: A large-scale evaluation of
”
graded lexical entailment.“ In: Computational Linguistics
43.4 (2017), pp. 781–835 (cit. on p. 94).
[Wee+14] Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir,
and Bill Keller. Learning to distinguish hypernyms and
”
co-hyponyms.“ In: Proceedings of COLING 2014, the 25th
International Conference on Computational Linguistics: Tech-
nical Papers. Dublin City University and Association for
Computational Linguistics. 2014, pp. 2249–2259 (cit. on
p. 97).
170
Bibliography
171